Of those 53 miles, just a few millimeters’ worth of pages have been scanned and made available online.Even fewer pages have been transcribed into computer text and made searchable.It just needs to know which groups of chunks represent real letters and which are bogus.

The four main scientists behind the project—Paolo Merialdo, Donatella Firmani, and Elena Nieddu at Roma Tre University, and Marco Maiorino at the VSA—skirt Sayre’s paradox with an innovation called jigsaw segmentation.

This process, as the team recently outlined in a paper, breaks words down not into letters but something closer to individual pen strokes.

But getting these systems up and running is a bear, because they require gargantuan memory banks.

Rather than a few dozen alphabet letters, these systems have to recognize images of thousands upon thousands of common words.

The team recruited students at 24 schools in Italy to build the projects’ memory banks.

The students logged onto a website, where they found a screen with three sections:, what the Codice scientists call “false friends.” The grid at the bottom is the meat of the program.

The end result is a series of jigsaw pieces: By themselves, the jigsaw pieces aren’t tremendously useful.

But the software can chunk them together in various ways to make possible letters.

The result is a computational deadlock, sometimes referred to as Sayre’s paradox: OCR software needs to segment a word into individual letters before it can recognize them, but in handwritten texts with connected letters, the software needs to recognize the letters in order to segment them. Some computer scientists have tried to get around this problem by developing OCR to recognize whole words instead of letters.

This works fine technologically—computers don’t “care” whether they’re parsing words or letters.

If you want to peruse anything else, you have to apply for special access, schlep all the way to Rome, and go through every page by hand. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time.

