Handwritten Text Recognition

In February 2021, we began exploring machine learning as a method of transcribing Geniza documents in partnership with Daniel Stökl Ben-Ezra of the École Pratique des Hautes Études (EPHE) in Paris and the digital paleography framework e-Scriptorium.

Machine transcription is an evolving field of artificial intelligence. One recent evolution is a burgeoning field of Handwritten Text Recognition (HTR) for ancient and medieval manuscripts.

Machines can be trained to transcribe texts in almost any writing system, directly from digital images and with human levels of accuracy. The first step in digital paleography is producing a good model, a process that takes time, trial and error and an adequate supply of "ground truth," accurate transcriptions stripped of non-manuscript material such as reconstructions of lacunae. The HTR team is currently reviewing thousands of transcriptions in the PGP database for accuracy and creating a stripped-down corpus to use as ground truth. 

Using HTR to automate geniza transcriptions creates an additional level of complexity. HTR has achieved excellent results for Hebrew-script manuscripts with predictable layouts. But in geniza manuscripts, the page layouts can be unpredictable.

We anticipate a learning period of roughly two years before we are able to produce machine transcriptions. If all goes well, the PGP will then expand its corpus of searchable fragments.

For more on HTR, see this article by Peter Stokes, a research professor at the EPHE specializing in digital and computational humanities applied to historical writings.