Source code for the language-aware OCR document error profiler. See the Profiler Manual for a description.
The profiler has originally been written by Uli Reffle as part of his PhD thesis in computational linguistics at CIS during the IMPACT project (2008-2011).
It has been further developed as a CLARIN-D Kurationsprojekt by Florian Fink at CIS.
Its underlying technology is described in the following publications:
Mihov, Stoyan, and Klaus U. Schulz. 2004. “Fast Approximate Search in Large Dictionaries.” Computational Linguistics 30 (4). MIT Press: 451–77.
Reffle, Ulrich. 2011. Algorithmen und Methoden zur dokumentenspezifischen Analyse historischer und OCR-erfasster Texte. Verlag Dr. Hut.
Reffle, Ulrich, and Christoph Ringlstetter. 2013. “Unsupervised Profiling of OCRed Historical Documents.” Pattern Recognition 46 (5): 1346–57. doi:http://dx.doi.org/10.1016/j.patcog.2012.10.002.
Schulz, Klaus U., and Stoyan Mihov. 2002. “Fast String Correction with Levenshtein Automata.” International Journal on Document Analysis and Recognition 5 (1). Springer: 67–85.