Reducing OCR Errors in Gothic-Script Documents

Lenz Furrer Martin Volk University of Zurich University of Zurich [email protected] [email protected]

Abstract which already lead to a lower accuracy compared to texts – but also particular words, phrases In order to improve OCR quality in texts and even whole are printed in anti- originally typeset in Gothic script, we qua (cf. figure 1). Although we are lucky to have built an automated correction system have an OCR engine capable of processing mixed which is highly specialized for the given Gothic and antiqua texts, the alternation of the two text. Our approach includes external dic- still has an impairing effect on the text qual- tionary resources as well as information ity. Since the interspersed antiqua tokens can be derived from the text itself. The focus lies very short (e. g. the abbreviation Dr.), their divert- on testing and improving different meth- ing script is sometimes not recognized by the - ods for classifying words as correct or er- gine. This leads to heavily misrecognized words roneous. Also, different techniques are ap- due to the different shapes of the ; for plied to find and rate correction candidates. example antiqua Landrecht (Engl.: “citizenship”) In addition, we are working on a web ap- is rendered as completely illegible I>aii

97 Proceedings of Language Technologies for Digital Humanities and Cultural Heritage Workshop, pages 97–103, Hissar, Bulgaria, 16 September 2011. Figure 2: Output XML of Abbyy Recognition Server 3.0, representing the word “Aktum”. errors. Section 4 is concerned with a web applica- serifProbability rates its probability of tion. having serifs on a scale from 0 to 255. The fourth character u shows an additional attribute 2 OCR system suspicious, which only appears if it is set to true. It marks characters that are likely to be We use Abbyy’s Recognition Server 3.0 for OCR. misrecognized (which is not the case here). Our main reason to decide in favour of this product We are pleased with the rich output of the OCR was its capability of converting text with both anti- system. Still, we can think of features that could qua and Gothic script. Another great advantage in make a happy user even happier. For example, comparison to Abbyy’s FineReader and other com- post-correction systems could gain precision if the mercial OCR software is its detailed XML output alternative recognition hypotheses made during file, which contains a lot of information about the OCR were accessible.1 Some of the information recognized text. provided is not very reliable, such as the attribute Figure 2 shows an excerpt of the XML output. “wordFromDictionary” that indicates if a word The fragment contains information about the was found in the internal dictionary: on the one five letters of the word Aktum (here meaning hand, a lot of vocabulary is not covered, whereas “governmental session”), which corresponds to on the other, there are misrecognized words that the first word in figure 1. The first two XML are found in the dictionary (e. g. Bandirektion, tags, par and line, hold the contents and further which is an OCR error for German Baudirektion, information of a and a line respectively, Engl.: “building ministry”). While the XML at- e. g. we can see that this paragraph’s alignment tribute “suspicious” (designating spots with low is centered. Each character is embraced by a recognition confidence) can be useful, the “char- charParams tag that has a series of attributes. Confidence” value does not help a lot with locating The graphical location and dimensions of a letter erroneous word tokens. are described by the four pixel positions l, t, r An irritating aspect is, that we found the dehy- and b indicating the left, top, right and bottom phenation to be done worse than in the previous borders. The following boolean-value attributes engine for Gothic OCR (namely FineReader XIX). specifie if a character is at a left word boundaries Words hyphenated at a line break (e. g. fakul- tative (wordStart), if the word was found in the in- on the fourth line in figure 1) are often split into ternal dictionary (wordFromDictionary) and two halves that are no proper words, thus look- if it is a regular word or rather a number or a ing up the word parts in its dictionary does not token (wordNormal, wordNumeric, help the OCR engine with deciding for the cor- wordIdentifier). charConfidence has a value between 0 and 100 indicating the recog- 1Abbyy’s Fine Reader Software Developer Kit (SDK) is nition confidence for this character, while capable of this.

98 rect conversion of the characters. Since Abbyy’s an automated approach to the post-correction. The systems mark hyphenation with a special charac- great homogeneity of the texts concerning the lay- ter when it is recognised, one can easily determine out as well as the limited variation in the recall by looking at the output. We encoun- are a good premise to achieve remarkable improve- tered the Recognition Server’s recall to be signif- ments. Having in mind the genre of our texts, we icantly lower than that of FineReader XIX, which followed the idea of generating useful documents is regrettable since it goes along with a higher error ready for reading and resigned from manually cor- rate in hyphenated words. recting or even re-keying the complete texts, as did The Recognition Server provides dictionaries e. g. the project Deutsches Textarchiv3, which has for various languages, including so-called “Old a very literary, almost artistic standard in digitizing German”.2 In a series of tests the “Old Ger- first-edition books from 1780 to 1900. man” dictionary has shown slightly better results The task resembles that of a spell-checker as than the Standard German dictionary for our texts. found in modern text processing applications, with Some, but not all of the regular spelling varia- two major differences. First, the scale of our text tions compared to modern orthography are cov- data – with approximately 11 million words we ex- ered by this historical dictionary, e. g. old Ger- pect more than 300,000 erroneous word forms – man Urtheil (in contrast to modern German Urteil, does not allow for an interactive approach, asking Engl.: “jugdement”) is found in Abbyy’s internal a human user for feedback on every occurrence dictionary, whereas reduzirt (modern German re- of a suspicious word form. Therefore we need duziert, Engl.: “reduced”) is not. It is also possi- an automatic system that can account for correc- ble to include a user-defined list of words to im- tions with high reliability. Second, dating from prove the recognition accuracy. However, when the late 19th century, the texts show historic or- we added a list of geographical terms to the sys- thography, which differs in many ways from the tem and compared the output of before and after, spelling encountered in modern dictionaries (e. g. there was hardly any improvement. Although the historic Mittheilung vs. modern Mitteilung, Engl.: words from our list were recognized more accu- “message”). This means that using modern dictio- rately, the overall number of misrecognized words nary resources directly cannot lead to satisfactory was almost completely compensated by new OCR results. errors unrelated to the words added. Additionally, the governmental resolutions con- All in all, Abbyy’s Recognition Server 3.0 tain, typically, many toponymical references, i. e. serves our purposes of digitizing well, especially geographical terms such as Steinach (a stream or in respect of handling large amounts of data and place name) or Schwarzenburg (a place name). for pipeline processing. Many of these are not covered by general- vocabulary dictionaries. We also find regional pe- 3 Automated Post-Correction culiarities, such as pronunciation variants or words that are not common elsewhere in the German spo- The main emphasis of our project is on the post- ken area, e. g. Hülfstrupp vs. standard German correction of recognition errors. Our evaluation of Hilfstruppe, Engl.: “rescue team”. Of course there a limited number of text samples yielded a word is also a great amount of genre-specific vocabu- accuracy of 96.7 %, which means that one word lary, i. e. administrative and legal jargon. We are out of 30 contains an error (as e. g. Regieruug in- hence challenged to build a fully-automated cor- stead of correct Regierung, Engl.: “government”). rection system with high precision that is aware of We aim to significantly reduce the number of mis- historic spelling and regional variants and contains recognized word tokens by identifying them in geographical and governmental language. the OCR output and determining the originally in- tended word with its correct spelling. 3.1 Classification In order to correct a maximum of errors in the The core task of our correction system with respect corpus with the given resources, we decided for to its precision is the classification routine, that 2The term has not to be confused with the linguistic epoch determines the correctness of every word. This Old High German, which refers to texts from the first mille- part is evidently the basis for all correcting steps. nium A. D. Abbyy’s “Old German” seems to cover some or- thographical variation from the 19th, maybe the 18thcentury. 3see www.deutschestextarchiv.de

99 Misclassifications are detrimental, as they can lead events in the administered area) and the recogni- to false corrections. At the same time it is also tion confidence values of the OCR engine. Every the hardest task, as there are no formal criteria word is either judged as correct or erroneous ac- that universally describe orthography. Reynaert cording to the information we gather from the var- (2008) suggests building a corpus-derived lexicon ious resources. that is mainly based on word frequencies and the The historic and regional spelling deviations are distribution of similarly spelled words. However, modelled with a set of handwritten rules describ- this is sometimes misleading. For example, saum- ing the regular differences. For example, with a selig (Engl.: “dilatory”) is a rare but correct word, rule stating that the sequence th corresponds to t in which is found only once in the whole corpus, modern spelling, the old form Mittheilung can be whereas Liegenschast is not correct (in fact, it is turned into the standard form Mitteilung. While misrecognized for Liegenschaft, Engl.: “real es- the former word is not found in the dictionary, the tate”), but it is found sixteen times in this erroneous derived one is, which allows for the assumption form. This shows that the frequency of word forms that Mittheilung is a correctly spelled word. alone is not a satisfactory indicator to distinguish The classification method is applied to single correct words from OCR errors; other information words, which are thus considered as a correct or has to be used as well. erroneous form separately and without their origi- An interesting technique for reducing OCR er- nal context. This reduces the number of correction rors is comparing the output of two or more differ- candidates significantly, as not every word is to be ent OCR systems. The fact that different systems handled, but only the set of the distinct word forms. make different errors is used to localize and cor- However, this limits the correction to non-word er- rect OCR errors. For example, Volk et al. (2010) rors, i. e. misrecognized tokens that cannot be in- achieved improvements in OCR accuracy by com- terpreted as a proper word. For example, Fcuer is bining the output of Abbyy FineReader 7 and Om- clearly a non-sense word, which should be spelled nipage 17. However, this approach is not applica- Feuer (Engl.: “fire”). In contrast, the two word ble to our project for the simple reason that there forms Absatze and Absätze (Engl.: “paragraph”) is to our knowledge, for the time being, only one are two correct morphological variants of the same state-of-the-art OCR system that recognizes mixed word, which are often confused during OCR. To antiqua / Gothic script text. decide which one of the two variants is correct in The complex problem of correcting historical a particular occurrence, the word has to be judged documents with erroneous tokens is addressed by in its original context, which is outside the scope Reffle (2011) with a two-channel model, which in- of our project. tegrates spelling variation and error correction into The decision for the correctness of a word is one algorithm. To cover the unpredictable num- primarily based on the output of the German ber of spelling variants of texts with historical or- morphology system Gertwol by Lingsoft, in that thography (or rather, depending on their age, texts we check every word for its analyzability. Al- without orthography), the use of static dictionaries though the analyses of Gertwol are reliable in most is not reasonable as they would require an enor- cases, there are two kinds of exceptions that can mous size. Luckily, the orthography of our corpus lead to false classifications. First, there are cor- is comparatively modern and homogenous, which rect German words that are unknown to Gertwol, means that there is a manageable number of devi- such as Latin words like recte (Engl.: “right”) or ations in comparison to modern spelling and be- proper names. Second, sometimes Gertwol finds tween different parts of the corpus. analyses for misrecognized words, e. g. correct In our approach, we use a combination of vari- Regierungsrat (Engl.: “member of the govern- ous resources for this task, such as a large dictio- ment”) is often rendered Negierungsrat, which is nary system for German that covers morphological then analysed as a compound of Negierung (Engl.: variation and compounding (such as ging, a past “negation”) and Rat (Engl.: “councillor”). To form of gehen, Engl.: “to go”, or German Nieder- avoid these misclassifications we apply heuristics druckdampfsystem, a compound of four segments which include word lists for known frequent is- meaning “low-pressure steam system”), a list of sues, the frequency of the word form or the feed- local toponyms (since our texts deal mostly with back values of the OCR system concerning the

100 recognition confidence. rank frequency {correct}–{recognized} 1 49 {u}–{n} 3.2 Correction 2 38 {n}–{u} The set of all words recognized as correct words 3 14 {a}–{ä} plus their frequency can now serve as a lexicon 4 14 {}–{.} for correcting erroneous word tokens. This corpus- 5 13 {d}–{b} derived lexicon is by nature highly genre-specific, 6 11 {s}–{f} which is desirable. On the other hand, rare words … … … might remain uncorrected if all of their few occur- 140 1 {o}–{p} rences happen to be misspelled, in which case there 141 1 {r}–{?} will be no correction candidate in the lexicon. Due 142 1 {§}–{8} to the repetitive character of the texts – administra- 143 1 {ä}–{ö} tive work involves many recurrent issues – there is also a lot of repetition in the vocabulary across the Table 1: Either ends of a ranked list, showing char- corpus. This increases the chance that a word mis- acter errors and their frequency. recognized in one place has a correct occurrence in another. The procedure of finding correction candidates is algorithmically complex, since the orthographi- Figure 4: Gothic characters of the cal similarity of words is not easily described in a Fraktur that are frequently confused by the OCR formal way. There is no linear ordering that would system. From left to right, the characters are: s (in group together similar words. In the simplest ap- its long form), f, u, n, u or n, B, V, R, N. proach to searching for similar words in a large list, every word has to be compared to every other word, which has quadratic complexity regarding ically recognized text, one will inevitably notice the number of words. In our system we avoid this that a small set of errors accounts for the majority with two different approaches that are applied sub- of misrecognized characters. At character level, sequently. In the first step, only a limited set of OCR errors can be described as substitutions of character substitutions is performed, whereas the single characters. Thus, by ranking these error second step allows for a broader range of devia- types by their frequency, it can be shown that al- tions between an erroneous form and its correction. ready a small sample of OCR errors resembles Zip- fian distribution,4 as is demonstrated in figure 3. For example, the 20 most frequent types of errors, which is less than 14 % of the 143 types encoun- tered, sum up to 50 % of all occurrences of char- acter errors. Table 1 shows the head of this error list as well as its tail. Among the top 6 we find pairs like u and n that are also often confounded when recognizing antiqua text as well as typical Gothic-script confusions like d–b or s–f. Figure 4 illustrates the optical similarity of certain charac- ters that can also challenge the human eye. For ex- ample, due to its improper printing, the fifth letter cannot be determined clearly as u or n. Figure 3: OCR errors ranked by their frequency, For our correction system we use a manually from a sample text of approx. 50,000 characters edited substitution table that is based on these ob- length. servations. The substitution pairs are inverted and used as replacement operations that are applied to the misspelled word forms. The spelling vari- 3.2.1 Substitution of similar characters 4Zipf’s Law states that, given a large corpus of natural lan- In this approach we attempt to undo the most com- guage data, the rank of each word is inversely proportional to mon OCR errors. When working with automat- its frequency.

101 ants produced are then searched for correct words with ersten, but only one trigram (ten) is shared that can be found in our dictionary. For example, with warten. given an erroneous token sprechcn and a substitu- The correction candidates found with this tion pair e, c stating that recognized c can represent method are further challenged, e. g. by using the actual e, the system produces the variants spree- Levenshtein distance as a similarity measure and hcn, sprechen and spreehen. Consulting the dictio- by rejecting corrections that affect the ending of a nary finally uncovers sprechen (Engl.: “to speak”) word.6 as a correct word, suggesting it as a correction for garbled sprechcn. 3.3 Evaluation In order to measure the effect of the correction 3.2.2 Searching shared trigrams system, we manually corrected a random sample In a second step, we look for similar words more of 100 resolutions (approximately 35,000 word to- generally. This is applied to the erroneous words kens). Using the ISRI OCR-Evaluation Frontiers that could not be corrected by the method de- Toolkit (Rice, 1996), we determined the recogni- scribed above. There are various techniques to tion accuracy before and after the correction step. efficiently find similar words in large mounds of As can be seen in table 2, the text quality in terms data, such as Reynaert’s (2005) anagram hashing of word accuracy could be improved from 96.72 % method, that treats words as multisets of words to 98.36 %, which means that the total number of and models edit operations as arithmetic functions. misrecognized words was reduced by half. Mihov and Schulz (2004) combine finite-state au- tomata with filtering methods that include a-tergo 4 Crowd Correction dictionaries. Inspired by the Australian Newspaper Digitisation In our approach, we use character n-grams to Program (Holley, 2009) and their online crowd- find similar correct words for each of the erroneous correction system, we are setting up an online ap- words. Since many of the corrections with a close plication. We are working on a tool that allows for editing distance have already been found in the browsing through the corpus and reading the text in previous step, we have to look for corrections in a a synoptical view of both the recognized plain text wider spectrum now. In order to reduce the search and a scan image of the original printed . Our space, we create an index of trigrams that points online readers will have the possibility to immedi- to the lexicon entries. Every word in the lexicon ately correct errors in the recognized text by click- is split into overlapping sequences of three charac- ing a word token for editing. The original word in ters, which are then stored in a hashed data struc- the scan image is highlighted for convenience of ture. For example, from ersten (Engl.: “first”) we the user, which permits fast comparison of image get the four trigrams ers, rst, ste and ten. Each and text, so the intended spelling of a word can be trigram is used as a key that points to a list of all verified quickly. words containing this three-letter sequence, so ten Crowd correction combines easy access of the not only leads to ersten, but also to warten (Engl.: historic documents with manual correcting. Simi- “wait”) and many other words with the substring lar to the concept of publicly maintained services ten. as Wikipedia and its followers, information is pro- Likewise, all trigrams are derived from each er- vided free of charge. At the same time, the user roneous word form. All lexicon entries that share has the possibility to give something back by im- at least one trigram with the word in question can thus be accessed by the trigram index. The lexicon 6Editing operations affecting a word’s ending are a deli- cate issue in an inflecting language like German, since it is entries found are now ranked by the number of tri- likely that we are dealing with morphological variation in- grams shared with the error word. To stay with stead of an OCR error. Data sparseness of rare words can the previous example, a misrecognized word form lead to unfortunate constellations. For example, the adjective 5 form innere (Engl.: “inner”) is present in the lexicon, while rrsten has three trigrams (rst, ste, ten) in common the inflectional variant innern is lacking. However, the latter form is indeed present in the corpus, but only in the misrecog- 5The correction pair ersten–rrsten is not found in the first nized form iunern. As innere is the only valuable correction step although its editing distance is 1 and thus minimal. But candidate in this situation, the option would be changing mis- since the error {e}–{r} has a low frequency, it is not con- spelled iunern into grammatically inapt innere. For the sake tained in the substitution table, which makes this correction of legibility, an unorthographical word is preferred to an un- invisible for the common-error routine. grammatical form.

102 plain (errors) corrected (errors) improvement character accuracy 99.09 % (2428) 99.39 % (1622) 33.2 % word accuracy 96.72 % (1165) 98.36 % (584) 49.9 %

Table 2: Evaluation results of the automated correction system. proving the quality of the corpus. It is important Stephen V. Rice. 1996. Measuring the Accuracy of to keep the technical barrier low, so that new users Page-Reading Systems. Ph. D. dissertation, Univer- can start correcting straight away without needing sity of Nevada, Las Vegas, NV. to learn a new formalism. With our click-and-type Martin Volk, Torsten Marek, and Rico Sennrich. 2010. tool the access is easy and intuitive. However, in Reducing OCR Errors by Combining Two OCR Sys- consequence, the scope of the editing operations tems. ECAI 2010 Workshop on Language Technol- ogy for Cultural Heritage, Social Sciences, and Hu- is limited to word level, it does not, for example, manities (LaTeCH 2010), Lisbon, Portugal, 16 Au- allow for rewriting whole passages. gust 2010 – 16 August 2010, 61–65.

5 Conclusion

With this project we demonstrate how to build a highly specialized correction system for a spe- cific text collection. We are using both general and specific resources, while the approach as a whole is widely generalizable. Of course, work- ing with older texts with less standardized orthog- raphy than late 19thcentury governmental resolu- tions may lead to a lower improvement, but the techniques we applied are not bound to a cer- tain epoch. In the crowd correction approach we see the possibility to further improve OCR output in combination with online access of the historic texts.

References Rose Holley. 2009. How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale His- toric Newspaper Digitisation Programs. D-Lib Magazine, 15(3/4).

Stoyan Mihov and Klaus U. Schulz. 2004. Fast Ap- proximate Search in Large Dictionaries. Computa- tional Linguistics 30(4):451–477.

Ulrich Reffle. 2011. Efficiently generating cor- rection suggestions for garbled tokens of histor- ical language. Natural Language Engineering 17(2):265–282.

Martin Reynaert. 2005. Text-Induced Spelling Correc- tion. Ph. D. thesis, Tilburg Universtiy, Tilburg, NL.

Martin Reynaert. 2008. Non-Interactive OCR Post- Correction for Giga-Scale Digitization Projects. Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Confer- ence, CICLing 2008 Berlin, 617–630.

103