An Unsupervised Method for OCR Post-Correction and Spelling Normalisation for Finnish

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish Quan Duong,| Mika Hämäläinen,|;} Simon Hengchen♠ firstname.lastname@{helsinki.fi;gu.se} |University of Helsinki, }Rootroo Ltd, ♠University of Gothenburg Abstract the data even more burdensome to modern NLP methods. Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the dig- Obviously, this problematic situation is not itization process, often said to be degrad- unique to Finnish. There are several other lan- ing the performance of NLP systems. Cor- guages in the world with rich morphologies and recting these errors manually is a time- relatively poor support for both historical and consuming process and a great part of the modern NLP. Such is the case with most of the lan- automatic approaches have been relying guages that are related to Finnish like Erzya, Sami on rules or supervised machine learning. and Komi, these Uralic languages are severely en- We build on previous work on fully auto- dangered but have valuable historical resources in matic unsupervised extraction of parallel books that are not yet available in a digital format. data to train a character-based sequence- OCR remains a problem especially for endangered to-sequence NMT (neural machine trans- languages (Partanen, 2017), although OCR quality lation) model to conduct OCR error cor- for such languages can be improved by limiting rection designed for English, and adapt the domain in which the OCR models are trained it to Finnish by proposing solutions that and used (Partanen and Rießler, 2019). take the rich morphology of the language into account. Our new method shows Automated OCR post-correction is usually increased performance while remaining modelled as a supervised machine learning prob- fully unsupervised, with the added bene- lem where a model is trained with parallel data fit of spelling normalisation. The source consisting of OCR erroneous text and manually 1 code and models are available on GitHub corrected text. However, we want to develop a 2 and Zenodo . method that can be used even in contexts where no manually annotated data is available. The most 1 Introduction viable recent method for such a task is the one Nature language processing (NLP) is arguably presented by Hämäläinen and Hengchen (2019). tremendously difficult to tackle in Finnish, due to However, their model works only on correcting in- an extremely rich morphology. This difficulty is dividual words without considering the context in reinforced by the limited availability of NLP tools sentences, and as it focuses on English, it com- for Finnish in general, and perhaps even more so pletely ignores the issues rising from a rich mor- for historical data by the fact that morphology has phology. Extending their approach, we introduce evolved through time – some older inflections ei- a self-supervised model to automatically gener- ther do not exist anymore, or are hardly used in ate parallel data which is learned from the real modern Finnish. As historical data comes with its OCRed text. Later, we train sequence-to-sequence own challenges, the presence of OCR errors makes (seq2seq) NMT models on character level with context information to correct OCR errors. The 1 Source Code, https://github.com/ruathudo/ NMT models are based on the Transformer algo- post-ocr-correction 2Trained models, https://doi.org/10.5281/ rithm (Vaswani et al., 2017), whose detailed com- zenodo.4242890 parison is demonstrated in this article. 2 Related work into “clean” texts. These MT approaches are quickly moving from statistical MT (SMT) – as As more and more digital humanities (DH) work previously used for historical text normalisation, start to use the large-scale, digitised and OCRed e.g. the work by Pettersson et al. (2013) – to NMT: collections made available by national libraries Dong and Smith (2018) use a word-level seq2seq and other digitisation projects, the quality of OCR NMT approach for OCR post-correction, while is a central point for text-based humanities re- Hämäläinen and Hengchen (2019), on which we search. Can one trust the output of complex NLP base our work, mobilised character-level NMT. systems, if these are fed with bad OCR? Beyond Very recently, Nguyen et al. (2020) use BERT em- the common pitfalls inherent to historical data (see beddings to improve an NMT-based OCR post- Piotrowski (2012) for a very thorough overview), correction system on English. some works have tried to answer the question stated above: Hill and Hengchen (2019) use a sub- 3 Experiment set of 18th-century corpus, ECCO3 as well as its keyed-in counterpart ECCO-TCP to compare the In this section, we describe our methods for auto- output of common NLP tasks used in DH and con- matically generating parallel data that can be used clude that OCR noise does not seem to be a large in a character-level NMT model to conduct OCR factor in quantitative analyses. A conclusion sim- post-correction. In short, our method requires only ilar to previous work by Rodriquez et al. (2012) in a corpus with OCRed text that we want to auto- the case of NER and to Franzini et al. (2018) for matically correct, a word list, a morphological an- authorship attribution, but in opposition to Mutuvi alyzer and any corpus of error free text. Since we et al. (2018) who focus on topic modelling for his- focus on Finnish only, it is important to note that torical newspapers and conclude that OCR does such resources exist for many endangered Uralic play a role. More recently and still on historical languages as well as they have extensive XML newspapers, van Strien et al. (2020) conclude that dictionaries and FSTs available (see (Hämäläinen while OCR noise does have an impact, its effect and Rueter, 2018)) together with a growing num- widely differs between downstream tasks. ber of Universal Dependencies (Nivre et al., 2016) It has become apparent that OCR quality for treebanks such as Komi-Zyrian (Lim et al., 2018), historical texts has become central for funding Erzya (Rueter and Tyers, 2018), Komi-Permyak bodies and collection-holding institutions alike. (Rueter et al., 2020) and North Sami (Sheyanova Reports such as the one put forward by Smith and Tyers, 2017). and Cordell (2019) rise OCR initiatives, while the Library-of-Congress-commissioned report by 3.1 Baseline Cordell (2020) underlines the importance of OCR We design the first experiment based on the pre- for culturage heritage collections. These reports vious work (Hämäläinen and Hengchen, 2019), echo earlier work by, among others, Tanner et al. who train a character-level NMT system. Their (2009) who tackle the digitisation of British news- research indicates that there is a strong seman- papers, the EU-wide IMPACT project4 that gath- tic relationship between the correct word to its ers 26 national libraries, or Adesam et al. (2019) erroneous forms and we can generate OCR er- who set out to analyse the quality of OCR made ror candidates using semantic similarity. To be available by the Swedish language bank. able to train the NMT model, we need to extract OCR post-correction has been tackled in pre- the parallel data of correct words and their OCR vious work. Specifically for Finnish, Drobac errors. Accordingly, we trained the Word2Vec et al. (2017) correct the OCR of newspapers using model (Mikolov et al., 2013) on the Historical weighted finite-state methods, accordance with, Newspaper of Finland from 1771 to 1929 using the Silfverberg and Rueter (2015) do the same for Gensim library (Reh˚uˇ rekˇ and Sojka, 2010). Af- Finnish (and Erzya). Most recent approaches rely ter obtaining the Word2Vec model and its trained on the machine translation (MT) of “dirty” text vocabulary, we extract the parallel data by using the Finnish morphological FST, Omorfi (Piri- 3 Eighteenth Century Collections Online, nen, 2015), provided in the UralicNLP library https://www.gale.com/primary-sources/ eighteenth-century-collections-online (Hämäläinen, 2019) and – following Hämäläinen 4http://www.impact-project.eu and Hengchen (2019) – Levenshtein edit distance (Levenshtein, 1965). The original approach used a rectly to the given correct word, especially in the lemma list for English for the data extraction, but case of semantically similar words that have simi- we use an FST so that we can distinguish morpho- lar lengths. Another limitation of the baseline ap- logical forms from OCR errors. Without the FST, proach is that NMT model usually requires more different inflectional forms would also be consid- variants to achieve better performance – some- ered to be OCR errors, which is particularly coun- thing limited by the vocabulary of the Word2Vec terproductive with a highly-inflected language. model, which is trained with a frequency thresh- We build a list of correct Finnish words by lem- old so as to provide semantically similar words. matisating all words in the Word2Vec model’s vo- To solve these problems we artificially introduce cabulary: if the lemma is present in the Finnish OCR-like errors in a modern corpus, and thus Wiktionary lemma list,5 it is considered as correct obtain more variants of the training word pairs and saved as such. Next, for each word in this and less noise in the data. We further specialise “correct" list, we retrieve the most similar words our approach by applying the Transformer model from the Word2Vec model. Those similar words with context and non-context words experiments are checked to see whether they exist in the cor- instead of the default OpenNMT algorithms for rect list or not and separated into two different training. In the next section, we detail our imple- groups of correct words and OCR errors.

Load more