OCR Post-Processing of Historical Swedish Text Using Machine Learning Techniques
Total Page:16
File Type:pdf, Size:1020Kb
OCR post-processing of historical Swedish text using machine learning techniques Master’s thesis in Computer science and engineering Simon Persson Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2019 Master’s thesis 2019 OCR post-processing of historical Swedish text using machine learning techniques Simon Persson Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2019 OCR post-processing of historical Swedish text using machine learning techniques Simon Persson © Simon Persson, 2019. Supervisor: Dana Dannélls, Språkbanken Examiner: Krasimir Angelov, University of Gothenburg Master’s Thesis 2019 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Gothenburg, Sweden 2019 iv OCR post-processing of historical Swedish text using machine learning techniques Simon Persson Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract We present an OCR post-processing method that utilizes machine learning tech- niques and is targeted at historical Swedish texts. The method is developed com- pletely independent of any OCR-tool with the aim is to avoid bias towards any single OCR-tool. The purpose of the method is to solve the two main problems of OCR post-processing, i.e. detecting and correcting errors caused by the OCR-tool. Our method is divided into two main parts, each solves one of these problems. Error detection is solved by a Support Vector Machine (SVM) that classifies each word to be either valid or erroneous. In order for the SVM to classify the words, each word is converted into a feature vector that contains several word features for indicating the validity of the word. The error correction part of the method takes the words that have been classified as erroneous and tries to replace them with the correct word. The error correction algorithm is based upon Levenshtein edit distance combined with the frequency wordlists. We OCR processed a collection of 400 documents from the 19th century using three OCR-tools: Ocropus, Tesseract and ABBYY, and used the output result from each tool to develop our method. Experiments and evalua- tions were carried out against the ground truth of the documents. The method is built in a modular fashion and evaluation was performed on each module. We report quantitative and qualitative results showing varying degrees of OCR post-processing complexity. Keywords: machine-learning, NLP, OCR, post-processing v Acknowledgements I would like to express my deepest gratitude towards my supervisor at Språkbanken, Dana Dannélls for her excellent help and patience. I would also like to thank my examiner Krasimir Angelov at University of Gothenburg for the support and feed- back. Lastly, I would like to thank my family and close friend for the support and encouragement throughout this project. Simon Persson, Gothenburg, June 2019 vii Contents List of Figures xiii List of Tables xv 1 Introduction1 1.1 Aim....................................2 1.2 Delimitations...............................2 2 Background3 2.1 OCR of historical texts..........................3 2.2 Optical Character Recognition......................4 2.2.1 OCR-process...........................5 2.2.1.1 Pre-processing.....................5 2.2.1.2 Character recognition.................6 2.2.1.3 Post-processing.....................7 2.3 OCR-tools.................................7 2.3.1 Abbyy Finereader.........................8 2.3.2 Ocropus..............................8 2.3.3 Tesseract.............................8 2.4 Support Vector Machine.........................9 2.5 Word feature............................... 10 2.6 Levenshtein distance........................... 11 2.7 Evaluation................................. 11 2.7.1 Precision, Recall and F1-score.................. 12 2.7.2 OCR evaluation.......................... 12 2.7.3 Word metrics evaluation..................... 14 3 Data 17 3.1 Targeted text............................... 17 3.1.1 Then Swänska Argus....................... 17 3.1.2 Grepect.............................. 18 3.2 Project Runeberg............................. 18 4 Methods 21 4.1 Data pre-processing............................ 21 4.2 Post-processing.............................. 22 4.2.1 Error detection.......................... 22 ix Contents 4.2.1.1 Training data generation................ 22 4.2.1.2 Feature vector..................... 23 4.2.1.3 Support Vector Machine................ 23 4.2.2 Error correction.......................... 24 4.2.2.1 Candidate search.................... 25 4.2.2.2 Candidate ranking................... 26 4.3 OCR evaluation.............................. 26 4.4 Project baseline.............................. 26 5 Results 29 5.1 Error detection.............................. 29 5.1.1 Word features........................... 30 5.1.1.1 Training data filtering................. 30 5.1.1.2 Tri-gram database size................. 30 5.1.1.3 Word frequencies database size............ 31 5.1.1.4 Overall performance.................. 31 5.1.2 Support Vector Machine..................... 32 5.1.2.1 Kernel selection..................... 32 5.1.2.2 Parameter selection................... 33 5.2 Error correction.............................. 34 5.2.1 Input filtering........................... 34 5.2.2 Error correction algorithm.................... 35 5.2.3 Word database size........................ 36 5.3 Finalized performance.......................... 37 6 Discussion 41 6.1 Error detection.............................. 41 6.1.1 Word features........................... 41 6.1.1.1 Tri-gram database size................. 42 6.1.1.2 Word frequencies database size............ 42 6.1.1.3 Training data filtering................. 43 6.1.1.4 SVM........................... 43 6.2 Error correction.............................. 44 6.2.1 Input filtering........................... 44 6.2.2 Correction algorithm....................... 44 6.2.3 Word database size........................ 45 6.3 Overall performance........................... 45 6.4 Bias for any OCR-tool.......................... 45 6.5 Suitability of a machine learning approach............... 45 6.6 Limitations caused by the targeted text................. 46 6.7 Future work................................ 46 6.7.1 Word features........................... 46 6.7.2 Error detection.......................... 47 7 Conclusion 49 Bibliography 51 x Contents A Appendix 1I xi Contents xii List of Figures 2.1 An example of deteriorated page. The prints from nearby pages has left marks on the current page......................4 2.2 A diagram of the different blocks in the OCR-process.........5 2.3 Illustration of a text getting segmented by the layout analysis during the pre-processing step in the OCR-process...............6 2.4 Depiction of an image before (left) and after (right) binarisation....6 2.5 A diagram of the different parts of Ocropus...............8 2.6 A diagram of the Tesseract structure...................9 2.7 Classification problem with two classes (triangles and circles) labeled with values one and zero. The diagonal full line is a possible solution to the problem. By maximizing the distance to the margins (dotted lines) the solution is optimized...................... 11 2.8 The Levenshtein distance function (LD) applied to three pairs of strings. 11 2.9 Classification problem with two classes (triangles and circles) labeled with values one and zero. The diagonal full line is a possible solution to the problem............................... 13 2.10 Example of averages of two word features divided by the validity class. 14 2.11 Example of the percent difference metric for two word features.... 15 3.1 Histogram of word length distribution for the targeted text...... 18 3.2 Example of pages from both Then Swänska Argus (left) and Grepect (right). Notice the ink “bleed-through” and mixture of typefaces........ 19 4.1 Simplified diagram of the complete OCR process including a more detailed version of the post-processing step................ 21 4.2 Visualization of the data pre-processing. The second line is the pro- cessed version of the first line....................... 21 4.3 Example of how a word is split and some candidates for each split. The winning candidate is chosen by ranking the candidates by Lev- enshtein distance (LD) and the frequency of the word.......... 25 5.1 The order in which each subsystem will be evaluated. Each sub- system will be evaluated with optimal configurations for all previous subsystems. The dashed boxes indicate which problem each subsytem contribute to solving............................ 29 xiii List of Figures 5.2 Performance of the word features without any of the two filters, last char filter and only non-alpha filter. The performance is displayed in the percent difference mean metric................... 31 5.3 Performance of the word feature with only last char filter. The per- formance is displayed in the percent difference mean metric...... 32 5.4 Performance of the word feature with only only non-alpha filter. The performance is displayed in the percent difference mean metric..... 33 5.5 Performance of the word features with both only non-alpha filter and last character filter. The performance is displayed in the percent difference mean metric........................... 34 5.6 Performance of the Swedishness metric with different