Generic Text Recognition Using Long Short-Term Memory Networks
Total Page:16
File Type:pdf, Size:1020Kb
Generic Text Recognition using Long Short-Term Memory Networks by Adnan Ul-Hasan Thesis approved by the Department of Computer Science University of Kaiserslautern for the award of Doctoral Degree Doctor of Engineering (Dr. -Ing) Date of PhD Defense: 11.01.2016 Dean of the Department: Prof. Dr. Klaus Schneider Chairperson of the PhD Committee: Prof. Dr. Paul Lukowicz Thesis Reviewers: Prof. Dr. Andreas Dengel, DFKI Kaiserslautern Associate Prof. Dr. Faisal Shafait, SEECS, NUST Pakistan apl. Prof. Dr. Marcus Liwicki, University of Kaiserslautern D 386 It always seems impossible until it is done. Nelson Mandela Abstract The task of printed Optical Character Recognition (OCR) is considered a “solved” issue by many Pattern Recognition (PR) researchers. The notion, however, partially true, does not represent the whole picture. Although, it is true that state-of-the-art OCR systems for many scripts exist, for example, for Latin, Greek, Han (Chinese), and Kana (Japanese), there is still a need for exhaustive research for many other challeng- ing modern scripts. Example of such scripts are: cursive Nabataean, which include Arabic, Persian, and Urdu; and the Brahamic family of scripts, which contain Devana- gari, Sanskrit, and its derivatives. These scripts present many challenging issues for OCR, for example, change in shape of character within a word depending upon its lo- cation, kerning, and a huge number of ligatures. Moreover, OCR research for histori- cal documents still requires much probing; therefore, efforts are required to develop robust OCR systems to preserve the literary heritage. Likewise, there is a need to address the issue of OCR of multilingual documents. Plenty of multilingual documents exist in the current time of globalization, which has increased the influence of different languages on each other. There is an increasein the usage of foreign words and phrases in articles, newspapers, and books, which are generating a large body of multilingual literature everyday. Another effect is seen in the products we use in our daily lives. From packaging of imported food items to so- phisticated electronics, the demand of international customers to access information about these products in their native language is ever increasing. The use of multilin- gual operational manuals, books, and dictionaries motivates the need to have multi- lingual OCR systems for their digitization. The aim of this thesis is to find the answers to some of these challenges using the contemporary machine learning methodologies, especially the Recurrent Neural Networks (RNN). Specifically, a recent architecture of these networks, referred toas Long Short-Term Memory (LSTM) networks, has been employed to OCR modern as well historical documents. The excellent OCR results obtained on these documents encourage us to extend their application to the field of multilingual OCR. The LSTM networks are first evaluated on standard English datasets to benchmark their performance. They yield better recognition results than any other contempo- rary OCR techniques without using sophisticated features and language modeling. Therefore, their application is further extended to more complex scripts that include Urdu Nastaleeq and Devanagari. For Urdu Nastaleeq script, LSTM networks achieve the best reported OCR results (2.55% Character Error Rate (CER)) on a publicly avail- able data set, while for Devanagari script, a new freely available database has been introduced on which CER of 9% is achieved. The LSTM-based methodology is further extended to the OCR of historical docu- ments. In this regard, this thesis focuses on Old German Fraktur script, medieval Latin script of the 15th century, and the Polytonic Greek script. LSTM-based systems out- perform the contemporary OCR systems on all of these scripts. For old documents, it is usually very hard to prepare transcribed dataset for training a neural network in su- pervised learning paradigm. A novel methodology has been proposed by combining segmentation-based and segmentation-free approaches to OCR scripts for which no transcribed training data is available. For German Fraktur and Polytonic Greek scripts, artificially generated data from existing text corpora yield highly promising results (CER of <1% and 5% for Fraktur and Polytonic Greek scripts respectively). Another major contribution of this thesis is an efficient Multilingual OCR (MOCR) system, which has been achieved in two ways. Firstly, a sequence-learning based script identification system has been developed that works at the text-line level; thereby, eliminating the need to segment individual characters or words prior to actual script identification. And secondly, a unified approach for dealing with different typesof multilingual documents has been proposed. The core motivation behind this gen- eralized framework is the reading ability of the human mind to process multilingual documents, where no script identification takes place. In this design, the LSTM net- works recognize multiple scripts simultaneously without the need to identify differ- ent scripts. The first step in building this framework is the realization of a language independent OCR system that recognizes multilingual text in a single step, using a sin- gle Long Short-Term Memory (LSTM) model. The language independent approach is then extended to a script independent OCR framework that can recognize multiscript documents using a single OCR model. The proposed generalized approach yields low error rate (1.2%) on a database of English-Greek bilingual documents. In summary, this thesis aims to extend the OCR research, from modern Latin scripts to old Latin, to Greek and to other “underprivileged” scripts such as Devanagari and Urdu Nastaleeq. It also provides a generalized OCR framework in dealing with multi- lingual documents. Contents Abstract v List of Figures xi List of Tables xv Abbreviations xvii 1 Introduction 1 1.1 Motivation ..................................... 4 1.2 Research Hypothesis ............................... 6 1.3 Contributions of the Thesis ........................... 6 1.4 Thesis Structure .................................. 7 2 Research Methodologies for Printed OCR 11 2.1 Segmentation-Based OCR ............................ 12 2.1.1 Template Matching for OCR ...................... 12 2.1.2 Over Segmentation Methods ...................... 14 2.2 Segmentation-Free OCR ............................. 14 2.2.1 HMM-Based OCR Approach ....................... 16 2.2.2 Sequence Learning Approach ..................... 16 P art − I : Challenges for P rinted OCR 3 Challenges for Robust OCR 19 3.1 Modern English .................................. 20 3.1.1 Challenges for OCR in Printed English ................ 20 3.1.2 OCR Databases for modern English .................. 21 3.2 Latin Script of Late Medieval Ages ....................... 22 3.2.1 Challenges for Old Latin OCR ..................... 22 3.2.2 OCR Databases for Old Latin ...................... 23 3.3 German Fraktur Script .............................. 23 3.3.1 Challenges for Fraktur OCR ....................... 24 3.3.2 OCR Databases for Fraktur ....................... 24 3.4 Polytonic Greek Script .............................. 25 3.4.1 OCR Challenges for Polytonic Greek ................. 25 3.4.2 OCR Databases for Polytonic Greek .................. 25 vii Contents viii 3.5 Devanagari Script ................................. 26 3.5.1 Challenges for Devanagari OCR .................... 27 3.5.2 OCR Databases for Devanagari .................... 27 3.6 Urdu Nastaleeq Script .............................. 27 3.6.1 Challenges for Nastaleeq OCR ..................... 28 3.6.2 Databases for Urdu Nastaleeq OCR .................. 33 3.7 Printed Documents with Multiple Scripts and Languages ......... 35 3.7.1 Categorization of Multilingual Documents .............. 36 3.7.2 OCR Datasets for Multilingual Documents .............. 36 3.8 Chapter Summary ................................. 38 4 Benchmark Datasets for OCR 39 4.1 Related Work ................................... 40 4.2 Semi-automated Database Generation .................... 41 4.2.1 Preprocessing .............................. 42 4.2.2 Ligature Extraction ........................... 42 4.2.3 Clustering ................................. 43 4.2.4 Ligature Labeling ............................. 44 4.2.5 Experiments and Results ........................ 45 4.2.6 Conclusion ................................ 46 4.3 Synthetic Text-Line Generation ......................... 46 4.4 OCR Databases .................................. 47 4.4.1 Deva-DB – OCR Database for Devanagari Script ........... 47 4.4.2 Polyton-DB – Greek Polytonic Database ............... 49 4.4.3 Databases for Multilingual OCR (MOCR) ............... 49 4.5 Chapter Summary ................................. 52 P art − II : OCR for Monolingual Documents 5 OCR for Printed Modern Scripts 53 5.1 Design of LSTM-Based OCR System ...................... 53 5.1.1 Network Parameters Selection ..................... 55 5.1.2 Features .................................. 57 5.1.3 Performance Metric ........................... 57 5.2 Printed English OCR ............................... 57 5.2.1 Related Work ............................... 58 5.2.2 Database ................................. 59 5.2.3 Experimental Evaluation and Results ................. 59 5.2.4 Error Analysis ............................... 62 5.2.5 Conclusions ................................ 62 5.3 Printed Devanagari OCR ............................. 63 5.3.1 Related Work