End-To-End Text Recognition with Hybrid HMM-Maxout Models

END-TO-END TEXT RECOGNITION WITH DEEP LEARNING ARCHITECTURES Ouais Alsharif Master of Science School of Computer Science McGill University, Montreal October 23, 2014 A thesis submitted to McGill University in partial fulfilment of the requirements of the degree of Master of Science. © Ouais Alsharif; October 23, 2014. i This page was unintentionally left blank. i Acknowledgements Foremost, I am grateful to my advisor, Joelle Pineau. Joelle balanced guiding me on the one hand and giving me the freedom to pursue my own ideas and the means to do so on the other. Her guidance and support were very helpful throughout my masters. Our weekly meetings were the highlights of my week. I consider my self lucky being one of her students. I am also grateful to Doina Precup, who recommended me to Joelle in the first place and invited me to join the Reasoning and Learning lab’s meetings. Doina’s support was of immense help as I transitioned into McGill and now that I am at the crossroads of several paths. My McGill experience wouldn’t have been complete had it not been for Luc Devroye. Luc’s classes are the best thing one could do at 8:30 in the morning. His teaching style, passion and knowledge are unparalleled. I hope one day to become a researcher of his calibre. Graduate school is nothing without friends. In a randomly gener- ated order: Mahdi Milani Fard, Pierre-luc Bacon, Gheorghe Comanici, Neil Girdhar, Phillip Bachman, William Hamilton, Clement Gehring, Mike Ounsworth, Jimmy Li, David Cortes, Javona Whitebear, Jinxu Jia, Angus Leigh, Andrew Sutcliffe and Martin Gerdzhev. This would not have been the same without you. I am most grateful to my family: Obada, Ubai, Mom, Dad. There are no words to describe how your love and unconditional support affected my life. Thank you all. ii Abstract Accurate text recognition in documents was one of the milestones of machine learning and computer vision techniques. However, despite this early success, general text recognition still remains an unsolved problem. Since textual information is an artificial signal, designed to be simple to draw, it can be easily confused with other simple signals that naturally exist. Moreover, unlike in document text recognition, assumptions on the way text exists should be kept to a minimum in the general setting, creating the need for more robust detectors and recognizers. From a practical point of view, engineering an end-to-end system is an elaborate effort. It involves designing multiple modules from text detection to character recognition and integrating these models in a way that allows for scalability, modularity and high accuracy. That is why most of the previous works focused only on parts of the pipeline instead of the whole end-to-end system. Moreover, the most accurate previous works traded off accuracy with scalability, making them infeasible to use in real-world settings. This thesis attempts to address this issue, by showing how such an end-to-end system can be constructed with the high-level goals of balanc- ing simplicity, accuracy and scalability. Drawing on connections to speech and handwriting recognition. Specifically, this thesis shows how the end- to-end problem can be dissected into three main sub-problems: character recognition, word recognition and text detection. Then, novel solutions to each problem are proposed, and a method for integrating the three modules together is shown. Technically, the system leverages a recent variant of convolutional neural networks that uses dropout and a max activation function. It also makes use of hybrid HMM models, that were shown to be useful in speech recognition problems. Empirically, the system’s performance is measured in comparison to previous systems in terms of accuracy and scalability. Results show the proposed system outperforms previous state-of-the-art systems on benchmark datasets on all sub-problems. It also addresses scalability issues in lexicon size that previously proposed systems suffer from. iii Résumé La reconnaissance précise de texte dans les documents a étéune pierre angulaire en apprentissage machine et vision artificielle. Toutefois, malgré ces premieres succès,le problèmegénéralde reconnaissance de texte de- meure un problèmenon résolu.Puisque l’information textuelle est un signal artificiel con¸cuafin d’êtrefacile àdessiner, il peut êtrefacilement confondu avec d’autres signaux du mêmegenre existants déjàdans l’environnement. De plus, àla différencede la reconnaissance de texte dans un document, les suppositions ayant trait àla manièredont le texte doit apparaˆıtredoivent demeurer minimales dans ce scénarioplus général.Il faut ainsi développer des détecteurset reconnaisseurs plus robustes. D’un point de vue pratique, l’élaboration de systèmede reconnaissance “du débutàla fin” demande un effort considérable. Il faut non seule- ment concevoir de multiples modules de détectionde texte et de reconnaissance des caractèresmais aussi les intégrer d’une manière àpermettre l’extensibilité,la modularitéet la précision. C’est pour cette raison que les efforts précédents ont étédédiésseulement aux parties constituantes de cette chaˆıneplutôtqu’au systèmecomplet du débutàla fin. De plus, ces approches ayant négligél’extensibilitéau profit de la précisionne peuvent êtreutiliséesdans le monde réel. Cette thèsetente de résoudreces problèmeet montre comment un système“du débutàla fin” peut êtrecon¸cutout en répondant àl’idéal de simplicitésans toutefois compromettre la précisionet l’extensibilité. Dans un mêmetemps, cette thèsetente d’établirdes liens avec la reconnaissance de voix et d’écriture. Plus précisément, cette thèsemontre comment le problèmedu “débutàla fin” peut êtredécomposéen trois sous-problèmesprincipaux: reconnaissance de caractères,reconnaissance de mots et reconnaissance de texte. Des solutions novatrices pour chacun de ces problèmessont indépendamment présentéeset son ensuite combinées en un seul système.Techniquement parlant, le systèmeexploite une récente variation des réseauxneuronaux convolutionnels utilisant la technique de “dropout” et celle d’une fonction d’activation de type “max”. Un modèlehy- bride HMM s’étant avéréutile en reconnaissance vocal est aussi utilisépour notre problème.D’un point de vue empirique, la performance du système est évaluéeen comparaison avec les systèmesprécédents d’aprèsles critères de précisionet d’extensibilité. Les résultatsdémontrent que le système proposés’avèresupérieuraux autre systèmesde fine pointe lorsqu’il est évaluésur tous les sous-problèmesdes ensembles de données.Le problème d’extensibilitéest finalement résolupour les lexiques dont la taille limitait les systèmesprécédents. Contents Contents iv List of Figures vi List of Tables vi 1 Introduction1 1.1 Prologue................................ 1 1.2 Problem Definition.......................... 3 1.3 Contributions............................. 3 1.4 Methodology ............................. 4 1.5 Organization ............................. 5 2 Technical Background6 2.1 Supervised Machine Learning .................... 6 2.2 Neural Networks ........................... 7 2.2.1 Training neural networks................... 10 2.2.2 Pros and Cons of Neural Networks ............. 11 2.3 Convolutional Neural Networks ................... 13 2.4 Dropout................................ 15 2.5 Maxout Networks........................... 16 2.6 Hidden Markov Models........................ 17 2.7 Hybrid HMM models......................... 19 2.8 Maximally Stable Extremal Regions (MSERs)........... 20 3 Character Recognition 23 3.1 Problem Definition.......................... 23 3.2 Related Works............................. 23 3.3 Dataset ................................ 26 3.4 Method ................................ 26 3.5 Results................................. 28 3.6 Discussion............................... 29 4 Word Recognition 30 4.1 Problem Definition.......................... 30 4.2 Related Works............................. 31 4.3 Method Outline............................ 32 iv CONTENTS v 4.4 Hybrid HMM Maxout Model..................... 33 4.5 Constructing the Cascade ...................... 35 4.6 Word Inference ............................ 35 4.7 Implementation details........................ 38 4.8 Dataset and Results ......................... 39 4.9 Speed-accuracy tradeoffs and language models’ effect . 41 4.9.1 Effect of Beam Width .................... 41 4.9.2 Effect of Language Model Order............... 41 4.10 Discussion............................... 42 5 Text Detection and End-to-End system 44 5.1 Problem Definition.......................... 44 5.2 Related Works............................. 44 5.2.1 Text Detection ........................ 45 5.2.2 End-to-End Pipelines..................... 45 5.3 Method ................................ 46 5.4 End-to-End Results.......................... 48 5.5 On Training an End-to-end System via Gradient Descent . 50 6 Discussion 52 6.1 Contributions............................. 52 6.2 Limitations .............................. 53 6.3 Future Work.............................. 53 Bibliography 55 List of Figures 1.1 End-to-end pipeline overview ...................... 5 2.1 A multi-layer perceptron......................... 8 2.2 The behaviour of popular neural activation functions......... 9 2.3 Effect of convolution in convolutional networks............. 14 2.4 Effect of pooling in convolutional networks............... 15 3.1 Character

End-To-End Text Recognition with Hybrid HMM-Maxout Models

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Topic Segmentation: Algorithms and Applications

A Generic Neural Text Segmentation Model with Pointer Network

Text Segmentation Techniques: a Critical Review

Text Segmentation Based on Semantic Word Embeddings

Steps Involved in Text Recognition and Recent Research in OCR; a Study

Word Sense Disambiguation and Text Segmentation Based on Lexical

Multiple Segmentations of Thai Sentences for Neural Machine