End-To-End Text Recognition with Hybrid HMM-Maxout Models
Total Page:16
File Type:pdf, Size:1020Kb
END-TO-END TEXT RECOGNITION WITH DEEP LEARNING ARCHITECTURES Ouais Alsharif Master of Science School of Computer Science McGill University, Montreal October 23, 2014 A thesis submitted to McGill University in partial fulfilment of the requirements of the degree of Master of Science. © Ouais Alsharif; October 23, 2014. i This page was unintentionally left blank. i Acknowledgements Foremost, I am grateful to my advisor, Joelle Pineau. Joelle balanced guiding me on the one hand and giving me the freedom to pursue my own ideas and the means to do so on the other. Her guidance and support were very helpful throughout my masters. Our weekly meetings were the highlights of my week. I consider my self lucky being one of her students. I am also grateful to Doina Precup, who recommended me to Joelle in the first place and invited me to join the Reasoning and Learning lab’s meet- ings. Doina’s support was of immense help as I transitioned into McGill and now that I am at the crossroads of several paths. My McGill experience wouldn’t have been complete had it not been for Luc Devroye. Luc’s classes are the best thing one could do at 8:30 in the morning. His teaching style, passion and knowledge are unparalleled. I hope one day to become a researcher of his calibre. Graduate school is nothing without friends. In a randomly gener- ated order: Mahdi Milani Fard, Pierre-luc Bacon, Gheorghe Comanici, Neil Girdhar, Phillip Bachman, William Hamilton, Clement Gehring, Mike Ounsworth, Jimmy Li, David Cortes, Javona Whitebear, Jinxu Jia, Angus Leigh, Andrew Sutcliffe and Martin Gerdzhev. This would not have been the same without you. I am most grateful to my family: Obada, Ubai, Mom, Dad. There are no words to describe how your love and unconditional support affected my life. Thank you all. ii Abstract Accurate text recognition in documents was one of the milestones of machine learning and computer vision techniques. However, despite this early success, general text recognition still remains an unsolved problem. Since textual information is an artificial signal, designed to be simple to draw, it can be easily confused with other simple signals that naturally exist. Moreover, unlike in document text recognition, assumptions on the way text exists should be kept to a minimum in the general setting, creating the need for more robust detectors and recognizers. From a practical point of view, engineering an end-to-end system is an elaborate effort. It involves designing multiple modules from text detection to character recognition and integrating these models in a way that allows for scalability, modularity and high accuracy. That is why most of the previous works focused only on parts of the pipeline instead of the whole end-to-end system. Moreover, the most accurate previous works traded off accuracy with scalability, making them infeasible to use in real-world settings. This thesis attempts to address this issue, by showing how such an end-to-end system can be constructed with the high-level goals of balanc- ing simplicity, accuracy and scalability. Drawing on connections to speech and handwriting recognition. Specifically, this thesis shows how the end- to-end problem can be dissected into three main sub-problems: character recognition, word recognition and text detection. Then, novel solutions to each problem are proposed, and a method for integrating the three mod- ules together is shown. Technically, the system leverages a recent variant of convolutional neural networks that uses dropout and a max activation function. It also makes use of hybrid HMM models, that were shown to be useful in speech recognition problems. Empirically, the system’s perfor- mance is measured in comparison to previous systems in terms of accuracy and scalability. Results show the proposed system outperforms previous state-of-the-art systems on benchmark datasets on all sub-problems. It also addresses scalability issues in lexicon size that previously proposed systems suffer from. iii R´esum´e La reconnaissance pr´ecise de texte dans les documents a ´et´eune pierre angulaire en apprentissage machine et vision artificielle. Toutefois, malgr´e ces premieres succ`es,le probl`emeg´en´eralde reconnaissance de texte de- meure un probl`emenon r´esolu.Puisque l’information textuelle est un signal artificiel con¸cuafin d’ˆetrefacile `adessiner, il peut ˆetrefacilement confondu avec d’autres signaux du mˆemegenre existants d´ej`adans l’environnement. De plus, `ala diff´erencede la reconnaissance de texte dans un document, les suppositions ayant trait `ala mani`eredont le texte doit apparaˆıtredoivent demeurer minimales dans ce sc´enarioplus g´en´eral.Il faut ainsi d´evelopper des d´etecteurset reconnaisseurs plus robustes. D’un point de vue pratique, l’´elaboration de syst`emede reconnaissance “du d´ebut`ala fin” demande un effort consid´erable. Il faut non seule- ment concevoir de multiples modules de d´etectionde texte et de recon- naissance des caract`eresmais aussi les int´egrer d’une mani`ere `apermettre l’extensibilit´e,la modularit´eet la pr´ecision. C’est pour cette raison que les efforts pr´ec´edents ont ´et´ed´edi´esseulement aux parties constituantes de cette chaˆıneplutˆotqu’au syst`emecomplet du d´ebut`ala fin. De plus, ces approches ayant n´eglig´el’extensibilit´eau profit de la pr´ecisionne peuvent ˆetreutilis´eesdans le monde r´eel. Cette th`esetente de r´esoudreces probl`emeet montre comment un syst`eme“du d´ebut`ala fin” peut ˆetrecon¸cutout en r´epondant `al’id´eal de simplicit´esans toutefois compromettre la pr´ecisionet l’extensibilit´e. Dans un mˆemetemps, cette th`esetente d’´etablirdes liens avec la re- connaissance de voix et d’´ecriture. Plus pr´ecis´ement, cette th`esemontre comment le probl`emedu “d´ebut`ala fin” peut ˆetred´ecompos´een trois sous-probl`emesprincipaux: reconnaissance de caract`eres,reconnaissance de mots et reconnaissance de texte. Des solutions novatrices pour chacun de ces probl`emessont ind´ependamment pr´esent´eeset son ensuite combin´ees en un seul syst`eme.Techniquement parlant, le syst`emeexploite une r´ecente variation des r´eseauxneuronaux convolutionnels utilisant la technique de “dropout” et celle d’une fonction d’activation de type “max”. Un mod`elehy- bride HMM s’´etant av´er´eutile en reconnaissance vocal est aussi utilis´epour notre probl`eme.D’un point de vue empirique, la performance du syst`eme est ´evalu´eeen comparaison avec les syst`emespr´ec´edents d’apr`esles crit`eres de pr´ecisionet d’extensibilit´e. Les r´esultatsd´emontrent que le syst`eme propos´es’av`eresup´erieuraux autre syst`emesde fine pointe lorsqu’il est ´evalu´esur tous les sous-probl`emesdes ensembles de donn´ees.Le probl`eme d’extensibilit´eest finalement r´esolupour les lexiques dont la taille limitait les syst`emespr´ec´edents. Contents Contents iv List of Figures vi List of Tables vi 1 Introduction1 1.1 Prologue................................ 1 1.2 Problem Definition.......................... 3 1.3 Contributions............................. 3 1.4 Methodology ............................. 4 1.5 Organization ............................. 5 2 Technical Background6 2.1 Supervised Machine Learning .................... 6 2.2 Neural Networks ........................... 7 2.2.1 Training neural networks................... 10 2.2.2 Pros and Cons of Neural Networks ............. 11 2.3 Convolutional Neural Networks ................... 13 2.4 Dropout................................ 15 2.5 Maxout Networks........................... 16 2.6 Hidden Markov Models........................ 17 2.7 Hybrid HMM models......................... 19 2.8 Maximally Stable Extremal Regions (MSERs)........... 20 3 Character Recognition 23 3.1 Problem Definition.......................... 23 3.2 Related Works............................. 23 3.3 Dataset ................................ 26 3.4 Method ................................ 26 3.5 Results................................. 28 3.6 Discussion............................... 29 4 Word Recognition 30 4.1 Problem Definition.......................... 30 4.2 Related Works............................. 31 4.3 Method Outline............................ 32 iv CONTENTS v 4.4 Hybrid HMM Maxout Model..................... 33 4.5 Constructing the Cascade ...................... 35 4.6 Word Inference ............................ 35 4.7 Implementation details........................ 38 4.8 Dataset and Results ......................... 39 4.9 Speed-accuracy tradeoffs and language models’ effect . 41 4.9.1 Effect of Beam Width .................... 41 4.9.2 Effect of Language Model Order............... 41 4.10 Discussion............................... 42 5 Text Detection and End-to-End system 44 5.1 Problem Definition.......................... 44 5.2 Related Works............................. 44 5.2.1 Text Detection ........................ 45 5.2.2 End-to-End Pipelines..................... 45 5.3 Method ................................ 46 5.4 End-to-End Results.......................... 48 5.5 On Training an End-to-end System via Gradient Descent . 50 6 Discussion 52 6.1 Contributions............................. 52 6.2 Limitations .............................. 53 6.3 Future Work.............................. 53 Bibliography 55 List of Figures 1.1 End-to-end pipeline overview ...................... 5 2.1 A multi-layer perceptron......................... 8 2.2 The behaviour of popular neural activation functions......... 9 2.3 Effect of convolution in convolutional networks............. 14 2.4 Effect of pooling in convolutional networks............... 15 3.1 Character