END-TO-END TEXT RECOGNITION WITH DEEP LEARNING ARCHITECTURES

Ouais Alsharif

Master of Science School of Computer Science McGill University, Montreal

October 23, 2014

A thesis submitted to McGill University in partial fulfilment of the requirements of the degree of Master of Science. © Ouais Alsharif; October 23, 2014. i

This page was unintentionally left blank. i

Acknowledgements

Foremost, I am grateful to my advisor, Joelle Pineau. Joelle balanced guiding me on the one hand and giving me the freedom to pursue my own ideas and the means to do so on the other. Her guidance and support were very helpful throughout my masters. Our weekly meetings were the highlights of my week. I consider my self lucky being one of her students. I am also grateful to Doina Precup, who recommended me to Joelle in the first place and invited me to join the Reasoning and Learning lab’s meet- ings. Doina’s support was of immense help as I transitioned into McGill and now that I am at the crossroads of several paths. My McGill experience wouldn’t have been complete had it not been for Luc Devroye. Luc’s classes are the best thing one could do at 8:30 in the morning. His teaching style, passion and knowledge are unparalleled. I hope one day to become a researcher of his calibre. Graduate school is nothing without friends. In a randomly gener- ated order: Mahdi Milani Fard, Pierre-luc Bacon, Gheorghe Comanici, Neil Girdhar, Phillip Bachman, William Hamilton, Clement Gehring, Mike Ounsworth, Jimmy Li, David Cortes, Javona Whitebear, Jinxu Jia, Angus Leigh, Andrew Sutcliffe and Martin Gerdzhev. This would not have been the same without you. I am most grateful to my family: Obada, Ubai, Mom, Dad. There are no to describe how your love and unconditional support affected my life. Thank you all. ii

Abstract

Accurate text recognition in documents was one of the milestones of and computer vision techniques. However, despite this early success, general text recognition still remains an unsolved problem. Since textual information is an artificial signal, designed to be simple to draw, it can be easily confused with other simple signals that naturally exist. Moreover, unlike in document text recognition, assumptions on the way text exists should be kept to a minimum in the general setting, creating the need for more robust detectors and recognizers. From a practical point of view, engineering an end-to-end system is an elaborate effort. It involves designing multiple modules from text detection to character recognition and integrating these models in a way that allows for scalability, modularity and high accuracy. That is why most of the previous works focused only on parts of the pipeline instead of the whole end-to-end system. Moreover, the most accurate previous works traded off accuracy with scalability, making them infeasible to use in real-world settings. This thesis attempts to address this issue, by showing how such an end-to-end system can be constructed with the high-level goals of balanc- ing simplicity, accuracy and scalability. Drawing on connections to speech and handwriting recognition. Specifically, this thesis shows how the end- to-end problem can be dissected into three main sub-problems: character recognition, recognition and text detection. Then, novel solutions to each problem are proposed, and a method for integrating the three mod- ules together is shown. Technically, the system leverages a recent variant of convolutional neural networks that uses dropout and a max activation function. It also makes use of hybrid HMM models, that were shown to be useful in problems. Empirically, the system’s perfor- mance is measured in comparison to previous systems in terms of accuracy and scalability. Results show the proposed system outperforms previous state-of-the-art systems on benchmark datasets on all sub-problems. It also addresses scalability issues in lexicon size that previously proposed systems suffer from. iii

R´esum´e

La reconnaissance pr´ecise de texte dans les documents a ´et´eune pierre angulaire en apprentissage machine et vision artificielle. Toutefois, malgr´e ces premieres succ`es,le probl`emeg´en´eralde reconnaissance de texte de- meure un probl`emenon r´esolu.Puisque l’information textuelle est un signal artificiel con¸cuafin d’ˆetrefacile `adessiner, il peut ˆetrefacilement confondu avec d’autres signaux du mˆemegenre existants d´ej`adans l’environnement. De plus, `ala diff´erencede la reconnaissance de texte dans un document, les suppositions ayant trait `ala mani`eredont le texte doit apparaˆıtredoivent demeurer minimales dans ce sc´enarioplus g´en´eral.Il faut ainsi d´evelopper des d´etecteurset reconnaisseurs plus robustes. D’un point de vue pratique, l’´elaboration de syst`emede reconnaissance “du d´ebut`ala fin” demande un effort consid´erable. Il faut non seule- ment concevoir de multiples modules de d´etectionde texte et de recon- naissance des caract`eresmais aussi les int´egrer d’une mani`ere `apermettre l’extensibilit´e,la modularit´eet la pr´ecision. C’est pour cette raison que les efforts pr´ec´edents ont ´et´ed´edi´esseulement aux parties constituantes de cette chaˆıneplutˆotqu’au syst`emecomplet du d´ebut`ala fin. De plus, ces approches ayant n´eglig´el’extensibilit´eau profit de la pr´ecisionne peuvent ˆetreutilis´eesdans le monde r´eel. Cette th`esetente de r´esoudreces probl`emeet montre comment un syst`eme“du d´ebut`ala fin” peut ˆetrecon¸cutout en r´epondant `al’id´eal de simplicit´esans toutefois compromettre la pr´ecisionet l’extensibilit´e. Dans un mˆemetemps, cette th`esetente d’´etablirdes liens avec la re- connaissance de voix et d’´ecriture. Plus pr´ecis´ement, cette th`esemontre comment le probl`emedu “d´ebut`ala fin” peut ˆetred´ecompos´een trois sous-probl`emesprincipaux: reconnaissance de caract`eres,reconnaissance de mots et reconnaissance de texte. Des solutions novatrices pour chacun de ces probl`emessont ind´ependamment pr´esent´eeset son ensuite combin´ees en un seul syst`eme.Techniquement parlant, le syst`emeexploite une r´ecente variation des r´eseauxneuronaux convolutionnels utilisant la technique de “dropout” et celle d’une fonction d’activation de type “max”. Un mod`elehy- bride HMM s’´etant av´er´eutile en reconnaissance vocal est aussi utilis´epour notre probl`eme.D’un point de vue empirique, la performance du syst`eme est ´evalu´eeen comparaison avec les syst`emespr´ec´edents d’apr`esles crit`eres de pr´ecisionet d’extensibilit´e. Les r´esultatsd´emontrent que le syst`eme propos´es’av`eresup´erieuraux autre syst`emesde fine pointe lorsqu’il est ´evalu´esur tous les sous-probl`emesdes ensembles de donn´ees.Le probl`eme d’extensibilit´eest finalement r´esolupour les lexiques dont la taille limitait les syst`emespr´ec´edents. Contents

Contents iv

List of Figures vi

List of Tables vi

1 Introduction1 1.1 Prologue...... 1 1.2 Problem Definition...... 3 1.3 Contributions...... 3 1.4 Methodology ...... 4 1.5 Organization ...... 5

2 Technical Background6 2.1 Supervised Machine Learning ...... 6 2.2 Neural Networks ...... 7 2.2.1 Training neural networks...... 10 2.2.2 Pros and Cons of Neural Networks ...... 11 2.3 Convolutional Neural Networks ...... 13 2.4 Dropout...... 15 2.5 Maxout Networks...... 16 2.6 Hidden Markov Models...... 17 2.7 Hybrid HMM models...... 19 2.8 Maximally Stable Extremal Regions (MSERs)...... 20

3 Character Recognition 23 3.1 Problem Definition...... 23 3.2 Related Works...... 23 3.3 Dataset ...... 26 3.4 Method ...... 26 3.5 Results...... 28 3.6 Discussion...... 29

4 Word Recognition 30 4.1 Problem Definition...... 30 4.2 Related Works...... 31 4.3 Method Outline...... 32

iv CONTENTS v

4.4 Hybrid HMM Maxout Model...... 33 4.5 Constructing the Cascade ...... 35 4.6 Word Inference ...... 35 4.7 Implementation details...... 38 4.8 Dataset and Results ...... 39 4.9 Speed-accuracy tradeoffs and language models’ effect ...... 41 4.9.1 Effect of Beam Width ...... 41 4.9.2 Effect of Language Model Order...... 41 4.10 Discussion...... 42

5 Text Detection and End-to-End system 44 5.1 Problem Definition...... 44 5.2 Related Works...... 44 5.2.1 Text Detection ...... 45 5.2.2 End-to-End Pipelines...... 45 5.3 Method ...... 46 5.4 End-to-End Results...... 48 5.5 On Training an End-to-end System via Gradient Descent . . . . . 50

6 Discussion 52 6.1 Contributions...... 52 6.2 Limitations ...... 53 6.3 Future Work...... 53

Bibliography 55 List of Figures

1.1 End-to-end pipeline overview ...... 5

2.1 A multi-layer perceptron...... 8 2.2 The behaviour of popular neural activation functions...... 9 2.3 Effect of convolution in convolutional networks...... 14 2.4 Effect of pooling in convolutional networks...... 15

3.1 Character recognition confusion matrix...... 28

4.1 Word recognition pipeline...... 33 4.2 A hybrid HMM-Maxout model ...... 34 4.3 A cascade induced graph for the word “JFC” ...... 36 4.4 Effect of beam width...... 41 4.5 Effect of language model...... 42

5.1 End-to-end text recognition pipeline ...... 47 5.2 Sample end-to-end results ...... 48 5.3 Precision/ recall curves for the end-to-end system...... 50

List of Tables

3.1 Character recognition accuracy on ICDAR 2003 test set. All methods use the same augmented training dataset...... 29

4.1 Word recognition accuracies on ICDAR 2003 and SVT datasets. The last two lines are from this work...... 40

5.1 End-to-end F-measures on the ICDAR 2003 and SVT datasets . . . . 49

vi Chapter One

Introduction

1.1 Prologue

The plethora of applications that text recognition has, from digitizing old doc- uments to enhancing robotic navigation and planning all point to why it is im- portant to create robust, accurate and fast text detectors. In the past few years, significant progress was made on document and handwritten text recognition. However, detecting and recognizing text in unstructured environments, with as few assumptions as possible on the text’s attributes (e.g., font, color, lighting, etc) remains an elusive goal. The difficulty of general text recognition can be attributed to many reasons. Natural attributes, such as lighting, shadows, styles, fonts and backgrounds affect the perception of textual information, making it hard to recognize in some cases even for humans. Instances of such difficult situations include words written using different fonts and different character sizes; words with missing characters, due to occlusion or wrong cropping and words atop noisy backgrounds. The combination of these noise sources shifts the text recognition problem away from the well-controlled document text recognition, and closer to speech recognition and handwriting recognition. In earlier works on document-text recognition, designing a recognizer that worked well was a relatively simple task, even with simple preprocessing. This is because strong assumptions on fonts, colors and structure could be made. For

1 CHAPTER 1. INTRODUCTION 2 instance, knowing that text was mostly black on white background, formed in lines of specific height and had attributes that were consistent among words and lines made the text recognizer’s task easier. However, with this not being the case for text recognition in unstructured environments, more robust methods are needed. These methods would ideally be accurate, fast and simple. Loosely speaking, this problem of detecting textual information in an unstructured setting, separating it into lines and words and recognizing those, is referred to as the end-to-end text recognition problem. The end-to-end text recognition problem has been historically, and arguably, naturally, decomposed into three main parts: text detection, character recognition and word recognition. The first problem is to identify text locations in a natural image. The second problem is to classify character images into their corresponding characters. The third problem is a sequencing problem, given an image of a word, to output the most likely word corresponding to that image. Technically, each of these sub-problems presents challenges in its own right. On the character level, the main challenge is to achieve high recognition accuracy. On the word level, the word recognizer needs to be robust, accurate, fast, and scalable with lexicon size. Finally, on the end-to-end level, the system needs to balance precision, recall, complexity and speed. Building a system that satisfies all these constraints while remaining relatively simple is why constructing such end-to-end systems is a challenge most previous works shied away from. This thesis presents a detailed inspection into how to construct such end-to- end system by reusing solutions to one problem within another. The overlapping nature of these problems, as character recognition is a part of word recognition and word recognition is a part of text detection, allows these sub-solutions to be exploited efficiently. CHAPTER 1. INTRODUCTION 3 1.2 Problem Definition

In addressing the end-to-end problem, this thesis address three main sub-problems:

1. Character Recognition

2. Word Recognition

3. Text Detection

The first problem is a classification problem, namely, given an image x ∈ Xchar and a set of labels Y = {1, ..., K}, create a function f : Xchar → Y. The second problem is a sequencing problem, more precisely, given an image x ∈ Xword and a

∗ set of characters Y, create a function g : Xword → Y , where ∗ is the kleene closure

∗ n operator, i.e., Y = ∪n∈NY . The third problem is the detection problem, where

∗ the goal is to create a function h : Xall → B where B is the set of all rectangles in an image. Xchar, Xword, Xall are respectively the sets of character images, word images and all images. In light of these definitions, the end-to-end problem is defined as creating a

∗ ∗ function e : Xall → (Y ) .

1.3 Contributions

Abstractly, this thesis presents a way to construct an end-to-end recognition sys- tem. We connect the dots to other sub-fields, namely speech recognition and handwritten text recognition. We also show successful applications of multiple machine learning models. More concretely, we focus on showing how to design an end-to-end system that balances accuracy, speed and relative simplicity. On a finer level, this thesis shows how to construct a highly accurate character recognizer, a fast, accurate CHAPTER 1. INTRODUCTION 4 and scalable word recognizer, and a relatively fast, highly accurate end-to-end recognizer in the F-measure sense. The thesis presents dataset specific results, showing how the system performs on the ICDAR 2003 [Lucas et al., 2003] and the SVT [Wang and Belongie, 2010] datasets. The proposed system outperforms most previous state-of-the-art sys- tems in the time of writing. It also presents empirical results on the scalability of the system, when more time is allowed per query and when lexicons of different sizes are used.

1.4 Methodology

To achieve the goals specified above, the end-to-end problem is dissected into its natural sub-problems and new solutions are proposed for all sub-problems. More specifically, for the character recognition problem, a variant of the recently intro- duced deep convolutional Maxout network architecture [Goodfellow et al., 2013b] is proposed that allows for high accuracy with almost no preprocessing. Also, by drawing on connections to recent works in speech recognition [Hinton et al., 2012a], a method for sequencing words into characters using a hybrid HMM/Maxout ar- chitecture is proposed. The proposed model allows for simple integration of a lexicon’s higher order n-grams, resulting in a method that is fast, accurate and highly tunable, while taking constant time relative to lexicon size. These parts are then integrated in a novel end-to-end recognition system that utilises stan- dard techniques from computer vision, like Maximally Stable Extremal Regions (MSERs) and DBSCAN to achieve state-of-the-art F-measure on the ICDAR 2003 [Lucas et al., 2003] and SVT [Wang and Belongie, 2010] datasets. A depiction of the end-to-end pipeline is presented in Figure 1.1. CHAPTER 1. INTRODUCTION 5

Figure 1.1: End-to-end pipeline overview

1.5 Organization

This thesis is organized as follows. Chapter two presents the technical back- ground related to models used in the thesis. Chapter three presents a detailed construction of a general purpose character recognizer. Chapter four shows how this character recognizer can be used in a word recognizer that balances accuracy with speed. Chapter five shows how with this word recognizer, a highly-accurate end-to-end text recognition system can be created. In chapter six, a discussion of the thesis, current open problems and future works. Chapter Two

Technical Background

An end-to-end system is an elaborate effort. It involves multiple parts, some algorithmic parts are learned from data, and others are static image processing algorithms. This section provides the reader with background needed to under- stand these building blocks and how they contribute to the system as a whole.

2.1 Supervised Machine Learning

Supervised machine learning concerns itself with estimating a mapping f : X → Y

n from a labelled dataset (called the training set) with n samples {(xi, yi)}i=1. X is called the input domain and is usually a d-dimensional vector space X = Rd. Y is the output domain and is set to R for regression problems and to a set of K labels {1, ..., K} for classification problems. The hope underlying supervised learning, is that by estimating f from a finite amount of data, f would be able to generalize to unseen data, providing us with labels for the new data. One particular quantity of interest in such estimations is the risk of f. The risk of a function denotes the expected error incurred by that function and is given as:

1 R(f) = E(x,y)[ f(x)6=y]

In practice we cannot measure the true risk, but rather an empirical estimate thereof: n 1 X 1 Rn(f) = f(xi)6=yi (2.1) n i=1 6 CHAPTER 2. TECHNICAL BACKGROUND 7

The empirical risk minimization principle [Vapnik, 1998] suggests that to create a

m function f that performs well on some unseen test set Ztest = {xi, yi}i=1, we need

n 1 to minimize the empirical risk on the training set Ztrain = {xi, yi}i=1 :

fn = argminf∈F Rn(f) where F is the model space of interest. In order to control the search in F, we usually augment the above optimization problem with a regularizer on F, typically a norm ||f||. Then we minimize the regularized empirical risk:

fn = argminf∈F Rn(f) + λ||f|| where λ controls the strength of regularization. Controlling the search through regularization is necessary to prevent overfitting, a phenomena in which a function f performs well on some training set, but does not generalize well onto the test set.

2.2 Neural Networks

Neural Networks are one of the cornerstones of modern machine learning tech- niques [Bishop, 1995, Hornik et al., 1989, Rumelhart et al., 1985]. Devised origi- nally to be simplistic models of neurons, mathematically they are non-linear global function-approximators2. The structure of neural networks is like that of a weighted directed acyclic graph, more precisely, a tree. In this tree, vertices are called neurons, edges are called weights. Each neuron has a set of inputs and a single output. The inputs of a neuron are the outputs of other neurons with edges incident on that neuron. 1This is because the true risk is bounded by the empirical risk + a generalization factor. For a more detailed reasoning, consult [Bousquet et al., 2004] 2The universal approximation theorem [Barron, 1993] states: a neural network with a single hidden layer, containing a finite amount of neurons is a global function approximator. Note that the theorem is regarding the existence of a network that approximates the function, not the learnability of such networks. CHAPTER 2. TECHNICAL BACKGROUND 8

h3,1

h2,1 h2,2 h2,d3 h = φ (W h ) ··· 2 2 2 1

h1,1 h1,2 h1,d h = φ (W x) ··· 2 1 1 1

x = (x1, x2, ..., xd1 ) x1 x2 ··· xd1

Figure 2.1: A multi-layer perceptron

The output of a neuron is a result of applying a non-linear activation function on a dot product of its inputs and its weights. The dot product can be thought of as a pattern match in the euclidean space. The activation function augments this by adding non-linearity, as many functions of interest are non-linear. Examples of popular activations functions include:

1 1. Sigmoid: φ(z) = 1+e−z

ez−e−z 2. Hyperbolic tangent: φ(z) = ez+e−z

3. Rectifiers: φ(z) = max(0, z)

A variant of neural networks, where neurons are structured in layers, is called the Multi-Layer Perceptron (MLP) [Bishop, 1995]. In MLPs, neurons in the same layer have the same activation function. Thus, for an input x ∈ Rn the function computed by an MLP with N layers is of the form:

f(x, W ) = φN (...φ2(W2 · φ1(W1 · x))) where Wi is a weights matrix parametrizing layer i, φi is the non-linear activation function used in layer i. Some presentations of neural nets (e.g., [Bishop, 1995]) CHAPTER 2. TECHNICAL BACKGROUND 9

5

4

3

Sigmoid 2 tanh rectifier

1

0

1 6 4 2 0 2 4 6

Figure 2.2: The behaviour of popular neural activation functions include a bias vector b and a weight matrix W for every layer instead of just W , however, this can be subsumed by using just W if we mandate the last entry in x to be always 1. Neural networks experienced fluctuating levels of interest over the years. In general, interest in these networks increased as computation became cheaper and more abundant. Also, due to the difficulty of training non-convex functions, neural networks gained traction whenever a new technique for training such functions surfaced. Such techniques include Backpropagation [Rumelhart et al., 1985] in the early days of neural networks which is a reinvention of gradient descent, using the chain rule to avoid duplicate computations, unsupervised pretraining [Hinton et al., 2006] which is a technique for initializing neural networks and reintroduction of convolutional networks [LeCun et al., 1998, Krizhevsky et al., 2012] which showed to be good function approximators in the image domain. One reason for the popularity of neural networks is their modularity, as the structure of a neural network allows modular changes in activation functions [Nair and Hinton, 2010, Goodfellow et al., 2013b], pooling layers [Zeiler and Fergus, 2013], and regularizers [Weston et al., 2012]. Allowing designers to tune the CHAPTER 2. TECHNICAL BACKGROUND 10 architecture to their specific needs. Generally speaking, the real empirical problem with supervised learning of feed forward neural networks with small to medium amounts of data is in overfitting; as usually, performance on training data becomes quickly perfect, where it plateaus on testing data.

2.2.1 Training neural networks

Training a neural net requires two things, an error function to minimize and a method to minimize it. The error function is called a loss function and the method is called a training algorithm. For classification tasks, the first error function that comes to mind is the 0-1 error, given in the case of two classes as follows:

L(x, y) = sign(h(x)) = y where h is the hypothesis function modelled by the neural network. However, in order to train neural networks with gradient based methods, we need to replace the non-differentiable 0-1 loss with a surrogate differentiable loss. For a hypothesis class h, input x and target y possible surrogate loss functions include:

1. Mean-square error: (h(x) − y)2

2. Negative-log likelihood (aka. cross-entropy): − log P (y|x, W )

3. Large margin loss (aka. hinge or SVM loss): max(0, 1 − h(x)y)

4. Exponential loss: eh(x)y.

As for training algorithms, the standard algorithm used in practice is gradi- ent descent and variants thereof. Other possible training algorithms like genetic algorithms [Montana and Davis, ] and simulated annealing [Ingber, 1993] can be used as well, however, they tend to be inefficient in practice. Gradient descent updates the weights of a neural net to decrease the loss function by moving along CHAPTER 2. TECHNICAL BACKGROUND 11 the gradient with respect to the weights, namely, at iteration t, gradient descent updates the weights as follows:

∆t = γ(t)∇f(x, W )

t t−1 W = W − ∆t where γ(t) > 0 is called the learning rate, which is a value that depends on t. For gradient descent to converge to a local minima, it suffices for the values of the gradient to be bounded and for the coefficients γ(t) satisfy the condition:

X γ(t) = ∞, (2.2) t=1 X γ(t)2 < ∞ (2.3) t=1 Training neural networks by gradient descent is commonly known as the back- propagation algorithm [LeCun, 1986, Rumelhart et al., 1986]. A simple practical variation of the standard gradient descent is to include a momentum term during the update, i.e.,:

t t−1 ∆t = α∆t−1 − γ(t)∇f(x, W )W = W − ∆t where α is called the momentum rate. Using momentum tends to accelerate convergence, usually without incurring a cost in the quality of the solution.

2.2.2 Pros and Cons of Neural Networks

Like all other algorithms, neural networks have several advantages and several shortcomings. In [Vapnik, 1998], Vapnik points out some of the shortcomings of neural networks as follows:

1. Susceptibility to local minima

2. Slow convergence for standard gradient descent

3. Scaling factors of sigmoids affect approximation quality CHAPTER 2. TECHNICAL BACKGROUND 12

The first point made above follows naturally from the fact that this is a non- convex optimization problem, but then, this brings forth the question: are local minimas really a negative? Local minimas seem to be an inherent quality of nature. That is, it seems that the functions we are most interested in learning are non-convex functions. It is difficult to imagine a way to get around that. This in part explains why neural networks perform so well on difficult tasks like image and speech recognition. Since these problems are difficult, it is likely that simple feature extraction with convex methods is not going to work so well and neural nets search a larger space that is more likely to be close to the true function we are trying to approximate than a convex method. That said, local minimas do introduce pragmatic difficulties: since a local minima only satisfies property of the gradient of the function in a local region of the parameters space, there is not much we could say about the acquired solution in terms of its optimality. Moreover, local minimas make results a bit harder to reproduce. As opposed to convex optimization problems, solutions to non-convex optimization problems depend on initialization points. Hence when one finds solutions with such architectures, there is no telling if this is the best solution the architecture can give. As for the second point, it tends to become less true when dataset size grows bigger and momentum style methods are used. On today’s machines, fitting a neural net with millions of parameters takes on the order of hours. Networks with billions of parameters take on the order of 5-6 days to learn [Krizhevsky et al., 2012]. The third point has also been addressed by learning scaling factors, using different activation functions and other engineering changes. Pragmatically, one particular difficulty in designing neural networks, is proper regularization. Being general function-approximators, without proper regulariza- tion, or large amounts of data, there is no reason why the neural net should model CHAPTER 2. TECHNICAL BACKGROUND 13 functions in a desired way. This gives way to one other deficiency of neural net- works: the multitude of parameters involved in designing neural networks (i.e., hidden layer sizes, learning rates, decay factors, activation functions, ..., etc). The choice of these parameters is non-trivial and greatly affects results. Thus making neural networks an unlikely candidate for off-the-shelf use. On the other hand, the main case for neural networks is made by their strong empirical performance. While this kind of argument is not based on lasting the- oretical principles. It is indeed a fact that in the time of this writing, most of the state-of-the-art results on benchmark datasets are held by variants of neural networks. This points out that the main problem with neural nets is not their susceptibility to local minimas, but rather the sensitivity of the training process to changes in the parameters and proneness to overfitting.

2.3 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) [LeCun et al., 1998] are a class of MLPs that alternate convolution layers and pooling layers, with the last layer being usually a softmax or RBF layer. In a convolution layer, an input is convolved with multiple learned filters leading to multiple maps which then are pooled to- gether through a pooling scheme. CNNs are trained usually with standard back- propagation. Mathematically, a convolution layer applies the following function:

h = activation(W k ? x + b) (2.4) where ? denotes the discrete 2D convolution given as:

∞ ∞ X X h[m, n] = activation( x[m + i, n + j]W [i, j]) (2.5) i=− inf j=− inf Usually, a convolution layer contains multiple convolution filters, to learn bet- ter representations of the data. Thus, the parameters connecting two layers are CHAPTER 2. TECHNICAL BACKGROUND 14 a 4D tensor (destination feature map index, source feature map index, source vertical position index, source horizontal position index). The original CNNs [LeCun et al., 1998] were created specifically to operate on images. The reasons behind alternating convolution and pooling is two fold. The first is a connection to complex and simple cells in the cat’s visual cortex [Hubel and Wiesel, 1968]. Simple cells are cells that were observed to respond to edge- like stimulus patterns in their receptive fields. Complex cells are ones observed to have a larger receptive field and to be locally invariant. Hence, convolution layers were meant to simulate simple cells and pooling layers were meant to simulate complex cells.

Figure 2.3: Effect of convolution in convolutional networks Convolution: Corgi is convolved with a vector [-1,1].

A second biology-free justification for the structure of CNNs is as follows. Con- volution layers exploit a kind of deformation that does not change image class, hence regularizing learning through what has been also named weight sharing. Weight sharing leads to a reduced number of parameters in a convolutional layer compared to a standard fully connected layer, thereby reducing overfitting. Pool- ing layers combine stimuli from different convolution layers to produce a feature extractor that is invariant to small translations. Combined with regularization and pretraining techniques, these neural nets achieve state-of-the-art results on many datasets [Krizhevsky et al., 2012, Good- CHAPTER 2. TECHNICAL BACKGROUND 15

Figure 2.4: Effect of pooling in convolutional networks Pooling: Corgi is pooled with a 2 × 2 max filter. Note that the pooled image has half the width of the original and half the height. fellow et al., 2013b].

2.4 Dropout

Dropout [Hinton et al., 2012b] is a simple and efficient technique that can be used to reduce overfitting in neural networks. The main idea of dropout is to stochasticly omit some of the units from the network during learning. Intuitively, dropout adds robustness to the network by introducing noise on all levels of the architecture. Another way to view dropout is as a way to do model averaging over exponentially-many models with shared parameters. In that sense, dropout can be seen as optimizing the following objective function:

min M [L(X,M,W )] W E

where L is the loss function, X is the dataset, M parametrizes the architec- ture on which the expectation is taken and W is the set of the models’ shared parameters. Note that objective function above also models dropout’s more recent cousin, dropconnect which was shown to work out slightly better than dropout on some datasets [Wan et al., 2013]. CHAPTER 2. TECHNICAL BACKGROUND 16 2.5 Maxout Networks

A Maxout network [Goodfellow et al., 2013b] is a multi-layer perceptron that makes heavy use of dropout to regularize the neural net, thereby reducing over- fitting. It also uses a max activation function that produces a sparse gradient.

Specifically, in these networks, for an input x ∈ Rn, every hidden layer implements the function:

hi(x) = max zij, where

T zij = x W...ij + bij

d×m×k m×k W ∈ R , b ∈ R .

Maxout networks apply the max operation on a number of linear units. Usu- ally for these architectures, the number of linear units per maxout unit is the same for all Maxout units in a layer. That means that a neural network with 4 maxout layers is in reality an 8 layer neural network with alternating linear/ max lay- ers. The max layers in maxout are also known as pooling layers in convolutional networks. Maxout networks have produced state-of-the-art results on benchmark datasets without the use of pre-training or second order optimization methods [Goodfellow et al., 2013b]. While Maxout networks were presented as an activation function that makes better use of dropout, they cannot be compared directly to rectifiers, as a maxout activation function adds new layers on which it computes a pooling operation. This leads to a multiplicative growth in the number of parameters in the network. For example, a rectifier layer with 400 nodes, connected to the output of a layer with 100 nodes has 40,000 parameters, excluding biases, whereas a Maxout network has 40, 000 ∗ Np, where Np is the number of units a maxout unit pools on. CHAPTER 2. TECHNICAL BACKGROUND 17 2.6 Hidden Markov Models

Hidden Markov Models (HMMs) [Rabiner, 1989] are relatively simple, reliable models for data sequencing problems. Technically, HMMs are partially-observable generalizations of markov models, which are simple stochastic graphical models. In their simplest form an HMM is characterized by a tuple λ = (AN×N ,BN×M , π) where:

1. N: number of states in the model, we denote states as S = S1,S2, ..., SN

and qt as state at time t

2. M: number of possible observations, we denote observations as V = v1, v2, ..., vM

3. AN×N : matrix of state transition probabilities

where Aij = p(qt = Si|qt−1 = Sj). A is known as the transition model.

N×M 4. B matrix of symbol emission probabilities B = bj(k) where

bj(k) = p(vk|qt = Sj)

5. π: initial distribution on states πi = p(q1 = Si)

The three problems of Hidden Markov Models

T For a sequence of observations O1 = o1, o2, ..., oT and an HMM λ = (A, B, π) there are three basic questions that can be asked:

T 1. How to compute P (O1 |λ) efficiently?

T 2. How to compute a state sequence Q = q1, q2, ..., qT that best explains O1 ?

i.e., compute the sequence Q such that: Q = argmaxQ P (Q|O, λ)

3. How to train a given HMM λ = (A, B, π) from data?

Rabiner [Rabiner, 1989] gives an extensive explanation for how each of these problems could be solved. CHAPTER 2. TECHNICAL BACKGROUND 18

Namely, the first problem is solved with what is called the forward algorithm. It decomposes P (O|λ) to smaller subproblems by exploiting the markov property as follows:

X αt(j) = αt−1(i)Aijbi(ot) (2.6) i∈[1,N] Which computes the probability of the state sequence such that it ends in state j at time t. Computing the above equation can be done with dynamic programming

2 T where it takes O(N T ) time. After the previous matrix is computed, P (O1 |λ) can be simply computed as follows:

T X P (O1 |λ) = αT (j) j∈[1,N] The second problem is also a dynamic programming problem. It can be solved with the Viterbi algorithm by computing the following recursion:

αt(j) = max αt−1(i)Aijbi(ot) (2.7) i∈[1,N] which computes the probability of the most likely sequence ending at state j in time t. Finding the most likely state sequence is done by simply tracing back through the dynamic programming matrix. Viterbi also takes O(N 2T ) time. The third problem is intractable in general, however, many ways exist to come up with locally optimal solutions, the most popular of which is maximum like- lihood (i.e., maximizing p(O|λ)) with a variant of the expectation-maximization algorithm [Dempster et al., 1977] called the Baum-Welch algorithm [Rabiner, 1989]. Another method is to discriminatively train the HMM to minimize condi- tional entropy of hidden state sequences given the observations. This is equiva- lent to maximizing mutual information between the observations and the hidden states. Concretely, conditional entropy H(Q|O) and mutual information I(Q, O) are given as:

X H(Q|O) = − p(q|o) log p(q|o)

I(Q, O) = H(Q) − H(Q|O) CHAPTER 2. TECHNICAL BACKGROUND 19

For works that extensively elaborate this procedure, consult [Vertanen,, Valtchev et al., 1997]. Generally speaking, HMM-based models have long been among the main tools used for sequence modelling in voice recognition [Rabiner, 1989] and hand-writing recognition [Hu et al., 1996].

2.7 Hybrid HMM models

Hybrid models [Morgan and Bourlard, 1995] extend HMMs with a simple idea, that is, instead of using a generative observation model, hybrid models use Bayes rule and implicitly model the observation model using a probabilistic classifier onto the HMM states. Concretely, let O be a sequence of observations and let Q be a state sequence, the purpose of the HMM in sequencing tasks is to produce argmax p(Q|O). In a standard setting, to train an HMM, we require an obser- Q vation model p(o|s) where o is an observation and s is an HMM state. In the hybrid model, we approximate the observation model through Bayes rule with a probabilistic classifier that computes the posterior p(s|o) distribution on HMM states s given an input o. Concretely:

p(o) p(s|o) p(o|s) = p(s|o) ∝ , (2.8) p(s) p(s) with p(o) assumed to be equal for all observations. Such hybrid models are usually trained with the embedded Viterbi algorithm [Bourlard and Morgan, 1998] to maximize the likelihood of the data. In other variants of the model, hybrid models are discriminatively trained to optimize seg- mentation accuracy directly [Bengio et al., 1992, Bengio et al., 1995] 3. Combined with variants of neural network, these models have increased accuracies on chal- lenging sequencing tasks primarily in speech-recognition [Hinton et al., 2012a]. 3It is worth noting that this model is pseudo-generative, in the sense that it is trained to maximize likelihood of the data, but it cannot be sampled from. CHAPTER 2. TECHNICAL BACKGROUND 20

Embedded Viterbi

Embedded Viterbi alignment [Morgan and Bourlard, 1995] is an algorithm used to train hybrid HMM models. The algorithm is different from the standard baum- welch training in two ways. The first, is that for embedded Viterbi, when feeding in a sequence, a label is required for every slice of that sequence to train the probabilistic classifier. The second is that in the Maximization step of baum- welch, the state labels are re-estimated using Viterbi after posterior probabilities are computed from the probabilistic model which was trained on labels from previous iterations. Embedded Viterbi can be thought of as a supervised method for training HMMs. i.e., it is made for contexts where not only sequence information is avail- able, but also information about a segmentation that is close to optimal.

2.8 Maximally Stable Extremal Regions (MSERs)

Maximally Stable Extremal Regions (MSERs) [Matas et al., 2004] are computer vision tools originally developed to detect key points in images as a part of solving the image correspondence problem, where an algorithm seeks to find which objects correspond to each other in two images, where one is produced of the other due to some movements, change of camera position or elapsed time. Intuitively, extremal regions are regions of the image that are regional local-minimas or maximas in the manifold defined by the image. More formally, MSERs posses the following qualities:

1. Invariance to affine transformations of image intensity.

2. Stability: Since the extremal regions whose support is not changed over a CHAPTER 2. TECHNICAL BACKGROUND 21

region of thresholds are selected.

3. Multi-scale detection: without any smoothing, both fine and coarse regions are detected.

4. Low computational cost: can be enumerated in O(n) where n is the number of pixels in the image.

The process for computing MSERs is relatively straight forward. It begins by removing all pixels from the image and sorting them by intensity. Then, it proceeds by placing the pixels back into the image one by one in order of intensity. When a pixel is put back into the image, the set of other pixels this new one is connected to can be identified on the fly. This set is called a connected component. Where we say two pixels are connected if a path of adjacent pixels exists from one to the other. While this process is taking place, the areas of each connected component can be also computed on the fly as pixels are added. If one were to plot how a connected component’s area changes as pixels are added, one could compute the rate of change (i.e., the first derivative) of area as a function of the number of pixels added. In this graph, local minimas, which are points in intensity at which the area of a connected component stops changing, are the thresholds used to compute MSERs. Algorithm1 presents the original algorithm [Matas et al., 2004] for computing MSERs.

Algorithm 1 MSER Extraction Input: image I 1. sort all pixels in I by intensity value O(n) 2. place pixels back in image while maintaing a list of connected components and their areas O(n log log n) 3. report intensity levels that are local minimas of the rate of change in the area function as thresholding values that produces MSERs.

In the original paper defining MSERs [Matas et al., 2004], the computation complexity was O(n log log n). However, in [Nist´erand Stew´enius,2008] this was CHAPTER 2. TECHNICAL BACKGROUND 22 refined to O(n). As noted above, MSERs were originally created to help with the image correspondence problem. However, they were shown to be useful for other tasks like text detection [Chen et al., 2011]. Chapter Three

Character Recognition

This chapter presents a detailed inspection into the creation of a state-of-the-art character recognizer. We begin by formalizing the problem of character recog- nition in section 3.1. We then survey related works in the literature in section 3.2. Afterwards, in section 3.3, we describe the particular dataset this chapter is concerned with. Then we present our specific approach in section 3.4 and the results in section 3.5. We conclude with a discussion in section 3.6.

3.1 Problem Definition

Character recognition is a basic block in an end-to-end text recognition system.

Formally defined, given an image x ∈ Xchar, and a set Y = {1, .., K} of character labels, construct a function f : Xchar → Y, where Xchar is the set of all character images.

3.2 Related Works

Character recognition is an instance of object recognition where characters are the object of interest. The difficulty of this problem stems from high confusion between upper-case/lower-case characters and letters/numbers on which even hu- mans make mistakes [de Campos et al., 2009].

23 CHAPTER 3. CHARACTER RECOGNITION 24

Character recognition and its more general cousin, object recognition have been addressed by the machine learning community using various methods ranging across the entire spectrum of classification algorithms (e.g. k-NN, SVMs, neural nets [LeCun et al., 1998], boosting). However, the fore-mentioned algorithms in their most basic form cannot compete for state-of-the art results on the more challenging datasets, such as those in which characters are found “in the wild” as opposed to characters found in structured documents. Therefore ,variants of vanilla machine learning algorithms (e.g., Large margin nearest neighbours [Blitzer et al., 2005], virtual SVMs [Decoste and Sch¨olkopf, 2002], boosting products of classifiers [K´egland Busa-Fekete, 2009]) were devised with different underlying assumptions to tackle this problem. Generally speaking, the techniques used for constructing these variants, which were meant to improve models’ performance on test sets can be grouped into three different classes:

1. Manually help the model by extracting useful features.

2. Use a larger training set while increasing model capacity.

3. Encode data invariances directly into the model, thereby restricting the hypothesis space to more useful areas.

The first approach has been the traditional approach for machine learning practioners. Instance of such approach include [de Campos et al., 2009] where the authors manually define a set of features that they perceive to be discrimina- tive. Then they utilise off-the-shelf algorithms to achieve good recognition results. While this approach is simple and efficient, designing and selecting features is not a trivial task. Moreover, the desired discriminative features may be too compli- cated for a human to create. This steered the interest of the field to learning discriminative features through neural networks. CHAPTER 3. CHARACTER RECOGNITION 25

Neural networks however are too simple in their vanilla form to achieve state- of-the-art results. This is in part due to their large hypothesis spaces, in which search is difficult without proper regularization. In general, encoding translation and scale invariances into the neural networks has been shown to be effective. This is usually done with convolutional neural networks [LeCun et al., 1998, Ciresan et al., 2012]. However, while these networks do produce state-of-the-art results in their different instantiations, they require large amounts of data and take long times to train (in our experiments, training a convolutional neural network with 100k examples and millions of parameters took around 7 hours). An interesting recent work [Bruna and Mallat, 2013] reached a CNN-like architecture (called scattering networks) from signal processing principles by attempting to encode invariances to translation and scaling into the learning machine. Some other interesting methods include Virtual SVMs [Decoste and Sch¨olkopf, 2002], in which the authors sample new training examples by applying deforma- tions to support vectors. This method is quite simple, however, experiments using this method were limited to fairly small datasets. Another interesting method is the Large Margin Nearest Neighbour (LMNN) which is a instance of the popular k-nn class of methods in which the authors [Blitzer et al., 2005] learn a distance metric such that a k-nn classifier would work well. This is done by enforcing constraints that a positive neighbour should be closer to a sample than a negative neighbour. Where positive neighbours are ones that belong to the same class and negative neighbours are ones that don’t. On the particular task of recognizing English characters and digits, a more limited set of algorithms can be found in the literature. Specifically, pre-trained variants of CNNs [Wang et al., 2011, Wang et al., 2012] in which the authors, before discriminatively training the CNN, pretrain its convolutional weights by applying a variant of k-means to normalized data. Deformable parts models [Shi et al., 2013] were also shown to work well in this particular sub- CHAPTER 3. CHARACTER RECOGNITION 26 domain.

3.3 Dataset

Since this work concerns itself with recognizing English text, we set Y to be the set of English characters and Arabic digits from 0 to 9. Note that |Y| = 62. The particular dataset we use for this task is the ICDAR 2003 character recognition dataset [Lucas et al., 2003]. We choose this specific dataset because previous works used this dataset to assess their methods (e.g., [Coates et al., 2011, Wang et al., 2012]), also because we intend on using this classifier for recognizing words from the ICDAR 2003 words dataset. The character dataset contains 6114 training samples and 5379 test samples after removing all non-alphanumeric characters as in [Wang et al., 2012]. We augment the training dataset with 75,495 character images from the Chars74k English dataset [de Campos et al., 2009] and 50,000 synthetic characters generated by [Wang et al., 2012] making the total size of the training set 131,609 tightly cropped character images. All images are rescaled to a size of 32-by-32 with Lanczos interpolation over 8x8 pixel neighborhood and then converted to grey-scale.

3.4 Method

We want to create a probabilistic, highly accurate classifier on a large dataset of characters. We would like our classifier to be invariant to small translations in input, as character cropping is usually not tight. We would also like our clas- sifier to be invariant to scale, as text exists at multiple scales. SVMs and their variants take very long times to train as training time scales quadratically with the number of examples when training. Also, there is no clear way for encoding such invariances into the SVM. Other more sophisticated options like Scattering CHAPTER 3. CHARACTER RECOGNITION 27 networks [Bruna and Mallat, 2013] were not tested on large scale problems and therefore were difficult to asses. On the other hand, convolutional neural net- works, while consuming long training times, offered the best tradeoff in terms of inference speed, accuracy and training time. The particular variant of CNNs we chose to work with is a convolutional maxout network. Maxout networks were shown to be highly accurate [Goodfellow et al., 2013b] on challenging real world object-recognition tasks, like the CIFAR 10 [Krizhevsky et al., 2012] and SVHN [Netzer et al., ] datasets. The particular architecture we create for this task is a five-layer convolutional Maxout network with the first three layers being convolution-pooling Maxout layers, the fourth a Maxout layer and finally a softmax layer on top. The first three layers have respectively 48, 128, 128 filters of sizes 8-by-8 for the first two and 5-by-5 for the third, pooling over regions of sizes 4-by-4, 4-by-4 and 2-by-2 respectively, with 2 linear pieces per Maxout unit and a 2-by-2 stride. The 4th layer has 400 units and 5 linear pieces per Maxout unit, fully connected with the softmax output layer. These choices for hyper parameter values were made based on the network constructed for CIFAR 10 in [Goodfellow et al., 2013b]. We train the proposed network on 32-by-32 grey-scale character image patches with a simple preprocessing stage of subtracting the mean of every patch and dividing its elements by its standard deviation + . We set  = .001 where  was chosen by cross-validating on the training set with an SVM classifier with an RBF kernel. Similar to [Goodfellow et al., 2013b], we train this network using stochastic gradient descent with momentum and dropout to minimize the cross-entropy loss − log p(y|x). CHAPTER 3. CHARACTER RECOGNITION 28 3.5 Results

The resulting character recognizer achieves state-of-the-art recognition rates on the ICDAR 2003 character test set with an accuracy of 85.5% on the 62-way case-sensitive benchmark and 89.9% on the case-insensitive 36-way benchmark. When we use the Maxout network as a feature extractor and feed the features from the penultimate layer into an SVM with an RBF kernel, the recognition accuracy increases to 86% on the 62-way benchmark while it remains roughly the same (89.8%) on the 36-way benchmark. Table 3.1 compares our results to other works on this dataset. The works we compare to were state-of-the-art methods at the time of this writing. Also these methods all used a training set augmented in the same way ours is. Overall, the performance of the Convolutional Maxout networks is superior to that of the CNNs used in previous approaches. Training took six hours on a nvidia kepler-k20c GPU using Theano [Bergstra et al., 2010] and pylearn [Goodfellow et al., 2013a].

0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1.0 1 2 3 4 5 6 0.9 7 8 9 a b c 0.8 d e f g h i 0.7 j k l m n o 0.6 p q r s t u v 0.5 w x y

Actual z A B 0.4 C D E F G H 0.3 I J K L M N 0.2 O P Q R S T 0.1 U V W X Y Z 0.0 Predicted Figure 3.1: Character recognition confusion matrix

As a side note, we found that different forms of binarization (otsu, random walkers, grabcuts) and preprocessing methods, such as ZCA as used in [Wang et al., 2012], do not enhance the test accuracy and in some cases, decrease it. CHAPTER 3. CHARACTER RECOGNITION 29

Table 3.1: Character recognition accuracy on ICDAR 2003 test set. All methods use the same augmented training dataset. Work Method Result [Coates et al., 2011] pre-trained CNNs 81.7 [Wang et al., 2012] pre-trained CNNs 83.9 this work Conv-Maxout 85.5 this work Conv-Maxout + SVM 86.0

3.6 Discussion

This chapter presents a method to construct a state-of-the-art character recog- nizer on the ICDAR 2003 dataset. The recognizer utilizes a convolutional variant of the recently introduced Maxout networks in addition to dropout. The archi- tecture we construct demonstrates superior results when compared to pre-trained CNNs which are the previous leading method for this problem. By inspecting the confusion matrix (Figure 3.1) we see that most of the inaccuracy comes from the confusion between upper-case and lower-case letters. This kind of confusion is difficult for even humans to circumvent without context information. There- fore, one possible future work to sidestep this confusion is by including context information for the characters under recognition. Chapter Four

Word Recognition

This chapter presents the word recognition module for transcribing images of words found “in the wild” into their corresponding text. We begin with a concrete definition for the text recognition problem in section 4.1. We proceed to survey related works in section 4.2. Then we present an outline of our method in section 4.3. Sections 4.4, 4.5 and 4.6 respectively present the three core steps of our method: Segmentation with a hybrid HMM model, cascade construction and word inference. In section 4.7 we present the implementation details pertaining to the hybrid HMM model. Section 4.8 presents the datasets used as well as the results obtained. In section 4.9 an analysis of the accuracy-speed tradeoff is presented, as well as an empirical analysis of the effect of language models’ order and finally, we conclude in section 4.10.

4.1 Problem Definition

There are two variants of the word recognition problem: one where a lexicon of allowed words is provided and another where predicted words are permitted to be not in the lexicon. Those problems can be formally defined as follows:

For the first problem, given an image x ∈ Xword and a set of words W, create a function f1 : Xword → W. The second problem is defined similarly, except instead of the words lexicon, we are given a set of characters Y and the desired function

∗ is of the form f2 : Xword → Y .

30 CHAPTER 4. WORD RECOGNITION 31

In this chapter, we present a method that can be used for solving either prob- lem (i.e., with or without limiting words to a lexicon). Our proposed method outperforms most previous state-of-the-art results on the ICDAR 2003 and SVT word recognition datasets.

4.2 Related Works

Word recognition, much like phone recognition and handwriting recognition, is a sequence recognition problem. Works on sequence recognition have generally been confined to their applied fields. However, machine learning methods used in one field tend to be applicable to another with minor modifications. In the speech recognition community, the most popular sequencing models have been Hidden Markov Models (HMMs) and variants thereof [Rabiner, 1989, Dahl et al., 2012]. Other tools such as Dynamic Time Warping (DTW) and neural nets received attention during the 1980s, but then they were outperformed by HMM variants, such as hybrid HMM models. More recently however, variants of recurrent neural nets have been shown to work well on phoneme sequencing tasks [Graves et al., 2013]. As for handwriting recognition, HMM alternatives such as Long Short Term Memory (LSTM) [Liwicki et al., ] and Graph Transducer Networks (GTNs) [Le- Cun et al., 1998] have been shown to work particularly well. While GTNs haven’t received much attention recently, LSTMs continue to outperform other methods for handwriting recognition [Grosicki and El Abed, 2009]. As for image text recognition, previous works tackled this problem using vari- ants of CNNs [Wang et al., 2012], Conditional Random Fields (CRFs) [Mishra et al., 2012, Novikova et al., 2012] and Pictorial Structures (PS) [Wang et al., 2011]. Surprisingly, most of the work in this area hasn’t benefited from the al- ready established methods for handwriting and speech recognition such as hybrid CHAPTER 4. WORD RECOGNITION 32 models and recurrent nets. The majority of techniques in this particular sub-domain have relied on segmentation- free lexicon-dependent approaches. Using lexicons helps tackle the high confusion inherent in the text recognition problem. However, despite the argument for the validity of task-specific lexicon use in [Wang et al., 2011], it is clear that we ulti- mately wish to recognize text with a very general lexicon. To do so, we require word recognizers that scale well in the size of the lexicon. The works of [Neumann and Matas, 2011, Mishra et al., 2012, Novikova et al., 2012] are the only works we know of that show how their methods scale with lexicon size.

4.3 Method Outline

The purpose of the word recognition module is to transcribe a word image into text. Our approach for word recognition is a segmentation-based, lexicon-free approach that could easily incorporate a lexicon during inference or as a post- inference processing tool. As it currently stands, it is difficult to recognize words with high accuracy without any language model due to the character confusion problem, therefore, all previous systems rely on lexicons to better the results. However, since lexicons can be very large, we make the distinction in our approach between when query time is linear in the size of the lexicon and when it is constant. The pipeline proposed in this work for word recognition is as follows: a word- image is received, after which it is segmented into possible characters using a Hybrid HMM/Maxout model, from the resulting segmentation a cascade of po- tential characters is constructed. Then to find the exact word, a beam-search style variant of the Viterbi algorithm is applied. The beam search allows trading off speed and accuracy to compute a list of candidate results. Figure 4.1 depicts the pipeline for the full word recognition module. We detail each sub-module in CHAPTER 4. WORD RECOGNITION 33 the following sections.

Figure 4.1: Word recognition pipeline. Pipeline for the word recognition module. Pentagons represent learned modules. The character recognition Maxout module was described in Chapter 3. Note that the lexicon can be used either as a post-processing tool, or during inference after a language model is constructed.

4.4 Hybrid HMM Maxout Model

The use of a hybrid HMM Maxout model for was inspired by works in speech recognition [Renals et al., 1994, Hinton et al., 2012a]. Whereas in speech recognition the hybrid model is used to directly sequence phonemes, here we use it to segment word images into character/inter-character regions. Unlike works in speech recognition, our input domain consists of an image stretched in two dimensions. To make use of the topological structure of our input, we construct a hybrid model combining an HMM with a convolutional network, instead of a standard neural network. To train the hybrid model from word image data, for every image we create a segmentation into character/inter- character regions as follows: for every word, for every character pair that are CHAPTER 4. WORD RECOGNITION 34 adjacent, we define an inter-character region as the region stretching 10% into the left character and 10% into the right character. The particular HMM structure we use, depicted in Figure 4.2, has two “strands” of states, the first strand models character regions and the second strand models the inter-character regions.

} HMM } Maxout

Figure 4.2: A hybrid HMM-Maxout model Word-to-character Hybrid HMM Maxout module. The Maxout network’s fourth layer is coupled with the HMM.

Concretely, training the hybrid model follows the following procedure: first, we start with a heuristic alignment of the input sequence with target labels. In other words, we rely on an initial segmentation of the input data. Afterwards, the neural network is trained via conventional gradient descent to minimize the errors on the current alignment. Then a new alignment is estimated using the Viterbi algorithm, on which the neural network is trained again until convergence. This procedure is known as the embedded Viterbi training, which is defined in [Bourlard and Morgan, 1998]. After the model is trained, we use it to find the segmentation Q that maximizes P (Q|O) by running the standard Viterbi algorithm. CHAPTER 4. WORD RECOGNITION 35 4.5 Constructing the Cascade

The segmentation produced by the hybrid model suffers from two natural short- comings: over- and under-segmentation. Over-segmentation arises because some characters are composed by concatenating other characters, e.g., VV instead of W . Under-segmentation is more often observed in cases of difficult fonts, blurry images and complex background noise. To filter out instances of over-segmentation, we train a 4-layer convolutional Maxout network with the same architecture as the Maxout used with the hy- brid model to predict the probability of over and under-segmentation. This net- work (called Segmentation Correction Maxout in Figure 4.1) is trained on correct, over-, and under-segmentations. We create a new interval from every two ad- jacent intervals if the joined interval has a higher probability of being a correct segmentation than both of its constituents under the learned network. As for under-segmentation, we simply divide every resulting interval into two intervals by cutting the resulting intervals in the middle. 4 This operation produces a three-layered graph of overlapping intervals. We refer to this graph as a cascade. Every cascade induces an adjacency graph that we use later for inferring the corresponding word. Figure 4.3 depicts a cascade for the word “JFC” along with its induced graph. Note that the middle row is the segmentation from the hybrid HMM/Maxout model.

4.6 Word Inference

Computing the most likely word given a cascade is equivalent to computing the most likely path from the beginning of the cascade to its end. This problem can be solved using dynamic programming in a way similar to the Viterbi algorithm, 4While we could alternatively use other segmentations from the Hyrbid model, this particular heuristic is simpler and as effective. CHAPTER 4. WORD RECOGNITION 36

s e

Figure 4.3: A cascade induced graph for the word “JFC” A sample cascade with its induced graph. The middle row is the output of the Hybrid HMM Maxout module. The letter J in the top level was produced as an output of the Segmentation Correction Maxout module. The lower row is produced by systematic splitting of all the nodes in the middle row. except where both nodes and edges in the graph incur a cost.

Let the alphabet consist of K characters, also, let ci be character with index i, let vk be an interval in the cascade with index k where the cascade intervals are sorted by their left-most point. We define S(ci, vk) to be the probability of the most likely sequence ending in interval vk and character ci. We also define

N(vk) to be the set of intervals that immediately precede vk. S can be computed optimally using a Viterbi style algorithm in two cases: with no language model, or with a language model. In the first case S becomes:

S(ci, vk) = p(ci|vk)p(vk) max S(cj, vq), (4.1) j,q while in the second:

S(ci, vk) = p(ci|vk)p(vk) max p(ci|cj)S(cj, vq), (4.2) j,q such that q ∈ N(vk). Computing S from a list of intervals takes O(K2V + V log(V )) where V is the number of intervals in the cascade assuming that for any vk, |N(vk)| = O(1). The V log V term is consumed when doing a binary search for the set of intervals CHAPTER 4. WORD RECOGNITION 37

|N(v)| for every interval. The most likely word can be found by tracing back the

5 largest S(ci, vk) for all intervals whose end is the end of the cascade . While we can obtain p(c|v) as the posterior from the five-layer character recog- nition Maxout network (from chapter 3), we obtain p(v) from the Segmentation Correction Maxout network specified in Section 4.3. As for the language model, p(ci|cj), we obtain it from a predefined lexicon. The straight forward generalization of equation (4.2) to n-gram language mod- els incurs a large time penalty on the order of O(Kn). To side step that penalty while allowing for higher order language models we propose an algorithm that trades off accuracy with inference time in a way similar to Beam Search [Rus- sell and Norvig, ]; keeping the top B candidates for every interval. We call this: Cascade Beam Search (see Algorithm 1). Here, p(c|m, w) is the probability of a character given an n-gram language model m and a sequence of characters w that come before it, costv is the visual cost of an interval character pair, costl is the linguistic cost of ending in an interval s with a character c, and k is the string concatenation operation. Note that using the beam search algorithm above allows us to conduct the search for the target word with a constant complexity in lexicon size. This is due to the N-grams being compiled before hand. The outputs produced by the above algorithm need not belong to a pre-defined lexicon. One could constrain the output to a lexicon in multiple ways, the most popular of which are Finite State Transducers [Mohri, 1997]. Which keep the computation constant in lexicon size. In this work however, we adopt a different approach, where we map the predicted sequence to its edit-distance wise nearest neighbor. We compute this mapping Naively with a method that consumes linear 5One issue with the optimization problem in equation 4.2 is that it is comparing sequences on different probability spaces such that the sequence length prior is induced by the hybrid model’s segmentation and the cascade construction. This particular issue was not directly studied by any works that we know of. Other works in the area have handled it in other ways, for references, consult [Bengio et al., 1995, LeCun et al., 1998] CHAPTER 4. WORD RECOGNITION 38

Algorithm 2 Cascade Beam Search Input: intervals si, language model m , Beam width B for i = 1 to V do Qi = Queue() for j ∈ N(vi) do for ck ∈ Alphabet do for every word w in Qj do wˆ = w k ck costv = p(ck|vi) ∗ p(vi) costl = p(ck|m, w) costwˆ = costv ∗ costl ∗ costw Add (w, ˆ costwˆ) to Qi if size(Qi) > B then remove word with lowest cost from Qi end if end for end for end for end for return all w ∈ Qj sorted decreasingly by their costs, such that vj is at the end of the word time in the size of the dictionary.

4.7 Implementation details

In the hybrid model we use, depicted in Figure 4.2, the HMM has four states for each of the character/inter-character regions. For the classifier, we use a four-layer convolutional Maxout net with the first three layers being convolution/pooling layers with 48 filters each, where the filters were of size 8-by-8 for the first two layers and 5-by-5 for the third and a softmax layer on top. The first three layers had 2, 2 and 4 linear pieces per Maxout unit respectively and pooling was done on regions of size 4-by-4 with a 2-by-2 stride. The particular dataset we use to create the hybrid HMM model is made from the first 500 words in the ICDAR 2003 training-set. We find that a single iteration of the embedded Viterbi is sufficient for the hybrid model to learn to segment; this is likely because the CHAPTER 4. WORD RECOGNITION 39

Maxout component is learning a good initial segmentation.

4.8 Dataset and Results

We test our word recognition subsystem on word images from the ICDAR 2003 [Lucas et al., 2003] and SVT [Wang and Belongie, 2010] word recognition test sets. The ICDAR 2003 test set consists of images of perfectly cropped words. The SVT test set is a harder benchmark with more loosely cropped words and case-wise incorrect labellings. Similar to [Wang et al., 2011, Wang et al., 2012], all of our tests are on words that do not contain non-alphanumeric characters and that are of length greater than 2, leaving 860 and 647 test words for the ICDAR 2003 and SVT datasets respectively. For the ICDAR 2003 test set, we test the recognizer under three scenarios that vary by lexicon size: small, medium and large. In the case of small lexicons, an image’s lexicon contains the ground truth word in the image in addition to 50 distractor words provided by [Wang et al., 2011]. In the medium lexicon case, the lexicon contains all the words in the test set. For the large lexicon case, we use ’s spell checking dictionary6 that contains almost 50,000 words, in addition to all the words in the test set. We call these scenarios respectively W-S, W-M and W-L. As for the SVT test set, as in [Wang et al., 2011, Wang et al., 2012], we test the recognizer under a single setting in which for every word, we use other distractor words provided in the dataset. Moreover, since the SVT dataset’s lexions contain only capitalized words, we collapse the classifier’s result p(ck|vi) by setting the probability of a character to the sum of its upper-case and lower-case probabilities. We also test the system in two modes, the first is where the system takes constant time in lexicon size per query, while in the second we permit the query 6available here: http://wordlist.sourceforge.net/ CHAPTER 4. WORD RECOGNITION 40

Table 4.1: Word recognition accuracies on ICDAR 2003 and SVT datasets. The last two lines are from this work. Work Method W-S W-M W-L SVT [Wang et al., 2012] CNNs 90.0 84.0 - 70.0 [Mishra et al., 2012] CRFs 81.8 67.8 - 73.2 [Novikova et al., 2012] CRFs 82.8 - - 72.9 [Wang et al., 2011] PS 76.0 62.0 - 57.0 [Goel et al., 2013] weighted DTW 89.7 - - 77.3 [Bissacco et al., 2013] over-segmentation - - - 90.4 5-gram language model HMM/Maxout 90.1 87.3 83.0 67.0 edit-distance HMM/Maxout 93.1 88.6 85.1 74.3 to post-process the result with a lexicon. In the first mode, we test the system with language models constructed from task-specific lexicons. While in the second mode, we do not use a language model at all and instead, we use the resulting list of words from the cascade beam search algorithm and consider the recognition result to be the most likely resultant word that exists in the lexicon, or in case none of the resulting words were in the lexicon, we use the word with the least edit distance to any word in the lexicon. Table 4.1 compares our results on the benchmarks W-Small, W-Medium, W- Large and SVT to previous published results under the two modes specified above.7 All experiments were run with a beam width B = 100. Without the use of either a language model or a lexicon, the module reaches an accuracy of 55.6%. As is shown in table 4.1, our proposed algorithm outperforms previous state-of- the-art algorithms on the specified benchmarks. On the large lexicon benchmark, we couldn’t find works that were directly comparable to ours. However, we note that when we increase the lexicon size a 1000-fold, we get an accuracy of 85.1% which compares favourably with 78% achieved by [Novikova et al., 2012] when they increase their lexicon size 90-fold. 7The work in [Bissacco et al., 2013] uses a private dataset of 2.2 million labelled characters to train their character recognizer. They also use 108 characters to train their language model. These datasets exceed by orders of magnitude datasets used by other works, including ours. CHAPTER 4. WORD RECOGNITION 41 4.9 Speed-accuracy tradeoffs and language models’ effect

4.9.1 Effect of Beam Width

Since the complexity of the cascade beam search algorithm is O(KVB log(B) + V log(V )), we could trade the accuracy of the algorithm with its speed through the parameter B. Figure 4.4 shows the effect of the beam width on recognition accuracy and on recognition speed on the W-Small task. As shown in the figure, a small beam width does not lead to a great decrease in accuracy, and permits a great increase in recognition speed, making the word recognition module almost 15 times faster.

100 5 98 4 96 94 3 92 2 90 accuracy accuracy (%) 88 speed 1 86 0 0 20 40 60 80 100 speed (seconds per query) beam width

Figure 4.4: Effect of beam width Beam Width vs. Accuracy on the ICDAR 2003 word recognition dataset under a small lexicon scenario

4.9.2 Effect of Language Model Order

As noted earlier, the Cascade Beam Search algorithm also allows for integration of higher order language models directly through the inference stage. This would be most helpful in the cases of very large lexicons, since the inference process takes a constant time in lexicon size after the initial stage of encoding the lexicon by its n- grams. Figure 4.5 depicts how accuracy changes with the language model’s order for different lexicon sizes. The Small, Medium and Large curves correspond to CHAPTER 4. WORD RECOGNITION 42

95 90 85 80 Small 75 Medium 70 Large accuracy (%) 65 Large* 60 55 2 3 4 5 6 7 8 language model order

Figure 4.5: Effect of language model Language model order vs. accuracy by lexicon size on the ICDAR 2003 test set with beam width B = 100. Note that the Small, Medium and Large curves are tested on case-sensitive words while the large* is on case-insensitive words using the Small, Medium and Large lexicons specified in section 5.4. The Large* curve corresponds to using the same large lexicon but without adding the ground truth words in the lexicon; this is the only scenario done on case-insensitive words. The highest accuracy reached under the Large* scenario is 67.0%. It is notable that in the Large* scenario, higher orders of language models cause overfitting and thereby reduce the recognition accuracy. This overfitting is probably due to the fact that many words in the test set were not in the lexicon, and by using a higher order language model, words that looked similar to those in the test set were chosen instead.

4.10 Discussion

This chapter presents a detailed account of creating a fast and accurate word recognizer. Our recognizer leverages techniques from speech recognition, particu- larly, a hybrid HMM model, combined with a state-of-the-art deep neural network to outperform previous works on this task. Concretely, we use the hybrid model to sequence words into their character constituents, then we construct a graph on possible segmentations, on which we apply a variant of beam search. Our beam search variant allows a fast and accurate recognition, as recognition speed is con- CHAPTER 4. WORD RECOGNITION 43 stant in lexicon size. As we are able to compute the n-gram model before seeing the test words. We present two methods for including a lexicon in our recognizer: 1) through using nearest-neighbor type approach on an edit distance metric, in- cluding an additional cost linear in the lexicon size 2) through inclusion of the lexicon’s n-grams in the beam search to bias the search towards sequences more probable according to the language model. We compare our work to previous works on the ICDAR 2003 and the SVT datasets, in which our work outperforms all other works on the ICDAR 2003, and most others on the SVT dataset. There are many approaches for improving the performance of the word rec- ognizer. The simplest of which is adding more data, as the size of training data currently used remains relatively small. Another possible improvement is the use of learned edit-distances [Ristad and Yianilos, 1998, McCallum et al., 2012], instead of vanilla ones. Beyond that, most of the loss in accuracy comes from segmentations created by the hybrid HMM model. Designing a neural net to factor in context information into the hybrid HMM while computing posterior probabilities should help reduce that loss in accuracy. A slightly different approach, where mapping directly from images to words, without sequencing or beam-search might perform better, especially in situations with more data and manageable lexicon sizes (currently on the orders of tens of thousands of words). Apart from that, further incorporation of techniques from speech recognition, like Long Short Term Memories (LSTMs) [Hochreiter and Schmidhuber, 1997] may also lead to improved performance. Chapter Five

Text Detection and End-to-End system

This section shows how the previous modules can be fitted in an end-to-end text recognition pipeline, which when given an image would produce bounding boxes on areas containing textual information as well as transcriptions of text therein. We begin with a formalization of the text detection and the end-to-end text recog- nition problems in section 5.1, we then proceed to present related works on text detection and more generally, end-to-end systems, in section 5.2.

5.1 Problem Definition

Formally, the text detection problem is defined as follows: given an image x ∈ Xall

∗ where Xall is the set of all images, create a function f : Xall → B where B is the set of all rectangles defined within image boundaries. In light of this definition, the end-to-end problem can be defined as creating a

∗ ∗ function f : Xall → (B × Y ) where Y is the set of characters in some alphabet. The end-to-end system combines information from the text detection stage with recognition results from a text recognizer to produce recognized text.

5.2 Related Works

In this section, we will provide an overview of related works in the sub-fields of text detection and end-to-end pipelines.

44 CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 45 5.2.1 Text Detection

Text detection is defined such that, given a natural image, the goal is to output bounding boxes on all words in the image. Abstractly speaking, the problem is an instance of the object detection problem, followed by segmenting text regions into their constituent words. Previous works investigated different approaches for text detection, typically trading off precision, recall, training time and time consumed for manually designing features. Pre-trained CNNs [Coates et al., 2011, Wang et al., 2012] applied in a multi-scale sliding window fashion are highly accurate but very time consuming. Viola-Jones style classifiers remedy the slowness in CNNs, but have long training times and require manually-engineered features [Hanif et al., 2008, Chen and Yuille, 2011]. Alternative methods that cleverly exploit the nature of text such as Maximally Stable Extremal Regions (MSERs) [Matas et al., 2004] and Stroke Width Transform [Epshtein et al., 2010] generally have lower accuracy but are fast to compute. Such methods were used successfully to detect text as in [Neumann and Matas, 2011, Chen et al., 2011, Wang et al., 2011]. Little focus has been given on end-to-end recognition systems in the literature, favouring rather focusing on each part alone. While this approach is a valid first step, the fact that the subproblems intertwine and interact make presenting end-to-end systems and optimizing them the logical second step. To the best of our knowledge, works that present end-to-end systems are [Wang et al., 2011, Neumann and Matas, 2011, Wang et al., 2012].

5.2.2 End-to-End Pipelines

There are several ways to structure end-to-end pipelines. The simplest is a feed- forward structure where modules assume results from previous modules to be correct and deterministic. In such frameworks, pruning negative results is done by CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 46 every module individually. A more complicated kind of structure is a hypothesis verification structure. In these frameworks, pruning is postponed until the last stage of the pipeline, allowing for pruning on multiple criteria. A third and most involved structure is a closed loop structure, where hypotheses are continuously refined. This work conforms with other major works for end-to-end systems in adopting a hypothesis verification structure. The works of [Wang et al., 2011, Neumann and Matas, 2011] combined tech- niques from computer vision and image processing with tools for character recog- nition to build the end-to-end pipeline. In contrast to this approach, [Wang et al., 2012] built their entire system using deep, pretrained, convolutional neural nets. These systems trade off many qualities, the most important of which are speed and accuracy. The systems of [Wang et al., 2011, Neumann and Matas, 2011] were less accurate but much faster than the work of [Wang et al., 2012]. There are two main factors behind this slowdown. The first is having a word recognition method that scales linearly in lexicon size, where inference has to be done for every word in the lexicon. The second factor is using convolutional neural nets for the text detection part. While CNNs offer very high accuracies on detection tasks, they tend to be very time consuming. Bear in mind that the CNN had to be applied on every pixel of the image on multiple scales. Such application is very costly and can be done in reasonable time only in classifiers specifically designed to minimize this time, such as the Viola Jones style classifiers [Viola and Jones, 2001b].

5.3 Method

To extract text locations from an image, we start by extracting possible text can- didates using Maximally Stable Extremal Regions (MSERs). MSERs are defined to be regions in the image that are either maximas or minimas of image intensities CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 47 with respect to their surroundings. While being highly imprecise text detectors, they can be computed very quickly [Nist´erand Stew´enius,2008]. The use of MSERs allows us to sidestep the enormous time penalty incurred by applying a costly recognizer on multiple scales of the image as in [Wang et al., 2012], thereby allowing our system to become much more efficient. Since MSERs would ideally correspond to character regions, we form candidate line boxes by clustering the character candidates with DBSCAN [Ester et al., 1996] using multiple distances to obtain candidate line-level bounding boxes. After we obtain the line-level bounding boxes, we segment these lines using the Line-to-Word Hybrid HMM/Maxout trained to segment lines to words from the ICDAR 2003 scene training set. After this, we threshold the resulting word bound- ing boxes using the Word Detection Maxout, a four-layer convolutional Maxout network with the same architecture as the one used in Sec. 4.4 on text/non-text images extracted from ICDAR 2003 scene training dataset. We also threshold words on the costv score resulting from the word recognition module.

Figure 5.1: End-to-end text recognition pipeline End-to-end pipeline. Pentagons represent learned modules. The word recognition module shown here represents the full system from Fig. 1. CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 48

We follow this pipeline by doing a non-max suppression (NMS) [Neubeck and Van Gool, 2006] on word boxes that overlap by 30% of the area of their bounding box according to the visual cost costv from the word recognition module (com- puted in algorithm 2). Non-max suppression operates by removing predicted boxes that overlap with other boxes with lower cost.

5.4 End-to-End Results

We tested the above system on both the ICDAR 2003 and SVT end-to-end scene text-recognition test sets. Each of the datasets contain 249 scene images. More specifically, for the ICDAR 2003 dataset, we conduct tests under five scenarios, where for the first three, the lexicons consist of {5,20,50} distractor words per image in addition to the ground truth words for that image, in the fourth scenario all the test words are included in the lexicon and in the fifth scenario, we use the same large lexicon we used to test the word recognition module (Sec.5.4). We label these scenarios I-5, I-20, I-50, I-Full and I-Large respectively. The lexicons were provided by the authors of [Wang et al., 2011]. As for the SVT dataset, we conduct the tests using the lexicons provided with the dataset.

Figure 5.2: Sample end-to-end results Samples from the end-to-end results, the purple boxes represent the ground truth and the green boxes represent the predictions CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 49

We test the end-to-end system using the standard precision/recall metrics under the benchmarks specified in [Lucas et al., 2003], where a prediction is considered a hit when the area of the overlap between the predicted box and the target box is greater than 50% of the bounding box area and the predicted text matches exactly. Table 3 compares our results to other results in the field. Despite our use of a simple method with low accuracy like MSERs to extract possible text regions, our end-to-end system is able to outperform previous state-of-the-art end-to-end systems and produce reasonable results for large lexicons. Figure 8 shows the precision/recall curves for all the tasks on the ICDAR 2003 dataset and Figure 7 shows a few sample outputs from our system. To increase the F-measures on the end-to-end system, we should seek to boost recall. As pointed out in [Neumann and Matas, 2011] MSERs do not offer high recall for character location extraction. The alternative of using a time-consuming but highly accurate classifier as in [Wang et al., 2012] is not practical to make the end-to-end system work in real-time. In our opinion, a promising solution would be to develop a Viola-Jones-style cascade [Viola and Jones, 2001a] coupled with feature-learning. Such an approach could offer a fast, accurate, easy to train and feature-engineering-free text detector that would lead to increased recall.

Work I-5 I-20 I-50 I-Full I-Large SVT [Wang et al., 2011] 72 70 68 51 - 38 [Wang et al., 2012] 76 74 72 67 - 46 This work 80 79 77 70 63 48 Table 5.1: End-to-end F-measures on the ICDAR 2003 and SVT datasets CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 50

0.75

0.70

0.65 I-5 0.60 I-20 recall I-50 0.55 I-Full 0.50 I-Large 0.45 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 precision

Figure 5.3: Precision/ recall curves for the end-to-end system Precision/recall curves for the end-to-end system on the ICDAR 2003 dataset under different lexicon sizes 5.5 On Training an End-to-end System via Gradient Descent

In [LeCun et al., 1998], the authors propose a complete method for end-to-end training of a recognition system. Abstractly, their method is quite similar to ours, and to most other recognition pipelines, involving feature extraction, segmentation and beam search decoding. The key high-level difference is in training components together as opposed to training them independently. Intuitively, the merit of end-to-end training is to train the signal-specific model (in our case, the hybrid model) with the knowledge that its outputs will be fed into a language model and a particular decoding algorithm. This information should help the signal model tailor its outputs as inputs to the decoding algorithm, which may lead to improved end-to-end performance. While this sounds plausible, the downside of such training procedure could be having the language model overfit to the set of targets available in the training set. This is due to the fact that usually, the set of training signals available tends CHAPTER 5. TEXT DETECTION AND END-TO-END SYSTEM 51 to be much smaller than the set of words available to draw a language model from. However, it should be noted that for the system presented in [LeCun et al., 1998], end-to-end training does indeed help (see Figure 32.). For our particular system, the problem of detecting and recognizing words directly from scene images is less straightforward. While one could conceptually train a system similar to [LeCun et al., 1998], it appears that due to the size of the input data, such training procedure would take too long to converge, as convolutional networks would need to be swept on input images which tend to have high resolutions. However, it is conceivable that such training procedure can be implemented with more resources. Whether it would lead to improved generalization remains to be seen. Chapter Six

Discussion

In this section, we discuss the contributions presented in this thesis. The limita- tions of the work and possible venues for future work.

6.1 Contributions

An end-to-end system is a significant effort. This thesis presents a detailed account for designing such a system. We start from the character recognition problem, for which we propose the application of a convolutional maxout network which leads to a state-of-the-art result compared to pre-trained CNNs. We then proceed to the word recognition problem, for which we propose a method inspired by speech recognition, in which we segment words into constituent regions and apply a variant of beam search on the induced graph. Factoring in knowledge of the word’s appearance and a character-level language model for the particular language being recognized. Our particular approach is both highly accurate, as well as scalable to language models with tens of thousands of words (and possibly more). Then we proceed to the complete end-to-end problem, where we propose a hypothesis verification pipeline with a novel text detection stage, that leads to improved results on end-to-end recognition rates.

52 CHAPTER 6. DISCUSSION 53 6.2 Limitations

While our work surmounts some of the previous obstacles faced in this sub-field, many obstacles remain. The first are data related obstacles, as training set sizes and testing set sizes are relatively small (relatively one thousand words for training and one thousand for testing). Therefore, it is likely that current leading methods, including ours, overfit on the test set. Beyond that, several problem-specific obstacles also exist. On the character recognition front, the main obstacle is inherently related to the character-confusion problem, for which, on the character level, no clear solution is available. On the word recognition front, scalability to large lexicons doesn’t seem to be an issue, however, accuracy wise, our method is still not accurate enough for industrial use. On the end-to-end recognition front, the main challenge is in detection speed. As pointed out in section 5.4, while our method is reasonably fast for offline end-to-end recognition, requiring a few seconds per image, it is still not fast enough for real-time use. Currently, the main bottleneck remains the convolutional part of the neural network. As it plays a role in both detection and recognition. However, the system’s overall running speed could perhaps be improved through a more optimized implementation, or through changing the detection models.

6.3 Future Work

As this thesis is mainly concerned with the design of an end-to-end system, pos- sible improvements exist for every module. Broadly speaking, adding more data and scaling up the system is the easiest way to improve performance. Beyond that, investigation of other models from speech recognition like Recurrent Neural Networks or Long Short Term Memories seem to be promising alternatives to convolutional networks to improve the recognizer’s accuracy. That said, the main CHAPTER 6. DISCUSSION 54 challenge in designing such systems remains in the detection phase, for which no real-time, highly accurate solution exists. We believe this work could be utilized in various scenarios. For one, in aiding self-driving vehicles in recognizing street names to better navigation. Also, allow- ing advertisers to offer customized service when a customer sees a certain word or trademark through a “smart” device. On a higher level, it is likely that pipelines similar to ours can be utilized for handwriting and speech recognition systems. Whether their performance would surpass current methods remains to be seen. Bibliography

[Barron, 1993] Barron, A. R. (1993). Universal approximation bounds for super- positions of a sigmoidal function. Information Theory, IEEE Transactions on, 39(3):930–945.

[Bengio et al., 1992] Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Global optimization of a neural network-hidden markov model hybrid. Neural Networks, IEEE Transactions on, 3(2):252–259.

[Bengio et al., 1995] Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A nn/hmm hybrid for on-line handwriting recognition. Neural Compu- tation, 7(6):1289–1303.

[Bergstra et al., 2010] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas- canu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.

[Bishop, 1995] Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.

[Bissacco et al., 2013] Bissacco, A., Cummins, M., Netzer, Y., and Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions.

55 BIBLIOGRAPHY 56

[Blitzer et al., 2005] Blitzer, J., Weinberger, K. Q., and Saul, L. K. (2005). Dis- tance metric learning for large margin nearest neighbor classification. In Ad- vances in neural information processing systems, pages 1473–1480.

[Bourlard and Morgan, 1998] Bourlard, H. and Morgan, N. (1998). Hybrid hmm/ann systems for speech recognition: Overview and new research direc- tions. In Adaptive Processing of Sequences and Data Structures, pages 389–417. Springer.

[Bousquet et al., 2004] Bousquet, O., Boucheron, S., and Lugosi, G. (2004). In- troduction to statistical learning theory. pages 169–207.

[Bruna and Mallat, 2013] Bruna, J. and Mallat, S. (2013). Invariant scattering convolution networks. Pattern Analysis and Machine Intelligence, IEEE Trans- actions on, 35(8):1872–1886.

[Chen et al., 2011] Chen, H., Tsai, S. S., Schroth, G., Chen, D. M., Grzeszczuk, R., and Girod, B. (2011). Robust text detection in natural images with edge- enhanced maximally stable extremal regions. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 2609–2612. IEEE.

[Chen and Yuille, 2011] Chen, X. and Yuille, A. (2011). Adaboost learning for detecting and reading text in city scenes.

[Ciresan et al., 2012] Ciresan, D., Meier, U., and Schmidhuber, J. (2012). Multi- column deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3642–3649. IEEE.

[Coates et al., 2011] Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., Wu, D. J., and Ng, A. Y. (2011). Text detection and character recognition in scene images with unsupervised feature learning. In Document BIBLIOGRAPHY 57

Analysis and Recognition (ICDAR), 2011 International Conference on, pages 440–445. IEEE.

[Dahl et al., 2012] Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context- dependent pre-trained deep neural networks for large-vocabulary speech recog- nition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30–42.

[de Campos et al., 2009] de Campos, T. E., Babu, B. R., and Varma, M. (2009). Character recognition in natural images.

[Decoste and Sch¨olkopf, 2002] Decoste, D. and Sch¨olkopf, B. (2002). Training invariant support vector machines. Machine Learning, 46(1-3):161–190.

[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38.

[Epshtein et al., 2010] Epshtein, B., Ofek, E., and Wexler, Y. (2010). Detect- ing text in natural scenes with stroke width transform. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2963–2970. IEEE.

[Ester et al., 1996] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise.

[Goel et al., 2013] Goel, V., Mishra, A., Alahari, K., and Jawahar, C. V. (2013). Whole is greater than sum of parts: Recognizing scene text words. In Proceed- ings of International Conference on Document Analysis and Recognition.

[Goodfellow et al., 2013a] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Du- moulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, BIBLIOGRAPHY 58

Y. (2013a). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.421.

[Goodfellow et al., 2013b] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maxout networks. In ICML.

[Graves et al., 2013] Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. arXiv preprint arXiv:1303.5778.

[Grosicki and El Abed, 2009] Grosicki, E. and El Abed, H. (2009). Icdar 2009 handwriting recognition competition. In Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, pages 1398–1402. IEEE.

[Hanif et al., 2008] Hanif, S. M., Prevost, L., and Negri, P. A. (2008). A cascade detector for text detection in natural scene images. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pages 1–4. IEEE.

[Hinton et al., 2012a] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012a). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97.

[Hinton et al., 2006] Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554.

[Hinton et al., 2012b] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.

[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780. BIBLIOGRAPHY 59

[Hornik et al., 1989] Hornik, K., Stinchcombe, M., and White, H. (1989). Mul- tilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.

[Hu et al., 1996] Hu, J., Brown, M. K., and Turin, W. (1996). Hmm based online handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 18(10):1039–1045.

[Hubel and Wiesel, 1968] Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology (Lon- don), 195, 215–243.

[Ingber, 1993] Ingber, L. (1993). Simulated annealing: Practice versus theory. Mathematical and computer modelling, 18(11):29–57.

[K´egland Busa-Fekete, 2009] K´egl, B. and Busa-Fekete, R. (2009). Boosting products of base classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 497–504. ACM.

[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114.

[LeCun, 1986] LeCun, Y. (1986). Learning processes in an asymmetric threshold network. In Bienenstock, E., Fogelman-Souli´e,F., and Weisbuch, G., editors, Disordered systems and biological organization, pages 233–240, Les Houches, France. Springer-Verlag.

[LeCun et al., 1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. BIBLIOGRAPHY 60

[Liwicki et al., ] Liwicki, M., Graves, A., Bunke, H., and Schmidhuber, J. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks.

[Lucas et al., 2003] Lucas, S. M., Panaretos, A., Sosa, L., Tang, A., Wong, S., and Young, R. (2003). Icdar 2003 robust reading competitions.

[Matas et al., 2004] Matas, J., Chum, O., Urban, M., and Pajdla, T. (2004). Ro- bust wide-baseline stereo from maximally stable extremal regions. Image and vision computing, 22(10):761–767.

[McCallum et al., 2012] McCallum, A., Bellare, K., and Pereira, F. (2012). A conditional random field for discriminatively-trained finite-state string edit dis- tance. arXiv preprint arXiv:1207.1406.

[Mishra et al., 2012] Mishra, A., Alahari, K., Jawahar, C., et al. (2012). Scene text recognition using higher order language priors.

[Mohri, 1997] Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational linguistics, 23(2):269–311.

[Montana and Davis, ] Montana, D. J. and Davis, L. Training feedforward neural networks using genetic algorithms.

[Morgan and Bourlard, 1995] Morgan, N. and Bourlard, H. (1995). Continuous speech recognition. Signal Processing Magazine, IEEE, 12(3):24–42.

[Nair and Hinton, 2010] Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814.

[Netzer et al., ] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. BIBLIOGRAPHY 61

[Neubeck and Van Gool, 2006] Neubeck, A. and Van Gool, L. (2006). Efficient non-maximum suppression. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 3, pages 850–855. IEEE.

[Neumann and Matas, 2011] Neumann, L. and Matas, J. (2011). A method for text localization and recognition in real-world images. In Computer Vision– ACCV 2010, pages 770–783. Springer.

[Nist´erand Stew´enius,2008] Nist´er,D. and Stew´enius,H. (2008). Linear time maximally stable extremal regions. In Computer Vision–ECCV 2008, pages 183–196. Springer.

[Novikova et al., 2012] Novikova, T., Barinova, O., Kohli, P., and Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In Computer Vision–ECCV 2012, pages 752–765. Springer.

[Rabiner, 1989] Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257– 286.

[Renals et al., 1994] Renals, S., Morgan, N., Bourlard, H., Cohen, M., and Franco, H. (1994). Connectionist probability estimators in hmm speech recog- nition. Speech and Audio Processing, IEEE Transactions on, 2(1):161–174.

[Ristad and Yianilos, 1998] Ristad, E. S. and Yianilos, P. N. (1998). Learning string-edit distance. Pattern Analysis and Machine Intelligence, IEEE Trans- actions on, 20(5):522–532.

[Rumelhart et al., 1985] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1985). Learning internal representations by error propagation. Technical re- port, DTIC Document. BIBLIOGRAPHY 62

[Rumelhart et al., 1986] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533–536.

[Russell and Norvig, ] Russell, S. J. and Norvig, P. Artificial intelligence: a mod- ern approach, volume 74. Prentice hall Englewood Cliffs.

[Shi et al., 2013] Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., and Zhang, Z. (2013). Scene text recognition using part-based tree-structured character detection.

[Valtchev et al., 1997] Valtchev, V., Odell, J., Woodland, P. C., and Young, S. J. (1997). Mmie training of large vocabulary recognition systems. Speech Com- munication, 22(4):303–314.

[Vapnik, 1998] Vapnik, V. N. (1998). Statistical learning theory.

[Vertanen, ] Vertanen, K. An overview of discriminative training for speech recog- nition.

[Viola and Jones, 2001a] Viola, P. and Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Soci-

ety Conference on, volume 1, pages I–511. IEEE.

[Viola and Jones, 2001b] Viola, P. and Jones, M. (2001b). Robust real-time ob- ject detection.

[Wan et al., 2013] Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proc. Inter- national Conference on Machine learning (ICML’13). BIBLIOGRAPHY 63

[Wang et al., 2011] Wang, K., Babenko, B., and Belongie, S. (2011). End-to-end scene text recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1457–1464. IEEE.

[Wang and Belongie, 2010] Wang, K. and Belongie, S. (2010). Word spotting in the wild. In Computer Vision–ECCV 2010, pages 591–604. Springer.

[Wang et al., 2012] Wang, T., Wu, D. J., Coates, A., and Ng, A. Y. (2012). End- to-end text recognition with convolutional neural networks. In Pattern Recogni- tion (ICPR), 2012 21st International Conference on, pages 3304–3308. IEEE.

[Weston et al., 2012] Weston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012). Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer.

[Zeiler and Fergus, 2013] Zeiler, M. D. and Fergus, R. (2013). Stochastic pool- ing for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557.