Gradient-Based Learning Applied to Document Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Gradient-Based Learning Applied to Document Recognition YANN LECUN, MEMBER, IEEE, LEON´ BOTTOU, YOSHUA BENGIO, AND PATRICK HAFFNER Invited Paper Multilayer neural networks trained with the back-propagation NN Neural network. algorithm constitute the best example of a successful gradient- OCR Optical character recognition. based learning technique. Given an appropriate network PCA Principal component analysis. architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify RBF Radial basis function. high-dimensional patterns, such as handwritten characters, with RS-SVM Reduced-set support vector method. minimal preprocessing. This paper reviews various methods SDNN Space displacement neural network. applied to handwritten character recognition and compares them SVM Support vector method. on a standard handwritten digit recognition task. Convolutional TDNN Time delay neural network. neural networks, which are specifically designed to deal with the variability of two dimensional (2-D) shapes, are shown to V-SVM Virtual support vector method. outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, I. INTRODUCTION and language modeling. A new learning paradigm, called graph Over the last several years, machine learning techniques, transformer networks (GTN’s), allows such multimodule systems particularly when applied to NN’s, have played an increas- to be trained globally using gradient-based methods so as to ingly important role in the design of pattern recognition minimize an overall performance measure. Two systems for online handwriting recognition are described. systems. In fact, it could be argued that the availability Experiments demonstrate the advantage of global training, and of learning techniques has been a crucial factor in the the flexibility of graph transformer networks. recent success of pattern recognition applications such as A graph transformer network for reading a bank check is continuous speech recognition and handwriting recognition. also described. It uses convolutional neural network character The main message of this paper is that better pattern recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed recognition systems can be built by relying more on auto- commercially and reads several million checks per day. matic learning and less on hand-designed heuristics. This Keywords— Convolutional neural networks, document recog- is made possible by recent progress in machine learning nition, finite state transducers, gradient-based learning, graph and computer technology. Using character recognition as a transformer networks, machine learning, neural networks, optical case study, we show that hand-crafted feature extraction can character recognition (OCR). be advantageously replaced by carefully designed learning machines that operate directly on pixel images. Using NOMENCLATURE document understanding as a case study, we show that the traditional way of building recognition systems by manually GT Graph transformer. integrating individually designed modules can be replaced GTN Graph transformer network. by a unified and well-principled design paradigm, called HMM Hidden Markov model. GTN’s, which allows training all the modules to optimize HOS Heuristic oversegmentation. a global performance criterion. K-NN K-nearest neighbor. Since the early days of pattern recognition it has been known that the variability and richness of natural data, Manuscript received November 1, 1997; revised April 17, 1998. Y. LeCun, L. Bottou, and P. Haffner are with the Speech and Image be it speech, glyphs, or other types of patterns, make it Processing Services Research Laboratory, AT&T Labs-Research, Red almost impossible to build an accurate recognition system Bank, NJ 07701 USA. entirely by hand. Consequently, most pattern recognition Y. Bengio is with the Departement´ d’Informatique et de Recherche Operationelle,´ Universite´ de Montreal,´ Montreal,´ Quebec´ H3C 3J7 Canada. systems are built using a combination of automatic learning Publisher Item Identifier S 0018-9219(98)07863-3. techniques and hand-crafted algorithms. The usual method 0018–9219/98$10.00 © 1998 IEEE 2278 PROCEEDINGS OF THE IEEE, VOL. 86, NO. 11, NOVEMBER 1998 character recognition (Sections I and II) and compare the performance of several learning techniques on a benchmark data set for handwritten digit recognition (Section III). While more automatic learning is beneficial, no learning technique can succeed without a minimal amount of prior knowledge about the task. In the case of multilayer NN’s, a good way to incorporate knowledge is to tailor its archi- tecture to the task. Convolutional NN’s [2], introduced in Section II, are an example of specialized NN architectures which incorporate knowledge about the invariances of two- dimensional (2-D) shapes by using local connection patterns Fig. 1. Traditional pattern recognition is performed with two modules: a fixed feature extractor and a trainable classifier. and by imposing constraints on the weights. A comparison of several methods for isolated handwritten digit recogni- tion is presented in Section III. To go from the recognition of individual characters to the recognition of words and of recognizing individual patterns consists in dividing the sentences in documents, the idea of combining multiple system into two main modules shown in Fig. 1. The first modules trained to reduce the overall error is introduced module, called the feature extractor, transforms the input in Section IV. Recognizing variable-length objects such as patterns so that they can be represented by low-dimensional handwritten words using multimodule systems is best done vectors or short strings of symbols that: 1) can be easily if the modules manipulate directed graphs. This leads to the matched or compared and 2) are relatively invariant with concept of trainable GTN, also introduced in Section IV. respect to transformations and distortions of the input pat- Section V describes the now classical method of HOS for terns that do not change their nature. The feature extractor recognizing words or other character strings. Discriminative contains most of the prior knowledge and is rather specific and nondiscriminative gradient-based techniques for train- to the task. It is also the focus of most of the design effort, ing a recognizer at the word level without requiring manual because it is often entirely hand crafted. The classifier, segmentation and labeling are presented in Section VI. on the other hand, is often general purpose and trainable. Section VII presents the promising space-displacement NN One of the main problems with this approach is that the approach that eliminates the need for segmentation heuris- recognition accuracy is largely determined by the ability of tics by scanning a recognizer at all possible locations on the designer to come up with an appropriate set of features. the input. In Section VIII, it is shown that trainable GTN’s This turns out to be a daunting task which, unfortunately, can be formulated as multiple generalized transductions must be redone for each new problem. A large amount of based on a general graph composition algorithm. The the pattern recognition literature is devoted to describing connections between GTN’s and HMM’s, commonly used and comparing the relative merits of different feature sets in speech recognition, is also treated. Section IX describes for particular tasks. a globally trained GTN system for recognizing handwriting Historically, the need for appropriate feature extractors entered in a pen computer. This problem is known as was due to the fact that the learning techniques used “online” handwriting recognition since the machine must by the classifiers were limited to low-dimensional spaces produce immediate feedback as the user writes. The core with easily separable classes [1]. A combination of three of the system is a convolutional NN. The results clearly factors has changed this vision over the last decade. First, demonstrate the advantages of training a recognizer at the availability of low-cost machines with fast arithmetic the word level, rather than training it on presegmented, units allows for reliance on more brute-force “numerical” hand-labeled, isolated characters. Section X describes a methods than on algorithmic refinements. Second, the avail- complete GTN-based system for reading handwritten and ability of large databases for problems with a large market machine-printed bank checks. The core of the system is and wide interest, such as handwriting recognition, has the convolutional NN called LeNet-5, which is described enabled designers to rely more on real data and less on in Section II. This system is in commercial use in the hand-crafted feature extraction to build recognition systems. NCR Corporation line of check recognition systems for the The third and very important factor is the availability banking industry. It is reading millions of checks per month of powerful machine learning techniques that can handle in several banks across the United States. high-dimensional inputs and can generate intricate decision functions when fed with these large data sets. It can be A. Learning from Data argued that the recent progress in the accuracy of speech There are several approaches to automatic machine learn- and handwriting recognition systems can