<<

PROC OF THE IEEE NOVEMBER

GradientBased Learning Applied to Do cument

Recognition

Yann LeCun Leon Bottou and Patrick Haner

Abstract I Introduction

Multilayer Neural Networks trained with the backpropa

gation algorithm constitute the b est example of a successful

Over the last several years techniques

GradientBased Learning technique Given an appropriate

particularly when applied to neural networks haveplayed

network architecture GradientBased Learning algorithms

an increasingly imp ortant role in the design of pattern

can b e used to synthesize a complex decision surface that can

recognition systems In fact it could be argued that the

classify highdimensional patterns such as handwritten char

acters with minimal prepro cessing This pap er reviews var

availability of learning techniques has b een a crucial fac

ious metho ds applied to handwritten character recognition

tor in the recent success of applica

and compares them on a standard handwritten digit recog

tions suchascontinuous sp eech recognition and handwrit

nition task Convolutional Neural Networks that are sp ecif

ically designed to deal with the variability of D shap es are

ing recognition

shown to outp erform all other techniques

The main message of this pap er is that b etter pattern

Reallife do cument recognition systems are comp osed

recognition systems can b e built by relying more on auto

of multiple mo dules including eld extraction segmenta

tion recognition and language mo deling A new learning

matic learning and less on handdesigned heuristics This

paradigm called Graph Transformer Networks GTN al

is made p ossible by recent progress in machine learning

ultimo dule systems to b e trained globally using lows suchm

and computer technology Using character recognition as

GradientBased metho ds so as to minimize an overall p er

a case study we show that handcrafted feature extrac

formance measure

Two systems for online are de

tion can b e advantageously replaced by carefully designed

scrib ed Exp eriments demonstrate the advantage of global

learning machines that op erate directly on pixel images

training and the exibility of Graph Transformer Networks

Using do cument understanding as a case study we show

A Graph Transformer Network for reading bank checkis

that the traditional way of building recognition systems by

also describ ed It uses Convolutional Neural Network char

acter recognizers combined with global training techniques

manually integrating individually designed mo dules can b e

to provides record accuracy on business and p ersonal checks

replaced by a unied and wellprincipled design paradigm

It is deployed commercially and reads several million checks

called Graph Transformer Networks that allows training

per day

all the mo dules to optimize a global p erformance criterion

Keywords Neural Networks OCR Do cument Recogni

tion Machine Learning GradientBased Learning Convo

Since the early days of pattern recognition it has been

lutional Neural Networks Graph Transformer Networks Fi

known that the variability and richness of natural data

nite State Transducers

be it sp eech glyphs or other typ es of patterns make it

almost imp ossible to build an accurate recognition system

entirely by hand Consequen tly most pattern recognition

Nomenclature

systems are built using a combination of automatic learn

GT Graph transformer

ing techniques and handcrafted algorithms The usual

GTN Graph transformer network

metho d of recognizing individual patterns consists in divid

HMM Hidden Markov mo del

ing the system into twomain mo dules shown in gure

HOS Heuristic oversegmentation

The rst mo dule called the feature extractor transforms

KNN Knearest neighb or

the input patterns so that they can b e represented bylow

NN Neural network

dimensional vectors or short strings of symb ols that a can

OCR Optical character recognition

b e easily matched or compared and b are relatively in

PCA Principal comp onent analysis

variant with resp ect to transformations and distortions of

RBF Radial basis function

the input patterns that do not change their nature The

RSSVM Reducedset supp ort vector metho d

feature extractor contains most of the prior knowledge and

SDNN Space displacement neural network

is rather sp ecic to the task It is also the fo cus of most of

SVM Supp ort vector metho d

the design eort b ecause it is often entirely handcrafted

TDNN Time delay neural network

The classier on the other hand is often generalpurp ose

VSVM Virtual supp ort vector metho d

and trainable One of the main problems with this ap

proach is that the recognition accuracy is largely deter

The authors are with the Sp eech and Image Pro

mined by the ability of the designer to come up with an

cessing Services Research Lab oratory ATT Labs

appropriate set of features This turns out to b e a daunt

Research Schulz Drive Red Bank NJ Email

fyannleonbyoshuahanergresearchattcom Yoshua Bengio

ing task which unfortunatelymust b e redone for eachnew

is also with the Departement dInformatique et de Recherche

problem A large amount of the pattern recognition liter

Operationelle UniversitedeMontreal CP Succ CentreVille

Chemin de la Tour Montreal Qu eb ec HC J ature is devoted to describing and comparing the relative

PROC OF THE IEEE NOVEMBER

Class scores

manipulate directed graphs This leads to the concept of

ransformer Network GTN also intro trainable Graph T

in Section IV Section V describ es the now clas

TRAINABLE CLASSIFIER MODULE duced

sical metho d of heuristic oversegmentation for recogniz

words or other character strings Discriminative and

Feature vector ing

nondiscriminative gradientbased techniques for training

a recognizer at the word level without requiring manual

FEATURE EXTRACTION MODULE

segmentation and lab eling are presented in Section VI Sec

tion VI I presents the promising SpaceDisplacementNeu

Raw input

ral Network approach that eliminates the need for seg

mentation heuristics by scanning a recognizer at all pos

Fig Traditional pattern recognition is p erformed with twomod

sible lo cations on the input In section VIII it is shown

ules a xed feature extractor and a trainable classier

that trainable Graph Transformer Networks can be for

mulated as multiple generalized transductions based on a

general graph comp osition algorithm The connections b e

merits of dierent feature sets for particular tasks

tween GTNs and Hidden Markov Mo dels commonly used

Historically the need for appropriate feature extractors

in sp eech recognition is also treated Section IX describ es

was due to the fact that the learning techniques used by

a globally trained GTN system for recognizing handwrit

the classiers were limited to lowdimensional spaces with

ing entered in a p en computer This problem is known as

easily separable classes Acombination of three factors

online handwriting recognition since the machine must

have changed this vision over the last decade First the

pro duce immediate feedback as the user writes The core of

availabilityoflowcost machines with fast arithmetic units

the system is a Convolutional Neural Network The results

allows to rely more on bruteforce numerical metho ds

clearly demonstrate the advantages of training a recognizer

than on algorithmic renements Second the availability

at the word level rather than training it on presegmented

of large databases for problems with a large market and

handlab eled isolated characters Section X describ es a

wide interest such as handwriting recognition has enabled

complete GTNbased system for reading handwritten and

designers to rely more on real data and less on handcrafted

machineprinted bank checks The core of the system is the

feature extraction to build recognition systems The third

Convolutional Neural Network called LeNet describ ed in

and very imp ortantfactoristheavailabilityofpowerful ma

Section I I This system is in commercial use in the NCR

chine learning techniques that can handle highdimensional

Corp oration line of check recognition systems for the bank

inputs and can generate intricate decision functions when

ing industry It is reading millions of checks p er month in

fed with these large data sets It can b e argued that the

several banks across the

recent progress in the accuracy of sp eech and handwriting

recognition systems can b e attributed in large part to an

A Learning from Data

increased reliance on learning techniques and large training

data sets As evidence to this fact a large prop ortion of

There are several approaches to automatic machine

mo dern commercial OCR systems use some form of multi

learning but one of the most successful approaches p op

Neural Network trained with backpropagation

ularized in recentyears by the neural network community

In this studywe consider the tasks of handwritten char can b e called numerical or gradientbased learning The

p p

acter recognition Sections I and I I and compare the p er learning machine computes a function Y F Z W

p

formance of several learning techniques on a benchmark where Z is the pth input pattern and W represents the

data set for handwritten digit recognition Section I I I collection of adjustable parameters in the system In a

p

While more automatic learning is b enecial no learning pattern recognition setting the output Y may be inter

p

technique can succeed without a minimal amount of prior preted as the recognized class lab el of pattern Z or as

knowledge ab out the task In the case of multilayer neu scores or probabilities asso ciated with each class A loss

p p p

FWZ measures the discrep ral networks a good way to incorp orate knowledge is to function E D D

p

Neu ancy b etween D the correct or desired output for pat tailor its architecture to the task Convolutional

p

ral Networks intro duced in Section II are an exam tern Z and the output pro duced by the system The

ple of sp ecialized neural network architectures which in average E W is the average of the er

tr ain

p

corp orate knowledge ab out the invariances of D shap es rors E over a set of lab eled examples called the training

P P

by using lo cal connection patterns and by imp osing con set fZ D Z D g In the simplest setting the

straints on the weights A comparison of several metho ds learning problem consists in nding the value of W that

for isolated handwritten digit recognition is presented in minimizes E W In practice the p erformance of the

tr ain

section I I I To go from the recognition of individual char system on a training set is of little interest The more rel

acters to the recognition of words and sentences in do cu evant measure is the error rate of the system in the eld

be used in practice This p erformance is ments the idea of combining multiple mo dules trained to where it would

estimated by measuring the accuracy on a set of samples reduce the overall error is intro duced in Section IV Rec

disjoint from the training set called the test set Much ognizing variablelength ob jects such as handwritten words

theoretical and exp erimental work has shown using multimo dule systems is best done if the mo dules

PROC OF THE IEEE NOVEMBER

that the gap between the exp ected error rate on the test Hessian matrix as in Newton or QuasiNewton metho ds

set E and the error rate on the training set E de The Conjugate Gradient metho d can also be used

test tr ain

creases with the numb er of training samples approximately However App endix B shows that despite many claims

as to the contrary in the literature the usefulness of these

secondorder metho ds to large learning machines is very

E E k hP

test tr ain

limited

where P is the numb er of training samples h is a measure of

A p opular minimization pro cedure is the sto chastic gra

eective capacity or complexity of the machine

dient algorithm also called the online up date It consists

is a number between and and k is a constant This

in up dating the parameter vector using a noisyorapprox

gap always decreases when the numb er of training samples

imated version of the average gradient In the most com

increases Furthermore as the capacity h increases E

tr ain

mon instance of it W is up dated on the basis of a single

decreases Therefore when increasing the capacity hthere

sample

p

is a tradeo between the decrease of E and the in

k

tr ain

W E

W W

k k 

crease of the gap with an optimal value of the capacity h

W

that achieves the lowest E Most

test

With this pro cedure the parameter vector uctuates

learning algorithms attempt to minimize E as well as

tr ain

around an average tra jectory but usually converges consid

some estimate of the gap A formal version of this is called

erably faster than regular and second or

structural risk minimization and is based on den

der metho ds on large training sets with redundant samples

ing a sequence of learning machines of increasing capacity

such as those encountered in sp eechorcharacter recogni

corresp onding to a sequence of subsets of the parameter

tion The reasons for this are explained in App endix B

space such that each subset is a sup erset of the previous

The prop erties of such algorithms applied to learning have

subset In practical terms Structural Risk Minimization

b een studied theoretically since the s

is implemented by minimizing E H W where the

tr ain

but practical successes for nontrivial tasks did not o ccur

function H W is called a regularization function and is

until the mid eighties

a constant H W is chosen such that it takes large val

ues on parameters W that b elong to highcapacity subsets

C Gradient BackPropagation

of the parameter space Minimizing H W in eect lim

GradientBased Learning pro cedures have been used

its the capacity of the accessible subset of the parameter

since the late s but they were mostly limited to lin

space thereby controlling the tradeo between minimiz

ear systems The surprising usefulness of such sim

ing the training error and minimizing the exp ected gap

ple gradient descenttechniques for complex machine learn

between the training error and test error

ing tasks was not widely realized until the following three

events o ccurred The rst event was the realization that

B GradientBased Learning

trary the presence despite early warnings to the con

The general problem of minimizing a function with re

of lo cal minima in the loss function do es not seem to

sp ect to a set of parameters is at the ro ot of many issues in

be a ma jor problem in practice This b ecame apparent

GradientBased Learning draws on the

when it was noticed that lo cal minima did not seem to

fact that it is generally much easier to minimize a reason

b e a ma jor imp ediment to the success of early nonlinear

ably smo oth continuous function than a discrete combi

gradientbased Learning techniques such as Boltzmann ma

natorial function The loss function can b e minimized by

chines The second eventwas the p opularization

estimating the impact of small variations of the parame

by Rumelhart Hinton and Williams and others of a

ter values on the loss function This is measured by the

simple and ecient pro cedure the backpropagation al

gradient of the loss function with resp ect to the param

gorithm to compute the gradient in a nonlinear system

eters Ecient learning algorithms can be devised when

comp osed of several layers of pro cessing The third event

the gradient vector can be computed analytically as op

was the demonstration that the backpropagation pro ce

posed to numerically through p erturbations This is the

dure applied to multilay er neural networks with sigmoidal

basis of numerous gradientbased learning algorithms with

units can solve complicated learning tasks The basic idea

continuousvalued parameters In the pro cedures describ ed

of backpropagation is that gradients can b e computed e

in this article the set of parameters W is a realvalued vec

ciently by propagation from the output to the input This

tor with resp ect to which E W is continuous as well as

idea was describ ed in the control theory literature of the

dierentiable almost everywhere The simplest minimiza

early sixties but its application to machine learning

tion pro cedure in such a setting is the gradient descent

was not generally realized then Interestingly the early

algorithm where W is iteratively adjusted as follows

derivations of backpropagation in the context of neural

network learning did not use gradients but virtual tar

EW

W W

k k 

gets for units in intermediate layers or minimal

W

disturbance arguments The Lagrange formalism used

In the simplest case is a scalar constant More sophisti in the control theory literature provides p erhaps the b est

cated pro cedures use variable or substitute it for a diag rigorous metho d for deriving backpropagation and for

onal matrix or substitute it for an estimate of the inverse deriving generalizations of backpropagation to recurrent

PROC OF THE IEEE NOVEMBER

networks and networks of heterogeneous mo dules ferentiable and therefore lends itself to the use of Gradient

A simple derivation for generic multilayer systems is given Based Learning metho ds Section V intro duces the use of

in Section IE directed acyclic graphs whose arcs carry numerical infor

mation as a way to represent the alternative hyp otheses

The fact that lo cal minima do not seem to b e a problem

and intro duces the idea of GTN

for multilayer neural networks is somewhat of a theoretical

mystery It is conjectured that if the network is oversized

The second solution describ ed in Section VI I is to elim

for the task as is usually the case in practice the presence inate segmentation altogether The idea is to sweep the

of extra dimensions in parameter space reduces the risk recognizer over every p ossible lo cation on the input image

of unattainable regions Backpropagation is by far the and to rely on the character sp otting prop erty of the rec

most widely used neuralnetwork learning algorithm and

ognizer ie its ability to correctly recognize a wellcentered

probably the most widely used learning algorithm of any

character in its input eld even in the presence of other

form characters b esides it while rejecting images containing no

centered characters The sequence of recognizer

D Learning in Real Handwriting Recognition Systems

eeping the recognizer over the in outputs obtained by sw

put is then fed to a Graph Transformer Network that takes

Isolated handwritten character recognition has b een ex

linguistic constraints into account and nally extracts the

tensively studied in the literature see for reviews

most likely interpretation This GTN is somewhat similar

and was one of the early successful applications of neural

to Hidden Markov Mo dels HMM which makes the ap

networks Comparativeexperiments on recognition of

proach reminiscent of the classical sp eech recognition

individual handwritten digits are rep orted in Section III

While this technique would be quite exp ensive in

They show that neural networks trained with Gradient

the general case the use of Convolutional Neural Networks

Based Learning p erform b etter than all other metho ds

makes it particularly attractive b ecause it allows signicant

tested here on the same data The b est neural networks

savings in computational cost

called Convolutional Networks are designed to learn to

extract relevant features directly from pixel images see

E Global ly Trainable Systems

Section I I

As stated earlier most practical pattern recognition sys One of the most dicult problems in handwriting recog

tems are comp osed of multiple mo dules For example a nition however is not only to recognize individual charac

do cument recognition system is comp osed of a eld lo cator

ters but also to separate out characters from their neigh

which extracts regions of interest a eld segmenter which

b ors within the word or sentence a pro cess known as seg

cuts the input image into images of candidate characters a

mentation The technique for doing this that has b ecome

recognizer which classies and scores each candidate char the standard is called Heuristic OverSegmentation It

acter and a contextual p ostpro cessor generally based on consists in generating a large number of p otential cuts

astochastic grammar which selects the b est grammatically

between characters using heuristic image pro cessing tech

correct answer from the hyp otheses generated by the recog

niques and subsequently selecting the b est combination of

nizer In most cases the information carried from mo dule

cuts based on scores given for each candidate character by

the recognizer In such a mo del the accuracy of the sys to mo dule is b est represented as graphs with numerical in

tem dep ends up on the quality of the cuts generated bythe formation attached to the arcs For example the output

of the recognizer mo dule can b e represented as an acyclic

heuristics and on the ability of the recognizer to distin

graph where each arc contains the lab el and the score of

guish correctly segmented characters from pieces of char

a candidate character and where each path represent a acters multiple characters or otherwise incorrectly seg

alternative interpretation of the input string Typically mented characters Training a recognizer to p erform this

eachmoduleismanually optimized or sometimes trained task p oses a ma jor challenge b ecause of the dicultyincre

outside of its context For example the character recog

ating a lab eled database of incorrectly segmented charac

nizer would b e trained on lab eled images of presegmented

ters The simplest solution consists in running the images

characters Then the complete system is assembled and of character strings through the segmenter and then man

a subset of the parameters of the mo dules is manually ad ually lab eling all the character hyp otheses Unfortunately

justed to maximize the overall p erformance This last step not only is this an extremely tedious and costly task it is

is extremely tedious timeconsuming and almost certainly

also dicult to do the lab eling consistently For example

sub optimal

should the righthalfofacutupbelabeledasaoras

a noncharacter should the right half of a cut up be

A b etter alternativewould b e to somehow train the en

lab eled as a

tire system so as to minimize a global error measure suchas

The rst solution describ ed in Section V consists in the probabilityofcharacter misclassications at the do cu

Ideallywewould want to nd a go o d minimum training the system at the level of whole strings of char mentlevel

el The notion of of this global loss function with resp ect to all the param acters rather than at the character lev

GradientBased Learning can b e used for this purp ose The eters in the system If the loss function E measuring the

system is trained to minimize an overall loss function which p erformance can b e made dierentiable with resp ect to the

measures the probability of an erroneous answer Section V systems tunable parameters W we can nd a lo cal min

explores various ways to ensure that the loss function is dif imum of E using GradientBased Learning However at

PROC OF THE IEEE NOVEMBER

rst glance it app ears that the sheer size and complexity tion system is b est represented by graphs with numerical

of the system would make this intractable information attached to the arcs In this case each mo dule

p p

To ensure that the global loss function E Z Wisdif called a Graph Transformer takes one or more graphs as

h ferentiable the overall system is built as a feedforward net input and pro duces a graph as output Networks of suc

mo dules are called Graph Transformer Networks GTN work of dierentiable mo dules The function implemented

Sections IV VI and VIII develop the concept of GTNs byeach mo dule must b e continuous and dierentiable al

and show that GradientBased Learning can be used to most everywhere with resp ect to the internal parameters of

train all the parameters in all the mo dules so as to mini the mo dule eg the weights of a Neural Net character rec

mize a global loss function It may seem paradoxical that ognizer in the case of a character recognition mo dule and

gradients can b e computed when the state information is with resp ect to the mo dules inputs If this is the case a

represented by essentially discrete ob jects such as graphs simple generalization of the wellknown backpropagation

but that diculty can b e circumvented as shown later pro cedure can b e used to eciently compute the gradients

of the loss function with resp ect to all the parameters in

II Convolutional Neural Networks for

the system For example let us consider a system

Isolated Character Recognition

built as a cascade of mo dules eachofwhich implements a

X is a vector rep function X F W X where

n n n n n

The ability of multilayer networks trained with gradi

resenting the output of the mo dule W is the vector of

n

ent descent to learn complex highdimensional nonlinear

tunable parameters in the mo dule a subset of W and

mappings from large collections of examples makes them

X is the mo dules input vector as well as the previous

n

obvious candidates for image recognition tasks In the tra

mo dules output vector The input X to the rst mo dule

ditional mo del of pattern recognition a handdesigned fea

p p

is the input pattern Z If the partial derivativeof E with

ture extractor gathers relevant information from the input

p

resp ect to X is known then the partial derivatives of E

n

and eliminates irrelevantv ariabilities A trainable classier

with resp ect to W and X can b e computed using the

n n

then categorizes the resulting feature vectors into classes

backward recurrence

In this scheme standard fullyconnected multilayer net

p p

E F E

works can b e used as classiers A p otentially more inter

W X

n n

esting scheme is to rely on as much as p ossible on learning

W W X

n n

p p

in the feature extractor itself In the case of character

E F E

W X

n n

recognition a network could be fed with almost raw in

X X X

n n

puts eg sizenormalized images While this can b e done

F

W X is the Jacobian of F with resp ect to where

with an ordinary fully connected feedforward network with

n n

W

F

some success for tasks suchascharacter recognition there

W evaluated at the p ointW X and W X

n n n n

X

are problems

is the Jacobian of F with resp ect to X The Jacobian of

avector function is a matrix containing the partial deriva Firstlytypical images are large often with several hun

tives of all the outputs with resp ect to all the inputs dred variables pixels A fullyconnected rst layer with

The rst equation computes some terms of the gradient sayonehundred hidden units in the rst layer would al

p

of E W while the second equation generates a back ready contain several tens of thousands of weights Such

ward recurrence as in the wellknown backpropagation a large numb er of parameters increases the capacityofthe

pro cedure for neural networks Wecanaverage the gradi system and therefore requires a larger training set In ad

ents over the training patterns to obtain the full gradient dition the memory requirement to store so manyweigh ts

It is interesting to note that in many instances there is may rule out certain hardware implementations But the

no need to explicitly compute the Jacobian matrix The main deciency of unstructured nets for image or sp eech

ab oveformula uses the pro duct of the Jacobian with a vec applications is that they have no builtin invariance with

tor of partial derivatives and it is often easier to compute resp ect to translations or lo cal distortions of the inputs

this pro duct directly without computing the Jacobian b e Before b eing sent to the xedsize input layer of a neural

forehand In By analogy with ordinary multilayer neural net character images or other D or D signals must b e

networks all but the last mo dule are called hidden layers approximately sizenormalized and centered in the input

b ecause their outputs are not observable from the outside eld Unfortunatelynosuch prepro cessing can b e p erfect

more complex situations than the simple cascade of mo d handwriting is often normalized at the word level which

ules describ ed ab ove the partial derivative notation be can cause size slant and p osition variations for individual

comes somewhat ambiguous and awkward A completely characters This combined with variability in writing style

rigorous derivation in more general cases can b e done using will cause variations in the p osition of distinctive features

Lagrange functions in input ob jects In principle a fullyconnected network of

Traditional multilayer neural networks are a sp ecial case sucient size could learn to pro duce outputs that are in

of the ab ove where the state information X is represented variant with resp ect to suchvariations However learning

n

such a task would probably result in multiple units with with xedsized vectors and where the mo dules are al

similar weight patterns p ositioned at various lo cations in ternated layers of matrix multiplications the weights and

the input so as to detect distinctive features wherever they comp onentwise sigmoid functions the neurons However

app ear on the input Learning these weight congurations as stated earlier the state information in complex recogni

PROC OF THE IEEE NOVEMBER

requires a very large numb er of training instances to cover planes eachofwhich is a feature map A unit in a feature

the space of p ossible variations In convolutional networks map has inputs connected to a by area in the input

describ ed b elow shift invariance is automatically obtained called the receptive eld of the unit Each unit has in

by forcing the replication of weight congurations across puts and therefore trainable co ecients plus a trainable

space bias The receptive elds of contiguous units in a feature

map are centered on corresp ondingly contiguous units in

Secondly a deciency of fullyconnected architectures is

the previous layer Therefore receptive elds of neighbor

that the top ology of the input is entirely ignored The in

ing units overlap For example in the rst hidden layer

put variables can b e presentedinany xed order without

of LeNet the receptiveeldsof horizontally contiguous

aecting the outcome of the training On the contrary

units overlap by columns and rows As stated earlier

images or timefrequency representations of sp eech have

the units in a feature map share the same set of all

a strong D lo cal structure variables or pixels that are

weights and the same bias so they detect the same feature

spatially or temp orally nearby are highly correlated Lo cal

at all p ossible lo cations on the input The other feature

correlations are the reasons for the wellknown advantages

maps in the layer use dierentsetsofweights and biases

of extracting and combining local features b efore recogniz

thereby extracting dierenttyp es of lo cal features In the

ing spatial or temp oral ob jects b ecause congurations of

case of LeNet at each input lo cation six dierenttyp es

neighb oring variables can b e classied into a small number

of features are extracted by six units in identical lo cations

of categories eg edges corners Convolutional Net

in the six feature maps A sequential implementation of

works force the extraction of lo cal features by restricting

a feature map would scan the input image with a single

the receptive elds of hidden units to b e lo cal

unit that has a lo cal receptive eld and store the states

A Convolutional Networks

of this unit at corresp onding lo cations in the feature map

This op eration is equivalenttoaconvolution followed by

Convolutional Networks combine three architectural

an additive bias and squashing function hence the name

ideas to ensure some degree of shift scale and distor

convolutional network The k ernel of the is the

tion invariance local receptive elds shared weights or

set of connection weights used by the units in the feature

weight replication and spatial or temp oral subsampling

map An interesting prop ertyofconvolutional layers is that

Atypical convolutional network for recognizing characters

if the input image is shifted the feature map output will

dubb ed LeNet is shown in gure The input plane

b e shifted by the same amount but will b e left unchanged

receives images of characters that are approximately size

otherwise This prop erty is at the basis of the robustness

normalized and centered Eachunitinalayer receives in

of convolutional networks to shifts and distortions of the

puts from a set of units lo cated in a small neighborhood

input

in the previous layer The idea of connecting units to lo cal

receptive elds on the input go es back to the in Once a feature has b een detected its exact lo cation

the early s and was almost simultaneous with Hub el and b ecomes less imp ortant Only its approximate p osition

Wiesels disco very of lo callysensitive orientationselective relative to other features is relevant For example once

neurons in the cats visual system Lo cal connections weknow that the input image contains the endp ointofa

have b een used many times in neural mo dels of visual learn roughly horizontal segment in the upp er left area a corner

ing With lo cal receptive in the upp er right area and the endp oint of a roughly ver

elds neurons can extract elementary visual features such tical segment in the lower p ortion of the image we can tell

as oriented edges endp oints corners or similar features in the input image is a Not only is the precise p osition of

other signals such as sp eech sp ectrograms These features each of those features irrelevant for identifying the pattern

are then combined by the subsequentlayers in order to de it is p otentially harmful b ecause the p ositions are likely to

tect higherorder features As stated earlier distortions or ary for dierent instances of the character A simple way v

shifts of the input can cause the p osition of salient features to reduce the precision with which the p osition of distinc

to vary In addition elementary feature detectors that are tive features are enco ded in a feature map is to reduce the

useful on one part of the image are likely to b e useful across spatial resolution of the feature map This can b e achieved

the entire image This knowledge can b e applied by forcing with a socalled subsampling layers which p erforms a lo cal

a set of units whose receptive elds are lo cated at dierent averaging and a subsampling reducing the resolution of

places on the image to have identical weightvectors the feature map and reducing the sensitivity of the output

Units in a layer are organized in planes within to shifts and distortions The second hidden layer of LeNet

which all the units share the same set of weights The set is a subsampling layer This layer comprises six feature

of outputs of the units in such a plane is called a feature maps one for each feature map in the previous layer The

map Units in a feature map are all constrained receptiveeldofeach unit is a by area in the previous to per

layers corresp onding feature map Each unit computes the form the same op eration on dierent parts of the image

average of its four inputs multiplies it by a trainable co ef A complete convolutional layer is comp osed of several fea

cient adds a trainable bias and passes the result through ture maps with dierentweightvectors so that multiple

a Contiguous units have nonoverlapping features can b e extracted at each lo cation A concrete ex

contiguous receptive elds Consequently a subsampling ample of this is the rst layer of LeNet shown in Figure

layer feature map has half the number of rows and columns Units in the rst hidden layer of LeNet are organized in

PROC OF THE IEEE NOVEMBER

C3: f. maps 16@10x10 C1: feature maps S4: f. maps 16@5x5 INPUT 6@28x28 32x32 S2: f. maps C5: layer OUTPUT 6@14x14 F6: layer 120 84 10

Full connection Gaussian connections

Convolutions Subsampling Subsampling Full connection

Fig Architecture of LeNet a Convolutional Neural Network here for digits recognition Each plane is a feature map ie a set of units

whose weights are constrained to b e identical

as the feature maps in the previous layer The trainable B LeNet

co ecient and bias control the eect of the sigmoid non

This section describ es in more detail the architecture of

linearity If the co ecient is small then the unit op erates

LeNet the Convolutional Neural Network used in the

in a quasilinear mo de and the subsampling layer merely

exp eriments LeNet comprises layers not counting the

blurs the input If the co ecient is large subsampling

input all of which contain trainable parameters weights

units can b e seen as p erforming a noisy OR or a noisy

The input is a x pixel image This is signicantly larger

AND function dep ending on the value of the bias Succes

than the largest character in the database at most x

sivelayers of convolutions and subsampling are typically

x eld The reason is that it is pixels centered in a

alternated resulting in a bipyramid at eachlayer the

desirable that p otential distinctive features such as stroke

numb er of feature maps is increased as the spatial resolu

endp oints or corner can app ear in the center of the recep

tion is decreased Each unit in the third hidden layer in g

tive eld of the highestlevel feature detectors In LeNet

ure mayhave input connections from several feature maps

the set of centers of the receptive elds of the last convolu

in the previous layer The convolutionsubsampling com

tional layer C see b elow form a x area in the center

bination inspired by Hub el and Wiesels notions of sim

of the x input The values of the input pixels are nor

ple and complex cells was implemented in Fukushimas

malized so that the background level white corresp onds

Neo cognitron though no globally sup ervised learning

to avalue of and the foreground black corresp onds

pro cedure such as backpropagation was av ailable then A

to This makes the mean input roughly and the

large degree of invariance to geometric transformations of

variance roughly which accelerates learning

the input can b e achieved with this progressive reduction

In the following convolutional layers are lab eled Cx sub

of spatial resolution comp ensated by a progressive increase

sampling layers are lab eled Sx and fullyconnected layers

of the richness of the representation the numb er of feature

are lab eled Fx where x is the layer index

maps

Layer C is a convolutional layer with feature maps

Since all the weights are learned with backpropagation

Eachunitineach feature map is connected to a x neigh

convolutional networks can be seen as synthesizing their

b orho o d in the input The size of the feature maps is x

own feature extractor The weight sharing technique has

whichprevents connection from the input from falling o

the interesting side eect of reducing the number of free

the b oundary C contains trainable parameters and

parameters thereby reducing the capacity of the ma

connections

chine and reducing the gap b etween test error and training

Layer S is a subsampling layer with feature maps of

error The network in gure contains con

size x Eachunitineach feature map is connected to a

nections but only trainable free parameters b ecause

x neighb orho o d in the corresp onding feature map in C

of the weight sharing

The four inputs to a unit in S are added then multiplied

Fixedsize Convolutional Networks have been applied

by a trainable co ecient and added to a trainable bias

among other handwriting recogni to many applications

The result is passed through a sigmoidal function The

tion machineprinted character recognition

x receptive elds are nonoverlapping therefore feature

online handwriting recognition and face recogni

maps in S have half the number of rows and column as

tion Fixedsize convolutional networks that share

feature maps in C Layer S has trainable parameters

weights along a single temp oral dimension are known as

and connections

TimeDelayNeuralNetworks TDNNs TDNNs have b een

Layer C is a convolutional layer with feature maps used in phoneme recognition without subsampling

Each unit in each feature map is connected to several x sp oken word recognition with subsampling

neighb orho o ds at identical lo cations in a subset of Ss online recognition of isolated handwritten charac

feature maps Table I shows the set of S feature maps ters and signature verication

PROC OF THE IEEE NOVEMBER

where A is the amplitude of the function and S determines

X X X X X X X X X X

is o dd with horizon its slop e at the origin The function f

X X X X X X X X X X

tal asymptotes at A and A The constant A is chosen

X X X X X X X X X X

to b e The rationale for this choice of a squashing

X X X X X X X X X X

function is given in App endix A

X X X X X X X X X X

Finally the output layer is comp osed of Euclidean Radial

X X X X X X X X X X

Basis Function units RBF one for each class with

inputs each The outputs of each RBF unit y is computed

i

TABLE I

as follows

X

Each column indicates which feature map in S are combined

y x w

i j ij

by the units in a particular feature map of C

j

In other words each output RBF unit computes the Eu

clidean distance b etween its input vector and its parameter

combined byeach C feature map Why not connect ev

vector The further away is the input from the parameter

ery S feature map to every C feature map The rea

vector the larger is the RBF output The output of a

son is twofold First a noncomplete connection scheme

particular RBF can b e interpreted as a p enalty term mea

keeps the numb er of connections within reasonable b ounds

suring the t b etween the input pattern and a mo del of the

More imp ortantly it forces a break of symmetry in the net

class asso ciated with the RBF In probabilistic terms the

work Dierent feature maps are forced to extract dierent

RBF output can b e in terpreted as the unnormalized nega

hop efully complementary features b ecause they get dif

tive loglikeliho o d of a Gaussian distribution in the space

ferent sets of inputs The rationale b ehind the connection

of congurations of layer F Given an input pattern the

scheme in table I is the following The rst six C feature

loss function should b e designed so as to get the congu

maps take inputs from every contiguous subsets of three

ration of F as close as p ossible to the parameter vector

feature maps in S The next six take input from every

of the RBF that corresp onds to the patterns desired class

contiguous subset of four The next three take input from

The parameter vectors of these units were chosen byhand

of four Finally the last one some discontinuous subsets

and kept xed at least initially The comp onents of those

takes input from all S feature maps Layer C has

parameters vectors were set to or While they could

trainable parameters and connections

have b een chosen at random with equal probabilities for

Layer S is a subsampling layer with feature maps of

and or even chosen to form an error correcting co de

size x Eachunitineach feature map is connected to a

as suggested by they were instead designed to repre

x neighb orho o d in the corresp onding feature map in C

sentastylized image of the corresp onding character class

in a similar way as C and S Layer S has trainable

drawn on a x bitmap hence the number Such a

parameters and connections

representation is not particularly useful for recognizing iso

Layer C is a convolutional layer with feature maps

lated digits but it is quite useful for recognizing strings of

Each unit is connected to a x neighborhood on all

characters taken from the full printable ASCI I set The

of Ss feature maps Here b ecause the size of S is also

rationale is that characters that are similar and therefore

x the size of Cs feature maps is x this amounts

confusable suchasuppercaseOlowercase O and zero or

to a full connection between S and C C is lab eled

lowercase l digit square brackets and upp ercase I will

as a convolutional layer instead of a fullyconnected layer

ha ve similar output co des This is particularly useful if the

b ecause if LeNet input were made bigger with everything

system is combined with a linguistic p ostpro cessor that

else kept constant the feature map dimension would be

can correct such confusions Because the co des for confus

larger than x This pro cess of dynamically increasing the

able classes are similar the output of the corresp onding

size of a convolutional network is describ ed in the section

RBFs for an ambiguous character will b e similar and the

Section VI I Layer C has trainable connections

p ostpro cessor will b e able to pick the appropriate interpre

Layer F contains units the reason for this number

tation Figure gives the output co des for the full ASCI I

comes from the design of the output layer explained be

set

low and is fully connected to C It has trainable

Another reason for using such distributed co des rather

parameters

than the more common of N co de also called place

As in classical neural networks units in layers up to F

co de or grandmother cell co de for the outputs is that

compute a dot pro duct b etween their input vector and their

non distributed co des tend to b ehave badly when the num

weightvector to whichabiasisadded This weighted sum

ber of classes is larger than a few dozens The reason is

denoted a for unit i is then passed through a sigmoid

i

that output units in a nondistributed co de must be o

squashing function to pro duce the state of unit i denoted

most of the time This is quite dicult to achieve with

by x

i

sigmoid units Yet another reason is that the classiers are

x f a

i i

haracters but also to re often used to not only recognize c

ject noncharacters RBFs with distributed co des are more

The squashing function is a scaled hyp erb olic tangent

appropriate for that purp ose b ecause unlike sigmoids they

are activated within a well circumscrib ed region of their in f aA tanhSa

PROC OF THE IEEE NOVEMBER

p enalties it means that in addition to pushing down the

! " # $ % & ’ ( ) * + , − . /

p enalty of the correct class like the MSE criterion this

0 1 2 3 4 5 6 7 8 9 : ; < = > ? criterion also pulls up the p enalties of the incorrect classes

P

X X p

@ A B C D E F G H I J K L M N O

p j y Z W

i

p

y Z W loge e E W

D

P

p

P Q R S T U V W X Y Z [ \ ] ^ _ i

‘ a b c d e f g h i j k l m n o

The negative of the second term plays a comp etitive role

It is necessarily smaller than or equal to the rst term

p q r s t u v w x y z { | } ~

therefore this loss function is p ositive The constant j is

p ositive and prevents the p enalties of classes that are al

Fig Initial parameters of the output RBFs for recognizing the

full ASCI I set

ready very large from b eing pushed further up The p os

terior probabilityof this rubbish class lab el would b e the

P

p

j j y Z W

i

ratio of e and e e This discrimina

i

put space that nontypical patterns are more likely to fall

tive criterion prevents the previously mentioned collaps

outside of

ing eect when the RBF parameters are learned b ecause

The parameter vectors of the RBFs play the role of target

it keeps the RBF centers apart from each other In Sec

vectors for layer F It is worth p ointing out that the com

tion VI we present a generalization of this criterion for

p onents of those vectors are or whichiswell within

systems that learn to classify multiple ob jects in the input

the range of the sigmoid of F and therefore prevents those

eg characters in words or in do cuments

sigmoids from getting saturated In fact and are the

Computing the gradient of the loss function with resp ect

p oints of maximum curvature of the sigmoids This forces

to all the weights in all the layers of the convolutional

the F units to op erate in their maximally nonlinear range

network is done with backpropagation The standard al

Saturation of the sigmoids must be avoided b ecause it is

gorithm must be slightly mo died to take accountof the

known to lead to slow convergence and illconditioning of

weight sharing An easy way to implement it is to rst com

the loss function

pute the partial derivatives of the loss function with resp ect

to each connection as if the network were a conventional

C Loss Function

multilayer network without weight sharing Then the par

The simplest output loss function that can b e used with

tial derivatives of all the connections that share a same

the ab ovenetwork is the Maximum Likeliho o d Estimation

parameter are added to form the derivative with resp ect to

criterion MLE which in our case is equivalent to the Min

that parameter

imum Mean Squared Error MSE The criterion for a set

Such a large architecture can b e trained very eciently

of training samples is simply

but doing so requires the use of a few techniques that are

describ ed in the app endix Section A of the app endix

P

X

describ es details such as the particular sigmoid used and

p

p

Z W E W y

D

P

the weight initialization Section B and C describ e the

p

minimization pro cedure used whichisastochastic version

of a diagonal approximation to the Levenb ergMarquardt

p

where y is the output of the D th RBF unit ie the

D p

pro cedure

one that corresp onds to the correct class of input pattern

p

Z While this cost function is appropriate for most cases

III Results and Comparison with Other

it lacks three imp ortant prop erties First if weallowthe

Methods

parameters of the RBF to adapt E W has a trivial but

While recognizing individual digits is only one of many

totally unacceptable solution In this solution all the RBF

problems involved in designing a practical recognition sys

parameter vectors are equal and the state of F is constant

tem it is an excellent b enchmark for comparing shap e

and equal to that parameter vector In this case the net

recognition metho ds Though many existing metho d com

work happily ignores the input and all the RBF outputs

bine a handcrafted feature extractor and a trainable clas

are equal to zero This collapsing phenomenon do es not

sier this study concentrates on adaptive metho ds that

o ccur if the RBF weights are not allowed to adapt The

op erate directly on sizenormalized images

second problem is that there is no comp etition between

the classes Suc h a comp etition can be obtained by us

A Database the ModiedNISTset

ing a more discriminative training criterion dubb ed the

The database used to train and test the systems de MAP maximum a p osteriori criterion similar to Maxi

scrib ed in this pap er was constructed from the NISTs Sp e mum Mutual Information criterion sometimes used to train

cial Database and Sp ecial Database containing binary HMMs It corresp onds to maximizing the

images of handwritten digits NIST originally designated p osterior probability of the correct class D or minimiz

p

SD as their training set and SD as their test set How ing the logarithm of the probabilityof the correct class

ever SD is much cleaner and easier to recognize than SD given that the input image can come from one of the classes

The reason for this can b e found on the fact that SD or from a background rubbish class lab el In terms of

PROC OF THE IEEE NOVEMBER

was collected among Census Bureau employees while SD

was collected among highscho ol students Drawing sensi

ble conclusions from learning exp eriments requires that the

result b e indep endent of the choice of training set and test

among the complete set of samples Therefore it was nec

essary to build a new database by mixing NISTs datasets

SD contains digit images written by dif

ferent writers In contrast to SD where blo cks of data

from each writer app eared in sequence the data in SD is

scrambled Writer identities for SD are available and we

used this information to unscramble the writers Wethen

split SD in two characters written by the rst writers

wentinto our new training set The remaining writers

were placed in our test set Thus we had two sets with

nearly examples each The new training set was

completed with enough examples from SD starting at

pattern to make a full set of training patterns

Similarly the new test set was completed with SD exam

Fig Sizenormalized examples from the MNIST database

ples starting at pattern to makea full set with

test patterns In the exp eriments describ ed here we

only used a subset of test images from SD

three for the next three for the next

and from SD but we used the full training

and thereafter Before each iteration the diagonal

samples The resulting database was called the Mo died

Hessian approximation was reevaluated on samples as

NIST or MNIST dataset

describ ed in App endix C and kept xed during the entire

The original black and white bilevel images were size

iteration The parameter wassetto The resulting

normalized to t in a x pixel box while preserving

eective learning rates during the rst pass varied b etween

their asp ect ratio The resulting images contain grey lev



approximately and over the set of parame

els as result of the antialiasing image interp olation tech

ters The test error rate stabilizes after around passes

nique used by the normalization algorithm Three ver

through the training set at The error rate on the

sions of the database were used In the rst version

training set reaches after passes Many authors

the images were centered in a x image by comput

have rep orted observing the common phenomenon of over

ing the center of mass of the pixels and translating the

training when training neural networks or other adaptive

image so as to p osition this point at the center of the

algorithms on various tasks When overtraining o ccurs

x eld In some instances this x eld was ex

the training error keeps decreasing over time but the test

tended to x with background pixels This version of

error go es through a minimum and starts increasing after

the database will be referred to as the regular database

a certain numb er of iterations While this phenomenon is

In the second version of the database the character im

very common it was not observed in our case as the learn

ages were deslanted and cropp ed down to x pixels im

ing curves in gure show A p ossible reason is that the

ages The deslanting computes the second moments of in

learning rate was k ept relatively large The eect of this is

ertia of the pixels counting a foreground pixel as and a

that the weights never settle down in the lo cal minimum

background pixel as and shears the image by horizon

but keep oscillating randomly Because of those uctua

tally shifting the lines so that the principal axis is verti

tions the average cost will b e lower in a broader minimum

cal This version of the database will b e referred to as the

Therefore sto chastic gradient will have a similar eect as

version of the database deslanted database In the third

a regularization term that favors broader minima Broader

used in some early exp eriments the images were reduced

minima corresp ond to solutions with large entropyof the

to x pixels The regular database training

parameter distribution which is b enecial to the general

examples test examples sizenormalized to x

ization error

and centered by center of mass in x elds is avail

able at httpwwwresearchattcomyannocrmnist

The inuence of the training set size was measured by

Figure shows examples randomly picked from the test set

training the network with and exam

ples The resulting training error and test error are shown

B Results

in gure It is clear that even with sp ecialized architec

tures such as LeNet more training data would improve

Several versions of LeNet were trained on the regular

the accuracy

MNIST database iterations through the entire train

ing data were p erformed for each session The values of Toverify this hyp othesis we articially generated more

the global learning rate see Equation in App endix C training examples by randomly distorting the original

for a denition was decreased using the following sched training images The increased training set was comp osed

for the rst two passes for the next plus instances of ule of the original patterns

PROC OF THE IEEE NOVEMBER

Error Rate (%)

5%

4%

3%

2%

1% Test

0% Training 0 4 8 12 16 20

Training set Iterations

Fig Training and test error of LeNet as a function of the num

b er of passes through the pattern training set without

distortions The average training error is measured onthey as

training pro ceeds This explains why the training error app ears

to b e larger than the test error Convergence is attained after

to passes through the training set

Fig Examples of distortions of ten training patterns Error Rate (%) 1.8

4−>6 3−>5 8−>2 2−>1 5−>3 4−>8 2−>8 3−>5 6−>5 7−>3 1.6

1.4 Test error (no distortions) 9−>4 8−>0 7−>8 5−>3 8−>7 0−>6 3−>7 2−>7 8−>3 9−>4

1.2 8−>2 5−>3 4−>8 3−>9 6−>0 9−>8 4−>9 6−>1 9−>4 9−>1

1 9−>4 2−>0 6−>1 3−>5 3−>2 9−>5 6−>0 6−>0 6−>0 6−>8

0.8 Test error (with distortions) 4−>6 7−>3 9−>4 4−>6 2−>7 9−>7 4−>3 9−>4 9−>4 9−>4 0.6

8−>7 4−>2 8−>4 3−>5 8−>4 6−>5 8−>5 3−>8 3−>8 9−>8 0.4

0.2 Training error (no distortions) 1−>5 9−>8 6−>3 0−>2 6−>5 9−>5 0−>7 1−>6 4−>9 2−>1

0 0 10 20 30 40 50 60 70 80 90 100 2−>8 8−>5 4−>9 7−>2 7−>2 6−>5 9−>7 6−>1 5−>6 5−>0 Training Set Size (x1000)

4−>9 2−>8

Fig Training and test errors of LeNet achieved using training

Fig The test patterns misclassied by LeNet Beloweach

sets of various sizes This graph suggests that a larger training

image is displayed the correct answers left and the network an

set could improve the p erformance of LeNet The hollow square

swer right These errors are mostly caused either by genuinely

show the test error when more training patterns are articially

ambiguous patterns or by digits written in a style that are under

generated using random distortions The test patterns are not

represented in the training set

distorted

p erfectly identiable by humans although they are writ

distorted patterns with randomly picked distortion param

ten in an underrepresented style This shows that further

eters The distortions were combinations of the follow

improvements are to b e exp ected with more training data

ing planar ane transformations horizontal and verti

cal translations scaling squeezing simultaneous horizon

C Comparison with Other Classiers

tal compression and vertical elongation or the reverse

For the sake of comparison a variety of other trainable

and horizontal shearing Figure shows examples of dis

classiers was trained and tested on the same database An

torted patterns used for training When distorted data was

early subset of these results was presented in The error

used for training the test error rate dropp ed to from

rates on the test set for the various metho ds are shown in

without deformation The same training parame

gure

ters were used as without deformations The total length of

the training session was left unchanged passes of

C Linear Classier and Pairwise Linear Classier

patterns each It is interesting to note that the network

eectively sees each individual sample only twice over the

Possibly the simplest classier that one might consider is

course of these passes

a linear classier Each input pixel value contributes to a

weighted sum for each output unit The output unit with Figure shows all misclassied test examples some

eral are the highest sum including the contribution of a bias con of those examples are genuinely ambiguous but sev

PROC OF THE IEEE NOVEMBER

Linear −−−− 12.0 −−−−> [deslant] Linear −−−− 8.4 −−−−> Pairwise −−−− 7.6 −−−−>

K−NN Euclidean 5 [deslant] K−NN Euclidean 2.4 40 PCA + quadratic 3.3 1000 RBF + linear 3.6 [16x16] Tangent Distance 1.1 SVM poly 4 1.1 RS−SVM poly 5 1 [dist] V−SVM poly 9 0.8

28x28−300−10 4.7 [dist] 28x28−300−10 3.6 [deslant] 20x20−300−10 1.6 28x28−1000−10 4.5 [dist] 28x28−1000−10 3.8 28x28−300−100−10 3.05 [dist] 28x28−300−100−10 2.5 28x28−500−150−10 2.95 [dist] 28x28−500−150−10 2.45

[16x16] LeNet−1 1.7 LeNet−4 1.1 LeNet−4 / Local 1.1 LeNet−4 / K−NN 1.1 LeNet−5 0.95 [dist] LeNet−5 0.8 [dist] Boosted LeNet−4 0.7

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Fig Error rate on the test set for various classication metho ds deslant indicates that the classier was trained and tested on

the deslanted version of the database dist indicates that the training set was augmented with articially distorted examples x

indicates that the system used the x pixel images The uncertainty in the quoted error rates is ab out

stant indicates the class of the input character On the However the memory requirement and recognition time are

regular data the error rate is The network has large the complete twentybytwenty pixel training

free parameters On the deslanted images the test error images ab out Megabytes at one byte p er pixel must b e

rate is The network has free parameters The available at run time Much more compact representations

deciencies of the linear classier are well do cumented could b e devised with mo dest increase in error rate On the

and it is included here simply to form a basis of comparison regular test set the error rate was On the deslanted

for more sophisticated classiers Various combinations of with k Naturally a data the error rate was

sigmoid units linear units gradient descent learning and realistic Euclidean distance nearestneighb or system would

learning by directly solving linear systems gave similar re op erate on feature vectors rather than directly on the pix

sults els but since all of the other systems presented in this

study op erate directly on the pixels this result is useful for

A simple improvementofthe basic linear classier was

tested The idea is to train each unit of a singlelayer a baseline comparison

network to separate each class from each other class In our

C Principal Comp onent Analysis PCA and Polynomial

case this layer comprises units lab eled

Classier

Unit ij is trained to pro duce on patterns

of class i on patterns of class j andisnottrainedon

Following a prepro cessing stage was con

other patterns The nal score for class i is the sum of

structed which computes the pro jection of the input pat

the outputs all the units lab eled ix minus the sum of the

tern on the principal comp onents of the set of training

output of all the units lab eled yi for all x and y The

vectors To compute the principal comp onents the mean of

error rate on the regular test set was

each input comp onentwas rst computed and subtracted

from the training vectors The covariance matrix of the re

C Baseline Nearest Neighb or Classier

sulting vectors was then computed and diagonalized using

Singular Value Decomp osition The dimensional feature Another simple classier is a Knearest neighb or classi

vector was used as the input of a second degree p olynomial er with a Euclidean distance measure b etween input im

classier This classier can be seen as a linear classier ages This classier has the advantage that no training

with inputs preceded byamodule that computes all time and no brain on the part of the designer are required

PROC OF THE IEEE NOVEMBER

pro ducts of pairs of input variables The error on the reg only marginally improved error rates Training

ular test set was with distorted patterns improved the p erformance some

what error for the x network and

C Radial Basis Function Network

for the x network

Following an RBF network was constructed The

C ASmallConvolutional Network LeNet

rst layer was comp osed of Gaussian RBF units with

Convolutional Networks are an attempt to solve the x inputs and the second layer was a simple inputs

outputs linear classier The RBF units were divided

dilemma between small networks that cannot learn

the training set and large networks that seem over into groups of Each group of units was trained

parameterized LeNet was an early emb o diment of the on all the training examples of one of the classes using

Convolutional Network architecture which is included here the adaptive Kmeans algorithm The second layer weights

for comparison purp oses The images were downsampled were computed using a regularized pseudoinverse metho d

The error rate on the regular test set was

to x pixels and centered in the x input layer Al

though ab out multiplyadd steps are required to

C OneHidden Layer Fully Connected Multilayer Neural

evaluate LeNet its convolutional nature keeps the num

Network

ber of free parameters to only ab out The LeNet

architecture was develop ed using our own version of

Another classier that we tested was a fully connected

the USPS US Postal Service zip co des database and its

multilayer neural network with twolayers of weights one

size was tuned to match the available data LeNet

hidden layer trained with the version of backpropagation

achieved test error The fact that a network with such

describ ed in App endix C Error on the regular test set was

a small numb er of parameters can attain suchagooderror

for a network with hidden units and for a

rate is an indication that the architecture is appropriate

network with hidden units Using articial distortions

for the task

to generate more training data brought only marginal im

provement for hidden units and for

C LeNet

hidden units When deslanted images were used the test

error jump ed down to for a network with hidden Exp eriments with LeNet made it clear that a larger

units convolutional network was needed to make optimal use of

It remains somewhat of a mystery that networks with the large size of the training set LeNet and later LeNet

such a large numb er of free parameters manage to achieve were designed to address this problem LeNet is very

reasonably low testing errors We conjecture that the dy similar to LeNet except for the details of the architec

ture It contains rstlevel feature maps followed by

namics of gradient descent learning in multilayer nets has

a selfregularization eect Because the origin of weight subsampling maps connected in pairs to each rstlayer

space is a saddle p oint that is attractive in almost every feature maps then feature maps followed by sub

direction the weights invariably shrink during the rst sampling map followed by a fully connected layer with

few ep o chs recent theoretical analysis seem to conrm units followed by the output layer units LeNet

contains ab out connections and has ab out

this Small weights cause the sigmoids to op erate

free parameters Test error was In a series of ex in the quasilinear region making the network essentially

p eriments we replaced the last layer of LeNet with a equivalenttoalowcapacity singlelayer network As the

Euclidean Nearest Neighb or classier and with the lo cal learning pro ceeds the weights grow which progressively

learning metho d of Bottou and Vapnik in which a lo increases the eective capacity of the network This seems

cal linear classier is retrained each time a new test pattern

to be an almost p erfect if fortuitous implementation of

is shown Neither of those metho ds improved the raw error Vapniks Structural Risk Minimization principle A

rate although they did improve the rejection p erformance b etter theoretical understanding of these phenomena and

more empirical evidence are denitely needed

C Bo osted LeNet

C TwoHidden Layer Fully Connected Multilayer Neural

Following theoretical work byRSchapire Drucker

Network

et al develop ed the b o osting metho d for combining

multiple classiers Three LeNets are combined the rst To see the eect of the architecture several twohidden

one is trained the usual way the second one is trained on layer multilayer neural networks were trained Theoreti

patterns that are ltered by the rst net so that the second approxi cal results haveshown that any function can be

mac mated by a onehidden layer neural network However hine sees a mix of patterns of which the rst net

several authors have observed that twohidden layer archi got right and of which it got wrong Finally the

tectures sometimes yield b etter p erformance in practical third net is trained on new patterns on which the rst and

situations This phenomenon was also observed here The the second nets disagree During testing the outputs of

test error rate of a x network was the three nets are simply added Because the error rate of

a much b etter result than the onehidden layer network LeNet is very low it was necessary to use the articially

obtained using marginally more weights and connections distorted images as with LeNet in order to get enough

Increasing the network size to x yielded samples to train the second and third nets The test error

PROC OF THE IEEE NOVEMBER

[deslant] K−NN Euclidean 8.1

rate was the best of any of our classiers At rst

[16x16] Tangent Distance 1.9 e glance b o osting app ears to b e three times more exp ensiv SVM poly 4 1.8

as a single net In fact when the rst net pro duces a

[deslant] 20x20−300−10 3.2

er the other nets are not called The high condence answ

[16x16] LeNet−1 3.7

average computational cost is ab out times that of a LeNet−4 1.8

single net LeNet−4 / Local 1.4 LeNet−4 / K−NN 1.6

[dist] Boosted LeNet−4 0.5

C Tangent Distance Classier TDC

0123456789

The Tangent Distance classier TDC is a nearest

neighb or metho d where the distance function is made in

Fig Rejection Performance p ercentage of test patterns that

sensitive to small distortions and translations of the input

must b e rejected to achieve error for some of the systems

image If we consider an image as a point in a high

dimensional pixel space where the dimensionality equals

Linear 4

umb er of pixels then an evolving distortion of a char the n Pairwise 36

traces out a curve in pixel space Taken together

acter

[deslant] K−NN Euclidean −−−− 24,000 −−−−>

these distortions dene a lowdimensional manifold in

all 40 PCA+quadratic 39

space For small distortions in the vicinity of the pixel 1000 RBF 794

[16x16] Tangent Distance −−−− 20,000 −−−−>

original image this manifold can be approximated by a

SVM poly 4 −−−− 14,000 −−−−>

wn as the tangent plane An excellent measure plane kno RS−SVM poly 5 650

[dist] V−SVM poly 9 −−−− 28,000 −−−−>

of closeness for character images is the distance b etween

tangent planes where the set of distortions used to their [deslant] 20x20−300−10 123

28x28−1000−10 795

generate the planes includes translations scaling skewing

28x28−300−100−10 267

rotation and line thickness variations A test squeezing 28x28−500−150−10 469

error rate of was achieved using x pixel images

[16x16] LeNet−1 100 hniques using simple Euclidean distance at

Preltering tec LeNet−4 260

ultiple resolutions allowed to reduce the numb er of nec m LeNet−4 / Local −−−− 20,000 −−−−>

LeNet−4 / K−NN −−−− 10,000 −−−−>

essary Tangent Distance calculations LeNet−5 401

Boosted LeNet−4 460

Supp ort Vector Machine SVM

C 0 300 600 900

Polynomial classiers are wellstudied metho ds for gen

Fig Number of multiplyaccumulate op erations for the recogni

erating complex decision surfaces Unfortunately they

tion of a single character starting with a sizenormalized image

are impractical for highdimensional problems b ecause the

num b er of pro duct terms is prohibitive The Supp ort Vec

has reached using a mo died version of the VSVM

tor technique is an extremely economical way of represent

Unfortunately VSVM is extremely exp ensive ab out twice

ing complex surfaces in highdimensional spaces including

as much as regular SVM To alleviate this problem Burges

p olynomials and many other typ es of surfaces

has prop osed the Reduced Set Supp ort Vector technique

A particularly interesting subset of decision surfaces is

RSSVM which attained on the regular test set

the ones that corresp ond to hyp erplanes that are at a max

with a computational cost of only multiplyadds

imum distance from the convex hulls of the two classes in

p er recognition ie only ab out more exp ensivethan

the highdimensional space of the pro duct terms Boser

LeNet

Guyon and Vapnik realized that any p olynomial of

degree k in this maximum margin set can b e computed

D Discussion

by rst computing the dot pro duct of the input image with

A summary of the p erformance of the classiers is shown a subset of the training samples called the supp ort vec

in Figures to Figure shows the raw error rate of the tors elevating the result to the k th p ower and linearly

classiers on the example test set Bo osted LeNet combining the numb ers thereby obtained Finding the sup

p erformed b est achieving a score of closely followed port vectors and the co ecients amounts to solving a high

by LeNet at

dimensional quadratic minimization problem with linear

Figure shows the numberofpatternsin the test set inequality constraints For the sake of comparison we in

that must be rejected to attain a error for some of clude here the results obtained by Burges and Scholkopf

the metho ds Patterns are rejected when the value of cor rep orted in With a regular SVM their error rate

resp onding output is smaller than a predened threshold on the regular test set was Cortes and Vapnik had

In many applications rejection p erformance is more signif rep orted an error rate of with SVM on the same

icant than raw error rate The score used to decide up on data using a slightly dierent technique The computa

een the the rejection of a pattern was the dierence betw tional cost of this technique is very high ab out million

scores of the top two classes Again Bo osted LeNet has multiplyadds per recognition Using Scholkopf s Virtual

the b est p erformance The enhanced versions of LeNet Supp ort Vectors technique VSVM error was at

did b etter than the original LeNet even though the raw tained More recentlyScholkopf p ersonal communication

PROC OF THE IEEE NOVEMBER

Linear 4

ing the template images Not surprisingly neural networks

Pairwise 35

uch less memory than memorybased metho ds require m

[deslant] K−NN Euclidean −−− 24,000 −−−>

p erformance dep ends on many factors in The Overall

40 PCA+quadratic 40

running time and memory requirements 1000 RBF 794 cluding accuracy

−−− 25,000 −−−>

computer technology improves largercapacity recog [16x16] Tangent Distance As

SVM poly 4 −−−− 14,000 −−−−> Larger recognizers in turn require

RS−SVM poly 5 650 nizers b ecome feasible

LeNet was appropriate to the avail [dist] V−SVM poly 5 −−−− 28,000 −−−−> larger training sets

able technology in just as LeNet is appropriate now

[deslant] 20x20−300−10 123

ould have re 28x28−1000−10 795 In a recognizer as complex as LeNet w

28x28−300−100−10 267

quired several weeks training and more data than was

28x28−500−150−10 469

vailable and was therefore not even considered For quite a

[16x16] LeNet 1 3

a long time LeNet was considered the state of the art LeNet 4 17

LeNet 4 / Local −−− 24,000 −−−> The lo cal learning classier the optimal margin classier

−−− 24,000 −−−>

the tangent distance classier were develop ed to im LeNet 4 / K−NN and

LeNet 5 60

veupon LeNet and they succeeded at that How

Boosted LeNet 4 51 pro

er they in turn motivated a search for improved neural

0 300 600 900 ev

network architectures This searchwas guided in part by

estimates of the capacityofvarious learning machines de

Fig Memory requirements measured in number of variables for

each of the metho ds Most of the metho ds only require one byte

rived from measurements of the training and test error as

per variable for adequate p erformance

a function of the number of training examples We dis

covered that more capacitywas needed Through a series

of exp eriments in architecture combined with an analy

accuracies were identical

sis of the characteristics of recognition errors LeNet and

Figure shows the numberofmultiplyaccumulate op

LeNet were crafted

erations necessary for the recognition of a single size

We nd that b o osting gives a substantial improvementin

normalized image for each metho d Exp ectedly neural

accuracy with a relatively mo dest p enalty in memory and

networks are much less demanding than memorybased

computing exp ense Also distortion mo dels can be used

metho ds Convolutional Neural Networks are particu

to increase the eective size of a data set without actually

larly well suited to hardware implementations b ecause of

requiring to collect more data

their regular structure and their low memory requirements

The Supp ort Vector Machine has excellent accuracy

for the weights Single chip mixed analogdigital imple

which is most remarkable b ecause unlike the other high

mentations of LeNets predecessors have b een shown to

p erformance classiers it do es not include a priori knowl

op erate at sp eeds in excess of characters per sec

edge ab out the problem In fact this classier would do

ond However the rapid progress of mainstream com

just as well if the image pixels were p ermuted with a xed

puter technology renders those exotic technologies quickly

mapping and lost their pictorial structure However reach

obsolete Costeective implementations of memorybased

ing levels of p erformance comparable to the Convolutional

techniques are more elusive due to their enormous memory

Neural Networks can only b e done at considerable exp ense

requirements and computational requirements

in memory and computational requirements The reduced

Training time was also measured Knearest neighbors

set SVM requirements are within a factor of two of the

and TDC have essentially zero training time While the

Convolutional Networks and the error rate is very close

singlelayer net the pairwise net and PCAquadratic net

Improvements of those results are exp ected as the tech

could be trained in less than an hour the multilayer net

nique is relatively new

training times were exp ectedly much longer but only re

When plentyofdataisavailable many metho ds can at

quired to passes through the training set This

tain resp ectable accuracy The neuralnet metho ds run

amounts to to days of CPU to train LeNet on a Sil

much faster and require much less space than memory

icon Graphics Origin server using a single MHz

based techniques The neural nets advantage will b ecome

R pro cessor It is imp ortantto note that while the

more striking as training databases continue to increase in

training time is somewhat relev ant to the designer it is of

size

little interest to the nal user of the system Given the

choice b etween an existing technique and a new technique

E Invariance and Noise Resistance

that brings marginal accuracy improvements at the price

networks are particularly well suited for of considerable training time any nal user would chose Convolutional

recognizing or rejecting shap es with widely varying size the latter

p osition and orientation such as the ones typically pro

Figure shows the memory requirements and therefore

duced by heuristic segmenters in realworld string recogni

the number of free parameters of the various classiers

tion systems

measured in terms of the number of variables that need

In an exp eriment like the one describ ed ab ove the im to be stored Most metho ds require only ab out one byte

p ortance of noise resistance and distortion invariance is per variable for adequate p erformance However Nearest

not obvious The situation in most real applications is Neighb or metho ds maygetby with bits p er pixel for stor

PROC OF THE IEEE NOVEMBER

W1 W2

quite dierent Characters must generally be segmented

out of their context prior to recognition Segmentation

algorithms are rarely perfect and often leave extraneous

F1(X0,X1,W1)

marks in character images noise underlines neighb oring t X3

Input F3(X3,X4)

haracters or sometimes cut characters to o much and pro

c X5

incomplete characters Those images cannot be re duce Z Loss

X1

sizenormalized and centered Normalizing incom liably Function E X4

F0(X0)

plete characters can b e very dangerous For example an

F2(X2,W2)

y mark can lo ok likeagenuine Therefore

enlarged stra X2

y systems have resorted to normalizing the images at man D

Desired Output

the level of elds or words In our case the upp er and lower

proles of entire elds amounts in a check are detected

Fig A trainable system comp osed of heterogeneous mo dules

and used to normalize the image to a xed height While

this guarantees that stray marks will not b e blown up into

words that can b e trained to simultaneously segmentand

characterlo oking images this also creates wide variations

recognize words without ever b eing given the correct seg

of the size and vertical p osition of characters after segmen

mentation

tation Therefore it is preferable to use a recognizer that is

Figure shows an example of a trainable multimo dular

robust to suchvariations Figure shows several exam

system Amultimo dule system is dened by the function

ples of distorted characters that are correctly recognized by

implemented byeach of the mo dules and by the graph of

LeNet It is estimated that accurate recognition o ccurs

interconnection of the mo dules to each other The graph

for scale variations up to ab out a factor of vertical shift

implicitly denes a partial order according to which the

variations of plus or minus ab out half the height of the

mo dules must b e up dated in the forward pass For exam

character and rotations up to plus or minus degrees

ple in Figure mo dule is rst up dated then mo dules

While fully invariant recognition of complex shap es is still

and are up dated p ossibly in parallel and nally mo d

an elusive goal it seems that Convolutional Networks oer

ule Mo dules mayormay not have trainable parameters

a partial answer to the problem of invariance or robustness

Loss functions which measure the p erformance of the sys

with resp ect to geometrical distortions

tem are implemented as mo dule In the simplest case

Figure includes examples of the robustness of LeNet

the loss function mo dule receives an external input that

under extremely noisy conditions Pro cessing those

carries the desired output In this framework there is no

images would pose unsurmountable problems of segmen

qualitative dierence b etween trainable parameters WW

tation and feature extraction to many metho ds but

in the gure external inputs and outputs ZDE and

LeNet seems able to robustly extract salient features

intermediate state variablesXXXXX

from these cluttered images The training set used for

the network shown here was the MNIST training set

A A n ObjectOriented Approach

with salt and p epp er noise added Each pixel was ran

domly inverted with probability More examples

Ob jectOriented programming oers a particularly con

of LeNet in action are available on the Internet at

venientway of implementing multimo dule systems Each

httpwwwresearchattcomyannocr

mo dule is an instance of a class Mo dule classes have a for

ward propagation metho d or member function called

IV MultiModule Systems and Graph

fprop whose arguments are the inputs and outputs of the

Transformer Networks

mo dule For example computing the output of mo dule

in Figure can b e done by calling the metho d fprop on

The classical backpropagation algorithm as describ ed

mo dule with the arguments XXX Complex mo d

and used in the previous sections is a simple form of

ules can be constructed from simpler mo dules by simply

GradientBased Learning Howev er it is clear that the

dening anew class whose slots will contain the member

gradient backpropagation algorithm given by Equation

mo dules and the intermediate state variables b etween those

describ es a more general situation than simple multilayer

mo dules The fprop metho d for the class simply calls the

feedforward networks comp osed of alternated linear trans

fprop metho ds of the member mo dules with the appro

formations and sigmoidal functions In principle deriva

priate intermediate state variables or external input and

tives can b e backpropagated through any arrangementof

algorithms are eas outputs as arguments Although the

functional mo dules as long as we can compute the pro d

ily generalizable to anynetwork of such mo dules including

uct of the Jacobians of those mo dules byanyvector Why

those whose inuence graph has cycles we will limit the dis

would wewant to train systems comp osed of multiple het

cussion to the case of directed acyclic graphs feedforward

erogeneous mo dules The answer is that large and complex

networks

trainable systems need to b e built out of simple sp ecialized

Computing derivativesinamultimo dule system is just mo dules The simplest example is LeNet which mixes

as simple A backward propagation metho d called convolutional layers subsampling layers fullyconnected

bprop for each mo dule class can b e dened for that pur layers and RBF layers Another less trivial example de

pose The bprop metho d of a mo dule takes the same ar scrib ed in the next two sections is a system for recognizing

PROC OF THE IEEE NOVEMBER

C1 S2 C3 S4 C5

4 4 4 Output

F6

4 3 8

4 3 3

Fig Examples of unusual distorted and noisy characters correctly recognized by LeNet The greylevel of the output lab el represents

the p enalty lighter for higher p enalties

guments as the fprop metho d All the derivatives in the used to extend the pro cedures to networks with recurrent

system can b e computed by calling the bprop metho d on all connections

the mo dules in reverse order compared to the forward prop

B Special Modules

agation phase The state variables are assumed to contain

slots for storing the gradients computed during the back

Neural networks and many other standard pattern recog

ward pass in addition to storage for the states computed in

nition techniques can be formulated in terms of multi

the forward pass The backward pass eectively computes

mo dular systems trained with GradientBased Learning

the partial derivatives of the loss E with resp ect to all the

Commonly used mo dules include matrix multiplications

state variables and all the parameters in the system There

and sigmoidal mo dules the combination of which can be

is an interesting duality prop ertybetween the forward and

used to build conventional neural networks Other mo d

backward functions of certain mo dules For example a

ules include convolutional layers subsampling layers RBF

sum of several variables in the forward direction is trans

also layers and softmax layers Loss functions are

formed into a simple fanout replication in the backward

represented as mo dules whose single output pro duces the

direction Conversely a fanout in the forward direction

value of the loss Commonly used mo dules have simple

is transformed into a sum in the backward direction The

bprop metho ds In general the bprop metho d ofafunc

software environment used to obtain the results describ ed

tion F is a multiplication by the Jacobian of F Here are

in this pap er called SN uses the ab ove concepts It is

a few commonly used examples The bprop metho d of a

based on a homegrown ob jectoriented dialect of Lisp with

fanout a Y connection is asum and vice versa The

a compiler to C

bprop metho d of a multiplication by a co ecientisamul

The fact that derivatives can b e computed by propaga tiplication by the same co ecient The bprop metho d of a

tion in the reverse graph is easy to understand intuitively multiplication by a matrix is a multiplication by the trans

The b est way to justify it theoretically is through the use of p ose of that matrix The bprop metho d of an addition with

Lagrange functions The same formalism can b e a constant is the identity

PROC OF THE IEEE NOVEMBER

tween the mo dules are all xedsize vectors The limited

exibility of xedsize vectors for data representation is a

serious deciency for many applications notably for tasks

Layer Graph

ariable length inputs eg continuous sp eech

Transformer that deal with v

recognition and handwritten word recognition or for tasks

that require enco ding relationships b etween ob jects or fea

tures whose number and nature can vary invariant per

ception scene analysis recognition of comp osite ob jects

Layer Graph

imp ortant sp ecial case is the recognition of strings of

Transformer An

characters or words

xedsize vectors lack exibility for tasks (a) More generally

(b)

in which the state must enco de probability distributions

over sequences of vectors or symb ols as is the case in lin

Fig Traditional neural networks and multimo dule systems com

guistic pro cessing Such distributions over sequences are

municate xedsize vectors between layer MultiLayer Graph

ted bystochastic grammars or in the more b est represen

Transformer Networks are comp osed of trainable mo dules that

general case directed graphs in which each arc contains a

op erate on and pro duce graphs whose arcs carry numerical in

formation

vector sto chastic grammars are sp ecial cases in whichthe

vector contains probabilities and symb olic information

Eachpathin the graph represents a dierent sequence of

Interestingly certain nondierentiable mo dules can be

vectors Distributions over sequences can be represented

inserted in a multimo dule system without adverse eect

byinterpreting elements of the data asso ciated with each

An interesting example of that is the multiplexer mo dule

arc as parameters of a probability distribution or simply

It has two or more regular inputs one switching input

as a p enalty Distributions over sequences are particularly

and one output The mo dule selects one of its inputs de

handy for mo deling linguistic knowledge in sp eech or hand

p ending up on the discrete value of the switching input

writing recognition systems each sequence ie each path

and copies it on its output While this mo dule is not dif

in the graph represents an alternativeinterpretation of the

ferentiable with resp ect to the switching input it is dier

input Successive pro cessing mo dules progressively rene

entiable with resp ect to the regular inputs Therefore the

the interpretation For example a sp eech recognition sys

overall function of a system that includes such mo dules will

tem might start with a single sequence of acoustic vectors

b e dierentiable with resp ect to its parameters as long as

transform it into a lattice of phonemes distribution over

the switching input do es not dep end up on the parameters

phoneme sequences then into a lattice of words distribu

For example the switching input can b e an external input

tion over word sequences and then into a single sequence

Another interesting case is the min mo dule This mo d

of words representing the b est interpretation

ule has two or more inputs and one output The output

In our work on building largescale handwriting recog

of the mo dule is the minimum of the inputs The func

nition systems we have found that these systems could

tion of this mo dule is dieren tiable everywhere except on

much more easily and quickly b e develop ed and designed

the switching surface which is a set of measure zero In

by viewing the system as a networks of mo dules that take

terestingly this function is continuous and reasonably reg

one or several graphs as input and pro duce graphs as out

ular and that is sucient to ensure the convergence of a

put Such mo dules are called Graph Transformers and the

GradientBased Learning algorithm

complete systems are called Graph Transformer Networks

The ob jectoriented implementation of the multimo dule

or GTN Mo dules in a GTN communicate their states and

idea can easily be extended to include a bbprop metho d

gradients in the form of directed graphs whose arcs carry

that propagates GaussNewton approximations of the sec

numerical information scalars or vectors

ond derivatives This leads to a direct generalization for

From the statistical point of view the xedsize state

mo dular systems of the secondderivative backpropagation

vectors of conventional networks can b e seen as represent

Equation given in the App endix

ing the means of distributions in state space In variable

The multiplexer mo dule is a sp ecial case of a much

size netw orks such as the SpaceDisplacement Neural Net

more general situation describ ed at length in Section VI I I

works describ ed in section VI I the states are variable

where the architecture of the system changes dynamically

length sequences of xed size vectors They can be seen

with the input data Multiplexer mo dules can b e used to

as representing the mean of a probability distribution over

dynamically rewire or recongure the architecture of the

variablelength sequences of xedsize vectors In GTNs

system for each new input pattern

the states are represented as graphs which can be seen

as representing mixtures of probability distributions over

C Graph Transformer Networks

structured collections p ossibly sequences of vectors Fig

ure

Multimo dule systems are a very exible to ol for build

ing large trainable system However the descriptions in One of the main points of the next several sections is

the previous sections implicitly assumed that the set of to show that GradientBased Learning pro cedures are not

parameters and the state information communicated be limited to networks of simple mo dules that communicate

PROC OF THE IEEE NOVEMBER

through xedsize vectors but can b e generalized to GTNs

Gradient backpropagation through a Graph Transformer

takes gradients with resp ect to the numerical informa

tion in the output graph and computes gradients with re

sp ect to the numerical information attached to the input

graphs and with resp ect to the mo dules internal param

Fig Building a segmentation graph with Heuristic Over

eters GradientBased Learning can b e applied as long as

Segmentation

dierentiable functions are used to pro duce the numerical

data in the output graph from the numerical data in the

input graph and the functions parameters

that it avoids making hard decisions ab out the segmenta

The second p oint of the next several sections is to show

tion by taking a large number of dierent segmentations

that the functions implemented by many of the mo dules

into consideration The idea is to use heuristic image pro

used in typical do cument pro cessing systems and other

cessing techniques to nd candidate cuts of the word or

image recognition systems though commonly thoughtto

string and then to use the recognizer to score the alter

be combinatorial in nature are indeed dierentiable with

native segmentations thereby generated The pro cess is

resp ect to their internal parameters as well as with resp ect

depicted in Figure First a number of candidate cuts

to their inputs and are therefore usable as part of a globally

are generated Good candidate lo cations for cuts can be

trainable system

found bylocatingminimainthevertical pro jection prole

In most of the following we will purp osely avoid making

or minima of the distance between the upp er and lo wer

references to probabilitytheory All the quantities manip

contours of the word Better segmentation heuristics are

ulated are viewed as p enalties or costs which if necessary

describ ed in section X The cut generation heuristic is de

can be transformed into probabilities by taking exp onen

signed so as to generate more cuts than necessaryin the

tials and normalizing

hop e that the correct set of cuts will b e included Once

the cuts have b een generated alternative segmentations are

V Multiple Object Recognition Heuristic

b est represented by a graph called the segmentation graph

OverSegmentation

The segmentation graph is a DirectedAcyclic Graph DAG

with a start no de and an end no de Eachinternal no de is

One of the most dicult problems of handwriting recog

asso ciated with a candidate cut pro duced by the segmen

nition is to recognize not just isolated characters but

tation algorithm Each arc between a source no de and a

strings of characters such as zip co des check amounts

destination no de is asso ciated with an image that contains

or words Since most recognizers can only deal with one

all the ink b etween the cut asso ciated with the source no de

character at a time wemust rst segment the string into

and the cut asso ciated with the destination no de An arc

individual character images However it is almost imp os

is created b etween two no des if the segmentor decided that

sible to devise image analysis techniques that will infallibly

the ink b etween the corresp onding cuts could form a can

segment naturally written sequences of characters into well

didate character Typically each individual piece of ink

formed characters

would b e asso ciated with an arc Pairs of successive pieces

The recent history of automatic sp eech recognition

of ink would also b e included unless they are separated by

is here to remind us that training a recognizer byopti

a wide gap which is a clear indication that they b elong

mizing a global criterion at the word or sentence level is

to dierent characters Each complete path through the

much preferable to merely training it on handsegmented

graph contains each piece of ink once and only once Each

phonemes or other units Several recentworks haveshown

path corresp onds to a dierentway of asso ciating pieces of

that the same is true for handwriting recognition op

ink together so as to form characters

timizing a wordlevel criterion is preferable to solely train

ing a recognizer on presegmented characters b ecause the

B Recognition Transformer and Viterbi Transformer

recognizer can learn not only to recognize individual char

acters but also to reject missegmented characters thereby

A simple GTN to recognize character strings is shown

minimizing the overall word error

in Figure It is comp osed of two graph transformers

This section and the next describ e in detail a simple ex

called the recognition transformer T and the Viterbi

rec

ample of GTN to address the problem of reading strings of

transformer T The goal of the recognition transformer

vit

characters suchaswords or check amounts The metho d

is to generate a graph called the interpretation graph or

avoids the exp ensive and unreliable task of handtruthing

rec ognition graph G that contains all the p ossible inter

int

the result of the segmentation often required in more tra

pretations for all the p ossible segmentations of the input

ditional systems trained on individually lab eled character

Each path in G represents one p ossible interpretation of

int

images

one particular segmentation of the input The role of the

Viterbi transformer is to extract the best interpretation

A Segmentation Graph

from the interpretation graph

Anow classical metho d for word segmentation and recog The recognition transformer T takes the segmentation

rec

nition is called Heuristic OverSegmentation Its graph G as input and applies the recognizer for single

seg

main advantages over other approaches to segmentation are characters to the images asso ciated with eachof the arcs

PROC OF THE IEEE NOVEMBER

class Viterbi Penalty label character recognizer penalty for each class Σ

PIECE OF THE 7.9 INTERPRETATION "0" 6.7 "0" GRAPH "1" 11.2 2 "1" 10.3 Viterbi 6.8 3 Path "2" Gvit "3" 0.2

4 "8" 0.3 "8" 13.5 "9" 12.5 "9" 8.4 Viterbi T vit Transformer

Character Character W Recognizer Recognizer 3 3 1 2 Interpretation 2 4 4 3 Gint 4 Graph 3 1 4 4

8 3 candidate Recognition 0.1 0.5 segment T NN NN NN NN NN NN Transformer PIECE OF THE image rec SEGMENTATION GRAPH penalty given by

the segmentor

The recognition transformer renes each arc of the segmen

Segmentation Fig

to a set of arcs in the interpretation graph one p er

Gseg Graph tation arc in

character class with attached p enalties and lab els

Fig Recognizing a character string with a GTN For readability

only the arcs with low p enalties are shown

famous Viterbi algorithm an application of the prin

ciple of dynamic programming to nd the shortest path

in a graph eciently Let c b e the p enalty asso ciated to

i

in the segmentation graph The interpretation graph G

int

arc i with source no de s and destination no de d note

i i

has almost the same structure as the segmentation graph

that there can be multiple arcs between two no des In

except that each arc is replaced by a set of arcs from and

the interpretation graph arcs also have a lab el l The

i

to the same no de In this set of arcs there is one arc for

Viterbi algorithm pro ceeds as follows Eachnode n is as

each p ossible class for the image asso ciated with the cor

so ciated with acumulated Viterbi penalty v Those cu

n

resp onding arc in G As shown in Figure to each

seg

mulated p enalties are computed in any order that satises

arc is attached a class lab el and the p enalty that the im

the partial order dened bytheinterpretation graph which

age b elongs to this class as pro duced by the recognizer If

is directed and acyclic The start no de is initialized with

the segmentor has computed p enalties for the candidate

ulated penalty v The other no des cu the cum

start

segments these p enalties are combined with the p enalties

mulated p enalties v are computed recursively from the v

n

computed by the character recognizer to obtain the p enal

values of their parent no des through the upstream arcs

ties on the arcs of the interpretation graph Although com

U farc i with destination d ng

n i

bining p enalties of dierent nature seems highly heuristic

the GTN training pro cedure will tune the p enalties and

takeadvantage of this combination anyway Each path in

v min c v

n i s

i

the interpretation graph corresp onds to a p ossible inter

iU

n

input word The penalty of a particular pretation of the

interpretation for a particular segmentation is given bythe

sum of the arc p enalties along the corresp onding path in

Furthermore the value of i for eachnoden which minimizes

the interpretation graph Computing the p enalty of an in

the right hand side is noted m the minimizing entering

n

terpretation indep endently of the segmentation requires to

arc When the end no de is reached we obtain in v the

end

combine the p enalties of all the paths with that interpre

total p enalty of the path with the smallest total p enalty

tation An appropriate rule for combining the p enalties of

We call this p enalty the Viterbi penalty and this sequence

parallel paths is given in section VIC

of arcs and no des the Viterbi path To obtain the Viterbi

The Viterbi transformer pro duces a graph G with a path with no des n n and arcs i i we trace back

vit T T 

single path This path is the path of least cumulated these no des and arcs as follows starting with n the end

T

p enalty in the Interpretation graph The result of the no de and recursively using the minimizing entering arc

recognition can be pro duced by reading o the lab els of until the start no de is reached and n s i m

t i t n

t t

the arcs along the graph G extracted by the Viterbi The lab el sequence can then be read o the arcs of the

vit

transformer The Viterbi transformer owesitsnametothe Viterbi path

PROC OF THE IEEE NOVEMBER

Global Training for Graph Transformer

VI Constrained Viterbi Penalty Ccvit orks

Netw Σ

The previous section describ es the pro cess of recognizing

erSegmentation assuming that

a string using Heuristic Ov Best Constrained Path Gcvit

the recognizer is trained so as to givelow p enalties for the

Viterbi Transformer

correct class lab el of correctly segmented characters high

for erroneous categories of correctly segmented p enalties Constrained

Interpretation Graph Gc

characters and high p enalties for all categories for badly

haracters This section explains how to train the

formed c Desired Sequence Path Selector

system at the string level to do the ab ove without requiring

ual lab eling of character segments This training will

man Interpretation Graph Gint

be p erformed with a GTN whose architecture is slightly

Recognition

t from the recognition architecture describ ed in the

dieren Transformer

previous section

In many applications there is enough a priori knowl

Fig Viterbi Training GTN Architecture for a character string

recognizer based on Heuristic OverSegmentation

edge ab out what is exp ected from each of the mo dules in

order to train them separately For example with Heuris

tic OverSegmentation one could individually lab el single

character images and train a character recognizer on them

Neural Networks RNN Unfortunately despite early en

but it might be dicult to obtain an appropriate set of

thusiasm the training of RNNs with gradientbased tech

nonc haracter images to train the mo del to reject wrongly

niques has proved very dicult in practice

segmented candidates Although separate training is sim

The GTN techniques presented b elow simplify and gen

ple it requires additional sup ervision information that is

eralize the global training metho ds develop ed for sp eech

often lacking or incomplete the correct segmentation and

recognition

the lab els of incorrect candidate segments Furthermore

it can b e shown that separate training is suboptimal

The following section describ es three dierent gradient

A Viterbi Training

based metho ds for training GTNbased handwriting recog

During recognition we select the path in the Interpre nizers at the string level Viterbi training discriminative

tation Graph that has the lowest p enalty with the Viterbi Viterbi training forward training and discriminativefor

ward training The last one is a generalization to graph

algorithm Ideallywewould like this path of lowest p enalty

based systems of the MAP criterion intro duced in Sec

to b e asso ciated with the correct lab el sequence as often as

tion I IC Discriminative forward training is somewhat

p ossible An obvious loss function to minimize is therefore

the average over the training set of the p enalty of the path similar to the socalled Maximum Mutual Information cri

associated with the correct label sequence that has the low terion used to train HMM in sp eech recognition However

our rationale diers from the classical one We make no

est p enalty The goal of training will b e to nd the set of

recourse to a probabilistic interpretation but show that

recognizer parameters the weights if the recognizer is a

within the GradientBased Learning approach discrimina

neural net work that minimize the average p enalty of this

tive training is a simple instance of the p ervasive principle correct lowest penalty path The gradient of this loss

of error correcting learning function can be computed by backpropagation through

Training metho ds for graphbased sequence recognition the GTN architecture shown in gure This training

systems such as HMMs have been extensively studied in

architecture is almost identical to the recognition archi

the context of sp eech recognition Those metho ds re

tecture describ ed in the previous section except that an

extra graph transformer called a path selector is inserted quire that the system b e based on probabilistic generative

between the Interpretation Graph and the Viterbi Trans mo dels of the data which provide normalized likelihoods

former This transformer takes the interpretation graph over the space of p ossible input sequences Popular HMM

learning metho ds such as the the BaumWelsh algorithm

and the desired lab el sequence as input It extracts from

the interpretation graph those paths that contain the cor rely on this normalization The normalization cannot be

preserved when nongenerativemodelssuch as neural net rect desired lab el sequence Its output graph G is called

c

works are integrated into the system Other techniques the constr ained interpretation graph also known as forced

suchasdiscriminative training metho ds must be used in alignment in the HMM literature and contains all the

this case Several authors have prop osed such metho ds to

paths that corresp ond to the correct lab el sequence The

train neural networkHMM sp eech recognizers at the word constrained interpretation graph is then sent to the Viterbi

or sentence level transformer which pro duces a graph G with a single

cvit

path This path is the correct path with the lowest

Other globally trainable sequence recognition systems p enalty Finally a path scorer transformer takes G and

cvit

avoid the diculties of statistical mo deling by not resorting simply computes its cumulated p enalty C by adding up

cvit

to graphbased techniques The b est example is Recurrent the p enalties along the path The output of this GTN is

PROC OF THE IEEE NOVEMBER

the loss function for the current pattern that integrate neural networks with time alignment

or hybrid neuralnetworkHMM systems

E C

vit cvit

While it seems simple and satisfying this training ar

The only lab el information that is required by the ab ove

chitecture has a aw that can potentially be fatal The

system is the sequence of desired character lab els No

problem was already mentioned in Section IIC If the

knowledge of the correct segmentation is required on the

recognizer is a simple neural network with sigmoid out

part of the sup ervisor since it cho oses among the segmen

put units the minimum of the loss function is attained

tations in the interpretation graph the one that yields the

not when the recognizer always gives the rightanswer but

lowest p enalty

when it ignores the input and sets its output to a constant

The pro cess of backpropagating gradients through the

all the comp onents This is vector with small values for

Viterbi training GTN is now describ ed As explained in

known as the col lapse problem The collapse only o ccurs if

section IV the gradients must be propagated backwards

the recognizer outputs can simultaneously take their min

through all mo dules of the GTN in order to compute gra

imum value If on the other hand the recognizers out

dients in preceding mo dules and thereafter tune their pa

put layer contains RBF units with xed parameters then

rameters Backpropagating gradients through the path

there is no such trivial solution This is due to the fact

es of scorer is quite straightforward The partial derivativ

that a set of RBF with xed distinct parameter vectors

the loss function with resp ect to the individual p enalties on

cannot simultaneously take their minimum value In this

the constrained Viterbi path G are equal to since the

cvit

case the complete collapse describ ed ab ove do es not o ccur

loss function is simply the sum of those p enalties Back

However this do es not totally prevent the o ccurrence of a

propagating through the Viterbi Transformer is equally

milder collapse because the loss function still has a at

simple The partial derivatives of E with resp ect to the

vit

sp ot for a trivial solution with constant recognizer out

p enalties on the arcs of the constrained graph G are

c

put This at sp ot is a saddle p oint but it is attractivein

for those arcs that app ear in the constrained Viterbi path

almost all directions and is very dicult to get out of using

G and for those that do not Why is it legitimate

cvit

gradientbased minimization pro cedures If the parameters

to backpropagate through an essentially discrete function

of the RBFs are allowed to adapt then the collapse prob

suchasthe Viterbi Transformer The answer is that the

lems reapp ears b ecause the RBF centers can all converge

Viterbi Transformer is nothing more than a collection of

to a single vector and the underlying neural network can

min functions and adders put together It was shown in

learn to pro duce that vector and ignore the input A dif

Section IV that gradientscanbebackpropagated through

ferent kind of collapse o ccurs if the width of the RBFs are

min functions without adverse eects Backpropagation

also allowed to adapt The collapse only o ccurs if a train

through the path selector transformer is similar to back

able mo dule such as a neural network feeds the RBFs The

propagation through the Viterbi transformer Arcs in G

int

collapse do es not o ccur in HMMbased sp eech recognition

that app ear in G have the same gradient as the corre

c

systems b ecause they are generative systems that pro duce

sp onding arc in G ie or dep ending on whether the

c

normalized likeliho o ds for the input data more on this

arc app ear in G The other arcs ie those that do

cvit

later Another way to avoid the collapse is to train the

not have an alter ego in G b ecause they do not contain

c

whole system with resp ect to a discriminative training cri

the right lab el have a gradient of During the forward

terion such as maximizing the conditional probability of

propagation through the recognition transformer one in

the correct interpretations correct sequence of class lab els

stance of the recognizer for single character was created

given the input image

for each arc in the segmentation graph The state of rec

Another problem with Viterbi training is that the

ognizer instances was stored Since each arc penalty in

p enalty of the answer cannot be used reliably as a mea

G is pro duced by an individual output of a recognizer

int

sure of condence b ecause it do es not takelowp enalty or

instance we now have a gradient or for each out

highscoring comp eting answers into accoun t

put of each instance of the recognizer Recognizer outputs

that have a non zero gradientare part of the correct an

B Discriminative Viterbi Training

swer and will therefore have their value pushed down The

A mo dication of the training criterion can circumvent gradients present on the recognizer outputs can be back

the collapse problem describ ed ab ove and at the same time propagated through each recognizer instance For each rec

pro duce more reliable condence values The idea is to not ognizer instance we obtain a vector of partial derivatives

only minimize the cumulated p enaltyofthelowest p enalty of the loss function with resp ect to the recognizer instance

path with the correct interpretation but also to somehow parameters All the recognizer instances share the same pa

increase the p enalty of comp eting and p ossibly incorrect rameter vector since they are merely clones of each other

paths that have a dangerously low p enalty This typ e of therefore the full gradient of the loss function with resp ect

criterion is called discriminative b ecause it plays the go o d to the recognizers parameter vector is simply the sum of

ectors pro duced by each recognizer instance answers against the bad ones Discriminative training pro the gradientv

Viterbi training though formulated dierently is often use cedures can be seen as attempting to build appropriate

in HMMbased sp eech recognition systems Similar al separating surfaces between classes rather than to mo del

gorithms have b een applied to sp eech recognition systems individual classes indep endently of each other For exam

PROC OF THE IEEE NOVEMBER

Loss Function [0.1](+1) Σ + − [0.7](+1) [0.6](−1) + +

3 [0.1](+1) Gvit Gcvit 4 [0.6](+1) 3 [0.1](−1) 4 [0.4](−1) 1 [0.1](−1) Viterbi Tansformer

3 [0.1](+1) 4 [2.4](0) Gc Viterbi Transformer 3 [3.4](0) 4 [0.6](+1)

"34" Path Selector Desired Answer 3 [0.1](0) 4 [0.4](−1) 1 [0.1](−1) Interpretation 5 [2.3](0) 2 [1.3](0) 4 [2.4](0) Graph Gint 3 [3.4](0) 4 [0.6](+1) 4 [4.4](0) 9 [1.2](0)

(−1) (+1) (−1) Recognition 4 4 1 Transfomer W NN NN NN NN NN T Neural Net rec Weights

Segmentation Graph

Gseg

Segmenter

Fig Discriminative Viterbi Training GTN Architecture for a character string recognizer based on Heuristic OverSegmentation Quantities

in square brackets are p enalties computed during the forward propagation Quantities in parentheses are partial derivatives computed

during the backward propagation

PROC OF THE IEEE NOVEMBER

ple mo deling the conditional distribution of the classes low p enalty but should have had a higher p enalty since it

given the input image is more discriminative fo cussing is not part of the desired answer

more on the classication surface than having a separate

Variations of this technique have b een used for the sp eech

generative mo del of the input data asso ciated to each class

recognition Driancourt and Bottou used a version of

which with class priors yields the whole joint distribu

it where the loss function is saturated to a xed value

tion of classes and inputs This is b ecause the conditional

This can b e seen as a generalization of the Learning Vector

approach do es not need to assume a particular form for the

Quantization LVQ loss function Other variations

distribution of the input data

of this metho d use not only the Viterbi path but the K

One example of discriminative criterion is the dierence

b est paths The Discriminative Viterbi algorithm do es not

between the p enalty of the Viterbi path in the constrained

have the aws of the nondiscriminativeversion but there

graph and the p enalty of the Viterbi path in the uncon

are problems nonetheless The main problem is that the

strained interpretation graph ie the dierence b etween

criterion do es not build a margin b etween the classes The

the p enalty of the best correct path and the penalty of

gradientis zero as so on as the p enaltyoftheconstrained

the best path correct or incorrect The corresp onding

Viterbi path is equal to that of the Viterbi path It would

GTN training architecture is shown in gure The left

be desirable to push up the p enalties of the wrong paths

side of the diagram is identical to the GTN used for non

The when they are dangerously close to the good one

discriminative Viterbi training This loss function reduces

following section presents a solution to this problem

the risk of collapse b ecause it forces the recognizer to in

creases the p enalty of wrongly recognized ob jects Dis

C ForwardScoring and ForwardTraining

criminative training can also be seen as another example

of error correction procedure which tends to minimize the

While the p enalty of the Viterbi path is p erfectly appro

dierence b etween the desired output computed in the left

priate for the purp ose of recognition it gives only a partial

half of the GTN in gure and the actual output com

picture of the situation Imagine the lowest p enalty paths

puted in the right half of gure

corresp onding to several dierent segmentations pro duced

Let the discriminative Viterbi loss function be denoted

the same answer the same lab el sequence Then it could

E and let us call C the p enalty of the Viterbi path in

dvit cvit

be argued that the overall penalty for the interpretation

the constrained graph and C the p enalty of the Viterbi

vit

should b e smaller than the p enalty obtained when only one

path in the unconstrained interpretation graph

path pro duced that interpretation b ecause multiple paths

with identical lab el sequences are more evidence that the

E C C

dvit cvit vit

lab el sequence is correct Several rules can b e used com

pute the p enalty asso ciated to a graph that contains several

E is always p ositive since the constrained graph is a

dvit

parallel paths Weuseacombination rule b orrowed from

subset of the paths in the interpretation graph and the

a probabilistic interpretation of the p enalties as negative

Viterbi algorithm selects the path with the lo west total

log p osteriors In a probabilistic framework the p osterior

p enalty In the ideal case the two paths C and C

cvit vit

probability for the interpretation should b e the sum of the

coincide and E is zero

dvit

p osteriors for all the paths that pro duce that interpreta

Backpropagating gradients through the discriminative

tion Translated in terms of p enalties the p enalty of an

Viterbi GTN adds some negative training to the pre

interpretation should b e the negative logarithm of the sum

viously describ ed nondiscriminative training Figure

of the negative exp onentials of the p enalties of the individ

shows how the gradients are backpropagated The left

ual paths The overall p enalty will b e smaller than all the

half is identical to the nondiscriminative Viterbi training

p enalties of the individual paths

GTN therefore the backpropagation is identical The gra

dients backpropagated through the righthalfoftheGTN Given an interpretation there is a well known metho d

are multiplied by since C contributes to the loss with called the forward algorithm for computing the ab ove quan

vit

a negative sign Otherwise the pro cess is similar to the left tity eciently The p enalty computed with this pro

half The gradients on arcs of G get p ositivecontribu cedure for a particular interpretation is called the forward

int

tions from the left half and negative contributions from the penalty Consider again the concept of constrained graph

righthalf The two con tributions must b e added since the the subgraph of the interpretation graph which contains

p enalties on G arcs are sentto the two halves through only the paths that are consistentwith a particular lab el

int

a Y connection in the forward pass Arcs in G that sequence There is one constrained graph for each pos

int

app ear neither in G nor in G have a gradient of zero sible lab el sequence some may be empty graphs which

vit cvit

They do not contribute to the cost Arcs that app ear in have innite p enalties Given an interpretation running

both G and G also have zero gradient The contri the forward algorithm on the corresp onding constrained

vit cvit

graph gives the forward p enalty for that interpretation bution from the right half cancels the the contribution

The forward algorithm pro ceeds in a way very similar to from the left half In other words when an arc is rightfully

the Viterbi algorithm except that the op eration used at part of the answer there is no gradient If an arc app ears

each no de to combine the incoming cumulated p enalties in G but not in G the gradientis The arc should

cvit vit

instead of being the min function is the socalled logadd havehad alower p enalty to makeitto G If an arc is

vit

but not in G the gradientis Thearchada op eration which can b e seen as a soft version of the min in G

cvit vit

PROC OF THE IEEE NOVEMBER

function Edforw

f logadd c f

n i s

i

iU n

+ −

where f U is the set of upstream arcs of no de n n

start Cdforw

c is the p enalty on arc iand i Cforw

Forward Scorer

n

X

x

i

logaddx x x log e

n Constrained Forward Scorer

Interpretation Graph Gc

i Desired

Path Selector

that b ecause of numerical inaccuracies it is b etter

Note Sequence

x

i

to factorize the largest e corresp onding to the smallest

Gint y out of the logarithm

p enalt Interpretation Graph

An interesting analogy can b e drawn if we consider that

Recognition

hwe apply the forward algorithm is equiv

a graph on whic Transformer

alent to a neural network on whichwerunaforward prop

agation except that multiplications are replaced by addi

Fig Discriminative Forward Training GTN Architecture

for a character string recognizer based on Heuristic Over

tions the additions are replaced by logadds and there are

Segmentation

no sigmoids

ard algorithm is to think One way to understand the forw

ab out multiplicative scores eg probabilities instead of

G

c

X

E E

additive p enalties on the arcs score exp p enalty In

f c f

i n

d

i

e e

that case the Viterbi algorithm selects the path with the

f f

n d

i

iD

n

largest cumulative score with scores multiplied along the

where D farc i with source s ng is the set of down

n i

path whereas the forward score is the sum of the cumula

stream arcs from no de n From the ab ove derivatives the

tive scores asso ciated to each of the p ossible paths from the

derivatives with resp ect to the arc p enalties are obtained

start to the end no de The forward p enaltyisalways lower

than the cumulated p enaltyonany of the paths but if one

E E

c f f

i s

d

i

i

path dominates with a muchlower p enalty its p enalty

e

c f

i d

i

is almost equal to the forward p enalty The forward algo

rithm gets its name from the forward pass of the wellknown

This can b e seen as a soft version of the backpropagation

BaumWelsh algorithm for training Hidden MarkovMod

through a Viterbi scorer and transformer All the arcs in

els Section VI I IE gives more details on the relation

G havean inuence on the loss function The arcs that

c

betweenthiswork and HMMs

b elong to low p enalty paths have a larger inuence Back

The advantage of the forward p enalty with resp ect to

propagation through the path selector is the same as b efore

the Viterbi p enalty is that it takes into account all the

The derivative with resp ect to G arcs that havean alter

int

dierentways to pro duce an answer and not just the one

ego in G are simply copied from the corresp onding arc in

c

with the lowest p enalty This is imp ortant if there is some

G The derivatives with resp ect to the other arcs are

c

ambiguity in the segmentation since the combined forward

Several authors have applied the idea of back

p enaltyoftwo paths C and C asso ciated with the same

propagating gradients through a forward scorer to train

lab el sequence may b e less than the p enalty of a path C

sp eech recognition systems including Bridle and his net

asso ciated with another lab el sequence even though the

mo del and Haner and his TDNN mo del but

p enaltyofC might b e less than anyoneofC or C

these authors recommended discriminative training as de

The Forward training GTN is only a slight mo dica

scrib ed in the next section

tion of the previously intro duced Viterbi training GTN It

D Discriminative ForwardTraining

suces to turn the Viterbi transformers in Figure into

ForwardScorers that takeaninterpretation graph as input

The information contained in the forward p enalty can b e

an pro duce the forward p enaltyof that graph on output

used in another discriminative training criterion whichwe

Then the p enalties of all the paths that contain the correct

will call the discriminative forward criterion This criterion

answer are lowered instead of just that of the b est one

corresp onds to maximization of the posterior probability of

choosing the paths associated with the correct interpreta Backpropagating through the forward p enalty computa

tion This p osterior probability is dened as the exp onen tion the forward transformer is quite dierentfromback

propagating through a Viterbi transformer All the p enal tial of the minus the constrained forward p enalty normal

ties of the input graph have an inuence on the forward ized by the exp onential of minus the unconstrained forward

p enalty but p enalties that belong to lowp enalty paths p enalty Note that the forward p enalty of the constrained

have a stronger inuence Computing derivatives with re graph is always larger or equal to the forward p enaltyofthe

sp ect to the forward p enalties f computed at each n no de unconstrained interpretation graph Ideallywewould like

n

of a graph is done bybackpropagation through the graph the forward p enalty of the constrained graph to b e equal to

PROC OF THE IEEE NOVEMBER

the forward p enalty of the complete interpretation graph E Remarks on Discriminative Training

Equalitybetween those two quantities is achieved when the

In the ab ove discussion the global training criterion

combined p enalties of the paths with the correct lab el se

was given a probabilistic interpretation but the individ

quence is negligibly small compared to the p enalties of all

ual p enalties on the arcs of the graphs were not There are

the other paths or that the p osterior probability asso ci

go o d reasons for that For example if some p enalties are

ated to the paths with the correct interpretation is almost

asso ciated to the dierent class lab els they would have

which is precisely what we want The corresp onding

to sum to class p osteriors or integrate to over the

GTN training architecture is shown in gure

input domain likeliho o ds

Let the dierence be denoted E and let us call

dforw

Let us rst discuss the rst case class p osteriors normal

C the forward p enalty of the constrained graph and

cforw

ization This lo cal normalization of p enalties may elimi

C the forward p enalty of the complete interpretation

forw

nate information that is imp ortant for lo cally rejecting all

graph

the classes eg when a piece of image do es not cor

E C C

dforw cforw forw

resp ond to a valid character class b ecause some of the

segmentation candidates may b e wrong Although an ex

constrained graph is a E is always p ositive since the

dforw

plicit garbage class can b e intro duced in a probabilistic

subset of the paths in the interpretation graph and the

framework to address that question some problems remain

forward p enalty of a graph is always larger than the for

b ecause it is dicult to characterize such a class probabilis

ward p enalty of a subgraph of this graph In the ideal case

tically and to train a system in this way it would require

the p enalties of incorrect paths are innitely large there

a density mo del of unseen or unlab eled samples

fore the two p enalties coincide and E is zero Readers

dforw

The probabilistic interpretation of individual variables

familiar with the connectionist mo del

plays an imp ortant role in the BaumWelsh algorithm

might recognize the constrained and unconstrained graphs

in combination with the Exp ectationMaximization pro ce

as analogous to the clamp ed constrained by the ob

dure Unfortunately those metho ds cannot b e applied to

served values of the output variable and free uncon

strained phases of the Boltzmann machine algorithm discriminative training criteria and one is reduced to us

ing gradientbased metho ds Enforcing the normalization

Backpropagating derivatives through the discriminative

of the probabilistic quantities while p erforming gradient

Forward GTN distributes gradients more evenly than in the

based learning is complex inecient time consuming and

Viterbi case Derivatives are backpropagated through the

creates illconditioning of the lossfunction

left half of the the GTN in Figure down to the interpre

tation graph Derivatives are negated and backpropagated Following we therefore prefer to p ostp one normal

through the righthalf and the result for each arc is added ization as far as p ossible in fact until the nal decision

stage of the system Without normalization the quanti

to the contribution from the left half Eac h arc in G

int

ties manipulated in the system do not have a direct prob

nowhasaderivative Arcs that are part of a correct path

abilistic interpretation

have a p ositivederivative This derivativeisvery large if

an incorrect path has a lower p enalty than all the correct Let us now discuss the second case using a generative

paths Similarly the derivatives with resp ect to arcs that mo del of the input Generative mo dels build the b oundary

arepartofalowp enalty incorrect path have a large nega indirectlyby rst building an indep endent densitymodel

for each class and then p erforming classication decisions

tive derivative On the other hand if the p enaltyofapath

on the basis of these mo dels This is not a discriminative

asso ciated with the correct interpretation is much smaller

than all other paths the loss function is very close to approach in that it do es not fo cus on the ultimate goal of

and almost no gradientis backpropagated The training learning which in this case is to learn the classication de

therefore concentrates on examples of images which yield a cision surface Theoretical arguments suggest that

estimating input densities when the real goal is to obtain

classication error and furthermore it concentrates on the

a discriminant function for classication is a sub optimal

pieces of the image which cause that error Discriminative

forward training is an elegant and ecientway of solving strategy In theory the problem of estimating densities in

the infamous credit assignment problem for learning ma highdimensional spaces is much more illp osed than nd

chines that manipulate dynamic data structures suchas ing decision b oundaries

graphs More generally the same idea can Even though the internal variables of the system do not be used in all

have a direct probabilistic interpretation the overall sys situations where a learning machine must cho ose b etween

tem can still b e viewed as pro ducing p osterior probabilities discrete alternativeinterpretations

for the classes In fact assuming that a particular lab el se As previously the derivatives on the interpretation graph

quence is given as the desired sequence to the GTN in p enalties can then be backpropagated into the character

gure the exp onential of minus E can be inter recognizer instances Backpropagation through the char

dforw

preted as an estimate of the p osterior probabilityof that acter recognizer gives derivatives on its parameters All the

lab el sequence given the input The sum of those p osteriors gradientcontributions for the dierent candidate segments

for all the p ossible lab el sequences is Another approach are added up to obtain the total gradient asso ciated to one

would consists of directly minimizing an approximation of pair input image correct lab el sequence that is one ex

the numb er of misclassications We prefer to use ample in the training set A step of sto chastic gradient

the discriminative forward loss function b ecause it causes descent can then b e applied to up date the parameters

PROC OF THE IEEE NOVEMBER

"U"

Recognizer

Fig Explicit segmentation can b e avoided bysweeping a recog

nizer at every p ossible lo cation in the input eld

$

less numerical problems during the optimization We will

see in Section XC that this is a go o d way to obtain scores

Fig A Space DisplacementNeuralNetwork is a convolutional

on which to base a rejection strategy The imp ortant p oint

network that has b een replicated over a wide input eld

b eing made here is that one is free to cho ose any param

eterization deemed appropriate for a classication mo del

The fact that a particular parameterization uses internal

characters within a string may have widely varying sizes

variables with no clear probabilistic interpretation do es not

and baseline p ositions Therefore the recognizer must be

make the mo del any less legitimate than mo dels that ma

very robust to shifts and size variations

nipulate normalized quantities

These three problems are elegantly circumvented if a

An imp ortant advantage of global and discriminative

convolutional network is replicated over the input eld

training is that learning fo cuses on the most imp ortant

First of all as shown in section III convolutional neu

errors and the system learns to integrate the ambigui

ral networks are very robust to shifts and scale varia

ties from the segmentation algorithm with the ambigui

tions of the input image as well as to noise and extra

ties of the character recognizer In Section IX we present

neous marks in the input These prop erties take care of

exp erimental results with an online handwriting recogni

the latter two problems mentioned in the previous para

tion system that conrm the advantages of using global

graph Second convolutional networks provide a drastic

training versus separate training Exp eriments in sp eech

saving in computational requirement when replicated over

recognition with hybrids of neural networks and HMMs

large input elds A replicated convolutional network also

also showed marked improvements broughtby global train

called a Space Displacement Neural Network or SDNN

ing

is shown in Figure While scanning a recognizer can

VI I Multiple Object Recognition Space

be prohibitively exp ensive in general convolutional net

Displacement Neural Network

works can be scanned or replicated very eciently over

large variablesize input elds Consider one instance of

aconv olutional net and its alter ego at a nearby lo cation

There is a simple alternative to explicitly segmenting im

Because of the convolutional nature of the network units

ages of character strings using heuristics The idea is to

in the two instances that lo ok at identical lo cations on the

sweep a recognizer at all p ossible lo cations across a nor

input haveidentical outputs therefore their states do not

malized image of the entire word or string as shown in

need to be computed twice Only a thin slice of new

Figure With this technique no segmentation heuris

states that are not shared by the two network instances

tics are required since the system essentially examines al l

needs to be recomputed When all the slices are put to

the p ossible segmentations of the input However there

gether the result is simply a larger convolutional network

are problems with this approach First the metho d is in

whose structure is identical to the original network except

general quite exp ensive The recognizer must be applied

that the feature maps are larger in the horizontal dimen

at every p ossible lo cation on the input or at least at a

sion In other words replicating a convolutional network

largeenoughsubset of lo cations so that misalignments of

can b e done simply by increasing the size of the elds over

characters in the eld of view of the recognizers are small

which the convolutions are p erformed and by replicating

enough to have no eect on the error rate Second when

theoutputlayer accordingly The output layer eectively

the recognizer is centered on a character to b e recognized

b ecomes a convolutional layer An output whose receptive

the neighb ors of the center character will b e presentinthe

eld is centered on an elementary ob ject will pro duce the

eld of view of the recognizer p ossibly touching the cen

class of this ob ject while an inb etween output may indi

ter character Therefore the recognizer must be able to

cate no character or contain rubbish The outputs can b e

correctly recognize the character in the center of its input

interpreted as evidences for the presence of ob jects at all

eld even if neighb oring characters are very close to or

p ossible p ositions in the input eld

Third a word or charac touching the central character

ter string cannot b e p erfectly size normalized Individual The SDNN architecture seems particularly attractive for

PROC OF THE IEEE NOVEMBER e handwriting where no reliable segmen

recognizing cursiv Viterbi Answer

tation heuristic exists Although the idea of SDNN is quite

ery attractiveby its simplicity it has not gener

old and v Viterbi Graph

ated wide interest until recently b ecause as stated ab ove In

it puts enormous demands on the recognizer Viterbi Transformer

h recognition where the recognizer is at least one

sp eec Interpretation Graph

order of magnitude smaller replicated convolutional net

orks are easier to implement for instance in Haners w Character

MultiState TDNN mo del Model Compose

Transducer

A Interpreting the Output of an SDNN with a GTN S....c.....r...... i....p....t

SDNN Output s....e.....n.....e.j...o.T

ectors which

The output of an SDNN is a sequence of v 5...... a...i...u...... p.....f

enco de the likeliho o ds p enalties or scores of nding char

SDNN

of a particular class lab el at the corresp onding lo

acter Transformer

cation in the input A p ostpro cessor is required to pull

out the b est p ossible lab el sequence from this vector se

quence An example of SDNN output is sho wn in Fig

Fig A Graph Transformer pulls out the b est interpretation from

the output of the SDNN

ure Very often individual characters are sp otted by

several neighb oring instances of the recognizer a conse

of the robustness of the recognizer to horizontal

quence C1 C3 C5

Answer

Also quite often characters are erroneously

translations 2345

by recognizer instances that see only a piece of detected Compose + Viterbi

SDNN

character For example a recognizer instance that only

a 2 3 3 4 5 Output

t third of a might output the lab el How

sees the righ F6

can we eliminate those extraneous characters from the out

sequence and pullout the best interpretation This

put Input

can b e done using a new typ e of Graph Transformer with

two input graphs as shown in Figure The sequence of

vectors pro duced by the SDNN is rst co ded into a linear

graph with multiple arcs b etween pairs of successive no des

Fig An example of multiple character recognition with SDNN

Each arc b etween a particular pair of no des contains the

With SDNN no explicit segmentation is p erformed

lab el of one of the p ossible categories together with the

p enalty pro duced by the SDNN for that class lab el at that

lo cation This graph is called the SDNN Output Graph

B Experiments with SDNN

The second input graph to the transformer is a grammar

In a series of exp eriments LeNet was trained with the

transducer more sp ecically a nitestate transducer

goal of b eing replicated so as to recognize multiple char

that enco des the relationship b etween input strings of class

acters without segmentations The data was generated

lab els and corresp onding output strings of recognized char

from the previously describ ed Mo died NIST set as fol

actersThe transducer is a weighted nite state machine a

lows Training images were comp osed of a central char

graph where each arc contains a pair of lab els and p ossibly

acter anked bytwo side characters picked at random in

a p enalty Like a nitestate machine a transducer is in a

the training set The separation between the b ounding

state and follows an arc to a new state when an observed

boxes of the characters were chosen at random b etween

input symbol matches the rst symb ol in the symb ol pair

and pixels In other instances no central character was

attached to the arc Atthispoint the transducer emits the

present in which case the desired output of the network

second symb ol in the pair together with a p enaltythatcom

w as the blank space class In addition training images

bines the p enalty of the input symbol and the penaltyof

were degraded with salt and p epp er noise random

the arc A transducer therefore transforms a weighted sym

pixel inversions

b ol sequence into another weighted symb ol sequence The

graph transformer shown in gure p erforms a composi Figures and show a few examples of success

tion between the recognition graph and the grammar trans ful recognitions of multiple characters by the LeNet

ducer This op eration takes every p ossible sequence corre SDNN Standard techniques based on Heuristic Over

sp onding to every p ossible path in the recognition graph Segmentation would fail miserably on many of those ex

and matches them with the paths in the grammar trans amples As can be seen on these examples the network

ducer The comp osition pro duces the interpretation graph exhibits striking invariance and noise resistance prop erties

whichcontains a path for each corresp onding output lab el While some authors have argued that invariance requires

sequence This comp osition op eration may seem combina more sophisticated mo dels than feedforward neural net

torially intractable but it turns out there exists an ecient works LeNet exhibits these prop erties to a large ex

algorithm for it describ ed in more details in Section VI I I tent

PROC OF THE IEEE NOVEMBER

540 1114 5 5 4 0 1 1 1 4 4 1

Answer 678 3514 SDNN output 6 7 7 7 8 8 3 5 5 1 1 4 F6

Input

Fig An SDNN applied to a noisy image of digit string The digits shown in the SDNN output represent the winning class lab els with

a lighter grey level for highp enalty answers

Similarly it has b een suggested that accurate recognition which they can b e implemented on parallel hardware Sp e

of multiple overlapping ob jects require explicit mechanisms cialized analogdigital chips have b een designed and used

that would solve the socalled feature binding problem in character recognition and in image prepro cessing appli

As can b e seen on Figures and the network is able to cations However the rapid progress of conventional

tell the characters apart even when they are closely inter pro cessor technology with reducedprecision vector arith

twined a task that would b e imp ossible to achieve with the metic instructions suchasIntels MMX make the success

more classical Heuristic OverSegmentation technique The of sp ecialized hardware hyp othetical at b est

SDNN is also able to correctly group disconnected pieces

ed at Short video clips of the LeNet SDNN can b e view

of ink that form characters Good examples of that are

httpwwwresearchattcomyannocr

shown in the upp er half of gure In the top left ex

C Global Training of SDNN

ample the and the are more connected to each other

than they are connected with themselves yet the system

In the ab ove exp eriments the string image were arti

correctly identies the and the as separate ob jects The

cially generated from individual character The advantage

top right example is interesting for several reasons First

is that we know in advance the lo cation and the lab el of

the system correctly identies the three individual ones

the imp ortantcharacter With real training data the cor

Second the left half and right half of disconnected are

rect sequence of lab els for a string is generally available

correctly group ed even though no geometrical information

but the precise lo cations of each corresp onding character

could decide to asso ciate the left half to the vertical bar on

in the input image are unknown

its left or on its right The right half of the do es cause

In the exp eriments describ ed in the previous section the

the app earance of an erroneous on the SDNN output

b est interpretation was extracted from the SDNN output

but this one is removed by the character mo del transducer

using a very simple graph transformer Global training of

which prevents characters from app earing on contiguous

an SDNN can b e p erformed by backpropagating gradients

outputs

through such graph transformers arranged in architectures

similar to the ones describ ed in section VI Another imp ortantadvantage of SDNN is the ease with

PROC OF THE IEEE NOVEMBER

t exp eriments in

Edforw zip co de recognition and more recen

online handwriting recognition have demonstrated the

idea of globallytrained SDNNHMM hybrids SDNN is an

+

etechnique for OCR but − extremely promising and attractiv

Cdforw

so far it has not yielded b etter results than Heuristic Over

Cforw

Segmentation We hop e that these results will improveas

Forward Scorer more exp erience is gained with these mo dels

Constrained Forward Scorer

Gc

Object Detection and Spotting with SDNN

Interpretation Graph D teresting application of SDNNs is ob ject detection Desired An in

Sequence Path Selector

and sp otting The invariance prop erties of Convolutional

combined with the eciency with which they works

Interpretation Graph Gint Net

can be replicated over large elds suggest that they can Character

Model Compose b e used for brute force ob ject sp otting and detection in

The main idea is to train a single Convolu

Transducer large images

tional Network to distinguish images of the ob ject of inter

S....c.....r...... i....p....t

t in the background In utilization SDNN Output s....e.....n.....e.j...o.T est from images presen

5...... a...i...u...... p.....f

mo de the network is replicated so as to cover the entire

analyzed thereby forming a twodimensional SDNN image to b e

Transformer

Space Displacement Neural Network The output of the

SDNN is a twodimensional plane in which activated units

Fig A globally trainable SDNNHMM hybrid system expressed

indicate the presence of the ob ject of interest in the corre

as a GTN

sp onding receptive eld Since the sizes of the ob jects to

b e detected within the image are unknown the image can

be presented to the network at multiple resolutions and

This is somewhat equivalent to mo deling the output

the results at multiple resolutions combined The idea has

of an SDNN with a Hidden Markov Mo del Globally

b een applied to face lo cation address blo ck lo cation

trained variablesize TDNNHMM hybrids have b een used

on env elop es and hand tracking in video

for sp eech recognition and online handwriting recogni

To illustrate the metho d we will consider the case of

tion Space Displacement Neural Net

face detection in images as describ ed in First images

works havebeenusedincombination with HMMs or other

containing faces at various scales are collected Those im

elastic matching metho ds for handwritten word recogni

ages are ltered through a zeromean Laplacian lter so as

tion

to removevariations in global illumination and low spatial

frequency illumination gradients Then training samples

Figure shows the graph transformer architecture for

of faces and nonfaces are manually extracted from those

training an SDNNHMM hybrid with the Discriminative

images The face subimages are then size normalized so

Forward Criterion The top part is comparable to the top

that the height of the entire face is approximately pixels

part of gure On the right side the comp osition of the

while keeping fairly large variations within a factor of two

recognition graph with the grammar gives the interpreta

The scale of background subimages are picked at random

tion graph with all the p ossible legal interpretations On

A single convolutional network is trained on those samples

the left side the comp osition is p erformed with a grammar

to classify face subimages from nonface subimages

that only contains paths with the desired sequence of la

When a scene image is to b e analyzed it is rst ltered

bels This has a somewhat similar function to the path

through the Laplacian lter and subsampled at powers

selector used in the previous section Like in Section VID

oftwo resolutions The network is replicated over eachof

the loss function is the dierence b etween the forward score

multiple resolution images A simple voting technique is

obtained from the left half and the forward score obtained

used to combine the results from multiple resolutions

from the right half To backpropagate through the com

Atwodimensional v ersion of the global training metho d

p osition transformer we need to keep a record of whicharc

describ ed in the previous section can be used to allevi

in the recognition graph originated whicharcsintheinter

ate the need to manually lo cate faces when building the

pretation graph The derivative with resp ect to an arc in

training sample Eachpossiblelocationisseenasan

the recognition graph is equal to the sum of the derivatives

alternativeinterpretation ie oneofseveral parallel arcs

with resp ect to all the arcs in the interpretation graph that

in a simple graph that only contains a start no de and an

originated from it Derivative can also b e computed for the

end no de

p enalties on the grammar graph allowing to learn them as

Other authors have used Neural Networks or other clas well As in the previous example a discriminative criterion

siers suchasSupportVector Machines for face detection must b e used b ecause using a nondiscriminative criterion

with great success Their systems are very similar could result in a collapse eect if the networks output RBF

to the one describ ed ab ove including the idea of presenting are adaptive The ab ove training pro cedure can b e equiv

the image to the network at multiple scales But since those alently formulated in term of HMM Early exp eriments in

PROC OF THE IEEE NOVEMBER

systems do not use Convolutional Networks they cannot path and a corresp onding pair of inputoutput sequences

takeadvantage of the sp eedup describ ed here and haveto S S in the transducer graph The weights on the arcs

out in

rely on other techniques such as preltering and realtime of the output graph are obtained by adding the weights

tracking to keep the computational requirement within from the matching arcs in the input acceptor and trans

reasonable limits In addition b ecause those classiers are ducer graphs In the rest of the pap er we will call this

much less invariant to scale variations than Convolutional graph comp osition op eration using transducers the stan

Networks it is necessary to multiply the numb er of scales dard transduction operation

at which the images are presented to the classier

A simple example of transduction is shown in Figure

In this simple example the input and output symb ols on

VI I I Graph Transformer Networks and

the transducer arcs are always identical This typ e of trans

Transducers

ducer graph is called a grammar graph To b etter under

stand the transduction op eration imagine two tokens sit

In Section IV Graph Transformer Networks GTN

ting each on the start no des of the input acceptor graph

were intro duced as a generalization of multilayer multi

and the transducer graph The tokens can freely follow

mo dule networks where the state information is repre

any arc lab eled with a null input symb ol A token can

sented as graphs instead of xedsize vectors This section

followan arc lab eled with a nonnull input symbol if the

the framework of Generalized reinterprets the GTNs in

other token also follows an arc lab eled with the same in

Transduction and prop oses a p owerful Graph Composition

put symbol We havean acceptable trajectory when b oth

algorithm

tokens reach the end no des of their graphs ie the tokens

A Previous Work

ha ve reached the terminal conguration This tra jectory

represents a sequence of input symb ols that complies with

Numerous authors in sp eech recognition have used

b oth the acceptor and the transducer We can then collect

GradientBased Learning metho ds that integrate graph

the corresp onding sequence of output symb ols along the

based statistical mo dels notably HMM with acoustic

tra jectory of the transducer token The ab ove pro cedure

recognition mo dules mainly Gaussian mixture mo dels but

pro duces a tree but a simple technique describ ed in Sec

also neural networks Similar ideas have

tion VI I IC can b e used to avoid generating multiple copies

b een applied to handwriting recognition see for a re

of certain subgraphs by detecting when a particular output

view However there has b een no prop osal for a system

state has already b een seen

atic approachtomultilayer graphbased trainable systems

The transduction op eration can b e p erformed very e

The idea of transforming graphs into other graphs has re

ciently but presents complex b o okkeeping problems

ceived considerable interest in computer science through

concerning the handling of all combinations of null and non

the concept of weighted nitestate transducers Trans

null symb ols If the weights are interpreted as probabilities

ducers have been applied to sp eech recognition and

normalized appropriately then an acceptor graph repre

language translation and prop osals have b een made

sents a probability distribution over the language dened

for handwriting recognition This line of work has

by the set of lab el sequences asso ciated to all p ossible paths

been mainly fo cused on ecient search algorithms

from the start to the end no de in the graph

and on the algebraic asp ects of combining transducers and

An example of application of the transduction op era

graphs called acceptors in this context but very little

tion is the incorp oration of linguistic constraints a lexicon

eort has b een devoted to building globally trainable sys

or a grammar when recognizing words or other character

tems out of transducers What is prop osed in the follow

strings The recognition transformer pro duces the recog

ing sections is a systematic approach to automatic training

nition graph an acceptor graph by applying the neural

in graphmanipulating systems A dierent approach to

network recognizer to each candidate segment This ac

graphbased trainable systems called InputOutput HMM

ceptor graph is comp osed with a transducer graph for the

was prop osed in

grammar The grammar transducer contains a path for

B StandardTransduction

each legal sequence of symbol p ossibly augmented with

p enalties to indicate the relative likeliho o ds of the possi

In the established framework of nitestate transduc

ble sequences The arcs contain identical input and output

ers discrete symb ols are attached to arcs in the graphs

symbols Another example of transduction was mentioned

Acceptor graphs have a single symbol attached to each

in Section V the path selector used in the heuristic over

arc whereas transducer graphs havetwo symb ols an input

segmentation training GTN is implementable by a comp o

symbol and an output symb ol A sp ecial null symbol is

sition The transducer graph is linear graph which con

absorb ed by any other symb ol when concatenating sym

tains the correct lab el sequence The comp osition of the

b ols to build a symbol sequence Weighted transducers

interpretation graph with this linear graph yields the con

and acceptors also have a scalar quantityattached to each

strained graph

arc In this framework the comp osition op eration takes as

input an acceptor graph and a transducer graph and builds

C GeneralizedTransduction

an output acceptor graph Each path in this output graph

If the data structures asso ciated to each arc to ok only with symb ol sequence S corresp onds to one path with

out

a nite number of values comp osing the input graph and symb ol sequence S in the input acceptor graph and one in

PROC OF THE IEEE NOVEMBER

interpretations:

ould b e a sound solution For an appropriate transducer w interpretation graph cut (2.0)

cap (0.8)

wever the data structures attached to

our applications ho "t" 0.8 cat (1.4)

arcs of the graphs may be vectors images or other the 0.8 "u"

"p" 0.2

highdimensional ob jects that are not readily enumerated "c" 0.4

"a" 0.2

e present a new comp osition op eration that solves this W "t" grammar graph 0.8 problem "r" "n"

"a"

Instead of only handling graphs with discrete symbols

e are interested in considering

and p enalties on the arcs w "t" "b" "u"

match match match y carry complex data structures in graphs whose arcs ma & add & add & add "t" "e"

"u"

cluding continuousvalued data structures suchasvectors

"c" "r" "e"

images Comp osing such graphs requires additional and "a" "p"

information "t" "r" "d"

Graph Composition

When examining a pair of arcs one from each input e need a criterion to decide whether to create cor graph w "c" 0.4 0.1

"x" "p" 0.2 resp onding arcs and no des in the output graph based

"o" 1.0 "a" 0.2 Recognition

hed to the input arcs We can de on the information attac Graph "t" 0.8

1.8 "u" 0.8

to build an arc several arcs or an entire subgraph

cide "d"

with several no des and arcs

Fig Example of comp osition of the recognition graph with

When that criterion is met we must build the corre

the grammar graph in order to build an interpretation that is

sp onding arcs and no des in the output graph and com

consistentwithbothofthem During the forward propagation

pute the information attached to the newly created arcs

dark arrows the metho ds check and fprop are used Gradients

dashed arrows are backpropagated with the application of the

as a function the the information attached to the input

metho d bprop

arcs

These functions are encapsulated in an ob ject called a

Composition Transformer An instance of Comp osition

jectory is acceptable ie b oth tokens simultaneously reach

Transformer implements three metho ds

the end no des of their graphs The managementof null

checkarc arc

transitions is a straightforward mo dication of the token

compares the data structures p ointed to byarcsarc from

simulation function Before enumerating the p ossible non

the rst graph and arc from the second graph and re

null joint token transitions we lo op on the p ossible null

turns a b o olean indicating whether corresp onding arcs

transitions of each token recursively call the token sim

should b e created in the output graph

ulation function and nally call the metho d fprop The

fpropngraph upnode downnode arc arc

safest wayforidentifying acceptable tra jectories consists in

is called when checkarc arc returns true This

running a preliminary pass for identifying the token con

metho d creates new arcs and no des b etween no des upnode

gurations from whichwe can reach the terminal congu

and downnode in the output graph ngraph and computes

ration ie both tokensonthe end no des This is easily

the information attached to these newly created arcs as a

achieved by enumerating the tra jectories in the opp osite

function of the attached information of the input arcs arc

direction We start on the end no des and follow the arcs

and arc

upstream During the main pass we only build the no des

bpropngraph upnode downnode arc arc

that allowthetokens to reach the terminal conguration

is called during training in order to propagate gradient in

Graph comp osition using transducers ie standard

formation from the output subgraph b etween upnode and

transduction is easily and eciently implementedasagen

downnode into the data structures on the arc and arc

eralized transduction The metho d check simply tests the

as well as with resp ect to the parameters that were used in

equality of the input symbols on the two arcs and the

the fprop call with the same arguments This assumes that

metho d fprop creates a single arc whose symbol is the

the function used by fprop to compute the values attached

output symb ol on the transducers arc

to its output arcs is dierentiable

The comp osition b etween pairs of graphs is particularly

The check metho d can be seen as constructing a dy

useful for incorp orating linguistic constraints in a hand

namic architecture of functional dep endencies while the

writing recognizer Examples of its use are given in the

fprop metho d p erforms a forward propagation through

online handwriting recognition system describ ed in Sec

that architecture to compute the numerical information at

tion IX and in the check reading system describ ed in Sec

tached to the arcs The bprop metho d p erforms a back

tion X

ward propagation through the same architecture to com

pute the partial derivatives of the loss function with resp ect

In the rest of the pap er the term Composition Trans

to the information attached to the arcs This is illustrated

former will denote a Graph Transformer based on the gen

in Figure

eralized transductions of multiple graphs The concept of

Figure shows a simplied generalized graph comp osi generalized transduction is a very general one In fact

tion algorithm This simplied algorithm do es not handle many of the graph transformers describ ed earlier in this

null transitions and do es not check whether the tokens tra pap er such as the segmenter and the recognizer can be

PROC OF THE IEEE NOVEMBER

transduction In this formulated in terms of generalized

case the the generalized transduction do es not taketwo in

put graphs but a single input graph The metho d fprop of

the transformer may create several arcs or even a complete

subgraph for each arc of the initial graph In fact the pair

Function generalizedcompositionPGRAPH graph

check fprop itself can b e seen as pro cedurally dening

PGRAPH graph

a transducer

PTRANS trans

Returns PGRAPH

In addition It can b e shown that the generalized trans

duction of a single graph is theoretically equivalenttothe

Create new graph

standard comp osition of this graph with a particular trans

PGRAPH ngraph newgraph

ducer graph However implementing the op eration this

way may be very inecient since the transducer can be

Create map between token positions

very complicated

and nodes of the new graph

In practice the graph pro duced by a generalized trans

PNODE mapPNODEPNODE newemptymap

duction is represented pro cedurally in order to avoid build

mapendnodegraph endnodegraph

ing the whole output graph whichmaybehuge when for

endnodenewgraph

example the interpretation graph is comp osed with the

grammar graph We only instantiate the no des which

Recursive subroutine for simulating tokens

are visited by the search algorithm during recognition eg

Function simtokensPNODE node PNODE node

Viterbi This strategy propagates the b enets of pruning

Returns PNODE

algorithms eg Beam Search in all the Graph Transformer

Network

PNODE currentnode mapnode node

Check if already visited

D Notes on the Graph Structures

If currentnode nil

Record new configuration

Section VI has discussed the idea of global training

currentnode ngraphcreatenode

bybackpropagating gradient through simple graph trans

mapnode node currentnode

formers The bprop metho d is the basis of the back

Enumerate the possible nonnull

propagation algorithm for generic graph transformers A

joint token transitions

generalized comp osition transformer can b e seen as dynam

For ARC arc in downarcsnode

ically establishing functional relationships b etween the nu

For ARC arc in downarcsnode

merical quantities on the input and output arcs Once the

If transcheckarc arc

check function has decided that a relationship should b e es

PNODE newnode

tablished the fprop function implements the numerical re

simtokensdownnodearc

lationship The check function establishes the structure of

downnodearc

the ephemeral network inside the comp osition transformer

transfpropngraph currentnode

Since fprop is assumed to b e dierentiable gradients can

newnode arc arc

b e backpropagated through that structure Most param

Return node in composed graph

eters aect the scores stored on the arcs of the successive

Return currentnode

graphs of the system A few threshold parameters mayde

termine whether an arc app ears or not in the graph Since

non existing arcs are equivalent to arcs with very large

Perform token simulation

p enalties we only consider the case of parameters aect

simtokensstartnodegraph startnodegraph

ing the p enalties

Delete map

Return ngraph

In the kind of systems wehave discussed until nowand

the application describ ed in Section X muchoftheknowl

edge ab out the structure of the graph that is pro duced b y

Fig Pseudoco de for a simplied generalized comp osition algo

a Graph Transformer is determined by the nature of the

rithm For simplifying the presentation we do not handle null

Graph Transformer but it may also dep end on the value

transitions nor implement dead end avoidance The twomain

of the parameters and on the input It mayalsobeinterest

comp onent of the comp osition app ear clearly here a the re

cursive function simtoken enumerating the token tra jectories

ing to consider Graph Transformer mo dules which attempt

and b the asso ciative array map used for rememb ering which

to learn the structure of the output graph This might

no des of the comp osed graph have b een visited

b e considered a combinatorial problem and not amenable

to GradientBased Learning but a solution to this prob

lem is to generate a large graph that contains the graph

candidates as subgraphs and then select the appropriate subgraph

PROC OF THE IEEE NOVEMBER

E GTN and Hidden Markov Models arcs are simply added in order to obtain the complete out

put graph The input values of the emission and transition

GTNs can b e seen as a generalization and an extension of

mo dules are read o the data structure on the input arcs

HMMs On the one hand the probabilistic interpretation

of the IOHMM Graph Transformer In practice the out

can b e either kept with p enalties b eing logprobabilities

put graph maybevery large and needs not b e completely

pushed to the nal decision stage with the dierence of the

instantiated ie it is pruned only the low p enalty paths

constrained forward p enalty and the unconstrained forward

are created

p enalty being interpreted as negative logprobabilities of

lab el sequences or dropp ed altogether the network just

IX An OnLine Handwriting Recognition System

represents a decision surface for lab el sequences in input

space On the other hand Graph Transformer Networks

Natural handwriting is often a mixture of dierent

extend HMMs by allowing to combine in a wellprincipled

styles lower case printed upp er case and cursive A

framework multiple levels of pro cessing or multiple mo d

reliable recognizer for such handwriting would greatly im

els eg Pereira et al have been using the transducer

prove interaction with p enbased devices but its imple

framework for stacking HMMs representing dierentlevels

hallenges Characters mentation presents new technical c

of pro cessing in automatic sp eech recognition

taken in isolation can be very ambiguous but consider

Unfolding a HMM in time yields a graph that is very sim

able information is available from the context of the whole

ilar to our interpretation graph at the nal stage of pro

word We have built a word recognition system for pen

cessing of the Graph Transformer Network b efore Viterbi

based devices based on four main mo dules a prepro cessor

recognition It has no des nt i asso ciated to each time

that normalizes a word or word group by tting a geomet

step t and state i in the mo del The p enalty c foranarc

i

rical mo del to the word structure a mo dule that pro duces

from nt j to nt i then corresp onds to the nega

an annotated image from the normalized p en tra jectory

tive logprobability of emitting observed data o at p osi

t

a replicated convolutional neural network that sp ots and

tion t and going from state j to state i in the time interval

recognizes characters and a GTN that interprets the net

t t With this probabilistic interpretation the for

works output by taking wordlevel constraints into account

ward p enalty is the negative logarithm of the likeliho o d of

The network and the GTN are jointly trained to minimize

whole observed data sequence given the mo del

an error measure dened at the word level

In Section VI we mentioned that the collapsing phe

In this work we have compared a system based on

nomenon can o ccur when nondiscriminative loss functions

SDNNs such as describ ed in Section VI I and a system

are used to train neural networksHMM hybrid systems

based on Heuristic OverSegmentation such as describ ed

With classical HMMs with xed prepro cessing this prob

in Section V Because of the sequential nature of the infor

lem do es not o ccur b ecause the parameters of the emission

mation in the p en tra jectory whichreveals more informa

and transition probability mo dels are forced to satisfy cer

tion than the purely optical input from in image Heuristic

tain probabilistic constraints the sum or the integral of

OverSegmentation can b e very ecient in prop osing can

the probabilities of a random variable over its p ossible val

didate character cuts esp ecially for noncursive script

ues must b e Therefore when the probability of certain

events is increased the probability of other events must au

A Preprocessing

tomatically b e decreased On the other hand if the prob

Input normalization reduces intracharacter variability

abilistic assumptions in an HMM or other probabilistic

simplifying character recognition We have used a word

mo del are not realistic discriminative training discussed

normalization scheme based on tting a geometrical

in Section VI can improve p erformance as this has b een

mo del of the word structure Our mo del has four exi

clearly shown for sp eech recognition systems

ble lines representing resp ectively the ascenders line the

core line the base line and the descenders line The lines

The InputOutput HMM mo del IOHMM

are tted to lo cal minima or maxima of the p en tra jectory

is strongly related to graph transformers Viewed as a

The parameters of the lines are estimated with a mo died

probabilistic mo del an IOHMM represents the conditional

version of the EM algorithm to maximize the joint prob

distribution of output sequences given input sequences of

ability of observed points and parameter values using a

the same or a dierent length It is parameterized from

ents the lines from collapsing prior on parameters that prev

an emission probability mo dule and a transition probabil

on each other

ity mo dule The emission probability mo dule computes

the conditional emission probabilityofanoutputvariable The recognition of handwritten characters from a pen

given an input value and the value of discrete state vari tra jectory on a digitizing surface is often done in the time

able The transition probability mo dule computes condi domain Typically tra jectories are nor

tional transition probabilities of a change in the value of malized and lo cal geometrical or dynamical features are

the state variable given the an input value Viewed as a extracted The recognition may then be p erformed us

graph transformer it assigns an output graph representing ing curve matching or other classication techniques

a probability distribution over the sequences of the output such as TDNNs While these representations

variable to each path in the input graph All these output have several advantages their dep endence on stroke order

graphs have the same structure and the p enalties on their ing and individual writing styles makes them dicult to

PROC OF THE IEEE NOVEMBER

"Script" "Script"

Viterbi Graph Viterbi Graph

Beam Search Beam Search Transformer Transformer

Interpretation Graph Interpretation Graph

Language Compose Model Compose

Recognition Graph Recognition Graph

Character Recognition Compose Transformer Model SDNN Output AMAP Graph SDNN AMAP Computation Transformer

Segmentation Graph AMAP

Segmentation Transformer AMAP Computation Normalized Word Normalized Word

Word Normalization Word Normalization

Fig An online handwriting recognition GTN based on heuristic

Fig An online handwriting recognition GTN based on Space

oversegmentation

Displacement Neural Network

B Network Architecture

One of the b est networks we found for b oth online and

use in high accuracy writer indep endent systems that in

oine character recognition is a layer convolutional net

tegrate the segmentation with the recognition

work somewhat similar to LeNet Figure but with

multiple input planes and dierentnumb ers of units on the

Since the intent of the writer is to pro duce a legible im last twolayers layer convolution with kernels of size

age it seems natural to preserveasmuch of the pictorial x layer x subsampling layer convolution with

nature of the signal as p ossible while at the same time ex kernels of size x layer convolutionwithkernels

ploit the sequential information in the tra jectory For this of size x layer x subsampling classication layer

purp ose wehave designed a representation scheme called RBF units one per class in the full printable ASCI I

AMAP where p en tra jectories are represented bylow set The distributed co des on the output are the same as

resolution images in which each picture element contains for LeNet except they are adaptiveunlike with LeNet

information ab out the lo cal prop erties of the tra jectory An When used in the heuristic oversegmentation system the

AMAP can be viewed as an annotated image in which an AMAP with ve input to ab ove network consisted of

each pixel is a element feature vector features are as planes rows and columns It was determined that

so ciated to four orientations of the pen tra jectory in the this resolution was sucient for representing handwritten

area around the pixel and the fth one is asso ciated to characters In the SDNN version the numberofcolumns

lo cal curvature in the area around the pixel A particu was varied according to the width of the input word Once

larly useful feature of the AMAP representation is that it the numb er of subsampling layers and the sizes of the ker

makes very few assumptions ab out the nature of the input nels are chosen the sizes of all the layers including the

tra jectory It do es not dep end on stroke ordering or writ input are determined unambiguously The only architec

ing sp eed and it can b e used with all typ es of handwriting tural parameters that remain to b e selected are the num

capital lower case cursive punctuation symb ols Un b er of feature maps in eachlayer and the information as

like many other representations such as global features to what feature map is connected to what other feature

AMAPs can b e computed for complete words without re map In our case the subsampling rates were chosen as

quiring segmentation small as possible x and the kernels as small as pos

PROC OF THE IEEE NOVEMBER

sible in the rst layer x to limit the total number of In this application the language mo del simply constrains

connections Kernel sizes in the upp er layers are chosen to the nal output graph to represent sequences of character

b e as small as p ossible while satisfying the size constraints lab els from a given dictionary Furthermore the interpre

mentioned ab ove Larger architectures did not necessarily tation graph is not actually completely instantiated the

p erform b etter and required considerably more time to b e only no des created are those that are needed by the Beam

trained Avery small architecture with half the input eld Searchmodule The interpretation graph is therefore rep

also p erformed worse b ecause of insucient input resolu resented pro cedurally rather than explicitly

tion Note that the input resolution is nonetheless much

A crucial contribution of this researchwas the joint train

less than for optical character recognition b ecause the an

ing of all graph transformer mo dules within the network

gle and curvature provide more information than would a

with resp ect to a single criterion as explained in Sec

single grey level at each pixel

tions VI and VI I I We used the DiscriminativeForward loss

function on the nal output graph minimize the forward

C Network Training

p enalty of the constrained interpretation ie along all the

correct paths while maximizing the forward p enaltyof

Training pro ceeded in two phases First we kept the

the whole interpretation graph ie along all the paths

centers of the RBFs xed and trained the network weights

During global training the loss function was optimized

so as to minimize the output distance of the RBF unit

with the sto chastic diagonal Levenb ergMarquardt pro ce

corresp onding to the correct class This is equivalent to

dure describ ed in App endix C that uses second derivatives

minimizing the meansquared error between the previous

to compute optimal learning rates This optimization op

layer and the center of the correctclass RBF This b o ot

erates on al l the parameters in the system most notably

strap phase was p erformed on isolated characters In the

the network weights and the RBF centers

second phase all the parameters network weights and RBF

centers were trained globally to minimize a discriminative

D Experimental Results

criterion at the word level

With the Heuristic OverSegmentation approach the

In the rst set of exp eriments weevaluated the general

GTN was comp osed of four main Graph Transformers

ization ability of the neural network classier coupled with

The Segmentation Transformer performs the

the word normalization prepro cessing and AMAP input

Heuristic OverSegmentation and outputs the segmenta

representation All results are in writer independent mo de

tion graph An AMAP is then computed for each image

dierent writers in training and testing Initial train

attached to the arcs of this graph

ing on isolated characters was p erformed on a database of

The Character Recognition Transformer applies

approximately hand printed characters classes

the the convolutional network character recognizer to each

of upp er case lower case digits and punctuation Tests

candidate segment and outputs the recognition graph

on a database of isolated characters were p erformed sepa

with p enalties and classes on eacharc

rately on the four typ es of characters upp er case

The Comp osition Transformer comp oses the recog

error on patterns lower case error on

nition graph with a grammar graph representing a language

patterns digits error on patterns and punc

mo del incorp orating lexical constraints

tuation error on patterns Exp eriments were

The Beam SearchTransformer extracts a go o d inter

p erformed with the network architecture describ ed ab ove

pretation from the interpretation graph This task could

To enhance the robustness of the recognizer to variations

have been achieved with the usual Viterbi Transformer

in p osition size orientation and other distortions addi

The Beam Search algorithm however implements pruning

tional training data was generated by applying lo cal ane

strategies which are appropriate for large interpretation

transformations to the original characters

graphs

The second and third set of exp eriments concerned the

With the SDNN approach the main Graph Transformers

recognition of lower case words writer indep endent The

are the following

tests were p erformed on a database of words First

The SDNN Transformer replicates the convolutional weevaluated the improvements broughtby the word nor

network over the a whole word image and outputs a recog malization to the system For the SDNNHMM system

nition graph that is a linear graph with class p enalties for we have to use wordlevel normalization since the net

every window centered at regular intervals on the input work sees one whole word at a time With the Heuris

image tic OverSegmentation system and b efore doing anyword

level training we obtained with characterlevel normaliza The CharacterLevel Comp osition Transformer

tion and word and character errors adding in comp oses the recognition graph with a lefttoright HMM

hcharacter class as in Figure sertions deletions and substitutions when the searchwas for eac

The WordLevel Comp osition Transformer com constrained within a word dictionary When using

poses the output of the previous transformer with a lan the word normalization prepro cessing instead of a charac

guage mo del incorp orating lexical constraints and outputs ter level normalization error rates dropp ed to and

the interpretation graph for word and character errors resp ectively ie a rel

The Beam Search Transformer extracts a go o d in ative drop of and in word and character error

terpretation from the interpretation graph resp ectively This suggests that normalizing the word in

PROC OF THE IEEE NOVEMBER

the system as correct reject error The its entirety is b etter than rst segmenting it and then nor

system presented here was one of the rst to cross that malizing and pro cessing each of the segments

threshold on representative mixtures of business and p er

SDNN/HMM No Language Model

hecks no global training 12.4 sonal c

with global training 8.2

Checks contain at least twoversions of the amount The

No Language Model

is written with numerals while the Legal HOS Courtesy amount

no global training 8.5

is written with letters On business checks which with global training 6.3 amount

hineprinted these amounts are relatively HOS 25K Word Lexicon are generally mac

no global training 2

easy to read but quite dicult to nd due to the lackof

with global training 1.4

standard for business check layout On the other hand

0 5 10 15

these amounts on p ersonal checks are easy to nd but much

Fig Comparative results character error rates showing the

harder to read

improvement brought by global training on the SDNNHMM

For simplicity and sp eed requirements our initial task

hybrid and on the Heuristic OverSegmentation system HOS

is to read the Courtesy amount only This task consists of

without and with a words dictionary

two main steps

In the third set of exp eriments we measured the im

The system has to nd among all the elds lines of

provements obtained with the joint training of the neural

text the candidates that are the most likelytocontain the

network and the p ostpro cessor with the wordlevel crite

courtesy amount This is obvious for many p ersonal checks

rion in comparison to training based only on the errors

where the p osition of the amount is standardized However

p erformed at the character level After initial training on

as already noted nding the amount can b e rather dicult

individual characters as ab ove global wordlevel discrim

in business checks even for the human eye There are

inative training was p erformed with a database of

many strings of digits suchasthechecknumb er the date

lower case words For the SDNNHMM system without

or even not to exceed amounts that can be confused

an y dictionary constraints the error rates dropp ed from

with the actual amount In many cases it is very dicult

and word and character error to and

to decide which candidate is the courtesy amount b efore

resp ectively after wordlevel training ie a relative drop

p erforming a full recognition

of and For the Heuristic OverSegmentation sys

In order to read and cho ose some Courtesy amount

tem and a slightly improved architecture without anydic

candidates the system has to segment the elds into char

tionary constraints the error rates dropp ed from

acters read and score the candidate characters and nally

and word and character error to and re

nd the b est interpretation of the amount using contextual

sp ectively ie a relative drop of and With a

knowledge represen ted byastochastic grammar for check

word dictionary errors dropp ed from and

amounts

word and character errors to and resp ectively

The GTN metho dology was used to build a check amount

after wordlevel training ie a relative drop of and

reading system that handles b oth p ersonal checks and busi

Even lower error rates can be obtained by dras

ness checks

tically reducing the size of the dictionary to words

yielding and word and character errors

A A GTN for Check Amount Recognition

These results clearly demonstrate the usefulness of glob

We now describ e the successive graph transformations

ally trained NeuralNetHMM hybrids for handwriting

that allow this network to read the check amount cf Fig

recognition This conrms similar results obtained earlier

ure Each Graph Transformer pro duces a graph whose

in sp eech recognition

paths enco de and score the currenthyp otheses considered

X A Check Reading System

at this stage of the system

The input to the system is a trivial graph with a single

This section describ es a GTN based Check Reading Sys

arc that carries the image of the whole check cf Figure

tem intended for immediate industrial deployment It also

shows how the use of Gradient BasedLearning and GTNs The eld lo cation transformer T rst p erforms

field

make this deployment fast and costeective while yielding classical image analysis including connected comp onent

an accurate and reliable solution analysis ink density histograms layout analysis etc

The verication of the amountonacheck is a task that and heuristically extracts rectangular zones that maycon

is extremely time and money consuming for banks As a tain the check amount T pro duces an output graph

field

consequence there is a very high interest in automating the h can called the eld graph cf Figure such that eac

pro cess as much as p ossible see for example didate zone is asso ciated with one arc that links the start

Even a partial automation would result in consid no de to the end no de Each arc contains the image of the

erable cost reductions The threshold of economic viability zone and a penalty term computed from simple features

for automatic check readers as set by the bank is when extracted from the zone absolute p osition size asp ect ra

tio etc The p enalty term is close to zero if the features of the checks are read with less than error The

suggest that the eld is a likely candidate and is large if other of the check b eing rejected and senttohuman

the eld is deemed less likely to b e an amount The p enalty op erators In such a case we describ e the p erformance of

PROC OF THE IEEE NOVEMBER

ter uses a variety of heuristics to nd candi

Viterbi Answer The segmen

date cut One of the most imp ortant ones is called hit and

The idea is to cast lines downward from the

Best Amount Graph deect

top of the eld image When a line hits a black pixel it is

Viterbi Transformer

deected so as to follow the contour of the ob ject When a

um of the upp er prole ie when it "$" 0.2 line hits a lo cal minim

Interpretation Graph "*" 0.4

tinue downward without crossing a black pixel "3" 0.1 cannot con

......

it is just propagated vertically downward through the ink

wosuch lines meet each other they are merged into

Grammar Compose When t The pro cedure can b e rep eated from the b ot "$" 0.2 a single cut

"*" 0.4

up This strategy allows the separation of touching Recognition Graph "3" 0.1 tom

"B" 23.6

haracters such as double zeros ...... c

Recognition

The recognition transformer T iterates over all

Transformer rec

segment arcs in the segmentation graph and runs a charac

$ * 3

t image In our Segmentation Graph ter recognizer on the corresp onding segmen

45

the recognizer is LeNet the Convolutional Neural

** case

work describ ed in Section I I whose weights constitute

Segmentation Transf. Net

the largest and most imp ortant subset of tunable parame

$ *** 3.45

The recognizer classies segment images into one of 45/xx ters

Field Graph $10,000.00

classes full printable ASCI I set plus a rubbish class for

Field Location Transf.

unknown sym b ols or badlyformed characters Each arc in

2nd Nat. Bank

not to exceed $10,000.00 $ *** 3.45

the input graph T is replaced byarcsinthe output rec

Check Graph three dollars and 45/xx

graph Eachof those arcs contains the lab el of one of

the classes and a p enalty that is the sum of the p enalty

Fig A complete check amount reader implemented as a single

of the corresp onding arc in the input segmentation graph

cascade of Graph Transformer mo dules Successive graph trans

formations progressively extract higher level information

and the p enalty asso ciated with classifying the image in

the corresp onding class as computed by the recognizer In

other words the recognition graph represents a weighted

function is dierentiable therefore its parameter are glob

trellis of scored character classes Each path in this graph

ally tunable

represents a possible character string for the corresp ond

ing eld We can compute a p enalty for this interpretation

An arc may represent separate dollar and cent amounts

by adding the p enalties along the path This sequence of

as a sequence of elds In fact in handwritten checks the

characters mayormay not b e a valid check amount

centamountmay b e written over a fractional bar and not

aligned at all with the dollar amount In the worst case

The comp osition transformer T selects the

gram

one may nd several cent amount candidates ab oveand

paths of the recognition graph that represent valid char

below the fraction bar for the same dollar amount

acter sequences for check amounts This transformer takes

two graphs as input the recognition graph and the gram

The segmentation transformer T similar to the

seg

mar graph The grammar graph contains all p ossible se

one describ ed in Section VI I I examines each zone contained

quences of symb ols that constitute a wellformed amount

in the eld graph and cuts each image into pieces of ink

The output of the comp osition transformer called the in

using heuristic image pro cessing techniques Each piece

terpretation graph contains all the paths in the recognition

of ink may be a whole character or a piece of character

graph that are compatible with the grammar The op er

Each arc in the eld graph is replaced by its corresp ond

ation that combines the twoinput graphs to pro duce the

ing segmentation graph that represents all p ossible group

output is a generalized transduction see Section VI I IA

ings of pieces of ink Each eld segmentation graph is ap

dierentiable function is used to compute the data attached

p ended to an arc that contains the p enalty of the eld in

to the output arc from the data attached to the input arcs

the eld graph Each arc carries the segment image to

In our case the output arc receives the class lab el of the

gether with a penalty that provides a rst evaluation of

two arcs and a p enalty computed by simply summing the

the likeliho o d that the segment actually contains a charac

p enalties of the two input arcs the recognizer p enaltyand

ter This p enalty is obtained with a dierentiable function

the arc p enalty in the grammar graph Each path in the

that combines a few simple features such as the space b e

interpretation graph represents one interpretation of one

tween the pieces of ink or the compliance of the segment

of one eld on the check The sum of the segmentation

image with a global baseline and a few tunable parame

p enalties along the path represents the badness of the

ters The segmentation graph represents al l the p ossible

corresp onding interpretation and combines evidence from

segmentations of al l the eld images We can compute the

each of the mo dules along the pro cess as well as from the

p enalty for one segmented eld by adding the arc p enalties

grammar

along the corresp onding path As b efore using a dieren

tiable function for computing the p enalties will ensure that The Viterbi transformer nally selects the path with

the parameters can b e optimized globally the lowest accumulated p enalty corresp onding to the b est

PROC OF THE IEEE NOVEMBER

Edforw as describ ed in Figure using as our desired sequence the

Viterbi answer This is summarized in Figure with

+ −

E C condence exp

dforw dforw Cforw

Forward Scorer

D Results

Forward Scorer

A version of the ab ove system was fully implemented

tested on machineprint business checks This sys Viterbi and

Path Selector

is basically a generic GTN engine with task sp ecic

Answer tem

encapsulated in the check and fprop metho d

Interpretation Graph heuristics

consequence the amount of co de to write was min As a

imal mostly the adaptation of an earlier segmenter into

Fig Additional pro cessing required to compute the condence

the segmentation transformer The system that deals with

handwritten or p ersonal checks was based on earlier im

grammatically correct interpretations

plementations that used the GTN concept in a restricted

way

B GradientBased Learning

The neural network classier was initially trained on

Each stage of this check reading system contains tun

images of character images from various origins

able parameters While some of these parameters could b e

spanning the entire printable ASCI I set This contained

manually adjusted for example the parameters of the eld

b oth handwritten and machineprinted characters that had

lo cator and segmenter the vast ma jorityofthem must be

b een previously size normalized at the string level Addi

learned particularly the weights of the neural net recog

tional images were generated by randomly distorting the

nizer

original images using simple ane transformations of the

Prior to globally optimizing the system each mo dule pa

images The netw ork was then further trained on character

rameters are initialized with reasonable values The param

images that had b een automatically segmented from check

eters of the eld lo cator and the segmenter are initialized

images and manually truthed The network was also ini

by hand while the parameters of the neural net charac

tially trained to reject noncharacters that resulted from

ter recognizer are initialized by training on a database of

segmentation errors The recognizer was then inserted in

presegmented and lab eled characters Then the entire

the check reading system and a small subset of the parame

system is trained globally from whole check images lab eled

ters were trained globally at the eld level on whole check

with the correct amount No explicit segmentation of the

images

amounts is needed to train the system it is trained at the

On business checks that were automatically catego

checklevel

rized as machine printed the p erformance was cor

The loss function E minimized by our global train

rectly recognized checks errors and rejects This

ing pro cedure is the Discriminative Forward criterion de

can b e compared to the p erformance of the previous sys

scrib ed in Section VI the dierence between a the for

tem on the same test set correct errors and

ward p enalty of the constrained interpretation graph con

rejects A check is categorized as machineprinted

strained by the correct lab el sequence and b the forward

when characters that are near a standard p osition Dollar

p enalty of the unconstrained interpretation graph Deriva

sign are detected as machine printed or when if nothing

tives can b e backpropagated through the entire structure

found in the standard p osition at least one courtesy is

although it only practical to do it down to the segmenter

amount candidate is found somewhere else The improve

ment is attributed to three main causes First the neural

C Rejecting Low Condence Checks

network recognizer was bigger and trained on more data

Second b ecause of the GTN architecture the new system

In order to b e able to reject checks which are the most

could take advantage of grammatical constraints in a much

likely to carry erroneous Viterbi answers we must rate

more ecient way than the previous system Third the

them with a condence and reject the check if this con

GTN architecture provided extreme exibility for testing

dence is below a given threshold To compare the un

heuristics adjusting parameters and tuning the system

normalized Viterbi Penalties of two dierentchecks would

This last p oint is more imp ortant than it seems The GTN

b e meaningless when it comes to decide which answer we

framework separates the algorithmic part of the system

trust the most

from the knowledgebased part of the system allowing

The optimal measure of condence is the probabilityof

easy adjustments of the latter The imp ortance of global

the Viterbi answer given the input image As seen in Sec

training was only minor in this task b ecause the global

tion VIE given a target sequence which in this case

training only concerned a small subset of the parameters

would be the Viterbi answer the discriminative forward

An indep endent test p erformed by systems integrators loss function is an estimate of the logarithm of this prob

ability Therefore a simple solution to obtain a go o d esti in showed the sup eriority of this system over other

mate of the condence is to reuse the interpretation graph commercial Courtesy amount reading systems The system

see Figure to compute the discriminative forward loss was integrated in NCRs line of check reading systems It

PROC OF THE IEEE NOVEMBER

has b een elded in several banks across the US since June Neural Networks allows to learn appropriate features from

and has b een reading millions of checks p er daysince examples The success of this approachwas demonstrated

then in extensive comparative digit recognition exp eriments on

the NIST database

XI Conclusions

Segmentation and recognition of ob jects in images can

not b e completely decoupled Instead of taking hard seg

During the short history of automatic pattern recogni

mentation decisions to o earlywehave used Heuristic Over

tion increasing the role of learning seems to have invari

Segmentation to generate and evaluate a large number of

ably improved the overall p erformance of recognition sys

hyp otheses in parallel p ostp oning any decision until the

tems The systems describ ed in this pap er are more ev

overall criterion is minimized

idence to this fact Convolutional Neural Networks have

ted characters Hand truthing images to obtain segmen

been shown to eliminate the need for handcrafted fea

for training a character recognizer is exp ensive and do es

ture extractors Graph Transformer Networks have b een

not takeinto account the way in which a whole do cument

shown to reduce the need for handcrafted heuristics man

or sequence of characters will b e recognized in particular

ual lab eling and manual parameter tuning in do cument

the fact that some segmentation candidates maybewrong

recognition systems As training data b ecomes plentiful as

even though they maylooklike true characters Instead

computers get faster as our understanding of learning al

wetrain multimo dule systems to optimize a global mea

gorithms improves recognition systems will rely more and

sure of p erformance which do es not require time consum

more of learning and their p erformance will improve

ing detailed handtruthing and yields signicantly b etter

Just as the backpropagation algorithm elegantly solved

recognition p erformance because it allows to train these

the credit assignment problem in multilayer neural net

mo dules to co op erate towards a common goal

works the gradientbased learning pro cedure for Graph

Ambiguities inherent in the segmentation character

Transformer Networks intro duced in this pap er solves the

recognition and linguistic mo del should b e integrated op

credit assignment problem in systems whose functional ar

timally Instead of using a sequence of taskdep endent

chitecture dynamically changes with each new input The

heuristics to combine these sources of information we

learning algorithms presented here are in a sense nothing

have prop osed a unied framework in which generalized

more than unusual forms of gradient descent in complex

transduction metho ds are applied to graphs representing a

dynamic architectures with ecient backpropagation al

weighted set of hyp otheses ab out the input The success of

gorithms to compute the gradient The results in this pa

this approachwas demonstrated with a commercially de

p er help establish the usefulness and relevance of gradient

ployed check reading system that reads millions of business

based minimization metho ds as a general organizing prin

and p ersonal checks p er day the generalized transduction

ciple for learning in large systems

engine resides in only a few hundred lines of co de

It was shown that all the steps of a do cument analysis

Traditional recognition systems rely on many hand

system can be formulated as graph transformers through

crafted heuristics to isolate individually recognizable ob

which gradients can be backpropagated Even in the

jects The promising Space Displacement Neural Network

nontrainable parts of the system the design philosophy

approach draws on the robustness and eciency of Con

in terms of graph transformation provides a clear separa

volutional Neural Networks to avoid explicit segmentation

tion b etween domainsp ecic heuristics eg segmentation

altogether Simultaneous automatic learning of segmenta

heuristics and generic pro cedural knowledge the gener

tion and recognition can b e achieved with GradientBased

alized transduction algorithm

Learning metho ds

It is worth pointing out that data generating models

This pap er presents a small numb er of examples of graph

such as HMMs and the Maximum Likelihood Principle

transformer mo dules but it is clear that the concept can b e

were not called up on to justify most of the architectures

applied to many situations where the domain knowledge or

and the training criteria describ ed in this pap er Gradient

the state information can b e represented by graphs This is

based learning applied to global discriminative loss func

the case in many audio signal recognition tasks and visual

tions guarantees optimal classication and rejection with

scene analysis applications Future work will attempt to

out the use of hard to justify principles that put strong

apply Graph Transformer Networks to such problems with

constraints on the system architecture often at the exp ense

the hop e of allowing more reliance on automatic learning

of p erformances

and less on detailed engineering

More sp ecically the metho ds and architectures pre

sented in this pap er oer generic solutions to a large num

Appendices

ber of problems encountered in pattern recognition sys

A Preconditions for faster convergence

tems

As seen b efore the squashing function used in our Con Feature extraction is traditionally a xed transform

volutional Networks is f a A tanhSa Symmetric generally derived from some exp ert prior knowledge ab out

functions are b elieved to yield faster convergence although the task This relies on the probably incorrect assumption

the learning can b ecome extremely slow if the weights are that the human designer is able to capture all the rele

to o small The cause of this problem is that in weightspace vant information in the input We have shown that the

the origin is a xed point of the learning dynamics and application of GradientBased Learning to Convolutional

PROC OF THE IEEE NOVEMBER

although it is a saddle p oint it is attractive in almost all p erforming two complete learning iterations over the small

directions For our simulations we use A subset This idea can b e generalized to training sets where

and S see With this choice of parame there exist no precise rep etition of the same pattern but

ters the equalities f and f are satised where some redundancy is present In fact sto chastic up

The rationale b ehind this is that the overall gain of the date must b e b etter when there is redundancy ie when a

squashing transformation is around in normal op erat certain level of generalization is exp ected

ing conditions and the interpretation of the state of the

Many authors have claimed that secondorder meth

network is simplied Moreover the absolute value of the

ods should be used in lieu of gradient descent for neu

second derivativeoff is a maximum at and which

ral net training The literature ab ounds with recom

improves the convergence towards the end of the learning

mendations for classical secondorder metho ds such

session This particular choice of parameters is merely a

as the GaussNewton or Levenb ergMarquardt algorithms

convenience and do es not aect the result

for QuasiNewton metho ds suchasthe BroydenFletcher

GoldfarbShanno metho d BFGS Limitedstorage BFGS

Before training the weights are initialized with random

or for various versions of the Conjugate Gradients CG

F and values using a uniform distribution between

i

e metho ds are un metho d Unfortunately all of the ab ov

F where F is the numb er of inputs fanin of the unit

i i

suitable for training large neural networks on large data

which the connection b elongs to Since several connections

sets The GaussNewton and Levenb ergMarquardt meth

share a weight this rule could b e dicult to applybutin

ods require O N op erations per up date where N is

our case all connections sharing a same weight b elong to

the number of parameters which makes them impracti

units with identical fanins The reason for dividing bythe

cal for even mo derate size networks QuasiNewton meth

fanin is that wewould like the initial standard deviation

o ds require only O N op erations p er up date but that

of the weighted sums to be in the same range for each

still makes them impractical for large networks Limited

unit and to fall within the normal op erating region of the

Storage BFGS and Conjugate Gradient require only O N

sigmoid If the initial weights are to o small the gradients

op erations p er up date so they would app ear appropriate

are very small and the learning is slow If they are to o

Unfortunately their convergence sp eed relies on an accu

large the sigmoids are saturated and the gradient is also

rate evaluation of successive conjugate descent directions

very small The standard deviation of the weighted sum

which only makes sense in batch mo de For large data

scales like the square ro ot of the number of inputs when

sets the sp eedup broughtby these metho ds over regular

the inputs are indep enden t and it scales linearly with the

batch gradient descent cannot matc h the enormous sp eed

number of inputs if the inputs are highly correlated We

up broughtby the use of sto chastic gradient Several au

chose to assume the second hyp othesis since some units

thors have attempted to use Conjugate Gradient with small

receive highly correlated signals

batches or batches of increasing sizes but those

B Stochastic Gradient vs Batch Gradient

attempts havenotyet b een demonstrated to surpass a care

fully tuned sto chastic gradient Our exp eriments were p er

GradientBased Learning algorithms can use one of two

formed with a sto chastic metho d that scales the parameter

classes of metho ds to up date the parameters The rst

axes so as to minimize the eccentricity of the error surface

metho d dubb ed Batch Gradient is the classical one the

gradients are accumulated over the entire training set and

C Stochastic Diagonal LevenbergMarquardt

the parameters are up dated after the exact gradient has

Owing to the reasons given in App endix B we prefer to

b een so computed In the second metho d called Sto chas

up date the weights after eachpresentation of a single pat

tic Gradient a partial or noisy gradientisevaluated on

tern in accordance with sto chastic up date metho ds The

the basis of one single training sample or a small num

patterns are presented in a constant random order and the

ber of samples and the parameters are up dated using

training set is typically rep eated times

can be this approximate gradient The training samples

Our up date algorithm is dubb ed the Sto chastic Diagonal

selected randomly or according to a prop erly randomized

Levenb ergMarquardt metho d where an individual learning

sequence In the sto chastic version the gradient estimates

rate step size is computed for each parameter weight

are noisy but the parameters are up dated much more often

b efore each pass through the training set

than with the batch version An empirical result of con

These learning rates are computed using the diagonal terms

siderable practical imp ortance is that on tasks with large

of an estimate of the GaussNewton approximation to the

redundant data sets the sto chastic version is considerably

Hessian second derivative matrix This algorithm is not

faster than the batchversion sometimes by orders of mag

believed to bring a tremendous increase in learning sp eed

nitude Although the reasons for this are not totally

but it converges reliably without requiring extensive ad

understo o d theoretically an intuitive explanation can be

justments of the learning parameters It corrects ma jor ill

found in the following extreme example Let us take an

conditioning of the loss function that are due to the p ecu

example where the training database is comp osed of two

liarities of the network architecture and the training data

copies of the same subset Then accumulating the gradient

The additional cost of using this pro cedure over standard

over the whole set would cause redundant computations

sto chastic gradient descent is negligible

to be p erformed On the other hand running Sto chas

Ateach learning iteration a particular parameter w is set would amount to tic Gradient once on this training k

PROC OF THE IEEE NOVEMBER

up dated according to the following sto chastic up date rule the total input to unit i denoted a Interestingly there is

i

an ecient algorithm to compute those second derivatives

p

E

which is very similar to the backpropagation pro cedure

w w

k k k

w

k

used to compute the rst derivatives

p

p p p

where E is the instantaneous loss function for pattern p

X

E E E

 

u f a f a

i i

In Convolutional Neural Networks b ecause of the weight

ki

a a x p

i

i

k E

k

sharing the partial derivative is the sum of the partial

w

k

derivatives with resp ect to the connections that share the

Unfortunately using those derivatives leads to wellknown

parameter w

problems asso ciated with every Newtonlike algorithm

k

p p

X

E E

these terms can be negative and can cause the gradient

w u

k ij

algorithm to move uphill instead of downhill Therefore

ij V

k

our second approximation is a wellknown trick called the

where u is the connection weight from unit j to unit i V

ij k

GaussNewton approximation which guarantees that the

is the set of unit index pairs i j such that the connection

second derivative estimates are nonnegative The Gauss

between i and j share the parameter w ie

k

Newton approximation essentially ignores the nonlinearity

of the estimated function the Neural Network in our case

u w i j V

ij k k

but not that of the loss function The backpropagation

equation for GaussNewton approximations of the second

As stated previously the step sizes are not constantbut

k

derivatives is

are function of the second derivative of the loss function

p p

X

along the axis w

k

E E



f a u

i

ki

a a

i

k

k

k

h

kk

This is very similar to the formula for backpropagating the

rst derivatives except that the sigmoids derivative and

where is a handpicked constant and h is an estimate

kk

the weightvalues are squared The righthand side is a sum

of the second derivative of the loss function E with re

of pro ducts of nonnegative terms therefore the lefthand

sp ect to w The larger h the smaller the weight up date

k kk

side term is nonnegative

The parameter prevents the step size from b ecoming to o

The third approximation wemake is that wedonotrun

large when the second derivativeis small very much like

the average in Equation over the entire training set but

the mo deltrust metho ds and the Levenb ergMarquardt

run it on a small subset of the training set instead In

metho ds in nonlinear optimization The exact formula

addition the reestimation do es not need to be done of

to compute h from the second derivatives with resp ect

kk

ten since the second order prop erties of the error surface

to the connection weights is

change rather slowly In the exp eriments describ ed in this

X X

E

pap er we reestimate the h on patterns b efore each

kk

h

kk

u u

ij kl

training pass through the training set Since the size of the

ij V klV

k k

training set is the additional cost of reestimating

However wemake three approximations The rst approx

the h is negligible The estimates are not particularly

kk

imation is to drop the odiagonal terms of the Hessian

sensitive to the particular subset of the training set used in

with resp ect to the connection weights in the ab ove equa

the av eraging This seems to suggest that the secondorder

tion

prop erties of the error surface are mainly determined by

X

E

the structure of the network rather than by the detailed

h

kk

u

statistics of the samples This algorithm is particularly use

ij

ij V

k

ful for sharedweightnetworks b ecause the weight sharing



E

are the average over the training Naturally the terms



creates illconditionning of the error surface Because of

u

ij

the sharing one single parameter in the rst few layers can

set of the lo cal second derivatives

have an enormous inuence on the output Consequently

P

p

X

E E

the second derivative of the error with resp ect to this pa

rameter maybevery large while it can b e quite small for

u P u

ij ij

p

other parameters elsewhere in the network The ab ove al

gorithm comp ensates for that phenomenon

Those lo cal second derivatives with resp ect to connection

Unlike most other secondorder acceleration metho ds for

weights can b e computed from lo cal second derivatives with

backpropagation the ab ove metho d works in sto chastic

resp ect to the total input of the downstream unit

mo de It uses a diagonal approximation of the Hessian

p p

E E

Like the classical Levenb ergMarquardt algorithm it uses a

x

j

u a

safety factor to prevent the step sizes from getting to o

ij i

second derivative estimates are small Hence large if the

 p

E

is the second where x is the state of unit j and



j

the metho d is called the Sto chastic Diagonal Levenb erg

a

i

Marquardt metho d derivative of the instantaneous loss function with resp ect to

PROC OF THE IEEE NOVEMBER

F FogelmanSoulie and G Weisbuch Eds Acknowledgments E Bienensto ck

Les Houches pp SpringerVerlag

Some of the systems describ ed in this pap er is the work

D B Parker Learninglogic Tech Rep TR Sloan

Scho ol of Management MIT Cambridge Mass April

of many researchers nowatATT and LucentTechnolo

Y LeCun Modeles connexionnistes de lapprentissage con

gies In particular Christopher Burges Craig Nohl Troy

nectionist learning models PhD UniversitePetM

Cauble and Jane Bromley contributed much to the check

Curie June

Y LeCun A theoretical framework for backpropagation in

reading system Exp erimental results describ ed in sec

Proceedings of the Connectionist Models Summer School

tion III include contributions by Chris Burges Aymeric

D TouretzkyGHinton and T Sejnowski Eds CMU Pitts

Brunot Harris Drucker Larry Jackel Urs

burgh Pa pp Morgan Kaufmann

L Bottou and P Gallinari A framework for the co op eration of

Muller Bernhard Scholkopf and Patrice Simard The au

learning algorithms in Advances in Neural Information Pro

thors wish to thank Fernando Pereira

cessing SystemsDTouretzky and R Lippmann Eds Denver

John Denker and Isab elle Guyon for helpful discussions

vol Morgan Kaufmann

C Y Suen C Nadal R Legault T A Mai and L Lam

Charles Stenard and Ray Higgins for providing the appli

Computer recognition of unconstrained handwritten numer

cations that motivated some of this work and Lawrence R

als Proceedings of the IEEE Special issue on Optical Char

Rabiner and Lawrence D Jackel for relentless supp ort and

acter Recognitionvol no pp July

S N Srihari Highp erformance reading machines Proceed

encouragements

ings of the IEEE Special issue on Optical Character Recogni

tionvol no pp July

References

Y LeCun L D Jackel B Boser J S Denker H P Graf

I Guyon D Henderson R E Howard and W Hubbard

R O Duda and P E Hart Pattern Classication And Scene

Handwritten digit recognition Applications of neural net

A nalysis Wiley and Son

chips and automatic learning IEEE Communication pp

Y LeCun B Boser J S Denker D Henderson R E Howard

Novemb er invited pap er

W Hubbard and L D Jackel applied to

J Keeler D Rumelhart and W K Leow Integrated seg

handwritten zip co de recognition Neural Computationvol

mentation and recognition of handprinted numerals in Neu

no pp Winter

ral Information Processing Systems R P Lippmann J M

S Seung H Somp olinsky and N Tishby Statistical mechan

Moody and D S Touretzky Eds vol pp Morgan

ics of learning from examples Physical Review Avol pp

Kaufmann Publishers San Mateo CA

Ofer Matan Christopher J C Burges Yann LeCun and

V N Vapnik E Levin and Y LeCun Measuring the vc

John S Denker Multidigit recognition using a space dis

dimension of a learning machine Neural Computationvol

placement neural network in Neural Information Processing

no pp

SystemsJMMoody S J Hanson and R P Lippman Eds

C Cortes L Jackel S Solla V N Vapnik and J Denker

vol Morgan Kaufmann Publishers San Mateo CA

Learning curves asymptotic values and rate of convergence

L R Rabiner A tutorial on hidden Markov mo dels and se

in Advances in Neural Information Processing Systems JD

lected applications in sp eech recognition Proceedings of the

Cowan G Tesauro and J Alsp ector Eds San Mateo CA

IEEEvol no pp February

pp Morgan Kaufmann

H A Bourlard and N Morgan CONNECTIONIST SPEECH

V N Vapnik The Nature of Statistical Learning Theory

RECOGNITION A Hybrid ApproachKluwer Academic Pub

Springer NewYork

lisher Boston

V N Vapnik Statistical Learning The ory John Wiley Sons

D H Hub el and T N Wiesel Receptive elds bino cular

NewYork

interaction and functional architecture in the cats visual cor

W H Press B P FlannerySATeukolsky and W T Vet

tex Journal of Physiology Londonvol pp

terling Numerical Recipes The Art of Scientic Computing

Cambridge University Press Cambridge

K Fukushima Cognitron A selforganizing multilayered neu

S I Amari A theory of adaptive pattern classiers IEEE

ral network Biological Cyberneticsvol no pp

Transactions on Electronic Computers vol EC pp

November

K Fukushima and S Miyake Neo cognitron A new algorithm

Ya Tsypkin Adaptation and Learning in automatic systems

for pattern recognition tolerant of deformations and shifts in

Academic Press

p osition Pattern Recognitionvol pp

Ya Tsypkin Foundations of the theory of learning systems

M C Mozer The perception of multiple objects Aconnec

Academic Press

tionist approach MIT PressBradford Bo oks Cambridge MA

M Minsky and O Selfridge Learning in random nets in

th London symposium on Information Theory London

Y LeCun Generalization and network design strategies in

pp

Connectionism in Perspective R Pfeifer Z Schreter F Fogel

D H Ackley G E Hinton and T J Sejnowski A learning

man and L Steels Eds Zurich Switzerland Elsevier

algorithm for b oltzmann machines Cognitive Sciencevol

an extended version was published as a technical rep ort of the

pp

UniversityofToronto

G E Hinton and T J Sejnowski Learning and relearning

Y LeCun B Boser J S Denker D Henderson R E Howard

cessing in Boltzmann machines in Paral lel DistributedPro

W Hubbard and L D Jackel Handwritten digit recognition

Explorations in the Microstructure of Cognition Volume

with a backpropagation network in Advances in Neural In

Foundations D E Rumelhart and J L McClelland Eds MIT

formation Processing Systems NIPSDavid Touretzky

Press Cambridge MA

Ed Denver CO Morgan Kaufmann

D E Rumelhart G E Hinton and R J Williams Learning

G L Martin Centeredob ject integrated segmentation and

internal representations by error propagation in Paral lel dis

recognition of overlapping handprinted characters Neural

tributedprocessing Explorations in the microstructureofcog

Computationvol no pp

nitionvol I pp Bradford Bo oks Cambridge MA

J Wang and J Jean Multiresolution neural networks for om

nifontcharacter recognition in Proceedings of International

A E Jr Bryson and YuChi Ho Applied Optimal Control

Conference on Neural Networks vol I I I pp

Blaisdell Publishing Co

Y Bengio Y LeCun C Nohl and C Burges Lerec A

Y LeCun A learning scheme for asymmetric threshold net

NNHMM hybrid for online handwriting recognition Neural

works in Proceedings of Cognitiva Paris France

Computationvol no

pp

Y LeCun Learning pro cesses in an asymmetric threshold S Lawrence C Lee Giles A C Tsoi and A D Back Face

network in Disordered systems and biological organization recognition A convolutional neural network approach IEEE

PROC OF THE IEEE NOVEMBER

son J D Cowan and C L Giles Eds San Mateo CA Transactions on Neural Networks vol no pp

pp Morgan Kaufmann

P Simard Y LeCun and Denker J Ecient pattern recog K J Lang and G E Hinton A time delayneuralnetwork

nition using a new transformation distance in Advances in architecture for sp eech recognition Tech Rep CMUCS

Neural Information Processing Systems S Hanson J Cowan CarnegieMellon University Pittsburgh PA

and L Giles Eds vol Morgan Kaufmann

A H Waib el T Hanazawa G Hinton K Shikano and

B Boser I Guyon and V Vapnik A training algorithm for K Lang Phoneme recognition using timedelay neural net

optimal margin classiers in Proceedings of the Fifth Annual works IEEE Transactions on Acoustics Speech and Signal

Processingvol pp March

Workshop on Computational Learning Theory vol pp

L Bottou F Fogelman P Blanchet and J S Lienard

Sp eaker indep endent isolated digit recognition Multilayer

C J C Burges and B Scho elkopf Improving the accuracy

p erceptron vs dynamic time warping Neural Networksvol

and sp eed of supp ort vector machines in Advances in Neural

pp

Information Processing Systems M Jordan M Mozer and

T Petsche Eds The MIT Press Cambridge

P Haner and A H Waib el Timedelay neural networks

Eduard Sackinger Bernhard Boser Jane Bromley Yann Le emb edding time alignment a p erformance analysis in EU

ROSPEECH nd European ConferenceonSpeech Commu

Cun and Lawrence D Jackel Application of the ANNA neu

nication and Technology Genova Italy Sept

ral net work chip to highsp eed character recognition IEEE

Transaction on Neural Networks vol no pp

I Guyon P Albrecht Y LeCun J S Denker and W Hub

March

bard Design of a neural network character recognizer for a

touch terminal Pattern Recognitionvol no pp

J S Bridle Probabilistic interpretation of feedforward classi

cation networks outputs with relationship to statistical pattern

recognition in Neurocomputing Algorithms Architectures

J Bromley J W Bentz L Bottou I Guyon Y LeCun

and Applications F Fogelman J Herault and Y Burno d

C Mo ore E Sackinger and R Shah Signature verica

Eds Les Arcs France Springer

tion using a siamese time delay neural network International

Journal of Pattern Reco gnition and Articial Intel ligencevol

Y LeCun L Bottou and Y Bengio Reading checks with

no pp August

graph transformer networks in International Conferenceon

Acoustics Speech and Signal Processing Munich vol

Y LeCun I Kanter and S Solla Eigenvalues of covariance

pp IEEE

matrices application to neuralnetwork learning Physical

Review Lettersvol no pp May

Y Bengio Neural Networks for Speech and SequenceRecogni

tion International Thompson Computer Press London UK

T G Dietterich and G Bakiri Solving multiclass learning

problems via errorcorrecting output co des Journal of Arti

cial Intel ligenceResearchvol pp

C Burges O Matan Y LeCun J DenkerLJack el C Ste

L R Bahl P F Brown P V de Souza and R L Mercer nard C Nohl and J Ben Shortest path segmentation A

Maximum mutual information of hidden Markov mo del pa metho d for training a neural network to recognize character

rameters for sp eech recognition in Proc Int Conf Acoust strings in International Joint Conference on Neural Net

Speech Signal Processing pp works Baltimore vol pp

L R Bahl P F Brown P V de Souza and R L Mercer T M Breuel A system for the oline recognition of hand

Sp eech recognition with continuousparameter hidden Markov written text in ICPR IEEE Ed Jerusalem

mo dels Computer Speech and Languagevol pp pp

A Viterbi Error bounds for convolutional co des and an

B H Juang and S Katagiri Discriminative learning for min asymptotically optimum deco ding algorithm IEEE Trans

imum error classication IEEE Trans on Acoustics Sp actions on Information Theory pp April eech

and Signal Processingvol no pp December

Lippmann R P and Gold B Neuralnet classiers useful for

sp eech recognition in Proceedings of the IEEE First Interna

tional Conference on Neural Networks San Diego June Y LeCun L D Jackel L Bottou A Brunot C Cortes J S

pp Denker H Drucker I Guyon U A Muller E Sackinger

P Simard and V N Vapnik Comparison of learning al

H Sakoe R Isotani K Yoshida K Iso and T Watan

gorithms for handwritten digit recognition in International

ab e Sp eakerindep endent word recognition using dynamic

Conference on Articial Neural Networks F Fogelman and

programming neural networks in International Conference

P Gallinari Eds Paris pp EC Cie

on Acoustics Speech and Signal Processing Glasgow

IGuyon I Poujaud L Personnaz G Dreyfus J Denker and

pp

Y LeCun Comparing dierent neural net architectures for

J S Bridle Alphanets a recurrent neural network archi

classifying handwritten digits in Proc of IJCNN Washing

tecture with a hidden markov mo del interpretation Speech

ton DC vol I I pp IEEE

Communicationvol no pp

R Ott construction of quadratic p olynomial classiers

M A Franzini K F Lee and A H Waib el Connectionist

in Proc of International Conference on Pattern Recognition

viterbi training a new hybrid metho d for continuous sp eech

pp IEEE

recognition in International ConferenceonAcoustics Speech

J Schurmann A multifontword recognition system for p ostal

and Signal Processing Albuquerque NM pp

address reading IEEE Transactions on Computersvol C

L T Niles and H F Silverman Combining hidden markov

no pp August

mo dels and neural network classiers in International Con

Y Lee Handwritten digit recognition using knearest neigh

ference on Acoustics Speech and Signal Pr ocessing Albu

b or radialbasis functions and backpropagation neural net

querque NM pp

works Neural Computationvol no pp

X Driancourt and L Bottou MLPLVQ and DP Compari

D Saad and S A Solla Dynamics of online gradientde

son co op eration in Proceedings of the International Joint

scent learning for multilayer neural networks in Advances in

Conference on Neural Networks Seattle vol pp

Neural Information Processing Systems David S Touretzky

Michael C Mozer and Michael E Hasselmo Eds vol

Y Bengio R De Mori G Flammia and R Komp e Global

pp The MIT Press Cambridge

optimization of a neural networkhidden Markov mo del hy

G Cyb enko Approximation by sup erp ositions of sigmoidal

brid IEEE Transactions on Neural Networksvol no

functions Mathematics of Control Signals and Systemsvol

pp

no pp

P Haner and A H Waib el Multistate timedelay neural

L Bottou and V N Vapnik Lo cal learning algorithms Neu

networks for continuous sp eech recognition in Advances in

ral Computationvol no pp

Neural Information Processing Systems vol pp

Morgan Kaufmann San Mateo R E Schapire The strength of weak learnability Machine

Learningvol no pp

Y Bengio P Simard and PFrasconi Learning longterm

dep endencies with gradient descent is dicult IEEE Trans ker R Schapire and P Simard Improving p erfor H Druc

actions on Neural Networksvol no pp March mance in neural networks using a b o osting algorithm in Ad

Sp ecial Issue on vances in Neural Information Processing Systems SJHan

PROC OF THE IEEE NOVEMBER

T Kohonen G Barna and R Chrisley Statistical pattern Lippmann Eds Denver CO pp Morgan Kauf

recognition with neural network Benchmarking studies in mann

Proceedings of the IEEE Second International Conferenceon

F C N Pereira and M Riley bycompo

Neural Networks San Diego vol pp

sition of weighted nite automata in FiniteState Devices for

P Haner Connectionist sp eech recognition with a global Natural Langue ProcessingCambridge Massachusetts

MMI algorithm in EUROSPEECH rdEuropean Confer MIT Press

enceonSpeech Communication and Technology Berlin Sept

sp eech M Mohri Finitestate transducers in language and

pro cessing Computational Linguisticsvol no pp

J S Denker and C J Burges Image segmentation and recog

nition in The Mathematics of Induction Addison Wes

I Guyon M Schenkel and J Denker Overview and syn

ley

thesis of online cursive handwriting recognition techniques

L Bottou Une Approche theorique de lApprentissage Connex in Handbook on Optical Character Recognition and Document

ionniste ApplicationsalaR econnaissancedelaParolePhD Image AnalysisPSPWang and Bunke H Eds World

thesis UniversitedeParis XI Orsay cedex France Scientic

M Rahim Y Bengio and Y LeCun Discriminativefeature M Mohri and M Riley Weighted determinization and min

imization for large vo cabulary recognition in Proceedings of

and mo del design for automatic sp eech recognition in Proc

Eurospeech Rho des Greece Septemb er pp

of Eurospeech Rho des Greece

Y Bengio and PFrasconi An inputoutput HMM architec

U Bo denhausen S Manke and A Waib el Connectionist ar

ture in Advances in Neural Information Processing Systems

chitectural learning for high p erformance character and sp eech

G Tesauro D Touretzky and T Leen Eds vol pp

recognition in International ConferenceonAcoustics Speech

and Signal Processing Minneap olis vol pp MIT Press Cambridge MA

F Pereira M Riley and R Sproat Weighted rational trans Y Bengio and PFrasconi InputOutput HMMs for sequence

pro cessing IEEE Transactions on Neural Networksvol

ductions and their application to human language pro cessing

no pp

in ARPA Natural Language Processing workshop

M Mohri F C N Pereira and M Riley Arational design

M Lades J C Vorbruggen J Buhmann and C von der Mals

for a weighted nitestate transducer library Lecture Notes in

burg Distortion invariant ob ject recognition in the dynamic

Computer Science Springer Verlag

link architecture IEEE Trans Comp vol no pp

M Rahim C H Lee and B H Juang Discriminative ut

B Boser E Sackinger J Bromley Y LeCun and L Jackel terance verication for connected digits recognition IEEE

An analog neural network pro cessor with programmable top ol Trans on Speech Audio Procvol pp

ogy IEEE Journal of SolidState Circuitsvol no pp

M Rahim Y Bengio and Y LeCun Discriminativefeature

Decemb er

and mo del design for automatic sp eech recognition in Eu

rospeech Rho des Greece pp

M Schenkel H Weissman I Guyon C Nohl and D Hender

son Recognitionbased segmentation of online handprin ted

S Bengio and Y Bengio An EM algorithm for asynchronous

words in Advances in Neural Information Processing Systems

inputoutput hidden Markov mo dels in International Con

S J Hanson J D Cowan and C L Giles Eds Denver

ference On Neural Information Processing L Xu Ed Hong

CO pp

Kong pp

C Dugast L Devillers and X Aub ert Combining TDNN

C Tapp ert C Suen and T Wakahara The state of the

and HMM in a hybrid system for improved continuoussp eech

art in online handwriting recognition IEEE Transactions on

recognition IEEE Transactions on Speech and Audio Pro

Pattern Analysis and Machine Intel ligencevol no pp

cessingvol no pp

Ofer Matan Henry S Baird Jane Bromley Christopher J C

S Manke and U Bo denhausen A connectionist recognizer for

Burges John S Denker Lawrence D Jackel YannLeCunEd

online cursive handwriting recognition in International Con

win PDPednault William D Sattereld Charles E Stenard

fer enceonAcoustics Speech and Signal Processing Adelaide

and Timothy J Thompson Reading handwritten digits A

vol pp

ZIP co de recognition system Computervol no pp

M Gilloux and M Leroux Recognition of cursive script

July

amounts on p ostal checks in European Conferencededicated

Y Bengio and Y Le Cun Word normalization for online to Postal Technologies Nantes France June pp

handwritten word recognition in Proc of the International

Conference on Pattern Recognition IAPR Ed Jerusalem

D Guillevic and C Y Suen Cursive script recognition applied

IEEE

to the pro cessing of bank checks in Int Conf on Document

R Vaillant C Monro cq and Y LeCun Original approac h

Analysis and Recognition Canada August pp

for the lo calization of ob jects in images IEE Proc on Vision

Image and Signal Processing vol no pp

L Lam C Y Suen D Guillevic N W Strathy M Cheriet

August

K Liu and J N Said Automatic pro cessing of information

R Wolf and J Platt Postal address blo ck lo cation using a on checks in Int Conf on Systems Man Cybernetics

convolutional lo cator network in Advances in Neural Infor Vancouver Canada Octob er pp

mation Processing Systems JDCowan G Tesauro and

C J C Burges J I Ben J S Denker Y LeCun and C R

J Alsp ector Eds pp Morgan Kaufmann Pub

Nohl O line recognition of handwritten p ostal words using

lishers San Mateo CA

neural net works Int Journal of Pattern Recognition and Ar

S Nowlan and J Platt A convolutional neural network hand

ticial Intel ligencevol no pp Sp ecial Issue

tracker in Advances in Neural Information Processing Sys

on Applications of Neural Networks to Pattern Recognition I

tems GTesauro D Touretzky and T Leen Eds San Ma

Guyon Ed

teo CA pp Morgan Kaufmann

Y LeCun Y Bengio D Henderson A Weisbuch H Weiss

H A Rowley S Baluja and T Kanade Neural network man and Jackel L Online handwriting recognition with

based face detection in Proceedings of CVPR pp neural networks spatial representation versus temp oral repre

IEEE Computer So ciety Press sentation in Proc International Conference on handwriting

and drawing Ecole Nationale Sup erieure des Telecommu

E Osuna R Freund and F Girosi Training supp ort vector

nications

mac hines an application to face detection in Proceedings of

CVPR pp IEEE Computer So cietyPress

U Muller A Gunzinger and W Guggenbuhl Fast neural

net simulation with a DSP pro cessor array IEEE Trans on

H Bourlard and C J Wellekens Links b etween Markovmod

Neural Networksvol no pp

els and multilayer p erceptrons in Advances in Neural Infor

R Battiti First and secondorder metho ds for learning Be mation Processing SystemsDTouretzky Ed Denver

tween steep est descent and newtons metho d Neural Com vol pp MorganKaufmann

putationvol no pp

Y Bengio R De Mori G Flammia and R Komp e Neu

A H Kramer and A SangiovanniVincentelli Ecient par ral network gaussian mixture hybrid for sp eech recognition

allel learning algorithms for neural networks in Advances in or density estimation in Advances in Neural Information

Neural Information Processing Systems Processing Systems JEMoodySJHansonandRP DSTouretzky Ed

PROC OF THE IEEE NOVEMBER

Yoshua Bengio Yoshua Bengio received his

Denver vol pp Morgan Kaufmann San

BEng in electrical engineering in from

Mateo

McGill University He also received a MSc

M Moller Ecient Training of FeedForward Neural Net

and a PhD in computer science from McGill

works PhD thesis Aarhus University Aarhus Denmark

University in and resp ectively In

he was a p ostdo ctoral fellowatthe

S Becker and Y LeCun Improving the convergence of back

Massachusetts Institute of Technology In

propagation learning with secondorder metho ds Tech Rep

he joined ATT Bell Lab oratories whichlater

CRGTR UniversityofToronto Connectionist Research

b ecame ATT LabsResearch In he

Group Septemb er

joined the faculty of the computer science de

partment of the UniversitedeMontreal where

he is now an asso ciate professor Since his rst work on neural net

works in his researchinterests have b een centered around learn

ing algorithms esp ecially for data with a sequential or spatial nature

such as sp eech handwriting and timeseries

Patrick Haner Patrick Haner graduated

Yann LeCun Yann LeCun received a

from Ecole Polytechnique Paris France in

Diplome dIngenieur from the Ecole Superieure

and from Ecole Nationale Superieure des

dIngenieur en Electrotechnique et Electron

Telecommunications ENST Paris France in

ique Paris in and a PhD in Computer

He received his PhD in sp eech and sig

Science from the Universite Pierre et Marie

nal pro cessing from ENST in In

Curie Paris in during which he prop osed

and he worked with Alex Waib el on the

an early version of the backpropagation learn

design of the TDNN and the MSTDNN ar

ing algorithm for neural networks He then

chitectures at ATR Japan and Carnegie Mel

joined the Department of Computer Science at

lon University From to as a re

the UniversityofToronto as a researchasso

search scientist for CNETFranceTelecom in

ciate In he joined the Adaptive Systems

Lannion France he develop ed connectionist learning algorithms for

Research DepartmentatATT Bell Lab oratories in Holmdel NJ

telephone sp eech recognition In he joined ATT Bell Lab ora

where he worked among other thing on neural networks machine

tories and worked on the application of Optical Character Recognition

learning and handwriting recognition Following ATTs second

and transducers to the pro cessing of nancial do cuments In he

breakup in he b ecame head of the Image Pro cessing Services

joined the Image Pro cessing Services Research DepartmentatATT

Research DepartmentatATT LabsResearch

LabsResearch His researchinterests include statistical and connec

He is serving on the b oard of the Machine Learning Journal and

tionist mo dels for sequence recognition machine learning sp eech and

has served as asso ciate editor of the IEEE T rans on Neural Networks

image recognition and information theory

He is general chair of the Machines that Learn workshop held every

year since in Snowbird Utah He has served as program cochair

of IJCNN INNC NIPS and He is a member of the

IEEE Neural Network for Signal Pro cessing Technical Committee

He has published over technical pap ers and b o ok chapters on

neural networks machine learning pattern recognition handwriting

recognition do cument understanding image pro cessing VLSI design

and information theory In addition to the ab ove topics his current

interests include videobased user interfaces image compression and

contentbased indexing of multimedia material

Leon Bottou Leon Bottou received a Diplome

from Ecole Polytechnique Paris in a

Magistere en Mathematiques Fondamentales et

Appliquees et Informatiques from Ecole Nor

male Superieure Paris in and a PhD

in Computer Science from UniversitedeParis

Sud in during whichheworked on sp eech

recognition and prop osed a framework for

sto chastic gradient learning and global train

ing He then joined the Adaptive Systems Re

search DepartmentatATT Bell Lab oratories

where he worked on neural network statistical learning theory and

lo cal learning algorithms He returned to France in as a research

engineer at ONERA He then b ecame chairman of Neuristique SA

a company making neural network simulators and trac forecast

ing software He eventually came backtoATT Bell Lab oratories

in where he work ed on graph transformer networks for optical

character recognition He is now a memb er of the Image Pro cess

ing Services Research DepartmentatATT LabsResearch Besides

learning algorithms his currentinterests include arithmetic co ding image compression and indexing