A Connectionist Recognizer for On-Line Cursive Handwriting Recognition
Total Page:16
File Type:pdf, Size:1020Kb
A CONNECTIONIST RECOGNIZER FOR ON-LINE CURSIVE HANDWRITING RECOGNITION Stefan Manke and Ulrich Bodenhausen University of Karlsruhe, Computer Science Department, D-76 128 Karlsruhe, Germany ABSTRACT Neural Network (MS-TU”), an extension of the TU”, coin- bines the high accuracy character recognition capahilities of ;I In this paper we show how the Multi-State Tim Delay Neural TDNN with a non-linear time alignment procedure (Dynamic Network (MS-TDNN), which is already used successfully in Time Warping) [7] for finding an optimal alignment between continuous speech recognition tasks, can be applied both to on- strokes and characters in handwritten continuous words. line single character and cursive (continuous) handwriting recog- The following section describes the basic network architecture nition. The MS-TDNN integrates the high accuracy single char- and training method of the MS-TDNN, followed by a description acter recognition capabilities of a TDNN with a non-linear time of the input features used in this paper (section 3). Section 4 pre- alignment procedure (dynamic time warping algorithm) for hd- sents results both on different writer dependent/writer indepn- ing stroke and character boundaries in isolated, handwritten char- dent, single character recognition task and writer dependent, acters ‘and words. In this approach each character is modelled by cursive handwriting recognition tasks. up to 3 different states and words are represented as a sequence of these characters. We describe the basic MS-TDNN architec- 2. THE MULTI-STATE TIME DELAY NEURAL ture and the input features used in this paper, and present results NETWORK (MS-TDNN) (up to 97.7% word recognition rate) both on writer dependent/ independent, single character recognition tasks and writer depen- The Time Delay Neural Network provides a time-shrft invariant dent, cursive handwriting tasks with varying vocahulary sizes up architecture for high performance on-line single character recog- to 20000 words. nition. The Multi-State TDNN is capable of recognizing continu- ous words by integrating a dynamic time warping algorithm 1. INTRODUCTION (DTW) into the TI)” architecture. Words are represented as a sequence of characters, where each character is modelled by one This paper describes a connectionist solution for the prohlem of or inore states. For the results in this paper, each character is single character and cursive (continuous) handwriting recogni- tion. The recognition of continuous handwriting, as it is being written on a touch screen or graphics tablet, has not only scien- tific but also significant practical value, e.g. for note pad comput- ers or for integration into multi-modal systems. In Figure 1, the example application which we use for on-line demonstrations of our handwriting recognizer is shown. The main advantage of on- line handwriting recognition is the temporal information of writ- ing, which can he recorded and used for recognition. Handwrit- ten words can be represented as a the-ordered sequence of coordinates with varying speed and pressure in each coordinate. .As in speech recognition, the main problem of recognizing con- tinuous words is that character or stroke boundaries are not known (in particular if no pen lifts or white space indicate these boundaries) and an optimal time alignment has to he found [ 11. The connectionist recognizer, described in this paper, integrates the recognition and segmentation into a single network architec- ture, the Multi-State Time Delay Neural Network (MS-TDNN), which was originally proposed for continuous speech recognition tasks [2,3,4]. For on-line single character recognition, the Time Delay Neural Figure 1: Demonstration application for our on-line Network (TDNN) [SI with its time-shift invariant architecture continuous handwriting recognizer has been applied.. successfully-~. 161. The Multi-State Time Delay 11-633 0-7803-1775-0/94 $3.00 0 1994 EEE “zero” scores “one” 000 ZD zl L 22 DTW zy layer e2 ro rl r2 nO 01 02 r LA LA z-- e _- r- 0 ilU ill a2 MJ bl states b2 layer zo % zl i3 z“ 22 3 z e.>> hidden a” layer i=z v) ) re1 x-y-change\ input layer a Em Figure 2: Architecture of the basic MS-TDNN modelled with 3 states representing the first, middle, and last part All netwcirk pamineters like the nuinher of states pcr chai-actei-. of the characters. We use this approach not only for the recogni- the size of the input windows, or the nuinher of hitlclen units are tion of cursive words, hut also for the recognition of isolated sin- optimized rnanually for the results presented in this p:iper, hut gle characters (letters and digits). In Figure 2 the hasic can also he optimized automatically hy the Autcirnntic Structure architecture of the MS-TDNN is shown. The first three layers Optimization (ASO) algorithm that we have already proposed in constitute a standard TDNN with sliding input windows of cer- [ 8.91. Ry using the AS() algorithm, nu time-consuming manual tain sizes. This TDNN computes scores for each state and for tuning of these network parameters for particular handwriting each time frame in the states layer. In the DTW layer each word tasks and training set sizes is necessary to get optimal reccigniticin tci he recognized is modelled hy a sequence cif states, the c~irre- perf(irinances. spondinp scures for the states are simply copied from the states The MS-TDNN IS trained with stanclnrcl hack-pr~)pngatil,n:mI layer intci the word models of the DTW layer. An optimal align- training starts in a forced aligninent rnocle, during which the MS- ment path is found hy the DTW algorithm for each word MCI the TDNN is trained with hand-segmented training data. For thls sum cif all activations along this optimal path is taken as the score purpose cinly a small part cif the training data must he Inkled for the word output unit. manually with character hounclai-ies. Stroke hounJwies we cleter- 11-634 .. ........ .I.... ... .I LA w4 b r- e Figure 3: a) the word “be” written on the graphics tablet and b) its input representation mined automatically. During this torced alignment mode the descrihe relative x and y position changes tit each dot. Usmg hack-propagation starts at the states layer of the front-end TU” these 13 features. une can comhine the advantages of pure pixel- with fixed state houndaries. maps (all dots can be seen at once) with the dynamic writing After a certain nurnher of iterations, when the network has suc- information (order of writing). We will present these features in cessfully leamed character houndaries of the first sinaller training more detail in a forthcoming paper. set. the forced alignment is replaced hy a free alignment found hy Figure 2 shows the handwritten word “he” as it has heen written the DTW algorithm. Now training starts at the word level of the on the graphics tahlet and its input representation after prepro- MS-TI)” ancl is performed on unsegmented training data. cessing. During training in forced alignment mock the McClelland objec- tive function [3] is used to avoid the prohlem with I-ouf-ofn 4. EXPEKIMENTS AND RESULTS codings for large n which appear with the Mean Squared Error. For word level training (where hack-propagation starts at the out- The propowl MS-TU” ,uihiteLture wa\ trained and tc\ted put units) we use an objective function (Classification Figure of hoth on writer dependent, cursive (continuous) handwriting Merit [IO]), which tries to maximize the distance between the task, and writer dependentlwriter independent, single char- activation cif the correct output unit ancl the activation of the hest acte r recog nit io n t x,k\ incorrect output unit. a) writer dependent, cursive (continuous) handwriting 3. D.4TA COLLECTION AND PREPROCESSING The MS-TDNN was trained in writer dependent mode with 2oUO tr‘iining pattern\ (iwlated word\) trom a 4UO word voLhuIq (5 The datahases used for training and testing of the MS-TDNN training pitterm for each word in the voLahulnry) The aveidpe were collected at the University of Karlsruhe. All data is word length (it thi\ vi)cahulary I\ 6 3 charaLter\/word and it ~ov- recorded on a pressure sensitive graphics tahlet with a cordless ers all \ingle lower LAW letten stylus which produces a sequence cif time ordered 3-dimensional The trand network wa\ then te\ted withuut any retraining on 5 vectors (at a maximum rate of 200 dots per second) consisting of diflerent voc,abularie\ with v‘irying voiahulary \ize\ from 400 up the x-y-coordinates and a pressure value for each dot. All suh- to 20000 word\ Te\t rewlts tor thew voLahularie\ (given J\ jects had to write a set of single words from a 400 word vocahu- word reiogniticm rates) are \how in Tahle 1 Iary, covering all lower case letters and at least one set of isolated lower case letters, upper case letters, and digits. For the continu- ous handwriting results presented in this paponly the data of Task Vocabulary Test Recognition the first autho: was used. 1 1 1 1 Size (words) Patterns Rate A datively straightforward and fast preprocessing is performed on this data. To preserve the dynamic writing information, which is the main advantage of tm-line handwriting recognition over pure optical handwriting recognition. these sequences of 3- dimensional vectors iue transformed inti) time-orclerecl sequences of equally spaced. 13-dimensional feature vectors, which descrik relative position changes and the local topological con- msm-20000 200tX) 2000 83.0‘3 text of each dot. The local context is represented hy a 3x3 contiixt pixel-map, which is calculated from a larger 20x20 pixel-map around the current clot. This pixel-map encodes the local shape of the curvature and all previous or future dots in the context cf the Database nzsni-400-U consists cif 800 test patterns from the saiiie current dot.