Combining Neural Networks and Context-Driven Search for Online, Printed Handwriting Recognition in the NEWTON

AI Magazine Volume 19 Number 1 (1998) (© AAAI) Articles Combining Neural Networks and Context-Driven Search for Online, Printed Handwriting Recognition in the NEWTON Larry S. Yaeger, Brandyn J. Webb, and Richard F. Lyon ■ While online handwriting recognition is an area of writing recognition have utilized strong, limit- long-standing and ongoing research, the recent ed language models to maximize accuracy, but emergence of portable, pen-based computers has they proved unacceptable in real-world appli- focused urgent attention on usable, practical solu- cations, generating disturbing and seemingly tions. We discuss a combination and improvement random word substitutions—known colloqui- of classical methods to produce robust recognition ally within Apple and NEWTON as “The Doones- of hand-printed English text for a recognizer ship- bury Effect” because of Gary Trudeau’s satirical ping in new models of Apple Computer’s NEWTON NEWTON MESSAGEPAD and EMATE. Combining an artificial look at first-generation recognition neural network (ANN) as a character classifier with performance. However, the original handwrit- a context-driven search over segmentation and ing-recognition technology in the NEWTON, and word-recognition hypotheses provides an effective the current, much-improved cursive-recogniz- recognition system. Long-standing issues relative er technology, both of which were licensed to training, generalization, segmentation, models from ParaGraph International, Inc., are not the of context, probabilistic formalisms, and so on, subject of this article. need to be resolved, however, to achieve excellent In Apple’s Advanced Technology Group (aka performance. We present a number of recent inno- Apple Research Labs), we pursued a different vations in the application of ANNs as character approach, using bottom-up classification tech- classifiers for word recognition, including integrated multiple representations, normalized output niques based on trainable artificial neural net- error, negative training, stroke warping, frequency works (ANNs) in combination with compre- balancing, error emphasis, and quantized weights. hensive but weakly applied language models. User adaptation and extension to cursive recogni- To focus our work on a subproblem that was tion pose continuing challenges. tractable enough to lead to usable products in a reasonable time, we initially restricted the en-based hand-held computers are heavi- domain to hand printing so that strokes were ly dependent on fast and accurate hand- clearly delineated by pen lifts. By simultane- Pwriting recognition because the pen ously providing accurate character-level recog- serves as the primary means for inputting data nition, dictionaries exhibiting wide coverage to such devices. Some earlier attempts at hand- of the language, and the ability to write entire- Copyright © 1998, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1998 / $2.00 SPRING 1998 73 Articles (x,y) Points and Pen-Lifts Words Tentative Neural Network Search Segmentation Character Classifier Character with Context Segmentation Class Hypotheses Hypotheses Figure 1. A Simplified Block Diagram of Our Hand-Print Recognizer. ly outside these dictionaries, we have pro- System Overview duced a hand-print recognizer that some have called the “first usable” handwriting-recogni- Apple’s print recognizer (APR) consists of three tion system. The ANN character classifier conceptual stages—(1) tentative segmentation, required some innovative training techniques (2) classification, and (3) context-driven search—as indicated in figure 1. The primary to perform its task well. The dictionaries data on which we operate are simple sequences required large word lists, a regular expression of (x,y) coordinate pairs plus pen-up–pen- grammar (to describe special constructs such as down information, thus defining stroke prim- date, time, and telephone numbers), and a itives. The segmentation stage decides which means of combining all these dictionaries into strokes will be combined to produce segments a comprehensive language model. In addition, —the tentative groupings of strokes that will well-balanced prior probabilities had to be be treated as possible characters—and pro- determined for in-dictionary and out-of-dictio- duces a sequence of these segments together nary writing. Together with a maximum-likeli- with legal transitions between them. This hood search engine, these elements form the process builds an implicit graph that is then basis of the so-called “Print Recognizer,” which labeled in the classification stage and exam- was first shipped in NEWTON OS 2.0–based MES- ined for a maximum-likelihood interpretation SAGEPAD 120 units in December 1995 and has in the search stage. The classification stage eval- shipped in all subsequent NEWTON devices. In uates each segment using the ANN classifier the most recent units, since the MESSAGEPAD and produces a vector of output activations 2000, despite retaining its label as a print rec- that are used as letter-class probabilities. The ognizer, it has been extended to handle con- search stage then uses these class probabilities, nected characters (as well as a full Western together with models of lexical and geometric European character set). context, to find the N most likely word or sen- There is ample prior work in combining tence hypotheses. low-level classifiers with dynamic time warping, hidden Markov models, Viterbi algo- Tentative Segmentation rithms, and other search strategies to provide Character segmentation—the process of decid- integrated segmentation and recognition for ing which strokes make up which characters— writing (Tappert, Suen, and Wakahara 1990) is inherently ambiguous. Ultimately, this deci- and speech (Renals et al. 1992). In addition, sion must be made, but short of writing in box- there is a rich background in the use of ANNs es, it is impossible to do so (with any accuracy) as classifiers, including their use as low-level in advance, external to the recognition character classifiers in a higher-level word- process. Hence, the initial segmentation stage recognition system (Bengio et al. 1995). How- in APR produces multiple, tentative groupings ever, these approaches leave a large number of of strokes and defers the final segmentation open-ended questions about how to achieve decisions until the search stage, thus integrat- acceptable (to a real user) levels of perfor- ing these segmentation decisions with the mance. In this article, we survey some of our overall recognition process. experiences in exploring refinements and APR uses a potentially exhaustive, sequen- improvements to these techniques. tial enumeration of stroke combinations to 74 AI MAGAZINE Articles generate a sequence of viable character-segmentation hypotheses. These segments are subjected to some obvious constraints (such as Segment Stroke “all strokes must be used” and “no strokes can Forward Reverse Ink Number Segment Count Delay Delay be used twice”) and some less obvious filters (to cull impossible segments for the sake of 1 130 efficiency). The resulting algorithm produces the actual segments that will be processed as 2 241 possible characters, along with the legal transi- 3 342 tions between these segments. The legal transitions are defined by forward 4 120 and reverse delays. The forward delay indicates the next possible segment in the sequence. 5 221 The reverse delay indicates the start of the cur- 6 110 rent batch of segments, all of which share the same leading stroke. Because of the enumera- 7 100 tion scheme, a segment’s reverse delay is the same as its stroke count minus one, unless pre- ceding segments (sharing the same leading stroke) were eliminated by the filters men- Figure 2. Segmentation of Strokes into Tentative Characters, or Segments. tioned previously. These two simple delay parameters (for each segment) suffice to define an implicit graph of all legal segment transi- the data that are given as input to the network. tions. For a transition from segment number i We experimented with a variety of input representations, including stroke features both to segment number j to be legal, the sum of antialiased (gray scale) and not (binary) and segment i’s forward delay plus segment j’s images both antialiased and not, and with var- reverse delay must be equal to j – i. Figure 2 ious schemes for positioning and scaling the provides an example of some ambiguous ink ink within the image input window. In every and the segments that might be generated case, antialiasing was a significant win. This from its strokes, supporting interpretations of result is consistent with others’ findings: ANNs dog, clog, cbg, or even %g. perform better when presented with smoothly varying, distributed input than they do when Character Classification presented with binary, localized input. Almost the simplest image representation possible, a The output of the segmentation stage is a non–aspect-ratio-preserving, expand-to-fill- stream of segments that are then passed to an the-window image (limited only by a maxi- ANN for classification as characters. Except for mum scale factor to keep from blowing dots up the architecture and training specifics detailed to the full window size), together with either a later, a fairly standard multilayer perceptron single unit or a thermometer code (some num- trained with error backpropagation provides ber of units turned on in sequence to represent the ANN character classifier at the heart of larger values) for the aspect ratio,

Load more