AI Magazine Volume 19 Number 1 (1998) (© AAAI) Articles Combining Neural Networks and Context-Driven Search for Online, Printed Handwriting Recognition in the NEWTON

Larry S. Yaeger, Brandyn J. Webb, and Richard F. Lyon

■ While online handwriting recognition is an area of writing recognition have utilized strong, limit- long-standing and ongoing research, the recent ed language models to maximize accuracy, but emergence of portable, -based computers has they proved unacceptable in real-world appli- focused urgent attention on usable, practical solu- cations, generating disturbing and seemingly tions. We discuss a combination and improvement random word substitutions—known colloqui- of classical methods to produce robust recognition ally within Apple and NEWTON as “The Doones- of hand-printed English text for a recognizer ship- bury Effect” because of Gary Trudeau’s satirical ping in new models of Apple Computer’s NEWTON NEWTON MESSAGEPAD and EMATE. Combining an artificial look at first-generation recognition neural network (ANN) as a character classifier with performance. However, the original handwrit- a context-driven search over segmentation and ing-recognition technology in the NEWTON, and word-recognition hypotheses provides an effective the current, much-improved cursive-recogniz- recognition system. Long-standing issues relative er technology, both of which were licensed to training, generalization, segmentation, models from ParaGraph International, Inc., are not the of context, probabilistic formalisms, and so on, subject of this article. need to be resolved, however, to achieve excellent In Apple’s Advanced Technology Group (aka performance. We present a number of recent inno- Apple Research Labs), we pursued a different vations in the application of ANNs as character approach, using bottom-up classification tech- classifiers for word recognition, including integrat- ed multiple representations, normalized output niques based on trainable artificial neural net- error, negative training, stroke warping, frequency works (ANNs) in combination with compre- balancing, error emphasis, and quantized weights. hensive but weakly applied language models. User adaptation and extension to cursive recogni- To focus our work on a subproblem that was tion pose continuing challenges. tractable enough to lead to usable products in a reasonable time, we initially restricted the en-based hand-held computers are heavi- domain to hand printing so that strokes were ly dependent on fast and accurate hand- clearly delineated by pen lifts. By simultane- Pwriting recognition because the pen ously providing accurate character-level recog- serves as the primary means for inputting data nition, dictionaries exhibiting wide coverage to such devices. Some earlier attempts at hand- of the language, and the ability to write entire-

Copyright © 1998, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1998 / $2.00 SPRING 1998 73 Articles

(x,y) Points and Pen-Lifts Words

Tentative Neural Network Search Segmentation Character Classifier Character with Context Segmentation Class Hypotheses Hypotheses

Figure 1. A Simplified Block Diagram of Our Hand-Print Recognizer.

ly outside these dictionaries, we have pro- System Overview duced a hand-print recognizer that some have called the “first usable” handwriting-recogni- Apple’s print recognizer (APR) consists of three tion system. The ANN character classifier conceptual stages—(1) tentative segmentation, required some innovative training techniques (2) classification, and (3) context-driven search—as indicated in figure 1. The primary to perform its task well. The dictionaries data on which we operate are simple sequences required large word lists, a regular expression of (x,y) coordinate pairs plus pen-up–pen- grammar (to describe special constructs such as down information, thus defining stroke prim- date, time, and telephone numbers), and a itives. The segmentation stage decides which means of combining all these dictionaries into strokes will be combined to produce segments a comprehensive . In addition, —the tentative groupings of strokes that will well-balanced prior probabilities had to be be treated as possible characters—and pro- determined for in-dictionary and out-of-dictio- duces a sequence of these segments together nary writing. Together with a maximum-likeli- with legal transitions between them. This hood search engine, these elements form the process builds an implicit graph that is then basis of the so-called “Print Recognizer,” which labeled in the classification stage and exam- was first shipped in NEWTON OS 2.0–based MES- ined for a maximum-likelihood interpretation SAGEPAD 120 units in December 1995 and has in the search stage. The classification stage eval- shipped in all subsequent NEWTON devices. In uates each segment using the ANN classifier the most recent units, since the MESSAGEPAD and produces a vector of output activations 2000, despite retaining its label as a print rec- that are used as letter-class probabilities. The ognizer, it has been extended to handle con- search stage then uses these class probabilities, nected characters (as well as a full Western together with models of lexical and geometric European character set). context, to find the N most likely word or sen- There is ample prior work in combining tence hypotheses. low-level classifiers with dynamic time warp- ing, hidden Markov models, Viterbi algo- Tentative Segmentation rithms, and other search strategies to provide Character segmentation—the process of decid- integrated segmentation and recognition for ing which strokes make up which characters— writing (Tappert, Suen, and Wakahara 1990) is inherently ambiguous. Ultimately, this deci- and speech (Renals et al. 1992). In addition, sion must be made, but short of writing in box- there is a rich background in the use of ANNs es, it is impossible to do so (with any accuracy) as classifiers, including their use as low-level in advance, external to the recognition character classifiers in a higher-level word- process. Hence, the initial segmentation stage recognition system (Bengio et al. 1995). How- in APR produces multiple, tentative groupings ever, these approaches leave a large number of of strokes and defers the final segmentation open-ended questions about how to achieve decisions until the search stage, thus integrat- acceptable (to a real user) levels of perfor- ing these segmentation decisions with the mance. In this article, we survey some of our overall recognition process. experiences in exploring refinements and APR uses a potentially exhaustive, sequen- improvements to these techniques. tial enumeration of stroke combinations to

74 AI MAGAZINE Articles generate a sequence of viable character-seg- mentation hypotheses. These segments are subjected to some obvious constraints (such as Segment Stroke “all strokes must be used” and “no strokes can Forward Reverse Number Segment Count Delay Delay be used twice”) and some less obvious filters (to cull impossible segments for the sake of 1 130 efficiency). The resulting algorithm produces the actual segments that will be processed as 2 241 possible characters, along with the legal transi- 3 342 tions between these segments. The legal transitions are defined by forward 4 120 and reverse delays. The forward delay indicates the next possible segment in the sequence. 5 221 The reverse delay indicates the start of the cur- 6 110 rent batch of segments, all of which share the same leading stroke. Because of the enumera- 7 100 tion scheme, a segment’s reverse delay is the same as its stroke count minus one, unless pre- ceding segments (sharing the same leading stroke) were eliminated by the filters men- Figure 2. Segmentation of Strokes into Tentative Characters, or Segments. tioned previously. These two simple delay parameters (for each segment) suffice to define an implicit graph of all legal segment transi- the data that are given as input to the network. tions. For a transition from segment number i We experimented with a variety of input rep- resentations, including stroke features both to segment number j to be legal, the sum of antialiased (gray scale) and not (binary) and segment i’s forward delay plus segment j’s images both antialiased and not, and with var- reverse delay must be equal to j – i. Figure 2 ious schemes for positioning and scaling the provides an example of some ambiguous ink ink within the image input window. In every and the segments that might be generated case, antialiasing was a significant win. This from its strokes, supporting interpretations of result is consistent with others’ findings: ANNs dog, clog, cbg, or even %g. perform better when presented with smoothly varying, distributed input than they do when Character Classification presented with binary, localized input. Almost the simplest image representation possible, a The output of the segmentation stage is a non–aspect-ratio-preserving, expand-to-fill- stream of segments that are then passed to an the-window image (limited only by a maxi- ANN for classification as characters. Except for mum scale factor to keep from blowing dots up the architecture and training specifics detailed to the full window size), together with either a later, a fairly standard multilayer single unit or a thermometer code (some num- trained with error provides ber of units turned on in sequence to represent the ANN character classifier at the heart of larger values) for the aspect ratio, proved to be APR. A large body of prior work exists to indi- the most effective single-classifier solution. cate the general applicability of ANN technol- However, the best overall classifier accuracy ogy as a classifier providing good estimates of was ultimately obtained by combining multi- a posteriori probabilities of each class given the ple distinct representations into nearly inde- input (Renals and Morgan 1992; Richard and pendent, parallel classifiers joined at a final Lippman 1991; Gish 1990; and others cited output . Hence, representation proved not herein). Compelling arguments have been only to be as important as architecture, but made for why ANNs providing posterior prob- ultimately, it helped to define the architecture abilities in a probabilistic recognition formula- of our nets. For our final, hand-optimized sys- tion should be expected to outperform other tem, we use four distinct inputs, as indicated recognition approaches (Lippman 1994), and in table 1. The stroke-count representation was ANNs have performed well as the core of dithered (changed randomly at a small proba- speech-recognition systems (Morgan and bility) to expand the effective training set, pre- Bourlard 1995). vent the network from fixating on this simple input, and thereby improve the network’s abil- Representation ity to generalize. A schematic of the various A recurring theme in ANN research is the input representations can be seen as part of the extreme importance of the representation of architecture drawing in figure 3.

SPRING 1998 75 Articles

Input Feature Resolution Description Image 14 × 14 Antialiased, scale to window, scale limited Stroke 20 × 9 Antialiased, limited-resolution tangent slope, resampled to fixed number of points Aspect Ratio 1 × 1 Normalized and capped to [0,1] Stroke Count 5 × 1 Dithered thermometer code

Table 1. Input Representations Used in the Apple Print Recognizer.

a … z A … Z 0 … 9 ! … ~£ 95 x 1

104 x 1 112 x 1

7 x 2 2 x 7 72 x 1 7 x 7 5 x 5

(8x8) 1 x 9 (8x7;1,7) (7x8;7,1 9 x 1 (8x6;1,8) (6x8;8,1 (10x10) (14x6) (6x14) 20 x 9 5 x 1 1 x 1 14 x 14 Stroke Aspect Stroke Feature Count Ratio Image

Figure 3. Final English-Language Net Architecture.

Architecture Layers are fully connected except for the As with representations, we experimented input to the first hidden layer on the image with a variety of architectures, including sim- side. This first hidden layer on the image side ple fully connected layers; receptive fields; consists of eight separate grids, each of which shared weights; multiple hidden layers; and, accepts input from the image input grid with its own receptive field sizes and strides, shown ultimately, multiple nearly independent classi- × fiers tied to a common output layer. The final parenthetically in figure 3 as (x-size y-size; x- choice of architecture includes multiple input stride, y-stride). A stride is the number of units representations; a first hidden layer (separate (pixels) in the input image space between for each input representation) using receptive sequential positionings of the receptive fields fields; fully connected second hidden layers in a given direction. The 7 × 2 and 2 × 7 side (again distinct for each representation); and a panels (surrounding the central 7 × 7 grid) pay final, shared, fully connected output layer. special attention to the edges of the image. The Simple scalar features—aspect ratio and stroke 9 × 1 and 1 × 9 side panels specifically exam- count—connect to both second hidden layers. ine full-size vertical and horizontal features, The final network architecture, for our original respectively. The 5 × 5 grid observes features at English-language system, is shown in figure 3. a different spatial scale than the 7 × 7 grid.

76 AI MAGAZINE Articles

Combining the two classifiers at the output nonlinear function f(P(class | input), A) layer, rather than, say, averaging the output of depending on the factor A by which we completely independent classifiers, allows reduced the error pressure toward zero. generic backpropagation to learn the best way Using a simple version of the technique of to combine them, which is both convenient Bourlard and Wellekens (1990), we worked out and powerful. However, our integrated multi- the resulting nonlinear function. The net will ple-representation architecture is conceptually attempt to converge to minimize the modified related to, and motivated by, prior experi- quadratic error function ments at combining nets such as Steve Nowl- EpyApyˆ 22=−+−()11 () 2 an’s “mixture of experts” (Jacobs et al. 1991). by setting its output y for a particular class to Normalizing Output Error y = p / (A – Ap + p) , Analyzing a class of errors involving words that were misrecognized as a result of, perhaps, where p = P(class | input), and A is as previously a single misclassified character, we realized defined. For small values of p, the activation y that the net was doing a poor job of represent- is increased by a factor of nearly 1/A relative to ing second- and third-choice probabilities. the conventional case of y = p, and for high Essentially, the net was being forced to attempt values of p, the activation is closer to 1 by near- unambiguous classification of intrinsically ly a factor of A. The inverse function, useful for ambiguous patterns due to the nature of the converting back to a probability, is mean-squared error minimization in back- p = yA / (yA + 1 – y) . propagation, coupled with the typical training We verified the fit of this function by looking vector, which consists of all 0s except for the at histograms of character-level empirical per- single 1 of the target. Lacking any viable centage correct versus y, as in figure 4. Even for means of encoding legitimate probabilistic this moderate amount of output error normal- ambiguity into the training vectors, we decid- ization, it is clear that the lower-probability ed to try normalizing the “pressure toward 0” samples have their output activations raised versus the “pressure toward 1” introduced by significantly, relative to the 45˚ line that A = 1 the output error during training. We refer to yields. this technique as NormOutErr because of its The primary benefit derived from this tech- normalizing effect on target versus nontarget nique is that the net does a much better job of output error. representing second- and third-choice proba- We reduce the backpropagation error for bilities, and low probabilities in general. nontarget classes relative to the target class by Despite a small drop in top-choice character a factor that normalizes the total nontarget accuracy when using NormOutErr, we obtain a error seen at a given output unit relative to the significant increase in word accuracy with this total target error seen at the unit. If the train- technique. Figure 5 shows an exaggerated ing set consisted of an equal representation of example of this effect for an atypically large classes, then this normalization should be value of d (0.8), which overly penalizes charac- based on the number of nontarget versus tar- ter accuracy; however, the 30-percent decrease get classes in a typical training vector or, sim- in word error rate is normal for this technique. ply, the number of output units (minus one). (Note: These data are from a multi–year-old Hence for nontarget output units, we scale the experiment and are not necessarily representa- error at each unit by a constant tive of current levels of performance on any e’ = Ae , absolute scale.) where e is the error at an output unit, and A is Negative Training defined as The previously discussed inherent ambiguities A = 1 / [d(Noutput – 1)] , in character segmentation necessarily result in where Noutput is the number of output units, the generation and testing of a large number of and is our tuning parameter, typically ranging invalid segments. During recognition, the net- from 0.1 to 0.2. Error at the target output unit work must classify these invalid segments just is unchanged. Overall, this error modulation as it would any valid segment, with no knowl- raises the activation values at the output units edge of which are valid or invalid. A significant because of the reduced pressure toward zero, increase in word-level recognition accuracy particularly for low-probability samples. Thus, was obtained by performing negative training the learning algorithm no longer converges to with these invalid segments. Negative training a least mean-squared error (LMSE) estimate of consists of presenting invalid segments to the P(class | input) but to an LMSE estimate of a net during training with all-zero target vectors.

SPRING 1998 77 Articles

apply were chosen through cross-validation experiments as just the amounts needed to

1 yield optimum generalization. (Cross-valida- tion is a standard technique for early stopping 0.9 of ANN training to prevent overlearning of the 0.8 p = P(correct) training set. Such overlearning of extraneous

0.7 detail in the training samples reduces the ANN’s ability to generalize and, thus, reduces 0.6 its accuracy on new data. The technique con- net output y 0.5 sists of partitioning the available data into a training set and a cross-validation set. Then, 0.4 during the learning process, the training set is 0.3 used in the usual fashion to train the network while accuracy on the cross-validation set is 0.2 simply measured periodically. Training is ter- 0.1 minated when accuracy on the cross-valida- tion set is maximized, despite the fact that fur- 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ther training might continue to improve the accuracy measured on the training set. Cross- validation is a very effective and widely accept- ed means of determining the optimal amount Figure 4. Empirical p Versus y Histogram for a Net Trained with of learning for a given ANN and body of data.) A = 0.11 (d = .1), with the Corresponding Theoretical Curve. We chose relative amounts of the various transformations by testing for optimal final, converged accuracy on a cross-validation set. We retain control over the degree of negative We then increased the amount of all stroke training in two ways: First is a negative-training factor (ranging from 0.2 to 0.5) that modulates warping being applied to the training set, just the learning rate (equivalently by modulating to the point at which accuracy on the training the error at the output layer) for these negative set ceased to diverge from accuracy on the patterns. Lowering the learning rate for nega- cross-validation set. tive patterns reduces the impact of negative We also examined a number of such samples training on positive training, thus modulating by eye to verify that they represent a natural the impact on whole characters that specifical- range of variation. A small set of such varia- ly look like pieces of multistroke characters (for tions is shown in figure 6. example, I, 1, l, o, O, 0). Second, we control a Our stroke warping scheme is somewhat negative-training probability (ranging between related to the ideas of TANGENT DIST and TANGENT 0.05 and 0.3), which determines the probabil- PROP (Simard, LeCun, and Denker 1993; Simard ity that a particular negative sample will actu- et al. 1992) in terms of the use of predeter- ally be trained on (for a given presentation). mined families of transformations, but we Probabilistically skipping negative patterns believe it is much easier to implement. It is both reduces the overall impact of negative also somewhat distinct in applying transfor- training and significantly reduces training mations on the original coordinate data as time, because invalid segments are more opposed to using distortions of images. The numerous than valid segments. As with Norm voice-transformation scheme of Chang and OutErr, this modification hurts character-level Lippmann (1995) is also related, but they use a accuracy a little bit but helps word-level accu- static replication of the training set through a racy a lot. small number of transformations rather than dynamic random transformations of an essen- Stroke Warping tially infinite variety. During training (but not during recognition), we produce random variations in stroke data, Frequency Balancing consisting of small changes in skew, rotation, Training data from natural English words and and x and y linear and quadratic scalings. This phrases exhibit nonuniform priors for the var- stroke warping produces alternate character ious character classes, and ANNs readily model forms that are consistent with stylistic varia- these priors. However, as with NormOutErr, we tions within and between writers and induces find that reducing the effect of these priors on an explicit aspect ratio and rotation invariance the net in a controlled way and, thus, forcing within the framework of standard backpropa- the net to allocate more of its resources to low- gation. The amounts of each distortion to frequency, low-probability classes is of signifi-

78 AI MAGAZINE Articles

% 40 NormOutErr = 0.0 0.8 E 30 31.6 r 20 r 22.7 o 10 9.5 12.4 r 0 Character Error Word Error

Figure 5. Character and Word Error Rates for Two Different Values of NormOutErr (d). A value of 0.0 disables NormOutErr, yielding normal backpropagation. The unusually high value of 0.8 (A = 0.014) produces nearly equal pressures toward 0 and 1.

Figure 6. A Few Random Stroke Warpings of the Same Original m Data. cant benefit to the overall word-recognition frequent classes will repeat. A value of 0.0 for b process. To this end, we explicitly (partially) would do no balancing, giving Ri = 1.0 for all balance the frequencies of the classes during classes, and a value of 1.0 would provide full training. We obtain this frequency balancing normalization. A value of b somewhat less by probabilistically skipping and repeating than one seems to be the best choice, letting patterns based on a precomputed repetition fac- the net keep some bias in favor of classes with tor. Each presentation of a repeated pattern is higher prior probabilities. warped uniquely, as discussed previously. This explicit prior-bias reduction is concep- To compute the repetition factor for a class tually related to Lippmann’s (1994) and Mor- i, we first compute a normalized frequency of gan and Bourlard’s (1995) recommended this class: method for converting from the net’s esti- = mate of posterior probability, p(class | input), FSSii/, to the value needed in an HMM or Viterbi where_ Si is the number of samples in class i, search, p(input | class), which is to divide by and S is the average number of samples over all p(class) priors. This division, however, may classes, computed in the obvious way: produce noisier estimates for low-frequency C classes, resulting in a set of estimates that are = 1 S (),∑ Si not really optimized in an LMSE sense (as the C i=1 net output are). In addition, output activa- with C the number of classes. Our repetition tions that are naturally bounded between 0 factor is then defined as and 1 because of the sigmoid convert to b Ri = (a / Fi) , potentially large probability estimates, requir- with a and b as adjustable controls over the ing a renormalization step. Our method of amount of skipping versus repeating and the frequency balancing during training elimi- degree of prior normalization, respectively. nates both these concerns. Perhaps more sig- Typical values of a range from 0.2 to 0.8, and nificantly, frequency balancing also allows b ranges from 0.5 to 0.9. The factor a < 1 lets us the standard backpropagation training pro- do more skipping than repeating. For example, cess to dedicate more network resources to for a = 0.5, classes with relative frequency the classification of the lower-frequency class- equal to half the average will neither skip nor es, although we have no current method for repeat; more frequent classes will skip; and less characterizing or quantifying this benefit.

SPRING 1998 79 Articles

Phase Epochs Learning Rate Correct Training Negative Training (Approximate) Probability Probability 1 25 1.0–0.5 0.1 0.05 2 25 0.5–0.1 0.25 0.1 3 50 0.1–0.01 0.5 0.18 4 30 0.01–0.001 1.0 0.3

Table 2. Typical Multiphase Schedule of Learning Rates and Other Parameters for Training a Character-Classifier Net.

Error Emphasis a learning rate starting very high and decreas- ing only slowly to a very low value. However, Although frequency balancing corrects for rather than using any prespecified formula to underrepresented classes, it cannot account for decelerate learning, the rate at which the underrepresented writing styles. We use a con- learning rate decreases is determined by the ceptually related probabilistic skipping of pat- dynamics of the learning process itself. We terns, but of just those patterns that the net typically start with a rate near 1.0 and reduce correctly classifies in its forward-recognition the rate by a multiplicative decay factor of 0.9 pass, as a form of error emphasis, to address until it gets down to about 0.001. The rate this problem. We define a correct-train probabil- decay factor is applied following any epoch in ity (ranging from 0.1 to 1.0), which is used as a which the total squared error is increased on biased coin to determine whether a particular the training set relative to the previous epoch. pattern, having been correctly classified, will This total squared error is summed over all out- also be used for the backward-training pass. put units and over all patterns in one full This only applies to correctly segmented, or epoch and normalized by these counts. Thus, positive patterns, and misclassified patterns are even though we are using online or stochastic never skipped. , we have a measure of perfor- Especially during the early stages of training, mance over whole epochs that can be used to we set this parameter fairly low (around 0.1), guide the annealing of the learning rate. thus concentrating most of the training time Repeated tests indicate that this approach and the net’s learning capability on patterns yields better results than low (or even moder- that are more difficult to correctly classify. This ate) initial learning rates, which we speculate error emphasis is the only way we were able to to be related to a better ability to escape local get the net to learn to correctly classify unusu- minima. al character variants, such as a three-stroke 5 as In addition, we find that we obtain best written by only one training writer. overall results when we also allow some of our Variants of this scheme are possible in many training parameters to change over the which misclassified patterns would be repeat- course of a training run. In particular, the cor- ed, or different learning rates would apply to rect train probability needs to start out very correctly and incorrectly classified patterns. It low to give the net a chance to learn unusual is also related to techniques that use a training character styles, but it should finish near 1.0 in subset, from which easily classified patterns are order to avoid introducing a general posterior replaced by randomly selected patterns from probability bias in favor of classes with lots of the full training set (Guyon et al. 1992). ambiguous examples. We typically train a net in four phases, according to parameters such as Annealing in table 2. Although some discussions of backpropaga- tion espouse explicit formulas for modulating Quantized Weights the learning rate over time, many seem to The work of Asanovic´ and Morgan (1991) assume the use of a single, fixed learning rate. shows that 2-byte (16-bit) weights are about We view the stochastic backpropagation the smallest that can be tolerated in training process as a kind of simulated annealing, with large ANNs using backpropagation. However,

80 AI MAGAZINE Articles memory is expensive in small devices, and reduced instruction set computer processors, such as the ARM-610 in the first devices in which this technology was deployed, are much more count per bin efficient doing one-byte loads and multiplies 1000 than two-byte loads and multiplies; so, we were of width 1/16 motivated to make one-byte weights work. Running the net for recognition demands significantly less precision than training the 100 net. It turns out that one-byte weights provide adequate precision for recognition if the weights are trained appropriately. In particular, a dynamic range should be fixed, weights lim- 10 ited to the legal range during training, and then rounded to the requisite precision after training. For example, we find that a range of weight values from (almost) –8 to +8 in steps of 1 1/16 does a good job. Figure 7 shows a typical –8 –6 –4 –2 0 2 4 6 8 resulting distribution of weight values. If the weight value w weight limit is enforced during high-precision training, the resources of the net will be adapt- ed to make up for the limit. Because bias weights are few in number, however, and very Figure 7. Distribution of Weight Values in a Net important, we allow them to use two bytes with One-Byte Weights on a Log Count Scale. with essentially unlimited range. Performing Weights with magnitudes greater than four are sparse but important. our forward-recognition pass with low-preci- sion, 1-byte weights (a ±3.4 fixed-point repre- sentation), we find no noticeable degradation path through this vector stream, abiding by relative to floating-point, 4-byte, or 2-byte the legal transitions between segments as weights using this scheme. defined in the tentative-segmentation step dis- We have also developed a scheme for train- cussed previously. This minimum-cost path is ing with one-byte weights. It uses a temporary the APR system’s best interpretation of the ink augmentation of the weight values with two input by the user and is returned to the system additional low-order bytes to achieve precision in which APR is embedded as the recognition in training but runs the forward pass of the net result for whole words or sentences of the using only the one-byte high-order part. Thus, user’s input. any cumulative effect of the one-byte rounded The search is driven by a somewhat ad hoc weights in the forward pass can be compensat- generative language model that consists of a set ed through further training. Small weight of graphs that are searched in parallel. We use changes accumulate in the low-order bytes and a simple beam search in a negative-log-proba- only occasionally carry into a change in the bility (or penalty) space for the best N hypothe- one-byte weights used by the net. In a personal ses. The beam is based on a fixed maximum product, this scheme could be used for adapta- number of hypotheses rather than a particular tion to the user, after which the low-order value. Each possible transition token (charac- residuals could be discarded and the tempo- ter) emitted by one of the graphs is scored not rary memory reclaimed. only by the ANN but by the full language mod- el, a simple letter-case model, and a geometric- context model discussed later. The fully inte- Context-Driven Search grated search process takes place over a space of The output of the ANN classifier is a stream of character- and word-segmentation hypotheses probability vectors, one vector for each seg- as well as character-class hypotheses. mentation hypothesis, with as many poten- tially nonzero probability elements in each Lexical Context vector as there are characters (that the system Context is essential to accurate recognition, is capable of recognizing). In practice, we typ- even if this context takes the form of a broad ically only pass the top 10 (or fewer) scored language model. Humans achieve just 90-per- character-class hypotheses for each segment to cent accuracy on isolated characters from our the search engine for the sake of efficiency. The database. Without any context, this character search engine then looks for a minimum-cost accuracy would translate to a word accuracy of

SPRING 1998 81 Articles

dig = [0123456789] digm01 = [23456789]

acodenums = (digm01 [01] dig)

acode = { ("1-"? acodenums "-"):40 , ("1"? "(" acodenums ")"):60 }

phone = (acode? digm01 dig dig "-" dig dig dig dig)

Figure 8. Sample of the Regular-Expression Grammar Used to Define a Simple Model of Telephone Numbers. Symbols are defined by the equal operator; square brackets enclose multiple, alternative characters; parentheses enclose sequences of symbols; curly braces enclose multiple, alternative symbols; an appended colon, followed by numbers, designates a prior probability of this alternative; an appended question mark means “zero or one occurrence”; and the final symbol definition represents the graph or grammar expressed by this dictionary.

BiGrammar Phone

[Phone.lang 1. 1. 1.]

Figure 9. Sample of a Simple BiGrammar Describing a Telephone-Only Context. The BiGrammar is first named (Phone) and then specified as a list of dictionaries (Phone.lang) together with the probability of starting with this dictionary, ending with this dictionary, and cycling within this dictionary (the three numeric values).

not much more than 60 percent (0.95), assum- are essentially scored lists of dictionaries ing an average word length of 5 characters. We together with specified legal (scored) transi- need to obtain word accuracies much higher tions between these dictionaries. This scheme than this 60 percent, despite basing our recog- allows us to use word lists, prefix and suffix nition on character accuracies that may never lists, and punctuation models and enable reach human levels. We achieve this accuracy appropriate transitions between them. Some by careful application of our context models. dictionary graphs are derived from a regular- A simple model of letter case and adjacen- expression grammar that permits us to easily cy—penalizing case transitions except between model phone numbers, dates, times, and so the first and second characters, penalizing on, as shown in figure 8. alphabetic-to-numeric transitions, and so All these dictionaries can be searched in par- on—together with the geometric-context allel by combining them into a general-pur- models discussed later, is sufficient to raise pose BiGrammar that is suitable for most word accuracy to around 77 percent. applications. It is also possible to combine sub- The next large gain in accuracy requires a sets of these dictionaries, or special-purpose genuine language model. We provide this dictionaries, into special BiGrammars targeted model by means of dictionary graphs and at more limited contexts. A simple BiGram- assemblages of these graphs combined into mar, which might be useful to specify context what we refer to as BiGrammars. BiGrammars for a field that only accepts telephone num-

82 AI MAGAZINE Articles

BiGrammar FairlyGeneral (.8 (.6 [WordList.dict .5 .8 1. EndPunct.lang .2] [User.dict .5 .8 1. EndPunct.lang .2] ) (.4 [Phone.lang .5 .8 1. EndPunct.lang .2] [Date.lang .5 .8 1. EndPunct.lang .2] ) )

(.2 [OpenPunct.lang 1. 0. .5 (.6 WordList.dict .5 User.dict .5 ) (.4 Phone.lang .5 Date.lang .5 ) ] )

[EndPunct.lang 0. .9 .5 EndPunct.lang .1]

Figure 10. Sample of a Slightly More Complex BiGrammar Describing a Fairly General Context. The BiGrammar is first named (FairlyGeneral) and then specified as a list of dictionaries (the *.dict and *.lang entries), together with the probability of starting with this dictionary, ending with this dictionary, and cycling within this dictionary (the first three numeric values following each dictionary name), as well as any dictio- naries to which this dictionary can legally transition and the probability of taking this transition. The paren- theses permit easy specification of multiplicative prior probabilities for all dictionaries contained within them. Note that in this simple example, it is not possible (starting probability = 0) to start a string with the EndPunct (end punctuation) dictionary, just as it is not possible to end a string with the OpenPunct dictionary. bers, is shown in figure 9. A more complex el of accuracy for common in-dictionary words BiGrammar (although still far short of the (and special constructs such as phone numbers) complexity of our final general-input context) but can also recognize arbitrary, nondictionary is shown in figure 10. character strings, especially if they are written We refer to our language model as “weakly neatly enough that the character classifier can applied” because in parallel with all the word be confident of its classifications. list–based dictionaries and regular-expression We have also experimented with bigrams, grammars, we simultaneously search both an trigrams, N grams, and we are continuing alphabetic-character grammar (wordlike) and a experiments with other, more data-driven lan- completely general, any-character-anywhere guage models; to date, however, our generative grammar (symbols). These more flexible mod- approach has yielded the best results. els, although given fairly low a priori probabili- ties, permit users to write any unusual character Geometric Context string they might desire. When the prior prob- We have never found a way to reliably estimate abilities for the various dictionaries are properly a baseline or topline for characters, indepen- balanced, the recognizer is able to benefit from dent of classifying these characters in a word. the language model and deliver the desired lev- Non-recognition-integrated estimates of these

SPRING 1998 83 Articles

"if" from User versus Table:

Figure 11. The Eight Measurements That Contribute to the GeoContext Error Vector and Corresponding Score for Each Letter Pair.

line positions, based on strictly geometric fea- expected values, subject only to a scale factor tures, have too many pathological failure and offset, which are chosen to minimize the modes, which produce erratic recognition fail- estimated error of fit between data and model. ures. However, the geometric positioning of This same quadratic error term, computed characters most certainly bears information from the inverse covariance matrix of a full important to the recognition process. Our sys- multivariate Gaussian model of these sizes and tem factors the problem by letting the ANN positions, is used directly as GeoContext’s classify representations that are independent score (or penalty because it is applied in the of baseline and size and then using separate negative log probability space of the search modules to score both the absolute size of indi- engine). Figure 11 illustrates the bounding vidual characters and the relative size and posi- boxes derived from the user’s ink versus the tion of adjacent characters. table-driven model with the associated error The scoring based on absolute size is derived measures for our GeoContext module. from a set of simple Gaussian models of indi- GeoContext’s multivariate Gaussian model vidual character heights relative to some run- is learned directly from data. The problem in ning scale parameters computed during both doing so was to find a good way to train per- learning and recognition. This CharHeight character parameters of top, bottom, width, score directly multiplies the scores emitted by space, and so on, in our standardized space the ANN classifier and helps significantly in from data that had no labeled baselines or oth- case disambiguation. er absolute referent points. Because we had a We also use a GeoContext module that scores technique for generating an error vector from adjacent characters based on the classification the table of parameters, we decided to use a hypotheses for these characters and their rela- backpropagation variant to train the table of tive size and placement. GeoContext scores parameters to minimize the squared error each tentative character based on its class and terms in the error vectors given all the pairs of the class of the immediately preceding letter adjacent characters and correct class labels (for the current search hypothesis). The char- from the training set. acter classes are used to look up expected char- GeoContext plays a major role in properly acter sizes and positions in a standardized recognizing punctuation and disambiguating space (baseline = 0.0, top line = 1.0). The ink case and a major role in recognition in general. being evaluated provides actual sizes and posi- A more extended discussion of GeoContext tions that can be compared directly to the was provided by Lyon and Yaeger (1996).

84 AI MAGAZINE Articles

Γ Samples StrokeGap

Stroke (Non-Word) Break Γ Word Break WordGap

Gap Size

P = Γ Γ + Γ ) WordBreak WordGap / ( StrokeGap WordGap

Figure 12. Gaussian Density Distributions Yield a Simple Statistical Model of Word-Break Probability, Which Is Applied in the Region between the Peaks of the StrokeGap and WordGap Distributions.

Hashed areas indicate regions of clear-cut decisions, where PWordBreak is set to either 0.0 or 1.0 to avoid problems dealing with tails of these simple distributions.

Integration with Word Segmentation non–word-break hypotheses are generated at Just as it is necessary to integrate character seg- each segment transition and weighted by mentation with recognition using the search spaceProb and (1 – spaceProb), respectively. process, so is it essential to integrate word seg- The search process then proceeds over this larg- mentation with recognition and search to er hypothesis space to produce best estimates obtain accurate estimates of word boundaries of whole phrases or sentences, thus integrating and reduce the large class of errors associated word segmentation as well as character seg- with missegmented words. To perform this mentation. integration, we first need a means of estimating the probability of a word break between each Discussion pair of tentative characters. We use a simple sta- tistical model of gap sizes and stroke-centroid The combination of elements described in the spacing to compute this probability (spaceProb). preceding sections produces a powerful, inte- Gaussian density distributions, based on means grated approach to character segmentation, and standard deviations computed from a large word segmentation, and recognition. Users’ training corpus, together with a prior-probabil- experiences with APR are almost uniformly ity scale factor, provide the basis for the word- positive, unlike experiences with previous gap and stroke-gap (nonword-gap) models, as handwriting-recognition systems. Writing illustrated in figure 12. Because any given gap within the dictionary is remarkably accurate, is, by definition, either a word gap or a non- yet the ease with which people can write out- word gap, the simple ratio defined in figure 12 side the dictionary has fooled many people provides a convenient, self-normalizing esti- into thinking that the NEWTON’S Print Recog- mate of the word-gap probability. In practice, nizer does not use dictionaries. As discussed this equation further reduces to a simple sig- previously, our recognizer certainly does use moid form, thus allowing us to take advantage dictionaries. Indeed, the broad-coverage lan- of a lookup-table–based sigmoid derived for use guage model, although weakly applied, is in the ANN. In a thresholding, non-integrated essential for high-accuracy recognition. Curi- word-segmentation model, word breaks would ously, there seems to be little problem with be introduced when spaceProb exceeds 0.5, dictionary perplexity—little difficulty as a result that is, when a particular gap is more likely to of using very large, very complex language be a word gap than a nonword gap. For our models. We attribute this fortunate behavior to integrated system, both word-break and the excellent performance of the neural net-

SPRING 1998 85 Articles

One of the key reasons for the success of One of the key reasons for the success of APR is the suite of innovative neural APR is the suite of innovative neural net- network–training techniques that help the work–training techniques that help the network encode better class probabilities, espe- cially for underrepresented classes and writing network encode better class probabilities, styles. Many of these techniques—stroke- especially for underrepresented classes and count dithering, normalization of output error, frequency balancing, and error empha- writing styles. sis—share a unifying theme: Reducing the effect of a priori biases in the training data on network learning significantly improves the network’s performance in an integrated recog- work character classifier at the heart of the sys- nition system despite a modest reduction in tem. One of the side benefits of the weak appli- the network’s accuracy for individual charac- cation of the language model is that even ters. Normalization of output error prevents when recognition fails and produces the overrepresented non-target classes from bias- wrong result, the answer that is returned to the ing the net against underrepresented target user is typically understandable by the classes. Frequency balancing prevents overrep- user—perhaps involving substitution of a sin- resented classes from biasing the net against gle character. Two useful phenomena ensue as underrepresented classes. Stroke-count dither- a result: First, the user learns what works and ing and error emphasis prevent overrepresent- what doesn’t, especially when he/she refers ed writing styles from biasing the net against back to the ink that produced the misrecogni- underrepresented writing styles. One could tion; so, the system trains the user gracefully even argue that negative training eliminates over time. Second, the meaning is not lost the an absolute bias toward properly segmented way it can be, all too easily, with whole-word characters and that stroke warping reduces the substitutions—with that Doonesbury effect bias toward those writing styles found in the found in first-generation, strong-language- training data, although these techniques also model recognizers. provide wholly new information to the sys- Although we provided legitimate accuracy statistics for certain comparative tests of some tem. of our algorithms, we deliberately shied away Although we’ve offered arguments for why from claiming specific levels of accuracy in each of these techniques individually helps the general. Neat printers, who are familiar with overall recognition process, it is unclear why the system, can achieve 100-percent accuracy prior bias reduction in general should be so if they are careful. Testing on data from com- consistently valuable. The general effect might plete novices, writing for the first time using a be related to the technique of dividing out pri- metal pen on a glass surface, without any feed- ors, as is sometimes done to convert from back from the recognition system and with p(class | input) to p(input | class). However, we ambiguous instructions about writing with dis- also believe that forcing the net during learn- connected characters (intended to mean print- ing to allocate resources to represent less fre- ing but often interpreted as writing with oth- quent sample types might be directly benefi- erwise cursive characters but separated by large cial. In any event, it is clear that paying spaces in a wholly unnatural style), can yield attention to such biases and taking steps to word-level accuracies as low as 80 percent. Of modulate them is a vital component of effec- course, the entire interesting range of recogni- tive training of a neural network serving as a tion accuracies lies between these two classifier in a maximum-likelihood recognition extremes. Perhaps a slightly more meaningful system. statistic comes from common reports on news The majority of this article describes a sort of groups and some personal testing that suggest snapshot of the system and its architecture as accuracies of 97 percent to 98 percent in regu- it was deployed in its first commercial release, lar use. For scientific purposes, none of these when it was, indeed, purely a print recognizer. numbers have any real meaning because our Letters had to be fully disconnected; that is, testing data sets are proprietary, and the only the pen had to be lifted between each pair of valid tests between different recognizers would characters. The characters could overlap to have to be based on results obtained by pro- some extent, but the ink could not be contin- cessing the exact same bits or analyzing large uous. Connected characters proved to be the numbers of experienced users of the systems in largest remaining class of errors for most of our the field—a difficult project that has not been users because even a person who normally undertaken. prints (as opposed to writing in cursive script)

86 AI MAGAZINE Articles

30 User-Independent Net User-Specific Net 24.8 User-Adapted Net 20 21.3 17.8 13.9 10 11.0 10.7

5.1 4.7 6.1 6.3 Character Error Rate (%) Character Error 0 all 45 1 2 3 Writer

Figure 13. User-Adaptation Test Results for Three Individual Writers with Three Different Nets Each Plus the Overall Results for 45 Writers Tested on a User-Independent Net Trained on All 45 Writers. might occasionally connect a pair of charac- languages (although there is only limited cov- ters—the cross-bar of a t with the h in the, the erage of foreign languages in the English o and n in any word ending in ion, and so on. units). The main innovation that permitted To address this issue, we experimented with this extended character set was an explicit some fairly straightforward modifications to handling of any compound character as a base our recognizer, involving the fragmenting of plus an accent. In this way, only a few nodes user strokes into multiple system strokes, or needed to be added to the neural network out- fragments. Once the ink representing the con- put layer, representing just the bases and nected characters is broken into fragments, we accents rather than all combinations and per- then allow our standard integrated segmenta- mutations of the same. In addition, training tion and recognition process to stitch them data for all compound characters sharing a back together into the most likely character common base or a common accent contribute and word hypotheses, in the fashion routinely to the network’s ability to learn this base or used in print recognition. This technique has accent as opposed to contributing only to an proven to work well, and the version of the explicit base + accent combination. Here print recognizer in the most recent NEWTONs, again, however, the fundamental recognizer since the MESSAGEPAD 2000, supports recogni- technology has not changed significantly from tion of printing with connected characters. that presented in this article. This capability was added without significant modification of the main recognition algo- rithms as presented in this article. Because of Future Extensions certain assumptions and constraints in the cur- We are optimistic that our algorithms, having rent release of the software, APR is not yet a proven themselves to work essentially well for full cursive recognizer, although this is an connected characters as for disconnected char- obvious next direction to explore. acters, might extend gracefully to full cursive The net architecture discussed earlier and script. On a more speculative note, we believe shown in figure 3 also corresponds to the true that the technique might extend well to ideo- printing-only recognizer. The final output lay- graphic languages, substituting radicals for er has 95 elements corresponding to the full characters and ideographic characters for printable ASCII character set plus the British words. pound sign. Initially for the German market, Finally, a note about learning and user adap- and now even in English units, we have tation: For a learning technology such as extended APR to handle diacritical marks and ANNs, user adaptation is an obvious and nat- the special symbols needed for most European ural fit and was planned as part of the system

SPRING 1998 87 Articles

from its inception. However, because of ran- pendent recognition accuracies since these dom-access memory constraints in the initial tests were performed. Nonetheless, we believe shipping product and the subsequent prioriti- these results are suggestive of the significant zation of European character sets and connect- value of user adaptation, even in preference to ed characters, we have not yet deployed a a user-specific solution. learning system. We have, however, done some testing of user adaptation and believe it Acknowledgments to be of considerable value. Figure 13 shows a This work was done in collaboration with Bill comparison of the average performance on an Stafford, Apple Computer, and Les Vogel, old user-independent net trained on data from Angel Island Technologies. We would also like 45 writers and the performance for 3 individu- to acknowledge the contributions of Michael als using (1) the user-independent net, (2) a Kaplan, Rus Maxham, Kara Hayes, Gene Cic- net trained on data exclusively from this indi- carelli, and Stuart Crawford. We also received vidual, and (3) a copy of the user-independent support and assistance from the NEWTON Recog- net adapted to the specific user by some incre- nition, Software Testing, and Hardware Sup- mental training. (Note: These data are from a port groups. We are also indebted to our many multi-year-old experiment and are not neces- colleagues in the connectionist community for sarily representative of current levels of perfor- their advice, help, and encouragement over mance on any absolute scale.) the years as well as our many colleagues at An important distinction is being made here Apple who pitched in to help throughout the between user-adapted and user-specific nets. life of this project. User-specific nets have been trained with a rela- Some of the techniques described in this tively large corpus of data exclusively from this article are the subject of pending U.S. and for- specific user. User-adapted nets were based on eign patent applications. the user-independent net with some addition- al training using limited data from the user in References question. All testing was performed with data Asanovic´, K., and Morgan, N. 1991. Experimental held out from all training sets. Determination of Precision Requirements for Back- One obvious thing to note is the reduction Propagation Training of Artificial Neural Networks, in error rate ranging from a factor of two to a TR-91-036, International Computer Science Insti- tute, Berkeley, California. factor of five that both user-specific and user- adapted nets provide. An equally important Bengio, Y.; LeCun, Y.; Nohl, C.; and Burges, C. 1995. LEREC: A NN-HMM Hybrid for Online Handwriting thing to note is that the user-adapted net per- Recognition. Neural Computation 7:1289–1303. forms essentially as well as a user-specific Bourlard, H., and Wellekens, C. J. 1990. Links net—in fact, slightly better for two of the three between Markov Models and Multilayer . writers. Given ANNs’ penchant for local mini- IEEE Transactions on Pattern Analysis and Machine ma, we were concerned that this might not be Intelligence 12:1167–1178. the case, but it appears that the features Chang, E. I., and Lippmann, R. P. 1995. Using Voice learned during the user-independent net train- Transformations to Create Additional Training Talk- ing served the user-adapted net well. We ers for Word Spotting. In Advances in Neural Informa- believe that a small amount of training data tion Processing Systems 7, eds. G. Tesauro, D. S. Touret- from an individual will allow us to adapt the zky, and T. K. Leen, 875–882. Cambridge, Mass.: MIT user-independent net to the user and improve Press. the overall accuracy for the user significantly, Gish, H. 1990. A Probabilistic Approach to Under- especially for individuals whose writing is standing and Training of Neural Network Classifiers. more stylized or whose writing style is under- In Proceedings of the IEEE International Conference represented in our user-independent training on Acoustics, Speech, and Signal Processing, 1361–1364. Washington, D.C.: IEEE Computer Soci- corpus. Even for writers with common or neat ety. writing styles, there is inherently less ambigu- Guyon, I.; Henderson, D.; Albrecht, P.; LeCun, Y.; ity in a single writer’s style than in a corpus of and Denker, P. 1992. Writer-Independent and data necessarily doing its best to represent Writer-Adaptive Neural Network for Online Charac- essentially all possible writing styles. ter Recognition. In From Pixels to Features III, ed. S. These results might be exaggerated some- Impedovo, 493–506. New York: Elsevier. what by the limited data in the user-indepen- Jacobs, R. A.; Jordan, M. I.; Nowlan, S. J.; and Hinton, dent training corpus at the time these tests G. E. 1991. Adaptive Mixtures of Local Experts. Neur- were performed (just 45 writers), and at least al Computation 3:79–87. two of the three writers in question had partic- Lippmann, R. P. 1994. Neural Networks, Bayesian a ularly problematic writing styles. We have also Posteriori Probabilities, and Pattern Classification. In made significant advances in our user-inde- From Statistics to Neural Networks—Theory and Pattern

88 AI MAGAZINE Articles

Recognition Applications, eds. V. Cherkassky, J. H. Technology Group, he was technical lead in the Friedman, and H. Wechsler, 83–104. Berlin: Springer- development of the neural network handprint- Verlag. recognition system in second-generation NEWTONs. Lyon, R. F., and Yaeger, L. S. 1996. Online Hand- Printing Recognition with Neural Networks. Paper Brandyn Webb (www.brain- presented at the Fifth International Conference on storm.com/~brandyn) is an inde- Microelectronics for Neural Networks and Fuzzy Sys- pendent consultant specializing in tems, 12–14 February, Lausanne, Switzerland. neural networks, computer graph- Morgan, N., and Bourlard, H. 1995. Continuous ics, and simulation. He received a —An Introduction to the Hybrid degree in computer science from HMM-Connectionist Approach. IEEE Signal Process- the University of California at San ing 13(3): 24–42. Diego at the age of 18. He was Parizeau, M., and Plamondon, R. 1993. Allograph responsible for the San Diego Adjacency Constraints for Cursive Script Recogni- Supercomputer Center’s SYNU renderer as well as tion. Paper presented at the Third International numerous scientific visualization tools and SIGGRAPH Workshop on Frontiers in Handwriting Recognition, demonstrations for Ardent Computer. He created 25–27 May, Buffalo, New York. Pertek’s ADME package to create realistic animated Renals, S., and Morgan, N. 1992. Connectionist models of facilities operations involving human Probability Estimation in HMM Speech Recognition, decision and interaction. He designed hardware and TR-92-081, International Computer Science Insti- software algorithms for Neural Semiconductor and tute, Berkeley, California. cofounded Apple’s handprint-recognition project. Renals, S.; Morgan, N.; Cohen, M.; and Franco, H. 1992. Connectionist Probability Estimation in the Richard F. Lyon (www.pcmp.cal- DECIPHER Speech-Recognition System. In Proceedings tech.edu/~dick) is known for his of the IEEE International Conference on Acoustics, research in very large system inte- Speech, and Signal Processing, I-601–I-604. Wash- gration and machine perception at ington, D.C.: IEEE Computer Society. Xerox PARC, Schlumberger Palo Richard, M. D., and Lippmann, R. P. 1991. Neural Alto Research, and Apple Comput- Network Classifiers Estimate Bayesian a Posteriori er. His achievements include the Probabilities. Neural Computation 3:461–483. development of a computational model of the cochlea and auditory Simard, P.; Victorri, B.; LeCun, Y.; and Denker, J. brain stem and the application of the neuromorphic 1992. Tangent Prop—A Formalism for Specifying system approach to problems in handwriting recog- Selected Invariances in an Adaptive Network. In nition, optical mouse tracking, and machine hear- Advances in Neural Information Processing Systems 4, ing. He recently directed the team at Apple that eds. J. E. Moody, S. J. Hanson, and R. P. Lippman, developed the improved NEWTON handwriting-recog- 895–903. San Francisco, Calif.: Morgan Kaufmann. nition technology. Currently, Lyon continues his Simard, P.; LeCun, Y.; and Denker, J. 1993. Efficient research affiliation with the Computation and Neur- Using a New Transformation al Systems Program at the California Institute of Distance. In Advances in Neural Information Processing Technology and serves as a technology-development Systems 5, eds. S. J. Hanson, J. D. Cowan, and C. L. consultant to several companies in the information Giles, 50–58. San Francisco, Calif.: Morgan Kauf- industry. mann. Tappert, C. C.; Suen, C. Y.; and Wakahara, T. 1990. The State of the Art in Online Handwriting Recogni- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12:787–808.

Larry Yaeger (//pobox.com/~lar- ryy) studied aerospace engineer- ing with a focus on computers. He was director of software develop- ment at Digital Productions and generated the computer graphic special effects for The Last Starfighter, 2010, and Labyrinth as well as a number of television commercials. At Apple Computer, as part of Alan Kay’s Vivarium Program, he designed and pro- grammed a computer voice for Koko the gorilla, helped introduce MACINTOSHes into routine produc- tion on Star Trek: The Next Generation, and created a widely respected artificial-life computational ecology (POLYWORLD). Most recently, in Apple’s Advanced

SPRING 1998 89 Articles

IOS Press Ad No electronic copy. Shoot from CRC

90 AI MAGAZINE