IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-30, NO. 4, AUGUST 1982 535 AnAdaptive,Ordered,Graph Search Technique for Dynamic Time Warping for Isolated Word Recognition

MICHAELK. BROWN AND LAWRENCE R. RABINER, FELLOW, IEEE

Abstract—The technique of dynamic time warping (DTW) is relied on heavily in isolated word recognition systems. The advantage of using DTW is that reliable time alignment between reference and test E patterns is obtained. The disadvantage of using DTW is the heavy UI computational burden required to find the optimal time alignment OPTIMAL TIME path. Several alternative procedures have been proposed for reducing "II'- ALIGNMENT PATH the computation of DTW algorithms. However, these alternative 0 UIz methods generally suffer from a loss of optimality or precision in UI defining points along the alignment path. In this paper we propose LU another alternative procedure for implementing a DTW algorithm. The procedure is based on the well-known class of techniques for a directed search through a grid to find the "shortest" path. An adaptive version of a directed search procedure is defined and shown to be TEST FRAME In) capable of obtaining the exact D1'W solution with reduced computa. Fig.1. Region in the (n, m) plane ror which a time alignment contour tion of distances but with increased overhead. It is shown that for is calculated in dynamic time warping. machines where the time for distance computation is significantly larger than the time for combinatorics and overhead, a potential gain in speed of up to 3: 1 can be realised with the directed search algorithm. made to modify the DTW algorithm to eliminate some of the Formal comparison of the directed search algorithm with a standard computation either by being more restrictive with the DTW DTW method, in an isolated word recognition test, showed essentially no loss in recognition accuracy when the parameters of the directedconstraints [3] —[5J, or by approximating the reference pat- search were selected to realize the 3: 1 reduction in distancetern by states that are variable in duration [6] —[7]. In either computation. case, efficiency is gained at the sacrifice of optimal time alignment. As a result, recognition error rate generally in- creases over that obtained using the full DTW alignment I. INTRODUCTION algorithm. T IS well known in the area of that In this paper we present a new approach to finding an opti- Ioptimal time alignment of reference patterns to test pat-mal time alignment path which can substantially reduce terns substantially reduces recognition errors for a vocabularycomputation without sacrificing optimality of the resulting with polysyllabic words [I]. Typically, time alignment ispath. The way in which these efficiencies are achieved is by performed on speech data which is represented as a timemodeling the DTW problem as one of finding a directed path sequence of feature vectors (e.g., vectors of linear predictionthrough a constrained grid. By modeling the grid as a digraph coefficients) which represent the spectral information in cor-with conditional branch costs (or equivalently production responding "frames" of the speech signal. The most com-rules), an ordered graph searching (OGS) algorithm can be monly used time alignment procedures, for the speech recogni-used to solve for the best path through the grid. It will be tion problem, are the class of algorithms referred to as dynamicshown that such an algorithm can be designed to guarantee programming (DP) or dynamic time warping (DTW) methodsessentially optimal time alignment while reducing computation [2] —[5].Asshown in Fig. 1, these procedures require calcula-over that required for a conventional tion of the local distance between each possible reference and(DP) solution. Furthermore, we will show that by slightly test frame (within a prescribed global range, e.g., the parallelo-relaxing the path optimally conditions, a substantial reduction gram of Fig. I), in order to determine the optimal time align-in computation (>60 percent) can be achieved with only a ment path relating reference and test frames. small loss in accuracy. For most isolated word recognition systems using DTW for The ordered graph search type of algorithm, of the type to time alignment, it has been shown that the DTW algorithmbe described in this paper, is most useful for implementations requires the vast majority of the processing time required toof a word recognizer where distance calculations made on local recognize a word. Consequently, several attempts have beenfeatures of the speech pattern (e.g., LPC coefficients) are computationally expensive (e.g., simple microprocessor sys- Manuscript received August 3, 1981. tems). In such cases the control overhead is relatively inex- The authors are with Bell Laboratories, Murray Hill, NJ 07974. pensive compared to the cost of distance calculation. Then

0096-35l8/82/0800-0535800.75 © 1982 IEEE 536 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-30, NO. 4, AUGUST 1982

the reduction in computation for local distances is approxi- Endpoint Constraints: We assume that the word endpoints mately 65 percent, at the expense of doubling the controlof both the test and reference patterns have been accurately overhead, and about 4.5 kbytes of additional memory. Fordetermined, and so we require the path to obey the constraints implementations where the cost of local distances is negligible w(l) =1 (e.g., using a peripheral array processor or peripheral high (5a) speed multiplier), the increased cost of overhead can be corn- w(N) =M. (5b) parable to the decreased cost of local distances. For such cases Local Path Constraints: We assume the Itakura path con- there would be little advantage to the proposed algorithm. The organization of this paper is as follows. In Section IIstraints [2] are obeyed, namely, we review the "standard" DTW algorithm as this provides the 0 '(w(n) —w(n—1) 2 (6a) basis of comparison for the OGS algorithm to be presented in w(n) - -=0iff w(n - w(n - Section III. In Section IV we describe the results of a series w(n 1) 1)- 2)>0. (6b) of isolated word recognition tests designed to compare speedThese local path constraints guarantee that the average slope and accuracy of the two time alignment procedures. Finally,of the warping function lies between 4and2, and guarantee in Section V we discuss the advantages and disadvantages ofpath monotonicity. the OGS algorithm relative to the standard DP method, for Global Path Constraints: The endpoint conditions (5) and various implementations. the local path constraints (6) lead to a set of global path constraints of the form II. THE STANDARD DTW ALGORITHM We assume that we are given a test pattern T, consisting of a mL(n)

A. The Basic DTW Iteration Based on the above specifications, the techniques of dy- namic programming can be used to solve (10) iteratively for the best path as follows. We define an accumulated distance m function DA (n, m) as the total distance from the grid point (1, 1) to the grid point (n, m), along the best path between these grid points (minimum distance), and we can use the iteration (based on the local constraints) n S(i,1) DA(n, m)d(T(n),R(m))+min [DA(n— 1,m)g(n- 1,m),Fig. 2. Illustration of the nodal structure and a typical path for ordered DA(n— 1,m— 1),DA(n— 1,m— 2)] graph searching. lnN, mL(n)mmH(n) (12) ordered graph searching (OGS) algorithm is then applied to where g(n, m) is the nonlinear weighting find the best time alignment path. Before describing the graph searching algorithm, it is im- if w(n)w(n-1) g(n, m) =Ji (13) portant to understand why such a procedure can lead to signi- if w(n)w(n- 1) ficant reduction in computation over standard DTW algorithms. to guarantee that the optimum path to (n, m) does not stayThis gain in efficiency for the digraph search is achieved by flat for two consecutive frames. The final, desired solution toomitting the local distance calculations associated with nodes (10) is which are not searched. Dynamic programming, on the other hand, requires that all local distances within the global con- D DA(N,M) (14a)straints be calculated. By restricting the digraph search to D=DA(N,M)IN. (14b)investigate only the most likely warping paths, the number of nodes that are expanded (i.e., used in subsequent calculations) Thus the DTW algorithm requires on the order of NM dis-can be kept to between -and that used for the DP search; tance calculations, and NM sets of combinatorics [(12) andat the same time, under certain readily attainable conditions, (13)] to obtain the best path and the total distance for eachthe resulting warping path can be shown to be optimal, i.e., reference pattern. We now consider alternative path findingidentical to the one found by the DP algorithm. techniques which seek to reduce the number of local distance calculations. A. The Directed Search Algorithm Consider the graph structure of Fig. 2. Each point in the III. AN ORDERED GRAPH SEARCH APPROACH TO grid is a node, and we designate the ith node by its coordinates FINDING THE BEST PATH THROUGH A GRID (n, m), i.e., We have shown that the conventional DTW algorithm solves the problem of optimally time aligning a test and a reference i(n,m). (15) pattern, at the same time providing a measure of the similarityWe denote the starting node as s (1, 1), and the ending node (distance) between test and reference along the alignment path.as t =(N,M). By restructuring the entire time alignment problem as a prob- For any path through the grid passing through node i, we lem in finding the best path through a finite grid of points, wedenote the path cost (the accumulated distance along the can take advantage of a large class of ordered tree and graphpath) as searching algorithms, as described by Nilsson[8], [9] ,tofind. the best path with substantially reduced computation of local f(i) g(i) + h(i) (16) distances. These searching algorithms have been commonlywhere g(i) is the minimal cost of the path from node s to node used in the artificial intelligence (Al) area for machine simula-i, and h(i) is the minimal cost of the path from node i to tion of cognitive processes, e.g., problem solving or gamenode t. playing. For a directed search through the grid (i.e., proceeding from The way in which we apply graph searching to dynamic timeleft to right), the cost g(i) along the path to node I is known warping is illustrated in Fig. 2. We represent the grid of pointsexactly; however, the cost h(i) from node ito node t is not in which the alignment path can lie as a directed graph inknown, and therefore must be estimated. Thus, an estimate of which the nodes represent local distances (between test anda minimal cost path passing through node i is reference frames), and allowable node transitions are repre- sented by branches of the directed graph (digraph). An (i) 'g(i) ÷ (i) (17) 538 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-30, NO. 4, AUGUST 1982 where (i) is the estimate of h(i). Thus, in trying to find the 22 f minimal cost path through the grid, we build up a series of nodes for which we know the exact cost from the start node, and for which we estimate a cost to the terminal node. We expand the node which currently provides the smallest cost estimate (simultaneously keeping track of all previously encountered nodes, for backtracking) until we reach the termi- nal node t. It should be clear that any path from start node s to inter- m mediate node i is completely characterized by the nodal state, which includes: 1) the node coordinates (n, m); 2) the estimated cost h(i) from the node ito the end; 3) a pointer to the previous node on the path (the ancestor or parent node); 4) the true cost g(i) from the start node to node i—the cost S n g(i) is saved to facilitate computation of f(i') where i' is aFig. 3. Illustration of the computation to determine a path from node successor node to i. s to node t, using the ordered graph searching concept. In solving for the minimal cost path through the grid, gen- erally nodes from several paths are encountered. By using the(termination) a node i' such that ?(i') f*(t) 0 V i sandg(i) is monotonic; which, upon expansion, shows node 18 (an earlier successor) 3) f(i)f(i)Vi; to have smallest f of all open nodes. Upon expanding node 4) h(i) is monotonic for all potential paths h(i)> 0, Vit. 18, successor node 19 is generated but not added to the open To show that these conditions are sufficient to find the opti-list since it was encountered when node 7 was expanded pre- mal path we have to show that the search is guaranteed toviously. Node 18 and its successors are expanded until node reach the terminal state t, and having reached this state, the20 comes to the top of the open list and this local search is resulting path is minimal cost. As proved by Nilsson [9, pp.terminated by the optimality property just described. Node 74—79] for finite graphs the search must always terminate20 is next expanded to node 21, followed by node 22 and since these are only a finite number of nodes that can befinally state t is reached with a minimal cost path. expanded. To show that the search finds the optimal path to The computational savings, for this simple example, over a t, we assume that a path to node t gets a nonminimal costconvention DP approach are approximately the ratio of grid f(t) g(t) >f*(t) where f*(t) is the minimal cost to node t.points to the number of nodes for which distances are cal- By condition 3) (i.e., the estimator of cost is less than or equalculated, or about 3: 1. That such savings are achievable in to the true cost) there existed, before expansion of node treal examples will be shown later in this paper. BROWN AND RABINER: GRAPH SEARCH TECHNIQUE FOR DYNAMIC TIME WARPING 539

REFERENCE AND TEST REFERENCE AND TEST FROM SAME CLASS FROM DIFFERENT CLASSES

LOCAL DISTANCE

Fig. 4. Plots of local distance in the (n, m) plane for reference and test Fig. 5. Plots of local distance in the (is, in) plane for reference and test patterns from the same word class. patterns from different word classes.

B. Ordered Graph Searching Applied to DTW however, is that paths and path distances obtained by the As applied to DTW algorithms for isolated word recognition,OGS approach are comparable to those obtained from DP all the optimality conditions of Section Ill-A are satisfied,algorithms. except con4ition 1). This condition is violated because the 1) Flowchart of OGS Procedure: A flow diagram of the local constraints are data dependent, i.e., the optimal pathbasic ordered graph searching algorithm is given in Fig. 6. The cannot stay flat for two consecutive frames, and thus thestart node s (1, 1) is the only node on the open list and the expansion of any node depends on the path to that node.closed list is empty at the beginning of the search. The open This means that for ordered graph searching (as for conven-list is always maintained as a sorted list of nodes such that the tional DTW algorithms [5]) optimality of the path is notnode having least estimated total warp path cost heads the guaranteed. However, as shown by Myers et al. [5], thelist.This head node is removed from the open list for ex- nature of the problem essentially makes the path finding apansion into successor nodes. Expansion proceeds by gen- robust procedure which is basically insensitive to the dataerating one node at a time until all possibilities are exhausted. dependent path constraints, especially when the test andFor efficiency reasons, an array of flags is maintained which reference patterns are from the same word class. This point isindicates which nodes have been generated. In this way, the illustrated in Figs. 4 and 5 which show plots of the localstep in the flow labeled "check OPEN & CLOSED" can be distance d'(T(n), R(m)) over the entire (n, m) plane for refer-performed by simply checking one element of the array rather ence and test from the same class (Fig. 4 for the digit 5), andthan searching two lists. The generated node must lie within for reference and test from different classes (Fig. 5 for thethe global constraints. If it is not the terminal node, then digits S nd 6). Also shown in these figures is the optimum?(i) =g(i) + (i) is computed. Using this?'(i) value the node is warping path (indicated by the dashed line). As shown in Fig.inserted in the open list at the appropriate location and the 4 the warping path lies in a valley in the local distance func-process continues. tion. This valley is reasonably broad along most of the path; When the terminal node is found, the g(i) already calculated hence slight deviations from the time optimal path do notfor this node is the path cost g(t) f(t). The warping path generally result in substantial increases in path cost (distance).may be recovered if necessary by following the parent node When the reference and test are from different classes (Fig.pointers backward from the terminal node to the start node 5), the warping path generally deviates substantially from theand the solution is complete. diagonal path, indicating highly nonlinear compressions and 2) The Estimator Function. h(i): The only unspecified expansions of the time scales to achieve best matches. Thequantity for the OGS algorithm is the estimator function general shape of the distance function is a series of sharph(i) used to provide the estimated path cost f(i) g(i) + h(i). peaks and valleys which fluctuate rapidly in the (n, m) plane.The quantity f(i) must be calculatedat each node i visited Thus, small deviations in the warping path, due to the localduring the search process, and since g(i) is known exactly, constraints, can lead to significant cost increases over theonly (i) must be specified to give f(i). optimal path. There are several ways that h(i) could be calculated. Since lfthe previous description were entirely correct, we wouldh(i) is the distance (cost) along the path from node i to the be able to take advantage of it to increase the separation be-terminal node t, and since h(i) must underestimate the true tween the distance (cost) distributions of "same" and "differ-path, cost h(i) (to satisfy the path optimality constraints of ent" words, since the calculated distances for same words areSection 111-A), then for i (n, m) we have essentially minimal costs, whereas for different words they are N above minimal cost. Although in practice this situation does h(i)h(i) d(T(k), R(w(k))). (17) occur, the magnitude of the effect is small. The key point, k= n+i 540 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-30, NO. 4, AUGUST 1982

INITIALIZE I i.e., the probability density function ofis independent of PLACE START NODEI ON OPEN LIST T, R, and k [10]. For example, for the LPC parameter set used here we have

I REMOVE NODE FROM OPEN LIST HAVING LOWEST i(i) P1Ij3) x2($). (19) ANDPLACE ON CLOSED ST FOR EXPANSION Based on the above discussion we can use, as a bound on h(i), the quantity H GENERATE A I (i) =- .d SUCCESSOR NODE_r_L0CAL CONSTRAINTS (Nn) (20) whereis sufficiently small so that we can guarantee that the probability that h(i)> h(i) is kept to any desired value. SUCCESS NODE The problem with the estimator of (20) is that for test and GENERATED reference words of different word classes, the estimator is a

YES gross underestimator, causing a needlessly large number of nodes to be searched. For such comparisons we would prefer CHECK OPEN AND CLOSED LISTS FORCOPY OFNEW NODE a gross overestimator; whereas for cases when reference and test are from the same word class we need the underestimator

COPY of (10). To combat these difficulties we have developed an YES FOUND adaptive estimation procedure, which we now describe. Consider first a fixed estimator of the form NO i(i)(N- n)a (21) CHECKGLOBAL CONSTRAINTS where a is a parameter of the estimator. Fig. 7 shows some representative plots of three measures of computation and NODE YEs OUTSIDE accuracy, namely BOUNDARIES 1) accumulated path cost f(t), 0 2) number of distance calculations ND, and SAVEPATH POINTER 3) number of nodes expanded NE, SET LOCAL CONSTRAINTS PARAMETER as a function of a. Fig. 7(a) is for a typical case when refer- COMPUTEg(i) ence and test words are from the same class, and Fig. 7(b) is for a typical case when reference and test words are from IS THIS THE TERMINAL YES different classes. It can be seen from Fig. 7(a) that for values NODE of a 0.885, the accumulated path cost remains constant, whereas ND and NE fall dramatically as a increases above 0.1. NO For this example the average cost per frame was 0.45, showing COMPUTEI(i)AND that a can increase above the value by almost a factor of 1 .7 SORTNEW NODE ONTO OPENLIST before optimality of the path is sacrificed. Other examples have shown similar behavior as a function of a. I RECOVER OPTIONAL PATH For the data of Fig. 7(b), when reference and test words [_ BY BACKTRACKING were different, decreases of NE and ND became significant only for very large values of a, e.g., a> 1.2, whereas path COST cost is seen to be almost independent of a. The above results suggest a data adaptive estimator of the form

Fig. 6. Flowchart of the OGS method. (i)(N- n)a- (22)

Equation (17) says that the true path cost from node ito nodeThe estimator of (22) has the advantage that for words of the tisthe sum of the local distances along all the nodes in thesame class, g(i)/n eventually becomes small, giving an effective path, and for the asymmetric path constraints used in thea multiplier in the correct range, whereas for words in differ- DTW implementation, this distance is the sum of the distancesent classes, g(i)/n eventually becomes large, giving the best results here.Experimentation with the estimator of (22) along the (N -n)grid points of the path. When T and R are from the same word class, then, along the optimal alignmentshowed that the resulting path costs were basically insensitive path, we have the theoretical result that to a over a wide range of a. Based on this experimentation, a value of a 0.7 was chosen. Although there is no theoretical 'c1(T(k),R (w(k)))(13) =F(t3), (18) guarantee that h(i) of (22) is an underestimate of h(i) in all WN AND RABINER: GRAPH SEARCH TECHNIQUE FOR DYNAMIC TIME WARPING 541

Values of 3 between 0.6 and 0.7 were used in two evaluation tests to be described in Section IV. The chosen value of !3 ACCUMULATED PATH DISTANCE zU NuHeER OF NODES EXPANDED is essentially the largest reasonable average distance one would — NUM8ER OF DISTANCE CALCULATION a expect to encounter when comparing test and reference words za of the same class. zw - z IV. EXPERIMENTAL COMPARISON OF DP ANt) OGS 0 To compare the performance of the OGS algorithm of Section III with the standard DTW algorithm, two recognition _j_IIIIIlIlIIIIIljjIIIitIlII experiments were performed. For the first experiment, each a— of 6 talkers (3 male, 3 female) trained the recognizer on iso- (a) lated digits (using a robust training method [11]), thereby obtaining 6 sets of speaker trained templates. Then each talker spoke the 10 digits S times to form a test set of 50 I I I P II I I I I I I I p I utterances. For the second experiment a speaker independent set of templates was used in which each word of the vocabulary was zC represented by 6 templates: The vocabulary for this experi- I" z a mentwas 129 word airlines vocabulary [12], andthe templates I.- z S - were obtained from a clustering analysis of the speech of 100 C, - - talkers (50 male, 50 female) [13]. A set of 4 talkers (2 male, 2 female—all not part of the training set) was used for this

- - experiment, with each talker saying the vocabulary one time. I I C. Ill_I_I11111 I 0 Both recognition experiments used speech recorded off a 0 a— IS dialed-up telephone line. The "standard" LPCbasedrecog- (b) nizer of Rabiner et a!. was used to derive the features of the Fig.7. Plots of ND, NE, and D versus a for reference and test patterns speech, and to provide a set of Qbestrecognition candidates in (a) the same class and (b) in different classes. [14] .Thenormalize-and-warp procedure [5] was also used to provide fixed length test and reference patterns. For each cases, practical experience indicates this is essentially alwaysexperiment both the DP and the OGS algorithms were used, - and the results are compared in terms of recognition accuracy the case. Efficiency can be improved further (without significant lossand computation. in accuracy) by forcing to become much larger than 0.7 wh•i preliminary results indicate that the test and referenceA. Results of Experiment 1—Speaker Trained Recognition WcL come from different classes. This makes h(i) predomi-Using Isolated Digits nant in calculating f(i) and tends to minimize backing up The results of the first experiment on the digit vocabulary during the search. In this case the actual warp path cost isare presented in Table I and in Fig. 8. Table I gives statistics, not important as long as it is large. The algorithm can predictfor each talker, for both the OGS and DP algorithms, for the early in the search if the test and reference do not belong tofollowing: the same class by observing the average per test frame value 1) number of local distance calculations ND; of g(i). A large value for g(i) indicates a potential mismatch. 2) average distance to the correct reference Th; There is always the possibility that for the first few test frames, 3)average distance to the second candidate 2 g( '.ll be larger than the global average. This could cause the 4) recognition error rate for the top candidate; aig. hm to predict a class mismatch during the first few 5) recognition error rate for the top two candidates. frames of the search. Since the sensitivity of path cost to is Fig. 8 shows histograms (obtained from one of the 6 talkers), low, however, this will not generally cause any substantialas a function of comparisons with the correct reference (C), difference in total path cost as long as the class mismatchand with all incorrect references (1), of prediction is corrected at each frame. 1) the number of nodes expanded NE; Based on the above discussion, the final form of the esti- 2) the number of local distance calculations ND; mator function is 3) the average per frame distance from the optimal path D. (Histograms from each of the 6 talkers showed similar 0.7-(N-n) if (23a)behavior). Examination of Table I shows the following. 1) For correct references an average reduction in the nuni- ber of local distances by a factor of 3.2 is obtained. if (23b) 2) For incorrect references, an average reduction in the number of local distances by a factor of 2.4 is obtained. 542 iEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-30, NO. 4, AUGUST 1982

TABLE I TABLE II RECOGNITION STATISTICS FORTHE 10WORD DIGITVOCABULARY RECOGNITIONSTATISTICS FOR THE 129 WORDAIRLINE VOCABULARY (SPEAKERTRAINED) (SPEAKERINDEPENDENT) Talker Number/Sex Talker Number/Sex

1(M) 2(F)3(F)4(M)5(M)6(F)Average 1(M)2(M)3(F)4(F) Average

Number DR 682 682 682682 682 Number DR 620 620620620620620 620 Distances Distances

Number OGS 242 243 240249 244 Number OGS (72 226 62230 212 167 (95 Distances/C Distances/C Number OGS 268 260 266260 264 Number OGS 257 268248 256 267 240 256 Distances/I Distances/I Average Distance DR .257 .259.286.30! .276 ToFirst Average DistanceDR .4) .40 .36 .34 .36 (7 .34 Candidate OGS.262 .264.292.306 .28! To Correct —————-— Reference OGS .43 .42 .37 .35 38 .17 .35 Average Distance DR .319 .309.329.347 .326 To Second —-———— Average DistanceDR .66 .66 .60 .50 .56.50 .58 Candidate OGS.327 .318.337.357 .335 To Second ——— ——— Candidate OGS .68 .70 .62 .52 .59 .52 .6) Error Rate (%) DP 0.8 7.815.59.4 3.4 lop Candidate OGS10.8 (0.! 14.720.2 (4.0 DR 0 4 0 6 (0 0 3.33 Error Rate (%) ———— Error Rate )%) DR 2.3 5.4 6.2 9.3 5.8 lop Candidate008 2 4 0 8 0 &33 lop2 Candidates OGSfl54 7.8 3.2 7.2 Error Rate (%) DR 0 0 0 0 0 0 0 lop2 Candidate'OGS 0 0 0 2 2 0 67 8(c) and (d)] are similar, and the small differences reflect the loss in accuracy of not always finding the optimal warping path. (0) A B. Results of Experiment 2—Speaker Independent 30 Recognition of Airline Terms The results of the speaker independent test, using 6 tempIati (b( perword and a 129 word airlines vocabulary are given in Tab I...z II and Fig. 9. The type of statistics given in the table and Fig. 5.))iAJ 9 are similar to those of Table I and Fig. 8, with one excep- OGS tion. Since there were 6 "correct" references for each spoken word, we present statistics on the first candidate (in Table II) 2.5 0 0/I 2.5 and on both the correct references and the first candidate in Fig. 9. (dl The results on reduction in distance computation are similar ,P to those of Experiment 1. However here the reduction '-3 2.5 I correct references is about 2.8: 1, and for incorrect references Fig. 8. Histograms of NE,ND, andDconditionedon correct and in- it is 2.6: 1, thereby indicating a somewhat larger overall re- correct references for the digit recognition experiment. duction than for the digit vocabulary. Another difference from Experiment 1 is that the average 3) The average distance of the OGS method to correctdistance of the second candidate is close to the average dis- references (0.35) is slightly larger than the average distance oftance of the first candidate, indicating that, in general, the first the DP method to correct references (0.34); similarly thetwo candidates were reference patterns from the correct word average distance of OGS to the second candidate (0.6) isclass. slightly larger than the DP distance (0.58). As in the digit experiment, it was found that the average 4) The average error rate for the OGS system is slightlyseparation of the distributions of distances for correct and larger (for both top I and top 2 candidates) than for the DPincorrect references for the OGS method, was slightly larger system. than for the DP method. The differences, however, occur The results above indicate that the OGS method yields onlyprimarily on the upper side of the distributions for the incor- slightly worse accuracy results than the DP method, at therect references, thereby having little effect on recognition same time achieving about a 2.5: 1 overall reduction in distanceaccuracy. computation. Close scrutiny of the OGS distribution reveals slight but Examination of the histograms of Fig. 8 shows a fairly wideconsistent irregularity in the distance distributions for jn:or- spread in the NE and ND statistics for incorrect references,rect references. This irregularity can be attributed to the indicating that for some words a large percentage of the avail-underestimator to overestimator function switching, which able nodes were used. For correct references a lower, tighteroccurs when predicted distance per test frame exceeds the distribution is found for both N and ND. Finally, it is seenchosen value of 13(0.7). When the estimator function h(i) that the overall distance distributions for DP and OGS [Figs.becomes a cost overestimator it generally forces the actual BROWN AND RABINER: GRAPH SEARCH TECHNIQUE FOR DYNAMIC TIME WARPING 543

expected accumulated distance histogram. Calculations by Rabiner etcii.show that potential savings of from 50—75per- cent of the computation can be achieved in this manner, with I1T<'.SC D 5( 0 NE/C Ne/I no loss in recognition accuracy. It should be clear to the reader that the rejection threshold idea can equally well be applied to the OGS method as to the 0 DP method; hence the data reduction of the OGS method is essentially in addition to that of the rejection threshold. A second, and perhaps more substantial objection to the OGS method is that the reduction in distance computation is I— OGS z attained at the expense of a more complicated control struc- 0 U ture. This combinatoric effort has been estimated to be about 200-250 percent of the combinatoric effort of the DP algo- OGj rithm based on CPU time measurements. This increased over- tA JGs(2, head may have a substantial impact on the overall efficiency, especially if most of the numerical effort can be handled by high speed hardware. 50 2,5 If we assume a typical microprocessor application without E''nI special purpose hardware, nominal time to perform one DTW is about 5s.Typically, combinatorics consumes only about rt'\ IP 0.1 s of this time leaving 4.9 s for numerical computation 2.5U 2.5 0 D1/C J\P4 using standard DP methods. The OGS algorithm is well suited to this type of application.Combinatoric time for OGS Fig. 9. Histograms of NE, ND, D, andD2conditioned on correct and incorrect references for the airlines vocabulary recognition experiment. would be about 0.2-0.25 s while numerical computation time would drop to about 1.7 s. Thus the OGS method would be more than 60 percent faster than the DP method, performing warp path cost to increase slightly thus forming a slight de-a typical DTW in less than 2 s. pression in the distribution of distances just above a distance On the other hand, special purpose hardware now exists for value of 0.7. This irregularity occurs at sufficiently high dis-computing local distances in a parallel manner. Under there tance values as to have no effect on recognition. circumstances the ratio of numeric to combinatoric time may Finally, we see that the average recognition error rates onbe considerably less than one. The OGS method would then the top candidate are comparable with the DP method yield-be considerably slower than the standard DP method. ing a 0.6 percent smaller error rate than the OGS method. Thus, as in the digit experiment, we see that a 2.6: 1 reduction VI. SUMMARY in local distance computations can be achieved at the expense In summary, the OGS method has been shown to be a prac- of about 0.6 percent in recognition accuracy. tical alternative to standard DP methods for solving the DTW V. DISCUSSION problem. The OGS method substantially reduces numerical computation at the expense of increased combinatoric effort The purpose of this paper was to describe an alternative without significantly altering the DTW path cost. OGS is a approach to dynamic programming to solve the time alignment problem for isolated word recognition, It was shown that theparticularly viable technique for microprocessor implementa- tions of speech recognition systems. class of ordered graph searching methods, widely used in the area of artificial intelligence, could readily be applied to the REFERENCES time alignment problem. One such algorithm was developed (1] G. M. White and R. B. Neely, "Speech recognition experiments and discussed in detail. with linear prediction, bandpass filtering, and dynamic program- It was shown that the OGS method could solve the time ming," IEEE Trans. Acoust., Speech,SignalProcessing, vol. alignment problem with essentially the same accuracy as DP ASSP-24, pp. 183—188, Apr. 1976. [2] F. Itakura, "Minimum prediction residual principle applied to methods; however the required computation for local dis- speech recognition," IEEE Trans. Acoust., Speech, Signal Pro- tances was reduced by a factor of about 2.5. cessing, vol. ASSP-23, pp. 67—72, Feb. 1975. There are, however, at least two mitigating circumstances [31 H. Sakoe and S. Chiba, "Dynamic programming algorithm optimization for spoken word recognition," IEEE Trans. Acoust., that affect the results presented above. First is that there are Speech, Signal Processing, vol. ASSP-26, pp. 43—49, Feb. 1978. alternative ways of achieving computational reductions in [4] L. R. Rabiner, A. E. Rosenberg, and S. E. Levinson, "Considera- standard DP approaches. For example Rabiner eta!. [14] tions in dynamic time warping for discrete word recognition," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-26, have proposed the use of a rejection threshold to monitor the pp. 575—582, Dec. 1978. DTW distance as it proceeds through the grid and to abort [5] C. S. Myers, L. R. Rabiner, and A. E. Rosenberg, "Performance any reference pattern whose accumulated distances exceeds tradeoffs in dynamic time warping algorithms for isolated word recognition," IEEE Trans. Acoust., Speech, Signal Processing, the rejection threshold. This rejection threshold can either be vol. ASSP-28, pp. 622—633, Dec. 1980. static or dynamic, depending on detailed knowledge of the [6] H. I. Silverman and N. R. Dixon, "State constrained dynamic 544 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-30, NO. 4, AUGUST 1982

programming (SCDP) for discrete utterance recognition," inPh.D. degree at the University of Michigan. His dissertation was in the Proc. mt. Conf ASSP, Apr. 1980, pp. 169—172. area of image processing and . Since 1980, he has [7] M. H. Kuhn, H. Tomaschewski, and H. Ney, "Fast nonlinear time been with the Group at Bell Laboratories, Murray alignment for isolated word recognition," in Proc. mt. ConfHill, NJ, where he has been involved in research in speech recognition ASSP, Apr. 1981, Pp. 736—740. and synthesis techniques. [8] N. J. Nilsson, Problem-Solving Methods in Artificial Intelligence. New York: McGraw-Hill, 1971. [91—, Principlesof Artificial Intelligence. Tioga, 1980. [10] L. R. Rabiner and J. G. Wilpon, "A two-pass system for isolated word recognition," Bell Syst. Tech. J, vol. 60, pp. 739—766, May—June 1981. [11] —,"Asimplified, robust training procedure for speaker trained, isolated word recognition systems," J. Acoust. Soc. Amer., vol. 68, pp. 127 1—1276, Nov. 1980. [12] S. E. Levinson, A. E. Rosenberg, and J. L. Flanagan, "Evaluation Lawrence R. Rabiner (S'62—M'67—SM'75—F'75) was born in Brooklyn, NY, on September 28, of a word recognition system using syntax analysis," Bell Syst. 1943. He received the S.B. and S.M. degrees Tech. J, vol. 57, pp. 1619—1626, May—June 1978. simultaneously in June 1964, and the Ph.D. [13] J. G. Wilpon, L. R. Rabiner, and A. Bergh, "Speaker independent degree in electrical engineering in June 1967, isolated word recognition using a 129 word airline vocabulary," to be published. all from the Massachusetts Institute of Tech- [14] L. R. Rabiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon, nology, Cambridge, MA. From 1962 through 1964, he participated in "Speaker independent recognition of isolated words using cluster- the cooperative plan in electrical engineering at ing techniques," IEEE Trans. Acoust., Speech, Signal Processing, Bell Laboratories, Whippany, NJ, and Murray vol. ASSP-27, pp. 336—349, Aug. 1979. Hill, NJ. He worked on digital circuitry, mili- tary communications problems, and problems in binaural hearing. Presently, he is engaged in research on speech communications and digi- tal signal-processing techniques at Bell Laboratories, Murray Hill. He is coauthor of the books Theory and Application of Digital Signal Michael K. Brown was born in Highland Park, Processing (Prentice-Hall, 1975) and Digital Processing of Speech Ml, on January 4, 1951. He received the Signals (Prentice-Hall, 1978). B.S.E.E. degree in 1973 in electrical engineering Dr. Rabiner is a member of Eta Kappa Nu, Sigma Xi, Tau Beta Pj, and the M.S. and Ph.D. degrees in 1977 and and a Fellow of the Acoustical Society of America. He is a former 1981, respectively, all from the University of President of the IEEE S-ASSP Ad Com, and is currently a member of Michigan, Ann Arbor. the S-ASSP Technical Committee on Digital Signal Processing, former From 1973 to 1976 he was employed with member of the S-ASSP Technical Committee on Speech Communica- the Burrough's Corporation and was involved in tion, former Associate Editor of the S-ASSP TRANSACTIONS, member of the development of ink jet printer systems. the PROCEEDINGS OF THE IEEE Editorial Board, and a former member From 1976 to 1980 he continued his work with of the Technical Committee on Speech Communication of the Acoustical Burrough's as a consultant while pursuing the Society.