Utterance Segmentation Using Combined Approach Based on Bi-directional N-gram and Maximum Entropy

Ding Liu Chengqing Zong National Laboratory of Pattern Recognition National Laboratory of Pattern Recognition Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Beijing 100080, China. Beijing 100080, China. [email protected] [email protected]

Output (text Abstract Input speech Speech Language or speech) analysis and recognition generation This paper proposes a new approach to segmentation of utterances into sentences using a new linguistic model based upon Figure 1. System with speech input. Maximum-entropy-weighted Bi- directional N-grams. The usual N-gram In these systems, the language analysis module takes the output of as its input, algorithm searches for sentence bounda- representing the current utterance exactly as pro- ries in a text from left to right only. Thus nounced, without any punctuation symbols mark- a candidate sentence boundary in the text ing the boundaries of sentences. Here is an is evaluated mainly with respect to its left context, without fully considering its right example: 这边请您坐电梯到 9 楼服务生将在那 context. Using this approach, utterances 里等您并将您带到 913 号房间 . (this way please are often divided into incomplete sen- please take this elevator to the ninth floor the floor tences or fragments. In order to make use attendant will meet you at your elevator entrance of both the right and left contexts of can- there and show you to room 913.) As the example didate sentence boundaries, we propose a shows, it will be difficult for a text analysis module new linguistic modeling approach based to parse the input if the utterance is not segmented. on Maximum-entropy-weighted Bi- Further, the output utterance from the speech rec- directional N-grams. Experimental results ognizer usually contains wrongly recognized indicate that the new approach signifi- or noise words. Thus it is crucial to segment cantly outperforms the usual N-gram al- the utterance before further language processing. gorithm for segmenting both Chinese and We believe that accurate segmentation can greatly English utterances. improve the performance of language analysis modules. Stevenson et al. have demonstrated the difficul- 1 Introduction ties of through an experiment in which six people, educated to at least the Bache- Due to the improvement of speech recognition lor’s degree level, were required to segment into technology, spoken language user interfaces, spo- sentences broadcast transcripts from which all ken dialogue systems, and speech translation sys- punctuation symbols had been removed. The ex- tems are no longer only laboratory dreams. perimental results show that humans do not always Roughly speaking, such systems have the structure agree on the insertion of punctuation symbols, and shown in Figure 1. that their segmentation performance is not very good (Stevenson and Gaizauskas, 2000). Thus it is a great challenge for computers to perform the task automatically. To solve this problem, many meth- They applied -based N-gram language models ods have been proposed, which can be roughly to utterance segmentation, and then combined classified into two categories. One approach is them with prosodic models. Compared with N- based on simple acoustic criteria, such as non- gram language models, their combined models speech intervals (e.g. pauses), pitch and energy. achieved an improvement of 0.5% and 2.3% in We can call this approach acoustic segmentation. precision and recall respectively. The other approach, which can be called linguistic Beeferman et al. (1998) used the CYBERPUNC segmentation, is based on linguistic clues, includ- system to add intra-sentence punctuation (espe- ing lexical knowledge, syntactic structure, seman- cially commas) to the output of an automatic tic information etc. Acoustic segmentation can not speech recognition (ASR) system. They claim that, always work well, because utterance boundaries do since commas are the most frequently used punc- not always correspond to acoustic criteria. For ex- tuation symbols, their correct insertion is by far the ample: 您好请问明天的单人间 most helpful addition for making texts legible. 还有吗或者标准间也行. Since CYBERPUNC augmented a standard the simple acoustic criteria are inadequate, linguis- speech recognition model with lexical information tic clues play an indispensable role in utterance concerning commas, and achieved a precision of segmentation, and many methods relying on them 75.6% and a recall of 65.6% when testing on 2,317 have been proposed. sentences from the Wall Street Journal. This paper proposes a new approach to linguis- Gotoh et al. (1998) applied a simple non-speech tic segmentation using a Maximum-entropy- interval model to detect sentence boundaries in weighted Bi-directional N-gram-based algorithm English broadcast speech transcripts. They com- (MEBN). To evaluate the performance of MEBN, pared their results with those of N-gram language we conducted experiments in both Chinese and models and found theirs far superior. However, English. All the results show that MEBN outper- broadcast speech transcripts are not really spoken forms the normal N-gram algorithm. The remain- language, but something more like spoken written der of this paper will focus on description of our language. Further, radio broadcasters speak for- new approach for linguistic segmentation. In Sec- mally, so that their reading pauses match sentence tion 2, some related work on utterance segmenta- boundaries quite well. It is thus understandable that tion is briefly reviewed, and our motivations are the simple non-speech interval model outperforms described. Section 3 describes MEBN in detail. the N-gram language model under these conditions; The experimental results are presented in Section 4. but segmentation of natural utterances is quite dif- Finally, Section 5 gives our conclusion. ferent. Zong et al. (2003) proposed an approach to ut- 2 Related Work and Our Motivations terance segmentation aiming at improving the per- formance of spoken language translation (SLT) systems. Their method is based on rules which are 2.1 Related Work oriented toward key word detection, template Stolcke et al. (1998, 1996) proposed an approach matching, and syntactic analysis. Since this ap- to detection of sentence boundaries and disfluency proach is intended to facilitate translation of Chi- locations in speech transcribed by an automatic nese-to-English SLT systems, it rewrites long recognizer, based on a combination of prosodic sentences as several simple units. Once again, cues modeled by decision trees and N-gram lan- these results cannot be regarded as general-purpose guage models. Their N-gram language model is utterance segmentation. Furuse et al. (1998) simi- mainly based on part of speech, and retains some larly propose an input-splitting method for translat- words which are particularly relevant to segmenta- ing spoken language which includes many long or tion. Of course, most part-of-speech taggers re- ill-formed expressions. The method splits an input quire sentence boundaries to be pre-determined; so into well-balanced translation units, using a seman- to require the use of part-of-speech information in tic dictionary. utterance segmentation would risk circularity. Cet- Ramaswamy et al. (1998) applied a maximum tolo et al.’s (1998) approach to sentence boundary entropy approach to the detection of command detection is somewhat similar to Stolcke et al.’s. boundaries in a conversational natural language user interface. They considered as their features but it can’t take into account the distant right con- words and their distances to potential boundaries. text to the candidate. This is the reason that N- They posited 400 feature functions, and trained gram methods often wrongly divide some long their weights using 3000 commands. The system sentences into halves or multiple segments. For then achieved a precision of 98.2% in a test set of example:小王病了一个星期. The N-gram method 1900 commands. However, command sentences is likely to insert a boundary mark between “了” for conversational natural language user interfaces and “一”, which corresponds to our everyday im- contain much smaller vocabularies and simpler pression that, if reading from the left and not structures than the sentences of natural spoken lan- considering several more words to the right of the guage. In any case, this method has been very current word, we will probably consider “小王病 helpful to us in designing our own approach to ut- 了 terance segmentation. ” as a whole sentence. However, we find that, if There are several additional approaches which are we search the sentence boundaries from right to not designed for utterance segmentation but which left, such errors can be effectively avoided. In the can nevertheless provide useful ideas. For example, present example, we won’t consider “一个星期” Reynar et al. (1997) proposed an approach to the as a whole sentence, and the search will be contin- disambiguation of punctuation marks. They con- ued until the word “小” is encountered. Accord- sidered only the first word to the left and right of ingly, in order to avoid segmentation errors made any potential sentence boundary, and claimed that by the normal N-gram method, we propose a re- examining wider context was not beneficial. The verse N-gram segmentation method (RN) which features they considered included the candidate’s does seek sentence boundaries from right to left. prefix and suffix; the presence of particular charac- Further, we simply integrate the two N-gram ters in the prefix or suffix; whether the candidate methods and propose a bi-directional N-gram was honorific (e.g. Mr., Dr.); and whether the can- method (BN), which takes into account both the didate was a corporate designator (e.g. Corp.). The left and the right context of a candidate segmenta- system was tested on the Brown Corpus, and tion site. Since the relative usefulness or signifi- achieved a precision of 98.8%. Elsewhere, Nakano cance of the two N-gram methods varies et al. (1999) proposed a method for incrementally depending on the context, we propose a method of understanding user utterances whose semantic weighting them appropriately, using parameters boundaries were unknown. The method operated generated by a maximum entropy method which by incrementally finding plausible sequences of takes as its features information about words in the utterances that play crucial roles in the task execu- context. This is our Maximum-Entropy-Weighted tion of dialogues, and by utilizing beam search to Bi-directional N-gram-based segmentation method. deal with the ambiguity of boundaries and with We hope MEBN can retain the correct segments syntactic and semantic ambiguities. Though the discovered by the usual N-gram algorithm, yet ef- method does not require utterance segmentation fectively skip the wrong segments. before discourse processing, it employs special rule tables for discontinuation of significant utter- 3 Maximum-Entropy-Weighted Bi- ance boundaries. Such rule tables are not easy to directional N-gram-based Segmentation maintain, and experimental results have demon- Method strated only that the method outperformed the method assuming pauses to be semantic boundaries. 3.1 Normal N-gram Algorithm (NN) for Ut- 2.2 Our motivations terance Segmentation

Though numerous methods for utterance segmen- Assuming that W1W2 ...Wm (where m is a natural tation have been proposed, many problems remain number) is a word sequence, we consider it as an n unsolved. order Markov chain, in which the word One remaining problem relates to the language W (1 ≤ i ≤ m) is predicted by the n-1 words to its model. The N-gram model evaluates candidate i sentence boundaries mainly according to their left left. Here is the corresponding formula: context, and has achieved reasonably good results, P(Wi |W1W2 ...Wi−1 ) = P(Wi |Wi−n+1...Wi−1 ) P(WmWm−1...Wi+1SBWi ) > P(WmWm−1...Wi+1Wi ) . From this conditional probability formula for a Similar to NN, P(WmWm−1...Wi+1SBWi ) and word, we can derive the probability of a word se- P(W W ...W W ) are computed as follows in quenceW W ...W : m m−1 i+1 i 1 2 i the trigram:

P(W1W2...Wi ) = P(W1W2...Wi−1)×P(Wi |W1W2...Wi−1) P(WmWm−1...Wi+1SBWi) = P(WmWm−1...SBWi+1)×P(SB| SBWi+1)×P(Wi |Wi+1SB) Integrating the two formulas above, we get: +P(WmWm−1...Wi+2Wi+1)×P(SB|Wi+2Wi+1)×P(Wi |Wi+1SB) P(WW ...W ) = P(WW ...W )×P(W |W ...W ) 1 2 i 1 2 i−1 i i−n+1 i−1 P(WmWm−1...Wi+1Wi) = P(WmWm−1...SBWi+1)×P(Wi | SBWi+1) Let us use SB to indicate a sentence boundary +P(WmWm−1...Wi+2Wi+1)×P(Wi |Wi+2Wi+1) and add it to the word sequence. The value of In contrast to the normal N-gram segmentation P(W1W2 ...Wi SBWi+1 ) and P(W1W2 ...WiWi+1 ) will method, we compute the above iterative formulas determine whether a specific word to seek sentence boundaries from Wm to W1 . Wi (1 ≤ i ≤ m) is the final word of a sentence. We say W is the final word of a sentence if and only 3.3 Bi-directional N-gram Algorithm for Ut- i terance Segmentation if P(W W ...W SBW ) > P(W W ...W W ) . 1 2 i i+1 1 2 i i+1 From the iterative formulas of the normal N-gram Taking the trigram as our example and consid- algorithm and the reverse N-gram algorithm, we ering the two cases where Wi-1 is and is not the can see that the normal N-gram method recognizes final word of a sentence, P(W1W2 ...Wi SBWi+1 ) a candidate sentence boundary location mainly according to its left context, while the reverse N- and P(W1W2 ...WiWi+1 ) is computed respectively gram method mainly depends on its right context. by the following two formulas: P(WW ...WSBW ) = P(WW ...SBW)× P(SB| SBW)× P(W |WSB) Theoretically at least, it is reasonable to suppose 1 2 i i+1 1 2 i i i+1 i that, if we synthetically consider both the left and + P(WW ...W W)× P(SB|W W )×P(W |WSB) 1 2 i−1 i i−1 i i+1 i the right context by integrating the NN and the RN, P(WW ...WW ) = P(WW ...SBW)×P(W | SBW) 1 2 i i+1 1 2 i i+1 i the overall segmentation accuracy will be im-

+ P(W1W2...Wi−1Wi )× P(Wi+1 |Wi−1Wi ) proved.

In the normal N-gram method, the above iterative Considering the word sequence W1W2 ...Wm , the formulas are computed to search the sentence candidate sites for sentence boundaries may be boundaries from W to W . 1 m found between W1 and W2 , between W2 and W , …, or between W andW . The number of 3.2 Reverse N-gram Algorithm (RN) for Ut- 3 m−1 m terance Segmentation candidate sites is thus m-1. We number those m-1 candidate sites 1, 2 … m-1 in succession, and we In the reverse N-gram segmentation method, we use P (i) (1 ≤ i ≤ m −1) and take the word sequence W W ...W as a reverse is 1 2 m P (i) (1 ≤ i ≤ m −1) respectively to indicate the Markov chain in which W (1 ≤ i ≤ m) is predicted no i probability that the current site i really is, or is not, by the n-1 words to its right. That is: a sentence boundary. Thus, to compute the word P(Wi |WmWm−1...Wi+1 ) = P(Wi |Wi+n−1...Wi+1 ) sequence segmentation, we must compute Pis (i) As in the N-gram algorithm, we compute the and P (i) for each of the m-1 candidate sites. In occurring probability of word sequence no the bi-directional BN, we compute P (i) and W1W2 ...Wm using the formula: is P (i) by combining the NN results and RN re- P(WmWm−1...Wi )=P(WmWm−1...Wi+1)×P(Wi |WmWm−1...Wi+1) no Then the iterative computation formula is: sults. The combination is described by the follow- ing formulas: P(WmWm−1...Wi ) =P(WmWm−1...Wi+1)×P(Wi |Wi+n−1...Wi+1) Pis _ BN (i) = Pis _ NN (i)× Pis _ RN (i) By adding SB to the word sequence, we say W i P (i) = P (i)× P (i) is the final word of a sentence if and only if no _ BN no _ NN no _ RN where Pis _ NN (i) , Pno _ NN (i) denote the probabili- four parameters α j10 , α j11 , α j20 , α j21 . Thus the ties calculated by NN which correspond to joint probability distribution of the candidate sites and their surrounding contexts is given by: P(W1W2 ...Wi SBWi+1 ) and P(W1W2 ...WiWi+1 ) in k f j10 (b,c) f j11(b,c) f j 20 (b,c) f j 21(b,c) P(c,b) = π (α j10 ×α j11 ×α j20 ×α j21 ) section 3.1 respectively and Pis _ RN (i) , Pno _ RN (i) ∏j=1 denote the probabilities calculated by RN which where k is the total number of the Matching Strings and is a parameter set to make P(c,1) and P(c,0) correspond to P(W W ...W SBW) and π m m−1 i+1 i sum to 1. The unknown parameters P(WmWm−1...Wi+1Wi ) in section 3.2 respectively. α j10 ,α j11 ,α j20 ,α j21 are chosen to maximize the We say there exits a sentence boundary at site i likelihood of the training data using the General- (1 ≤ i ≤ m −1) if and only if Pis _ BN (i) > Pno _ BN (i) . ized Iterative Scaling (Darroch and Ratcliff, 1972) algorithm. In the maximum entropy approach, we 3.4 Maximum Entropy Approach for Utter- say that a candidate site is a sentence boundary if ance Segmentation and only if P(c, 1) > P(c, 0). (At this point, we can In this section, we explain our maximum-entropy- anticipate a technical problem with the maximum based model for utterance segmentation. That is, approach to utterance segmentation. When a we estimate the joint probability distribution of the Matching String contains SB, we cannot know candidate sites and their surrounding words. Since whether it belongs to the Prefixes or Suffixes of we consider information concerning the lexical the candidate site until the left and right contexts of context to be useful, we define the feature func- the candidate site have been segmented. Thus if the tions for our maximum method as follows: segmentation proceeds from left to right, the lexi-

 1 if ( include (Pr efix ( c ), S j ) & & b == 0 ) cal information in the right context of the current f j 10 ( b , c ) =   0 else candidate site will always remain uncertain. Like- 1 if (include (Pr efix (c), S ) & &b == 1)  j wise, if it proceeds from right to left, the informa- f j11 (b, c) =  0 else tion in the left context of the current candidate site  1 if (include ( Suffix ( c ), S j ) & & b == 0 ) f j 20 (b , c ) =  remains uncertain. The next subsection will de-  0 else scribe a pragmatic solution to this problem.)  1 if (include (Suffix (c), S j ) & &b == 1) f j 21 (b,c) =  0 else 3.5 Maximum-Entropy-Weighted Bi- Sj denotes a sequence of one or more words directional N-gram Algorithm for Utter- which we can call the Matching String. (Note that ance Segmentation Sj may contain the sentence boundary mark ‘SB’.) The candidate c’s state is denoted by b, where b=1 In the bi-directional N-gram based algorithm, we indicates that c is a sentence boundary and b=0 have considered the left-to-right N-gram algorithm indicates that it is not a boundary. Prefix(c) de- and the right-to-left algorithm as having the same notes all the word sequences ending with c (that is, significance. Actually, however, they should be c's left context plus c) and Suffix(c) denotes all the assigned differing weights, depending on the lexi- word sequences beginning with c (in other words, cal contexts. The combination formulas are as fol- c plus its right context). For example: in the utter- lows: P (i) = W (C )× P (i)×W (C )× P (i) ance: 去走, is n _ is i is _ NN r _ is i is _ RN P (i) = W (C )× P (i)×W (C )× P (i) ‘场’, ‘机场’, and ‘去机场’ are c3’s Prefix, while no n _ no i no _ NN r _ no i no _ RN ‘怎’ , ‘怎么’and ‘怎么走’ are c3’s Suffix. The Wn _ is (Ci ) , Wn _ no (Ci ) , Wr _ is (Ci ) , Wr _ no (Ci ) value of function include(Pr efix(c),S j ) is true are the functions of the context surrounding candi- date site i which denotes the weights of when word sequence Sj is one of c’s Prefixes, and Pis _ NN (i) , Pno _ NN (i) , Pis _ RN (i) and Pno _ RN (i) re- the value of function include(Suffix(c),S j ) is spectively. Assuming that the weights of P (i) true when Sj is one of c’s Suffixes. is _ NN Corresponding to the four feature functions and Pno _ NN (i) depend upon the context to the left f (b,c) , f (b,c) , f (b,c) , f (b,c) are the j10 j11 j20 j21 of the candidate site, and that the weights of We evaluate site i as a sentence boundary if and Pis _ RN (i) and Pno _ RN (i) depend on the context to the right of the candidate site, the weight functions only if Pis _ MEBN (i) > Pno _ MEBN (i) . can be rewritten as:

Wn _ is (LeftCi ) , Wn _ no (LeftCi ) , Wr _ is (RightCi ) , 4 Experiment

Wr _ no (RightCi ) . It is reasonable to assume that as 4.1 Model Training the joint probability P(LeftCi ,i = SB) rises, Our models are trained on both Chinese and Eng- Pis _ NN (i) will increase in significance. (The joint lish corpora, which cover the domains of hotel res- probability in question is the probability of the cur- ervation, flight booking, traffic information, rent candidate’s left context, taken together with sightseeing, daily life and so on. We replaced the the probability that the candidate is a sentence full stops with “SB” and removed all other punc- tuation marks in the training corpora. Since in most boundary.) Therefore the value of Wn _ is (LeftCi ) actual systems part of speech information cannot is given by Wn _ is (LeftCi ) = P(LeftCi ,i = SB) . be accessed before determining the sentence Similarly we can give the formulas for comput- boundaries, we use Chinese characters and English words without POS tags as the units of our N-gram ing Wn _ no (LeftCi ) , Wr _ is (RightCi ) , and models. Trigram and reverse trigram probabilities Wr _ no (RightCi ) as follows: are estimated based on the processed training cor-

Wn _ no (LeftC i ) = P(LeftC i ,i!= SB) pus by using Modified Kneser-Ney Smoothing (Chen and Goodman, 1998). As to the maximum W (RightC ) = P(RightC ,i = SB) r _ is i i entropy model, the Matching Strings are chosen as

Wr _ no (RightCi ) = P(RightCi ,i!= SB) all the word sequences occurring in the training We can easily get the values of corpus whose length is no more than 3 words. The unknown parameters corresponding to the feature P(LeftC ,i SB) , P(LeftC ,i! SB) , i = i = functions are generated based on the training cor-

P(RightCi ,i = SB) , and P(RightCi ,i!= SB) pus using the Generalized Iterative Scaling algo- using the method described in the maximum en- rithm. Table 1 gives an overview of the training tropy approach section. For example: corpus. k f (1,i) Corpus SIZE SB Num- Average Length P(LeftC ,i = SB) = π α j11 i ∏j=1 j11 ber of Sentence k P(LeftC ,i!= SB) = π α f j10 (0,i) Chinese 4.02MB 148967 8 Chinese charac- i ∏j=1 j10 ters As mentioned in last subsection, we need seg- English 4.49MB 149311 6 words mented contexts for maximum entropy approach. Table 1. Overview of the Training Corpus. Since the maximum entropy parameters for MEBN algorithm are used as modifying NN and RN, we 4.2 Testing Results just estimate the joint probability of the candidate We test our methods using open corpora which are and its surrounding contexts based upon the seg- also limited to the domains mentioned above. All ments by NN and RN. Using NLeftCi indicate the punctuation marks are removed from the test cor- left context to the candidate i which has been seg- pora. An overview of the test corpus appears in mented by NN algorithm and RRightCi indicate the table 2. right context to i which has been segmented by RN, Corpus SIZE SB Average Length the combination probability computing formulas Number of Sentence for MEBN are as follows: Chinese 412KB 12032 10 Chinese char- Pis _ MEBN (i) = P(NLeftC i , i = SB ) × Pis _ NN (i) acters English 391KB 10518 7 words × P(RRightC i , i = SB ) × Pis _ RN (i) Table 2. Overview of the Testing Corpus. P (i) = P(NLeftC , i!= SB ) × P (i) no _ MEBN i no _ NN We have implemented four segmentation algo-

× P(RRightC i , i!= SB ) × Pno _ RN (i) rithms using NN, RN, BN and MEBN respectively. If we use “RightNum” to denote the number of notes the number of wrong NN segmentations right segmentations, “WrongNum” denote the which were skipped; WNON denotes the number number of wrong segmentations, and “TotalNum” of wrong segmentations not overlapping with those to denote the number of segmentations in the of NN; and CNON denotes the number of segmen- original testing corpus, the precision (P) can be tations which were correct but did not overlap with computed using the formula those of NN. The statistical results are listed in P=RightNum/(RightNum+WrongNum), the recall Table 5 and Table 6. (R) is computed as R=RightNum/TotalNum, and Methods TN CON SWN WNON CNON 2× P × R RN 13011 9525 1098 1077 870 the F-Score is computed as F-Score = . P + R BN 12777 9906 753 355 622 The testing results are described in Table 3 and MEBN 11935 9646 1274 223 678 Table 4. Table 5. Chinese Utterance Segmentation Results Total Right Wrong Preci- Methods Recall F-Score Num Num Num sion Comparison. NN 12032 10167 2638 79.4% 84.5% 81.9% Methods TN CON SWN WNON CNON RN 12032 10396 2615 79.9% 86.4% 83.0% RN 12365 8223 1077 1271 792 BN 12032 10528 2249 82.4% 87.5% 84.9% BN 12075 8565 640 488 491 MEBN 12032 10348 1587 86.7% 86.0% 86.3% MEBN 11332 8370 1247 486 559 Table 3. Experimental Results for Chinese Utter- Table 6. English Utterance Segmentation Results ance Segmentation. Comparison. Total Right Wrong Preci- Focusing upon the Chinese results, we can see that Methods Recall F-Score Num Num Num sion RN skips 1098 incorrect segments found by NN, NN 10518 8730 3164 73.4% 83.0% 77.9% and has 9525 correct segments in common with RN 10518 9014 3351 72.9% 85.7% 78.8% those of NN. It verifies our supposition that RN BN 10518 9056 3019 75.0% 86.1% 80.2% can effectively avoid some errors made by NN. MEBN 10518 8929 2403 78.8% 84.9% 81.7% But because at the same time RN brings in 1077 Table 4. Experimental Results for English Utter- new errors, RN doesn’t improve much in precision. ance Segmentation. BN skips 753 incorrect segments and brings in 355 From the result tables it is clear that RN, BN, and new segmentation errors; has 9906 correct seg- MEBN all outperforms the normal N-gram algo- ments in common with those of NN and brings in rithm in the F-score for both Chinese and English 622 new correct segments. So by equally integrat- utterance segmentation. MEBN achieved the best ing NN and RN, BN on one hand finds more cor- performance which improves the precision by rect segments, on the other hand brings in less 7.3% and the recall by 1.5% in the Chinese ex- wrong segments than NN. But in skipping incor- periment, and improves the precision by 5.4% and rect segments by NN, BN still performs worse than the recall by 1.9% in the English experiment. RN, showing that it only exerts the error skipping ability of RN to some extent. As for MEBN, it 4.3 Result analysis skips 1274 incorrect segments and at the same time MEBN was proposed in order to maintain the cor- brings in only 223 new incorrect segments. Addi- rect segments of the normal N-gram algorithm tionally it maintains 9646 correct segments in com- while skipping the wrong segments. In order to see mon with those of NN and brings in 678 new whether our original intention has been realized, correct segments. In recall MEBN performs a little we compared the segments as determined by RN worse than BN, but in precision it achieves a much with those determined by NN, compare the seg- better performance than BN, showing that modi- ments found by BN with those of NN and then fied by the maximum entropy weights, MEBN compare the segments found by MEBN with those makes use of the error skipping ability of RN more of NN. For RN, BN and MEBN, suppose TN de- effectively. Further, in skipping wrong segments notes the number of total segmentations, CON de- by NN, MEBN even outperforms RN, which indi- notes the number of correct segmentations cates the weights we set on NN and RN not only overlapping with those found by NN; SWN de- act as modifying parameters, but also have direct beneficial affection on utterance segmentation. 5 Conclusion language Translation. COLING-ACL 1998, pp. 421- 427. This paper proposes a reverse N-gram algorithm, a Gotoh Y. and S. Renals. 2000. Sentence Boundary De- bi-directional N-gram algorithm and a Maximum- tection in Broadcast Speech Transcripts. In Proc. In- entropy-weighted Bi-directional N-gram algorithm ternational Workshop on Automatic Speech for utterance segmentation. The experimental re- Recognition, pp. 228-235. sults for both Chinese and English utterance seg- Nakano M., N. Miyazaki, J. Hirasawa, K. Dohsaka, and T. Kawabata. 1999. Understanding Unsegmented User mentation show that MEBN significantly Utterances in Real-Time Spoken Dialogue Systems. outperforms the usual N-gram algorithm. This is Proceedings of the 37th Annual Meeting of the Asso- because MEBN takes into account both the left and ciation for Computational Linguistics (ACL-99), Col- right contexts of candidate sites: it integrates the lege Park, MD, USA, pp. 200-207. left-to-right N-gram algorithm and the right-to-left Ramaswamy N. G. and J. Kleindienst. 1998. Automatic N-gram algorithm with appropriate weights, using Identification of Command Boundaries in a Conversa- clues on the sites’ lexical context, as modeled by tional Natural Language User Interface. ICSLP 1998. maximum entropy. pp. 401-404. Reynar J. and A. Ratnaparkhi. 1997. A maximum en- Acknowledgements tropy approach to identifying sentence boundaries. In Proceedings of the 5th Conference on Applications of This work is sponsored by the Natural Sciences Natural Language Processing (ANLP), Washington Foundation of China under grant No.60175012, as DC, pp. 16-19. Seligman M. 2000. Nine Issues in Speech Translation. well as supported by the National Key Fundamen- In , 15, pp. 149-185. tal Research Program (the 973 Program) of China Stevenson M. and R. Gaizauskas. 2000. Experiments on under the grant G1998030504. sentence boundary detection. In Proceedings of the The authors are very grateful to Dr. Mark Selig- Sixth Conference on Applied Natural Language Proc- man for his very useful suggestions and his very essing and the First Conference of the North American careful proofreading. Chapter of the Association for Computational Linguis- tics, pp. 24-30. References Stolcke A. and E. Shriberg. 1996. Automatic linguistic segmentation of conversational speech. Proc. Intl. Beeferman D., A. Berger, and J. Lafferty. 1998. Conf. on Spoken Language Processing, Philadelphia, CYBERPUNC: A lightweight punctuation annotation PA, vol. 2, pp. 1005-1008. system for speech. In Proceedings of the IEEE Inter- Stolcke A., E. Shriberg, R. Bates, M. Ostendorf, D. national Conference on Acoustics, Speech and Signal Hakkani, M. Plauche, G. Tur, and Y. Lu. 1998. Auto- Processing, Seattle, WA. pp. 689-692. matic Detection of Sentence Boundaries and Disfluen- Beeferman D., A. Berger, and J. Lafferty. 1999. Statisti- cies based on Recognized Words. Proc. Intl. Conf. on cal models for text segmentation. Spoken Language Processing, Sydney, Australia, vol. 34, pp 177-210. 5, pp. 2247-2250. Berger A., S. Della Pietra, and V. Della Pietra. 1996. A Zong, C. and F. Ren. 2003. Chinese Utterance Segmen- Maximum Entropy Approach to Natural Language tation in Spoken Language translation. In Proceedings Processing. Computational Linguistics, 22(1), pp. 39- of the 4th international conference on intelligent text 71. processing and Computational Linguistics (CICLing), Cettolo M. and D. Falavigna. 1998. Automatic Mexico, Feb 16-22. pp. 516-525. Detection of Semantic Boundaries Based on Acoustic Zhou Y. 2001. Utterance Segmentation Based on Deci- and Lexical Knowledge. ICSLP 1998, pp. 1551-1554. sion Tree. Proceedings of the 6th National joint Con- Chen S. F. and J. Goodman. 1998. An empirical study ference on Computational Linguistics, Taiyuan, China, of smoothing techniques for language modeling. Tech- pp. 246-252. nical Report TR-10-98, Center for Research in Com- puting Technology, Harvard University. pp.243-255. Darroch J. N. and D. Ratcliff. 1972. Generalized Itera- tive Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480. Furuse O., S. Yamada, and K. Yamamoto. 1998. Split- ting Long or Ill-formed Input for Robust Spoken-