Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters

Hirofumi Yamamoto Shuntaro Isogai Yoshinori Sagisaka ATR SLT Waseda University GITI / ATR SLT 2-2-2 Hikaridai Seika-cho 3-4-1 Okubo, Shinjuku-ku 1-3-10 Nishi-Waseda Soraku-gun, Kyoto-fu, Japan Tokyo-to, Japan Shinjuku-ku, Tokyo-to, Japan [email protected] [email protected] [email protected]

Abstract are more effective and flexible than rule-based grammatical constraints in many cases, their per- In this paper, a new language model, the formance strongly depends on the size of training Multi-Class Composite N-gram, is pro- data, since they are statistical models. posed to avoid a data sparseness prob- In word N-grams, the accuracy of the word lem for spoken language in that it is prediction capability will increase according to difficult to collect training data. The the number of the order N, but also the num- Multi-Class Composite N-gram main- ber of word transition combinations will exponen- tains an accurate word prediction ca- tially increase. Moreover, the size of training data pability and reliability for sparse data for reliable transition probability values will also with a compact model size based on dramatically increase. This is a critical problem multiple word clusters, called Multi- for spoken language in that it is difficult to col- Classes. In the Multi-Class, the statisti- lect training data sufficient enough for a reliable cal connectivity at each position of the model. As a solution to this problem, class N- N-grams is regarded as word attributes, grams are proposed. and one word cluster each is created to represent the positional attributes. Fur- In class N-grams, multiple words are mapped thermore, by introducing higher order to one word class, and the transition probabilities word N-grams through the grouping of from word to word are approximated to the proba- frequent word successions, Multi-Class bilities from word class to word class. The perfor- N-grams are extended to Multi-Class mance and model size of class N-grams strongly Composite N-grams. In experiments, depend on the definition of word classes. In fact, the Multi-Class Composite N-grams re- the performance of class N-grams based on the sult in 9.5% lower perplexity and a 16% part-of-speech (POS) word class is usually quite lower word error rate in speech recogni- a bit lower than that of word N-grams. Based on tion with a 40% smaller parameter size this fact, effective word class definitions are re- than conventional word 3-grams. quired for high performance in class N-grams. In this paper, the Multi-Class assignment is proposed for effective word class definitions. The 1 Introduction word class is used to represent word connectiv- Word N-grams have been widely used as a sta- ity, i.e. which words will appear in a neigh- tistical language model for language processing. boring position with what probability. In Multi- Word N-grams are models that give the transition Class assignment, the word connectivity in each

probability of the next word from the previous position of the N-grams is regarded as a differ-

½ Æ word sequence based on a statistical analy- ent attribute, and multiple classes corresponding sis of the huge . Though word N-grams to each attribute are assigned to each word. For the word clustering of each Multi-Class for each 2.2 Problems in the Definition of Word word, a method is used in which word classes are Classes formed automatically and statistically from a cor- In class N-grams, word classes are used to repre- pus, not using a priori knowledge as POS infor- sent the connectivity between words. In the con- mation. Furthermore, by introducing higher order ventional word class definition, word connectiv- word N-grams through the grouping of frequent ity for which words follow and that for which word successions, Multi-Class N-grams are ex- word precedes are treated as the same neighbor- tended to Multi-Class Composite N-grams. ing characteristics without distinction. Therefore, only the words that have the same word connec- 2 N-gram Language Models Based on tivity for the following words and the preceding Multiple Word Classes word belong to the same word class, and this word class definition cannot represent the word connec- 2.1 Class N-grams tivity attribute efficiently. Take ”a” and ”an” as an Word N-grams are models that statistically give example. Both are classified by POS as an Indef- the transition probability of the next word from inite Article, and are assigned to the same word

class. In this case, information about the differ-

½ the previous Æ word sequence. This transition probability is given in the next formula. ence with the following word connectivity will be lost. On the other hand, a different class assign-

ment for both words will cause the information

Ô´Û jÛ ; :::; Û ;Û µ

i Æ ·½ i ¾ i ½ i (1) about the community in the preceding word con- nectivity to be lost. This directional distinction is In word N-grams, accurate word prediction can be quite crucial for languages with reflection such as expected, since a word dependent, unique connec- French and Japanese. tivity from word to word can be represented. On the other hand, the number of estimated param- 2.3 Multi-Class and Multi-Class N-grams

eters, i.e., the number of combinations of word As in the previous example of ”a” and ”an”, fol-

Æ Æ

Î Î transitions, is Î in vocabulary .As will lowing and preceding word connectivity are not exponentially increase according to Æ , reliable always the same. Let’s consider the case of dif- estimations of each word transition probability ferent connectivity for the words that precede and are difficult under a large Æ . follow. Multiple word classes are assigned to Class N-grams are proposed to resolve the each word to represent the following and preced- problem that a huge number of parameters is re- ing word connectivity. As the connectivity of the quired in word N-grams. In class N-grams, the word preceding ”a” and ”an” is the same, it is ef-

transition probability of the next word from the ficient to assign them to the same word class to

½ previous Æ word sequence is given in the next represent the preceding word connectivity, if as- formula. signing different word classes to represent the fol-

lowing word connectivity at the same time. To

Ô´c jc ; :::; c ;c µÔ´Û jc µ

i Æ ·½ i ¾ i ½ i i i (2) apply these word class definitions to formula (2),

the next formula is given.

c

i

½ f ¾ f ½

Where, represents the word class to which the fÆ

Ø Ø

Ô´c jc ; :::; c ;c µÔ´Û jc µ

i (3)

i i

i Æ ·½ i ¾ i ½ Û

word i belongs.

Ø c

In class N-grams with C classes, the number In the above formula, represents the word class

i

Æ

Î Û

of estimated parameters is decreased from in the target position to which the word i be-

Æ c

to C . However, accuracy of the word predic- longs, and represents the word class in the i tion capability will be lower than that of word N- N-th position in a conditional word sequence. grams with a sufficient size of training data, since We call this multiple word class definition, a the representation capability of the word depen- Multi-Class. Similarly, we call class N-grams dent, unique connectivity attribute will be lost for based on the Multi-Class, Multi-Class N-grams the approximation base word class. (Yamamoto and Sagisaka, 1999).

3 Automatic Extraction of Word probability of the succeeding class-word 2- f

Clusters gram or word 2-gram, while Ô is the same for the preceding one. 3.1 Word Clustering for Multi-Class 2-grams 3. Merge the two classes. We choose classes For word clustering in class N-grams, POS in- whose dispersion weighted with the 1-gram formation is sometimes used. Though POS in- probability results in the lowest rise, and formation can be used for words that do not ap- merge these two classes:

pear in the corpus, this is not always an optimal X

Í = ´Ô´Û µD ´Ú ´c ´Û µµ;Ú´Û µµµ ÒeÛ

word classification for N-grams. The POS in- ÒeÛ Û formation does not accurately represent the sta- (6)

tistical word connectivity characteristics. Better X

Í = ´Ô´Û µD ´Ú ´c ´Û µµ;Ú´Û µµµ ÓÐd

word-clustering is to be considered based on word ÓÐd Û connectivity by the reflection neighboring charac- (7)

teristics in the corpus. In this paper, vectors are where we merge the classes whose merge

Í Í D ´Ú ;Ú µ

ÓÐd c Û used to represent word neighboring characteris- cost ÒeÛ is the lowest.

tics. The elements of the vectors are forward or represents the square of the Euclidean dis-

Ú Ú c

Û ÓÐd

backward word 2-gram probabilities to the clus- tance between vector c and , repre- c tering target word after being smoothed. And we sents the classes before merging, and ÒeÛ consider that word pairs that have a small distance represents the classes after merging. between vectors also have similar word neighbor- ing characteristics (Brown et al., 1992) (Bai et 4. Repeat step 2 until the number of classes is al., 1998). In this method, the same vector is reduced to the desired number. assigned to words that do not appear in the cor- pus, and the same word cluster will be assigned to 3.2 Word Clustering for Multi-Class these words. To avoid excessively rough cluster- 3-grams ing over different POS, we cluster the words un- To apply the multiple clustering for 2-grams to der the condition that only words with the same 3-grams, the clustering target in the conditional POS can belong to the same cluster. Parts-of- part is extended to a word pair from the single

speech that have the same connectivity in each word in 2-grams. Number of clustering targets in

¾ Î Multi-Class are merged. For example, if differ- the preceding class increases to Î from in 2-

ent parts-of-speeche are assigned to ”a” and ”an”, grams, and the length of the vector in the succeed- ¾ these parts-of-speeche are regarded as the same ing class also increase to Î . Therefore, efficient for the preceding word cluster. Word clustering is word clustering is needed to keep the reliability thus performed in the following manner. of 3-grams after the clustering and a reasonable calculation cost. 1. Assign one unique class per word.s. To avoid losing the reliability caused by the data sparseness of the word pair in the history 2. Assign a vector to each class or to each word of 3-grams, approximation is employed using

X . This represents the word connectivity at- distance-2 2-grams. The authority of this ap- tribute. proximation is based on a report that the asso-

ciation of word 2-grams and distance-2 2-grams

Ø Ø Ø Ø

Ú ´Üµ= [Ô ´Û jܵ;Ô ´Û jܵ;:::;Ô ´Û jܵ]

¾ Æ ½ based on the maximum entropy method gives a

(4) good approximation of word 3-grams (Zhang et

f f f f

Ú ´Üµ= [Ô ´Û jܵ;Ô ´Û jܵ; :::; Ô ´Û jܵ]

¾ Æ ½ al., 1999). The vector for clustering is given in

(5) the next equation.

Ø

´Üµ

Where, Ú represents the preceding word

f ¾ f ¾ f ¾ f ¾

f

Ú Ú ´Üµ ´Üµ= [Ô ´Û jܵ;Ô ´Û jܵ;:::;Ô ´Û jܵ]

¾ Æ

connectivity, represents the following ½ Ø

word connectivity, and Ô is the value of the (8)

f f

f ¾

´Ý jܵ c ´A · B · C µ= c ´C µ

Where, Ô represents the distance-2 2-gram (12) Ý value from word Ü to word . And the POS con- straints for clustering are the same as in the clus- Applying these relations to equation (10), the next

tering for preceding words. equation is obtained.

Ø f

= Ô´c ´A · B · C µjc ´X µµ

4 Multi-Class Composite N-grams È

Ø

Ô´A · B · C jc ´A · B · C µµ

4.1 Multi-Class Composite 2-grams ¢

Ø f

Ô´c ´D µjc ´A · B · C µµ

Introducing Variable Length Word ¢

Ø

Ô´D jc ´D µµ Sequences ¢ (13) Let’s consider the condition such that only word

Equation(13) means that if the frequency of the

A; B ; C µ sequence ´ has sufficient frequency in

Æ word sequence is sufficient, we can partially

X ; A; B ; C; D µ sequence ´ . In this case, the value

introduce higher order word N-grams using Æ

´B jAµ of word 2-gram Ô can be used as a reli- length word succession, thus maintaining the re- able value for the estimation of word B , as the

liability of the estimated probability and forma-

A; B µ frequency of sequence ´ is sufficient. The

tion of the Multi-Class 2-grams. We call Multi-

´C jA; B µ value of word 3-gram Ô can be used Class Composite 2-grams that are created by par- for the estimation of word C for the same rea-

tially introducing higher order word N-grams by D son. For the estimation of words A and ,itis reasonable to use the value of the class 2-gram, word succession, Multi-Class 2-grams. In addi- since the value of the word N-gram is unreli- tion, equation (13) shows that number of param- able (note that the frequency of word sequences eters will not be increased so match when fre-

quent word successions are added to the word en-

X; Aµ ´C; Dµ

´ and is insufficient). Based on this

· B · C idea, the transition probability of word sequence try. Only a 1-gram of word succession A

should be added to the conventional N-gram pa-

A; B ; C ; Dµ X ´ from word is given in the next equation in the Multi-Class 2-gram. rameters. Multi-Class Composite 2-grams are

created in the following manner.

Ø f Ø

È = Ô´c ´Aµjc ´X µµÔ´Ajc ´Aµµµ

1. Assign a Multi-Class 2-gram, for state ini-

¢ Ô´B jAµ

tialization.

¢ Ô´C jA; B µ

Ø f Ø

Ô´c ´D µjc ´C µµÔ´D jc ´D µµ ¢ (9) 2. Find a word pair whose frequency is above

the threshold.

· B · C

When word succession A is introduced as

A; B ; C µ a variable length word sequence ´ , equa- 3. Create a new word succession entry for the tion (9) can be changed exactly to the next equa- frequent word pair and add it to a lexicon. tion (Deligne and Bimbot, 1995) (Masataki et al., The following connectivity class of the word

1996). succession is the same as the following class

f Ø

Ø of the first word in the pair, and its preceding

È = Ô´c ´Aµjc ´X µµÔ´A · B · C jc ´Aµµ

class is the same as the preceding class of the Ø f Ø

Ô´c ´D µjc ´C µµÔ´D jc ´D µµ ¢ (10) last word in it. Here, we find the following properties. The pre-

4. Replace the frequent word pair in training · ceding word connectivity of word succession A

data to word succession, and recalculate the

· C A B is the same as the connectivity of word ,

frequency of the word or word succession

· B · C the first word of A . The following con- pair. Therefore, the summation of probabil- nectivity is the same as the last word C . In these assignments, no new cluster is required. But con- ity is always kept to 1. ventional class N-grams require a new cluster for 5. Repeat step 2 with the newly added word the new word succession.

succession, until no more word pairs are

Ø Ø

´A · B · C µ= c ´Aµ c (11) found.

Ô´D jA; B ; C µ

4.2 Extension to Multi-Class Composite ¢

f ¾ f ½

3-grams Ø

¢ Ô´c ´E µjc ´C µ;c ´D µµ

Ø

Ô´E jc ´E µµ

Next, we put the word succession into the for- ¢

f ¾ f ½

mulation of Multi-Class 3-grams. The transition Ø

¢ Ô´c ´F µjc ´D µ;c ´E µµ

A; B ; C ; D ; E ; F µ

probability to word sequence ´

Ø

Ô´F jc ´F µµ

¢ (19)

X; Y µ from word pair ´ is given in the next equa- tion.

In equation (19), the words without B are es-

f ¾ f ½

Ø timated by the same or more accurate models

= Ô´c ´A · B · C · D µjc ´X µ;c ´Y µµ

È than Multi-Class 3-grams (Multi-Class 3-grams

Ø

¢ Ô´A · B · C · D jc ´A · B · C · D µµ

E F

for words A, and , and word 3-gram and word

Ø f ¾ f ½

¢ Ô´c ´E µjc ´Y µ;c ´A · B · C · D µµ D

4-gram for words C and ). However, for word

Ø

B

Ô´E jc ´E µµ

¢ , a word 2-gram is used instead of the Multi-

f ¾ f ½

Ø Class 3-grams though its accuracy is lower than

¢ Ô´c ´F µjc ´A · B · C · D µ;c ´E µµ

Ø the Multi-Class 3-grams. To prevent this decrease

Ô´F jc ´F µµ ¢ (14) in the accuracy of estimation, the next process is

Where, the Multi-Classes for word succession introduced.

Ø f ¾

´c ´E µjc ´Y µ;A ·

First, the 3-gram entry Ô

· B · C · D

A are given by the next equations.

· C · D µ

B is removed. After this deletion, back-

Ø Ø

´A · B · C · D µ=c ´Aµ

c (15) off smoothing is applied to this entry as follows.

f ¾ f ¾

Ø f ¾ f ½

c ´A · B · C · D µ= c ´D µ

´c ´E µjc ´Y µ;c ´A · B · C · D µµ

(16) Ô

f ¾ f ½

f ½ f ¾ f ½

= b´c ´Y µ;c ´A · B · C · D µµ

´A · B · C · D µ= c ´C µ;c ´D µ

c (17)

Ø f ½

Ô´c ´E µjc ´A · B · C · D µµ ¢ (20) In equation (17), please notice that the class se- quence (not single class) is assigned to the pre- Next, we assign the following value to the ceding class of the word successions. the class back-off parameter in equation (20). And this sequence is the preceding class of the last word of value is used to correct the decrease in the accu- the word succession and the pre-preceding class

racy of the estimation of word B .

of the second from the last word. Applying these

f ¾ f ½

´c ´Y µ;c ´A · B · C · D µµ

class assignments to equation (14) gives the next b

f ¾ f ½

equation. Ø

= Ô´c ´B µjc ´Y µ;c ´Aµµ

Ø

Ø f ¾ f ½

Ô´B jc ´B µµ=Ô´B jAµ

¢ (21) È = Ô´c ´Aµjc ´X µ;c ´Y µµ

Ø

Ô´A · B · C · D jc ´Aµµ

¢ After this assignment, the probabilities of words

Ø f ¾ f ½

¢ Ô´c ´E µjc ´C µ;c ´D µµ E

B and are locally incorrect. However, the total

Ø

Ô´E jc ´E µµ

¢ probability is correct, since the back-off parame-

f ¾ f ½

Ø ter is used to correct the decrease in the accuracy

¢ Ô´c ´F µjc ´D µ;c ´E µµ

B

Ø of the estimation of word . In fact, applying

Ô´F jc ´F µµ ¢ (18) equations (20) and (21) to equation (14) accord- In the above formation, the parameter increase ing to the above definition gives the next equa-

tion. In this equation, the probability for word B

´A · B · C · from the Multi-class 3-gram is Ô

Ø is changed from a word 2-gram to a class 3-gram.

jc ´Aµµ D . After expanding this term, the next

equation is given.

Ø f ¾ f ½

È = Ô´c ´Aµjc ´X µ;c ´Y µµ

Ø f ¾ f ½ Ø

È = Ô´c ´Aµjc ´X µ;c ´Y µµ ¢ Ô´Ajc ´Aµµ

Ø Ø f ¾ f ½

¢ Ô´Ajc ´Aµµ ¢ Ô´c ´B µjc ´Y µ;c ´Aµµ

Ø

¢ Ô´B jAµ ¢ Ô´B jc ´B µµ

¢ Ô´C jA; B µ ¢ Ô´C jA; B µ

Ô´D jA; B ; C µ

¢ Table 1: Evaluation of Multi-Class Composite N-

Ø f ¾ f ½

Ô´c ´E µjc ´C µ;c ´D µµ

¢ grams in Perplexity

Ø

¢ Ô´E jc ´E µµ

f ¾ f ½

Ø Kind of model Perplexity Number of

Ô´c ´F µjc ´D µ;c ´E µµ

¢ parameters

Ø

Ô´F jc ´F µµ ¢ (22) Word 2-gram 23.93 181,555 Multi-Class 2-gram 22.70 81,556 In the above process, only 2 parameters are ad- Multi-Class 19.81 92,761 Composite 2-gram

ditionally used. One is word 1-grams of word Word 3-gram 17.88 713,154

´A · B · C · D µ successions as Ô . And the Multi-Class 3-gram 17.38 438,130 other is word 2-grams of the first two words of Multi-Class 16.20 455,431 Composite 3-gram the word successions. The number of combina- Word 4-gram 17.45 1,703,207 tions for the first two words of the word succes- sions is at most the number of word successions. Therefore, the number of increased parameters in changed from 100 to 1,500. As shown in this fig- the Multi-Class Composite 3-gram is at most the ure, Multi-Class 3-grams result in lower perplex- number of introduced word successions times 2. ity than the conventional word 3-gram, indicating the reasonability of word clustering based on the 5 Evaluation Experiments distance-2 2-gram. 5.1 Evaluation of Multi-Class N-grams 5.2 Evaluation of Multi-Class Composite We have evaluated Multi-Class N-grams in per- N-grams plexity as the next equations.

We have also evaluated Multi-Class Composite X

½ N-grams in perplexity under the same conditions

ÐÓg ´Ô´Û µµ E ÒØÖ ÓÔÝ = i

¾ (23) Æ

i as the Multi-Class N-grams stated in the previ-

ous section. The Multi-Class 2-gram is used for

EÒØÖÓÔÝ =¾ È eÖÔÐeÜiØÝ (24) the initial condition of the Multi-Class Compos- The Good-Turing discount is used for smooth- ite 2-gram. The threshold of frequency for in- ing. The perplexity is compared with those of troducing word successions is set to 10 based on word 2-grams and word 3-grams. The evaluation a preliminary experiment. The same word suc- data set is the ATR Spoken Language Database cession set as that of the Multi-Class Composite (Takezawa et al., 1998). The total number of 2-gram is used for the Multi-Class Composite 3- words in the training set is 1,387,300, the vocab- gram. The evaluation results are shown in Table ulary size is 16,531, and 5,880 words in 42 con- 1. Table 1 shows that the Multi-Class Compos- versations which are not included in the training ite 3-gram results in 9.5% lower perplexity with a set are used for the evaluation. 40% smaller parameter size than the conventional Figure1 shows the perplexity of Multi-Class 2- word 3-gram, and that it is in fact a compact and grams for each number of classes. In the Multi- high-performance model. Class, the numbers of following and preceding classes are fixed to the same value just for com- 5.3 Evaluation in Continuous Speech parison. As shown in the figure, the Multi-Class Recognition 2-gram with 1,200 classes gives the lowest per- Though perplexity is a good measure for the per- plexity of 22.70, and it is smaller than the 23.93 formance of language models, it does not al- in the conventional word 2-gram. ways have a direct bearing on performance in lan- Figure 2 shows the perplexity of Multi-Class guage processing. We have evaluated the pro- 3-grams for each number of classes. The num- posed model in continuous . ber of following and preceding classes is 1,200 The experimental conditions are as follows: (which gives the lowest perplexity in Multi-Class

2-grams). The number of pre-preceding classes is ¯ Evaluation set Multi-Class 2-gram word 2-gram 25

24.5

24

23.5 Perplexity

23

22.5 400 600 800 1000 1200 1400 1600 Number of Classes

Figure 1: Perplexity of Multi-Class 2-grams

Multi-Class 3-gram 20 word 3-gram 19.5 19 18.5

Perplexity 18 17.5 17 100 300 500 700 900 1100 1300 1500 Number of Classes

Figure 2: Perplexity of Multi-Class 3-grams

Ð The same 42 conversations as used in Ð 2nd pass: full search after changing the the evaluation of perplexity language model and LM scale The Multi-Class Composite 2-gram and 3-

¯ Acoustic features gram are compared with those of the word 2- Ð Sampling rate 16kHz gram, Multi-Class 2-gram, word 3-gram and Ð Frame shift 10msec Multi-Class 3-gram. The number of classes is Ð Mel-cepstrum 12 + power and their 1,200 through all class-based models. For the delta, total 26 evaluation of each 2-gram, a 2-gram is used at both the 1st and the 2nd pass in decoder. For the 3-gram, each 2-gram is changed to the cor- ¯ Acoustic models responding 3-gram in the 2nd pass. The evalu- Ð 800-state 5-mixture HMnet model ation measures are conventional word accuracy based on ML-SSS (Ostendorf and and %correct calculated as follows.

Singer, 1997)

Ï D Á Ë

= ¢ ½¼¼

Ð Automatic selection of gender depen- Ï ÓÖ dAccÙÖ acÝ Ï

dent models

Ï D Ë

±CÓÖÖecØ = ¢ ½¼¼ Ï

¯ Decoder (Shimizu et al., 1996) D

(Ï : Number of correct words, : Deletion error, Ë Ð 1st pass: frame-synchronized viterbi Á : Insertion error, : Substitution error) search lower word error rate in continuous speech recog- Table 2: Evaluation of Multi-Class Composite N- nition with a 40% smaller model size than the grams in Continuous Speech Recognition conventional word 3-gram. And it is confirmed Kind of Model Word Acc. %Correct that high performance with a small model size can Word 2-gram 84.15 88.42 be created for Multi-Class Composite 3-grams. Multi-Class 2-gram 85.45 88.80 Multi-Class 88.00 90.84 Acknowledgments Composite 2-gram Word 3-gram 86.07 89.76 We would like to thank Michael Paul and Rainer Multi-Class 3-gram 87.11 90.50 Multi-Class 88.30 91.48 Gruhn for their assistance in writing some of the Composite 3-gram explanations in this paper.

References Table 2 shows the evaluation results. As in the perplexity results, the Multi-Class Composite 3- Shuanghu Bai, Haizhou Li, and Baosheng Yuan. gram shows the highest performance of all mod- 1998. Building class-based language models with contextual statistics. In Proc. ICASSP, pages 173Ð els, and its error reduction from the conventional 176. word 3-gram is 16%. P.F. Brown, V.J.D. Pietra, P.V. de Souza, J.C. Lai, and R.L. Mercer. 1992. Class-based n-gram models 6 Conclusion of natural language. Computational Linguistics, 18(4):467Ð479. This paper proposes an effective word clustering method called Multi-Class. In the Multi-Class Sabine Deligne and Frederic Bimbot. 1995. Language method, multiple classes are assigned to each modeling by variable length sequences. Proc. ICASSP, pages 169Ð172. word by clustering the following and preceding word characteristics separately. This word clus- Hirokazu Masataki, Shoichi Matsunaga, and Yosinori tering is performed based on the word connec- Sagusaka. 1996. Variable-order n-gram genera- tion by word-class splitting and consecutive word tivity in the corpus. Therefore, the Multi-Class grouping. Proc. ICASSP, pages 188Ð191. N-grams based on Multi-Class can improve reli- ability with a compact model size without losing M. Ostendorf and H. Singer. 1997. HMM topol- ogy design using maximum likelihood successive accuracy. state splitting. Computer Speech and Language, Furthermore, Multi-Class N-grams are ex- 11(1):17Ð41. tended to Multi-Class Composite N-grams. In Tohru Shimizu, Hirofumi Yamamoto, Hirokazu Masa- the Multi-Class Composite N-grams, higher or- taki, Shoichi Matsunaga, and Yoshinori Sagusaka. der word N-grams are introduced through the 1996. Spontaneous dialogue speech recognition grouping of frequent word successions. There- using cross-word context constrained word graphs. fore, these have accuracy in higher order word Proc. ICASSP, pages 145Ð148. N-grams added to reliability in the Multi-Class Toshiyuki Takezawa, Tsuyoshi Morimoto, and Yoshi- N-grams. And the number of increased param- nori Sagisaka. 1998. Speech and language eters with the introduction of word successions databases for speech translation research in ATR. is at most the number of word successions times In Proc. of the 1st International Workshop on East- Asian Language Resource and Evaluation, pages 2. Therefore, Multi-Class Composite 3-grams can 148Ð155. maintain a compact model size in the Multi-Class N-grams. Nevertheless, Multi-Class Composite Hirofumi Yamamoto and Yoshinori Sagisaka. 1999. Multi-class composite n-gram based on connection 3-grams are represented by the usual formation direction. Proc. ICASSP, pages 533Ð536. of 3-grams. This formation is easily handled by a language processor, especially that requires huge S. Zhang, H. Singer, D. Wu, and Y. Sagisaka. 1999. calculation cost as speech recognitions. Improving n-gram modeling using distance-related unit association maximum entropy language mod- In experiments, the Multi-Class Composite 3- eling. In Proc. EuroSpeech, pages 1611Ð1614. gram resulted in 9.5% lower perplexity and 16%