Multi-Class Composite N-Gram Language Model for Spoken Language Processing Using Multiple Word Clusters

Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters Hirofumi Yamamoto Shuntaro Isogai Yoshinori Sagisaka ATR SLT Waseda University GITI / ATR SLT 2-2-2 Hikaridai Seika-cho 3-4-1 Okubo, Shinjuku-ku 1-3-10 Nishi-Waseda Soraku-gun, Kyoto-fu, Japan Tokyo-to, Japan Shinjuku-ku, Tokyo-to, Japan [email protected] [email protected] [email protected] Abstract are more effective and flexible than rule-based grammatical constraints in many cases, their per- In this paper, a new language model, the formance strongly depends on the size of training Multi-Class Composite N-gram, is pro- data, since they are statistical models. posed to avoid a data sparseness prob- In word N-grams, the accuracy of the word lem for spoken language in that it is prediction capability will increase according to difficult to collect training data. The the number of the order N, but also the num- Multi-Class Composite N-gram main- ber of word transition combinations will exponen- tains an accurate word prediction ca- tially increase. Moreover, the size of training data pability and reliability for sparse data for reliable transition probability values will also with a compact model size based on dramatically increase. This is a critical problem multiple word clusters, called Multi- for spoken language in that it is difficult to col- Classes. In the Multi-Class, the statisti- lect training data sufficient enough for a reliable cal connectivity at each position of the model. As a solution to this problem, class N- N-grams is regarded as word attributes, grams are proposed. and one word cluster each is created to represent the positional attributes. Fur- In class N-grams, multiple words are mapped thermore, by introducing higher order to one word class, and the transition probabilities word N-grams through the grouping of from word to word are approximated to the proba- frequent word successions, Multi-Class bilities from word class to word class. The perfor- N-grams are extended to Multi-Class mance and model size of class N-grams strongly Composite N-grams. In experiments, depend on the definition of word classes. In fact, the Multi-Class Composite N-grams re- the performance of class N-grams based on the sult in 9.5% lower perplexity and a 16% part-of-speech (POS) word class is usually quite lower word error rate in speech recogni- a bit lower than that of word N-grams. Based on tion with a 40% smaller parameter size this fact, effective word class definitions are re- than conventional word 3-grams. quired for high performance in class N-grams. In this paper, the Multi-Class assignment is proposed for effective word class definitions. The 1 Introduction word class is used to represent word connectiv- Word N-grams have been widely used as a sta- ity, i.e. which words will appear in a neigh- tistical language model for language processing. boring position with what probability. In Multi- Word N-grams are models that give the transition Class assignment, the word connectivity in each probability of the next word from the previous position of the N-grams is regarded as a differ- ½ Æ word sequence based on a statistical analy- ent attribute, and multiple classes corresponding sis of the huge text corpus. Though word N-grams to each attribute are assigned to each word. For the word clustering of each Multi-Class for each 2.2 Problems in the Definition of Word word, a method is used in which word classes are Classes formed automatically and statistically from a cor- In class N-grams, word classes are used to repre- pus, not using a priori knowledge as POS infor- sent the connectivity between words. In the con- mation. Furthermore, by introducing higher order ventional word class definition, word connectiv- word N-grams through the grouping of frequent ity for which words follow and that for which word successions, Multi-Class N-grams are ex- word precedes are treated as the same neighbor- tended to Multi-Class Composite N-grams. ing characteristics without distinction. Therefore, only the words that have the same word connec- 2 N-gram Language Models Based on tivity for the following words and the preceding Multiple Word Classes word belong to the same word class, and this word class definition cannot represent the word connec- 2.1 Class N-grams tivity attribute efficiently. Take ”a” and ”an” as an Word N-grams are models that statistically give example. Both are classified by POS as an Indef- the transition probability of the next word from inite Article, and are assigned to the same word class. In this case, information about the differ- ½ the previous Æ word sequence. This transition probability is given in the next formula. ence with the following word connectivity will be lost. On the other hand, a different class assignment for both words will cause the information Ô´Û jÛ ; :::; Û ;Û µ i Æ ·½ i ¾ i ½ i (1) about the community in the preceding word connectivity to be lost. This directional distinction is In word N-grams, accurate word prediction can be quite crucial for languages with reflection such as expected, since a word dependent, unique connec- French and Japanese. tivity from word to word can be represented. On the other hand, the number of estimated param- 2.3 Multi-Class and Multi-Class N-grams eters, i.e., the number of combinations of word As in the previous example of ”a” and ”an”, fol- Æ Æ Î Î transitions, is Î in vocabulary .As will lowing and preceding word connectivity are not exponentially increase according to Æ , reliable always the same. Let’s consider the case of dif- estimations of each word transition probability ferent connectivity for the words that precede and are difficult under a large Æ . follow. Multiple word classes are assigned to Class N-grams are proposed to resolve the each word to represent the following and preced- problem that a huge number of parameters is re- ing word connectivity. As the connectivity of the quired in word N-grams. In class N-grams, the word preceding ”a” and ”an” is the same, it is ef- transition probability of the next word from the ficient to assign them to the same word class to ½ previous Æ word sequence is given in the next represent the preceding word connectivity, if as- formula. signing different word classes to represent the following word connectivity at the same time. To Ôć jc ; :::; c ;c µÔ´Û jc µ i Æ ·½ i ¾ i ½ i i i (2) apply these word class definitions to formula (2), the next formula is given. c i ½ f ¾ f ½ Where, represents the word class to which the fÆ Ø Ø Ôć jc ; :::; c ;c µÔ´Û jc µ i (3) i i i Æ ·½ i ¾ i ½ Û word i belongs. Ø c In class N-grams with C classes, the number In the above formula, represents the word class i Æ Î Û of estimated parameters is decreased from in the target position to which the word i be- fÆ Æ c to C . However, accuracy of the word predic- longs, and represents the word class in the i tion capability will be lower than that of word N- N-th position in a conditional word sequence. grams with a sufficient size of training data, since We call this multiple word class definition, a the representation capability of the word depen- Multi-Class. Similarly, we call class N-grams dent, unique connectivity attribute will be lost for based on the Multi-Class, Multi-Class N-grams the approximation base word class. (Yamamoto and Sagisaka, 1999). 3 Automatic Extraction of Word probability of the succeeding class-word 2- f Clusters gram or word 2-gram, while Ô is the same for the preceding one. 3.1 Word Clustering for Multi-Class 2-grams 3. Merge the two classes. We choose classes For word clustering in class N-grams, POS in- whose dispersion weighted with the 1-gram formation is sometimes used. Though POS in- probability results in the lowest rise, and formation can be used for words that do not ap- merge these two classes: pear in the corpus, this is not always an optimal X Í = ´Ô´Û µD ´Ú ć ´Û µµ;Ú´Û µµµ ÒeÛ word classification for N-grams. The POS in- ÒeÛ Û formation does not accurately represent the sta- (6) tistical word connectivity characteristics. Better X Í = ´Ô´Û µD ´Ú ć ´Û µµ;Ú´Û µµµ ÓÐd word-clustering is to be considered based on word ÓÐd Û connectivity by the reflection neighboring charac- (7) teristics in the corpus. In this paper, vectors are where we merge the classes whose merge Í Í D ´Ú ;Ú µ ÓÐd c Û used to represent word neighboring characteris- cost ÒeÛ is the lowest. tics. The elements of the vectors are forward or represents the square of the Euclidean dis- Ú Ú c Û ÓÐd backward word 2-gram probabilities to the clus- tance between vector c and , repre- c tering target word after being smoothed. And we sents the classes before merging, and ÒeÛ consider that word pairs that have a small distance represents the classes after merging. between vectors also have similar word neighboring characteristics (Brown et al., 1992) (Bai et 4. Repeat step 2 until the number of classes is al., 1998). In this method, the same vector is reduced to the desired number.

Multi-Class Composite N-Gram Language Model for Spoken Language Processing Using Multiple Word Clusters

A Survey of Orthographic Information in Machine Translation 3

Word Representation Or Word Embedding in Persian Texts Siamak Sarmady Erfan Rahmani

Language Resource Management System for Asian Wordnet Collaboration and Its Web Service Application

Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Creating Language Resources for Under-Resourced Languages: Methodologies, and Experiments with Arabic

Adaptation of Deep Bidirectional Transformers for Afrikaans Language

Unsupervised Learning of Cross-Modal Mappings Between

The Interplay Between Lexical Resources and Natural Language Processing

A Representation Framework for Cross-Lingual/Interlingual Lexical Semantic Correspondences

The Crْbadلn Project: Corpus Building for Under-Resourced Languages

Faqs & Recommendations on LR and MT