Boosting N-Gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach

Boosting N-gram Coverage for Unsegmented Languages Using Multiple Text Segmentation Approach Solomon Teferra Abate Laurent Besacier Sopheap Seng LIG Laboratory, LIG Laboratory, LIG Laboratory, CNRS/UMR-5217 CNRS/UMR-5217 CNRS/UMR-5217 [email protected] [email protected] MICA Center, CNRS/UMI- 2954 [email protected] Abstract have a writing system without obvious word de- limiters, the N-grams of words are usually esti- Automatic word segmentation errors, mated from the text corpus segmented into for languages having a writing system words employing automatic methods. Automat- without word boundaries, negatively af- ic segmentation of text is not a trivial task and fect the performance of language mod- introduces errors due to the ambiguities in natu- els. As a solution, the use of multiple, ral language and the presence of out of vocabu- instead of unique, segmentation has re- lary words in the text. cently been proposed. This approach While the lack of text resources has a nega- boosts N-gram counts and generates tive impact on the performance of language new N-grams. However, it also pro- models, the errors produced by the word seg- duces bad N-grams that affect the lan- mentation make those data even less usable. The guage models' performance. In this pa- word N-grams not found in the training corpus per, we study more deeply the contribu- could be due not only to the errors introduced tion of our multiple segmentation ap- by the automatic segmentation but also to the proach and experiment on an efficient fact that a sequence of characters could have solution to minimize the effect of adding more than one correct segmentation. bad N-grams. In previous article (Seng et al., 2009), we have proposed a method to estimate an N-gram 1 Introduction language model from the training corpus on which each sentence is segmented into multiple A language model is a probability assignment ways instead of a unique segmentation. The ob- over all possible word sequences in a natural jective of multiple segmentation is to generate language. It assigns a relatively large probabili- more N-grams from the training corpus to use in ty to meaningful, grammatical, or frequent word language modeling. It was possible to show that sequences and a low probability or a zero proba- this approach generates more N-grams (com- bility to nonsensical, ungrammatical or rare pared to the classical dictionary-based unique ones. The statistical approach used in N-gram segmentation method) that are potentially useful language modeling requires a large amount of and relevant in language modeling. The applica- text data in order to make an accurate estimation tion of multiple segmentation in language mod- of probabilities. These data are not available in eling for Khmer and Vietnamese showed im- large quantities for under-resourced languages provement in terms of tri-gram hits and recogni- and the lack of text data has a direct impact on tion error rate in Automatic Speech Recognition the performance of language models. While the (ASR) systems. word is usually a basic unit in statistical lan- This work is a continuation of our previous guage modeling, word identification is not a work on the use of multiple segmentation. It is simple task even for languages that separate conducted on Vietnamese only. A close analysis words by a special character (a white space in of N-gram counts shows that the approach has general). For unsegmented languages, which in fact two contributions: boosting the N-gram 1 Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), pages 1–7, the 23rd International Conference on Computational Linguistics (COLING), Beijing, August 2010 counts that are generated with first best segmen- ed text corpus using unsupervised methods, dic- tation and generating new N-grams. We have tionaries are often created manually. The state- also identified that there are N-grams that nega- of-the-art methods generally use a combination tively affect the performance of the language of hand-crafted, dictionary and statistical tech- models. In this paper, we study the contribution niques to obtain a better result. However, statis- of boosting N-gram counts and of new N-grams tical methods need a large corpus segmented to the performance of the language models and manually beforehand. Statistical methods and consequently to the recognition performance. complex training methods are not appropriate in We also present experiments where rare or bad the context of under-resourced languages as the N-grams are cut off in order to minimize their resources needed to implement these methods negative effect on the performance of the lan- do not exist. For an under-resourced language, guage models. we seek segmentation methods that allow better The paper is organized as follows: section 2 exploitation of the limited resources. In our pre- presents the theoretical background of our mul- vious paper (Seng et al., 2009) we have indicat- tiple segmentation approach; in section 3 we ed the problems of existing text segmentation point out the set up of our experiment; in sec- approaches and introduced a weighted finite tion 4 we present the results of our detailed sta- state transducer (WFST) based multiple text tistical analysis of N-grams generated by multi- segmentation algorithm. ple segmentation systems. Section 5 presents Our approach is implemented using the AT & the evaluation results of our language models T FSM Toolkit (Mohri et al., 1998). The algo- for ASR and finally, we give concluding re- rithm is inspired with the work on the segmen- marks. tation of Arabic words (Lee et al., 2003). The multiple segmentation of a sequence of charac- 2 Multiple Text Segmentation ters is made using the composition of three con- trollers. Given a finite list of words we can Text segmentation is a fundamental task in nat- build a finite state transducer M (or word trans- ural language processing (NLP). Many NLP ap- ducer) that, once composed with an acceptor I plications require the input text segmented into of the input string that represent a single charac- words before making further progress because ter with each arc, generates a lattice of the the word is considered the basic semantic unit in words that represent all of the possible segmen- natural languages. For unsegmented languages tations. To handle out-of-vocabulary entries, we segmenting text into words is not a trivial task. make a model of any string of characters by a Because of ambiguities in human languages, a star closure operation over all the possible char- sequence of characters may be segmented in acters. Thus, the unknown word WFST can more than one way to produce a sequence of parse any sequence of characters and generate a valid words. This is due to the fact that there unique unk word symbol. The word transducer are different segmentation conventions and the can, therefore, be described in terms of the definition of word in a language is often am- WFST operations as M = (WD ∪ UNK)+ biguous. where WD is a WFST that represents the dictio- Text segmentation techniques generally use nary and UNK represents the unknown word an algorithm which searches in the text the WFST. Here, ∪ and + are the union and Kleene words corresponding to those in a dictionary. In “+” closure operations. A language model L is case of ambiguity, the algorithm selects the one used to score the lattice of all possible segmen- that optimizes a parameter dependent on the tations obtained by the composition of our word chosen strategy. The most common optimiza- transducer M and the input string I. A language tion strategies consist of maximizing the length model of any order can be represented by a of words (“longest matching”) or minimizing WFST. In our case, it is important to note that the number of words in the entire sentence only a simple uni-gram language model is used. (“maximum matching”). These techniques rely The uni-gram model is estimated from a small heavily on the availability and the quality of the training corpus segmented automatically into dictionaries and while it is possible to automati- words using a dictionary based method. The cally generate a dictionary from an unsegment- composition of the sequence of input string I 2 with the word transducer M yields a transducer on a separate speech test set (different from the that represents all possible segmentations. This development set). transducer is then composed with the language First of all, a language model named lm_1 is model L, resulting in a transducer that repre- trained using the SRILM toolkit (Stolcke 2002) sents all possible segmentations for the input from the first best segmentation (Segmul1), string I, scored according to L. The highest which has the highest scoring paths (based on scoring paths of the compound transducer is the the transducer explained in section 2) of each segmentation m that can be defined as: sentence in the whole corpus. Then, additional language models have been trained using the P m =maxP mk The segmentation procedure can then be ex- corpus segmented with N-best segmentation: pressed formally as: the number of N-best segmentations to generate for each sentence is fixed to 2, 5, 10, 50, 100 m=bestpath I◦M◦L and 1000. The resulting texts are named accord- where ◦ is the composition operator. The N- ingly as Segmul2, Segmul5, Segmul10, Seg- best segmentations are obtained by decoding the mul50, Segmul100, Segmul1000. Using these final lattice to output the N-best highest scoring as training data, we have developed different paths and will be used for the N-gram count. language models.

Load more