Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information
Total Page:16
File Type:pdf, Size:1020Kb
Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information Ling-Xiang Tang1, Shlomo Geva1, Yue Xu1 and Andrew Trotman2 1School of Information Technology Faculty of Science and Technology Queensland University of Technology Queensland 4000 Australia {l4.tang, s.geva, yue.xu}@qut.edu.au 2Department of Computer Science Universityof Otago Dunedin 9054 New Zealand [email protected] Abstract In this paper, we propose an unsupervised seg- is called 激光打印机 in mainland China, but 鐳射打印機 in mentation approach, named "n-gram mutual information", or Hongkong, and 雷射印表機 in Taiwan. NGMI, which is used to segment Chinese documents into n- In digital representations of Chinese text different encoding character words or phrases, using language statistics drawn schemes have been adopted to represent the characters. How- from the Chinese Wikipedia corpus. The approach allevi- ever, most encoding schemes are incompatible with each other. ates the tremendous effort that is required in preparing and To avoid the conflict of different encoding standards and to maintaining the manually segmented Chinese text for train- cater for people's linguistic preferences, Unicode is often used ing purposes, and manually maintaining ever expanding lex- in collaborative work, for example in Wikipedia articles. With icons. Previously, mutual information was used to achieve Unicode, Chinese articles can be composed by people from all automated segmentation into 2-character words. The NGMI the above Chinese-speaking areas in a collaborative way with- approach extends the approach to handle longer n-character out encoding difficulties. As a result, these different forms of words. Experiments with heterogeneous documents from the Chinese writings and variants may coexist within same pages. Chinese Wikipedia collection show good results. Besides this, Wikipedia also has a Chinese collection in Classi- cal Chinese only, and versions for a few Chinese dialects. For example, 贛語(Gan) Wikipedia, 粵語(Cantonese) Wikipedia Keywords Chinese word segmentation, mutual information, and others. Moreover, in this Internet age more and more new n-gram mutual information, boundary confidence Chinese terms are coined at a faster than ever rate. Correspond- ingly, new Chinese Wikipedia pages will be created for the ex- planations of such terms. It is difficult to keep the dictionary 1 Introduction up to date due to the rate of creation and extent of new terms. All these issues could lead to serious segmentation problems in Modern Chinese has two forms of writings: simplified and tra- Wikipedia text processing while attempting to recognise mean- ditional. For instance, the word China is written as 中国 in ingful words in a Chinese article, as text will be broken down simplified Chinese, but as 中國 in traditional Chinese. Fur- into single character words when the actual n-gram word can thermore, a few variants of Chinese language exist in differ- not be recognised. In order to extract n-gram words from a ent locales including: Mainland China, Taiwan, Hong Kong, Wikipedia page, the following problems must be overcome: Macau, Singapore and Malaysia. For instance, a laser printer • Mix of Chinese writing forms: simplified and traditional • Mix of Chinese variants Proceedings of the 14th Australasian Document Computing Sym- • Mix of Classical Chinese and Modern Chinese posium, Sydney, Australia, 4 December 2009. Copyright for this • Out of vocabulary words article remains with the authors. There may be two options to tackle the these new issues in are thus separated by the boundaries. The estimation of bound- Chinese segmentation: (1) use existing methods and solutions; ary confidence is based on the mutual information of adjoining or (2) attempt new technique. In general, on the basis of segments. Since n-gram mutual information looks for bound- the required human effort, the Chinese word segmentation ap- aries in text but not for words directly, it overcomes the limi- proaches can be classified in two categories: tation of traditional usage of mutual information in the recog- nition of bi-gram words only. • Supervised methods, e.g. training-based, or rules-based methods, which require specific language knowledge. Normally, a pre-segmented corpus is employed to train 2 Previous Studies the segmentation models e.g. PPM [13], or word lex- icons need to be prepared for dictionary-based methods Pure statistical methods for word segmentation are less well e.g. CRF [10]. studied in the Chinese segmentation research. They are the • Unsupervised methods, which are less complicated, and approaches that make use of statistical information extracted commonly need only simple statistical data derived from from text to identify words. The text itself is the only ``train- known text to perform the segmentation. For instance, ing'' corpus used by the segmentation models. statistical methods using different mutual information for- Generally, the statistical methods used in Chinese segmen- mulas to extract two-character words rely on the bi-gram tation can be classified into the following groups: Information statistics from a corpus [2] [12]. Theory (e.g. entropy and mutual information), Accessory Va- riety, t-score and Others. The accuracy of the segmentation The drawbacks of supervised methods are obvious. The ef- is commonly evaluated using the simple recall and precision fort of preparing the manually segmented corpus and parame- measures: ters tuning is extensive. Also, the selected corpus mainly from modern Chinese text source may only cover a small portion of Wikipedia Chinese text. Plus, out-of-vocabulary(OOV) words c c are problematic for dictionary based methods. Different writ- R = ; and P = N n ing and different variant can lead to different combinations of characters representing the same word. Furthermore, accord- R is the recall rate of the segmentation ing to the 2nd International Chinese Word Segmentation Bake- P is the precision rate of the segmentation off result summary [1], the rankings of participants results in c is the number of correctly identified segmented words different corpora not being very consistent which may indi- N is the number of unique correct words in the test data cate that the supervised methods used in their segmentation n is the number of segmented words in the test data system are form (simplified or traditional) sensitive. To make In a recent study, an accessory variety(AV) method has been use of these existing systems, the segmentation could be done proposed by Feng et al. [4] to segment words in a unsupervised the by converting all Chinese text into one unified form, sim- manner. Accessory variety measures the probability of a char- plified Chinese, for example. However, the resulting perfor- acter sequence being a word. A word is separated from the in- mance may be cast in doubt because the Chinese form conver- put text by judging the independence of a candidate word from sion could not change the way the variant is used radically. For the rest by using accessor variety criterion in considering the example, 鐳(Radium) character in 鐳射(laser, or 激光 simpli- number of distinct preceding and trailing characters. An AV fied Chinese equivalent), will remain the same after such con- value of a candidate word is the minimal number of distinct version, and 鐳射 would still not be recognised correctly as preceding or trailing characters. The higher the number, the a word for the simplified-Chinese oriented segmentation sys- more independent the word is. tem. At the time of writing, no performance of segmentation Information theory can help group character sequences into targeted on Chinese Wikipedia corpus using the-state-of-the- words. Lua [7] [8] and Gan [8] used the entropy measure in art systems are reported. their word segmentation algorithm. A character sequence is a To avoid the effort of preparing and maintaining segmented possible word if its overall entropy is lower than the total en- text and lexicons for different corpora and potential issues tropy of individual characters. Using this entropy theory for when applying existing methods on Chinese Wikipedia arti- word judgment differently, Tung and Lee [14] considered the cles, a simple unsupervised statistical method called n-gram relationship of a candidate word with all possible preceding mutual information(NGMI), which relies on the statistical data and trailing single-characters appearing in the corpus. The en- from text mining on Chinese Wikipedia corpus, is proposed in tropy values are calculated for those characters given that they this paper. We extend the use of character-based mutual infor- occur in either the left hand side or the right hand side of this mation to be segment-based in order to realize n-gram Chinese candidate word. If entropy values on either side are high, the word segmentation. To achieve this goal, we introduce a new candidate word could be an actual word. Mutual information concept named boundary confidence(BC) which is used to de- and its derived algorithms are mainly used in finding bi-gram termine the boundary between segments. The n-gram words words. Generally, mutual information is used to measure the it with contextual information. The idea is to search words strength of association for two adjoining characters. The by looking for the word boundaries inside a given sentence stronger association, the more likely it is that they form a word. by combining contextual information, rather than looking for The formula used for calculating the association score for ad- words. This was tried by Sun et al. [9]before. The two ad- jacent two characters is: jacent characters are ``bounded'' or ``separated'' through a se- ries of judgment rules based on values of mutual information freq(xy) and difference of t-score. But for NGMI there are no rules A(xy) = MI(x; y) = log ( N ) involved, and mutual information of segments not just adja- 2 freq(x) freq(y) N N cent characters is considered.