Word Segmentation in Chinese Language Processing

Statistics and Its Interface Volume 10 (2017) 165–173 Word segmentation in Chinese language processing Xinxin Shu, Junhui Wang, Xiaotong Shen, and Annie Qu∗ However, after rapid growth in the last decade, there were over 649 million internet users in 2013 writing text docu- This paper proposes a new statistical learning method for ments in Chinese, consisting of 23.2% of all internet users, word segmentation in Chinese language processing. Word compared to 28.6% in English. The online sales and enter- segmentation is the crucial first step towards natural lan- tainment businesses also promote the popularity of Chinese guage processing. Segmentation, despite progress, remains in the digital world. For example, Amazon.cn data (Zhang under-studied; particularly for the Chinese language, the et al., 2009) consists of 5 × 105 Chinese reviews on various second most popular language among all internet users. products. One major difficulty is that the Chinese language is highly However, Chinese language processing is still an area context-dependent and ambiguous in terms of word repre- which has been severely under-studied. This is likely due to sentations. To overcome this difficulty, we cast the problem specific challenges caused by the characteristics of Chinese of segmentation into a framework of sequence classification, language. Word segmentation is considered a crucial step to- where an instance (observation) is a sequence of characters, wards Chinese language processing tasks, due to the unique and a class label is a sequence determining how each char- characteristics of Chinese language structure. Chinese words acter is segmented. Given the class label, each character generally are composed of multiple characters without any sequence can be segmented into linguistically meaningful delimiter appearing between words. For example, the word words. The proposed method is investigated through the 博客 “blog” consists of two characters, 博 “plentiful” and Peking university corpus of Chinese documents. Our numer- 客 “guest”. If characters in a word are treated individu- ical study shows that the proposed method compares favor- ally rather than together, this could lead to a completely ably with the state-of-the-art segmentation methods in the different meaning. Good word segmenters could correctly literature. transform text documents into collections of linguistically AMS 2000 subject classifications: Primary 62H30; meaningful words, and make it possible to extract infor- secondary 68T50. mation accurately from the documents. Therefore, accurate Keywords and phrases: Cutting-plane algorithm, Lan- segmentation is a prerequisite step for Chinese document guage processing, Support vector machines, Word segmen- processing. Without effective word segmentation of Chinese tation. documents, it is extremely difficult to extract correct information given the ambiguous nature of Chinese words. Existing methods for Chinese segmentation are essen- 1. INTRODUCTION tially based on either characters, words or their hybrids (Sun, 2010; Gao et al., 2005). Teahan et al. (2000) pro- Digital information has become an essential part of mod- posed a word-based method by applying forward or back- ern life, from media news, entertainment, business, distance ward maximum matching strategies. Their method requires learning and communication, and research on product mar- an existing corpus as a reference to identify exact charac- keting, to potential threat detection and national security. ter sequences and then segment character by character se- With the enormous amount of information gathered nowa- quentially, through processing documents in either a for- days, manual information processing is far from sufficient, ward or backward direction. This method is also developed and the development of fast automatic processes of informa- in Chen and Goodman (1999). One obvious drawback of tion extraction is becoming extremely important. this approach is that the segmentation heavily relies on the In this paper, we focus on problems arising from Chinese coverage of the given corpus, and thus is not designed for natural language processing, and in particular we address identifying new words which are not in the corpus. problems of word segmentation. This is an important area The character-based method considers segmentation as and quite timely, since the Chinese language has become the a sequence of labeling problems (Xue, 2003). That is, the second most popular language among all internet users. In location of characters in a word is labeled through statis- 2000, there were about 22.5 million Chinese internet users. tical modeling such as conditional Gaussian random fields ∗Corresponding author. (CRF; Lafferty et al., 2001; Chang et al., 2008) or struc- tured support vector machines (SVMstruct; Tsochantaridis rately, as the linguistic rules can be applied for new words et al., 2005) based on hinge loss. Xue and Shen (2003) pro- as well. posed a maximum entropy approach which combines both The paper is organized as follows. Section 2 proposes the character-based and word-based methods. Specifically, their linguistically-embedded learning model. Section 3 provides idea is to integrate forward/backward maximum matching a computational strategy to meet computational challenges with statistical or machine-learning models to improve seg- in solving large-scale optimization for the proposed model. mentation performance. Sun and Xu (2011) proposed a uni- Section 4 illustrates the proposed method through applica- fied approach for a learning segmentation model from both tion to the Peking university corpus in the SIGHAN Bakeoff. training and test datasets to improve segmentation accu- The final section provides concluding remarks and a brief racy. discussion. However, these approaches suffer major drawbacks in that they do not utilize available linguistic information 2. CHINESE LANGUAGE SEGMENTATION which can enhance the segmentation (Gao et al., 2005), One major challenge in Chinese word segmentation is and/or are incapable of identifying new words not appearing that the Chinese language is a highly context-dependent or in training documents. Some current segmenters treat word strongly analytic language. The major differences between segmentation and new word identification as two separate Chinese and English are listed as follows. Chinese mor- processes (Chen, 2003; Wu and Jiang, 2000), which may phemes corresponding to words have very little inflection. lead to inconsistent results in segmentation. Other methods English, on the other hand, is rich in equipped and there- of segmentation are embedded into other processing proce- fore more context-independent. A large number of Chinese dures such as translation, to serve a specific purpose, for ex- words have more than one meaning under different contexts. ample, Chinese-English translation (Xu et al., 2008; Zhang For example, the original meaning of the word 水分 means et al., 2008). These methods unified with other processing “water,” but could also mean “inflated;” The word 算账 has approaches have not been proposed for general use. double meanings of “balance budget” or “reckoning.” Chi- Although in some situations character-based methods nese has no tense on verbal inflections to distinguish past tend to outperform word-based methods in terms of seg- such as “-ed,” present such as “-ing” and future activities, mentation accuracy (Wang et al., 2010), the enormous va- no number marking such as “-s” in English to distinguish riety of different permutations of Chinese characters makes singular versus plural, and no upper or lower case marking the computation of segmentation intractable. In this paper, to indicate the beginning of a sentence. In addition, English we propose a statistical method for Chinese word segmenta- morphemes can have more than one syllable, while Chinese tion by incorporating linguistic rules to restrict the possible morphemes are typically monosyllabic and written as one permutation of Chinese characters and thus alleviate the character (Wong et al., 2010). computation burden. Specifically, an instance (observation) Another challenge is that the number of Chinese char- is treated as a character sequence linked with a tag sequence acters is much greater than the number of letters in En- which represents the position of each character as the be- glish. The Kangxi dictionary from the Qing dynasty in the ginning, middle, or end of a word. Given the tag sequence, 17th century records around 47,035 characters. Nowadays each character sequence can then be segmented (broken or the number of characters has almost doubled to 87,019, ac- grouped) into linguistically meaningful words. The key chal- cording to the Zhonghua Zihai dictionary (Zhonghua Book lenge of segmentation is that it does not have explicit fea- Company, 1994). Moreover, new Chinese characters are con- tures and the number of choices for tag sequences increases stantly been created by internet users with the exponential exponentially with the length of the sequence. To circum- speed of the internet in this information age. vent this difficulty, a segmentation function is formulated In addition, the writing of Chinese characters is not uni- as a rating function measuring the meaningfulness of the fied because there are two versions of character writing. One segmented words given each sequence labeling. is based on traditional characters and the other is simplified Specifically we utilize linguistically meaningful features character writing. Simplified characters are officially used

Word Segmentation in Chinese Language Processing

Digital Editions of Premodern Chinese Texts: Methods and Problems – Exemplified Using the Daozang Jiyao1道藏輯要

The Effect of Pinyin in Chinese Vocabulary Acquisition with English-Chinese Bilingual Learners

Learning Chinese Characters Through Applied

The Linguistic Analysis of Chinese Emoticon

Proposal for a Chinese Script Root Zone LGR

Hong Kong Supplementary Character Set

马马虎虎) Xiè Xiè Nǐ Bú Kè Qì Duì Bu Qǐ Méi Guàn Xì 谢谢你 , 不客气 , 对不起 , 没关系

The Totality of Chinese Characters – a Digital Perspective

Dos Ideogramas (China) Às Palavras (Brasil) : Elaboração Do Primeiro Dicionário Básico Bilíngüe Português-Chinês

SCML: a Structural Representation for Chinese Characters

Die Chinesische Schrift Fasziniert Durch Ihre Fremdheit, Ihr Alter Und Ihre Schönheit

ISO/IEC International Standard 10646-1