Statistics and Its Interface Volume 10 (2017) 165–173

Word segmentation in processing

Xinxin Shu, Junhui Wang, Xiaotong Shen, and Annie Qu∗

However, after rapid growth in the last decade, there were over 649 million internet users in 2013 writing text docu- This paper proposes a new statistical learning method for ments in Chinese, consisting of 23.2% of all internet users, word segmentation in Chinese language processing. Word compared to 28.6% in English. The online sales and enter- segmentation is the crucial first step towards natural lan- tainment businesses also promote the popularity of Chinese guage processing. Segmentation, despite progress, remains in the digital world. For example, Amazon.cn data (Zhang under-studied; particularly for the Chinese language, the et al., 2009) consists of 5 × 105 Chinese reviews on various second most popular language among all internet users. products. One major difficulty is that the Chinese language is highly However, Chinese language processing is still an area context-dependent and ambiguous in terms of word repre- which has been severely under-studied. This is likely due to sentations. To overcome this difficulty, we cast the problem specific challenges caused by the characteristics of Chinese of segmentation into a framework of sequence classification, language. Word segmentation is considered a crucial step to- where an instance (observation) is a sequence of characters, wards Chinese language processing tasks, due to the unique and a class label is a sequence determining how each char- characteristics of Chinese language structure. Chinese words acter is segmented. Given the class label, each character generally are composed of multiple characters without any sequence can be segmented into linguistically meaningful delimiter appearing between words. For example, the word words. The proposed method is investigated through the 博客 “blog” consists of two characters, 博 “plentiful” and Peking university corpus of Chinese documents. Our numer- 客 “guest”. If characters in a word are treated individu- ical study shows that the proposed method compares favor- ally rather than together, this could lead to a completely ably with the state-of-the-art segmentation methods in the different meaning. Good word segmenters could correctly literature. transform text documents into collections of linguistically AMS 2000 subject classifications: Primary 62H30; meaningful words, and make it possible to extract infor- secondary 68T50. mation accurately from the documents. Therefore, accurate Keywords and phrases: Cutting-plane algorithm, Lan- segmentation is a prerequisite step for Chinese document guage processing, Support vector machines, Word segmen- processing. Without effective word segmentation of Chinese tation. documents, it is extremely difficult to extract correct infor- mation given the ambiguous nature of Chinese words. Existing methods for Chinese segmentation are essen- 1. INTRODUCTION tially based on either characters, words or their hybrids (Sun, 2010; Gao et al., 2005). Teahan et al. (2000) pro- Digital information has become an essential part of mod- posed a word-based method by applying forward or back- ern life, from media news, entertainment, business, distance ward maximum matching strategies. Their method requires learning and communication, and research on product mar- an existing corpus as a reference to identify exact charac- keting, to potential threat detection and national security. ter sequences and then segment character by character se- With the enormous amount of information gathered nowa- quentially, through processing documents in either a for- days, manual information processing is far from sufficient, ward or backward direction. This method is also developed and the development of fast automatic processes of informa- in Chen and Goodman (1999). One obvious drawback of tion extraction is becoming extremely important. this approach is that the segmentation heavily relies on the In this paper, we focus on problems arising from Chinese coverage of the given corpus, and thus is not designed for natural language processing, and in particular we address identifying new words which are not in the corpus. problems of word segmentation. This is an important area The character-based method considers segmentation as and quite timely, since the Chinese language has become the a sequence of labeling problems (Xue, 2003). That is, the second most popular language among all internet users. In location of characters in a word is labeled through statis- 2000, there were about 22.5 million Chinese internet users. tical modeling such as conditional Gaussian random fields ∗Corresponding author. (CRF; Lafferty et al., 2001; Chang et al., 2008) or struc- tured support vector machines (SVMstruct; Tsochantaridis rately, as the linguistic rules can be applied for new words et al., 2005) based on hinge loss. Xue and Shen (2003) pro- as well. posed a maximum entropy approach which combines both The paper is organized as follows. Section 2 proposes the character-based and word-based methods. Specifically, their linguistically-embedded learning model. Section 3 provides idea is to integrate forward/backward maximum matching a computational strategy to meet computational challenges with statistical or machine-learning models to improve seg- in solving large-scale optimization for the proposed model. mentation performance. Sun and Xu (2011) proposed a uni- Section 4 illustrates the proposed method through applica- fied approach for a learning segmentation model from both tion to the Peking university corpus in the SIGHAN Bakeoff. training and test datasets to improve segmentation accu- The final section provides concluding remarks and a brief racy. discussion. However, these approaches suffer major drawbacks in that they do not utilize available linguistic information 2. CHINESE LANGUAGE SEGMENTATION which can enhance the segmentation (Gao et al., 2005), One major challenge in Chinese word segmentation is and/or are incapable of identifying new words not appearing that the Chinese language is a highly context-dependent or in training documents. Some current segmenters treat word strongly analytic language. The major differences between segmentation and new word identification as two separate Chinese and English are listed as follows. Chinese mor- processes (Chen, 2003; Wu and Jiang, 2000), which may phemes corresponding to words have very little inflection. lead to inconsistent results in segmentation. Other methods English, on the other hand, is rich in equipped and there- of segmentation are embedded into other processing proce- fore more context-independent. A large number of Chinese dures such as translation, to serve a specific purpose, for ex- words have more than one meaning under different contexts. ample, Chinese-English translation (Xu et al., 2008; Zhang For example, the original meaning of the word 水分 means et al., 2008). These methods unified with other processing “water,” but could also mean “inflated;” The word 算账 has approaches have not been proposed for general use. double meanings of “balance budget” or “reckoning.” Chi- Although in some situations character-based methods nese has no tense on verbal inflections to distinguish past tend to outperform word-based methods in terms of seg- such as “-ed,” present such as “-ing” and future activities, mentation accuracy (Wang et al., 2010), the enormous va- no number marking such as “-s” in English to distinguish riety of different permutations of makes singular versus plural, and no upper or lower case marking the computation of segmentation intractable. In this paper, to indicate the beginning of a sentence. In addition, English we propose a statistical method for Chinese word segmenta- morphemes can have more than one syllable, while Chinese tion by incorporating linguistic rules to restrict the possible morphemes are typically monosyllabic and written as one permutation of Chinese characters and thus alleviate the character (Wong et al., 2010). computation burden. Specifically, an instance (observation) Another challenge is that the number of Chinese char- is treated as a character sequence linked with a tag sequence acters is much greater than the number of letters in En- which represents the position of each character as the be- glish. The from the Qing dynasty in the ginning, middle, or end of a word. Given the tag sequence, 17th century records around 47,035 characters. Nowadays each character sequence can then be segmented (broken or the number of characters has almost doubled to 87,019, ac- grouped) into linguistically meaningful words. The key chal- cording to the Zhonghua Zihai dictionary (Zhonghua Book lenge of segmentation is that it does not have explicit fea- Company, 1994). Moreover, new Chinese characters are con- tures and the number of choices for tag sequences increases stantly been created by internet users with the exponential exponentially with the length of the sequence. To circum- speed of the internet in this information age. vent this difficulty, a segmentation function is formulated In addition, the writing of Chinese characters is not uni- as a rating function measuring the meaningfulness of the fied because there are two versions of character writing. One segmented words given each sequence labeling. is based on traditional characters and the other is simplified Specifically we utilize linguistically meaningful features character writing. Simplified characters are officially used through higher-order N-gram templates in the segmenta- on the mainland of , whereas traditional characters tion model. Linguistical features using different-order-gram are maintained by , and . This templates are constructed to build the candidate set of fea- leads to different coding systems for electronic Chinese doc- tures, and select significant features from the candidate uments and webpages. There are three main different coding set through minimizing a segmentation loss function. The systems, namely, GB, Big5, and Unicode. The GB encoding proposed model can achieve higher segmentation accuracy scheme is applied to simplified characters, while Big5 is for incorporating linguistic rules while reducing the estima- traditional characters. Unicode can be applied to both writ- tion complexity. This is because the proposed segmentation ing styles. One advantage of the Unicode system is that both strategy does not completely rely on training samples which GB and Big5 can be converted into Unicode. cannot identify new words. Instead, it is built on established Segmentation in Chinese language processing is a crucial linguistic rules which can segment new words more accu- step because there is no boundary delimiter among consec-

166 X. Shu et al. the i-th sentence, and n is the number of sentences. For Table 1. A character can appear in different positions within instance, φ({我,们})={B,E} indicating 我 们 is seg- different words mented as a word. Here φ can be a discontinuous and position example ultra-high-dimensional function when the size of a Chinese beginning 发生 “to happen” document is large. To reduce the dimensionality, we intro- middle 始发站 “starting station” duce a continuous segmentation function f,whichquan- 头发 end “hair” tifies the appropriateness of segmentation for each sen- tence.  Specifically, φ(c) = argmaxs f(c, s), and f(c, s)= K T λ f (s, c,t), where f (s, c,t) is a linguistically utive Chinese words. In fact, most Chinese characters can k=1 t=1 k k k meaningful feature measuring the appropriateness of c at appear in any position for different words. Table 1 shows an a specific location t of s, K is the number of features and example where the Chinese character 发 “happen” occurs Λ=(λ , ··· ,λ )T are the relative importance measures at three different positions. 1 K of each feature f (c, s,t). Usually, f (s, c,t) takes value in This unique feature of the Chinese language makes it k k {0, 1}, and thus the value of f(c, s) becomes a weighted mea- quite challenging to determine word boundaries simply sure of all features appearing in the segmentation of s by c. through detecting certain types of Chinese characters, even Therefore, one key idea of the proposed method is to learn though the number of characters is finite. This is due to the the relative importance Λ through the training documents, fact that a character appearing in different positions leads to and then apply the estimated f(c, s) to segment future doc- different meaning and interpretation of words and phrases. uments. For instance, a segmenter could segment the sentence 网球 拍卖完了 as 网球拍/卖完/了 “Tennis racquets are sold out,” To estimate Λ, we minimize the following cost function: or segment the sentence as 网球/拍卖完/了 “Tennis ball(s) n  T K  K is/are auctioned.” The ambiguity of the Chinese language is (1) argmin L λkfk(si, ci,t) + η J(λk) extremely challenging for Chinese language processing since Λ i=1 t=1 k=1 k=1 mechanical methods such as tabulating frequencies of key words from the context are not effective for text mining. where L(u) is a large margin loss function that is non- Therefore segmentation plays a very important role in Chi- increasing with u, J(λ) is a regularizer, and η is a tuning nese language processing, since different segmentation may parameter. The large margin loss function can take vari- lead to different sentiment analysis. ous forms and gives preference to a large value of f(c, s), 2.1 Linguistically-embedded learning which mimics the optimal rule φ(c) to achieve good estima- tion accuracy of Λ. To ensure model sparsity, the choices framework of J(λ) include LASSO (Tibshirani, 1996), SCAD (Fan and In this section, we first introduce a character-based Li, 2001), truncated L1-penalty (Shen et al., 2012), among ≥ framework, and then illustrate how to incorporate the others. Note that the positive constraint λk 0 is unnec- character-based framework into the proposed model. Let essary in (1)asfk(s, c,t)’s are all non-negative and L(u)is T be the number of characters in one sentence, and the non-increasing in u. We obtain fˆ(c, s) with selected impor- corresponding sentence is denoted as c = c1 ...cT and the tant features through minimizing (1), and the sequence c ˆ ˆ set of its segmentation locators as s = s1 ...sT . Here each can be segmented by ˆs = φ(c) = argmaxs f(c, s). That is, character ct corresponds to a segmentation locator st,and a document is segmented by maximizing the weighted com- st ∈S. Meng et al. (2010) suggest that a simple 4-tag set bination of those important features constructed based on S = {B,M,E,S} is sufficient for unique determination and each candidate segmentation. Note that the segmentation segmentation, where B,M,E,andS denote the beginning, formulation in (1) does not produce probabilistic outputs middle, the end of a word, and a single-character word, re- but only the most appropriate segmentation of s.Inthe spectively. For instance, consider the 11-character sentence literature, a number of methods have been proposed (e.g., c = 我们将创造美好的新世纪 “we will create a bright fu- Wang et al., 2008; Wu et al., 2010) to construct probabilistic ture,” where T = 11, c1 = 我, c2 = 们,..., c11 = 纪,and estimates for margin-based methods, which may be adapted s = BESBEBESBME. So the linguistically meaningful for segmentation formulation. 我们 将 创造 美好 的 新世纪 segmentation is: / / / / / . The 4-tag The binary linguistic features fk(si, ci,t) is constructed segmentation rule is effective in achieving segmentation ac- based on the N-gram templates. The unigram (or 1-gram) curacy and computation efficiency, which are two important templates contain I(s(0) = st,c(−1) = ct−1), I(s(0) = and desirable properties in natural language processing. st,c(0) = ct)andI(s(0) = st,c(+1) = ct+1), and bigram (or To identify the segmentation locater for each Chinese 2-gram) templates include I(s(0) = st,c(−1) = ct−1,c(0) = sentence, we construct the segmentation model based on ct)andI(s(0) = st,c(0) = ct,c(+1) = ct+1), where c(−1), n CT →ST training data (ci, si)i=1, mapping from φ : , c(0), c(+1) and s(0) denote the the previous, current and where ci and si are the character and locater vectors in next characters, and the tag for the current character, re-

Word segmentation 167 spectively. The unigram templates contain the single charac- plicable for complex outputs problems such as natural lan- ter’s information on the previous, current or next characters guage parsing. However, both methods involve exponential given the current segmentation locator, while the bigram numbers of constraints. The proposed method makes use of templates include two consecutive characters’ information the functional representation in CRF and integrates it in through combining the previous or next character with the a regularization form as in SVMstruct. In a sense, it inte- current character. grates the strengths of both CRF and SVMstruct, leading to For illustration, let c = 我们将创造美好的新世纪,and a much simpler formulation. Second, the optimization pro- s = {BESBEBESBME}. If each of any first and last cess of (2) can be efficiently implemented through parallel character has 2 unigram and 1 bigram features, and each computing and thus make the segmentation scalable. The of any middle character has 3 unigram and 2 bigram fea- details of the parallel algorithm to achieve scalable imple- tures, then this generates 31 unigram and 20 bigram fea- mentation is provided in Section 3. Third and most inter- tures in total. The higher-order gram templates can be de- estingly, the proposed method has the potential to incorpo- fined in a similar way. However, the more higher-order gram rate various linguistic rules of Chinese in constructing the templates are used, the more complex the model will be. features, which may lead to an improvement in the segmen- Fortunately, mastering around 3000 characters is sufficient tation performance. More detailed discussion is deferred to for understanding 99% of Chinese documents (Wong et al., Section 5. 2000). Moreover, the proportions of words with one, two, Specifically, the model in (1) with the hinge loss can be three and four or more characters are 5%, 75%, 14% and formulated as 6% respectively. Therefore, unigram, bigram, trigram and n K quadrigram templates are sufficient to capture most Chinese (2) argminΛ,ξ ξi + η J(λk) words. For lexicon words, as illustrated in the above exam- i=1 k=1 ple, we can apply unigram templates and bigram templates T K to construct their binary features. For example, in the word s.t. 1 − λ f (s , c ,t) ≤ ξ ,ξ≥ 0, 我们 k k i i i i “we” appearing at the beginning of the above sentence, t=1 k=1 the feature functions are where ξi is a slack variable for the hinge loss of each sentence.

f1 = I(s(0) = B,c(0) = 我), In addition, we may also consider alternative surrogate loss functions such as the ψ-loss function L(u)=ψ(u)= f2 = I(s(0) = B,c(+1) = 们), min(1, (1 − u)+) (Shen et al., 2003). Although the ψ-loss f3 = I(s(0) = B,c(0) = 我,c(+1) = 们), function is non-convex, it is able to attain the optimal rate

f4 = I(s(0) = E,c(−1) = 我), of convergence under certain conditions and outperforms the hinge loss in general. Intuitively, the advantage of the ψ- f = I(s(0) = E,c(0) = 们), 5 loss lies in the fact that it is much closer to the 0–1 loss 将 f6 = I(s(0) = E,c(+) = ), I(u>0) in identifying segmentation error, especially when

f7 = I(s(0) = E,c(−1) = 我,c(0) = 们), u is negative. Consequently, the ψ-loss is much less affected 将 by, e.g., an outlying misclassified sentence with a negative f8 = I(s(0) = E,c(0) = ,c(+1) = ). T functional margin Λ Fci,si . More detailed discussion can be found in Shen et al. (2003). To further facilitate the computation, we consider a surro- T T gate loss function L(f,ci, si)=L(Λ Fci,si ), where Λ Fci,si is the generalized function margin for multiclass classifi- 3. ALGORITHM AND COMPUTATION cation (Vapnik, 1998). We propose to use the hinge loss It is important to develop an efficient optimization strat- L(u)=(1− u)+ for segmentation formulation, which is of- egy together to improve computational efficiency. The key ten used in large margin classification such as the support idea of the proposed computational scheme is based on “de- vector machine (SVM; Cortes and Vapnik, 1995). The hinge composition and combination” to meet computational chal- loss works effectively as a loss function for large margin clas- lenges in solving large-scale optimization for the proposed sification, since the more the margin is violated, the greater model. The procedure is summarized in Algorithm 1. the loss is. The proposed formulation with the hinge loss We apply the truncated L1-penalty (Shen et al., 2012) has a number of advantages. First, the model (2) contains for J(Λ), i.e. J(·)=min(·/ν, 1), where ν is a tuning only 2n + 1 constraints, which is on a much smaller scale parameter. The truncated L1-penalty has the following ad- compared to the exponential order of operations required by vantages. It selects linguistic features adaptively with Λ and the conditional random fields (CRF) and structured support also corrects the Lasso bias through tuning ν. It is capable vector machine (SVMstruct). The CRF combines conditional of handling small Λ of linguistic features through tuning models with the global normalization of random fields, and ν and therefore improves accuracy in segmentation. In ad- struct the SVM solves classification problems involving mul- dition, the truncated L1-penalty is piecewise linear, and is tiple dependent output variables or structured outputs ap- computationally efficient in the optimization process.

168 X. Shu et al. Algorithm 1 Chinese word segmentation procedure based OpenMP, the multi-platform shared-memory parallel pro- on penalized hinge loss gramming platform (http://www.openmp.org), or Mahout,

1: Build features fk(si, ci,t)withN-gram templates for every a library for scalable machine learning and data mining. k =1, ··· ,K,i=1, ··· ,n and t =1, ··· ,T. These tools for solving large-scale problems allow us to an- 2: Implement the ad hoc cutting-plane algorithm to get esti- alyze data containing several billions (109) of observations ˆ ··· mates λk for k =1, ,K by minimizing (2). on a single machine with reasonable computational speed. ˆ 3: Predict  segmentation locators s by maximizing f(c, s) ≡ K T We can also consider other penalty functions in model (2). λˆkfk(c, s,t) for any c ∈C. k=1 t=1 For example, if the regularizer L2-norm penalty is applied, then solving model (2) is a convex-function optimization problem. This can be solved by sequential quadratic pro- The optimization is carried out using an ad-hoc cutting- gramming (QP) or linear programming (LP). Solving the LP plane algorithm. The idea of the cutting-plane method is problem using parallelization has been studied by Dongarra to refine feasible sets iteratively through linear inequalities. et al. (2002), and QP parallelization can be carried out by Let M be the set of constraints in model (2)andW⊂M a parallel gradient projection-based decomposition method be the current working set of constraints. In each iteration, (Zanni et al., 2006). The key idea of QP parallelization is to the algorithm finds the solution over the current working set split an original problem into a sequence of smaller QP sub- W, searches for the most violated constraint in M\W,and problems, and parallelize the most demanding tasks of the then adds it to the working set. The algorithm stops until gradient projection within each QP subproblem. Through all violations of the constraints are smaller than the toler- QP or LP parallelization, we can process large-scale Chi- ance . The ad-hoc cutting-plane algorithm is illustrated in nese text data of size O(107). Algorithm 2. 4. BENCHMARK: PEKING UNIVERSITY Algorithm 2 The cutting-plane algorithm for model (2) CORPUS 1: Initial η, ,setW = ∅; 2: Repeat  In this section, we analyze the corpora obtained from 3: Compute Λˆ (m) =argminn ξ +  Λ,ξ i=1 i SIGHAN Bakeoff (http://www.sighan.org), which are pop- K W η k=1 J(λk), s.t. ; ular corpora in Chinese language processing competitions. 4: Obtain the constraint l(m) ∈Mwhich has the largest There are four datesets included in SIGHAN’s International violation in M given Λˆ (m); Chinese Word Segmentation Bakeoff: Academia Sinica (AS), W W∪ 5: Set = sm; City University of Hong Kong (HK), Peking University 6: Until no violation is larger than  (PK) and Microsoft Research corpora (MSR). Each cor- ˆ 7: Obtain the estimator Λ. pus is coded by Unicode and consists of training and test sets. The number of words in each corpus is shown in Table The computational efficiency of the cutting-plane algo- 2. In the Bakeoff corpora, Out-of-vocabulary (OOV) words rithm for hinge loss has been extensively investigated by are defined as words in the test set which are not present empirical studies. Indeed, it is much faster than the conven- in the training set. The contents in the corpora are care- tional training methods derived from decomposition meth- fully selected, and domains in the corpora are broadly rep- ods (Joachims et al., 2009). Note that model (2) contains resented, including politics, economics, culture, law, science a large number of features which are computationally chal- and technology, sports, military and literature. Therefore, lenging. For example, assuming that C contains the 1000 the corpora are sufficiently representative to assess the per- most common Chinese characters and the character locator formances of Chinese word segmentation. set S has 4 tags {B,M,E,S}, the unigram and bigram tem- We use the PK corpus in our experiments to investigate plates involve 3 ×|S|×|C|+2×|S|×|C|×|C|or roughly the proposed model. Table 2 shows that the PK corpus is 8 × 106 features. It might be necessary to implement the well-balanced in terms of the OOV percentage and the size parallel computing strategy in Step 3 to accelerate the com- of the training and test sets, compared to the other three putational speed. corpora. In the PK corpus, the training set has 161,212 sen- To handle large size documents or texts, MapReduce tences, i.e. n = 161,212, which is about 1.1 million words; computation can be utilized to break large problems into while the test set has 14,922 sentences, equivalent to 17 many small subproblems in a recursive and parallel manner. thousand words. Figure 1 displays some randomly selected In particular, we decompose our cost functions and regu- documents from the training and test sets. Furthermore, larizers for many observations by transforming complicated approximately 6.9% of the words in the test set are OOV, nonconvex optimization problems to many subproblems of among which 30% are new words from time-sensitive events, convex minimization. In addition, we can alleviate high stor- such as 三通 “three links” and 非典 “SARS.” In addition, age costs and increase the computational speed through par- more than 85% of the new words fall into categories of 2- allel computing. To achieve this goal, we can implement character new word or 2-character word followed by another

Word segmentation 169 Figure 1. Fragments of the PK training set and test set.

Table 2. SIGHAN’s corpora Table 3. Comparison of two top performers and the proposed method corpora # training words #testwords OOV(%) (in thousands) (in thousands) method precision recall F-measure AS 5800 12 0.021 method12 0.929 0.965 0.947 HK 240 35 0.071 method1234 0.951 0.969 0.960 PK 1100 17 0.069 1st in bakeoff 0.952 0.951 0.952 MSR 20,000 226 0.002 2nd in bakeoff 0.955 0.939 0.947 character. Two fragments from the PK corpus training and method12 using the unigram and bigram templates attains test sets are provided in Figure 1. 92.9% precision, 96.5% recall and 94.7% in the F-measure, We test the proposed method based on model (2)using which delivers a comparable performance against the two two N-gram templates. The first method12 only contains the top performers. Note that method12 has relatively low pre- unigram and bigram templates, while the second method1234 cision and high recall values. This is because unigram and utilizes additional trigram and quadrigram templates. We bigram templates only utilize the information of the consec- compare these two methods with the two top performers in utive two characters, and are unlikely to segment words with the Second International Chinese Word Segmentation bake- three or more characters. When the trigram and quadri- off. The performance of Chinese word segmenters is gen- gram templates are used for the proposed method, a sig- erally reported in terms of three performance metric crite- nificant improvement is achieved in performance. Specifi- ria: precision (P), recall (R) and evenly-weighted F-measure cally, the method1234 achieves 95.1% in precision, and deliv- (F). The precision is the fraction of segmented words that ers the highest recall with 96.9%, and the highest F-measure are correct, while the recall is the fraction of correct words with 96.0%. In conclusion, the method1234 performs the best that are segmented, and the evenly-weighted F-measure is against the other three methods at an expense of computa- the harmonic mean of the precision and recall defined as tion time. F =(2∗ P ∗ R)/(P + R). Our segmentation for the PK corpus data is com- The segmentation results are shown in Table 3. The pro- puted using an Intel processor with 2 cores, quad CPU at posed method has higher recall values than the two top 2.40GHz and 4G ram memory. The quadruple-grams based performers across all situations. In particular, the proposed method1234 requires about 1.75 times that of the bi-grams

170 X. Shu et al. Table 4. Taxonomy in Chinese words category subcategory examples LW lexical word 学生, 照片, 约会 MDW affixation 老师们 reduplication 马马虎虎 splitting 吃了饭 merging 上下文 head+morpheme 拿出来 FT date & time 5月3日, 六月五日,12点半, 三点二十分 number & fraction 一千零二十四, 4897, 60%, 百分之一, 1/6 email & website [email protected], www.google.com NE person name 张三, 约翰 location name 北京, 上海 organization name 长城, 大都会博物馆 NW new word 吐槽, 非典 based method12, because of the complexity of quadruple- Factoid words mainly consist of numeric and foreign char- grams over bi-grams. Note that run times for the two com- acters, such as a number 一千零二十四 “1024” or a for- petitors are not available on the SIGHAN website. The pro- eign organization “FBI.” Given a set of numeric and for- posed method outperforms the two top performers for the eign characters F, the factoid words lead to trigram tem- PK corpus in the literature when sufficient N-gram tem- plates I(s(0) = B,c(−1) ∈F/ , (c(0),c(+1)) ∈F), I(s(0) = plates are incorporated. However, higher order N-grams re- M,(c(−1),c(0),c(+1)) ∈F), and so on. Named entities in- quire higher computational cost. clude frequently-used Chinese names for persons, locations and organizations. A person’s name requires extensive enu- 5. CONSTRUCTION OF meration to identify since it does not follow any language LINGUISTICALLY-EMBEDDED rules. In contrast, names for locations and organizations can be identified by using built-in feature templates. For FEATURES example, an organization template I(s(−2) = B,s(−1) = The segmentation accuracy can be further improved by M,s(0) = E,c(0) ∈L), where L is a collection of keywords 部 局 委员会 incorporating linguistic language rules into feature fk(s, c,t) such as , and , “ministry, bureau and commit- construction through word categorization for Chinese words. tee”. This categorization method was first introduced by Gao et It is much more challenging to identify new words in Chi- al. (2005) with five categories: lexical words (LW), morpho- nese segmentation. There is little literature on new word logically derived words (MDW), factoids (FT), named enti- identification, though this has substantial impact on the ties (NE) and new words (NW). The taxonomy in Chinese performance of word segmentation. Therefore, it is impor- words is summarized in Table 4. tant to develop good strategies to detect new words uti- For morphologically derived words, factoids and named lizing linguistic rules and more updated language features. entities, we use trigram and quadrigram templates such that For example, enumeration can be used to detect new fac- the five main morphological rules are incorporated. The five toid words and named entities, as discussed above. Linguis- rules include affixation (e.g., 老师们 “teachers” is teacher + tic features constructed for the lexicon and morphologically plural), reduplication (e.g., 马马虎虎 “careless” reduplicates derived words can also be employed to detect new words. and emphasizes word 马虎), splitting (e.g., 吃了饭 “already Specifically, certain characters are always located at the be- ate” splits a lexical word 吃饭 “eat” by a particle 了), merg- ginning or at the end of a Chinese word, so new words ing (e.g., 上下文 “context” merges 上文 “above text” and containing those characters can be easily detected by us- 下文 “following text”), and head morpheme (e.g., 拿出来 ing the unigram template. For instance, 反 “anti-” typi- “take out” is the head 拿 “take” + the morphemes 出来 cally appears at the beginning of a Chinese word, so the “out”). For instance, the head morpheme rule yields tri- unigram template I(s(0) = B,c(0) = 反) can be used to gram template I(s(0) = B,s(+1) = M,s(+2) = E,c(0) = detect new words such as 反对 “disagree” and 反抗 “re- e0, (c(+1),c(+2)) = (e1,e2)), where e0 is a head character sist.” In addition, if a new word satisfies the splitting rule such as 拿 “take” and 放 “put,” and (e1,e2) chooses a value as discussed above, trigram templates can be utilized such as from a set of selected morphemes such as 出来 “out,” 进 I(s(−1) = B,s(0) = M,s(+1) = E,c(0) = 了) for detecting 去 “in,” or 下来 “down”; the reduplication rule leads to new words like 吐了槽 “already complained”. quadrigram templates I(s(0) = B,s(+1) = M,s(+2) = More importantly, the linguistically-embedded con- M,s(+3) = E,c(0) = c(+1),c(+2) = c(+3)). straints can be integrated into (1), which is powerful for

Word segmentation 171 reducing the effective size of the parameter space through nonconvex optimization problems to many subproblems of ranking the importance of features. As discussed above, convex minimization. This allows one to compute many some Chinese characters appear much more frequently at subproblems of convex optimization in a parallel fashion, the beginning of a word. For instance, 读 “read,” has a and therefore helps to achieve scalable computing for high- chance of 74% to occur in the beginning position (Li et al., volume text data. 2004). Therefore we can formulate some simple constraints Furthermore, there is a trade-off in between the accuracy to obtain the relative order of importance measures λk’s. in segmentation performance and computational complex- E.g., for this example, we can assign the importance mea- ity. Applying higher-order-gram templates leads to higher sure λk for I(s(0) = B,c(0) = 读) larger than that associ- accuracy in segmentation, but could result in an increased ated with I(s(0) = M,c(0) = 读), I(s(0) = E,c(0) = 读) amount of computational cost. For different corpora, the or- and I(s(0) = S, c(0) = 读). In addition, the existing lin- der of grams needs to be selected carefully to meet the de- guistic rules presented in Section 2.2 should be considered mand of time constraints for real-life applications. Finally, and incorporated into the constraints as well. For example, the segmentation for new words is still a challenging prob- the merging rule implies that λk for 上下文 “context” with lem, and further research is needed on developing segmenta- I(s(0) = B,s(+1) = M,s(+2) = E,(c(0),c(+1),c(+2)) = tion strategies to incorporate more complex linguistic rules. 上下文) should be relatively large compared to those asso- ciated with other choices of trigram features. ACKNOWLEDGEMENTS Specifically, the model in (1) with the hinge loss can be formulated as Research supported in part by National Science Foun- dation Grants DMS-1207771, DMS-1415500, DMS-1415308, n K DMS-1308227, DMS-1415482, and HK GRF-11302615. The (3) argmin ξ + η J(λ ) Λ,ξ i k authors thank the editors, the associate editor and the re- i=1 k=1 viewers for helpful comments and suggestions. T K s.t. 1 − λkfk(si, ci,t) ≤ ξi,ξi ≥ 0, t=1 k=1 Received 2 December 2015 λk ≥ λj for all (k, j) ∈I, REFERENCES where ξi is a slack variable for the hinge loss of each sentence, and I is comprised of all available linguistically-embedded Chang, P., Galley, M.,andManning, C. (2008). Optimizing Chinese I word segmentation for machine translation performance. In Proceed- constraints. However, the construction of requires enumer- ings of the Third Workshop on Statistical Machine Translation, pp. ating all possible language rules manually, and thus can be 224–232. labor intensive and time consuming. In the current project, Chen, A. (2003). Chinese word segmentation using minimal linguistic we only make use of the standard N-gram features, but knowledge. SIGHAN, 148–151. Chen, S. and Goodman, J. (1999). An empirical study of smoothing leave the linguistically-embedded features as a future de- techniques for language modeling. Computer Speech and Language velopment. 13, 359–394. Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning 20, 273–297. 6. DISCUSSION Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Tor- czon, L.,andWhite, A. (2002). The Sourcebook of Parallel Com- In this paper, we propose a machine-learning framework puting. Morgan Kaufmann, San Francisco. to utilize linguistically-embedded features for Chinese word Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized segmentation. The proposed model is a character-based likelihood and its oracle properties. Journal of American Statistical Association 96, 1348–1360. MR1946581 method constructing feature functions mapping from char- Gao,J.,Li,M.,Wu,A.,andHuang, C. (2005). Chinese word seg- acters to segmentation locators in words. The key idea is to mentation and named entity recognition: A pragmatic approach. build feature functions through N-gram templates, which Computational Linguistics Journal 31, 531–574. contain the information of the character itself in conjunction Joachims, T., Finley, T.,andYu, C. (2009). Cutting-plane training of structural SVMs. Machine Learning 27, 27–59. with its consecutive characters. We apply the hinge loss to Lafferty, J., McCallum, A.,andPereira, F. (2001). Conditional make the model more scalable, which is effective in reducing random fields: Probabilistic models for segmenting and labeling se- the number of constraints and also simplifies the constraint quence data. In Proceedings of the 18th International Conference forms. on Machine Learning, Morgan Kaufmann, pp. 282–289. Li, H., Huang, C., Gao, J.,andFan, X. (2004). The use of SVM In addition, computational tractability is one crucial for Chinese new word identification. In Proceedings of the 1st In- component in segmentation because it requires one to pro- ternational Joint Conference on Natural Language, Springer, pp. cess a large amount of text information in a short time pe- 497–504. Meng, W., Liu, L. Chen, A. riod for real life applications. One important property of ,and (2010). A comparative study on Chinese word segmentation using statistical models. In IEEE Inter- the proposed model is that the optimization process can be national Conference on Software Engineering and Service Sciences, efficiently implemented through transforming complicated pp. 482–486.

172 X. Shu et al. Shen, X., Pan, W.,andZhu, Y. (2012). Likelihood-based selection Xue, N. and Shen, L. (2003). Chinese word segmentation as LMR tag- and sharp parameter estimation. Journal of the American Statistical ging. In Proceedings of 2nd SIGHAN Workshop on Chinese Lan- Association 107, 223–232. MR2949354 guage Processing, pp. 176–179. Shen, X., Tseng, G., Zhang, X.,andWong, W. (2003). On ψ- Zanni, L., Serafini, T.,andZanghirati, G. (2006). Parallel software learning. Journal of the American Statistical Association 98, 724– for training large scale support vector machines on multiproces- 734. MR2011686 sor systems. Journal of Machine Learning Research 7, 1467–1492. Sun, W. (2010). Word-based and character-based word segmentation MR2274413 models: Comparison and combination. In Proceedings of the 23rd Zhang, C., Zeng, D., Li, J., Wang, F.,andZuo, W. (2009). Sen- International Conference on Computational Linguistics, 1211–1219. timent analysis of Chinese documents: From sentence to document Sun, W. and Xu, J. (2011). Enhancing Chinese word segmentation level. Journal of the American Society for Information Science and using unlabeled data. Empirical Methods in Natural Language Pro- Technology 60, 2474–2487. cessing. Zhang, R., Yasuda, K.,andSumita, E. (2008). Chinese word seg- Teahan, W., Wen, Y.,andWitten, I. (2000). A compression-based mentation and statistical machine translation. ACM Transactions algorithm for Chinese word segmentation. Computational Linguis- on Speech and Language Processing 5, 4:1–4:19. tics 26, 375–393. Zhonghua Zihai Dictionary. Zhonghua Book Company, 1994. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society Series B 58, 267–288. Xinxin Shu MR1379242 Department of Biostatistics and Research Decision Sciences Tsochantaridis, I., Joachims, T., Hofmann, T.,andAltun, Y. (2005). Large margin methods for structured and interdependent Merck Research Laboratories output variables. Journal of Machine Learning Research 6, 1453– USA 1484. MR2249862 E-mail address: [email protected] Vapnik, V. (1998). Statistical learning theory, Chichester, UK, Wiley. MR1641250 Junhui Wang Wang,J.,Shen,X. Liu, Y. ,and (2008). Probability estimation for Department of Mathematics large margin classifiers. Biometrika 95, 149–167. MR2409720 Wang, K., Zong, C.,andSu, K. (2010). A character-based joint model City University of Hong Kong for Chinese word segmentation. In Proceedings of the 23rd Interna- People’s Republic of China tional Conference on Computational Linguistics, 1173–1181. E-mail address: [email protected] Wong,K.F.,Li,W.,Xu,R.,andZhang, Z.-S. (2010). Introduc- tion to Chinese Natural Language Processing. Morgan & Claypool Xiaotong Shen Publishers. Wu, A. and Jiang, Z. (2000). Statistically-enhanced new word iden- School of Statistics tification in a rule-based Chinese system. In Proceeding of the 2nd University of Minnesota ACL Chinese Processing Workshop, 41–66. USA Wu, Y., Zhang, H. H. Liu, Y. ,and (2010). Robust model-free mul- E-mail address: [email protected] ticlass probability estimation. Journal of the American Statistical Association 105, 424–436. MR2656060 Xu, J., Gao, J., Toutanova, K.,andNey, H. (2008). Bayesian Annie Qu semi-supervised Chinese word segmentation for statistical machine Department of Statistics translation. In Proceedings of the 22nd International Conference on University of Illinois at Urbana-Champaign Computational Linguistics 1, pp. 1017–1024. USA Xue, N. (2003). Chinese word segmentation as character tagging. Com- putational Linguistics and Chinese Language Processing 8, 29–48. E-mail address: [email protected]

Word segmentation 173