Arxiv:1905.05526V2 [Cs.CL] 6 Oct 2019

Is Word Segmentation Necessary for Deep Learning of Chinese Representations? Yuxian Meng∗1, Xiaoya Li∗1, Xiaofei Sun1, Qinghong Han1 Arianna Yuan1;2, and Jiwei Li1 1 Shannon.AI 2 Computer Science Department, Stanford University f yuxian meng, xiaoya li, xiaofei sun, qinghong han arianna yuan, jiwei [email protected] Abstract is present between words in written Chinese sentences. This gives rise to the task of Chinese Word Segmenting a chunk of text into words is usually the first step of processing Chinese text, Segmentation (CWS) (Zhang et al., 2003; Peng but its necessity has rarely been explored. et al., 2004; Huang and Zhao, 2007; Zhao et al., 2006; Zheng et al., 2013; Zhou et al., 2017; Yang In this paper, we ask the fundamental question of whether Chinese word segmentation (CWS) et al., 2017, 2018). In the context of deep learning, is necessary for deep learning-based Chinese the segmented words are usually treated as the ba- Natural Language Processing. We bench- sic units for operations (we call these models the mark neural word-based models which rely on word-based models for the rest of this paper). Each word segmentation against neural char-based segmented word is associated with a fixed-length models which do not involve word segmenta- vector representation, which will be processed by tion in four end-to-end NLP benchmark tasks: deep learning models in the same way as how En- language modeling, machine translation, sentence matching/paraphrase and text classifica- glish words are processed. Word-based models tion. Through direct comparisons between come with a few fundamental disadvantages, as these two types of models, we find that char- will be discussed below. based models consistently outperform word- Firstly, word data sparsity inevitably leads to based models. overfitting and the ubiquity of OOV words limits Based on these observations, we conduct com- the model’s learning capacity. Particularly, Zipf’s prehensive experiments to study why word- law applies to most languages including Chinese. based models underperform char-based mod- Frequencies of many Chinese words are extremely els in these deep learning-based NLP tasks. small, making the model impossible to fully learn We show that it is because word-based models their semantics. Let us take the widely used Chi- are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, nese Treebank dataset (CTB) as an example (Xia, 3 and thus more prone to overfitting. We hope 2000). Using Jieba, the most widely-used open- this paper could encourage researchers in the sourced Chinese word segmentation system, to seg- community to rethink the necessity of word ment the CTB, we end up with a dataset consist- segmentation in deep learning-based Chinese ing of 615,194 words with 50,266 distinct words. Natural Language Processing. 12 arXiv:1905.05526v2 [cs.CL] 6 Oct 2019 Among the 50,266 distinct words, 24,458 words 1 Introduction appear only once, amounting to 48.7% of the total vocabulary, yet they only take up 4.0% of the entire There is a key difference between English (or more corpus. If we increase the frequency bar to 4, we broadly, languages that use some form of the Latin get 38,889 words appearing less or equal to 4 times, alphabet) and Chinese (or other languages that do which contribute to 77.4% of the total vocabulary not have obvious word delimiters such as Korean but only 10.1% of the entire corpus. Statistics are and Japanese) : words in English can be easily given in Table1. This shows that the word-based recognized since the space token is a good approxi- data is very sparse. The data sparsity issue is likely mation of a word divider, whereas no word divider to induce overfitting, since more words means a 1Yuxian Meng and Xiaoya Li contribute equally to this larger number of parameters. In addition, since it paper. 2Paper to appear at ACL2019. 3https://github.com/fxsjy/jieba bar # distinct prop of vocab prop of corpus closed if bigrams of characters are used in char- 1 50,266 100% 100% 4 38,889 77.4% 10.1% based models. In the phrase-based machine trans- 1 24,458 48.7% 4.0% lation, Xu et al.(2004) reported that CWS only showed non-significant improvements over mod- Table 1: Word statistics of Chinese TreeBank. els without word segmentation. Zhao et al.(2013) found that segmentation itself does not guarantee Corpora Yao Ming reaches the final better MT performance and it is not key to MT im- CTB 姚明进e ;³[ PKU 姚明进e ; ³[ provement. For text classification, Liu et al.(2007) compared a na¨ıve character bigram model with Table 2: CTB and PKU have different segmentation word-based models, and concluded that CWS is criteria (Chen et al., 2017c). not necessary for text classification. Outside the literature of computational linguistics, there have is unrealistic to maintain a huge word-vector ta- been discussions in the field of cognitive science. ble, many words are treated as OOVs, which may Based on eye movement data, Tsai and McConkie further constrain the model’s learning capability. (2003) found that fixations of Chinese readers do Secondly, the state-of-the-art word segmenta- not land more frequently on the centers of Chi- tion performance is far from perfect, the errors of nese words, suggesting that characters, rather than which would bias downstream NLP tasks. Partic- words, should be the basic units of Chinese reading ularly, CWS is a relatively hard and complicated comprehension. Consistent with this view, Bai et al. task, primarily because word boundary of Chinese (2008) found that Chinese readers read unspaced words is usually quite vague. As discussed in Chen text as fast as word spaced text. et al.(2017c), different linguistic perspectives have In this paper, we ask the fundamental question different criteria for CWS (Chen et al., 2017c). As of whether word segmentation is necessary for shown in Table 1, in the two most widely adopted deep learning-based Chinese natural language pro- CWS datasets PKU (Yu et al., 2001) and CTB (Xia, cessing. We first benchmark word-based models 2000), the same sentence is segmented differently. against char-based models (those do not involve Thirdly, if we ask the fundamental problem of Chinese word segmentation). We run apples-to- how much benefit word segmentation may provide, apples comparison between these two types of it is all about how much additional semantic infor- models on four NLP tasks: language modeling, mation is present in a labeled CWS dataset. Af- document classification, machine translation and ter all, the fundamental difference between word- sentence matching. We observe that char-based based models and char-based models is whether models consistently outperform word-based model. teaching signals from the CWS labeled dataset are We also compare char-based models with word- utilized. Unfortunately, the answer to this question char hybrid models (Yin et al., 2016; Dong et al., remains unclear. For example. in machine transla- 2016; Yu et al., 2017), and observe that char-based tion we usually have millions of training examples. models perform better or at least as good as the The labeled CWS dataset is relatively small (68k hybrid model, indicating that char-based models sentences for CTB and 21k for PKU), and the do- already encode sufficient semantic information. main is relatively narrow. It is not clear that CWS It is also crucial to understand the inadequacy dataset is sure to introduce a performance boost. of word-based models. To this end, we perform Before neural network models became popular, comprehensive analyses on the behavior of word- there were discussions on whether CWS is nec- based models and char-based models. We identify essary and how much improvement it can bring the major factor contributing to the disadvantage about. In information retrieval(IR), Foo and Li of word-based models, i.e., data sparsity, which in (2004) discussed CWS’s effect on IR systems and turn leads to overfitting, prevelance of OOV words, revealed that segmentation approach has an effect and weak domain transfer ability. on IR effectiveness as long as the SAME segmenta- Instead of making a conclusive (and arrogant) tion method is used for query and document, and argument that Chinese word segmentation is not that CWS does not always work better than mod- necessary, we hope this paper could foster more els without segmentation. In cases where CWS discussions and explorations on the necessity of does lead to better performance, the gap between the long-existing task of CWS in the community, word-based models and char-based models can be alongside with its underlying mechanisms. 2 Related Work model dimension ppl word 512 199.9 Since the First International Chinese Word Seg- char 512 193.0 word 2048 182.1 mentation Bakeoff in 2003 (Sproat and Emerson, char 2048 170.9 2003) , a lot of effort has been made on Chinese hybrid (word+char) 1024+1024 175.7 word segmentation. hybrid (word+char) 2048+1024 177.1 Most of the models in the early years are based hybrid (word+char) 2048+2048 176.2 hybrid (char only) 2048 171.6 on a dictionary, which is pre-defined and thus in- dependent of the Chinese text to be segmented. Table 3: Language modeling perplexities in different The simplest but remarkably robust model is the models. maximum matching model (Jurafsky and Martin, 2014). The simplest version of it is the left-to-right maximum matching model (maxmatch). Starting important hyper-parameters such as learning rate, with the beginning of a string, maxmatch chooses batch size, dropout rate, etc.

Load more