<<

Multiple Character Embeddings for Chinese Word Segmentation

Jingkang Wang∗ Jianing Zhou∗ Jie Zhou Gongshen Liu† The Lab of Information Content Intelligent Analysis, Shanghai, School of Cyber Science and Engineering, Shanghai Jiao Tong University {wangjksjtu,zhjjn1919}@gmail.com, {sanny02,lgshen}@sjtu.edu.cn

Abstract a prerequisite for neural networks to solve NLP tasks in different languages. Chinese word segmentation (CWS) is often re- However, existing approaches neglect an im- garded as a character-based sequence label- portant fact that Chinese characters contain both ing task in most current works which have achieved great success with the help of pow- semantic and phonetic meanings - there are vari- erful neural networks. However, these works ous representations of characters designed for cap- neglect an important clue: Chinese characters turing these features. The most intuitive one is incorporate both semantic and phonetic mean- Romanization (拼音) that keeps many-to- ings. In this paper, we introduce multiple char- one relationship with Chinese characters - for one acter embeddings including Pinyin Romaniza- character, different meanings in specific context tion and Wubi Input, both of which are easily may lead to different pronunciations. This phe- accessible and effective in depicting semantics nomenon called Polyphony (and Polysemy) in lin- of characters. We propose a novel shared Bi- LSTM-CRF model to fuse linguistic features guistics is very common and crucial to word seg- efficiently by sharing the LSTM network dur- mentation task. Apart from Pinyin Romanization, ing the training procedure. Extensive experi- Wubi Input (五笔) is another effective represen- ments on five corpora show that extra embed- tation which absorbs semantic meanings of Chi- dings help obtain a significant improvement in nese characters. Compared to Radical (偏旁)(Sun labeling accuracy. Specifically, we achieve the et al., 2014; Dong et al., 2016; Shao et al., 2017), state-of-the-art performance in AS and CityU Wubi includes more comprehensive graphical and corpora with F1 scores of 96.9 and 97.3, re- spectively without leveraging any external lex- structural information that is highly relevant to the ical resources. semantic meanings and word boundaries, due to plentiful pictographic characters in Chinese and 1 Introduction effectiveness of Wubi in embedding the structures. This paper will thoroughly study how impor- Chinese is written without explicit word delim- tant the extra embeddings are and what schol- iters so word segmentation (CWS) is a preliminary ars can achieve by combining extra embeddings and essential pre-processing step for most natural with representative models. To leverage extra pho- language processing (NLP) tasks in Chinese, such netic and semantic information efficiently, we pro- arXiv:1808.04963v3 [cs.CL] 30 May 2019 as part-of-speech tagging (POS) and named-entity pose a shared Bi-LSTMs-CRF model, which feeds recognition (NER). The representative approaches embeddings into three stacked LSTM layers with are treating CWS as a character-based sequence shared parameters and finally scores with CRF labeling task following Xu(2003) and Peng et al. layer. We evaluate the proposed approach on five (2004). corpora and demonstrate that our method produces Although not relying on hand-crafted features, state-of-the-art results and is highly efficient as most of the neural network models rely heavily on previous single-embedding scheme. the embeddings of characters. Since Mikolov et al. Our contributions are summarized as follows: (2013) proposed word2vec technique, the vector 1) We firstly propose to leverage both seman- representation of words or characters has become tic and phonetic features of Chinese characters ∗ Equal contribution (alphabetical order). in NLP tasks by introducing Pinyin Romaniza- † Corresponding author. tion and Wubi Input embeddings, which are easily et al., 2017). However, Chinese characters are de- 乐 和 便 veloped to absorb and fuse phonetics, semantics, [he] 1 peaceful (adj) and hieroglyphology. In this paper, we would like [le] 1 happy (adj) 2 with (conj / adv) [bian] 1 convenient (adj) [yue] 2 music (n) [huo] 3 stir or join (v) [pian] 2 cheap (adj) to explore other linguistic features so the charac- [hu] 4 success in game (interjection) ters are the basic inputs with two other presenta- (a) (b) (c) tions (Pinyin and Wubi) introduced as auxiliary.

Figure 1: Examples of phono-semantic compound 2.2 Pinyin Romanization characters and polyphone characters. Pinyin Romanization (拼 音) is the official ro- manization system for charac- Character Wubi Input Code Character Wubi Input Code ters (ISO 7098:2015,E), representing the pronun- 提 花 (Carry) R J G H (Flower) A W X B ciation of Chinese characters like phonogram in English. Moreover, Pinyin is highly relevant to 打 草 / A I / (Hit) R S H (Grass) J semantics - one character may correspond var-

找 芽 ied Pinyin code that indicates different semantic / A A H T (Find) R A T (Bud) meanings. This phenomenon is very common in 扑 莲 Asian languages and termed as polyphone. (Leap) R H Y / (Lotus) A L P U Figure1 shows several examples of polyphone 抬 芦 乐 (Lift) R C K G (Reed) A Y N R characters. For instance, the character ‘ ’ in Fig- ure1 (a) has two different pronunciations (Pinyin Verb.(hands related) Noun. (plants related) code). When pronounced as ‘yue’, it means ’mu- (a) (b) sic’, as a noun. However, with the pronunciation Figure 2: Potential semantic relationships between of ’le’, it refers to ’happiness’. Similarly, the char- Chinese characters and Wubi Input. Gray area indi- acter ‘和’ in Figure1 (b) even has four meanings cates that these characters have the same first letter in with three varied Pinyin code. the Wubi Input representation. Through Pinyin code, a natural bridge is con- accessible and effective in representing semantic structed between the words and their semantics. and phonetic features; 2) We put forward a shared Now that human could understand the different Bi-LSTM-CRF model for efficiently integrating meanings of characters according to varied pro- multiple embeddings and sharing useful linguis- nunciations, the neural networks are also likely tic features; 3) We evaluate the proposed multi- to learn the mappings between semantic meanings embedding scheme on Bakeoff2005 and CTB6 and Pinyin code automatically. corpora. Extensive experiments show that auxil- Obviously, Pinyin provides extra phonetic and iary embeddings help achieve state-of-the-art per- semantic information required by some basic tasks formance without external lexical resources. such as CWS. It is worthy to notice that Pinyin is a dominant computer input method of Chinese 2 Multiple Embeddings characters, and it is easy to represent characters with Pinyin code as supplementary inputs. To fully leverage various properties of Chinese characters, we propose to split the character-level 2.3 Wubi Input embeddings into three parts: character embed- Wubi Input (五笔) is based on the structure of dings for textual features, Pinyin Romanization characters rather than the pronunciation. Since embeddings for phonetic features and Wubi Input plentiful Chinese characters are hieroglyphic, embeddings for structure-level features. Wubi Input can be used to find out the poten- tial semantic relationships as well as the word 2.1 Chinese Characters boundaries. It is beneficial to CWS task mainly in CWS is often regarded as a character-based se- two aspects: 1) Wubi encodes high-level semantic quence labeling task, which aims to label ev- meanings of characters; 2) characters with similar ery character with {B, M, E, S} tagging scheme. structures (e.g., radicals) are more likely to make Recent studies show that character embeddings up a word, which effects the word boundaries. are the most fundamental inputs for neural net- To understand its effectiveness in structure de- works (Chen et al., 2015; Cai and Zhao, 2016; Cai scription, one has to go through the rules of Wubi TaTgag TTaTaggag TaTTgaagg

CRCFRF CCRCRFRFF CRCCFRRFF

BBi-iL-SLTSTMM Bi-LSTM BiB-BLi-iS-LTLSSMTTMM

BiB-BLi-iSL-TLSSMTTMM

BiB-Li-SLTSMTM1 1 BiiB-Li-SLTSMTM2 2 BBii-B-LLi-SSLTTSMMTM33 3 FFCCF LC La aLyyaeeyrrer BBi-BiL-iSL-TLSSMTTMM ShSSahhraeardree dPd aP Praarmraamemteetrteserrss

ChCahrar PiinPyiniinyin WWuubbuiibi CChhCaahrrar PPiniPnyinyiniynin WWuWububi bi i CChCahhraarr PiPnPiyniniynyinin WWuWbuuibbii (a) Model-I (b) Model-II (c) Model-III

Figure 3: Network architecture of three multi-embedding models. (a) Model-I: Multi-Bi-LSTMs-CRF Model. (b) Model-II: FC-Layer Bi-LSTMs-CRF Model. (c) Model-III: Shared Bi-LSTMs-CRF Model.

Input method. It is an efficient encoding system 2.4 Multiple Embeddings which represents each Chinese character with at To fully utilize various properties of Chinese char- most four English letters. Specifically, these let- acters, we construct the Pinyin and Wubi embed- ters are divided into five regions, each of which dings as two supplementary character-level fea- 笔画 represents a type of structure (stroke, ) in tures. We firstly pre-process the characters and Chinese characters. obtain the basic character embedding following Figure2 provides some examples of Chinese the strategy in Lample et al.(2016); Shao et al. 1 characters and their corresponding Wubi code (2017). Then we use the Pypinyin Library to (four letters). For instance, ‘提’ (carry), ‘打’ (hit) annotate Pinyin code, and an official transforma- 2 and ‘抬’ (lift) in Figure2 (a) are all verbs related tion table to translate characters to Wubi code. to hands and correspond different spellings in En- Finally, we retrieve multiple embeddings using glish. On the contrary, in Chinese, these characters word2vec tool (Mikolov et al., 2013). are all left-right symbols and have the same radical For simplicity, we treat Pinyin and Wubi code (‘R’ in Wubi code). That is to say, Chinese char- as units like characters processed by canonical acters that are highly semantically relevant usually word2vec, which may discard some semantic have similar structures which could be perfectly affinities. It is worth noticing that the sequence captured by Wubi. Besides, characters with simi- order in Wubi code is an intriguing property con- lar structures are more likely to make up a word. sidering the fact that structures of characters are For example, ‘花’ (flower), ‘草’ (grass) and ‘芽’ encoded by the order of letters (see Sec 2.3). This (bud) in Figure2 (b) are nouns and represent dif- point merits further study. Finally, we remark that ferent plants. Whereas, they are all up-down sym- generating Pinyin code relies on the external re- bols and have the same radical (‘A’ in Wubi code). sources (statistics prior). Nontheless, Wubi code These words usually make up new words such as is converted under a transformation table so does ‘花草’ (flowers and grasses) and ‘花芽’ (the buds not introduce any external resources. of flowers). 3 Multi-Embedding Model Architecture In addition, the sequence in Wubi code is one approach to interpret the relationships between We adopt the popular Bi-LSTMs-CRF as our base- Chinese characters. In Figure2, it is easy to find line model (Figure4 without Pinyin and Wubi in- some interesting component rules. For instance, put), similar to the architectures proposed by Lam- we can conclude: 1) the sequence order implies ple et al.(2016) and Dong et al.(2016). To ob- the order of character components (e.g., ‘IA’ vs tain an efficient fusion and sharing mechanism for ‘AI’ and ‘IY’ vs ‘YI’); 2) some code has practical multiple features, we design three varied architec- meanings (e.g., ‘I’ denotes water). Consequently, tures (see Figure3). In what follows, we will pro- Wubi is an efficient encoding of Chinese charac- vide detailed explanations and analysis. ters so incorporated as a supplementary input like 1https://pypi.python.org/pypi/pypinyin Pinyin in our multi-embedding model. 2http://wubi.free.fr/index_en.html where σ is the logistic sigmoid function; Wfc and …… CRF Layer bfc are trainable parameters of fully connected (t) (t) (t) layer; xc , xp and xw are the input vectors of character, pinyin and wubi embeddings. The out- …… put of the fully connected layer x(t) forms the in- Bi-LSTMs Layer …… put sequence of the Bi-LSTMs-CRF. This archi- tecture benefits from its low computation cost but suffers from insufficient extraction from raw code. … … … char ...... Embedding .. Meanwhile, Model-I and Model-II ignore the in- Lookup Table PY … … … WB teractions between different embeddings.

3.3 Model-III: Shared Bi-LSTMs-CRF Input Layer 长 江 长 满 绿 藻 char Model chang jiang zhang man lv zao PY To address feature dependency while maintaining TAYI IAG/ TAYI IAGW XVIY AIKS WB training efficiency, Model-III (Figure 3c) intro- duces a sharing mechanism - rather than employ- Figure 4: The architecture of Bi-LSTM-CRF network. ing independent Bi-LSTMs networks for Pinyin PY and WB represent Pinyin Romanization and Wubi and Wubi, we let them share the same LSTMs with Input introduced in this paper. character embeddings. 3.1 Model-I: Multi-Bi-LSTMs-CRF Model In Model-III, we feed character, Pinyin and Wubi embeddings sequentially into a stacked Bi- In Model-I (Figure 3a), the input vectors of char- LSTMs network shared with the same parameters: acter, pinyin and wubi embeddings are fed into three independent stacked Bi-LSTMs networks  (t)   (t) h3,c wc and the output high-level features are fused via ad-      (t)  = ( (t) , θ), dition: h3,p  Bi-LSTMs wp      (3) (t) (t) (t) (t) h3,c = Bi-LSTMs1(xc , θc), h3,w ww (t) (t) t (t) (t) (t) h3,p = Bi-LSTMs2(xp , θp), h = h3,c + h3,p + h3,w, (1) (t) (t) h3,w = Bi-LSTMs3(xw , θw), where θ denotes the shared parameters of Bi- LSTMs. Different from Eqn (1), there is only h(t) = h(t) + h(t) + h(t) , 3,c 3,p 3,w one shared Bi-LSTMs rather than three indepen- where θc, θp and θw denote parameters in three dent LSTM networks with more trainable param- Bi-LSTMs networks respectively. The outputs of eters. In consequence, the shared Bi-LSTMs-CRF (t) (t) (t) model can be trained more efficiently compared to three-layer Bi-LSTMs are h3,c, h3,p and h3,w, Model-I and Model-II (extra FC-Layer expense). which form the input of the CRF layer h(t). Here three LSTM networks maintain independent pa- Specifically, at each epoch, the parameters of rameters for multiple features thus leading to a three networks are updated based on unified se- large computation cost during training. quential character, Pinyin and Wubi embeddings. The second LSTM network will share (or synchro- 3.2 Model-II: FC-Layer Bi-LSTMs-CRF nize) the parameters with the first network before Model it begins the training procedure with Pinyin as in- On the contrary, Model-II (Figure 3b) incorpo- puts. In this way, the second network will take rates multiple raw features directly by inserting fewer efforts in refining the parameters based on one fully-connected (FC) layer to learn a mapping the former correlated embeddings. So does the between fused linguistic features and concatenated third network (taking Wubi embedding as inputs). raw input embeddings. Then the output of this FC layer is fed into the LSTM network: 4 Experimental Evaluations

(t) (t) (t) (t) In this section, we provide empirical results to ver- x = [xc ; xp ; xw ], in (2) ify the effectiveness of multiple embeddings for (t) (t) x = σ(Wfcxin + bfc), CWS. Besides, our proposed Model-III can be CTB6 PKU MSR AS CityU Models PRFPRFPRFPRFPRF baseline 94.1 94.0 94.1 95.8 95.9 95.8 95.3 95.7 95.5 95.6 95.5 95.6 95.9 96.0 96.0 Model-I 94.9 95.0 94.9 95.7 95.7 95.7 96.8 96.6 96.7 96.6 96.5 96.5 96.7 96.5 96.6 Model-II 95.4 95.3 95.4 96.3 95.7 96.0 96.6 96.5 96.6 96.8 96.5 96.7 97.2 97.0 97.1 Model-III 95.4 95.0 95.2 96.3 96.1 96.2 97.0 96.9 97.0 96.9 96.8 96.9 97.1 97.0 97.1

Table 1: Comparison of different architectures on five corpora. Bold font signifies the best performance in all given models. Our proposed multiple-embedding models result in a significant improvement compared to vanilla character-embedding baseline model. trained efficiently (slightly costly than baseline) Model PKU MSR AS CityU and obtain the state-of-the-art performance. (Sun and Wan, 2012) 95.4 97.4 - - 4.1 Experimental Setup (Chen et al., 2015) 94.8 95.6 - - (Chen et al., 2017) 94.3 96.0 - 94.8 To make the results comparable and convincing, we evaluate our models on SIGHAN 2005 (Emer- (Ma et al., 2018) 96.1 97.4 96.2 97.2 son, 2005) and Chinese Treebank 6.0 (CTB6) (Zhang et al., 2013)* 96.1 97.4 - - (Xue et al., 2005) datasets, which are widely (Chen et al., 2015)* 96.5 97.4 - - used in previous works. We leverage standard (Cai et al., 2017)* 95.8 97.1 95.6 95.3 word2vec tool to train multiple embeddings. In (Wang and Xu, 2017)* 96.5 98.0 - - experiments, we tuned the embedding size follow- (Sun et al., 2017)* 96.0 97.9 96.1 96.9 ing Yao and Huang(2016) and assigned equal size (256) for three types of embedding. The number baseline 95.8 95.5 95.6 96.0 of Bi-LSTM layers is set as 3. ours (+PY)* 96.0 96.8 96.7 97.0 ours (+WB) 96.3 97.2 96.5 97.3 4.2 Experimental Results ours (+PY+WB)* 96.2 97.0 96.9 97.1 Performance under Different Architectures Table 2: Comparison with previous state-of-the-art We comprehensively conduct the analysis of three models on all four Bakeoff2005 datasets. The second architecture proposed in Section3. As illus- block (*) represents allowing the use of external re- trated in Table1, considerable improvements are sources such as lexicon dictionary or trained embed- obtained by three multi-embedding models com- dings on large-scale external corpora. Note that our pared with our baseline model which only takes approach does not leverage any external resources. character embeddings as inputs. Overall, Model- higher out of vocabulary rate. It again verifies that III (shared Bi-LSTMs-CRF) achieves better per- Pinyin and Wubi embeddings are capable of de- formance even with fewer trainable parameters. creasing mis-segmentation rate in large-scale data. Competitive Performance Embedding Ablation To demonstrate the effectiveness of supplemen- We conduct embedding ablation experiments on tary embeddings for CWS, we compare our mod- CTB6 and CityU to explore the effectiveness of els with previous state-of-the-art models. Pinyin and Wubi embeddings individually. As Table2 shows the comprehensive comparison shown in Table3, Pinyin and Wubi result in a con- of performance on all Bakeoff2005 corpora. To siderable improvement on F1-score compared to the best of our knowledge, we have achieved the vanilla single character-embedding model (base- best performance on AS and CityU datasets (with line). Moreover, Wubi-aided model usually leads F1 score 96.9 and 97.3 respectively) and com- to a larger improvement than Pinyin-aided one. petitive performance on PKU and MSR even if not leveraging external resources (e.g. pre-trained Convergence Speed char/word embeddings, extra dictionaries, labeled To further study the additional expense after in- or unlabeled corpora). It is worthy to notice that corporating Pinyin and Wubi, we record the train- AS and CityU datasets are considered more diffi- ing time (batch time and convergence time in Ta- cult by researchers due to its larger capacity and ble4) of proposed models on MSR. Compared to CTB6 CityU and NER tasks. In addition, their model leverages Models PRFPRF both the character-level and word-level informa- baseline 94.1 94.0 94.1 95.9 96.0 96.0 tion. IO + PY 94.6 94.9 94.8 96.8 96.4 96.6 Our work distinguishes itself by utilizing multi- ple dimensions of features in Chinese characters. IO + WB 95.3 95.4 95.3 97.3 97.3 97.3 With phonetic and semantic meanings taken into Model-II 95.4 95.3 95.4 97.2 97.0 97.1 consideration, three proposed models achieve bet- Table 3: Feature ablation on CTB6 and CityU. IO + ter performance on CWS and can be also adapted PY and IO + WB denote injecting Pinyin and Wubi to POS and NER tasks. In particular, compared embeddings separately under Model-II. to radical-level information in (Dong et al., 2016), Model Time (batch) Time (P-95%) Wubi Input encodes richer structure details and potentially semantic relationships. baseline 1 × 1 × Recently, researchers propose to treat CWS as Model-I 2.61 × 2.51 × a word-based sequence labeling problem, which Model-II 1.03 × 1.50 × also achieves competitive performance (Zhang Model-III 1.07 × 1.04 × et al., 2016; Cai and Zhao, 2016; Cai et al., 2017; Table 4: Relative training time on MSR. (a) averaged Yang et al., 2017). Other works try to introduce training time per batch; (b) convergence time, where very deep networks (Wang and Xu, 2017) or treat above 95% precision is considered as convergence. CWS as a gap-filling problem (Sun et al., 2017). We believe that proposed linguistic features can the baseline model, it almost takes the same train- also be transferred into word-level sequence la- ing time (1.07×) per batch and convergence time beling and correct the error. In a nutshell, multi- (1.04×) for Model-III. By contrast, Model-II leads ple embeddings are generic and easily accessible, to slower convergence (1.50×) in spite of its lower which can be applied and studied further in these batch-training cost. In consequence, we recom- works. mend Model-III in practice for its high efficiency.

5 Related Work 6 Conclusion Since Xu(2003), researchers have mostly treated In this paper, we firstly propose to leverage pho- CWS as a sequence labeling problem. Following netic, structured and semantic features of Chinese this idea, great achievements have been reached in characters by introducing multiple character em- Pinyin Wubi the past few years with the effective embeddings beddings ( and ). We conduct a com- introduced and powerful neural networks armed. prehensive analysis on why Pinyin and Wubi em- beddings are so essential in CWS task and could In recent years, there are plentiful works ex- be translated to other NLP tasks such as POS ploiting different neural network architectures in and NER. Besides, we design three generic mod- CWS. Among these architectures, there are sev- els to fuse the multi-embedding and produce the eral models most similar to our model: Bi-LSTM- start-of-the-art performance in five public corpora. CRF (Huang et al., 2015), Bi-LSTM-CRF (Lam- In particular, the shared Bi-LSTM-CRF models ple et al., 2016; Dong et al., 2016), and Bi-LSTM- (Model III in Figure3) could be trained effi- CNNs-CRF (Ma and Hovy, 2016). ciently and produce the best performance on AS Huang et al.(2015) was the first to adopt Bi- and CityU corpora. In future, the effective ways of LSTM network for character representations and leveraging hierarchical linguistic features to other CRF for label decoding. Lample et al.(2016) languages, NLP tasks (e.g., POS and NER) and and Dong et al.(2016) exploited the Bi-LSTM- refining mis-labeled sentences merit further study. CRF model for named entity recognition in west- ern languages and Chinese, respectively. More- Acknowledgement over, Dong et al.(2016) introduced radical-level information that can be regarded as a special case This research work has been partially funded by of Wubi code in our model. the National Natural Science Foundation of China Ma and Hovy(2016) proposed to combine Bi- (Grant No.61772337, U1736207), and the Na- LSTM, CNN and CRF, which results in faster con- tional Key Research and Development Program of vergence speed and better performance on POS China NO.2016QY03D0604. References Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detec- Deng Cai and Hai Zhao. 2016. Neural word segmen- tion using conditional random fields. In Proceedings tation learning for chinese. In ACL (1), Berlin, Ger- of the 20th International Conference on Computa- many. The Association for Computer Linguistics. tional Linguistics, COLING ’04, Stroudsburg, PA, USA. Association for Computational Linguistics. Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and Yan Shao, Christian Hardmeier, Jorg¨ Tiedemann, and accurate neural word segmentation for chinese. In Joakim Nivre. 2017. Character-based joint segmen- ACL (2), pages 608–615. Association for Computa- tation and POS tagging for chinese using bidirec- tional Linguistics. tional RNN-CRF. In IJCNLP(1), pages 173–183. Asian Federation of Natural Language Processing. Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term mem- Weiwei Sun and Xiaojun Wan. 2012. Reducing ap- ory neural networks for chinese word segmentation. proximation and estimation errors for chinese lexical In EMNLP, pages 1197–1206. The Association for processing with heterogeneous annotations. In ACL Computational Linguistics. (1), pages 232–241. The Association for Computer Linguistics. Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-criteria learning Yaming Sun, Lei Lin, Nan Yang, Zhenzhou Ji, and for chinese word segmentation. In ACL (1), pages Xiaolong Wang. 2014. Radical-enhanced chinese 1193–1203. Association for Computational Linguis- character embedding. In ICONIP (2), volume 8835 tics. of Lecture Notes in Computer Science, pages 279– 286. Springer. Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori, and Hui Di. 2016. Character- Zhiqing Sun, Gehui Shen, and Zhi-Hong Deng. 2017. based LSTM-CRF with radical-level features A gap-based framework for chinese word segmenta- for chinese named entity recognition. In tion via very deep convolutional networks. CoRR, NLPCC/ICCPOL, volume 10102 of Lecture Notes abs/1712.09509. in Computer Science, pages 239–250. Springer. Chunqi Wang and Bo Xu. 2017. Convolutional neu- Thomas Emerson. 2005. The second international chi- ral network with word embeddings for chinese word nese word segmentation bakeoff. In Proceedings of segmentation. In IJCNLP(1), pages 163–172. Asian the Fourth SIGHAN Workshop on Federation of Natural Language Processing. Processing, SIGHAN@IJCNLP 2005, Jeju Island, Nianwen Xu. 2003. Chinese word segmentation as Korea, 14-15, 2005. character tagging. Computational Linguistics and Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- Chinese Language Processing, 8(1):29–48. rectional LSTM-CRF models for sequence tagging. Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Martha CoRR, abs/1508.01991. Palmer. 2005. The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Lan- ISO 7098:2015(E). 2015. Information and documen- guage Engineering, 11(2):207–238. tation – Romanization of Chinese. Standard, Inter- national Organization for Standardization, Geneva, Jie Yang, Yue Zhang, and Fei Dong. 2017. Neural CH. word segmentation with rich pretraining. In ACL (1), pages 839–849. Association for Computational Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Linguistics. ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. Yushi Yao and Zheng Huang. 2016. Bi-directional In HLT-NAACL, pages 260–270. The Association LSTM recurrent neural network for chinese word for Computational Linguistics. segmentation. In ICONIP (4), volume 9950 of Lec- ture Notes in Computer Science, pages 345–353. Ji Ma, Kuzman Ganchev, and David Weiss. 2018. State-of-the-art chinese word segmentation with bi- Longkai Zhang, Houfeng Wang, Xu Sun, and Mairgup lstms. In EMNLP, pages 4902–4908. Association Mansur. 2013. Exploring representations from un- for Computational Linguistics. labeled data with co-training for chinese word seg- mentation. In EMNLP, pages 311–321. ACL. Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se- quence labeling via bi-directional lstm-cnns-crf. In Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. ACL (1). The Association for Computer Linguistics. Transition-based neural word segmentation. In Pro- ceedings of the 54th Annual Meeting of the Associ- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- ation for Computational Linguistics, ACL 2016, Au- rado, and Jeffrey Dean. 2013. Distributed represen- gust 7-12, 2016, Berlin, Germany, Volume 1: Long tations of words and phrases and their composition- Papers. ality. CoRR, abs/1310.4546.