Arxiv:1808.04963V3 [Cs.CL] 30 May 2019 As Part-Of-Speech Tagging (POS) and Named-Entity Pose a Shared Bi-Lstms-CRF Model, Which Feeds Recognition (NER)

Multiple Character Embeddings for Chinese Word Segmentation Jingkang Wang∗ Jianing Zhou∗ Jie Zhou Gongshen Liuy The Lab of Information Content Intelligent Analysis, Shanghai, China School of Cyber Science and Engineering, Shanghai Jiao Tong University fwangjksjtu,[email protected], fsanny02,[email protected] Abstract a prerequisite for neural networks to solve NLP tasks in different languages. Chinese word segmentation (CWS) is often re- However, existing approaches neglect an im- garded as a character-based sequence label- portant fact that Chinese characters contain both ing task in most current works which have achieved great success with the help of pow- semantic and phonetic meanings - there are vari- erful neural networks. However, these works ous representations of characters designed for cap- neglect an important clue: Chinese characters turing these features. The most intuitive one is incorporate both semantic and phonetic mean- Pinyin Romanization (拼音) that keeps many-to- ings. In this paper, we introduce multiple char- one relationship with Chinese characters - for one acter embeddings including Pinyin Romaniza- character, different meanings in specific context tion and Wubi Input, both of which are easily may lead to different pronunciations. This phe- accessible and effective in depicting semantics nomenon called Polyphony (and Polysemy) in lin- of characters. We propose a novel shared Bi- LSTM-CRF model to fuse linguistic features guistics is very common and crucial to word seg- efficiently by sharing the LSTM network dur- mentation task. Apart from Pinyin Romanization, ing the training procedure. Extensive experi- Wubi Input (五笔) is another effective represen- ments on five corpora show that extra embed- tation which absorbs semantic meanings of Chi- dings help obtain a significant improvement in nese characters. Compared to Radical (OÁ)(Sun labeling accuracy. Specifically, we achieve the et al., 2014; Dong et al., 2016; Shao et al., 2017), state-of-the-art performance in AS and CityU Wubi includes more comprehensive graphical and corpora with F1 scores of 96.9 and 97.3, re- spectively without leveraging any external lex- structural information that is highly relevant to the ical resources. semantic meanings and word boundaries, due to plentiful pictographic characters in Chinese and 1 Introduction effectiveness of Wubi in embedding the structures. This paper will thoroughly study how impor- Chinese is written without explicit word delim- tant the extra embeddings are and what schol- iters so word segmentation (CWS) is a preliminary ars can achieve by combining extra embeddings and essential pre-processing step for most natural with representative models. To leverage extra pho- language processing (NLP) tasks in Chinese, such netic and semantic information efficiently, we pro- arXiv:1808.04963v3 [cs.CL] 30 May 2019 as part-of-speech tagging (POS) and named-entity pose a shared Bi-LSTMs-CRF model, which feeds recognition (NER). The representative approaches embeddings into three stacked LSTM layers with are treating CWS as a character-based sequence shared parameters and finally scores with CRF labeling task following Xu(2003) and Peng et al. layer. We evaluate the proposed approach on five (2004). corpora and demonstrate that our method produces Although not relying on hand-crafted features, state-of-the-art results and is highly efficient as most of the neural network models rely heavily on previous single-embedding scheme. the embeddings of characters. Since Mikolov et al. Our contributions are summarized as follows: (2013) proposed word2vec technique, the vector 1) We firstly propose to leverage both seman- representation of words or characters has become tic and phonetic features of Chinese characters ∗ Equal contribution (alphabetical order). in NLP tasks by introducing Pinyin Romaniza- y Corresponding author. tion and Wubi Input embeddings, which are easily et al., 2017). However, Chinese characters are de- 乐和便 veloped to absorb and fuse phonetics, semantics, [he] 1 peaceful (adj) and hieroglyphology. In this paper, we would like [le] 1 happy (adj) 2 with (conj / adv) [bian] 1 convenient (adj) [yue] 2 music (n) [huo] 3 stir or join (v) [pian] 2 cheap (adj) to explore other linguistic features so the charac- [hu] 4 success in game (interjection) ters are the basic inputs with two other presenta- (a) (b) (c) tions (Pinyin and Wubi) introduced as auxiliary. Figure 1: Examples of phono-semantic compound 2.2 Pinyin Romanization characters and polyphone characters. Pinyin Romanization (拼音) is the official romanization system for standard Chinese charac- Character Wubi Input Code Character Wubi Input Code ters (ISO 7098:2015,E), representing the pronun- 提花（Carry） R J G H （Flower） A W X B ciation of Chinese characters like phonogram in English. Moreover, Pinyin is highly relevant to 打草 / A I / （Hit） R S H （Grass） J semantics - one character may correspond var- 找芽 ied Pinyin code that indicates different semantic / A A H T （Find） R A T （Bud） meanings. This phenomenon is very common in 扑莲 Asian languages and termed as polyphone. （Leap） R H Y / （Lotus） A L P U Figure1 shows several examples of polyphone 抬芦 P （Lift） R C K G （Reed） A Y N R characters. For instance, the character ‘ ’ in Fig- ure1 (a) has two different pronunciations (Pinyin Verb.（hands related） Noun. （plants related） code). When pronounced as ‘yue’, it means ’mu- (a) (b) sic’, as a noun. However, with the pronunciation Figure 2: Potential semantic relationships between of ’le’, it refers to ’happiness’. Similarly, the char- Chinese characters and Wubi Input. Gray area indi- acter ‘和’ in Figure1 (b) even has four meanings cates that these characters have the same first letter in with three varied Pinyin code. the Wubi Input representation. Through Pinyin code, a natural bridge is con- accessible and effective in representing semantic structed between the words and their semantics. and phonetic features; 2) We put forward a shared Now that human could understand the different Bi-LSTM-CRF model for efficiently integrating meanings of characters according to varied pro- multiple embeddings and sharing useful linguis- nunciations, the neural networks are also likely tic features; 3) We evaluate the proposed multi- to learn the mappings between semantic meanings embedding scheme on Bakeoff2005 and CTB6 and Pinyin code automatically. corpora. Extensive experiments show that auxil- Obviously, Pinyin provides extra phonetic and iary embeddings help achieve state-of-the-art per- semantic information required by some basic tasks formance without external lexical resources. such as CWS. It is worthy to notice that Pinyin is a dominant computer input method of Chinese 2 Multiple Embeddings characters, and it is easy to represent characters with Pinyin code as supplementary inputs. To fully leverage various properties of Chinese characters, we propose to split the character-level 2.3 Wubi Input embeddings into three parts: character embed- Wubi Input (五笔) is based on the structure of dings for textual features, Pinyin Romanization characters rather than the pronunciation. Since embeddings for phonetic features and Wubi Input plentiful Chinese characters are hieroglyphic, embeddings for structure-level features. Wubi Input can be used to find out the potential semantic relationships as well as the word 2.1 Chinese Characters boundaries. It is beneficial to CWS task mainly in CWS is often regarded as a character-based se- two aspects: 1) Wubi encodes high-level semantic quence labeling task, which aims to label ev- meanings of characters; 2) characters with similar ery character with fB, M, E, Sg tagging scheme. structures (e.g., radicals) are more likely to make Recent studies show that character embeddings up a word, which effects the word boundaries. are the most fundamental inputs for neural net- To understand its effectiveness in structure de- works (Chen et al., 2015; Cai and Zhao, 2016; Cai scription, one has to go through the rules of Wubi TaTgag TTaTaggag TaTTgaagg CRCFRF CCRCRFRFF CRCCFRRFF BBi-iL-SLTSTMM Bi-LSTM BiB-BLi-iS-LTLSSMTTMM BiB-BLi-iSL-TLSSMTTMM BiB-Li-SLTSMTM1 1 BiiB-Li-SLTSMTM2 2 BBii-B-LLi-SSLTTSMMTM33 3 FFCCF LC La aLyyaeeyrrer BBi-BiL-iSL-TLSSMTTMM ShSSahhraeardree dPd aP Praarmraamemteetrteserrss ChCahrar PiinPyiniinyin WWuubbuiibi CChhCaahrrar PPiniPnyinyiniynin WWuWububi bi i CChCahhraarr PiPnPiyniniynyinin WWuWbuuibbii (a) Model-I (b) Model-II (c) Model-III Figure 3: Network architecture of three multi-embedding models. (a) Model-I: Multi-Bi-LSTMs-CRF Model. (b) Model-II: FC-Layer Bi-LSTMs-CRF Model. (c) Model-III: Shared Bi-LSTMs-CRF Model. Input method. It is an efficient encoding system 2.4 Multiple Embeddings which represents each Chinese character with at To fully utilize various properties of Chinese char- most four English letters. Specifically, these let- acters, we construct the Pinyin and Wubi embed- ters are divided into five regions, each of which dings as two supplementary character-level fea- 笔; represents a type of structure (stroke, ) in tures. We firstly pre-process the characters and Chinese characters. obtain the basic character embedding following Figure2 provides some examples of Chinese the strategy in Lample et al.(2016); Shao et al. 1 characters and their corresponding Wubi code (2017). Then we use the Pypinyin Library to (four letters). For instance, ‘Ð’ (carry), ‘S’ (hit) annotate Pinyin code, and an official transforma- 2 and ‘¬’ (lift) in Figure2 (a) are all verbs related tion table to translate characters to Wubi code. to hands and correspond different spellings in En- Finally, we retrieve multiple embeddings using glish. On the contrary, in Chinese, these characters word2vec tool (Mikolov et al., 2013). are all left-right symbols and have the same radical For simplicity, we treat Pinyin and Wubi code (‘R’ in Wubi code). That is to say, Chinese char- as units like characters processed by canonical acters that are highly semantically relevant usually word2vec, which may discard some semantic have similar structures which could be perfectly affinities.

Load more