Arxiv:2106.00400V1 [Cs.CL] 1 Jun 2021 1 Introduction Sition of Stokes and Radicals, Also Contain Rich Se- Mantic Information (Wu Et Al., 2019)
Total Page:16
File Type:pdf, Size:1020Kb
SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining Chenglei Si1∗, Zhengyan Zhang2;3;4∗, Yingfa Chen2;3;4∗, Fanchao Qi2;3;4, Xiaozhi Wang2;3;4, Zhiyuan Liu2;3;4y, Maosong Sun2;3;4 1University of Maryland, College Park, MD, USA 2Department of Computer Science and Technology, Tsinghua University, Beijing, China 3Institute for Artificial Intelligence, Tsinghua University, Beijing, China 4State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China [email protected], [email protected] Abstract Unfortunately, current tokenization methods are mostly developed primarily for English (Bostrom Conventional tokenization methods for Chi- and Durrett, 2020). Almost all the current PLMs nese pretrained language models (PLMs) treat each character as an indivisible to- adopt the sub-word tokenization method originat- ken (Devlin et al., 2019), which ignores the ing from machine translation, such as the Byte- characteristics of Chinese writing system. Pair Encoding (Sennrich et al., 2016), Word- In this work, we comprehensively study the Piece (Schuster and Nakajima, 2012; Devlin et al., influences of three main factors on the Chi- 2019) and SentencePiece based on the unigram nese tokenization for PLM: pronunciation, language model (Kudo and Richardson, 2018). glyph (i.e., shape) and word boundary. Cor- While the idea of sub-word tokenization is in- respondingly, we propose three kinds of tok- tuitive and effective for morphological-rich syn- enizer: 1) SHUOWEN (说文, meaning Talk Word), the pronunciation-based tokenizers; thetic languages, it is not the case for Chinese. 2) JIEZI (ãW, meaning Solve Character), We believe that it is crucial to develop tai- the glyph-based tokenizers; 3) Word seg- lored techniques for the languages beyond En- mented tokenizers, the tokenizers with Chi- glish because there can be huge differences be- nese word segmentation. To empirically tween different languages (Bender, 2019, #Ben- compare the effectivenesses of studied to- kenizers, we pretrain BERT-style language derRule). Towards this end, we devote this work models with them and evaluate the models to analysing three unique linguistic characteris- on various downstream NLU tasks. We find tics of Chinese (writing system) compared to En- that SHUOWEN and JIEZI tokenizers can glish: 1) The Chinese writing system is mor- generally outperform conventional single- phemic (Hill, 2016), which means the Chinese character tokenizers, while Chinese word characters poorly reflect the pronunciation, result- segmentation shows no benefit as a pre- ing in the conventional character-based tokeniza- processing step. Moreover, the proposed tion misses much more phonological information. SHUOWEN and JIEZI tokenizers exhibit significantly better robustnesses on handling 2) Modern Chinese words basically do not un- noisy texts. The code and pretrained models dergo morphological alternations (Packard, 2000), will be publicly released to facilitate linguis- thus rendering sub-word tokenization inapplica- tically informed Chinese NLP. 1 ble. However, Chinese characters are mainly lo- gograms, which means their glyphs, the compo- arXiv:2106.00400v1 [cs.CL] 1 Jun 2021 1 Introduction sition of stokes and radicals, also contain rich se- mantic information (Wu et al., 2019). 3) In Chi- Large-scale Transformer-based pretrained lan- nese writing, there is no natural word boundary guage models (PLMs) (Devlin et al., 2019; Lan like the space in English writing. Although it et al., 2020; Clark et al., 2020; Ma et al., 2020; He is possible to inject word boundaries via Chinese inter alia et al., 2021, ) have achieved great success word Segmentation (CWS), there is no study on in recent years and attracted wide research inter- how this works for Chinese PLMs. est, in which the tokenization plays a fundamental Targeting the three factors, we then explore role. three corresponding tokenization strategies: 1) 1 Please refer to the Appendix A.1 for the historical mean- A pronunciation-based tokenizer family called ing of SHUOWEN-JIEZI. ∗ Equal contribution SHUOWEN, which first romanizes the Chinese y Corresponding author email: [email protected] characters based on their pronunciations, and then constructs the vocabulary with the romanized et al.(2019) empirically analysed whether CWS scripts using the unigram language model (Kudo is helpful for downstream Chinese NLP tasks be- and Richardson, 2018). 2) A glyph-based to- fore the PLM era and found that in many cases kenizer family called JIEZI, which decomposes the answer is negative. We examine the impact characters into combinations of Chinese strokes or of CWS for PLM instead. Wu et al.(2019) incor- radicals, and then constructs the vocabulary with porated glyph information of Chinese characters the stroke or radical sequences using the unigram though adding extra encoders to encode the im- language model. 3) A word segmented tokenizer ages of Chinese characters and then combine them family, which first uses a Chinese word segmenter with the character embeddings. We do not intend to segment Chinese texts into words, and then con- to fuse in additional information from sources like structs the vocabulary with the segmented word images, but instead, all of our proposed tokenza- sequences using the unigram language model. tion methods are drop-in replacements to the ex- We pretrain BERT-style PLMs using the pro- isting single-character tokenizers, without adding posed tokenizers from scratch and evaluate the any extra layers or parameters. Tan et al.(2018) resultant models on various downstream tasks. explore to Chinese text into Wubi sequences that Through comprehensive evaluation on ten Chinese represent character glyph information for the task NLU tasks, we find that our pronunciation-based of machine translation. (SHUOWEN) and glyph-based (JIEZI) tokenizers outperform conventional single-character tokeniz- 3 Method ers in most tasks. Furthermore, as they have the In this section, we introduce our proposed tok- unique advantage to learn the meanings of com- enization methods. plex characters through the composition of sim- pler sub-characters, they are naturally more robust 3.1 SHUOWEN: Pronunciation-based on handling noisy input. Surprisingly, we find that Tokenizers Chinese Word Segmentation (CWS) has no benefit The Chinese writing system is morphemic (Hill, for Chinese language model pretraining. 2016) and barely convey phonological informa- Our work suggests that linguistically informed tion. However, the pronunciation of Chinese char- techniques based on the characteristics of different acters also reveals semantic patterns (Duanmu, languages need more attention. We will release 2007) and has long been widely used as input the code, pretrained models, and the SHUOWEN- methods in China (e.g., pinyin). In order to capture JIEZI tokenizers to serve as a better foundation for such information, we propose a pronunciation- future research on Chinese PLM. based tokenizer named SHUOWEN. On raw Chinese input texts (e.g., QEMI), 2 Related Work SHUOWEN performs the following steps: Chinese PLM. Several previous works have ex- 1. Romanize the text using Chinese translitera- plored techniques to improve Chinese language tion systems. In this work, we explore two model pretraining. Zhu(2020) and Zhang et al. different transliteration methods: pinyin and (2021) expanded BERT vocabulary with Chinese zhuyin (i.e., bopomofo). Pinyin uses the Latin words apart from the single characters and incor- alphabet and four 2 different tones (¯, ´, ˇ, porated them in the pretraining objectives. Xiao `) to romanize pronunciations of characters, et al.(2021) and Cui et al.(2019) considered e.g., QEMI ! Chi¯ Mei` Wangˇ Liangˇ. coarse-grained information through masking n- On the other hand, zhuyin uses a set of self- gram and whole words during the masked lan- invented characters and the same four tones guage modeling pretraining. Diao et al.(2020) to romanize the characters, e.g., QEMI ! incorporated word-level information via superim- ㄔㄇㄟ` ($ˇ ㄌ'$ˇ. Note that in zhuyin, posing the character and word embeddings. Lai the first tone mark (¯) is usually omitted. et al.(2021) incorporated Chinese word lattice structures in the pretraining. 2. Insert special separation symbols (+) after Linguistically Informed Techniques for Chi- each character’s romanized sequence, e.g., nese. CWS is a common preprocessing step for 2The light tone is sometimes considered as the fifth tone Chinese NLP tasks (Li and Sun, 2009). Meng but we omit it for simplicity. Chi¯+Mei`+Wangˇ+Liangˇ+, ㄔ+ㄇㄟ`+( acters based on the standard stroke orders4, $ˇ+ㄌ'$ˇ+. This prevents cases where e.g., Q ! pszhshpzznnhpnzsszshn; E ! romanized sequences of different characters pszhshpzznhhspn. To convert into radical se- are mixed together, especially when there are quences, we adopt three existing glyph-based no tone markers to split them in zhuyin. Chinese input methods: Wubi, Zhengma, Cangjie. These methods group strokes to- 3. Different Chinese characters often have gether in different ways to form radicals or the same pronunciation. For disam- stroke combinations, and then represent char- biguation, we append different indices acters with them. We use Latin alphabet to after the romanized sequences for the represent these radicals or stroke combina- homophonic characters, so that allowing tions, e.g., QEMI ! Wubi: rqcc rqci rqcn a biunique mapping between each Chi- rqcw; Zhengma: njlz njbk njld njoo; Cangjie: nese character and