SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining Chenglei Si1∗, Zhengyan Zhang2,3,4∗, Yingfa Chen2,3,4∗, Fanchao Qi2,3,4, Xiaozhi Wang2,3,4, Zhiyuan Liu2,3,4†, Maosong Sun2,3,4 1University of Maryland, College Park, MD, USA 2Department of Computer Science and Technology, Tsinghua University, , 3Institute for Artificial Intelligence, Tsinghua University, Beijing, China 4State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China [email protected], [email protected]

Abstract Unfortunately, current tokenization methods are mostly developed primarily for English (Bostrom Conventional tokenization methods for Chi- and Durrett, 2020). Almost all the current PLMs nese pretrained language models (PLMs) treat each character as an indivisible to- adopt the sub-word tokenization method originat- ken (Devlin et al., 2019), which ignores the ing from machine translation, such as the Byte- characteristics of Chinese writing system. Pair Encoding (Sennrich et al., 2016), Word- In this work, we comprehensively study the Piece (Schuster and Nakajima, 2012; Devlin et al., influences of three main factors on the Chi- 2019) and SentencePiece based on the unigram nese tokenization for PLM: pronunciation, language model (Kudo and Richardson, 2018). glyph (i.e., shape) and word boundary. Cor- While the idea of sub-word tokenization is in- respondingly, we propose three kinds of tok- tuitive and effective for morphological-rich syn- enizer: 1) SHUOWEN (说文, meaning Talk Word), the pronunciation-based tokenizers; thetic languages, it is not the case for Chinese. 2) JIEZI (解字, meaning Solve Character), We believe that it is crucial to develop tai- the glyph-based tokenizers; 3) Word seg- lored techniques for the languages beyond En- mented tokenizers, the tokenizers with Chi- glish because there can be huge differences be- nese word segmentation. To empirically tween different languages (Bender, 2019, #Ben- compare the effectivenesses of studied to- kenizers, we pretrain BERT-style language derRule). Towards this end, we devote this work models with them and evaluate the models to analysing three unique linguistic characteris- on various downstream NLU tasks. We find tics of Chinese (writing system) compared to En- that SHUOWEN and JIEZI tokenizers can glish: 1) The Chinese writing system is mor- generally outperform conventional single- phemic (Hill, 2016), which means the Chinese character tokenizers, while Chinese word characters poorly reflect the pronunciation, result- segmentation shows no benefit as a pre- ing in the conventional character-based tokeniza- processing step. Moreover, the proposed tion misses much more phonological information. SHUOWEN and JIEZI tokenizers exhibit significantly better robustnesses on handling 2) Modern Chinese words basically do not un- noisy texts. The code and pretrained models dergo morphological alternations (Packard, 2000), will be publicly released to facilitate linguis- thus rendering sub-word tokenization inapplica- tically informed Chinese NLP. 1 ble. However, are mainly lo- gograms, which means their glyphs, the compo- arXiv:2106.00400v1 [cs.CL] 1 Jun 2021 1 Introduction sition of stokes and radicals, also contain rich se- mantic information ( et al., 2019). 3) In Chi- Large-scale Transformer-based pretrained lan- nese writing, there is no natural word boundary guage models (PLMs) (Devlin et al., 2019; Lan like the space in English writing. Although it et al., 2020; Clark et al., 2020; Ma et al., 2020; He is possible to inject word boundaries via Chinese inter alia et al., 2021, ) have achieved great success word Segmentation (CWS), there is no study on in recent years and attracted wide research inter- how this works for Chinese PLMs. est, in which the tokenization plays a fundamental Targeting the three factors, we then explore role. three corresponding tokenization strategies: 1) 1 Please refer to the Appendix A.1 for the historical mean- A pronunciation-based tokenizer family called ing of SHUOWEN-JIEZI. ∗ Equal contribution SHUOWEN, which first romanizes the Chinese † Corresponding author email: [email protected] characters based on their pronunciations, and then constructs the vocabulary with the romanized et al.(2019) empirically analysed whether CWS scripts using the unigram language model (Kudo is helpful for downstream Chinese NLP tasks be- and Richardson, 2018). 2) A glyph-based to- fore the PLM era and found that in many cases kenizer family called JIEZI, which decomposes the answer is negative. We examine the impact characters into combinations of Chinese strokes or of CWS for PLM instead. Wu et al.(2019) incor- radicals, and then constructs the vocabulary with porated glyph information of Chinese characters the stroke or radical sequences using the unigram though adding extra encoders to encode the im- language model. 3) A word segmented tokenizer ages of Chinese characters and then combine them family, which first uses a Chinese word segmenter with the character embeddings. We do not intend to segment Chinese texts into words, and then con- to fuse in additional information from sources like structs the vocabulary with the segmented word images, but instead, all of our proposed tokenza- sequences using the unigram language model. tion methods are drop-in replacements to the ex- We pretrain BERT-style PLMs using the pro- isting single-character tokenizers, without adding posed tokenizers from scratch and evaluate the any extra layers or parameters. Tan et al.(2018) resultant models on various downstream tasks. explore to Chinese text into Wubi sequences that Through comprehensive evaluation on ten Chinese represent character glyph information for the task NLU tasks, we find that our pronunciation-based of machine translation. (SHUOWEN) and glyph-based (JIEZI) tokenizers outperform conventional single-character tokeniz- 3 Method ers in most tasks. Furthermore, as they have the In this section, we introduce our proposed tok- unique advantage to learn the meanings of com- enization methods. plex characters through the composition of sim- pler sub-characters, they are naturally more robust 3.1 SHUOWEN: Pronunciation-based on handling noisy input. Surprisingly, we find that Tokenizers Chinese Word Segmentation (CWS) has no benefit The Chinese writing system is morphemic (Hill, for Chinese language model pretraining. 2016) and barely convey phonological informa- Our work suggests that linguistically informed tion. However, the pronunciation of Chinese char- techniques based on the characteristics of different acters also reveals semantic patterns (Duanmu, languages need more attention. We will release 2007) and has long been widely used as input the code, pretrained models, and the SHUOWEN- methods in China (e.g., ). In order to capture JIEZI tokenizers to serve as a better foundation for such information, we propose a pronunciation- future research on Chinese PLM. based tokenizer named SHUOWEN. On raw Chinese input texts (e.g., 魑魅魍魉), 2 Related Work SHUOWEN performs the following steps:

Chinese PLM. Several previous works have ex- 1. Romanize the text using Chinese translitera- plored techniques to improve Chinese language tion systems. In this work, we explore two model pretraining. Zhu(2020) and Zhang et al. different transliteration methods: pinyin and (2021) expanded BERT vocabulary with Chinese zhuyin (i.e., ). Pinyin uses the Latin words apart from the single characters and incor- alphabet and four 2 different tones (¯, ´, ˇ, porated them in the pretraining objectives. Xiao `) to romanize pronunciations of characters, et al.(2021) and Cui et al.(2019) considered e.g., 魑魅魍魉 → Chi¯ Mei` Wangˇ Liangˇ. coarse-grained information through masking n- On the other hand, zhuyin uses a set of self- gram and whole words during the masked lan- invented characters and the same four tones guage modeling pretraining. Diao et al.(2020) to romanize the characters, e.g., 魑魅魍魉 → incorporated word-level information via superim- ㄔㄇㄟ` ㄨㄤˇ ㄌㄧㄤˇ. Note that in zhuyin, posing the character and word embeddings. Lai the first tone mark (¯) is usually omitted. et al.(2021) incorporated Chinese word lattice structures in the pretraining. 2. Insert special separation symbols (+) after Linguistically Informed Techniques for Chi- each character’s romanized sequence, e.g., nese. CWS is a common preprocessing step for 2The light tone is sometimes considered as the fifth tone Chinese NLP tasks (Li and Sun, 2009). Meng but we omit it for simplicity. Chi¯+Mei`+Wangˇ+Liangˇ+, ㄔ+ㄇㄟ`+ㄨ acters based on the standard stroke orders4, ㄤˇ+ㄌㄧㄤˇ+. This prevents cases where e.g., 魑 → pszhshpzznnhpnzsszshn; 魅 → romanized sequences of different characters pszhshpzznhhspn. To convert into radical se- are mixed together, especially when there are quences, we adopt three existing glyph-based no tone markers to split them in zhuyin. Chinese input methods: Wubi, Zhengma, Cangjie. These methods group strokes to- 3. Different Chinese characters often have gether in different ways to form radicals or the same pronunciation. For disam- stroke combinations, and then represent char- biguation, we append different indices acters with them. We use Latin alphabet to after the romanized sequences for the represent these radicals or stroke combina- homophonic characters, so that allowing tions, e.g., 魑魅魍魉 → Wubi: rqcc rqci rqcn a biunique mapping between each Chi- rqcw; Zhengma: njlz njbk njld njoo; Cangjie: nese character and its romanized sequence, hiyub hijd hibtv himob. e.g., Chi¯33+Mei`24+Wangˇ25+Liangˇ13+, ㄔ10+ㄇㄟ`3+ㄨㄤˇ6+ㄌㄧㄤˇ1+. 2. Similar to the pronunciation-based tokeniz- ers, we add the same separation symbol after 4. Apply a unigram language model (ULM) as in each character, and also add the disambigua- Kudo and Richardson(2018) on the roman- tion indices for characters whose stroke se- ized sequences to build the final vocabulary. quences are identical (e.g., 人 (people) and 八 (eight)). We do not set any constraint on the vocabulary In the converted sequences, we can see how other than the vocabulary size. The resultant vo- common radicals naturally appear (the underlined cabulary contains tokens corresponding to flexible parts). Please refer to the Appendix A.2 for a combinations of characters and sub-characters. detailed explanation on the differences between 3.2 JIEZI: Glyph-based Tokenizers these different input methods involved. The word shapes of Chinese characters contain 3.3 Word Segmented Tokenizer rich semantic information and can help NLP mod- Chinese Word Segmentation (CWS) is a common els (Cao et al., 2018). For example, most Chinese technique to split Chinese chunks into a sequence characters can be broken down into semantically of Chinese words. The resultant segmented words meaningful radicals. Characters that share com- sometimes provide better granularity for down- mon radicals often share related semantic infor- stream tasks (e.g., Chang et al., 2008). However, mation, such as the four characters ‘魑魅魍魉’ the impact of CWS is unclear in the context of pre- all share the same radical ‘鬼’ (meaning ghost), training, especially how it interplays with statisti- and their meanings are indeed related to ghosts cal approaches like BPE and unigram LM. Hence, 3 and monsters. we study word segmented tokenizers that performs However, the prevailing tokenization method the following process on raw Chinese input, e.g., for Chinese treats each Chinese character as a sep- “这篇论文有意思。(this paper is interesting)” : arate token and hence preventing the model to learn the shared semantics of characters with com- 1. We use a state-of-the-art segmenter THU- mon radicals. In order to solve this problem, we LAC (Li and Sun, 2009) to segment the sen- propose the glyph-based tokenizer JIEZI, which tence into a sequence of words joined by 篇 思 performs the following steps on raw Chinese in- spaces, e.g., ‘这 \论文\有意 \。’ (We use put (e.g., 魑魅魍魉): \to indicate a blank space for easier reading.) 2. We directly apply ULM on these space-joined 1. Convert each character into a stroke or rad- sequences to construct the vocabulary. ical sequence. To convert into stroke se- quences, we use Latin alphabet to repre- In other words, CWS is used as a preprocess- sent the basic strokes and convert the char- ing step on training corpora when we build the vo- cabulary using above-mentioned methods. When 3Interestingly, the word ‘魑魅魍魉’ is in fact a Chinese idiom, which is now often used to refer to bad people who 4https://en.wikipedia.org/wiki/Stroke_ are like monsters. order Dataset Task MaxLen Batch Epoch #Train #Dev #Test Domain TNEWS DC 256 32 6 53.4K 10K 10K News IFLYTEK DC 256 32 6 12.1K 2.6K 2.6K App Description BQ SPM 256 32 6 100K 10K 10K Bank Service THUCNEWS DC 256 32 6 669K 83.6K 83.6K News CLUEWSC WSC 256 32 24 1.2K 0.3K 0.3K Literature AFQMC SPM 256 32 6 34.3K 4.3K 3.9K Financial CSL SPM 256 32 6 20K 3K 3K Academic Papers OCNLI SPM 256 32 6 45.4K 5K 3K Mixed CHID MRC 96 24 6 519K 57.8K 23K Mixed C3 MRC 512 24 6 12K 3.8K 3.9K Mixed

Table 1: Hyper-parameters and statistics of different datasets. DC: document classification. SPM: sentence pair matching (including natural language inference). WSC: Winograd Schema Challenge. MRC: machine reading comprehension. we perform the actual pretraining and finetuning, corpus and the same set of hyper-parameters for we also perform CWS as preprocessing before to- all models being compared. Notably, we also re- kenization using the word segmented tokenizer. pretrain the BERT model using the BERT-Chinese tokenizer on our pretraining corpus instead of just 4 Experiment loading from existing checkpoints to ensure that 4.1 Baselines all baselines and proposed methods are directly comparable. Since our proposed tokenizers are di- We use two strong baseline tokenizers in this rect drop-in replacements for the baseline tokeniz- work. The first one is the conventional single- ers, they do not incur any extra parameters. As character tokenizer as used in BERT-Chinese and a result, all the models being compared have the many other follow-up Chinese PLM (e.g., Cui same number of parameters, allowing for a truly et al., 2019, 2020). We name this tokenizer BERT- apple-to-apple comparison. Chinese as it originated from the Chinese version of BERT. 4.2 Datasets For the second baseline, we directly apply Sen- tencePiece with unigram LM on the raw Chi- We evaluate the trained models with different tok- nese corpus to generate the vocabulary. As a re- enization methods on a total of ten different down- sult, the vocabulary contains both single charac- stream datasets, including single-sentence tasks, ters and words (i.e., character combinations). This sentence-pair tasks, as well as reading comprehen- approach resembles the vocabulary of some re- sion tasks. We briefly introduce each dataset be- cent multi-granularity Chinese PLM variants such low and present the dataset statistics in Table1. as AMBERT (Zhang et al., 2021) and Lattice- TNEWS (Co., 2018) is a news title classification BERT (Lai et al., 2021). Unlike them, we do dataset containing 15 classes. We use the split as not add any new model designs or pretraining ob- released in Xu et al.(2020). jectives, but instead use the original BERT archi- IFLYTEK (Co., 2019) is a long text classification tecture and masked LM objective. We name this dataset containing 119 classes. The task is to clas- baseline SP-ULM. sify mobile applications into corresponding cate- To ensure a fair comparison, we set the same vo- gories given their description. cabulary size of 22675 for all tokenizers. We use BQ (Chen et al., 2018) is a sentence-pair ques- the same training corpus to train all the tokenizers. tion matching dataset extracted from an online We use the SentencePiece library’s unigram LM bank customer service log. The goal is to evaluate implementation to train the tokenizers. whether two questions are semantically identical In order to evaluate the effectiveness of the tok- or can be answered by the same answer. enizers, we pretrain a BERT model using each to- THUCNEWS (Li and Sun, 2007) is a document kenizer and compare their performance on down- classification dataset with 14 classes. The task is stream tasks. When pretraining the BERT models to classify news into the corresponding categories using each tokenizer, we use the same pretraining given their title and content. TNEWS IFLY THUC BQ WSC AFQMC CSL OCNLI CHID C3 AVG 6-layer BERT-Chinese 64.10 57.77 96.97 81.98 62.39 68.95 82.60 68.46 72.33 53.51 70.91 SP-ULM 64.26 55.44 97.09 81.52 62.06 69.88 83.16 68.98 72.77 51.73 70.69 (-0.22) JIEZI-CangJie 63.86 59.51 97.04 81.59 63.27 70.47 82.91 69.03 72.73 52.67 71.31 (+0.40) JIEZI-Stroke 63.81 58.74 96.87 81.55 62.94 69.66 82.44 68.02 72.21 53.35 70.96 (+0.05) JIEZI-Zhengma 63.96 58.74 96.99 82.27 61.95 69.86 83.46 68.56 72.12 54.91 71.28 (+0.37) JIEZI-Wubi 64.91 59.39 97.03 81.41 62.72 69.14 82.60 69.12 72.02 53.99 71.16 (+0.25) SHUOWEN-Pinyin 63.58 59.55 97.04 81.65 63.60 68.60 82.66 67.93 72.81 53.02 71.04 (+0.13) SHUOWEN-Zhuyin 64.11 59.16 97.01 81.64 63.93 68.53 82.86 69.39 71.48 54.59 71.27 (+0.36) 12-layer BERT-Chinese 65.07 58.01 97.05 82.33 73.14 71.04 83.90 70.19 76.61 55.90 73.32 SP-ULM 65.01 58.98 97.20 82.99 73.36 70.93 83.45 70.46 77.28 57.70 73.74 (+0.42) JIEZI-CangJie 64.26 60.29 97.15 83.48 71.16 71.48 83.68 71.50 76.82 57.99 73.78 (+0.46) JIEZI-Stroke 65.11 59.75 97.09 82.88 70.72 71.64 83.63 70.03 77.45 59.68 73.53 (+0.21) JIEZI-Zhengma 64.51 60.78 97.14 83.15 72.15 70.76 83.68 71.22 76.72 57.49 73.76 (+0.44) JIEZI-Wubi 64.47 60.05 97.16 82.76 72.70 72.00 83.62 70.77 76.34 58.31 73.82 (+0.50) SHUOWEN-Pinyin 64.50 60.40 97.17 83.13 70.18 71.37 84.12 71.97 76.11 58.05 73.70 (+0.38) SHUOWEN-Zhuyin 64.50 59.98 97.09 82.99 73.03 71.83 83.82 71.74 76.74 57.23 73.90 (+0.58)

Table 2: Results for standard evaluation. Best result on each dataset of each model size is boldfaced. The numbers in brackets in the last column indicate the average difference compared to the BERT-Chinese baseline.

TNEWS IFLYTEK CLUEWSC AFQMC CSL OCNLI C3 AVG SP-ULM 64.26 55.44 62.06 69.88 83.16 68.98 51.73 65.07 SP-ULM + CWS 64.26 54.15 63.05 69.62 82.87 68.64 51.77 64.91 JIEZI-Wubi 64.91 59.39 62.72 69.14 82.60 69.12 53.99 65.98 JIEZI-Wubi + CWS 63.66 59.22 63.16 68.65 82.21 68.81 52.76 65.50 SHUOWEN-Zhuyin 64.11 59.16 63.93 68.53 82.86 69.39 54.59 66.08 SHUOWEN-Zhuyin + CWS 63.37 57.24 62.83 68.94 82.12 68.69 51.48 64.95

Table 3: Results of models trained with Word Segmented Tokenization. All models are 6-layer.

CLUEWSC (Xu et al., 2020) is a coreference res- C3 (Sun et al., 2019) is a multiple choice ma- olution dataset in the format of Winograd Schema chine reading comprehension dataset. The goal is Challenge. The task is to determine whether the to choose the correct answer for some questions given noun and pronoun in the sentence co-refer. given a context. AFQMC is the Ant Financial Question Matching 4.3 Experiment Setup: Standard Evaluation Corpus for the question matching task that aims to predict whether two sentences are semantically For each tokenizer, we pretrain a 6-layer and 12- similar layer BERT style model using the Baidu Baike corpus (Zhang et al., 2020) which has 2.2G of raw CSL is the Chinese Scientific Literature dataset text before processing. The model configuration extracted from academic papers. Given an ab- is exactly the same for all models: 6 or 12 layers, stract and some keywords, the goal is to determine 12 attention heads, intermediate size 3072, hid- whether they belong to the same paper. It is for- den size 768. For pretraining, we follow the origi- matted as a sentence-pair matching task. nal BERT paper’s two-stage pretraining procedure OCNLI (Hu et al., 2020) is a natural language where we first train with sequence length 128 for inference dataset. The task is to determine 8k steps and then train with sequence length 512 whether the relationship between the hypothesis for 4k steps. We only keep the masked language and premise is entailment, neural, or contradic- modeling objective in pretraining and discard the tion. next sentence prediction objective as suggested in CHID (Zheng et al., 2019) is a cloze-style read- RoBERTa (Liu et al., 2019). During fine-tuning, ing comprehension dataset where . Given contexts we use the set of hyper-parameters as shown in Ta- where some idioms are masked, the task is to se- ble1. For all experiments in this paper, we report lect the appropriate idiom from a list of candidates. results of the average run of three different random clean 15% 30% 45% 60% TNEWS BERT-Chinese 63.99 62.18 60.49 57.64 53.97 SP-ULM 63.99 62.88 60.79 59.20 55.42 JIEZI-Wubi 64.05 62.80 62.56 62.71 62.81 OCNLI BERT-Chinese 68.07 62.60 56.83 51.37 46.00 SP-ULM 69.10 64.00 56.57 52.43 46.73 JIEZI-Wubi 68.43 67.10 65.47 65.40 64.80

Table 4: Results for noisy evaluation with glyph noises.

clean 15% 30% 45% 60% TNEWS BERT-Chinese 63.99 60.87 57.70 52.21 45.51 SP-ULM 63.99 61.52 58.60 53.30 46.81 SHUOWEN-Pinyin 63.35 60.60 57.45 53.61 49.29 SHUOWEN-Zhuyin 63.79 60.99 57.69 53.15 48.98 OCNLI BERT-Chinese 68.07 61.73 54.50 49.97 44.80 SP-ULM 69.10 62.23 54.77 50.33 45.70 SHUOWEN-Pinyin 67.83 61.00 54.33 49.87 45.10 SHUOWEN-Zhuyin 69.63 60.77 54.77 50.67 47.70

Table 5: Results for noisy evaluation with phonology noises. seeds. ilar characters could be chosen since their in- put encoding are similar. 4.4 Experiment Setup: Noisy Evaluation Apart from evaluating on the standard bench- • Pronunciation-based noise: we replace origi- marks, we also evaluate in a noisy setting to illus- nal characters with other characters that have trate the advantage of our proposed tokenization the same pronunciation but different seman- methods to handle noisy inputs. Specifically, we tic meanings (e.g., 真 (real) and 针 (needle)). inject two types of synthetic noises into both the Specifically, we obtain a substitution candi- training and test data in order to test whether the date list for each character, where all the can- models can learn from noisy training data and also didates have the same pronunciation as the perform robustly on noisy test data. We vary the original character. Then, similarly, we ran- ratio of noise in the data to examine the impact. domly sample a certain ratio r% of the origi- The two types of noises we inject are: nal characters, for each of them, we randomly sample a substitution character from its can- • Glyph-based noise: we replace original char- didate list for substitution. This simulates the acters with other characters that have similar common noise when users use pronunciation- glyph but have different semantic meanings based input methods where the input encod- (e.g., 壁 (wall) and 璧 (jade)). Specifically, ing of these characters and their substitutions we obtain a substitution candidate list for are the same. each character, where the candidates are se- lected so that they share at least one common For our experiment, we vary the noise ratio r% radical with the original character. Then, within the range of {0, 15%, 30%, 45% 60%}. we randomly sample a certain ratio r% of For the sampled characters to be replaced with the original characters, for each of them, noises, we randomly sample a substitution charac- we randomly sample a substitution character ter from their candidate list for every appearance from its candidate list for substitution. This of the character, instead of substituting with the simulates common noises when people use same candidate. This induces more noise varia- glyph-based input methods where these sim- tions. Note that some substitutions are both glyph- based noises and pronunciation-based noises (e.g., 5.2 Noisy Evaluation 快 and 块 share a same radical and also have the We perform noisy evaluation on two datasets: same pronunciation), we keep them in both types TNEWS and OCNLI. For glyph-based noises, of noises. we compare baselines BERT-Chinese and SP-ULM Intuitively, our SHUOWEN tokenizer could be with our JIEZI-Wubi. The results are presented robust to pronunciation-based noises and JIEZI to- in Table4. We observe that when the noise ra- kenizer could be robust to glyph-based noises be- tio increases, the advantage of JIEZI is particu- cause the substitution characters share similar pro- larly large. For example, when 60% characters are nunciation or glyph components with the original substituted, JIEZI-Wubi still performs close to the characters, which may be captured by our tokeniz- original performance, while other baselines suffer ers. large drops in performance. On OCNLI, the gap This noisy setup is reflective of real-life use can be as large as 18 points in accuracy. cases where user queries often contain such For pronunciation-based noises, we compare noises. Since most Chinese people use ei- baselines BERT-Chinese and SP-ULM with our ther glyph-based input methods (e.g., wubi) or SHUOWEN-Pinyin and SHUOWEN-Zhuyin. The pronunciation-based input methods (e.g., pinyin, results are shown in Table5. Unlike the cases on zhuyin), such mis-typed characters can be very glyph-based noises, we observe that the advantage common. This highlights the potential impact of of our SHUOWEN tokenizers are not so significant our work. compared to the baselines. One potential reason is that there many Chinese characters with the same 5 Results pronunciation. Unlike how the semantic meanings of radicals can be rather consistent across different characters, phoneme combinations can have vastly 5.1 Standard Evaluation different meanings across different characters (i.e., We compare the results of the baseline tokeniz- characters may pronounce the same but have to- tally different semantic meanings), which makes ers (BERT-Chinese, SP-ULM) with our proposed it difficult to learn these different semantic mean- JIEZI (including four variants: JIEZI-Cangjie, ings in the pronunciation-based token embedding. JIEZI-Stroke, JIEZI-Zhengma, JIEZI-Wubi) and SHUOWEN (including two variants: SHUOWEN- Pinyin and SHUOWEN-Zhuyin) tokenizers in Ta- 6 Conclusion ble2. In this paper, we have explored three linguisti- Despite some variations across different cally informed tokenization methods motivated by datasets, we observe that in terms of the average the unique linguistic characteristics of the Chi- score over ten datasets, all of our proposed nese writing system. Specifically, we find that tokenizers outperform the BERT-Chinese base- pronunciation-based and glyph-based tokenizers line. Notably, for the 6-layer model size, our can match or outperform the conventional Chinese JIEZI-Cangjie tokenizer obtains an average of tokenizers and Chinese word segmentation is not 0.40 points over the BERT-Chinese tokenizer, for a useful addition for the tokenizer. Moreover, we the 12-layer model size, our SHUOWEN-Zhuyin find that our glyph-based tokenizers achieve large tokenizer achieves an average of 0.58 points of gains on noisy input as compared to the baselines, improvement over the BERT-Chinese tokenizer. while our pronunciation-based tokenizers obtain These results indicate that on standard bench- limited success. This highlights the potential ad- marks, our proposed tokenizers can match or out- vantage of our proposed methods in real-life sce- perform the existing tokenizers for Chinese. narios with noisy data. We believe that our work On the other hand, we examine the impact of sets an important example of exploiting the unique CWS by comparing three tokenizers with their linguistic property of a language beyond English word segmented counterparts in Table3. We to develop more tailored techniques, which should can see that adding CWS as a preprocessing ac- be an important direction for the global NLP com- tually slightly decreased the average performance munity. on downstream tasks. References San Duanmu. 2007. The Phonology of . Oxford University Press. Emily Bender. 2019. The #BenderRule: On Nam- ing the Languages We Study and Why It Mat- Pengcheng He, Xiaodong Liu, Jianfeng Gao, and ters. The Gradient. Weizhu Chen. 2021. DeBERTa: Decoding- enhanced BERT with Disentangled Attention. Kaj Bostrom and Greg Durrett. 2020. Byte Pair In ICLR. Encoding is Suboptimal for Language Model Pretraining. In Findings of EMNLP. Archibald A Hill. 2016. The typology of writing systems. In Papers in linguistics in honor of Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Leon Dostert, pages 92–99. De Gruyter Mou- Li. 2018. cw2vec: Learning Chinese Word Em- ton. beddings with Stroke n-gram Information. In AAAI. Hai Hu, Kyle Richardson, Liang Xu, Lu Li, San- dra Kübler, and Lawrence S. Moss. 2020. OC- Pi-Chuan Chang, Michel Galley, and Christo- NLI: Original Chinese Natural Language Infer- pher D. Manning. 2008. Optimizing Chinese ence. In Findings of EMNLP. Word Segmentation for Machine Translation Performance. In WMT@ACL. Taku Kudo and John Richardson. 2018. Sentence- Piece: A simple and language independent sub- Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, word tokenizer and detokenizer for Neural Text Daohe Lu, and Buzhou Tang. 2018. The BQ Processing. In EMNLP. Corpus: A Large-scale Domain-specific Chi- nese Corpus For Sentence Semantic Equiva- Yuxuan Lai, Yijia Liu, Yansong Feng, Songfang lence Identification. In EMNLP. Huang, and Dongyan Zhao. 2021. Lattice- BERT: Leveraging Multi-Granularity Repre- Kevin Clark, Minh-Thang Luong, Quoc V. Le, sentations in Chinese Pre-trained Language and Christopher D. Manning. 2020. ELEC- Models. TRA: Pre-training Text Encoders as Discrimi- nators Rather Than Generators. In ICLR. Zhenzhong Lan, Mingda Chen, Sebastian Good- IFLYTEK Co. 2019. IFLYTEK: A multiple cate- man, Kevin Gimpel, Piyush Sharma, and Radu gories Chinese text classifier. Soricut. 2020. ALBERT: A Lite BERT for Self- supervised Learning of Language Representa- TouTiao Co. 2018. TNEWS Dataset. tions. In ICLR.

Yiming Cui, Wanxiang Che, Ting Liu, Bing , Jingyang Li and Maosong Sun. 2007. Scalable Shijin Wang, and Guoping Hu. 2020. Revis- Term Selection for Text Categorization. In iting Pre-Trained Models for Chinese Natural CoNLL. Language Processing. In Findings of EMNLP. Zhongguo Li and Maosong Sun. 2009. Punctua- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, tion as Implicit Annotations for Chinese Word Ziqing Yang, Shijin Wang, and Guoping Hu. Segmentation. Computational Linguistics. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv, abs/1906.08101. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Mike Lewis, Luke Zettlemoyer, and Veselin Kristina Toutanova. 2019. BERT: Pre-training Stoyanov. 2019. RoBERTa: A Robustly Op- of Deep Bidirectional Transformers for Lan- timized BERT Pretraining Approach. arXiv, guage Understanding. In NAACL-HLT. abs/1907.11692.

Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, and Yonggang Wang. 2020. ZEN: Pre-training Shijin Wang, and Guoping Hu. 2020. Char- Chinese Text Encoder Enhanced by N-gram BERT: Character-aware Pre-trained Language Representations. In Findings of EMNLP. Model. In COLING. Yuxian Meng, Xiaoya Li, Xiaofei Sun, Qinghong Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Han, Arianna Yuan, and Jiwei Li. 2019. Is Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Word Segmentation Necessary for Deep Learn- Su, Haozhe Ji, Jian Guan, Fanchao , Xiaozhi ing of Chinese Representations? In ACL. Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Jerome L. Packard. 2000. The Morphology of Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Chinese: A Linguistic and Cognitive Approach. Tang, Juanzi Li, Xiaoyan Zhu, and Maosong Cambridge University Press. Sun. 2020. CPM: A Large-scale Generative Mike Schuster and Kaisuke Nakajima. 2012. Chinese Pre-trained Language Model. arXiv, Japanese and Korean voice search. ICASSP. abs/2012.00413.

Rico Sennrich, B. Haddow, and Alexandra Chujie Zheng, Minlie Huang, and Aixin Sun. Birch. 2016. Neural Machine Translation 2019. ChID: A Large-scale Chinese IDiom of Rare Words with Subword Units. arXiv, Dataset for Cloze Test. In ACL. abs/1508.07909. Wei Zhu. 2020. MVP-BERT: Redesigning Vocab- Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. ularies for Chinese BERT and Multi-Vocab Pre- 2019. Investigating Prior Knowledge for Chal- training. arXiv, abs/2011.08539. lenging Chinese Machine Reading Comprehen- sion. TACL.

Mi Xue Tan, Yuhuang Hu, Nikola I. Nikolov, and Richard H. R. Hahnloser. 2018. wubi2en: Character-level Chinese-English Translation through ASCII Encoding. In WMT.

Wei Wu, Yuxian Meng, F. Wang, Qinghong Han, Muyu Li, Xiaoya Li, Jie Mei, Ping Nie, Xi- aofei Sun, and Jiwei Li. 2019. Glyce: Glyph- vectors for Chinese Character Representations. In NeurIPS.

Dongling Xiao, Yukun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-Gram: Pre-Training with Explicitly N- Gram Masked Language Modeling for Natural Language Understanding. In NAACL.

Liang Xu, Hai Hu, Xuanwei Zhang, Chenjie Cao Lu Li, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Wei- tang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong , Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. 2020. CLUE: A Chinese Language Under- standing Evaluation Benchmark. In COLING.

Xinsong Zhang, Pengshuai Li, and Hang Li. 2021. AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. In Findings of ACL. A Appendix keypads, maps five keys to five basic types of strokes. The user pressed the corresponding keys A.1 What is SHUOWEN-JIEZI? according to characters’ stroke order. SHUOWEN-JIEZI (‘说文解字’) is an ancient Chi- Pinyin (‘拼音’) input method, a pronunciation- nese dictionary from the Han dynasty. It was the based input method, is the most widely used first dictionary to analyze the structure of Chinese input method among Chinese speakers. It is characters and to give the rationale behind them, based on the Hanyu Pinyin (‘汉语拼音’, meaning and also the first dictionary to use Chinese radi- Chinese Sound-Spelling) romanization system for cals to organise the sections. Chinese. Each Chinese monophthong is mapping The literal meaning of SHUOWEN and to one or two Latin alphabet. JIEZI correspond nicely to the core intuitive be- Zhuyin (‘注音’) input method, a pronunciation- hind our pronunciation- and glyph-based tokeniz- based input method, is the most common input ers. We name our methods this name to pay tribute method in the Taiwan province of China. It is to the ancient wisdom of our ancestors. based on the Zhuyin phonetic transcription sys- tem, which consists 37 characters that represent A.2 Input Methods each Chinese phoneme. Note that while both Current Chinese input methods for computers Pinyin and Zhuyin input methods disregard tones can be categorized into pronunciation-based and as input methods on electonic devices, we do keep glyph-based. Both methods encode each Chinese the tones during encoding in our tokenizers. character into a sequence of units from a smaller set of alphabet (e.g. the Latin alphabet), but dif- fer in what the units represent. In pronunciation- based input methods, each unit usually represents a phoneme, while in glyph-based methods, one unit or a group of units generally represent a radi- cal or stoke composition. Chinese characters have a standardized stroke order, which can be taken into account by glyph-based methods. In almost all commonly used input methods, there exists dif- ferent characters that encode into the same se- quence, in which case, the solution is usually to list all matching characters and let the user select the correct one. We briefly introduce each input method used in the paper below. Cangjie (‘仓颉’) and Wubi (‘五笔’) are two sim- ilar glyph-based input methods. They map keys on the QWERTY keyboard to fundamental radi- cals or combination of strokes that are combined to represent the shape of entire characters. They sometimes disregard stroke order in favor of com- binations that visually more similar to the charac- ter. The main difference between them is the key mapping and the rules on how to break down char- acters into fundamental components. Zhengma (‘郑码’), a glyph-based method, is sim- ilar to Cangjie and Wubi. It maps each Latin let- ter to fundamental radicals, which are combined into entire characters. But Zhengma differs from Cangjie and Wubi in that it strictly follows stroke order. Stroke (‘笔 画’), a glyph-based method, more commonly used in mobile phones or numerical