Coarse-To-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database

UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database Canwen Xu1∗, Tao Ge2, Chenliang Li3, Furu Wei2 1 University of California, San Diego 2 Microsoft Research Asia 3 Wuhan University 1 [email protected] 2 ftage,[email protected] 3 [email protected] Abstract JA á1風2/熱3Î4低気5圧6.K¶7です。 1 2/ 364 5Ë的 .7 Chinese and Japanese share many charac- T-ZH ± ¨ ± # 一。 ters with similar surface morphology. To S-ZH 台1Î2/热3&4N气5压6的一Í7。 better utilize the shared knowledge across EN Typhoon is a type of tropical depression. the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel Table 1: A sentence example in Japanese (JA), Tradi- two-stage coarse-to-fine training approach. tional Chinese (T-ZH) and Simplified Chinese (S-ZH) We exploit Unihan, a ready-made database with its English translation (EN). The characters that constructed by linguistic experts to first merge already share the same Unicode are marked with an morphologically similar characters into clus- underline. In this work, we further match characters ters. The resulting clusters are used to re- with identical meanings but different Unicode, then place the original characters in sentences for merge them. Characters eligible to be merged together the coarse-grained pretraining of the MLM. are marked with the same superscript. Then, we restore the clusters back to the original characters in sentences for the fine- grained pretraining to learn the representation objective (translation language modeling, TLM). of the specific characters. We conduct ex- XLM-R (Conneau et al., 2019) has a larger size tensive experiments on a variety of Chinese and is trained with more data. Based on XLM, Uni- and Japanese NLP benchmarks, showing that coder (Huang et al., 2019) collects more data and our proposed UnihanLM is effective on both uses multi-task learning to train on three supervised mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit tasks. the homology of languages.1 The census of cross-lingual approaches is to al- low lexical information to be shared between lan- 1 Introduction guages. XLM and mBERT exploit shared lexical Recently, Pretrained Language Models have shown information by Byte Pair Encoding (BPE) (Sen- promising performance on many NLP tasks (Peters nrich et al., 2016) and WordPiece (Wu et al., 2016), et al., 2018; Devlin et al., 2019; Liu et al., 2019; respectively. However, these automatically learned Yang et al., 2019c; Lan et al., 2020). Many attempts shared representations have been criticized by re- have been made to train a model that supports mul- cent work (K et al., 2020), which reveals their limi- tiple languages. Among them, Multilingual BERT tations in sharing meaningful semantics across lan- (mBERT) (Devlin et al., 2019) is released as a part guages. Also, words in both Chinese and Japanese of BERT. It directly adopts the same model ar- are short, which prohibits an effective learning of chitecture and training objective, and is trained sub-word representations. Different from European on Wikipedia in different languages. XLM (Lam- languages, Chinese and Japanese naturally share ple and Conneau, 2019) is proposed with an ad- Chinese characters as a subword component. Early ditional language embedding and a new training work (Chu et al., 2013) shows that shared characters in these two languages can benefit Example- ∗ This work was done during Canwen’s internship at based Machine Translation (EBMT) with a statisti- Microsoft Research Asia. 1The code and pretrained weights are available at https: cal based phrase extraction and alignment. For Neu- //github.com/JetRunner/unihan-lm. ral Machine Translation (NMT), (Zhang and Ko- 201 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 201–211 December 4 - 7, 2020. c 2020 Association for Computational Linguistics machi, 2019) exploited such information by learn- trained language model for Chinese and Japanese ing a BPE representation over sub-character (i.e., NLP tasks. (2) We pioneer to apply the language ideograph and stroke) sequence. They applied this resource – the Unihan Database to help model pre- technique to unsupervised Chinese-Japanese ma- training, allowing more lexical information to be chine translation and achieved state-of-the-art per- shared between the two languages. (3) We devise formance. However, this approach greatly relies on a novel coarse-to-fine two-stage pretraining strat- unreliable automatic BPE learning and may suffer egy with different granularity for Chinese-Japanese from the noise brought by various variants. language modeling. To facilitate lexical sharing, we propose Unihan 2 Preliminaries Language Model (UnihanLM), a cross-lingual pretrained masked language model for Chinese and 2.1 Chinese Character Japanese. We propose a two-stage coarse-to-fine Chinese character is a pictograph used in Chinese pretraining procedure to empower better general- and Japanese. These characters often share the ization and take advantages of shared characters same background and origin. However, due to his- in Japanese, Traditional and Simplified Chinese. toric reasons, Chinese characters have developed First, we let the model exploit maximum possible into different writing systems, including Japanese shared lexical information. Instead of learning a Kanji, Traditional Chinese and Simplified Chinese. shared sub-word vocabulary like the prior work, Also, even in a single text, multiple variants of we leverage Unihan database (Jenkins et al., 2019), the same characters can be used interchangeably a ready-made constituent of the Unicode standard, (e.g., “台c” and “úc” for “Taiwan”, in Tradi- to extract the shared lexical information across the tional Chinese). These characters have identical languages. By exploiting this database, we can ef- or overlapping meanings. Thus, it is critical to fectively merge characters with the similar surface better exploit such information for modeling both morphology but independent Unicodes, as shown cross-lingual (i.e., between Chinese and Japanese), in Table1 into thousands of clusters. The clusters cross-system (i.e., between Traditional and Simpli- will be used to replace the characters in sentences fied Chinese) and cross-variant semantics. during the first-stage coarse-grained pretraining. Both Chinese and Japanese have no delimiter After the coarse-grained pretraining finishes, we re- (e.g., white space) to mark the boundaries of words. store the clusters back to the original characters and There have always been debates over whether word initialize their representation with their correspond- segmentation is necessary for Chinese NLP. Re- ing cluster’s representation and then learn their spe- cent work (Li et al., 2019) concludes that it is cific representation during the second-stage fine- not necessary for various NLP tasks in Chinese. grained pretraining. In this way, our model can Previous cross-lingual language models use dif- make full use of shared characters while maintain- ferent methods for tokenization. mBERT adds ing a good sense for nuances of similar characters. white spaces around Chinese characters and lefts To verify the effectiveness of our approach, we Katakana/Hiragana Japanese (also known as kanas) evaluate on both lexical and semantic tasks in unprocessed. Different from mBERT, XLM uses Chinese and Japanese. On word segmentation, Stanford Tokenizer2 and KyTea3 to segment Chi- our model outperforms monolingual and multi- nese and Japanese sentences, respectively. After to- lingual BERT (Devlin et al., 2019) and shows a kenization, mBERT and XLM use WordPiece (Wu much higher performance on cross-lingual zero- et al., 2016) and Byte Pair Encoding (Sennrich shot transfer. Also, our model achieves state-of-the- et al., 2016) for sub-word encoding, respectively. art performance on unsupervised Chinese-Japanese Nevertheless, both approaches suffer from obvi- machine translation, and is even comparable to the ous drawbacks. For mBERT, the kanas and Chinese supervised baseline on Chinese-to-Japanese transla- characters are treated differently, which causes a tion. On classification tasks, our model achieves a mismatch for labeling tasks. Also, leaving kanas comparable performance with monolingual BERT untokenized may cause the data sparsity problem. and other cross-lingual models trained with the For XLM, as pointed out in (Li et al., 2019), an same scale of data. 2https://nlp.stanford.edu/software/ To summarize, our contributions are three-fold: tokenizer.html (1) We propose UnihanLM, a cross-lingual pre- 3http://www.phontron.com/kytea/ 202 Variant Description Example Traditional Variant The traditional versions of a simplified Chinese character. 发 ! 髮 (hair), | (to burgeon) Simplified Variant The simplified version of a traditional Chinese character. 團 ! 团 (group) Z-Variant Same character with different unicodes only for compatibility. ª $ ! (say) Semantic Variant Characters with identical meaning. 兎 $ T (rabbit) Specialized Semantic Variant Characters with overlapping meaning. 丼 (rice bowl, well) $ 井 (well) Table 2: The five types of variants in the Unihan database. external word segmenter would introduce extra seg- Tokenization Scheme Result mentation errors and compromise the performance BERT (2019) á風 / はひどい of the model. Also, as a word-based model, it is XLM (2019) á風 / / / ひどい difficult to share cross-lingual characters unless the á 風 / 2 ) い segmented words in both Chinese and Japanese are UnihanLM / / / / / exactly matched. Furthermore, both approaches Table 3: Different tokenization schemes used in re- would enlarge the vocabulary size and thus intro- cent work and ours. Note that the tokenized results duce more parameters. of both BERT and XLM in this table are before Word- Piece/BPE applied. WordPiece/BPE may further split 2.2 Unihan Database a token. Chinese, Japanese and Korean (CJK) characters share a common origin from the ancient Chinese have multiple variant characters (e.g., the tradi- characters. However, with the development of each tional variants of “发” in Table2).

Coarse-To-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database

Download the Specification

Writing As Aesthetic in Modern and Contemporary Japanese-Language Literature

Legacy Character Sets & Encodings

Extension of VHDL to Support Multiple-Byte Characters

Implementing Cross-Locale CJKV Code Conversion

University of Montana Commencement Program, 1975

L2/18-279 & IRG N2334 (Proposal to Define New Unihan Database Property: Kiicore2020)

Proposal to Establish a CJK Unified Ideographs “Urgently Needed Characters” Process (Revised) Authors: Dr

NAME DESCRIPTION Supported Encodings

Multilingual Vi Clones: Past, Now and the Future

Internationalization

L2/21-072 (CJK & Unihan Group Recommendations For