Chinese Embedding Via Stroke and Glyph Information: a Dual-Channel View

Chinese Embedding via Stroke and Glyph Information: A Dual-channel View

Hanqing Tao†, Shiwei Tong†, Tong Xu†§, Qi Liu†§, Enhong Chen†§∗ †Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China §School of Data Science, University of Science and Technology of China {hqtao, tongsw}@mail.ustc.edu.cn, {tongxu, qiliuql, cheneh}@ustc.edu.cn

Abstract 1, the Chinese character “驾” (drive) can be decomposed into a sequence of eight strokes, where Recent studies have consistently given posi- the last three strokes together correspond to a root tive hints that morphology is helpful in enrich- character “马” (horse) similar to the root “clar” of ing word embeddings. In this paper, we ar- English word “declare” and “clarify”. gue that Chinese word embeddings can be substantially enriched by the morphological infor- Moreover, Chinese is a language originated mation hidden in characters which is reflected from Oracle Bone Inscriptions (a kind of hiero- not only in strokes order sequentially, but also glyphics). Its character glyphs have a spatial struc- in character glyphs spatially. Then, we pro- ture similar to graphs which can convey abun- pose a novel Dual-channel Word Embedding dant semantics (Su and Lee, 2017). Additionally, (DWE) model to realize the joint learning of the critical reason why Chinese characters are so sequential and spatial information of charac- rich in morphological information is that they are ters. Through the evaluation on both word composed of basic strokes in a 2-D spatial or- similarity and word analogy tasks, our model shows its rationality and superiority in mod- der. However, different spatial configurations of elling the morphology of Chinese. strokes may lead to different semantics. As shown in the lower half of Figure 1, three Chinese char- 1 Introduction acters “入” (enter), “八” (eight) and “人” (man) share exactly a common stroke sequence, but they Word embeddings are fixed-length vector repre- have completely different semantics because of sentations for words (Mikolov et al., 2013; Cui their different spatial configurations. et al., 2018). In recent years, the morphology of In addition, some biological investigations have words is drawing more and more attention (Cot- confirmed that there are actually two processing terell and Schutze¨ , 2015), especially for Chinese channels for Chinese language. Specifically, Chi- 1 whose writing system is based on logograms . nese readers not only activate the left brain which With the gradual exploration of the semantic is a dominant hemisphere in processing alphabetic features of Chinese, scholars have found that not languages (Springer et al., 1999; Knecht et al., only words and characters are important seman- 2000; Paulesu et al., 2000), but also activate the tic carriers, but also stroke2 feature of Chinese areas of the right brain that are responsible for arXiv:1906.04287v1 [cs.CL] 3 Jun 2019 characters is crucial for inferring semantics (Cao image processing and spatial information at the et al., 2018). Actually, a Chinese word usually same time (Tan et al., 2000). Therefore, we argue consists of several characters, and each charac- that the morphological information of characters ter can be further decomposed into a stroke se- in Chinese consists of two parts, i.e., the sequential quence which is certain and changeless, and this information hidden in root-like strokes order, and kind of stroke sequence is very similar to the con- the spatial information hidden in graph-like char- struction of English words. In Chinese, a partic- acter glyphs. Along this line, we propose a novel ular sequence of strokes can reflect the inherent Dual-channel Word Embedding (DWE) model for semantics. As shown in the upper half of Figure Chinese to realize the joint learning of sequential and spatial information in characters. Finally, we Corresponding∗ author 1https://en.wikipedia.org/wiki/Logogram evaluate DWE on two representative tasks, where 2https://en.wikipedia.org/wiki/Stroke (CJK character) the experimental results exactly validate the supe- (Horse) (Drive) (Strength) (Mouth) (Horse)

Strokes：㇕㇉㇐ Strokes：㇆㇓㇑㇕㇐㇕㇉㇐

(Enter) (Eight) (Person)

Strokes： Strokes： Strokes：

Figure 1: The upper part is an example for illustrating the inclusion relationship hidden in strokes order and character glyphs. The lower part reflects that a common stroke sequence may form different Chinese characters if their spatial configurations are different. riority of DWE in capturing the morphological in- al. (2019) designed a Tianzige-CNN to model the formation of Chinese. spatial structure of Chinese characters from the perspective of image processing. However, their 2 Releated Work methods are either somewhat loose for the stroke 2.1 Morphological Word Representations criteria or unable to capture the interactions between strokes and character glyphs. Traditional methods on getting word embeddings are mainly based on the distributional hypothesis 3 DWE Model (Harris, 1954): words with similar contexts tend to have similar semantics. To explore more inter- As we mentioned earlier, it is reasonable and im- pretable models, some scholars have gradually no- perative to learn Chinese word embeddings from ticed the importance of the morphology of words two channels, i.e., a sequential stroke n-gram in conveying semantics (Luong et al., 2013; Qiu channel and a spatial glyph channel. Inspired et al., 2014), and some studies have proved that by the previous works (Chen et al., 2015; Dong the morphology of words can indeed enrich the et al., 2016; Su and Lee, 2017; Wu et al., 2019), semantics of word embeddings (Sak et al., 2010; we propose to combine the representation of Chi- Soricut and Och, 2015; Cotterell and Schutze¨ , nese words with the representation of characters 2015). More recently, Wieting et al. (2016) pro- to obtain finer-grained semantics, so that unknown posed to represent words using character n-gram words can be identified and their relationship with count vectors. Further, Bojanowski et al. (2017) other known Chinese characters can be found by improved the classic skip-gram model (Mikolov distinguishing the common stroke sequences or et al., 2013) by taking subwords into account in character glyph they share. the acquisition of word embeddings, which is in- Our DWE model is shown in Figure2. For structive for us to regard certain stroke sequences an arbitrary Chinese word w, e.g., “驾车”, it will as roots in English. be firstly decomposed into several characters, e.g., “驾” and “车”, and each of the characters will be 2.2 Embedding for Chinese Language further processed in a dual-channel character em- The complexity of Chinese itself has given birth bedding sub-module to refine its morphological to a lot of research on Chinese embedding, in- information. In sequential channel, each charac- cluding the utilization of character features (Chen ter can be decomposed into a stroke sequence ac- et al., 2015) and radicals (Sun et al., 2014; Yin cording to the criteria of Chinese writing system as et al., 2016; Yu et al., 2017). Considering the 2- shown in Figure1. After retrieving the stroke se- D graphic structure of Chinese characters, Su and quence, we add special boundary symbols < and Lee (2017) creatively proposed to enhance word > at the beginning and end of it and adopt an representations by character glyphs. Lately, Cao efficient approach by utilizing the stroke n-gram et al. (2018) proposed that a Chinese word can method (Cao et al., 2018)3 to extract strokes order be decomposed into a sequence of strokes which 3We apply a richer standard of strokes (32 kinds of correspond to subwords in English, and Wu et strokes) than they did (only 5 kinds of strokes). Dual-channel where λ is the number of negative samples and Spatial Embedding Glyph Channel Ee0∼P [·] is the expectation term. For each w in

CNN training corpus D, a set of negative samples T (w) will be selected according to the distribution P , (Drive) < ㇆㇓  which is usually set as the word unigram distribu- ㇆㇓㇑ 3-gram tion. And σ is the sigmoid function.

㇉㇐ > < ㇆㇓㇑ 4 Experiments 4-gram ㇕㇉㇐ > 4.1 Dataset Preparation

Sequential We download parts of Chinese Wikipedia articles Channel Stroke n-grams from Large-Scale Chinese Datasets for NLP5. For (Car) Dual-channel Embedding word segmentation and filtering the stopwords, we Word 6  Embedding apply the jieba toolkit based on the stopwords ta- ble7. Finally, we get 11,529,432 segmented words. Figure 2: An illustration of our Dual-channel Word In accordance with their work (Chen et al., 2015), Embedding (DWE) model. all items whose Unicode falls into the range be- information for each character. More precisely, tween 0x4E00 and 0x9FA5 are Chinese charac- we firstly scan each character throughout the train- ters. We crawl the stroke information of all 20,402 8 ing corpus and obtain a stroke n-gram dictionary characters from an online dictionary and render G. Then, we use G(c) to denote the collection of each character glyph to a 28 × 28 1-bit grayscale 9 stroke n-grams of each character c in w. While in bitmap by using Pillow . spatial channel, to capture the semantics hidden in 4.2 Experimental Setup glyphs, we render the glyph Ic for each character c and apply a well-known CNN structure, LeNet We choose adagrad (Duchi et al., 2011) as our (LeCun et al., 1998), to process each character optimizing algorithm, and we set the batch size glyph, which is also helpful to distinguish between as 4,096 and learning rate as 0.05. In practice, those characters that are identical in strokes. the slide window size n of stroke n-grams is set After that, we combine the representation of as 3 ≤ n ≤ 6. The dimension of all word em- words with the representation of characters and beddings of different models is consistently set as define the word embedding for w as follows: 300. We use two test tasks to evaluate the performance of different models: one is word similarity, 1 X X w = wID ⊕ ( g ∗ CNN(Ic)), (1) and the other is word analogy. A word similar- Nc c∈w g∈G(c) ity test consists of multiple word pairs and sim- 4 ilarity scores annotated by humans. Good word where ⊕ and ∗ are compositional operation . wID representations should make the calculated simi- is the word ID embedding and Nc is the number of characters in w. larity have a high rank correlation with human an- According to the previous work (Mikolov et al., notated scores, which is usually measured by the ρ 2013), we compute the similarity between current Spearman’s correlation (Zar, 1972). word w and one of its context words e by defin- The form of an analogy problem is like ing a score function as s(w, e) = w · e, where w “king”:“queen” = “man”:“?”, and “woman” is the and e are embedding vectors of w and e respec- most proper answer to “?”. That is, in this task, tively. Following the previous works (Mikolov given three words a, b, and h, the goal is to in- et al., 2013; Bojanowski et al., 2017), the objec- fer the fourth word t which satisfies “a is to b tive function is defined as follows: that is similar to h is to t”. We use 3CosAdd (Mikolov et al., 2013) and 3CosMul function X X 0 L = logσ(s(w, e)) + λEe0∼P [logσ(−s(w, e ))], (Levy and Goldberg, 2014) to calculate the most w∈D e∈T (w) (2) 5https://github.com/brightmart/nlp chinese corpus 6 4There are a variety of options for ⊕ and ∗, e.g., addition, https://github.com/fxsjy/jieba 7 item-wise dot product and concatenation. In this paper, we https://github.com/YueYongDev/stopwords uses the addition operation for ⊕ and item-wise dot product 8https://bihua.51240.com/ operation for ∗. 9https://github.com/python-pillow/Pillow Table 1: Performance on word similarity and word analogy task. The dimension of embeddings is set as 300. The evaluation metric is ρ for word similarity and accuracy percentage for word analogy.

Word Analogy Word Similarity Model 3CosAdd 3CosMul wordsim-240 wordsim-296 Capital City Family Capital City Family Skipgram (Mikolov et al., 2013) 0.5670 0.6023 0.7592 0.8800 0.3676 0.7637 0.8857 0.3529 CBOW (Mikolov et al., 2013) 0.5248 0.5736 0.6499 0.6171 0.3750 0.6219 0.5486 0.2904 GloVe (Pennington et al., 2014) 0.4981 0.4019 0.6219 0.7714 0.3167 0.5805 0.7257 0.2375 sisg (Bojanowski et al., 2017) 0.5592 0.5884 0.4978 0.7543 0.2610 0.5303 0.7829 0.2206 CWE (Chen et al., 2015) 0.5035 0.4322 0.1846 0.1714 0.1875 0.1713 0.1600 0.1583 GWE (Su and Lee, 2017) 0.5531 0.5507 0.5716 0.6629 0.2417 0.5761 0.6914 0.2333 JWE (Yu et al., 2017) 0.4734 0.5732 0.1285 0.3657 0.2708 0.1492 0.3771 0.2500 cw2vec (Cao et al., 2018) 0.5529 0.5992 0.5081 0.7086 0.2941 0.5465 0.7714 0.2721 DWE (ours) 0.6105 0.6137 0.7120 0.7486 0.6250 0.6765 0.7257 0.6140 appropriate word t. By using the same data used those words expressing family relations in Chi- in (Chen et al., 2015) and (Cao et al., 2018), we nese are mostly compositional in their character adopt two manually-annotated datasets for Chi- glyphs. For example, in an analogy pair “兄弟” nese word similarity task, i.e., wordsim-240 and (brother) : “姐妹” (sister) = “儿子” (son) : “女 wordsim-296 (Jin and Wu, 2012) and a three- 儿” (daughter), we can easily find that “兄弟” and group10 dataset for Chinese word analogy task. “儿子” share an exactly common part of glyph “儿” (male relative of a junior generation) while 4.3 Baseline Methods “姐妹” and “女儿” share an exactly common part We use gensim11 to implement both CBOW and of glyph “女” (female), and this kind of morpho- Skipgram and apply the source codes pulished by logical pattern can be accurately captured by our the authors to implement CWE12, JWE13, GWE14 model. However, most of the names of countries, and GloVe15. Since Cao et al. (2018) did not pub- capitals and cities are transliterated words, and the lish their code, we follow their paper and repro- relationship between the morphology and seman- duce cw2vec in mxnet16 which we also use to im- tics of words is minimal, which is consistent with plement sisg (Bojanowski et al., 2017)17 and our the findings reported in (Su and Lee, 2017). For DWE. To encourage further research, we will pub- instance, in an analogy pair “西班牙” (Spain) : lish our model and datasets. “马德里” (Madrid) = “法国” (France) : “巴黎” (Paris), we cannot infer any relevance among these 4.4 Experimental Results four words literally because they are all translated The experimental results are shown in Table1. by pronunciation. We can observe that our DWE model achieves In summary, since different words that are mor- the best results both on dataset wordsim-240 and phologically similar tend to have similar semantics wordsim-296 in the similarity task as expected be- in Chinese, simultaneously modeling the sequen- cause of the particularity of Chinese morphology, tial and spatial information of characters from both but it only improves the accuracy for the family stroke n-grams and glyph features can indeed im- group in the analogy task. prove the modeling of Chinese word representa- Actually, it is not by chance that we get these tions substantially. results, because DWE has the advantage of distinguishing between morphologically related words, 5 Conclusions which can be verified by the results of the simi- In this article, we first analyzed the similari- larity task. Meanwhile, in the word analogy task, ties and differences in terms of morphology be- 10capitals of countries, (China) states/provinces of cities, tween alphabetical languages and Chinese. Then, and family relations. we delved deeper into the particularity of Chi- 11https://radimrehurek.com/gensim/ 12https://github.com/Leonard-Xu/CWE nese morphology and proposed our DWE model 13https://github.com/hkust-knowcomp/jwe by taking into account the sequential information 14https://github.com/ray1007/GWE of strokes order and the spatial information of 15 https://github.com/stanfordnlp/GloVe glyphs. Through the evaluation on two representa- 16https://mxnet.apache.org/ 17 http://gluon-nlp.mxnet.io/master/examples/word embed- tive tasks, our model shows its superiority in cap- ding/word embedding training.html turing the morphological information of Chinese. References Thang Luong, Richard Socher, and Christopher Man- ning. 2013. Better word representations with recur- Piotr Bojanowski, Edouard Grave, Armand Joulin, and sive neural networks for morphology. In Proceed- Tomas Mikolov. 2017. Enriching word vectors with ings of the Seventeenth Conference on Computa- subword information. Transactions of the Associa- tional Natural Language Learning, pages 104–113. tion for Computational Linguistics, 5:135–146. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. frey Dean. 2013. Efficient estimation of word 2018. cw2vec: Learning chinese word embeddings representations in vector space. arXiv preprint with stroke n-gram information. In Thirty-Second arXiv:1301.3781. AAAI Conference on Artificial Intelligence. Eraldo Paulesu, Eamon McCrory, Ferruccio Fazio, Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Lorena Menoncello, Nicola Brunswick, Stefano F and Huanbo Luan. 2015. Joint learning of character Cappa, Maria Cotelli, Giuseppe Cossu, Francesca and word embeddings. In Twenty-Fourth Interna- Corte, M Lorusso, et al. 2000. A cultural effect on tional Joint Conference on Artificial Intelligence. brain function. Nature neuroscience, 3(1):91. Ryan Cotterell and Hinrich Schutze.¨ 2015. Morpho- Jeffrey Pennington, Richard Socher, and Christopher logical word-embeddings. In Proceedings of the Manning. 2014. Glove: Global vectors for word 2015 Conference of the North American Chapter of representation. In Proceedings of the 2014 confer- the Association for Computational Linguistics: Hu- ence on empirical methods in natural language pro- man Language Technologies, pages 1287–1292. cessing (EMNLP), pages 1532–1543.

Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan A survey on network embedding. IEEE Transac- Liu. 2014. Co-learning of word representations and tions on Knowledge and Data Engineering. morpheme representations. In Proceedings of COL- ING 2014, the 25th International Conference on Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Computational Linguistics: Technical Papers, pages Masanori Hattori, and Hui Di. 2016. Character- 141–150. based lstm-crf with radical-level features for chinese named entity recognition. In Natural Language Has¸im Sak, Murat Saraclar, and Tunga Gung¨ or.¨ 2010. Understanding and Intelligent Applications, pages Morphology-based and sub-word language model- 239–250. Springer. ing for turkish speech recognition. In 2010 IEEE International Conference on Acoustics, Speech and John Duchi, Elad Hazan, and Yoram Singer. 2011. Signal Processing, pages 5402–5405. IEEE. Adaptive subgradient methods for online learning Radu Soricut and Franz Och. 2015. Unsupervised mor- Journal of Machine and stochastic optimization. phology induction using word embeddings. In Pro- Learning Research, 12(Jul):2121–2159. ceedings of the 2015 Conference of the North Amer- Zellig S Harris. 1954. Distributional structure. Word, ican Chapter of the Association for Computational 10(2-3):146–162. Linguistics: Human Language Technologies, pages 1627–1637. Peng Jin and Yunfang Wu. 2012. Semeval-2012 task 4: Jane A Springer, Jeffrey R Binder, Thomas A Ham- evaluating chinese word similarity. In Proceedings meke, Sara J Swanson, Julie A Frost, Patrick SF of the First Joint Conference on Lexical and Com- Bellgowan, Cameron C Brewer, Holly M Perry, putational Semantics-Volume 1: Proceedings of the George L Morris, and Wade M Mueller. 1999. main conference and the shared task, and Volume Language dominance in neurologically normal and 2: Proceedings of the Sixth International Workshop epilepsy subjects: a functional mri study. Brain, on Semantic Evaluation, pages 374–377. Associa- 122(11):2033–2046. tion for Computational Linguistics. Tzu-Ray Su and Hung-Yi Lee. 2017. Learning chi- Stefan Knecht, Michael Deppe, Bianca Drager,¨ nese word representations from glyphs of characters. L Bobe, Hubertus Lohmann, E-B Ringelstein, and arXiv preprint arXiv:1708.04755. H Henningsen. 2000. Language lateralization in healthy right-handers. Brain, 123(1):74–81. Yaming Sun, Lei Lin, Nan Yang, Zhenzhou Ji, and Xiaolong Wang. 2014. Radical-enhanced chinese Yann LeCun, Leon´ Bottou, Yoshua Bengio, Patrick character embedding. In International Conference Haffner, et al. 1998. Gradient-based learning ap- on Neural Information Processing, pages 279–286. plied to document recognition. Proceedings of the Springer. IEEE, 86(11):2278–2324. Li Hai Tan, John A Spinks, Jia-Hong Gao, Ho-Ling Omer Levy and Yoav Goldberg. 2014. Linguistic reg- Liu, Charles A Perfetti, Jinhu Xiong, Kathryn A ularities in sparse and explicit word representations. Stofer, Yonglin Pu, Yijun Liu, and Peter T Fox. In Proceedings of the eighteenth conference on com- 2000. Brain activation in the processing of chinese putational natural language learning, pages 171– characters and words: a functional mri study. Hu- 180. man brain mapping, 10(1):16–27. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. arXiv preprint arXiv:1607.02789. Wei Wu, Yuxian Meng, Qinghong Han, Muyu Li, Xiaoya Li, Jie Mei, Ping Nie, Xiaofei Sun, and Jiwei Li. 2019. Glyce: Glyph-vectors for chinese character representations. arXiv preprint arXiv:1901.10125. Rongchao Yin, Quan Wang, Peng Li, Rui Li, and Bin Wang. 2016. Multi-granularity chinese word embedding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Process- ing, pages 981–986. Jinxing Yu, Xun Jian, Hao Xin, and Yangqiu Song. 2017. Joint embeddings of chinese words, characters, and fine-grained subcharacter components. In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 286–291. Jerrold H Zar. 1972. Significance testing of the spearman rank correlation coefficient. Journal of the American Statistical Association, 67(339):578–580.