<<

Multimodal neural pronunciation modeling for spoken with logographic origin

Minh Nguyen Gia . Ngo Nancy F. Chen National University of National University of Institute for Infocomm Research Singapore Singapore Singapore [email protected] [email protected] [email protected]

Abstract belonging to the Han logographic family. Sim- Graphemes of most languages encode pro- ilar to pronunciation modeling in phonographic nunciation, though some are more ex- languages, in which are broken down into plicit than others. Languages like Spanish characters and modeling is done at the have a straightforward mapping between its level, pronunciation modeling in logographic lan- graphemes and , while this mapping guages requires decomposing logographs into sub- is more convoluted for languages like English. units and extracting only sub-units carrying pro- Spoken languages such as Cantonese present nunciation hints. As the correspondence of Han even more challenges in pronunciation mod- eling: (1) they do not have a standard writ- logograph to is intricately complex with ten form, (2) the closest graphemic origins are many sub-rules or exceptions (Hashimoto, 1978), logographic Han characters, of which only a it is challenging to computationally model these subset of these logographic characters implic- correspondences using white box approaches (e.g. itly encodes pronunciation. In this work, we graphical model). Instead, we exploit neural net- propose a multimodal approach to predict the works, as they (1) can flexibly model the im- pronunciation of Cantonese logographic char- plicit similarity of grapheme-phoneme relation- acters, using neural networks with a geomet- ric representation of logographs and pronun- ships across languages with Han origin, (2) can au- ciation of cognates in historically related lan- tomatically learn the most relevant knowledge rep- guages. The proposed framework improves resentation with minimal feature engineering (Le- performance by 18.1% and 25.0% respective Cun et al., 2015), such as extracting pronunciation to unimodal and multimodal baselines. hints from logographic representations. 1 Introduction Due to historical contact, there is much lexi- cal overlap across Han logographic languages, as In phonographic languages, there is a di- they borrowed words from one another (Rokuro, rect correspondence between graphemes and 1969; Miyake, 1997; Loveday, 1996; Sohn, 2001; phonemes (Defrancis, 1996), though this corre- Alves, 1999). As a result, cognates in different spondence is not always one-to-one. For exam- languages are written using identical graphemes ple, in English, the table corresponds to but pronounced differently. For example, [she] the pronunciation [‘‘teI.bl], in which each alpha- in Mandarin and [sip] in Cantonese are cog- arXiv:1809.04203v1 [cs.CL] 12 Sep 2018 betic character corresponds to one phoneme, and nates; their pronunciations are different yet they the character e is mapped to silence. However, are written using the same logograph (懾), which in logographic languages, the correspondence be- represents “admire”. Though Han logographic tween graphemes and phonemes is more ambigu- languages are mutually unintelligible (Tang and ous (Defrancis, 1996), as only some sub-units in a Van Heuven, 2009; Handel, 2015), the correspon- 1 grapheme are indicative of its phonemes. Korean , dence of Han logographic graphemes to phonemes 2 Vietnamese and Chinese languages (e.g. Can- across languages is often similar in systematic tonese) are examples of logographic languages, all ways (Cai et al., 2011; Frellesvig and Whitman, 1A large portion of Korean are Sino-Korean 2008; Miyake, 1997). The shared characteristics written in (Korean logographs) (Sohn, 2001) in pronunciation of cognates could be leveraged in 2Traditional Vietnamese vocabulary comprises of Sino- Vietnamese words written by Chinese logographs and deciphering the pronunciation of Han logographs. locally-invented Nom logographs (Alves, 1999). In this work, we proposed a neural pronuncia- tion model that exploits both embeddings of lo- logographs in the table have a common phonetic gographs and cognates’ phonemes. The proposed radical (in red), which offers an inkling of the pro- model significantly improves pronunciation pre- nunciation of these logographs. For instance, lo- diction of logographs in Cantonese. gographs that have the phonetic radical on the left (剖 and 部) share a similar pronunciation in Ko- 2 Related Work rean (in blue) while logographs that have the pho- netic radical on the right (陪, 賠, and 蓓) share The basic units in (graphemes) of Han lo- a similar pronunciation in Mandarin, Cantonese gographic languages are logographs. A word con- and Vietnamese. Note that for each logograph, tains one or more logographs and a logograph con- their pronunciations across the different languages sists of one or more radicals. The pronunciation of share similarities: when the phonetic radical is on a logograph corresponds to a which has the left, the nucleus ends in a back vowel like u three phonemes: onset, nucleus and coda. or o, whereas when the phonetic radical is on the Grapheme-to-phoneme (G2P) approaches such right, the nucleus ends in a front vowel like i. as (Xu et al., 2004; Chen et al., 2016) predicted a Han logograph’s pronunciation from its local con- Position of 咅 text in a phrase. This was similar to predicting Logograph a Latin word’s pronunciation from its surrounding Mandarin pou bu pei pei bei words, essentially treated individual logographs as Cantonese fau bou pui pui bui the basic units of the model and did not delve fur- Korean pwu pwu pay pay pay ther into the logographic sub-units (the radicals). Vietnamese phau bo boi boi bui While we are unaware of any work that de- Table 1: The position of radicals affects pronuncia- rives features for pronunciation prediction from tions. All logographs share a common radical in red. logographs, there are recent work in deriving rep- Similar pronunciations for 剖 and 部 are bolded in 陪 賠 蓓 resentation of logographs for various semantic blue. Similar pronunciations for , , and are bolded in green. The pronunciation of a logograph in tasks. Some methods (Shi et al., 2015; Ke and Mandarin, Cantonese, Korean and Vietnamese are rep- Hagiwara, 2017; Nguyen et al., 2017; Zhuang resented by Pinyin, Jyutping, Yale, and Vietnamese al- et al., 2017) decomposed logographs into sub- phabet symbols respectively. units using expert-defined rules and then extracted The example in Table1 explains the motivation the relevant semantic features. Other methods use for our proposed approach to predict a logograph’s convolutional neural network to extract features pronunciation by modelling both the constituent from the images of logographs (Dai and Cai, 2017; radicals and their geometric positions. Further- Liu et al., 2017; Toyama et al., 2017). Other works more, the proposed approach can generalize to un- combined multiple level of information for feature seen logographs if the co-occurrence patterns of extraction, using both logograph and sub-units ob- their constituent radicals have been learnt. tained from logograph decomposition (Dong et al., 2016; Han et al., 2017; Peng et al., 2017; Yu et al., 3 Model 2017; Yin et al., 2016). In this work, we explicitly looked at the rela- We first describe a geometric decomposition of lo- tionship between a logograph’s constituent rad- gographs and then different neural pronunciation icals and its pronunciation. Among Han lo- models for logographs. Finally, we present a mul- gographs, 81% of frequently used logographs timodal neural model that incorporates both - are semantic-phonetic compounds (Li and Kang, graphic input and the cognates’ phonemes in pre- 1993) which consist of radicals that might contain dicting pronunciation of logographs. phonetic or semantic hints (Hsiao and Shillcock, 2006). The pronunciation of a logograph could Representation of Han logographs conceivably be predicted from the phonetic radi- The majority of logographs (characters) in Han lo- cals. Furthermore, the relative position of radicals gographic family comprise of a radical in the logograph might also offer clues about it that indicates its nominal semantic category and a pronunciation. Table1 shows an example of such phonetic radical that gives an inkling of the pro- intricate relationships between a logograph’s pro- nunciation (Defrancis, 1996). Thus, patterns of nunciation and its constituent radicals. All Han co-occurrence of radicals across logographs might Tree The BoR is input to a multilayer perceptron (MLP) forms A B C 懾 懾 懾 with three layers of size 750, 500, 250. L2 regular- ization of 1e-4 is applied to the hidden layers. The ⿰ ⿰ ⿰ three dropout layers have dropout probabilities of 忄 聶 忄 ⿱ 忄 ⿱ 0.5, 0.5, and 0.2, respectively. As the output vari- ables are categorical, cross-entropy loss was used. 耳 聑 耳 ⿰ We investigated two structures for predicting 耳 耳 Vector forms output phonemes (i.e. onset, nucleus, coda). In the ⿰ 忄 聶 ⿰ 忄 ⿱ 耳 聑 ⿰ 忄 ⿱ 耳 ⿰ 耳 耳 first structure, output phonemes were predicted in- Figure 1: Geometric representation of the logograph dependently using the last hidden layer. The sec- “admire”. A, B and C are equivalent decomposition of ond structure made a sequential prediction (1) the the same logograph but with different levels of granu- coda was first predicted using the last hidden layer larity. The geometric representation comprises of both (2) the nucleus was predicted using both the final the radicals and geometric operators, which can be hidden layer and the predicted coda, and (3) the used to reconstruct the original logograph. onset was predicted using the last hidden layer to- be exploited to find the phonetic radicals, which in gether with the predicted coda and nucleus. The turn can suggest the corresponding pronunciation second structure was motivated by a stronger de- of a logograph. Using this intuition, we model the pendency between the nuclues and coda. For ex- pronunciation of logographs at the radical level. ample, the nucleus and coda are often grouped to- We investigated two representations of radicals gether as a single unit (rime/final) in the syllabic in a logograph. In the first approach, a logograph structure of most languages (Kessler and Treiman, is represented as a bag of its unordered constituent 2002). In our experiments, the sequential structure radicals (BoR), encoded as a vector of radical yielded lower error rates so it is used in all neural counts. The second approach is to use a decom- network models. position of radicals in the logograph that retains the original geometric organization of the radi- 忄 1 s (ON) cals. The geometric decomposition (GeoD) ap- 氵 0 750 500 250 i (NU) 耳 3 proach preserves important cues about the word’s FC 灬 Drop Drop Drop 0 FC p (CD) pronunciation in the relative position of the rad- out out out Radicals Radicals FC Output icals. For example, differentiating the left radi- vocabulary count phonemes for 懾 cal from the right radical in a left-right semantic- Figure 2: Pronunciation model of logographs using phonetic compound allows more effective extrac- multilayer perceptron (MLP). FC: Fully connected. tion of pronunciation hints. In addition, radi- cals that should be interpreted together are closer In Figure3, each logograph is represented by spatially in the GeoD representation, making the its geometric decomposition (GeoD). For exam- knowledge representation easier to learn. Note ple, the logograph 懾 is represented by a sequence that the GeoD representation is lossless as the of radicals and geometric operators shown in Fig- original logograph can be reconstructed perfectly ure1C. The neural prediction model consists of (details in AppendixA). Figure1 shows the geo- two LSTM layers with 256 memory cells each. In- metric decomposition of the Han logograph “ad- put and recurrent dropout (Gal and Ghahramani, mire” at three levels of granularity. 2016) of 0.2 and 0.5 are applied to the LSTM lay- ers to prevent overfitting. Neural pronunciation prediction models Figure2 and Figure3 show two neural pronuncia- LSTM s (ON) i (NU) tion prediction models of logographs. In Figure2, LSTM each logograph is treated as an ordered “bag of p (CD) 懾 ⿰ 忄 ⿱ 耳 ⿰ 耳 耳 radicals” (BoR). For example, assume the vocab- Output Input phonemes ulary of radicals in the whole dataset is [忄, 氵, logograph Geometric decomposition (GeoD) of radicals 耳, 灬], the word 懾 (“admire” - see Figure1) is Logographic radical Geometric operator represented by a vector of counts [1, 0, 3, 0], cor- Figure 3: Neural pronunciation model with geometric responding to one radical 忄 and three radicals 耳. decomposition of logographs. Multimodal neural pronunciation model of evaluation metrics. A wrongly predicted phoneme logographs (onset, nucleus or coda) is counted as one token er- In this section, we want to model the pronuncia- ror. A syllable containing token error(s) is counted tions of a logograph in the target language Can- as one string error. All the neural networks were tonese using multimodal information from both trained using Adam (Kingma and Ba, 2014). the logograph and phonemes of the cognates, Data as shown in Figure4. Given a vocabulary of phonemes in the source languages related to Can- The dataset is extracted from the UniHan 3 tonese (Mandarin, Korean, Vietnamese), the cog- database, which is a pronunciation database of nates’ phonemes are encoded as an indicator vec- logographs from Han logographic languages and tor, with an element equals 1 if the corresponding maintained by the consortium. For each phoneme in the vocabulary appears in a cognate’s entry in the dataset, a logograph corresponds to pronunciation, and 0 otherwise. phonemes in Cantonese, Mandarin, Korean and 4 5 The geometric decomposition (GeoD) of the lo- Vietnamese, represented by Jyutping, Pinyin, 6 gograph is fed to two LSTM layers. The output at Yale, and Vietnamese symbols respec- 7 the last time step is concatenated together with the tively. We randomly partition the dataset into two multilingual phonemic vector and used as input sets, with 80% for training and the other 20% for for a multi-layer perceptron (MLP). The MLP and testing. Overall, there are 16,011 entries in the LSTM setups are the same as those in Figure2 and training set and 4,002 entries in the test set. 1000 Figure3 respectively. Deep supervision (Szegedy entries of the training set are used as the develop- et al., 2015) was applied by using the output of ment set for hyper-parameters fine-tuning. the LSTM to make auxiliary prediction of the out- In the test set, only 16% of logographs have put phonemes. Note that the auxiliary prediction pronunciations in all non-target languages, while should be identical to the main prediction. While 6% of logographs have no non-target language predicting the same target, the main prediction pronunciation. The availability of pronunciations used both cognate phonemes and the logograph in non-target languages differs from logograph to while the auxiliary prediction used only the logo- logograph. For example, some logographs have graph. This was to ensure features extracted from Mandarin and Korean pronunciations, while oth- the logographs are useful for pronunciation pre- ers only have Mandarin pronunciations. diction and are complementary to the features ex- Predicting pronunciation using logograph tracted from the multilingual phonemes. input 4 Experiments We compared the neural networks against a deci- sion tree baseline. The decision tree baseline was We investigate whether Cantonese phonemes implemented using scikit-learn (Pedregosa et al., could be predicted using Han logographs and the 2011). The input of the decision tree (DT) model cognates’ phonemes from Mandarin, Korean, and is the BoR representation of the logograph, while Vietnamese. The prediction output are Cantonese the input of neural networks can be either BoR or onsets, nuclei and codas. The experimental de- GeoD. The MLP network in Figure2 uses BoR, is motivated by the nature of Han-logographic while the LSTM in Figure3 uses GeoD as input. languages. A Chinese logograph (character) is All models output phonemes in Cantonese. phonologically equivalent to a syllable in English From Table2, the neural network (MLP) out- while the constituent radicals are analogous to al- performs decision tree when using BoR input. phabet letters (with far less phonetic information). Both the SER and TER of the MLP model are While in most languages, a syllable’s pronuncia- lower than those of the decision tree. The LSTM tion is influenced by neighboring , most model using GeoD leads to the lowest SER and Han-logographic languages are monosyllabic and TER, suggesting the benefits of relative positional a logograph’s pronunciation is rarely affected by neighboring logographs. Therefore, pronunciation 3https://www.unicode.org/charts/unihan.html 4 prediction at the logograph (character) level for https://en.wikipedia.org/wiki/Jyutping 5https://en.wikipedia.org/wiki/Pinyin Han logographs is more appropriate. We use string 6https://en.wikipedia.org/wiki/Yale error rate (SER) and token error rate (TER) as 7https://en.wikipedia.org/wiki/Vietnamese alphabet Phonemic Phonemic vocabulary indicator vector k 0 o 1 … … LSTM s 1 s (ON) i 0 LSTM ... .. a 0 750 500 250 i (NU) 懾 ⿰ 忄 ⿱ 耳 ⿰ 耳 耳 t 1 … … FC Input Drop Drop Drop p (CD) logograph Geometric decomposition (GeoD) of radicals out out out FC Output Logographic radical phonemes FC Mandarin phonemes Geometric operator p (CD) i (NU) s (ON) Korean phonemes Logographic embedding Vietnamese phonemes Auxiliary output phonemes Figure 4: Multimodal neural pronunciation prediction model using logographs’ geometric representation and cognates’ phonemes. information of radicals in predicting pronuncia- case, most notably at the onset position. While tion. The trends of onset, nucleus and coda er- logographs usually carry hints about phonemes at ror rates are similar to those of TER and SER. the nucleus and coda position but not at the onset However, as the gap of of error rate between MLP position, multilingual phonemes input might carry (BoR) and LSTM (GeoD) for TER and SER are hints about pronunciation at all three positions. quite small, using BoR instead of GeoD can be a good computation-accuracy trade-off. Method SER TER On. Nu. Cd. DT (BoR, ph) 44.0 24.8 29.8 29.9 14.7 Method SER TER On. Nu. Cd. MLP (BoR, ph) 38.5 19.6 23.4 24.8 10.5 DT (BoR) 63.8 39.8 50.7 45.7 22.9 LSTM (GeoD, ph) 37.2 18.6 22.6 23.4 9.8 MLP (BoR) 59.2 33.6 44.5 38.6 17.8 LSTM (GeoD) 58.4 32.6 43.3 37.4 17.1 Table 3: Prediction error rates of Cantonese phonemes by multimodal models; BoR: Bag of Radi- Table 2: Prediction error rates of Cantonese cals; GeoD: Geometric Decomposition; ph:phonemes. phonemes by decision tree (DT), MLP and LSTM us- Best results are in bold. ing only logographic input. Best results are in bold. 5 Discussion

Predicting pronunciation using multimodal We have empirically shown that the systematic yet input tenuous correspondence between pronunciations The input of the models are logographs and cog- of cognates in Han logographic languages can be nate phonemes from Mandarin, Korean and Viet- exploited for pronunciation modeling using neural namese. Table3 shows that the proposed multi- networks. Moreover, combining logograph with modal neural network exploits multimodal and ge- cognate pronunciations further improves pronun- ometric information effectively. The relative im- ciation prediction. These results could be poten- provement reaches 18.2% and 33.3% for SER and tially applied to speech processing tasks such as TER respectively. The last rows in Table2 and , where the construction of pro- Table3 show that by combining Korean, Man- nunciation dictionaries are expert labor-intensive, darin and Vietnamese phonemes input with GeoD, especially for under-resourced spoken languages. the prediction performance improves by 54.1% For future work, recursive neural network (Tai relative in TER and by 65.5% relative in SER. et al., 2015) can be used as it is better suited for the Moreover, using solely logograph input resulted hierarchical logographic decomposition. Besides, in higher onset error (43.3%) than nucleus error incorporating more detailed relationship between (37.4%) while using both logographs and multilin- radicals (e.g. (Zhuang et al., 2017)) can help im- gual phonemes improves the onset error (23.5%) prove the model. The proposed approaches can to be lower than nucleus error (24.6%). This also be applied to other languages such as Min suggests that logographs and phonemes of cog- Nan or Hakka, which are spoken languages that nates provide complementary information about are even less well-documented than Cantonese. the pronunciation of a logograph, which in this References Brett Kessler and Rebecca Treiman. 2002. Syllable structure and the distribution of phonemes in english Mark J Alves. 1999. What’s so Chinese about Viet- syllables. namese. In Papers from the ninth annual meeting of the Southeast Asian Society, pages 221– Diederik P. Kingma and Jimmy Ba. 2014. 242. Adam: A Method for Stochastic Optimization. Zhenguang G Cai, Martin J Pickering, Hao Yan, and arXiv:1412.6980 [cs]. ArXiv: 1412.6980. Holly P Branigan. 2011. Lexical and syntactic rep- resentations in closely related languages: Evidence Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. from Cantonese–Mandarin bilinguals. Journal of 2015. Deep learning. Nature, 521(7553):436. Memory and Language, 65(4):431–445. Y Li and JS Kang. 1993. Analysis of phonetics of the Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. ideophonetic characters in Modern Chinese. Infor- 2016. Acoustic data-driven pronunciation lexicon mation analysis of usage of characters in modern generation for logographic languages. In Acous- Chinese, pages 84–98. tics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5350– Frederick Liu, Han Lu, Chieh Lo, and Graham 5354. IEEE. Neubig. 2017. Learning character-level compo- sitionality with visual features. arXiv preprint Falcon Z Dai and Zheng Cai. 2017. -aware arXiv:1704.04859. Embedding of . EMNLP 2017, page 64. Leo J Loveday. 1996. Language contact in Japan: A sociolinguistic history. Clarendon Press. John Defrancis. 1996. Graphemic indeterminacy in writing systems. Word, 47(3):365–377. Marc Hideo Miyake. 1997. Pre-Sino-Korean and Pre- Sino-Japanese: reexamining an old Problem from a Chuanhai Dong, Jiajun Zhang, Chengqing Zong, modern perspective. Japanese/Korean Linguistics, Masanori Hattori, and Hui Di. 2016. Character- 6:179–211. based LSTM-CRF with radical-level features for Chinese named entity recognition. In Natural Lan- Viet Nguyen, Julian Brooke, and Timothy Baldwin. guage Understanding and Intelligent Applications, 2017. Sub-character Neural Language Modelling in pages 239–250. Springer. Japanese. In Proceedings of the First Workshop on Bjarke Frellesvig and John Whitman. 2008. Subword and Character Level Models in NLP, pages The Japanese-Korean vowel correspondences. 148–153. Japanese/Korean Linguistics, 13:15–28. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Yarin Gal and Zoubin Ghahramani. 2016. A theoret- B. Thirion, O. Grisel, M. Blondel, P. Pretten- ically grounded application of dropout in recurrent hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- neural networks. In Advances in neural information sos, D. Cournapeau, M. Brucher, M. Perrot, and processing systems, pages 1019–1027. E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, He Han, Yang Xiaokun, Wu Lei, Yan Hua, Gao 12:2825–2830. Zhimin, Feng Yi, and Townsend George. 2017. Dual long short-term memory networks for sub- Haiyun Peng, Erik Cambria, and Xiaomei Zou. 2017. character representation learning. arXiv preprint Radical-based hierarchical embeddings for Chinese arXiv:1712.08841. sentiment analysis at sentence level. In The 30th In- ternational FLAIRS conference. Marco Island. Zev Handel. 2015. The classification of Chinese: sinitic (the Chinese language family). In The Ox- Kono Rokuro. 1969. The Chinese writing and its in- ford handbook of Chinese linguistics, pages 34–44. fluence on the Scripts of the Neighbouring Peoples Oxford University Press. with special reference to Korea and Japan. Memoirs Mantaro J Hashimoto. 1978. Current developments in of the Research Department of the Toyo Bunko (The Sino—Vietnamese studies. Journal of Chinese Lin- Oriental Library) No, 27:117–123. guistics, pages 1–26. Xinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie, Janet Hui-wen Hsiao and Richard Shillcock. 2006. and Chao Liu. 2015. Radical embedding: Delving Analysis of a chinese phonetic compound database: deeper to Chinese radicals. In Proceedings of the Implications for orthographic processing. Journal 53rd Annual Meeting of the Association for Compu- of psycholinguistic research, 35(5):405–426. tational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol- Yuanzhi Ke and Masafumi Hagiwara. 2017. Radical- ume 2: Short Papers), volume 2, pages 594–598. level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese. arXiv preprint Ho-Min Sohn. 2001. The Korean Language. Cam- arXiv:1708.03312. bridge University Press. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du- mitru Erhan, Vincent Vanhoucke, and Andrew Ra- binovich. 2015. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR). Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory net- works. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Chaoju Tang and Vincent J Van Heuven. 2009. Mu- tual intelligibility of Chinese dialects experimentally tested. Lingua, 119(5):709–732. Yota Toyama, Makoto Miwa, and Yutaka Sasaki. 2017. Utilizing Visual Forms of Japanese Characters for Neural Review Classification. In Proceedings of the Eighth International Joint Conference on Natu- ral Language Processing (Volume 2: Short Papers), volume 2, pages 378–382.

Jun Xu, Guohong Fu, and Haizhou Li. 2004. Grapheme-to-phoneme conversion for chinese text- to-speech. In Eighth International Conference on Spoken Language Processing.

Rongchao Yin, Quan Wang, Peng Li, Rui Li, and Bin Wang. 2016. Multi-granularity Chinese word em- bedding. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, pages 981–986.

Jinxing Yu, Xun Jian, Hao Xin, and Yangqiu Song. 2017. Joint Embeddings of Chinese Words, Charac- ters, and Fine-grained Subcharacter Components. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 286– 291. Hang Zhuang, Chao Wang, Changlong Li, Qingfeng Wang, and Xuehai Zhou. 2017. Natural Language Processing Service Based on Stroke-Level Convo- lutional Networks for Chinese Text Classification. In Web Services (ICWS), 2017 IEEE International Conference on, pages 404–411. IEEE. A The Ideographic Description count. Different from the sequence representa- algorithm tion, representing the components as a vector of counts is lossy, as the geometric relationship be- The Ideographic Description algorithm, defined tween components are not preserved. Represent- by the Unicode Consortium, describes a way to ing graphemes as sequences rather than as vectors represent a grapheme by its components. All Han may lead to higher prediction accuracy if the posi- logographs (i.e. graphemes) can be recursively de- tional information is useful for the task. composed into smaller components that are them- selves logographs. With IDS denoting an logo- graph, the Ideographic Description algorithm can be written as IDS := IDS | BinaryOperator IDS IDS | TrinaryOperator IDS IDS IDS/ This simply means that an logograph can be decomposed into one, two or three smaller lo- gographs. The operators indicate the relative po- sitions of the operands. Many logographs can be described in more than one way using this algo- rithm as the logographs can themselves be broken down further.

Tree forms A B C 懾 懾 懾

⿰ ⿰ ⿰

忄 聶 忄 ⿱ 忄 ⿱

耳 聑 耳 ⿰ 耳 耳 Vector forms ⿰ 忄 聶 ⿰ 忄 ⿱ 耳 聑 ⿰ 忄 ⿱ 耳 ⿰ 耳 耳

Figure S5: (reproduced from Figure1) A, B and C are equivalent ideographic description sequences for the same logograph. Each sequence can also be repre- sented as a tree.

Figure S5 shows three different ways the logo- graph for “admire” can be decomposed at different levels of granularity. The granularity depends on the set of basic logographs at which the algorithm terminates. As the algorithm is recursive, the de- composition of a logograph is a tree or a sequence with the operators evaluated in prefix order. The sequence representation is lossless as it preserves the relative geometric position between the com- ponents. The logograph can be reconstructed per- fectly from the sequence of components. Figure S5 also shows how the three sequence of components are represented as three vectors of