Multimodal Neural Pronunciation Modeling for Spoken Languages with Logographic Origin
Total Page:16
File Type:pdf, Size:1020Kb
Multimodal neural pronunciation modeling for spoken languages with logographic origin Minh Nguyen Gia H. Ngo Nancy F. Chen National University of National University of Institute for Infocomm Research Singapore Singapore Singapore [email protected] [email protected] [email protected] Abstract belonging to the Han logographic family. Sim- Graphemes of most languages encode pro- ilar to pronunciation modeling in phonographic nunciation, though some are more ex- languages, in which words are broken down into plicit than others. Languages like Spanish characters and modeling is done at the character have a straightforward mapping between its level, pronunciation modeling in logographic lan- graphemes and phonemes, while this mapping guages requires decomposing logographs into sub- is more convoluted for languages like English. units and extracting only sub-units carrying pro- Spoken languages such as Cantonese present nunciation hints. As the correspondence of Han even more challenges in pronunciation mod- eling: (1) they do not have a standard writ- logograph to phoneme is intricately complex with ten form, (2) the closest graphemic origins are many sub-rules or exceptions (Hashimoto, 1978), logographic Han characters, of which only a it is challenging to computationally model these subset of these logographic characters implic- correspondences using white box approaches (e.g. itly encodes pronunciation. In this work, we graphical model). Instead, we exploit neural net- propose a multimodal approach to predict the works, as they (1) can flexibly model the im- pronunciation of Cantonese logographic char- plicit similarity of grapheme-phoneme relation- acters, using neural networks with a geomet- ric representation of logographs and pronun- ships across languages with Han origin, (2) can au- ciation of cognates in historically related lan- tomatically learn the most relevant knowledge rep- guages. The proposed framework improves resentation with minimal feature engineering (Le- performance by 18.1% and 25.0% respective Cun et al., 2015), such as extracting pronunciation to unimodal and multimodal baselines. hints from logographic representations. 1 Introduction Due to historical contact, there is much lexi- cal overlap across Han logographic languages, as In phonographic languages, there is a di- they borrowed words from one another (Rokuro, rect correspondence between graphemes and 1969; Miyake, 1997; Loveday, 1996; Sohn, 2001; phonemes (Defrancis, 1996), though this corre- Alves, 1999). As a result, cognates in different spondence is not always one-to-one. For exam- languages are written using identical graphemes ple, in English, the word table corresponds to but pronounced differently. For example, [she] the pronunciation [``teI.bl], in which each alpha- in Mandarin and [sip] in Cantonese are cog- betic character corresponds to one phoneme, and nates; their pronunciations are different yet they the character e is mapped to silence. However, are written using the same logograph (懾), which in logographic languages, the correspondence be- represents “admire”. Though Han logographic tween graphemes and phonemes is more ambigu- languages are mutually unintelligible (Tang and ous (Defrancis, 1996), as only some sub-units in a Van Heuven, 2009; Handel, 2015), the correspon- 1 grapheme are indicative of its phonemes. Korean , dence of Han logographic graphemes to phonemes 2 Vietnamese and Chinese languages (e.g. Can- across languages is often similar in systematic tonese) are examples of logographic languages, all ways (Cai et al., 2011; Frellesvig and Whitman, 1A large portion of Korean vocabulary are Sino-Korean 2008; Miyake, 1997). The shared characteristics written in Hanja (Korean logographs) (Sohn, 2001) in pronunciation of cognates could be leveraged in 2Traditional Vietnamese vocabulary comprises of Sino- Vietnamese words written by Chinese logographs and deciphering the pronunciation of Han logographs. locally-invented Nom logographs (Alves, 1999). In this work, we proposed a neural pronuncia- 2916 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2916–2922 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics tion model that exploits both embeddings of lo- logographs in the table have a common phonetic gographs and cognates’ phonemes. The proposed radical (in red), which offers an inkling of the pro- model significantly improves pronunciation pre- nunciation of these logographs. For instance, lo- diction of logographs in Cantonese. gographs that have the phonetic radical on the left (V and è) share a similar pronunciation in Ko- 2 Related Work rean (in blue) while logographs that have the pho- netic radical on the right (j, 賠, and 蓓) share The basic units in writing (graphemes) of Han lo- a similar pronunciation in Mandarin, Cantonese gographic languages are logographs. A word con- and Vietnamese. Note that for each logograph, tains one or more logographs and a logograph con- their pronunciations across the different languages sists of one or more radicals. The pronunciation of share similarities: when the phonetic radical is on a logograph corresponds to a syllable which has the left, the nucleus ends in a back vowel like u three phonemes: onset, nucleus and coda. or o, whereas when the phonetic radical is on the Grapheme-to-phoneme (G2P) approaches such right, the nucleus ends in a front vowel like i. as (Xu et al., 2004; Chen et al., 2016) predicted a Han logograph’s pronunciation from its local con- Position of 咅 text in a phrase. This was similar to predicting Logograph a Latin word’s pronunciation from its surrounding Mandarin pou bu pei pei bei words, essentially treated individual logographs as Cantonese fau bou pui pui bui the basic units of the model and did not delve fur- Korean pwu pwu pay pay pay ther into the logographic sub-units (the radicals). Vietnamese phau bo boi boi bui While we are unaware of any work that de- Table 1: The position of radicals affects pronuncia- rives features for pronunciation prediction from tions. All logographs share a common radical in red. logographs, there are recent work in deriving rep- Similar pronunciations for V and è are bolded in j 賠 蓓 resentation of logographs for various semantic blue. Similar pronunciations for , , and are bolded in green. The pronunciation of a logograph in tasks. Some methods (Shi et al., 2015; Ke and Mandarin, Cantonese, Korean and Vietnamese are rep- Hagiwara, 2017; Nguyen et al., 2017; Zhuang resented by Pinyin, Jyutping, Yale, and Vietnamese al- et al., 2017) decomposed logographs into sub- phabet symbols respectively. units using expert-defined rules and then extracted The example in Table1 explains the motivation the relevant semantic features. Other methods use for our proposed approach to predict a logograph’s convolutional neural network to extract features pronunciation by modelling both the constituent from the images of logographs (Dai and Cai, 2017; radicals and their geometric positions. Further- Liu et al., 2017; Toyama et al., 2017). Other works more, the proposed approach can generalize to un- combined multiple level of information for feature seen logographs if the co-occurrence patterns of extraction, using both logograph and sub-units ob- their constituent radicals have been learnt. tained from logograph decomposition (Dong et al., 2016; Han et al., 2017; Peng et al., 2017; Yu et al., 3 Model 2017; Yin et al., 2016). In this work, we explicitly looked at the rela- We first describe a geometric decomposition of lo- tionship between a logograph’s constituent rad- gographs and then different neural pronunciation icals and its pronunciation. Among Han lo- models for logographs. Finally, we present a mul- gographs, 81% of frequently used logographs timodal neural model that incorporates both logo- are semantic-phonetic compounds (Li and Kang, graphic input and the cognates’ phonemes in pre- 1993) which consist of radicals that might contain dicting pronunciation of logographs. phonetic or semantic hints (Hsiao and Shillcock, 2006). The pronunciation of a logograph could Representation of Han logographs conceivably be predicted from the phonetic radi- The majority of logographs (characters) in Han lo- cals. Furthermore, the relative position of radicals gographic language family comprise of a radical in the logograph might also offer clues about it that indicates its nominal semantic category and a pronunciation. Table1 shows an example of such phonetic radical that gives an inkling of the pro- intricate relationships between a logograph’s pro- nunciation (Defrancis, 1996). Thus, patterns of nunciation and its constituent radicals. All Han co-occurrence of radicals across logographs might 2917 Tree The BoR is input to a multilayer perceptron (MLP) forms A B C 懾 懾 懾 with three layers of size 750, 500, 250. L2 regular- ization of 1e-4 is applied to the hidden layers. The ⿰ ⿰ ⿰ three dropout layers have dropout probabilities of 忄 聶 忄 ⿱ 忄 ⿱ 0.5, 0.5, and 0.2, respectively. As the output vari- ables are categorical, cross-entropy loss was used. 耳 聑 耳 ⿰ We investigated two structures for predicting 耳 耳 Vector forms output phonemes (i.e. onset, nucleus, coda). In the ⿰ 忄 聶 ⿰ 忄 ⿱ 耳 聑 ⿰ 忄 ⿱ 耳 ⿰ 耳 耳 first structure, output phonemes were predicted in- Figure 1: Geometric representation of the logograph dependently using the last hidden layer. The sec- “admire”. A, B and C are equivalent decomposition of ond structure made a sequential prediction (1) the the same logograph but with different levels of granu- coda was first predicted using the last hidden layer larity. The geometric representation comprises of both (2) the nucleus was predicted using both the final the radicals and geometric operators, which can be hidden layer and the predicted coda, and (3) the used to reconstruct the original logograph. onset was predicted using the last hidden layer to- be exploited to find the phonetic radicals, which in gether with the predicted coda and nucleus. The turn can suggest the corresponding pronunciation second structure was motivated by a stronger de- of a logograph. Using this intuition, we model the pendency between the nuclues and coda.