Phonology-Augmented Statistical Transliteration for Low-Resource Languages

INTERSPEECH 2015 Phonology-Augmented Statistical Transliteration for Low-Resource Languages Hoang Gia Ngo1, Nancy F. Chen2, Nguyen Binh Minh1, Bin Ma2, Haizhou Li2 1National University of Singapore, Singapore 2Institute for Infocomm Research, Singapore [email protected], [email protected] [email protected], mabin, hli @i2r.a-star.edu.sg { } Abstract because compared to OOV’s originating from the native language, their numbers are smaller in size. This often results Transliteration converts words in a source language (e.g., En- in non-interpretable outputs by statistical transliteration mod- glish) into phonetically equivalent words in a target language els [8]. The problem is compounded in low-resource languages (e.g., Vietnamese). This conversion needs to take into account such as Vietnamese [9]. Given the limited corpus size of such phonology of the target language, which are rules determin- languages, performance of statistical transliteration approaches ing how phonemes can be organized. For example, a translit- is suboptimal. On the other hand, rule-based transliteration ap- erated word in Vietnamese that begins with a consonant clus- proaches have been shown to produce phonologically-valid out- ter is phonologically invalid. While statistical transliteration puts with little training resources [10]. However, rule-based ap- approaches have been widely adopted, most do not explicitly proaches have their performance limited by the complexity of model the target language’s phonology, and thus produce in- the predefined rules, and therefore, under-perform with larger valid outputs. The problem is compounded for low-resource datasets, as compared to statistical methods [10]. languages where training data is scarce. In this work, we present We propose a transliteration framework in which the statis- a phonology-augmented statistical framework suitable for lan- tical n-gram language modeling is augmented with phonologi- guages with minimal linguistic resources. We propose the con- cal knowledge. We propose the concept of pseudo-syllables in cept of pseudo-syllables as structures representing how seg- statistical models to impose phonological constraints of sylla- ments of a foreign word are arranged according to the target ble structure in the target language, yet retain acoustic authen- language’s phonology. We use Vietnamese, a tonal language ticity of the source language as closely as possible. We assess with monosyllabic structure as an example. We show that the the transliteration performance of the proposed framework with proposed system outperforms the statistical baseline by up to existing baseline approaches on low-resource languages, using 70.3% relative, when there are limited training examples (94 Vietnamese as an example. We further investigate their perfor- word pairs). We also investigate the trade-off between training mance with corpora of different dialects and various training corpus size and transliteration performance of different methods data sizes. Our proposed framework integrates advantages of on two distinct corpora. rule-based approaches on top of classical statistical translitera- Index Terms: machine translation, under-resourced languages tion models. The proposed approach ensures phonologically- valid outputs, while maintaining strengths of statistical mod- 1. Introduction els (e.g., language-independent, performance scales up with in- crease in training data size). We are working on applying the 10.21437/Interspeech.2015-728 In every language, new words are constantly being invented framework to other languages such as Cantonese and Mandarin. or borrowed from foreign languages, especially in colloquial speech. For example, ‘facebook’ is now part of the vo- 2. Background cabulary in many languages other than English. These new words present out-of-vocabulary (OOV) challenges to spoken 2.1. Phonology language technologies such as automatic speech recognition We introduce two phonological concepts essential to transliter- [1], keyword search [2], and text-to-speech [3]. Translitera- ation. tion is a mechanism for modeling OOV’s adopted from foreign languages. Transliteration converts words written in one i. Syllable: A syllable is considered the smallest phonological writing system (source language, e.g., English) into phoneti- unit of a word [11] with the following structure [12]: cally equivalent words in another writing system (target lan- [O] N [Cd] + [T ] (1) guage, e.g., Vietnamese) [4], and is often used to translate foreign names of people, locations, organizations, and prod- where the “[ ]” specifies an optional unit. O denotes the Onset, ucts. For example, the English word facebook can be translit- which is a consonant or a cluster of consonants at the begin- erated to Vietnamese phonemes using X-SAMPA notation [5]: ning of a syllable. N denotes the Nucleus, which contains at f @I _1 . b_< u k _2 (“.” is a syllable delimiter). least a vowel. Cd denotes the Coda, which mostly contains Letter-to-sound tools employing statistical approaches are consonants. T denotes lexical tone, a feature existing in many common solutions for OOV’s [6][7]. Transliteration deals with languages to distinguish different words. The syllabic structure two different language systems, with the input as letters from a above is shared across most languages [12]. However, how con- source language, and output as phonemes of a target language. sonants (C ) and vowels (V ) constitute Onset, Nucleus, Coda Transliteration outputs therefore need to comply with phono- differs across languages. For example, in English, an Onset can logical rules of the target language. Such phonological rules of be a consonant cluster, such as ‘kl’, while no consonant cluster OOV’s adopted from foreign words are often difficult to model can be the Onset of a syllable in Vietnamese [13][14]. Copyright © 2015 ISCA 3670 September 6-10, 2015, Dresden, Germany ii. Lexical Tones: In tonal languages, pitch is used to distin- 4. New phonemes can be added to the output to retain the guish the meaning of words which are phonetically the same. phonological structure of the source language: a ‘schwa’ is This distinctive pitch level or contour is referred to as lexical added to imitate the pronunciation of the consonant cluster tones [15]. For instance, there are 6 distinct lexical tones in “KL” that does not have an equivalent in Vietnamese. Vietnamese [16], and 4 distinct lexical tones in Mandarin [17]. In existing statistical transliteration models, the phonological Each lexical tone is commonly encoded in phonetic representa- considerations listed above are implicitly modeled using n-gram tion with a number. For example, consider two different Viet- language modeling of the source language graphemes [22], or namese words: b_< O _3 (cow) and b_< O _6 (bug). The joint sequences of graphemes and phonemes [19][20]. Due to two words are represented in phonetic units using X-SAMPA limited training data in low-resource languages, phonological notation [5], and have the same Onset (b_<) and Nucleus (O), structure in the transliteration output is not well-modeled, re- but are distinguished by the two different lexical tones (tone 3 sulting in a high rate of invalid syllables in the output [10]. For and tone 6). Around 70% of languages are tonal [15], concen- example, k l E i _1 . J a: n is another transliteration trating in Africa, East and Southeast Asia [18]. output in Vietnamese phonemes produced by a statistical model for the word “KLEINHANS” in Figure 1. The output is invalid 2.2. Baseline Model: Joint Source-Channel Model in Vietnamese phonology since the first syllable has a conso- The joint source-channel model for transliteration was intro- nant cluster, and the second syllable has no lexical tone. Em- duced in [19]. Similar approaches have also been proposed for pirically, at least 21% of transliterated entries lack lexical tones, grapheme-to-phoneme conversion in [20]. when we run Sequitur G2P [23] (a statistical system) on 100 The transliteration problem can be formulated as follows: Vietnamese transliteration pairs extracted from OpenKWS13 given a string of graphemes f = (f1, f2, ..., fm) from source corpus [24]. Previous rule-based approaches have been shown language F , the objective is to convert this input string into a to improve transliteration performance by imposing phonologi- string of phonemes e = (e1, e2, ..., en) in target language E. cal constraints on their outputs [10]. However, predefined rules The string of phonemes is inferred to: are likely to make mistakes with words not observed in training data. Rule-based approaches are thus outperformed by statisti- e∗ = arg max p(e f) = arg max p(e, f). (2) e | e cal models in larger data sets [10]. The transliteration process can be seen as finding the alignment Our proposed approach has the strengths of both the rule- for sub-sequences of the input string f and the output string based and statistical approaches by integrating the phonology of e [21]. Let there be K aligned transliteration pairs <x,y>1, the target language explicitly with a statistical model. The pro- ..., <x,y>K , where x takes on sub-sequences of e and y takes posed framework can thus generate transliterated outputs that on sub-sequences of f. The joint probability p(e, f) is esti- are phonologically valid yet improves the performance with mated through an N-gram model, where the k-th alignment pair more data. Figure 2 shows the three steps of the proposed <x,y>k is dependent on its N predecessor pairs [19][20]: approach: (1) pseudo-syllable formulation, (2) grapheme-to- K phoneme mapping, and (3) lexical tone assignment. p(e, f) = P (<x, y>k <x, y>k N+1, ..., <x, y>k 1) | − − 3.2. Pseudo-Syllable Formulation kY=1 Pseudo-syllable is a representation of how segments of a for- 3. Proposed Phonology-Augmented eign word are arranged according to the syllable structure spec- Statistical Framework ified by the target language’s phonology. The concept is in- spired by how native speakers process a foreign loan word by 3.1.

Load more