Grapheme-To-Phoneme Models for (Almost) Any Language

Grapheme-to-Phoneme Models for (Almost) Any Language Aliya Deri and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {aderi, knight}@isi.edu Abstract lang word pronunciation eng anybody e̞ n iː b ɒ d iː Grapheme-to-phoneme (g2p) models are pol żołądka z̻owon̪t̪ka rarely available in low-resource languages, ben শ嗍 s̪ ɔ k t̪ ɔ as the creation of training and evaluation ʁ a l o m o t חלומות heb data is expensive and time-consuming. We use Wiktionary to obtain more than 650k Table 1: Examples of English, Polish, Bengali, word-pronunciation pairs in more than 500 and Hebrew pronunciation dictionary entries, with languages. We then develop phoneme and pronunciations represented with the International language distance metrics based on phono- Phonetic Alphabet (IPA). logical and linguistic knowledge; apply- ing those, we adapt g2p models for high- word eng deu nld resource languages to create models for gift ɡ ɪ f tʰ ɡ ɪ f t ɣ ɪ f t related low-resource languages. We pro- class kʰ l æ s k l aː s k l ɑ s vide results for models for 229 adapted lan- send s e̞ n d z ɛ n t s ɛ n t guages. Table 2: Example pronunciations of English words 1 Introduction using English, German, and Dutch g2p models. Grapheme-to-phoneme (g2p) models convert words into pronunciations, and are ubiquitous in For most of the world’s more than 7,100 lan- speech- and text-processing systems. Due to the guages (Lewis et al., 2009), no data exists and the diversity of scripts, phoneme inventories, phono- many technologies enabled by g2p models are in- tactic constraints, and spelling conventions among accessible. the world’s languages, they are typically language- Intuitively, however, pronouncing an unknown specific. Thus, while most statistical g2p learning language should not necessarily require large methods are language-agnostic, they are trained on amounts of language-specific knowledge or data. language-specific data—namely, a pronunciation A native German or Dutch speaker, with no knowl- dictionary consisting of word-pronunciation pairs, edge of English, can approximate the pronuncia- as in Table 1. tions of an English word, albeit with slightly differ- Building such a dictionary for a new language is ent phonemes. Table 2 demonstrates that German both time-consuming and expensive, because it re- and Dutch g2p models can do the same. quires expertise in both the language and a notation Motivated by this, we create and evaluate g2p system like the International Phonetic Alphabet, models for low-resource languages by adapting ex- applied to thousands of word-pronunciation pairs. isting g2p models for high-resource languages us- Unsurprisingly, resources have been allocated only ing linguistic and phonological information. To fa- to the most heavily-researched languages. Global- cilitate our experiments, we create several notable Phone, one of the most extensive multilingual text data resources, including a multilingual pronunci- and speech databases, has pronunciation dictionar- ation dictionary with entries for more than 500 lan- 1 ies in only 20 languages (Schultz et al., 2013) . guages. 1We have been unable to obtain this dataset. The contributions of this work are: 399 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 399–408, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics • Using data scraped from Wiktionary, we trainingh clean and normalize pronunciation dictionaries for 531 languages. To our knowledge, this is the most comprehensive multilingual pro- word g2ph pronh Mh!l pronl nunciation dictionary available. • We synthesize several named entities corpora (a) to create a multilingual corpus covering 384 training languages. h • We develop a language-independent distance metric between IPA phonemes. Figure 1: Strategies Mh!l • We extend previous metrics for language- for adapting existing language distance with additional information language resources and metrics. through output map- word g2ph!l pronl • We create two sets of g2p models for “high ping (a) and training resource” languages: 97 simple rule-based (b) data mapping (b). models extracted from Wikipedia’s “IPA Help” pages, and 85 data-driven models built 3 Method from Wiktionary data. • We develop methods for adapting these g2p Given a low-resource language l without g2p rules models to related languages, and describe re- or training data, we adapt resources (either an sults for 229 adapted models. existing g2p model or a pronunciation dictio- • We release all data and models. nary) from a high-resource language h to create a g2p for l. We assume the existence of two 2 Related Work modules: a phoneme-to-phoneme distance metric Because of the severe lack of multilingual pro- phon2phon, which allows us to map between the nunciation dictionaries and g2p models, different phonemes used by h to the phonemes used by l, methods of rapid resource generation have been and a closest language module lang2lang, which proposed. provides us with related language h. Schultz (2009) reduces the amount of exper- Using these resources, we adapt resources from tise needed to build a pronunciation dictionary, by h to l in two different ways: providing a native speaker with an intuitive rule- • Output mapping (Figure 1a): We use g2ph to generation user interface. Schlippe et al. (2010) pronounce wordl, then map the output to the crawl web resources like Wiktionary for word- phonemes used by l with phon2phon. pronunciation pairs. More recently, attempts have • Training data mapping (Figure 1b): We use been made to automatically extract pronunciation phon2phon to map the pronunciations in dictionaries directly from audio data (Stahlberg et h’s pronunciation dictionary to the phonemes al., 2016). However, the requirement of a na- used by l, then train a g2p model using the tive speaker, web resources, or audio data specific adapted data. to the language still blocks development, and the The next sections describe how we collect number of g2p resources remains very low. Our data, create phoneme-to-phoneme and language- method avoids these issues by relying only on text to-language distance metrics, and build high- data from high-resource languages. resource g2p models. Instead of generating language-specific resources, we are instead inspired by research on 4 Data cross-lingual automatic speech recognition (ASR) This section describes our data sources, which are by Vu and Schultz (2013) and Vu et al. (2014), summarized in Table 3. who exploit linguistic and phonetic relationships in low-resource scenarios. Although these works 4.1 Phoible focus on ASR instead of g2p models and rely on audio data, they demonstrate that speech technol- Phoible (Moran et al., 2014) is an online reposi- ogy is portable across related languages. tory of cross-lingual phonological data. We use 400 Phoible Wiki IPA Help tables Wiktionary 1674 languages 97 languages 531 languages 2155 lang. inventories 24 scripts 49 scripts 2182 phonemes 1753 graph. segments 658k word-pron pairs 37 features 1534 phon. segments Wiktionary train Wiktionary test NE data 3389 unique g-p rules 85 languages 501 languages 384 languages 42 scripts 45 scripts 36 scripts 629k word-pron pairs 26k word-pron pairs 9.9m NEs Table 3: Summary of data resources obtained from Phoible, named entity resources, Wikipedia IPA Help tables, and Wiktionary. Note that, although our Wiktionary data technically covers over 500 languages, fewer than 100 include more than 250 entries (Wiktionary train). two of its components: language phoneme inven- glish translation, named entity type, and script in- tories and phonetic features. formation where possible. 4.1.1 Phoneme inventories 4.3 Wikipedia IPA Help tables A phoneme inventory is the set of phonemes To explain different languages’ phonetic notations, used to pronounce a language, represented in IPA. Wikipedia users have created “IPA Help” pages,3 Phoible provides 2156 phoneme inventories for which provide tables of simple grapheme exam- 1674 languages. (Some languages have multiple ples of a language’s phonemes. For example, on inventories from different linguistic studies.) the English page, the phoneme z has the examples “zoo” and “has.” We automatically scrape these 4.1.2 Phoneme feature vectors tables for 97 languages to create simple grapheme- For each phoneme included in its phoneme in- phoneme rules. ventories, Phoible provides information about Using the phon2phon distance metric and map- 37 phonological features, such as whether the ping technique described in Section 5, we clean phoneme is nasal, consonantal, sonorant, or a tone. each table by mapping its IPA phonemes to the lan- Each phoneme thus maps to a unique feature vec- guage’s Phoible phoneme inventory, if it exists. If tor, with features expressed as +, -, or 0. it does not exist, we map the phonemes to valid Phoible phonemes and create a phoneme inventory 4.2 Named Entity Resources for that language. For our language-to-language distance metric, it is 4.4 Wiktionary pronunciation dictionaries useful to have written text in many languages. The most easily accessible source of this data is multi- Ironically, to train data-driven g2p models for lingual named entity (NE) resources. high-resource languages, and to evaluate our We synthesize 7 different NE corpora: Chinese- low-resource g2p models, we require pronunci- English names (Ji et al., 2009), Geonames (Vatant ation dictionaries for many languages. A com- and Wick, 2006), JRC names (Steinberger et al., mon and successful technique for obtaining this 2011), corpora from LDC2, NEWS 2015 (Banchs data (Schlippe et al., 2010; Schlippe et al., et al., 2015), Wikipedia names (Irvine et al., 2012a; Yao and Kondrak, 2015) is scraping Wik- 2010), and Wikipedia titles (Lin et al., 2011); tionary, an open-source multilingual dictionary to this, we also add multilingual Wikipedia titles maintained by Wikimedia.

Grapheme-To-Phoneme Models for (Almost) Any Language

Grapheme-To-Lexeme Feedback in the Spelling System: Evidence from a Dysgraphic Patient

500 Natural Sciences and Mathematics

Etytree: a Graphical and Interactive Etymology Dictionary Based on Wiktionary

ON SOME CATEGORIES for DESCRIBING the SEMOLEXEMIC STRUCTURE by Yoshihiko Ikegami

Multilingual Ontology Acquisition from Multiple Mrds

A New Rule-Based Method of Automatic Phonetic Notation On

(STAAR®) Dictionary Policy

Methods in Lexicography and Dictionary Research* Stefan J

Sounds to Graphemes Guide

Part 1: Introduction to The

Parietal Dysgraphia: Characterization of Abnormal Writing Stroke Sequences, Character Formation and Character Recall

Wiktionary Matcher