Grapheme-to-Phoneme Conversion with a Multilingual Transformer Model
Omnia ElSaadany Benjamin Suter Department of Informatics Department of Informatics University of Zurich, Switzerland University of Zurich, Switzerland omnia.elsaadany@uzh.ch [email protected]
Abstract training on additional phonetic representations for the same language (which was not permitted in In this paper, we describe our three submis- our shared task). sions to the SIGMORPHON 2020 shared task 1 on grapheme-to-phoneme conversion for 15 The Transformer (Vaswani et al. 2017) with languages. We experimented with a single its attention mechanism has been applied very multilingual Transformer model. We observed successfully to machine translation tasks, and it that the multilingual model achieves results was also used for grapheme-to-phoneme conver- on par with our separately trained monolin- sion. Yolchuyeva et al. (2019) suggested using gual models and is even able to avoid a few a Transformer-based approach for grapheme-to- of the errors made by the monolingual models. phoneme conversion and Yu et al. (2020) proposed a multilingual Transformer model for languages 1 Introduction with different writing systems by employing byte- level input representation. Grapheme-to-phoneme conversion is the task of In our submission to the shared task, we explore predicting the phonemic representation for a given the performance of a multilingual Transformer orthographic word, where a phoneme is the small- model with augmented input representation which est unit of sound which can distinguish one word can transduce a word from any language present from another. In many languages, some phonemes in the training data into its IPA representation. have different realizations depending on their con- text, and these variants are called allophones. 2 Linguistic Background While the task is about predicting phonemes and not allophones, in fact most datasets (e.g., the 2.1 IPA datasets for Hungarian, Bulgarian, and Armenian) The phonemic representation in this task uses the also contain allophones. However, since the distri- International Phonetic Alphabet (IPA). Interest- bution of allophones conditioned on the context is ingly, there is an issue with IPA which is lack of learnable, this is not an issue. “orthography”. This might seem surprising given The shared task training data consists of 15 lan- that the IPA aims at representing the pronunciation guages which have diverse phonologies, ranging of words with more rigor than typical orthogra- from tonal languages to languages with glottalized phies. However, different levels of depth of anal- consonants, and they are written in eight different ysis are possible with IPA, and this makes incon- writing systems. The data comes from the English sistent use of symbols among annotators unavoid- version of Wiktionary. Each training set contains able. To give an example, Bulgarian exhibits a 3600 words, and each development and test set voiceless coronal plosive /t/~/t̪/. The phoneme is contains 450 words. The official metrics for the articulated as a dental plosive in Bulgarian. Some- task are Word Error Rate (WER) and Phoneme Er- what randomly, the IPA provides an atomic sym- ror Rate (PER). bol for the voiceless alveolar plosive (/t/), but only A multilingual approach for grapheme-to- a composed symbol for the voiceless dental plo- phoneme conversion has been explored by Milde sive (/t̪/). In principle, /t̪/ would be the correct et al. (2017). They propose a sequence-to- representation for the phoneme in question, but sequence multilingual model that benefits from since there is no phonemic contrast between den- 85 Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 85–89 Online, July 10, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 tal and alveolar articulation in Bulgarian, a sim- the current task, it means that the WER cannot be ple /t/ suffices to represent the voiceless coronal substantially reduced for Georgian due to these in- plosive phoneme in Bulgarian. Hence, as is ex- consistencies. pected, the phoneme is not transcribed consistently 2.2.2 Bulgarian in the training data; while /t̪/ is used 1588 times,/t/ is applied 681 times. Similar issues are found Bulgarian exhibits vowel reduction in unstressed frequently for other phonemes, and for other lan- syllables (similar phenomena are found, for in- guages. stance, in English, German, and Russian), which leads to many allophones for vowels in unstressed 2.2 Languages positions (Leafgren 2020). These allophones should not be present in a purely phonemic tran- In our monolingual baseline models trained with scription, however they are in the given training the Transformer baseline published by the task or- set. Furthermore, the pronunciation of a vowel in ganizers, the WER (PER) ranged from only 3.78 Bulgarian depends on the position of stress, yet (0.66) for Hungarian up to 40.00 (16.38) for Ko- Bulgarian word stress can fall on any syllable and rean. Seeing these huge differences in perfor- is not completely predictable. We experimented mance, it seemed worth analyzing the difficulties with a self-written tool which predicts the stress faced by the model for the three languages with position in Bulgarian based on heuristics, however the worst WER, viz. Korean (40.00), Bulgarian the WER could only be decreased marginally us- (30.67), and Georgian (28.44). ing a stress-annotated training set, which is why 2.2.1 Georgian we abandoned this approach. Similar issues like We were particularly surprised to see Georgian the ones discussed above for Georgian are present among the seemingly most difficult languages. in the Bulgarian training data, and these were also 2 Georgian has a fully phonemic alphabet; each discussed on GitHub. However, these issues are character represents exactly one phoneme, and somewhat more difficult to solve automatically each phoneme is represented by exactly one char- compared to Georgian. acter (Hewitt 1995). Grapheme-to-phoneme con- 2.2.3 Korean version (and phoneme-to-grapheme conversion) Korean uses an alphabet that provides a symbol for for Georgian is thus a trivial task and can be done each consonant and for each vowel, yet it groups in principle with 100% accuracy using a simple 1- symbols into square syllable blocks, which makes to-1 look-up table. it look somewhat close to Chinese and Japanese We actually implemented this look-up table, and writing, although it is much simpler. By default, this allowed us to identify and quantify the is- Unicode encodes Korean in syllable blocks and sues in the Georgian dataset. We found that there not as single sounds, which results in a charac- are three phonemes that are each inconsistently ter set comprising thousands of characters. Luck- represented by two IPA symbols (and distributed ily, Unicode also provides code points for the roughly 50/50): i~ɪ; x~χ; ɣ~ʁ. The difference be- single-sound characters (called Jamo), and sylla- tween these symbols is neither phonemic nor allo- ble characters can easily be decomposed to single- phonic. Rather, it is caused by different annotators sound characters.3 We used hangul-jamo4 for using different representation for a given phoneme, this decomposition. To give an example of the de- in line with the orthographic weakness of the IPA composition, 가감 /k a̠ ɡ a̠ m/, is decomposed to outlined above in Section 2.1. ㄱㅏㄱㅏㅁ . With this approach, we were able to 1 We reported these data inconsistencies, and we decrease the WER and PER of our Korean baseline prepared a consistent dataset produced with our Transformer model considerably: the WER was look-up table. Together with the organizers, we reduced from 40.00 to 21.50, and the PER from planned to update the Georgian data directly on 16.38 to 3.86. We use this preprocessing step for Wiktionary and then re-retrieve the training data Korean for all our submitted models. from there. Unfortunately, bulk uploading to Wik- 2https://github.com/sigmorphon/2020/issues/9 tionary is not trivial, and it was not possible for 3http://www.unicode.org/versions/Unicode8.0. us to update the data before the task deadline. For 0/ch03.pdf 4https://github.com/jonghwanhyeon/ 1https://github.com/sigmorphon/2020/issues/8 hangul-jamo 86 3 Approach In our hyperparameter tuning, we experimented with the following values: embedding dimension We trained a multilingual model which can trans- {128, 256} and hidden size {512, 1024} for both duce a word in any of the 15 source languages the encoder and the decoder, batch size {256, 512, into its IPA representation. Multilingual models 1024}, and dropout probability {0.1, 0.2, 0.3}. can be of the types many-to-one, one-to-many, or The number of epochs is limited to 400. many-to-many. In our case, there are obviously Our submitted model has the largest possible multiple languages on the source side. On the values for all tuned hyperparameters: embedding target side, there is usually exactly one desired dimensions of 256, hidden sizes of 1024, a batch phoneme sequence for a given source word. Su- size of 1024, and a dropout probability of 0.3. Due perficially, we thus have a many-to-one problem. to limitations in available computation power, fur- However, many character sequences exist in more ther tuning with even larger hyperparameter val- than one language. For instance, the character se- ues was not feasible for us. quence
89