Grapheme-to- transcription in Hungarian

Attila Nov´ak1, Borb´alaSikl´osi2

1 MTA-PPKE Technology Research Group, 2 P´azm´any P´eterCatholic University, Faculty of Information Technology and Bionics, 50/a Pr´aterstreet, 1083 Budapest, Hungary {novak.attila, siklosi.borbala}@itk.ppke.hu

Abstract. A crucial component of text-to-speech systems is the one responsible for the transcription of the written text to its phonemic rep- resentation. Though the complexity of the relation between the written and spoken form of languages varies, most languages have their regular and irregular phonological set of rules. In this paper, we present a system for the phonemic transcription of Hungarian. Beside the implementation of transcription rules, the tool incorporates the knowledge of a Hungar- ian morphological analyzer in order to be able to detect morpheme and compound boundaries. It is shown that the system performs well even on texts containing a high number of foreign names, which could not be achieved by a lexicon-based method.

1 Introduction

In this study, our goal was to create a method to automatically transform written Hungarian to its phonetic representation. The system was used to transcribe a database of Hungarian geographic terms to a phonetic representation. Even though units in a written alphabet might correspond to a phonetic unit of the spoken language, the complexity of this mapping varies among lan- guages. Even if we consider only languages using the , there are significant, language-specific differences. Thus, a transcription system must be language-specific, and the applicability of certain methods depends on both mor- phosyntactic and phonological characteristics of the given language. In English, orthographic standards had been fixed quite early, while its sys- tem of pronunciation has further evolved [10]. Thus, it is often quite difficult to predict the correspondence between written and spoken forms. However, since the number of wordforms is limited, either a manually created, or an automat- ically generated lexicon – containing both written and transcribed wordforms – can cover almost the whole vocabulary of the language. The main problem in English is (in addition to eventual OOV items, like names) massive homog- raphy with items belonging to different part of speech often having different pronunciation. 2 Attila Nov´ak,Borb´alaSikl´osi

In the case of some other languages, such as Hungarian, the relation between written and spoken forms is much closer; the orthography is basically phonemic. In most cases, pronunciation is predictable from the orthographic form. Still, there are many exceptional phenomena and restrictions arising from phonetic capabilities. Moreover, agglutination yields a huge number of wordforms, making the inclusion of the full vocabulary in a lexicon impossible [8]. Thus, an automated method is necessary and is also supported by techno- logical constraints, i.e. exploiting processing capabilities instead of storing large amount of offline data in the form of lexicons. The structure of this paper is as follows: In Section 2 related approaches are overviewed briefly. Then, our method for transcription is described, including detailed arguments about the language-specific difficulties of Hungarian phonol- ogy. Finally, an evaluation of our tool is presented with an error analysis of the most significant errors revealed during the experiments.

2 Related Work

There are three main branches of grapheme-to-phoneme transcription methods [4]:

– dictionary look-up, – rule-based approaches, – data-driven approaches.

Dictionary look-up is used when the mapping between the orthographic and phonological representation is based on conventions, and rules or generalization are not applicable. The advantage of such methods is that other information (e.g. lexical stress, part-of-speech) can also be stored in the dictionary. However, the creation of such dictionaries by hand is very expensive and tedious. No matter how limited the agglutinating behaviour of a language is, there will always be new words or wordforms, which are not covered by a predefined lexicon. Rule-based approaches overcome this limitation by applying a set of predefined grapheme-to-phoneme transcription rules. These rules are language- specific and have to be manually defined by linguists, then these can be formu- lated for example in the framework of finite-state automata [7]. Such rule-based methods also require an exception lexicon for irregular wordforms. Machine learning methods are also applied to grapheme-to-phoneme tran- scription. In [5], it has been shown that the generalization capability of such methods is better than that of rule-based approaches (at least for English). One of the most successful implementations is based on the idea of Pronunciation by Analogy (PbA) [6]. The theory behind this approach is based on psycholinguis- tic models, i.e. predicting the pronunciation of a word by finding similarities to words for which the phonological representation is known. Joint-sequence mod- els [4] aim at finding the most likely pronunciation for an orthographic form by using Bayes’ decision rule. For all data-driven approaches, a dictionary or Grapheme-to-phoneme transcription in Hungarian 3 a transcribed corpus is needed for training the system or building statistical models. For Hungarian, there is an online dictionary containing 1.5 million wordforms and their phonetic transcription [2]. The construction of this dictionary included several main steps. First, wordforms from a large, written corpus were collected and the list of the resulting words were cleaned (i.e. foreign and misspelled words removed). Then, transformation rules were applied. Finally, exceptions were defined and corrected manually. The authors state, that their dictionary can be considered as a reference dictionary, providing the largest coverage of Hungarian wordforms and their IPA transcriptions. However, only wordforms appearing in the original corpus are included, not providing the possibility either for transcribing other inflected forms (unavoidable in Hungarian) or including new words arriving to language use.

3 Method

In the case of phonetic languages, such as Finnish, Estonian or Hungarian, the transcription of a written wordform is almost always straightforward. For exam- ple, the word ablak (’window’) is pronounced as [OblOk]. (Table 1 shows the tran- scription of the standard pronunciation of letters in the International Phonetic Alphabet, which is used in this research to represent phonetic transcription.) However, there are two types of phenomena that make the transcription non- trivial: changes in pronunciation due to the interference of certain sounds, and traditional or foreign words. Another problem is the normalization of semiotic systems.

letter IPA letter IPA letter IPA letter IPA ´a a: b b n n zs Z a O p p ñ s S > o o d d j j cs tS u u t t h h l l ¨u y g g v v r r > i i k k f f dz dz > ´e e: gy é z z dzs dZ ¨o ø ty c s > e E m m c ts

Table 1: The of Hungarian

Our method is based on three components: a morphological analyzer, a lex- icon for irregular stems and the implementation of phonological rules defined in an XFST (Xerox Finite-State Tool) formalization [3]. 4 Attila Nov´ak,Borb´alaSikl´osi

3.1 Morphological analysis

First, the morphological structure of each word is identified. This is necessary in order to find morpheme boundaries to which certain morpho-phonological rules refer. Lexical palatalization, for example, applies only to some specific inflectional suffixes. In addition, certain phonemes are represented by bigraphs (cs, gy, ty, ny, sz, zs, dz, dzs, and their long forms). However, if a morpheme boundary intervenes, the individual consonants of these digraphs are pronounced as consonant clusters (other rules might affect their behaviour resulting in partial or full assimilation). For example, in the word eszk¨ozs´av, ‘toolbar’ the correct transcription is [EskøsSa:v] instead of [EskøZa:v]. Compounds, which are also quite frequent in Hungarian, might contain com- ponents that have an irregular pronunciation. These should also be recognized by the morphological analyzer to avoid their transcription by the regular phono- logical rules. In the system, we used the Humor morphological analyzer [9, 11].

3.2 Lexicon of irregular stems

In all natural languages, there are wordforms with irregular pronunciation. These are usually proper names and foreign words. Words of the latter category might adapt to the adopting language to some extent. For example, the English word file might be written in Hungarian as the original form file or as it is adapted to the pronunciation, i.e. f´ajl. In both cases, the phonetic form is [fa:jl]. How- ever, the phrase New York is used only in its original form in written text and is pronounced as [ñu:jork]. In Hungarian, however, not only foreign, but some traditionally spelled words also fall into this category. Such irregularities occur in quite a few family names, geographical names, etc. In addition, there are cases where standard pronunciation deviates from what orthography suggests in terms of vowel and/or consonant length. For example the word egyes¨ulet ’association’ is pronounced as [Eé:ESylet] rather than [EéESylet], as suggested by the orthographic form. Another group of words included in the lexicon are members of the semiotic system. These use the same set of characters and symbols as the writing system of the language, but render meaning to such units of text in a different man- ner. In order to be able to produce the phonological transcription, these units must be normalized in a preprocessing step. Examples are numbers, abbrevia- tions, acronyms, units of measurements, dates, mathematical expressions, e-mail addresses, etc. Though all of these examples contain a number of subproblems, it is out of the scope of this paper to go into details. You can turn to [13] instead. However, it is worth mentioning the case of abbreviations, where we shall differentiate forms

– where the abbreviated form can be pronounced as if it were a word (e.g. NATO [na:to:]), Grapheme-to-phoneme transcription in Hungarian 5

– or the abbreviated form is substituted by the original form in speech (e.g. du. [de:luta:n] ’afternoon’) – or the abbreviation is spelled in speech (e.g. USB [u:eSbe:]). In our system, abbreviations are first matched against the lexicon that includes the transcription for those that are pronounced as words. If there are no matches, then the default rule is to spell the abbreviated form.

3.3 Phonological rules The morphophonological rules in our system were implemented using XFST. The description is based on [12]. The order of rules is shown in Table 2. The order of rules is determined by the following factors: orthographic peculiarities of consonant notation must be handled before other rules. Lexical rules are applied before those describing postlexical processes. There are a few feeding constraints between specific processes detailed below where we provide some details about each process.

Handling orthographic peculiarities 1. Certain palatal and sibilant consonants and affricates are denoted by digraph letters in Hungarian orthography, as shown in Table 1. Geminate consonants are in general denoted by doubling the corresponding letter. However, gem- inates of sounds denoted by digraphs are denoted by doubling only the first letter of the digraph. This rule handles these cases. Although letter sequences that look like the geminate form of digraph-denoted consonants may also be cases of clusters, e.g. ssz may be a sequence of s+sz [Ss], but this may occur only if there is an intervening morpheme boundary. In addition the (partly context-sensitive) pronunciation of the letters q, w, x and y, used only in loan words and names, is defined.

Lexical processes 2. The final h of a subset of h-final words (e.g. m´eh, ‘bee, uterus’) is not pro- nounced unless a vowel-initial suffix follows. 3. The initial j of inflectional suffixes palatalizes preceding stem-final dental stops and l. The rule applies only at inflectional suffix boundaries. 4. The initial j of inflectional suffixes merges with a preceding stem-final palatal consonant. Lexical palatalization feeds this process. 5. Polysyllabic stems the orthographic form of which ends in a long high final vowel (´ı,´u,˝u) are in general pronounced with a short final vowel except in highly polished speech. We implemented this optional shortening. 6. Intervocalic and word-final dzs and dz are long. There are a handful of lexical exceptions with a short intervocalic dzs (e.g. fridzsider ‘fridge’ [fridZider]).

Stress 7. Stress assignment is rather trivial in Hungarian: it always falls on the first syllable. The only complication is unstressed words like determiners and other , but these are handled outside the rule set. 6 Attila Nov´ak,Borb´alaSikl´osi

# rule 1. convert long digraphs, x, w, qu, y, ly 2. lexical h-deletion 3. lexical palatalization 4. lexical palatal merging (lex. palatalization must feed it)) 5. shortening of high final vowels of polysyllabic stems (optional) 6. lengthening of intervocalic and word-final dzs and dz 7. first syllable of every word stressed 8. voicing assimilation (regressive, right context checked on the output) 9. adaffrication (voicing assim. must feed it) 10. nasal assimilation 11. degemination 12. j: at the and of phon. phrase: friction and devoicing after voiceless obstruents; friction after voiced consonants at the end of phon. phrase 13. postlexical alternation of h (post sonorant voicing; palatalization and velar- ization in coda) 14. postlexical palatalization 15. stops, fricatives, nasals, liquids: gemination over all boundaries 16. affricates: gemination over suffix boundaries 17. convert vowels

Table 2: Phonological rules in the order of their application

Postlexical rules 8. There is a regressive (right-to-left) voicing assimilation affecting obstruents. The peculiarities are: v is devoiced, but it does not trigger voicing; h triggers devoicing, but it is not voiced. This process must feed adaffrication. 9. Adaffrication: certain stop + fricative and stop + affricate clusters merge into corresponding affricates. We did not implement optional adaffrication processes characterizing only very casual speech, like Stop + fricative adaf- frication across word boundaries or palatal + stop or palatal + affricate adaffrication. 10. Nasal n assimilates to the place of articulation of a right-adjacent stop or nasal, n and m are realized as a labiodental nasal [M]. 11. There are a number of degemination processes, which are conditioned on different contexts. Monomorphemic geminates degeminate in the context of any other consonant: CC-X → C-X, X-CC → X-C (where - can be any Grapheme-to-phoneme transcription in Hungarian 7

boundary or none at all). Degemination across boundaries XC-C → X-C, C- CX → C-X is obligatory if X is an obstruent (and we implemented the process in nasal contexts as well). C-CX → C-X degemination affects a restricted subset of obstruents only. Degemination following a liquid L, LC=C → L=C, occurs only across inflectional suffix boundaries. 12. At the end of a word, j is realized as a voiceless [ç] or voiced fricative [J] if it follows a voiceless/voiced consonant. 13. There is a postlexical alternation of h. It is voiced in intervocalic position and between a sonorant and a vowel. It is palatalized to [ç] in coda after front vowels, and, in other codas, it is velarized to [x]. 14. Postlexical palatalization: dental t, d, n are palatalized before a palatal ty, gy, ny. 15. Stops, fricatives, nasals and liquids geminate over any type of boundaries. 16. In not-very-casual speech, affricates geminate only over suffix boundaries. 17. Finally, we convert the orthographic representation of long vowels also to the V: notation.

4 Evaluation

Our system was evaluated on the 80206-word Hungarian version of George Or- well’s 1984. We used the Hungarian model of the eSpeak speech synthetiser [1] as a baseline system, the only freely available tool capable of performing grapheme-to-phoneme conversion for Hungarian we found. eSpeak can output an IPA transcription of its input. We also considered using the on-line pronun- ciation database [2] available at http://beszedmuhely.tmit.bme.hu/mksz/ as another baseline. This dictionary contains 1.5 million word forms, including in- flected forms and is supposed to be both representative and 99% correct. How- ever, the database is not available for download, and even the function mentioned in the user guide of the site that would allow the user to download the first 1000 hits returned for a query is not implemented. So we did not manage to use it either as a reference or as a baseline vocabulary-based system. We measured word error rate on the whole corpus. In the case of optional alternations, we accepted all correct variants. The eSpeak output lacks indica- tion of any postlexical assimilation processes (of obstruent voicing, palataliza- tion, nasals, /h/ and /j/), fails to clearly distinguish the IPA representation of > affricates from obstruent clusters (e.g. /tS/ vs. /tS/) and often erroneously rep- resents geminate consonants as e.g. /tt/ instead of /t:/. We postcorrected these errors in the eSpeak output in order to make it comparable to our output (and correct). Another discrepancy between the two systems was that we implemented the optional shortening of stem-final long high vowels, which is typical even in non-casual standard Hungarian speech, while eSpeak outputs these in the their somewhat stilted long form. The word error rates of the two systems are shown in Table 3. The errors in eSpeak’s output not mentioned before are mainly due to lex- ical gaps (including the numerous English names in the text), its inability to 8 Attila Nov´ak,Borb´alaSikl´osi

system WER our system WER 0.35% eSpeak u/i 0.98% eSpeak WER 2.26% eSpeak assim/h/j/N/voic 14.81%

Table 3: Evaluation. u/i: ratio of words with shortening of stem-final long high vowels; assim/h/j/N/voic: ratio of words erroneously lacking marking of voic- ing/palatal/nasal/j/h assimilation but otherwise correct; WER: residual word error rate.

resolve some common abbreviations, errors concerning geminate /r/’s and the pronunciation of the digraph ch, some idiosyncratic errors concerning the rep- resentation of certain words and the overapplication of lexical palatalization to morphemes that should not be affected. The latter type error is caused by the lack of morphological analysis in eSpeak: lexical palatalization is handled in a pattern-based manner, that also matches at wrong places. Our system is much better at pronouncing English names; its errors are mainly due to lexical gaps (different from those in eSpeak), wrong resolution of abbreviations and over- analysis of certain bogus compounds. The numerous Newspeak words in 1984 made up by Orwell, which a ‘Hungarian’ translation in the text, did not cause much trouble for either system, as they generally contain easy-to-convert letter sequences, and both systems have a productive transcription component instead of relying solely on a dictionary.

5 Conclusion

In this paper, an automatic tool was described that is able to transcribe Hungar- ian text to its phonetic representation. The system is more than a look-up tool for individual words, but is able to transcribe whole sentences, taking into ac- count sound assimilations appearing at word boundaries as well. This is achieved by the incorporation of a morphological analyzer capable of detecting morpheme and compound boundaries, and by a set of transcription rules. Moreover, as the system is not limited to the vocabulary of a prebuilt dictionary, it is capable of transcribing any wordforms, which is of crucial importance in languages like Hungarian, where agglutination and compounding can produce an unlimited number of words. It has been shown that evaluating our system on a dataset containing a lot of wordforms not available in dictionaries, our system resulted in much lower error rate than a commercial tool, even if the latter was considered with a less strict attitude. Grapheme-to-phoneme transcription in Hungarian 9

References

1. espeak. http://espeak.sourceforge.net/, accessed: 2015-04-10 2. Abari, K., Olaszy, G., Zaink´o, C., Kiss, G.: Magyar kiejt´esi sz´ot´ar az inter- neten [Hungarian Online Pronunciation Dictionary]. In: IV. Magyar Sz´am´it´og´epes Nyelv´eszetiKonferencia. pp. 223–230. SZTE, Szeged (2006) 3. Beesley, K., Karttunen, L.: Finite State Morphology. No. 1 in CSLI studies in computational linguistics: Center for the Study of Language and Information, CSLI Publications (2003) 4. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (May 2008) 5. Damper, R., Marchand, Y., Adamson, M., Gustafson, K.: Evaluating the pronunci- ation component of text-to-speech systems for english: a performance comparison of different approaches. Computer Speech and Language 13(2), 155 – 176 (1999) 6. Dedina, M.J., Nusbaum, H.C.: Pronounce: a program for pronunciation by analogy. Computer Speech and Language 5(1), 55 – 64 (1991) 7. Kaplan, R.M., Kay, M.: Regular models of phonological rule systems. Comput. Linguist. 20(3), 331–378 (Sep 1994) 8. Kurimo, M., Puurula, A., Arisoy, E., Siivola, V., Hirsim¨aki,T., Pylkk¨onen,J., Alum¨ae,T., Saraclar, M.: Unlimited vocabulary speech recognition for aggluti- native languages. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Com- putational Linguistics. pp. 487–494. HLT-NAACL ’06, Association for Computa- tional Linguistics, Stroudsburg, PA, USA (2006) 9. Nov´ak, A.: What is good Humor like? [Milyen a j´o Humor?]. In: I. Magyar Sz´am´it´og´epes Nyelv´eszetiKonferencia. pp. 138–144. SZTE, Szeged (2003) 10. N´emeth,G., Olaszy, G.: A magyar besz´ed(Hungarian Speech). Akad´emiaiKiad´o, Budapest, Hungary (2010) 11. Pr´osz´eky, G., Kis, B.: A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Com- putational Linguistics. pp. 261–268. ACL ’99, Association for Computational Lin- guistics, Stroudsburg, PA, USA (1999) 12. Sipt´ar,P.: A mag´anhangz´ok[Consonants]. In: Kiefer, F., B´anr´eti,Z., Acs,´ P. (eds.) Fonol´ogia. No. 2 in Struktur´alismagyar nyelvtan, Akad´emiaiKiad´o(1994) 13. Taylor, P.A.: Text-to-speech synthesis. Cambridge University Press, Cambridge, UK, New York (2009)