Grapheme-To-Phoneme Transcription in Hungarian
Total Page:16
File Type:pdf, Size:1020Kb
Grapheme-to-phoneme transcription in Hungarian Attila Nov´ak1, Borb´alaSikl´osi2 1 MTA-PPKE Hungarian Language Technology Research Group, 2 P´azm´any P´eterCatholic University, Faculty of Information Technology and Bionics, 50/a Pr´aterstreet, 1083 Budapest, Hungary fnovak.attila, [email protected] Abstract. A crucial component of text-to-speech systems is the one responsible for the transcription of the written text to its phonemic rep- resentation. Though the complexity of the relation between the written and spoken form of languages varies, most languages have their regular and irregular phonological set of rules. In this paper, we present a system for the phonemic transcription of Hungarian. Beside the implementation of transcription rules, the tool incorporates the knowledge of a Hungar- ian morphological analyzer in order to be able to detect morpheme and compound boundaries. It is shown that the system performs well even on texts containing a high number of foreign names, which could not be achieved by a lexicon-based method. 1 Introduction In this study, our goal was to create a method to automatically transform written Hungarian to its phonetic representation. The system was used to transcribe a database of Hungarian geographic terms to a phonetic representation. Even though units in a written alphabet might correspond to a phonetic unit of the spoken language, the complexity of this mapping varies among lan- guages. Even if we consider only languages using the Latin alphabet, there are significant, language-specific differences. Thus, a transcription system must be language-specific, and the applicability of certain methods depends on both mor- phosyntactic and phonological characteristics of the given language. In English, orthographic standards had been fixed quite early, while its sys- tem of pronunciation has further evolved [10]. Thus, it is often quite difficult to predict the correspondence between written and spoken forms. However, since the number of wordforms is limited, either a manually created, or an automat- ically generated lexicon – containing both written and transcribed wordforms – can cover almost the whole vocabulary of the language. The main problem in English is (in addition to eventual OOV items, like names) massive homog- raphy with items belonging to different part of speech often having different pronunciation. 2 Attila Nov´ak,Borb´alaSikl´osi In the case of some other languages, such as Hungarian, the relation between written and spoken forms is much closer; the orthography is basically phonemic. In most cases, pronunciation is predictable from the orthographic form. Still, there are many exceptional phenomena and restrictions arising from phonetic capabilities. Moreover, agglutination yields a huge number of wordforms, making the inclusion of the full vocabulary in a lexicon impossible [8]. Thus, an automated method is necessary and is also supported by techno- logical constraints, i.e. exploiting processing capabilities instead of storing large amount of offline data in the form of lexicons. The structure of this paper is as follows: In Section 2 related approaches are overviewed briefly. Then, our method for transcription is described, including detailed arguments about the language-specific difficulties of Hungarian phonol- ogy. Finally, an evaluation of our tool is presented with an error analysis of the most significant errors revealed during the experiments. 2 Related Work There are three main branches of grapheme-to-phoneme transcription methods [4]: – dictionary look-up, – rule-based approaches, – data-driven approaches. Dictionary look-up is used when the mapping between the orthographic and phonological representation is based on conventions, and rules or generalization are not applicable. The advantage of such methods is that other information (e.g. lexical stress, part-of-speech) can also be stored in the dictionary. However, the creation of such dictionaries by hand is very expensive and tedious. No matter how limited the agglutinating behaviour of a language is, there will always be new words or wordforms, which are not covered by a predefined lexicon. Rule-based approaches overcome this limitation by applying a set of predefined grapheme-to-phoneme transcription rules. These rules are language- specific and have to be manually defined by linguists, then these can be formu- lated for example in the framework of finite-state automata [7]. Such rule-based methods also require an exception lexicon for irregular wordforms. Machine learning methods are also applied to grapheme-to-phoneme tran- scription. In [5], it has been shown that the generalization capability of such methods is better than that of rule-based approaches (at least for English). One of the most successful implementations is based on the idea of Pronunciation by Analogy (PbA) [6]. The theory behind this approach is based on psycholinguis- tic models, i.e. predicting the pronunciation of a word by finding similarities to words for which the phonological representation is known. Joint-sequence mod- els [4] aim at finding the most likely pronunciation for an orthographic form by using Bayes’ decision rule. For all data-driven approaches, a dictionary or Grapheme-to-phoneme transcription in Hungarian 3 a transcribed corpus is needed for training the system or building statistical models. For Hungarian, there is an online dictionary containing 1.5 million wordforms and their phonetic transcription [2]. The construction of this dictionary included several main steps. First, wordforms from a large, written corpus were collected and the list of the resulting words were cleaned (i.e. foreign and misspelled words removed). Then, transformation rules were applied. Finally, exceptions were defined and corrected manually. The authors state, that their dictionary can be considered as a reference dictionary, providing the largest coverage of Hungarian wordforms and their IPA transcriptions. However, only wordforms appearing in the original corpus are included, not providing the possibility either for transcribing other inflected forms (unavoidable in Hungarian) or including new words arriving to language use. 3 Method In the case of phonetic languages, such as Finnish, Estonian or Hungarian, the transcription of a written wordform is almost always straightforward. For exam- ple, the word ablak (’window’) is pronounced as [OblOk]. (Table 1 shows the tran- scription of the standard pronunciation of letters in the International Phonetic Alphabet, which is used in this research to represent phonetic transcription.) However, there are two types of phenomena that make the transcription non- trivial: changes in pronunciation due to the interference of certain sounds, and traditional or foreign words. Another problem is the normalization of semiotic systems. letter IPA letter IPA letter IPA letter IPA ´a a: b b n n zs Z a O p p ny ñ s S > o o d d j j cs tS u u t t h h l l ¨u y g g v v r r > i i k k f f dz dz > ´e e: gy é z z dzs dZ ¨o ø ty c sz s > e E m m c ts Table 1: The phonemes of Hungarian Our method is based on three components: a morphological analyzer, a lex- icon for irregular stems and the implementation of phonological rules defined in an XFST (Xerox Finite-State Tool) formalization [3]. 4 Attila Nov´ak,Borb´alaSikl´osi 3.1 Morphological analysis First, the morphological structure of each word is identified. This is necessary in order to find morpheme boundaries to which certain morpho-phonological rules refer. Lexical palatalization, for example, applies only to some specific inflectional suffixes. In addition, certain phonemes are represented by bigraphs (cs, gy, ty, ny, sz, zs, dz, dzs, and their long forms). However, if a morpheme boundary intervenes, the individual consonants of these digraphs are pronounced as consonant clusters (other rules might affect their behaviour resulting in partial or full assimilation). For example, in the word eszk¨ozs´av, ‘toolbar’ the correct transcription is [EskøsSa:v] instead of [EskøZa:v]. Compounds, which are also quite frequent in Hungarian, might contain com- ponents that have an irregular pronunciation. These should also be recognized by the morphological analyzer to avoid their transcription by the regular phono- logical rules. In the system, we used the Humor morphological analyzer [9, 11]. 3.2 Lexicon of irregular stems In all natural languages, there are wordforms with irregular pronunciation. These are usually proper names and foreign words. Words of the latter category might adapt to the adopting language to some extent. For example, the English word file might be written in Hungarian as the original form file or as it is adapted to the pronunciation, i.e. f´ajl. In both cases, the phonetic form is [fa:jl]. How- ever, the phrase New York is used only in its original form in written text and is pronounced as [ñu:jork]. In Hungarian, however, not only foreign, but some traditionally spelled words also fall into this category. Such irregularities occur in quite a few family names, geographical names, etc. In addition, there are cases where standard pronunciation deviates from what orthography suggests in terms of vowel and/or consonant length. For example the word egyes¨ulet ’association’ is pronounced as [Eé:ESylet] rather than [EéESylet], as suggested by the orthographic form. Another group of words included in the lexicon are members of the semiotic system. These use the same set of characters and symbols as the writing system of the language, but render meaning to such units of text in a different man- ner. In order to be able to produce the phonological transcription, these units must be normalized in a preprocessing step. Examples are numbers, abbrevia- tions, acronyms, units of measurements, dates, mathematical expressions, e-mail addresses, etc. Though all of these examples contain a number of subproblems, it is out of the scope of this paper to go into details. You can turn to [13] instead.