Training Automatic Transliteration Models on Dbpedia Data
Total Page:16
File Type:pdf, Size:1020Kb
Training Automatic Transliteration Models on DBpedia Data Velislava Todorova Kiril Simov Linguistic Modeling Department Linguistic Modeling Department IICT-BAS IICT-BAS [email protected] [email protected] Abstract provide access to all RDF statements about the original URIs. Our goal is to facilitate named en- The paper presents several transliteration tity recognition in Bulgarian texts by models and their evaluation. Their evaluation extending the coverage of DBpedia is done over 100 examples, transliterated man- (http://www.dbpedia.org/) for Bul- ually by two people, independently from each garian. For this task we have trained other. The discrepancies between the two hu- translation Moses models to transliter- man transliterations demonstrate the complex- ate foreign names to Bulgarian. The ity of the task. training sets were obtained by extract- The structure of the paper is as follows: in ing the names of all people, places section 2 we describe the problem in more de- and organizations from DBpedia and tail; in section 3 we present some related ap- its extension Airpedia (http://www. proaches; section 4 reports on the preliminary airpedia.org/). Our approach is ex- experiments; section 5 describes how the train- tendable to other languages with small ing data is extended on the basis of the results DBpedia coverage. from the preliminary models and new models are trained; section 6 provides some heuristic 1 Introduction rules for combining transliteration and trans- lation of names’ parts; section 7 describes the 1 DBpedia Linked Open Dataset provides the evaluation of the models; the last section con- wealth of Wikipedia in a formalized way via cludes the paper and presents some directions the ontological language in which DBpedia for future work. statements are represented. Still DBpedia re- flects the multilingual nature of WikiPedia. 2 Challenges But if a user needs to access the huge num- ber of instances extracted from the English The transliteration of proper names presents version of WikiPedia, for example, in a dif- many and quite challenging difficulties not ferent language (in our case Bulgarian) he/she only to automatic systems, but also to humans. will not be able to do so, because the DBpedia A lot of information is needed to perform the in the other language will not provide appro- task: the language of origin to determine the priate Uniform Resource Identifiers (URIs) for pronunciation; some real world facts like how many of the instances in the English DBpe- this name was/is actually pronounced by the dia. In this paper we describe an approach to person it belongs or belonged to (in case of the problem. It generates appropriate names personal names) or by the locals (in case of in Bulgarian from DBpedia URIs in other lan- toponyms) and so on; and also the tradition guages. These new names are used in two in transliterating names from this particular ways: (1) to form gazetteers for annotation language into the given target language. Even of DBpedia instances in Bulgarian texts; and if all of this information is gathered and if it (2) they refer back to the DBpedia URIs from is consistent (which is not always the case), which they have been created and in this way there are still decisions to be made – the task of finding a phonetic equivalent of one lan- 1http://wiki.dbpedia.org/ guage’s phoneme into another is not trivial; in 654 Proceedings of Recent Advances in Natural Language Processing, pages 654–662, Hissar, Bulgaria, Sep 7–9 2015. some cases the name or parts of it are meaning- (like many music band names).2 To distin- ful words in the source language and it might guish the cases where transliteration is needed not be obvious which is more appropriate – from those where direct adoption or transla- translation or transliteration; and sometimes tion is more appropriate, we use some heuris- it might be better to leave the name in its orig- tics that involve the combination of several dif- inal script. ferent models and are described in more details In their survey on machine transliteration in section 6. (Karimi et al., 2011) give a list of five main Problem 5) is very peculiar. It looks similar challenges for an automated approach to the to the variation in translation, but is different task: 1) script specifications, 2) language in its ‘abnormality’. One would accept as nor- of origin, 3) missing sounds, 4) deciding on mal different translations of a single sentence whether or not to translate or transliterate a and we know that even in the source language name (or part of it), and 5) transliteration vari- the meaning of this sentence can be expressed ants. in other ways. Transliteration, on the other Fortunately, 1) does not present great diffi- hand, is expected to produce one single result culties when dealing with European languages for each name, just as this name is an unvaried only, as the direction of writing is the same reference to an entity. However, there are of- and the characters do not undergo changes in ten multiple transliterations of the same name. shape due to the phonetic environment. The Our models do not attempt to generate all the only problem related to the script (apart from acceptable variants, but we calculate how dif- the minor one of choosing appropriate encod- ferent our results are from manually generated ing) is the letter case. We decided to leave the transliterations and we expect this estimation upper and lower case characters in our train- to be useful to determine if an expression is ing data, trading overcomplication of the sta- likely to be a transliteration of a certain name. tistical models for a result that does not need 3 Related Works additional postprocessing. On the other hand, 2) is a very hard chal- (Matthews, 2007) approaches transliteration lenge and we decided to leave it aside for now. very similarly to the way we do. Like us, When we extract names from Wikipedia, we he trains machine translation Moses models know the language they are written in, but on parallel lists of proper names. The lan- there are no straightforward ways to figure out guage pairs for which he obtains translitera- the language they came from. tion and backtransliteration models are En- glish and Arabic, and English and Chinese. 3) is the challenge of finding a phonetic Unlike him, we are only interested in forward match for a sound that does not exist in the transliteration to Bulgarian. And our ap- target language. Human translators need to proach differs from his in several other aspects. make a decision which of the phonemes at hand First, we do not lowercase our training data. is most appropriate in the particular case, and Second, we explore not only unigram models, appropriate does not necessarily mean pho- but also bigram ones. And finally, we employ netically close, as orthography and etymology heuristics to decide whether to transliterate or could also be considered important. We hope not. to overcome this difficulty by training machine The construction of our models was inspired translation Moses models on parallel lists of by (Nakov and Tiedemann, 2012). They train names where the decisions about sound map- the only transliteration models for Bulgarian ping have already been made by humans. known to us. They use automatic transliter- Challenge 4) is related to the fact that trans- ation as a substitution for machine transla- ferring a name from one language to another 2Or it might be a combination of the three. Here can happen not only through transliteration, we will not deal with mixed cases as such. We will but also via translation (for example, until consider as cases of direct adoption only those where 1986, the French name Cˆoted’Ivoire has been the whole name has been directly adopted. We will treat as translation cases all cases where at least some translated, not transliterated in Bulgarian as part of the name has been translated, and we will take Бряг на слоновата кост) or direct adoption the rest to be transliteration ones. 655 tion between very closely related languages, cleaned and tidied as much as possible (details namely Bulgarian and Macedonian. Their are given in section 4.1). models are of several different types – uni- The second step was to apply the first mod- gram, bigram, trigram – and their results show els on the data to further clean and tidy it up that bigrams perform best, because they are (details are given in section 5.1), so that a sec- “a good compromise between generality and ond, better series of models is obtained. In this contextual specificity” (Nakov and Tiedemann, section, we describe the first models, and the 2012, p. 302). We have also trained and com- next section deals with the ones that were the pared unigram and bigram models (see Sec- product of the second step. tions 4 and 5), however we left the trigrams out because of the specificity problem (a tri- 4.1 Training Data gram generally occurs much less often than a The parallel lists of names on which we have unigram, for example), which gets even worse trained our models have been extracted from with proper names coming from many differ- DBpedia. We have used the instance type fea- ent languages with very diverse letter sequence ture to select the URLs of all people, places 3 patterns. and organizations in seven languages: Bulgar- Here we will not give a full overview of the ian, English, German, French, Russian, Ital- automatic transliteration techniques, instead ian and Spanish. Then we have mapped the we will reference (Karimi et al., 2011), which Bulgarian names to the corresponding foreign is a detailed survey on the topic, and (Choi et names via the interlanguage links in DBpedia.