Evaluation of Dictionary Creating Methods for Finno-Ugric Minority Languages
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation of Dictionary Creating Methods for Finno-Ugric Minority Languages Zsanett Ferenczi, Ivan´ Mittelholcz, Eszter Simon, Tamas´ Varadi´ Research Institute for Linguistics, Hungarian Academy of Sciences, H-1068 Budapest, Benczur´ u. 33. fferenczi.zsanett, mittelholcz.ivan, simon.eszter, [email protected] Abstract In this paper, we present the evaluation of several bilingual dictionary building methods applied to fKomi-Permyak, Komi-Zyrian, Hill Mari, Meadow Mari, Northern Saami, Udmurtg–fEnglish, Finnish, Hungarian, Russiang language pairs. Since these Finno-Ugric minority languages are under-resourced and standard dictionary building methods require a large amount of pre-processed data, we had to find alternative methods. In a thorough evaluation, we compare the results for each method, which proved our expectations that the precision of standard lexicon building methods is quite low for under-resourced languages. However, utilizing Wikipedia title pairs extracted via inter-language links and Wiktionary-based methods provided useful results. The newly created word pairs enriched with several linguistic information are to be deployed on the web in the framework of Wiktionary. With our dictionaries, the number of Wiktionary entries in the above mentioned Finno-Ugric minority languages can be multiplied. Keywords: bilingual dictionaries, evaluation, under-resourced languages, dictionary building methods 1. Introduction standard dictionary building methods are based. The stan- The research presented in this paper is part of a project dard approach of bilingual lexicon extraction from paral- whose general objective is to provide linguistically based lel and comparable corpora is based on context similarity support for several small Finno-Ugric (FU) digital commu- methods (e.g. Fung and Yee (1998; Rapp (1995)). Recently, nities to generate online content and help revitalize the dig- source and target vectors are learned as word embeddings ital functions of some FU minority languages. The prac- in neural networks based on gigaword corpora (e.g. Vulic´ tical objective of the project is to create bilingual dictio- and Moens (2015)). However, these methods need a large naries for six small FU languages (Komi-Permyak, Komi- amount of (pre-processed) data and a seed lexicon which Zyrian, Meadow Mari, Hill Mari, Northern Saami, and Ud- is then used to acquire additional translations of the context murt) paired with four major languages that are important words. One of the shortcomings of this approach is that it is for these small communities (English, Finnish, Hungarian, sensitive to the choice of parameters such as the size of the Russian) as well as to deploy the enriched lexical material context, the size of the corpus, the size of the seed lexicon, on the web in the framework of the collaborative dictionary and the choice of the association and similarity measures. project Wiktionary. For these reasons, the above mentioned standard dictionary The status of any particular language of the world is building methods cannot be used for our purposes. There- usually described using the Expanded Graded Intergen- fore, it was necessary to conduct experiments with alter- erational Disruption Scale (EGIDS) (Lewis and Simons, native methods. We made experiments with several lexi- 2010), which gives an estimation of the overall develop- con building methods utilizing crowd-sourced language re- ment versus endangerment of the language. In this scale, sources, such as Wikipedia and Wiktionary (Simon et al., the highest level is 0, where languages are world-wide used 2015; Benyeda et al., 2016). Completely automatic gener- koine´s, while languages on level 10 are already extinct. ation of clean bilingual resources is not possible according Northern Saami is on the highest level among the above to the state of the art, but it is possible to create certain mentioned FU languages: its level is 2 (provincial), thus it lexical resources, termed proto-dictionaries, that can sup- is used in education, work, mass media, and government port lexicographic and NLP work. Proto-dictionaries con- within some officially bilingual region of Norway, Sweden, tain candidate translation pairs produced by bilingual dic- and Finland. In the case of the Meadow Mari language, the tionary building methods. Depending on the method used, EGIDS level is 4 (educational), which means that it is in they either comprise more incorrect translation candidates vigorous use, with standardization and literature being sus- and provide greater coverage, or provide precise word pairs tained through a widespread system of institutionally sup- at the expense of some decrease in recall; their right size ported education. The EGIDS level of the other FU lan- depends on the specific needs. guages (Komi-Permyak, Komi-Zyrian, Hill Mari, Udmurt) Once the proto-dictionaries were prepared, they were is 5, i.e. they are developing, which means that there is lit- merged for each language pair and repeated lines were fil- erature which is available in a standardized form, though it tered out. These files were then the object of manual val- is not yet widespread or sustainable. idation by native speakers and linguist experts of the lan- The above mentioned FU languages are not endangered but guages. These validated dictionaries containing translation under-resourced, hence we could not collect enough data units were the input of generating new Wiktionary entries for building parallel and comparable corpora, on which the which were created fully automatically. As the last step of 1989 the project, we upload the entries to Wiktionary. the automatic word alignment created with GIZA++ and The rest of the article is as follows. In Section 2., the meth- the Moses toolkit. First, word pairs where the source and ods used for creating the proto-dictionaries are shortly pre- target words were character-level equivalents of each other sented. We conducted thorough evaluation of the dictio- were removed, since they are probably incorrect word pairs naries produced for each language pair. In Section 3., the and remaining parts after (or in the lack of) boilerplate results of the evaluation is presented: in Section 3.1., we removal. The remaining part of the dictionary was also present the workflow of the manual validation of the au- merged into the large dictionary, serving as an interest- tomatically generated dictionaries, in Section 3.2., we de- ing example of applying standard lexicon extraction tools tail the precision of each dictionary creating method ap- for an under-resourced language. The text material from plied here, and in Section 3.3., we figure out a kind of cov- which the Opus proto-dictionaries come is a parallel cor- erage for the newly created dictionaries in each language pus of KDE4 localization files, where the Northern Saami– pair. The article ends with some conclusions and future di- English parallel data contain 0.9M tokens, the Northern rections in Section 4.. Saami–Finnish data contain 0.6M tokens, and the North- ern Saami–Hungarian data contain 0.8M tokens. At the 2. Creating the Proto-dictionaries time of creating the proto-dictionaries, there were no dic For the creation of the proto-dictionaries, we applied sev- files available for the other language pairs besides the three eral lexicon building methods utilizing Wikipedia and Wik- mentioned above in the Opus corpus. tionary. For more details on the dictionary creating methods 3.1. Manual Validation we used, see Benyeda et al. (2016) and Simon and Mittel- holcz (2017) – here we only provide a short description. The large merged files were then manually validated by na- Wikipedia is not only the largest publicly available database tive speakers and linguist experts of the FU languages. The of comparable documents, but it can also be used for bilin- instructions for the validators were as follows. The source gual lexicon extraction in several ways. For example, Erd- and the target word must be a valid word in the language mann et al. (2009) used pairs of article titles for creat- concerned, they must be dictionary forms, and they must be ing bilingual dictionaries, which were later expanded with translations of each other. If the source word is not a valid translation pairs extracted from the article texts. Moham- word in the FU language, the word pair is treated as wrong. madi and Ghasem-Aghaee (2010) extracted parallel sen- If the source word is a valid word but not a dictionary form, tences from the English and Persian Wikipedia using a the correct dictionary form should be manually added. If bilingual dictionary generated from Wikipedia titles as a the target word is a good translation of the source word but seed lexicon. We followed the approach which is common is not a dictionary form, the correct dictionary form should in both articles, thus we created bilingual dictionaries from be added. If the target word is not a good translation, a new Wikipedia title pairs using the interwiki links. translation should be given. Besides Wikipedia, Wiktionary is also considered as a The following categories come from these instructions: crowd-sourced language resource that can serve as a source • ok-ok: The source and the target word are valid words, of bilingual dictionary extraction. Although Wiktionary is they are dictionary forms, and they are translations of primarily for human audience, the extraction of underly- each other. ing data can be automated to a certain degree. Acs´ et al. (2013) extracted translations from the so-called translation • ok-nd: The source and the target word are valid words, 1 tables. Since their tool Wikt2dict is freely available , they are translations of each other, but the target word we could apply it for our language pairs. We parsed the is not a dictionary form. English, Finnish, Russian and Hungarian editions of Wik- tionary looking for translations in the small FU languages • nd-ok: The source and the target word are valid words, we deal with.