A Comparison of MT Methods for Closely Related Languages
Total Page:16
File Type:pdf, Size:1020Kb
A Comparison of MT Methods for Closely Related Languages: a Case Study on Czech – Slovak Language Pair ∗ Vladislav Kubonˇ 1 Jernej Viciˇ cˇ2,3 1Institute of Formal and Applied Linguistics 2FAMNIT Faculty of Mathematics and Physics University of Primorska, Charles University in Prague 3Fran Ramovsˇ Institute of the [email protected] Slovenian Language, SRC SASA, [email protected] Abstract (Scannell, 2006) for Gaelic languages; Irish • (Gaeilge) and Scottish Gaelic (G‘aidhlig). This paper describes an experiment compar- ing results of machine translation between two (Tyers et al., 2009) for the North Sami to Lule • closely related languages, Czech and Slovak. Sami language pair. The comparison is performed by means of two MT systems, one representing rule-based ap- Guat (Viciˇ c,ˇ 2008) for Slavic languages with rich • proach, the other one representing statistical inflectional morphology, mostly language pairs approach to the task. Both sets of results are with Slovenian language. manually evaluated by native speakers of the target language. The results are discussed both Many of the systems listed above had been created from the linguistic and quantitative points of in the period when it was hard to obtain a good qual- view. ity data-driven system which would enable comparison against these systems. The existence of Google Trans- 1 Introduction late1 which nowadays enables the automatic translation even between relatively small languages made it possi- Machine translation (MT) of related languages is a spe- ble to investigate advantages and disadvantages of both cific field in the domain of MT which attracted the at- approaches. This paper introduces the first step in this tention of several research teams in the past by promis- direction - the comparison of results of two different ing relatively good results through the application of systems for two really very closely related languages - classic rule-based methods. The exploitation of lexi- Czech and Slovak. cal, morphological and syntactic similarity of related languages seemed to balance the advantages of data- 2 State of the art driven approaches, especially for the language pairs with smaller volumes of available parallel data. There has already been a lot of research in Machine This simple and straightforward assumption have led Translation evaluation. There are quite a few confer- to the construction of numerous rule-based translation ences and shared tasks devoted entirelly to this problem systems for related (or similar) natural languages. The such as NIST Machine Translation Evaluation (NIST, following list (ordered alphabetically) includes several 2009) or Workshop on Statistical Machine Translation examples of those systems: (Bojar et al., 2013). (Weijnitz et al., 2004) presents a research on how systems from two different MT (Altintas and Cicekli, 2002) for Turkic languages. • paradigms cope with a new domain. (Kolovratnık et Apertium (Corbi-Bellot et al., 2005) for Romance al., 2009) presents a research on how relatedness of • languages. languages influences the translation quality of a SMT sytem. (Dyvik, 1995; Bick and Nygaard, 2007; Ahren- • The novelty of the presented paper is in the focus on berg and Holmqvist, 2004) for Scandinavian lan- machine translation for closely related languages and in guages. the comparison of the two mostly used paradigms for Cesˇ ´ılko (Hajicˇ et al., 2000), for Slavic languages this task: shallow parse and transfer RBMT and SMT • with rich inflectional morphology, mostly lan- paradigms. guage pairs with Czech language as a source. 3 Translation systems Ruslan (Oliva, 1989) full-fledged transfer based • RBMT system from Czech to Russian. The translation systems selected for the experiment are: This∗ work has been using language resources devel- Google Translate oped and/or stored and/or distributed by the LINDAT-Clarin • project of the Ministry of Education of the Czech Republic 1Google Translate: https://translate.google. (project LM2010013). com/. 92 Language Technology for Closely Related Languages and Language Variants (LT4CloseLang), pages 92–98, October 29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics Cesˇ ´ılko (Hajicˇ et al., 2003) of lemmas was necessary due to differences in tagsets • of the source and target language. Google Translate was selected as the most used trans- The syntactic similarity of both languages allowed ˇ lation system. Ces´ılko belongs to the shallow-parse the omission of syntactic analysis of the source lan- and shallow-transfer rule based machine translation guage and syntactic synthesis of the target one, there- paradigm which is by many authors the most suitable fore the dictionary phase of the system had been im- for translation of related languages. mediately followed by morphological synthesis of the Cesˇ ´ılko (Hajicˇ et al., 2003) was used as a representa- target language. No changes of the word order were tive of rule-based MT systems, the translation direction necessary, the target language order of words preserved from Czech to Slovak was naturally chosen because the word order of the source language. this is the only direction this system supports for this Later versions of the system experimented with particular language pair. the architecture change involving the omission of the The on-line publicly available versions of the sys- source language tagger and the addition of a stochas- tems sere used in the experiment to ensure the repro- tic ranker of translation hypothesis at the target lan- ducibility of the experiment. All the test data is pub- guage side. The purpose of this experiment was to licly available at the language technologies server of eliminate tagging errors at the beginning of the transla- 2 the University of Primorska . tion process and to enable more variants of translation Let us now introduce the systems in a more detail. from which the stochastic ranker chose the most prob- able hypothesis. This change has been described for 3.1 Google Translate example in (Homola and Kubon,ˇ 2008). For the pur- This system is currently probably the most popular and pose of our experiment we are using the original ver- most widely used MT system in the world. It belongs sion of the system which has undergone some minor to the Statistical Machine Translation – SMT paradigm. improvements (better tagger, improved dictionary etc.) SMT is based on parametric statistical models, which and which is publicly available for testing at the web- are constructed on bilingual aligned corpora (training site of the LINDAT project3. The decision to use this data). The methods focus on looking for general pat- version is natural, given the fact that this is the only terns that arise in the use of language instead of ana- publicly available version of the system. lyzing sentences according to grammatical rules. The main tool for finding such patterns is counting a variety 4 Methodology of objects – statistics. The main idea of the paradigm In the planning phase of our experiment it was neces- is to model the probability that parts of a sentence from sary to make a couple of decisions which could cause the source language translate into suitable parts of sen- certain bias and invalidate the results obtained. First tence in the target language. of all, the choice of the language pair (Czech to Slo- The system takes advantage of the vast parallel re- vak) was quite natural. These languages show very sources which Google Inc. has at their disposal and high degree of similarity at all levels (morphological, it is therefore able to translate a large number of lan- syntactic, semantic) and thus they constitute an ideal guage pairs. Currently (July 2014), this system offers language pair for the development of simplified rule- automatic translation among 80 languages. This makes based architecture. We are of course aware that for a it a natural candidate as a universal quality standard for complete answer to this question it would be necessary MT, especially for pairs of smaller (underrepresented) to test more systems and more language pairs, but in languages for which there are very few MT systems. this phase of our experiments we do not aim at obtain- ing a complete answer, our main goal is to develop a 3.2 Cesˇ ´ılko methodology and to perform some kind of pilot testing One of the first systems which fully relied on the simi- showing the possible directions of future research. ˇ larity of related languages, Ces´ılko (Hajicˇ et al., 2003), The second important decision concerned the had originally a very simple architecture. Its first im- method of evaluation. Our primary goal was to set up a plementation translated from Czech to Slovak. It used method which would be relatively simple and fast, thus the method of direct word-for-word translation (after allowing to manually (the reasons for manual evalua- necessary morphological processing). More precisely, tion are given in 4.2.1 subsection) process reasonable it translated each lemma obtained by morphological volume of results. The second goal concerned the en- analysis and morphological tag provided by a tagger to deavor to estimate evaluator’s confidence in their judg- a lemma and a corresponding tag in the target language. ments. For the translation of lemmas it was necessary to use a bilingual dictionary, the differences in morphology of 4.1 Basic properties of the language pair both languages, although to a large extent regular, did The language pair used in our experiment belongs to not allow to use a simple transliteration. The translation western Slavic language group. We must admit that 2 Test data: http://jt.upr.si/research_ 3Cesˇ ´ılko: http://lindat.mff.cuni.cz/ projects/related_languages/ services/cesilko/ 93 the reason for choosing this language group was purely 4.2.1 Translation quality evaluation pragmatic – there is an extensive previous experience This part of the experiment relied on the methodology with the translation of several language pairs from this similar to that used in the 2013 Workshop on Statisti- group, see, e.g.