Statistical Machine Translation from English to Tuvan*
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Machine Translation from English to Tuvan* Rachel Killackey, Swarthmore College rkillac [email protected] Linguistics Senior Thesis 2013 Abstract This thesis aims to describe and analyze findings of the Tuvan Machine Translation Project, which attempts to create a functional statistical machine translation (SMT) model between English and Tuvan, a minority language spoken in southern Siberia. Though most Tuvan speakers are also fluent in Russian, easily accessible SMT technology would allow for simpler English translation without the use of Russian as an intermediary language. The English to Tuvan half of the system that I examine makes consistent morphological errors, particularly involving the absence of the accusative suffix with the basic form -ni. Along with a typological analysis of these errors, I show that the introduction of novel data that corrects for the missing accusative suffix can improve the performance of an SMT system. This result leads me to conclude that SMT can be a useful avenue for efficient translation. However, I also argue that SMT may benefit from the incorporation of some linguistic knowledge such as morphological rules in the early steps of creating a system. 1. Introduction This thesis explores the field of machine translation (MT), the use of computers in rendering one natural language into another, with a specific focus on MT between English and Tuvan, a Turkic language spoken in south central Siberia. While MT is a growing force in the translation of major languages with millions of speakers such as French, Spanish, and Russian, minority and non-dominant languages with relatively few numbers of speakers have been largely ignored. Additionally, languages with complex morphology have been difficult candidates for the creation of successful MT systems, particularly with regards to statistical machine translation (SMT), which uses probabilistic methods in processing a corpus of texts. Tuvan fulfills both of these * Many thanks to the following people for their help throughout the process of writing this thesis: Nathan Sanders, for his incredibly helpful thesis advising; K. David Harrison, who generously provided the opportunity for me to do this research; Kathryn Montemurro, for her indispensible wisdom and music taste; Vicki Sear, for her excellent editing skills; and Peter Nilsson, for his unparalleled scripting expertise. I also want to extend a huge thank you to the Microsoft Research machine translation team and to all of the Tuvan reviewers and project leaders who worked on the Tuvan MT Project. I am profoundly indebted to their contributions to the Project and therefore to this thesis. Any remaining errors are my own. This thesis is dedicated to my late father, Joseph Killackey. KILLACKEY 2 qualities: it is a minority language with complex, agglutinative morphology. Thus, Tuvan presents an interesting subject for the implementation of an SMT system. This thesis culminates with the assertion that an SMT system can in fact improve from the reintroduction of data that targets and corrects morphological errors - specifically involving a linguistic unit as minute as one affix - that the system has made previously. Thus, I argue for an approach to SMT that also incorporates elements of linguistic structure. I begin in Section 2 with an overview of the available literature on machine translation, focusing on the two major paradigms ofMT: rule-based and statistical machine translation. In addition, I discuss the primary way in which the output quality of most MT systems is evaluated: the Bilingual Evaluation Understudy (BLEU) score. In Section 3, I introduce the Tuvan Machine Translation Project and the Microsoft Translator Hub and summarize the methodology ofthe Project. I also present a basic sketch of the grammar of Tuvan, emphasizing complex elements of phonology and morphology that have been difficult for the Project's SMT system to grapple with, and I present the results of the English to Tuvan half of the system. In Section 4, I analyze the types of errors that the Project's SMT system makes and the post-editing corrections to these errors made by fluent Tuvan speakers. I provide an analysis ofthe effects of re-presenting the corrected data into the system in Section 5 and summarize the major results of this thesis and offer concluding remarks in Section 6. 2. Machine Translation From its inception in the 1950s, the field of machine translation (MT) has undergone a long and varied history to reach its status today as a major presence in both the research community and commercial sector (Hutchins 1986). Defined as "the application of computers to the translation of texts from one natural language to another," MT has been implemented to achieve anyone or KILLACKEY 3 more of the following goals: assimilation, the translation of foreign material for the purpose of understanding the content; dissemination, translating text for publication in other languages; and communication, the translation of more informal content such as emails, chat room discussions, and online blogs (Hutchins 1986:14, Koehn 2010). While most MT systems certainly do not produce perfect translations, the output can still be useful even for monolingual foreign language speakers in gathering a basic understanding of a text. For the purposes of this thesis, I am concerned primarily with the MT goals of assimilation and communication. To begin, I define some key terms. In the domain ofMT, a source language is the language from which a text is being translated, while a target language is the language into which a text is being translated (Hutchins 1986). Together, these two languages are called a language pair. Thus, translation can be defined as the general task in which texts in the source language are rendered into the target language, such that "the only invariant between the two is meaning" (Nirenburg and Goodman 1998:291). Furthermore, most MT systems require some degree of post-editing, or human revision to the MT system output. Native speakers of the target language usually perform the post-editing, with their main task being the rearrangement ofthe MT output into coherent, grammatical sentences of the target language. Parallel corpora are source language texts paired with their target translations and are imperative for many types of MT. Monolingual texts are documents written in the target language that help the translation system decide which of the considered alternative translations is more accurate, natural sounding, and in tune with context in examples of the target language. Finally, reference translations are texts written in the target language to which the translation system output is compared in computing the BLEU score. KILLACKEY 4 The subject of what constitutes a "good" translation of any kind is still relatively ill defined, though there are methods for assessing and comparing the quality of the output of MT systems (see Section 2.3). However, the main task ofMT can be stated quite simply: the computer must obtain input in the source language and produce an output text in the target language so that the meaning of the source text is the same as that of the target text. In fact, the differences among the MT efforts can be summarized in terms of the solutions that they propose for the problem of finding meanings of expression in target language for the various facts of meaning of the input text units. Nirenburg raises several important questions with regards to the issue ofthe translation of meaning (1987:2): 1. What is the meaning of the text? 2. Does it have any component structure? 3. How does one represent the meaning of a text? 4. How does one set out to extract the meaning of a text? 5. Is it absolutely necessary to extract meaning (or at least all of the meaning) in order to translate? While MT may not be able to answer these questions directly, they do help to underscore the fact that the central problem ofMT (and perhaps of translation in general) is not computational, but linguistic. Creating linguistic rules or statistical algorithms with which to analyze data is difficult enough, but dealing with lexical ambiguity, syntactic complexity, vocabulary differences, elliptical and ungrammatical constructions, and retaining meaning makes the process decidedly more complex. There are two main paradigms currently implemented in the field of MT that attempt to address these difficulties based on rule-based methods and on statistical methods. 2.1 Rule-based Machine Translation In general, rule-based machine translation (RBMT) - also known as "Knowledge-based MT" or the "Classical Approach" - was the approach heralded by some of the first MT researchers in the KILLACKEY 5 1970s and 1980s, including those who built pioneering systems such as SYSTRAN and Eurotra (Toma 1977, Johnson et al. 1985). This approach is characterized by a heavy emphasis on both source and target linguistic information in creating a system. Historically, there have been three subtypes of RBMT: direct, transfer, and interlingua. Each of these types differs in the degree to which the representation of meaning and the linguistic structures are tied to the language pair in question. The direct approach, depicted in Figure 1, is the simplest of the three. This method carries out translation unidirectionally, or from only one language to another, for one specific language pair (e.g., only English to Russian). Figure 1. Direct approach to rule-based machine translation (Hutchins and Somers 1992). Second, the interlingua approach uses an additional intermediate step to create a general representation of meaning that is independent of the language pair in question. This approach operates bidirectionally (e.g., both from English to Russian and from Russian to English) and occurs in two stages for each direction: from the source language to the interlingua, and then from the interlingua to the target language. Finally, the transfer approach involves three stages, each generating some level of syntactic representation of both the source and target languages (Hutchins and Somers 1992).