Statistical from English to Tuvan*

Rachel Killackey, Swarthmore College rkillac [email protected] Linguistics Senior Thesis 2013

Abstract

This thesis aims to describe and analyze findings of the Tuvan Machine Translation Project, which attempts to create a functional statistical machine translation (SMT) model between English and Tuvan, a minority language spoken in southern Siberia. Though most Tuvan speakers are also fluent in Russian, easily accessible SMT technology would allow for simpler English translation without the use of Russian as an intermediary language. The English to Tuvan half of the system that I examine makes consistent morphological errors, particularly involving the absence of the accusative suffix with the basic form -ni. Along with a typological analysis of these errors, I show that the introduction of novel data that corrects for the missing accusative suffix can improve the performance of an SMT system. This result leads me to conclude that SMT can be a useful avenue for efficient translation. However, I also argue that SMT may benefit from the incorporation of some linguistic knowledge such as morphological rules in the early steps of creating a system.

1. Introduction

This thesis explores the field of machine translation (MT), the use of computers in rendering one natural language into another, with a specific focus on MT between English and Tuvan, a Turkic

language spoken in south central Siberia. While MT is a growing force in the translation of major languages with millions of speakers such as French, Spanish, and Russian, minority and non-dominant languages with relatively few numbers of speakers have been largely ignored.

Additionally, languages with complex morphology have been difficult candidates for the creation

of successful MT systems, particularly with regards to statistical machine translation (SMT), which uses probabilistic methods in processing a corpus of texts. Tuvan fulfills both of these

* Many thanks to the following people for their help throughout the process of writing this thesis: Nathan Sanders, for his incredibly helpful thesis advising; K. David Harrison, who generously provided the opportunity for me to do this research; Kathryn Montemurro, for her indispensible wisdom and music taste; Vicki Sear, for her excellent editing skills; and Peter Nilsson, for his unparalleled scripting expertise. I also want to extend a huge thank you to the Research machine translation team and to all of the Tuvan reviewers and project leaders who worked on the Tuvan MT Project. I am profoundly indebted to their contributions to the Project and therefore to this thesis. Any remaining errors are my own. This thesis is dedicated to my late father, Joseph Killackey. KILLACKEY 2

qualities: it is a minority language with complex, agglutinative morphology. Thus, Tuvan

presents an interesting subject for the implementation of an SMT system. This thesis culminates with the assertion that an SMT system can in fact improve from the reintroduction of data that targets and corrects morphological errors - specifically involving a linguistic unit as minute as

one affix - that the system has made previously. Thus, I argue for an approach to SMT that also

incorporates elements of linguistic structure.

I begin in Section 2 with an overview of the available literature on machine translation,

focusing on the two major paradigms ofMT: rule-based and statistical machine translation. In

addition, I discuss the primary way in which the output quality of most MT systems is evaluated: the Bilingual Evaluation Understudy (BLEU) score. In Section 3, I introduce the Tuvan Machine

Translation Project and the Hub and summarize the methodology ofthe

Project. I also present a basic sketch of the grammar of Tuvan, emphasizing complex elements of

phonology and morphology that have been difficult for the Project's SMT system to grapple with, and I present the results of the English to Tuvan half of the system. In Section 4, I analyze the types of errors that the Project's SMT system makes and the post-editing corrections to these

errors made by fluent Tuvan speakers. I provide an analysis ofthe effects of re-presenting the

corrected data into the system in Section 5 and summarize the major results of this thesis and

offer concluding remarks in Section 6.

2. Machine Translation

From its inception in the 1950s, the field of machine translation (MT) has undergone a long and varied history to reach its status today as a major presence in both the research community and

commercial sector (Hutchins 1986). Defined as "the application of computers to the translation

of texts from one natural language to another," MT has been implemented to achieve anyone or KILLACKEY 3 more of the following goals: assimilation, the translation of foreign material for the purpose of

understanding the content; dissemination, translating text for publication in other languages; and

communication, the translation of more informal content such as emails, chat room discussions,

and online blogs (Hutchins 1986:14, Koehn 2010). While most MT systems certainly do not

produce perfect translations, the output can still be useful even for monolingual foreign language

speakers in gathering a basic understanding of a text. For the purposes of this thesis, I am

concerned primarily with the MT goals of assimilation and communication.

To begin, I define some key terms. In the domain ofMT, a source language is the

language from which a text is being translated, while a target language is the language into which a text is being translated (Hutchins 1986). Together, these two languages are called a

language pair. Thus, translation can be defined as the general task in which texts in the source

language are rendered into the target language, such that "the only invariant between the two is meaning" (Nirenburg and Goodman 1998:291). Furthermore, most MT systems require some

degree of post-editing, or human revision to the MT system output. Native speakers of the target

language usually perform the post-editing, with their main task being the rearrangement ofthe

MT output into coherent, grammatical sentences of the target language. Parallel corpora are

source language texts paired with their target translations and are imperative for many types of

MT. Monolingual texts are documents written in the target language that help the translation

system decide which of the considered alternative translations is more accurate, natural­

sounding, and in tune with context in examples of the target language. Finally, reference

translations are texts written in the target language to which the translation system output is

compared in computing the BLEU score. KILLACKEY 4

The subject of what constitutes a "good" translation of any kind is still relatively ill

defined, though there are methods for assessing and comparing the quality of the output of MT

systems (see Section 2.3). However, the main task ofMT can be stated quite simply: the

computer must obtain input in the source language and produce an output text in the target

language so that the meaning of the source text is the same as that of the target text. In fact, the

differences among the MT efforts can be summarized in terms of the solutions that they propose

for the problem of finding meanings of expression in target language for the various facts of meaning of the input text units. Nirenburg raises several important questions with regards to the

issue ofthe translation of meaning (1987:2):

1. What is the meaning of the text? 2. Does it have any component structure? 3. How does one represent the meaning of a text? 4. How does one set out to extract the meaning of a text? 5. Is it absolutely necessary to extract meaning (or at least all of the meaning) in order to translate?

While MT may not be able to answer these questions directly, they do help to underscore the fact that the central problem ofMT (and perhaps of translation in general) is not

computational, but linguistic. Creating linguistic rules or statistical algorithms with which to

analyze data is difficult enough, but dealing with lexical ambiguity, syntactic complexity, vocabulary differences, elliptical and ungrammatical constructions, and retaining meaning makes the process decidedly more complex.

There are two main paradigms currently implemented in the field of MT that attempt to

address these difficulties based on rule-based methods and on statistical methods.

2.1 Rule-based Machine Translation

In general, rule-based machine translation (RBMT) - also known as "Knowledge-based MT" or the "Classical Approach" - was the approach heralded by some of the first MT researchers in the KILLACKEY 5

1970s and 1980s, including those who built pioneering systems such as SYSTRAN and Eurotra

(Toma 1977, Johnson et al. 1985). This approach is characterized by a heavy emphasis on both source and target linguistic information in creating a system.

Historically, there have been three subtypes of RBMT: direct, transfer, and interlingua.

Each of these types differs in the degree to which the representation of meaning and the linguistic structures are tied to the language pair in question. The direct approach, depicted in

Figure 1, is the simplest of the three. This method carries out translation unidirectionally, or from only one language to another, for one specific language pair (e.g., only English to Russian).

Figure 1. Direct approach to rule-based machine translation (Hutchins and Somers 1992).

Second, the interlingua approach uses an additional intermediate step to create a general representation of meaning that is independent of the language pair in question. This approach operates bidirectionally (e.g., both from English to Russian and from Russian to English) and occurs in two stages for each direction: from the source language to the interlingua, and then from the interlingua to the target language. Finally, the transfer approach involves three stages, each generating some level of syntactic representation of both the source and target languages

(Hutchins and Somers 1992).

Each type of RBMT system also requires the creation of a source and target bilingual dictionary, as well as a method for deriving source language morphological and syntactic structures and converting them into the corresponding target language structures. The first stage of any RBMT system involves analyzing input source language text in terms of morphology, syntax, and semantics in order to create an internal representation of the grammar of the source KILLACKEY 6

language. The same process is implemented to create an internal representation of the grammar

of the target language. Translations are then generated from these representations using both the bilingual dictionary and the grammatical rules. Depending on the language pair, RBMT is

capable of producing accurate, high-quality translations.

However, the rule-based approach has several issues that make it difficult for commercial

MT systems to employ. First, RBMT is computationally expensive and time consuming. The

sheer number of rules and lexical items needed in order to describe both the source and target

language mean it could take months or even years to create one system. Additionally, the

interconnectedness and mutual dependency of linguistic rules make alterations to the system

difficult. Changing one rule creates a domino effect of sorts, meaning most other rules are also

affected (Sumita et al. 1990).

Moreover, the creation of syntactic rules, semantic restrictions, structural transfer rules, word selections, and generation rules requires a linguistically trained staff, which can mean

RBMT efforts are also monetarily costly. Furthermore, RBMT is less comprehensive than other types ofMT because it works on exact-match, and not best-match, reasoning. In this way,

RBMT is unable to use context for the translation of ambiguous words or sentences, as its

architecture contains linguistic rules and semantic information but no source or target language texts (Sumita et al. 1990). Thus, RBMT fails to translate when it has no knowledge that matches

its input exactly.

For all ofthese reasons, RBMT has not become the norm in the field ofMT and is

generally restricted to organizations like the National Security Agency that demand accurate translations of strategic, high-priority languages (Soudi et al. 2012). KILLACKEY 7

2.2 Statistical Machine Translation

The second and more recent theoretical approach to MT is statistical machine translation (SMT), which is the paradigm used in both the Tuvan MT Project and the majority of new MT models

being created today. Popular SMT systems include and Bing Translate, as well

as the Microsoft Translator Hub analyzed in this thesis. This statistical approach is founded upon the belief that language need not be fully distilled into a set of rules, and thus these systems

contain little or no explicit linguistic knowledge (Koehn 2010). Instead, SMT research argues that by presenting a computer with source-target language parallel corpora, the computer can

carry out the process of translation with minimal human intervention. To do this, SMT systems use probabilistic methods of pairing the input and output in order to learn from statistical patterns

in these data.

Finding sufficient large and quality parallel corpora is one of the main issues presented to

SMT researchers. Most texts are found either by crawling, or searching, the web or by utilizing

documents that have already been translated by others. Using a greater amount of parallel

corpora in training gives SMT systems a higher chance of learning the mapping between source

and target languages, which is especially true when languages with rich morphology like Tuvan

are involved. Complex morphology creates the possibility for a higher number of stem and affix

combinations, and thus a higher number of possible words. Consequently, these languages have what K. David Harrison calls high novelty indices, which means that as the corpus size increases, the number of new words encountered also increases (p.c., based on ideas in Nichols 1999). In

other words, even with larger numbers of texts, the chance of encountering novel, morphologically complex words continues to be high (see Figure 2). Such circumstances are

analogous to the economic property of increasing returns, which states that when a single KILLACKEY 8

process (here, corpus size) is increased, a single factor of output (here, the number of novel words encountered in a growing corpus) also continues to increase (Samuelson and Nordhaus

2001).

rIl rIl 'E =1------",jJ~­ ° e- _u~ 0 ~ .5 1------~~--- 0-c -Language with Z CI) High ....o '"'CI) I--___~ L------'"' ... Morphological CI) == Complexity ..ce 8= = == 1----1-'------Z'""

Corpus Size

Figure 2. Representation of novelty index oflanguage with high morphological complexity (e.g., Tuvan).

With languages that do not have complex morphology (e.g., English), increasing the size

of the corpus does not also increase the number of instances of novel words; at a certain point, the number of novel words levels off and begins to decrease (see Figure 3). These languages

have relatively low novelty indices. Such circumstances are analogous to the economic property

of diminishing returns, which states that when a single process (here, corpus size) is increased, a

single factor of output (here, the number of novel words encountered in a growing corpus)

decreases after a certain point (Samuelson and Nordhaus 2001).

rIl rIl -c '"' =~ ~------­ o '"' _u~ 0 ~ .5 1------0-c -- Language with Z CI) ~---, Low '0 ~ ~ Morphological '"' -= ~ .. CI) = , Complexity ..c ° e \j -~------Z'""= == , ~ ~ Corpus Size

Figure 3. Representation of novelty index of language with low morphological complexity (e.g., English). KILLACKEY 9

Because SMT is limited by the quantity and scope of its training data, it is often not

possible to present a system with the full picture of a language that has a high novelty index. In

fact, this is a problem encountered in the Tuvan MT Project.

Once a search for available parallel corpora has been exhausted, the texts must be made ready for input into the SMT system. Namely, each source language text must be aligned to its

corresponding target language text on a sentence-by-sentence basis. Some SMT systems also require a processing step of tokenization, or breaking up raw text into words (which may include

splitting off punctuation or breaking up compounds) and true casing, or changing all letters from

uppercase to lowercase aside from proper names (Koehn 2010). However, the Microsoft

Translator Hub system disregards any type of punctuation or capitalization, so it was not necessary for the Tuvan MT Project to carry out this step.

After the parallel corpora have been aligned and uploaded, most of the current SMT

systems use models that translate small word sequences as basic units. Though I will not

elaborate on the precise algorithms these models use to translate sentences, I will describe the

basic parsing process that the texts undergo. Input sentences in the source language are first

parsed into phrases, any multiword unit (note that this does not necessarily correlate to the

linguistic notion of a phrase; for example, an SMT model might group together the words fun with the as in Figure 2, even though theories of English syntax would argue that this phrase on its

own is ungrammatical [Koehn 2010:128]). Most SMT researchers choose to base their models on the translation of phrases, and not words, because single words can be translated in multiple ways. There is rarely a one-to-one correspondence between the vocabularies of natural languages

(Hutchins 1986); one word in the source language may translate to two or more words in the KILLACKEY 10 target language, or vice versa. Word-based models often break down in these cases. Thus,

phrases (or strings of multiple words) prove to be a better atomic unit for translation.

Next, these source language phrases are reordered and mapped onto the corresponding target language phrases. This process is illustrated between German and English in Figure 4.

natuerlich hat john spass am spiel

of course john has fun with the game

Figure 4. Phrasal matching between German and English (Koehn 2010).

If the SMT model encounters lexical or phrasal ambiguity, it will use the surrounding

context to influence the translation decision (Soudi et al. 2012). Translating phrases instead of

single words also helps to resolve these ambiguities, allowing for a larger context window. In

Figure 2, it may seem that the phrase fun with the is an odd grouping. However, it is useful in this context because German and English prepositions are generally semantically dissimilar.

While the German preposition am has multiple potential translations in English, such as at the,

on the, and with the, the context of following spass tells the MT system that the translation with

the is most appropriate (Koehn 2010).

Mathematically, reordering is achieved through a "distance-based reordering model,"

which views one phrase at a time relative to the previous one (Koehn 2010: 129). In this model,

source language phrases are reordered to match target language phrases. The reordering distance

is calculated by starti - endi-I - 1, where starti is the position of the first word of a source

language phrase that translates to the ith phrase of the target language, and endi is the position of the last word of the same source phrase. Here, the "reordering distance" refers to "the number of KILLACKEY 11

words skipped" when taking source language words out sequence (129). Given the German

source language phrase nauterlich hat john spass am spiel and the English target language phrase

ofcourse john has fun with the game shown in Figure 4, the distance-based reordering model

carries out the distance calculations outlined in Figure 5.

foreign

English

English phrase translates movement distance 1 1-3 start at beginn ing 0 2 6 sk i p over 4- 5 + 2 3 4-5 move back over 4-6 -3 4 7 skip over 6 + 1 Figure 5. Distance-based reordering model for German to English translation (Koehn 2010).

After phrases have been reordered and the resulting pairs have been extracted, the model then looks over the entire aligned parallel corpora in order to compute the conditional probability

distributions of the phrase pairs. In the context ofMT, a conditional probability distribution

looks at one phrase and computes the likelihood that any given phrase will follow it. For each

sentence pair, the model extracts a number of phrase pairs and counts in how many sentence pairs a particular phrase pair is extracted. These phrase pair probabilities are then stored by the

system in a "phrase translation table" like in Figure 6, which the system uses for the task of translation (128). KILLACKEY 12

Translation Probability ¢(elf) of course 0.5 naturally 0.3 of course, 0.15 , of course, 0.05 Figure 6. Example phrase translation table for German word natuerlich (Koehn 2010).

This is of course a simplified outline of the process that an SMT system carries out, but it is nonetheless useful for the purposes of this thesis.

2.3 The Bilingual Evaluation Understudy (BLEU) Score

The current primary accepted metric for evaluating SMT systems is the Bilingual Evaluation

Understudy (BLEU) score. Thus far, there is no equivalent metric for the evaluation ofRBMT systems. IBM researchers Papineni, Roukos, Ward, and Zhu developed the BLEU score as an efficient and inexpensive way to imitate the assessment mechanisms of human translators. The growing demand for translation creates what Papineni et al. call an "evaluation bottleneck," which means that human translators are unable to handle the sheer number of requests from researchers to evaluate MT systems (2002:311). As a result, human evaluation is expensive and may take weeks or months to complete. The BLEU score addresses this problem by providing quick, language-independent translation quality assessment that attempts to correlate as closely as possible with professional human judgments of adequacy, fidelity to the original text, and fluency. The BLEU score corresponds to a percentage (i.e., a number ranging from either 0-1, as with many MT systems, or 0%-100%, as with the Microsoft Translator Hub). No translation is awarded a perfect score unless it is indistinguishable from a reference translation. KILLACKEY 13

In order to evaluate an MT system successfully, the BLEU score requires "a numerical

'translational closeness' metric [and] a corpus of good quality human reference translations"

(Papineni et al. 2002:311). The corpus includes both source and target language texts, and while this evaluation system prefers multiple reference translations, using only one reference translation (i.e., one target language text) has not been shown systematically to decrease the

BLEU score (Koehn 2004). The BLEU score is modeled after the word error rate metric used in the community, which uses a weighed average of variable length phrase matches against the reference translations. Using a "modified n-gram precision measure," the

BLEU score computes the number of n-gram (i.e., phrases containing n words) matches in both the candidate translation, or MT system output, and the reference translation(s) (Papineni et al.

2002:312). The higher the number of matches, the better the output translation scores.

Importantly, these matches are position-independent, meaning the order of an n-gram with respect to other n-grams in the same sentence is irrelevant.

For example, given the reference translations in (la-c), and the candidate translations in

(2a-b), it is clear that the candidate in (2a) is a better translation. Candidate (2a) shares several

phrases with the reference translations; it shares it is a guide to action with reference (la), which with reference (1 b), ensures that the military with reference (1 a), always with references (1 b)

and (lc), commands with reference (la), and ofthe party with reference (lb) (Papeneni et al.

2002). In contrast, candidate (2b) exhibits far fewer matches, and the overall quality of the

sentence is poor. Therefore, candidate (2a) would receive a higher BLEU score.

(1) a. It is a guide to action that ensures that the military will forever heed Party commands. b. It is the guiding principle which guarantees the military forces always being under the command of the Party. c. It is the practical guide for the army always to heed the directions ofthe party. KILLACKEY 14

(2) a. It is a guide to action which ensures that the military always obeys the commands of the party. b. It is to insure the troops forever hearing the activity guidebook that party direct.

In order to ensure that the BLEU score is as accurate as possible, the metric also accounts

for duplicate n-grams. In other words, precise MT systems that overgenerate in output by translating one word or phrase multiple times in different ways often result in inaccurate and

improbable translations. Thus, the BLEU score exhausts a particular reference word or phrase

after it has been found one time in the candidate translation. A candidate translation with

frequent duplicate translations is considered inadequate and decreases in score. For instance,

candidate translation (4) would receive quite a low BLEU score when compared with its reference translations (3a-b). The word the is exhausted in candidate translation (4) after

appearing twice, as in reference translation (3a) (Papeneni et al. 2002). However, the is

duplicated many more times in the candidate translation.

(3) a. The cat is on the mat. b. There is a cat on the mat.

(4) the the the cat the the the

Furthermore, the BLEU score enforces length matches between the candidate and reference translation(s). The n-gram precision measure penalizes candidate translations that are too long or too short, meaning the candidate contains words that do not appear in the reference translation(s) or is missing words that do appear in the reference translation(s). While the modified n-gram precision measure already penalizes candidate translations that are longer than their reference( s), the introduction of a sentence brevity penalty ensures that candidate translations with missing words or phrases are also penalized. For example, given the reference translations (5a-c) and candidate translations (6a-b), candidate (6b) is the better translation

(Papeneni et al. 2002). While candidate (6a) contains more words from each reference KILLACKEY 15 translation, its length suggests it is a poorer translation. Thus, candidate (6b) would receive a higher BLEU score.

(5) a. I always do. b. I invariably do. c. I perpetually do.

(6) a. I always invariably perpetually do. b. I always do.

The sentence brevity penalty is computed over the entire corpus of reference translations,

so that if the length of candidate translation matches the length of at least one of the reference translations, then the candidate will not be penalized. However, if (as with the Tuvan MT

Project) the corpus contains only one reference translation, the chance of receiving a sentence

brevity penalty may increase.

The BLEU score is of course not without limitations. Zhang et al. (2004) ran several

experiments to elucidate the relative weight given to unigram (one-word) and bigram (two-word) matches by the BLEU score. They found that the metric used to calculate the BLEU score gives too much credit to multiple-word phrase matches, which overrides the contribution of one-word matches. In other words, the BLEU score places more importance on correct word order rather than correct lexical choice. Thus, Zhang et al. argue that the BLEU score essentially measures

"document similarity," and not actual translational closeness (2004:2052). However, given this

information, Zhang et al. maintain that while the BLEU score does not necessarily replace human judgment, it still provides a meaningful method of assessment. Additionally, differences

between BLEU scores do seem to be statistically significant, and methods such as bootstrapping may also help to calculate the statistical significance of different BLEU scores (Koehn 2004).

For the purposes of this paper, I rely on the BLEU score to evaluate the adequacy ofthe Tuvan

MT Project's English to Tuvan SMT system. KILLACKEY 16

2.4 Hybrid Machine Translation

It is worth mentioning that a combined, or hybrid, RBMT and SMT system is also feasible. This type of system, known as hybrid machine translation, is becoming increasingly prevalent, as it

attempts the address the issues that continue to plague both SMT and RBMT. Namely, SMT research struggles with ungrammatical output, while RBMT research grapples with high cost and

inefficiency (Eisele et al. 2008). Several relatively new MT systems currently use this combined

paradigm, the most prominent of which is , launched in 2009 (Sawaf et al. 2010).

AppTek represents the machine translation industry's first completely integrated

statistical and rule-based system. Most other hybrid systems fall under two categories: a rule­ based system with subsequent statistical processing and a primarily statistical engine guided by rules. In the former, a linguistically informed rule-based engine performs the bulk ofthe translation; statistics are used only as part of the post-editing process. The latter approach

employs linguistic rules in order to inform and guide the statistical engine; these rules may be used in the post-editing stage as well (Boretz 2009). Instead of adding rules to a statistical system

or a statistical module to a rule-based system, fully hybrid engines like AppTek integrate both

statistical and rule-based features from the very start. While research is ongoing and hybrid MT

literature is somewhat scarce, results thus far have shown that, compared to its counterparts, hybrid MT output achieves greater fluency and more aptly translates both "literal and non-literal

meanings" of a text (Boretz 2009: 1). With more language pairs continually being added to

systems like AppTek, hybrid MT is poised to be the promising future of machine translation.

3. The Tuvan Machine Translation Project

Using a model developed by called the Microsoft Translator Hub, the Tuvan

Machine Translation Project is an ongoing research initiative begun in 2012 both at Swarthmore KILLACKEY 17

College in Pennsylvania and in native speaker workshops in the Republic of Tuva headed by

Russian linguist and fluent Tuvan speaker Vitaly Voinov. Student researchers include

Swarthmore students Peter Nilsson, Kathryn Montemurro, and Rachel Killackey. Faculty

advisors to the Tuvan MT Project include Swarthmore Professors K. David Harrison and Nathan

Sanders. This project is also indebted to the following Tuvan speakers who carried out post­

editing efforts to the Microsoft Translator Hub's Tuvan output from August to September 2012:

Vitaly Voinov, Sasha Ondar, Sean Quirk, Chinchi Kungaa Villiere, Valeria Kulundarij, Rollanda

Kongar, and Aldynai Seden-Xuurak.

The Tuvan MT Project set out with the goal of creating a functional MT system with

English as the source language and Tuvan as the target language as well as a reversed model (not

analyzed in this thesis) with Tuvan as the source language and English as the target language.

We chose to make Tuvan the target and not the source language as the primary model in the hopes that Tuvan speakers who may not understand English at all would be able to easily input

English text for translation. Additionally, these individuals are better equipped as native speakers

to make target language corrections during post-editing.

The Microsoft Translator Hub is an SMT system developed by Microsoft Research that

was released for public and commercial use in July 2012. The purpose ofthe Hub is to provide

communities of speakers and businesses with free access to create their own automatic language translation systems (Microsoft 2012). Microsoft also describes a major goal of the Hub as

supporting languages that are not made available for automatic translation by the other major

providers such as Google Translate. According to Microsoft Research, only about 100 out of the world's 7000+ languages are currently available for automatic machine translation (Microsoft

2012). At the same time, many of these unsupported minority and non-dominant languages are KILLACKEY 18

quickly losing speakers as dominant languages like English, Chinese, and Russian become more necessary for global exchange. Thus, access to SMT technology like the Microsoft Translator

Hub, which affords communities with the independence to organize their own MT efforts, could have transformative effects on minority and non-dominant language speakers who are pressured to abandon their native tongue in favor of a more global one.

3.1 Tuvan Language

In this section, I provide a basic sketch of Tuvan language and grammar, emphasizing elements

of complex nominal and verbal morphology that have proven challenging for our English to

Tuvan system to learn. Tuvan is a member of the eastern branch of the Turkic language family

spoken in the Republic of Tuva in south central Siberia. The Republic of Tuva is under the

governance of the Russian Federation, and as a result most Tuvan people are also fluent in

Russian (Harrison 2000). Tuvan is a minority language, meaning it is a non-official language

spoken by a relatively small group of people who constitute a national minority. Furthermore,

Tuvan is also a low-density language, which indicates that there is very little online-accessible material written in the language (Koehn 2005). These two factors make it all the more significant to create a Tuvan MT system.

Like most members of the Turkic family, Tuvan exhibits agglutinative, highly suffixing nominal and verbal morphology. Tuvan has a phoneme inventory of 16 vowels and 20

consonants, given in Table 3.1 and Table 3.2, respectively1; these are adapted from Anderson

and Harrison (1999). As Tuvan vowels are contrastive for length, there are eight distinct vowel

qualities in addition to the long vowels. This symmetrical patterning is typical of Turkic

1 As Tuvan is customarily written in Cyrillic script, I give both a Cyrillic transcription (eyr) and a turcological (Turk) common to Turkic linguistics to represent the Tuvan characters. KILLACKEY 19 languages, allowing Tuvan' allowing Tuvan's vowels in particular to group together according to backness, height, and rounding.

Front Back Unround Round Unround Round Cyr Turk Cyr Turk Cyr Turk Cyr Turk Short H i Y U hi t Y u High Long HH ii yy Uti hlhl H yy uu Short e e e 6 a a 0 0 Non-high Long ee ee ee 66 aa aa 00 00 Table 3.1. Tuvan vowel phoneme inventory.

Labial Labio-dental Alveolar Palatal Velar Cyr Turk Cyr Turk Cyr Turk Cyr Turk Cyr Turk Plosive 6 n b p )J, T d t K r k g Nasal M m H n 1-\ 11 Trill p r Fricatives c\l f c 3 S Z ill)\( S Z x x Affricates 4 C Lateral JI I Approximant B v H Y Table 3.2. Tuvan consonant phoneme inventory.

Tuvan consonants undergo several predictable alternations given in Table 3.3, adapted from Harrison (2000).

Basic Form Surface Forms Cyr Turk Cyr Turk n p n6BM p b v m T t T)J,H t d n K k Kr kg 4 C 4)\( CZ H n HT)J, ntd

JI 12 JlT)J,H ltd n 3 z C3 sz Table 3.3. Consonant surface alternations in Tuvan.

2 When /1/ appears in the onset of an enclitic, its possible surface alternations are restricted to only [I] and [n] because of its position at a morpheme boundary. KILLACKEY 20

Furthermore, several sequences of labial and velar stops undergo metathesis, which in

Tuvan refers to the switching of contiguous sounds. Most commonly, the consonant cluster Ipk!

«IIK» becomes Ikpl (even across word boundaries) in Russian loanwords. This type of metathesis is especially prevalent in the rapid speech of younger speakers. For example, the

Russian loanword in (7) frequently undergoes p/k metathesis (Harrison 2000: 16).

(7) IIpOIIKa -7 IIpOKlIa propka -7 prokpa 'cork' (from Russian propka)

A second type of metathesis also occurs in Tuvan. Stop+fricative sequences become

fricative+stop, but never across word boundaries. This type of metathesis is observed in

borrowed words and rapid speech, and dialects vary in their use of this rule. For example, the

desiderative affix I-ks-I «-KC-» often surfaces as [-sk-], as in (8) (Harrison 2000:17).

(8) laksa/ -7 [aska] 'money'

Velar deletion occurs in intervocalic environments in Tuvan, targeting velars in both

codas and onsets. While onset velars are found in nine productive (derivational and inflectional)

suffixes, coda velars are present in both nominal and verbal stems (both mono- and polysyllabic).

For monosyllabic words, Igl «r», which sometimes surfaces as [y], is deleted when it occurs in the coda of vowel-initial suffix (9). When this process creates a potential vowel hiatus (i.e., two

different vowel sounds occurring in adjacent syllables), the non-high vowel wins out (10)

(Harrison 2000:88).

(9) lug + -u/ [uu] (sometimes [uyu]) direction -3poss 'hislher/its direction'

(10) log + -u/ [00] yurt -3poss 'hislher/its yurt' KILLACKEY 21

For polysyllabic roots, both /k/ «K» and Igl «r» are deleted in the same context as with monosyllabic roots (11). However, certain "short" velar-initial suffixes (12) and "short" velar-final stems that contain a high vowel (13) don't allow the velar to be deleted (Harrison

2000:95).

(11) lidik + -if -7 [idii] bootes) -3.POSS 'hislher/its boot( s)'

(12) Ikizi + -gel -7 [kizige], [*kizie], [*kizee] person -DAT 'to the person'

(13) lug + -gan/ [uggan], [*ugan], [*uan], [*aan] lift -PAST.I 'lifted'

Tuvan also exhibits a phonological process that makes the language particularly difficult

for a statistical machine translation system to learn - vowel harmony, which requires that certain vowels agree according to certain phonological features. Most Tuvan vowels are harmonic for backness, each vowel taking the backness of the preceding vowel. Thus, all vowels in a given word must belong to either the front class [i ii eo] «H y e e» or the back class [i u a 0] «bI y a

0». However, there are exceptions to Tuvan vowel harmony, such as the presence of the

allative, diminutive, and durative suffixes, as well as certain lexicalized Russian loan words and the ablaut (Harrison 2000). The following examples demonstrate backness harmony, with a front

harmonic word (14) and a back harmonic word (15) (Harrison 2000:112). Both words contain

the harmonizing plural, possessive, and ablative suffixes in order to clearly illustrate vowel

harmony3.

3 For Tuvan words and sentences that require a morphological parsing, I provide the original Tuvan sentence in Cyrillic on the first line, the same sentence in turcological notation on the second line, the morphological parsing on the third line, the morpheme-by-morpheme gloss on the fourth line, and the complete English gloss on the fifth line. KILLACKEY 22

(14) HCTepHM,n:eH isterimden IS -ter -im -den footprint -PL -1 SG.poss -ABL 'from my footprints'

(15) aTTapbIM,n:aH attarimdan at -tar -im -dan name -PL -lSG.POSS -ABL 'from my names'

Additionally, Tuvan exhibits rounding harmony that targets only high vowels. Rounding harmony is generally progressive, and requires that a vowel be round if it occurs near another round vowel (Harrison 2000). Tuvan round vowels include [ii 0 u 0] «y e yo». Rounding harmony is most clearly shown through the third person possessive (3poss) suffix. In (16) and

(17), the vowel in the 3poss suffix is high. Thus, the round vowel (either [0] or [u]) in the stem

causes the high, unround vowel in the suffix to become round (either [ii] or [u]) (Harrison

2000:132).

(16) xeJIY xolii (*xoli) xol -ii lake -3poss 'her/his/its lake'

(17) yJIy3y uluzu (*uluzi) ulu -zu dragon -3poss 'her/his/its cemetery'

Nominal inflection in Tuvan is quite regular, with nouns inflecting for number,

possession, and case (in that order). As Tuvan is an agglutinative, almost exclusively suffixing

language, each suffix generally contains only one piece of grammatical information. The plural marker precedes both possessor markers and case suffixes. While the singular suffix is KILLACKEY 23 unmarked, the plural suffix has eight phonologically conditioned variants, which are bolded in

(18) (Harrison 2000:20).

(18) 6ypyJlep eMHep XOJI,!J;ap HHeKTep hiiriiler emner xoldar inekter 'leaves' 'medicines' 'hands' 'cows'

HOM Hap X6JI,!J;ep aTTap xa,Il;I,IJlap nomnar x5lder attar xadilar 'books' 'lakes' 'horses' 'pines'

Furthermore, there are six cases in Tuvan that are realized morphologically on the noun:

locative (Lac), ablative (ABL), dative (DAT), accusative (ACC), genitive (GEN), and allative (ALL).

Aside from the allative case, each case suffix is subject to phonological processes like vowel harmony. Therefore, each case suffix has several phonologically conditioned surface variants.

However, the basic form is the form that suffixes to vowel-final stems. I give an example of a vowel-final stem, along with each case suffix in (19) (Harrison 2000:21). Nominative case is unmarked.

(19) 'leaf NOM 6ypy bfuii ACC 6ypYHY bfuiinii GEN 6YPYHYH; bfuiiniiIJ DAT 6ypyre bfuiiye LaC 6ypY.IJ:e bfuiide ABL 6ypY.IJ:eH bfuiiden ALL 6ypY}Ke bfuiize

Finally, the person and number of the possessor are marked morphologically on the nominal "possessum" (Harrison 2000:26). Again, each possessor suffix has several

phonologically conditioned surface variants. In (20), I provide an example of each possessor

suffix (lsG.poss, 2sG.poss, 3.poss, 1pL.poss, 2pL.poss), which precedes the unmarked

nominative case (Harrison 2000:26). The singular and plural forms ofthe third person possessive

suffix are identical. KILLACKEY 24

(20) 'my book' 'your (SG) book' 'itslhislher/their book' 'our book' 'your (PL) book' NOM HOMYM HOMYH: HOMY HOMyByC HOMYH:ap nomum nomUlJ nomu nomuvus nomUlJar

Verbal morphology, however, is not nearly as regular as its nominal counterpart. As verbal morphology did not playas big of a part in generating original parallel corpora for the

English to Tuvan system, I describe it only briefly here. Verbs are marked for the person and number of the subject, in addition to tense, mood, aspect, and negation. In fact, the most

prominent feature of Tuvan's verbal morphology is its TAM-system (i.e., Tense Aspect Mood

system), which uses a range of affixes and auxiliaries to mark temporal, aspectual, and modal

categories (Anderson and Harrison 1999). With regards to tense, Tuvan exhibits a distinction

between only present/future (pIp) (21), the meaning of which depends on context, and past (22-

23) (Harrison 2000:36-37). There is also a further distinction in the past tense between non-

assertive (indefinite, PAST.I) past, as in (22), and assertive (definite, PAST.II), as in (23). As with nominal morphology, the surface form of these suffixes may change depending on their

phonological environment. Personal pronouns, such as the first men in (22), are optional and are

frequently omitted.

(21) Eo yJIyc MeHH 6HJIHp. Bo ulus meni bilir. bo ulus me -ill bil -If this people 1SG -ACC know -pip 'These people know me.' (22) (MeH) KeJIreH MeH. (Men) kelgen men. men kel -gen men 1sG come -PAST.I 1sG 'I have come.' (23) KeJI.D:HM. Keldim. kel -di -m come -PAST.II -lsG 'I came.' KILLACKEY 25

There are two types of verbal person markers in Tuvan: suffixal (CLASS'!), as in (24), and

enclitic (CLASS.II), as in (25) (Harrison 2000:35). Each class of affix is grammatical in only

certain types of clauses.

(24) (CHJIep) ymry-H;ap. (Siler) ustu-lJar. SI -ler us -tu -lJar 2 -PL fly -PAST.II -2PL 'You flew.' (25) (EHc) qaHaap 6HC. (Bis) canaar bis. bis cana -ar bis 1PL go.home -pip 1PL 'We're going home.'

Tuvan also exhibits six aspectual and modal constructions - iterative, perfective, resultative, cessative, emphatic, and unaccomplished (about to be realized in the near future) - in

addition to various modal categories, including the conditional, concessive, evidential past and

present, imperative, optative, and desiderative. Both aspectual and modal constructions are realized by either an affix or enclitic. Furthermore, negation in Tuvan can be expressed by either

an affix or lexeme. Several verbal modifiers such as converbs (cv) and participles (both a type of

affix) may also be present in a sentence in order to assist the main verb in expressing some type

of semantic component. The distinction between participles and converbs in Turkic literature has traditionally been identified with converbs being unable carry inflection; however, inflection is

found on certain converbs in Tuvan (Anderson and Harrison 1999).

The final element of Tuvan verbal morphology that I discuss is the auxiliary verb (AUX), which generally appears after a converb form of the lexical stem. While they are technically

semantically bleached, auxiliaries are ubiquitous in the Tuvan verbal system and index a wide range of modal and aspectual categories (Anderson and Harrison 1999). For example, the

auxiliary verb alir (sometimes aar) either marks the self-benefactive voice or functions as the KILLACKEY 26

capabilitive mood (CAP) marker when it follows the lexical verb stem, as in (26). Additionally, the common auxiliary verb bar or baar marks a completed/perfective action, as in (27)

(Anderson and Harrison 1999:62-63).

(26) Orr 6IDKlIll: arr6ac. 01 biziy albas. 01 bizi -y al -bas s/he write -cv CAP -NEG.FUT 'She can't write.' (27) Orr 6mKlIll: COKcarr 6apraH. 01 biziy soksap bargan. 01 bizi -y soks -ap bar -gan s/he write -cv stop -cv AUX -PAST.I 'She stopped writing. '

3.2 General Methodology Using Microsoft Translator Hub

The process for creating an MT system on the Microsoft Translator Hub is relatively simple. The

basic procedure is outlined in Figure 7.

Building a Translation System with Microsoft Translator Hub

1. Setup Project 2. Conduct Trai nings 3. Share, Translate & Improve

Create Project I---t-----,

I Invi te People

Figure 7. Microsoft Translator Hub process diagram (Microsoft 2012).

First, a project must be created referencing the desired source and target languages.

Subsequently, any researchers and post-editors, provided with the necessary login information,

are invited to join the project. Primary researchers are named "Co-owners," while post-editors

are called "Reviewers." All relevant parallel corpora and monolingual documents may then be KILLACKEY 27 uploaded to the Hub. The Hub uses these documents as a basis for building the translation

system. This initial step is known as training. Microsoft suggests assembling no fewer than

10,000 total parallel sentences for training a system in order to ensure optimum performance. In this way, the Hub is able to learn the following from these data (Microsoft 2012):

• How words, phrases, and sentences are commonly translated. • How context, or surrounding phrases, affects translation (i.e., a particular word may not always translate the same way). • How to conjugate verbs and to inflect nouns, which may vary depending on context.

Once the parallel and monolingual documents have been uploaded, the source-target

language system is ready to be trained. Smaller parallel documents (around 2,500 sentences) may

also be added to a tuning data set "in order to set the parameters of the system to the optimal values" (Microsoft 2012). Documents added to the tuning set must be of excellent quality (i.e.,

carefully and precisely aligned) because the system puts extra weight on this data set to assist in

generating translations. Accordingly, these documents are excluded from the training data set.

There is also an option to have the system randomly choose the tuning data set for itself from the

provided parallel corpora. However, the Tuvan MT Project observed no difference in BLEU

score between any combination and type of tuning data sets. Thus, we are still relatively unclear

as to the effect of the tuning section. Additionally, a small group of parallel sentences (again,

around 2,500 lines) may be uploaded as the testing data set. These documents are used to

compute the translation system's BLEU score and, as with the tuning set, are not included in training the system. Once again, there is also an option to allow the system to select its own testing data set.

Next, the number and type of documents can be adjusted and manipulated through training new source-target language systems. The Hub allows multiple systems to be trained within a given model (i.e., language pair) with the goal of achieving the highest possible BLEU KILLACKEY 28

score. This also allows BLEU scores to be compared across each trained system. Once training has been completed, the desired system can be deployed, and fluent source and target language

speakers, or reviewers, are able to make interactive suggestions on how to further refine translations based on the results ofthe system's output under the "Translate Documents" feature.

Here, reviewers are also able to approve or reject suggested translations from others.

Furthermore, MT researchers may both evaluate the results ofthe training and review the post­

editing corrections throughout this stage. The process of reviewing and correcting the system is

also known as post-editing.

During deployment, the system is also available for source to target language translation,

under the "Test System" feature. Testing involves inputting source language text, which the

deployed system renders into target language text in a format that is meant to be

indistinguishable from the finished translation system product. The final step is going live with the system, or making it available to the public. To do this, the general minimum required BLEU

score for public use of an SMT system is 20 (Microsoft 2012). The English to Tuvan system

developed by the Tuvan MT Project received a maximum BLEU score of 48.99.

3.3 Main Tasks

The first, and arguably most substantial, task in creating the English to Tuvan system involved

aligning almost 40,000 sentences in a portion of the English-Tuvan parallel corpora in order to

ensure a 1: 1 sentence correspondence between the source and target language versions of the

documents. The Microsoft Translator Hub learns translations one sentence at a time, by "reading

a sentence, the translation of this sentence, and then aligning words and phrases in these two

sentences to each other" (Microsoft 2012). This process enables the Hub to create a map of the words and phrases in one sentence to the equivalent words and phrases in the translation of this KILLACKEY 29

sentence in the other language, ensuring that the system trains on sentences that are actually translations of each other. Consequently, alignment is an important task, as the system will fail to

learn from the data if any two parallel documents differ from each other by more than 5%

(Microsoft 2012).

For a full list of the parallel and monolingual texts used in the Tuvan MT Project, refer to

Table 3.4.

Name of Text Type of Text Number of Sentences The Bible (NIV) Parallel 29,573 The Lion, the Witch, and the Wardrobe Parallel 1,489 Prince Caspian Parallel 1,072 Voyage ofthe Dawn Treader Parallel 1,099 Generated Sentences 1 (LOC, ABL) Parallel 648,956 Generated Sentences 2 (ACC) Parallel 46,354 Noun Paradigm (lsG.FuT) Parallel 1,197 Noun Paradigm (lsG.PAsT) Parallel 1,165 Noun Paradigm (2SG.QUES) Parallel 1,197 "Boktu Kirish" Parallel 184 "Because I Love You" Parallel 49 "Tos- karak" Parallel 10 TyvaWiki4 Parallel 464 Adjectives5 Parallel 934 Tuvan Government Articles6 Monolingual 102 "Jehovah's Witness" Articles 7 Monolingual 2,597 www'6 Monolingual 303 Tuvan News Articles'} Monolingual 2,276 Tuvan Poems lU Monolingual 1,277 Tuvan Language BIogll Monolingual 392 Table 3.4. Training documents for English to Tuvan model ofTuvan MT Project.

The bulk of the parallel corpora used in training each English-Tuvan system included both the

entire Bible (New International Version) and three books of The Chronicles ofNarnia (The Lion,

4 These texts can be found online at: http://www.tyvawiki.org/wiki/Main_Page. 5 "Adjectives" refers to the complete list ofTuvan adjectives found on the Tuvan Talking Dictionary. 6 These news articles may be found online at: http://gov.tuva.ru/news.aspx. 7 These articles can be found online at: http://www.watchtower.org/vi/bh/article_OO .htm. 8 "www" denotes random monolingual Tuvan documents found online at: http://www.lrd63.narod2.ru/. 9 These texts can be found online at: http://orlan.tuva.ru/. 10 Tuvan poems can be found online at: http://lrd63.narod2.ru/. II This text can be found online at: http://orlan.tuva.ru/index.php. KILLACKEY 30

the Witch, and the Wardrobe; Prince Caspian; and Voyage ofthe Dawn Treader), each of which was translated with help from or directly by fluent Tuvan speaker and Russian linguist Vitaly

Voinov. We also carried out sentence alignment on three short stories: "Boktu Kirish," "Tos­ karak," and "Because I Love You." During this initial step, we also crawled the web in order to

find both monolingual texts and additional potential parallel corpora.

U sing cues such as proper names, numbers, paragraph breaks, and punctuation, we

segmented each text so that one line in the source language text corresponded to the same line in the target language text. Because none of the student researchers on the Tuvan MT Project

understood (written or spoken) Tuvan, ascertaining relevant sentence boundaries during sentence

alignment was difficult and often involved a bit of approximation. Frequently, one sentence in the source language corresponded to two or more sentences in the target language, or vice versa.

Fortunately, the Microsoft Translator Hub ignores punctuation like periods or semicolons, and

assumes that any text on one line of the target language is a translation of the same line of text in the source language. Therefore, removing corresponding source and target language lines of text with mismatching numbers of sentences was not necessary.

Instead, multiple sentences in either a source or target language text remain on the same

line as long as these sentences match the same line in the second parallel text. Many lines found

in both the English and Tuvan versions of The Chronicles ofNarnia: Prince Caspian parallel texts exemplify this 2:1 line correspondence. Example (28), which is line 47 from Chapter Three

of the English version of Prince Caspian, contains two sentences. However, sentence (29), which is the corresponding line 47 from the Tuvan text, contains one sentence. Using proper

names and the surrounding context, the student researchers determined that these lines are in fact

translations of each other. Therefore, the two English sentences remain on one line. KILLACKEY 31

(28) The Dwarf stared round at all four of them. He had a very curious expression on his face. (29) [HOM ,n:epT KIDKIDKe cOHyyprarr KepreH. Gnom dort kizize sonuurgap korgen.

While the majority of sentences in the corpora are aligned appropriately, instances of misalignment are bound to occur. When researchers with little knowledge of the target language

perform the semi-automated process of sentence alignment on target language texts, such mistakes are to be expected. Yet, the English to Tuvan system does appear to be learning and translating correctly; therefore, misalignment is most likely not a statistically significant factor.

One instance of misalignment occurs in example (30), or line 76 from Chapter Ten ofthe

English text, which contains only one sentence. Example (31) contains two sentences, appearing

on the corresponding line 76 of the Tuvan text.

(30) Susan looked at him very hard and was quite sure from the expression on his face that he was not making fun of them. (31) CYY3eH rrpoq,eccop)l(e URH Kapaa YJIra,n:hI 6epreH Kepyn TypraH, hIHQaJI3a-,n:aa OOH; Suuzen professorZe iyi karaa ulgadi bergen korup turgan, incalza-daa oOIJ aphIH-mhIpaHhIHhIH; a5IHhIH,n:aH aJIhIpra, KeH;ry6amTaKTaHMaHH TYpap XHpe. arin-sirayiniIJ avnindan alirga, kOIJgubastaktanmayn turar xire. AM 6a3a mKaq,ThI. KH)I(HJIep opaHhI TaJIa3hIH,n:. Am baza skafti, kiziler orani talazind.

Upon review of the aligned corpora, David Harrison observed that example (31) contains the extraneous fragment Am baza skafti, kiiiler orani talazind12, which is not a translation of line

76 in (30), but is instead part of the translation of the following line 77. While (31) did remain on

one line during the English to Tuvan system training, such misalignment errors did not appear to

substantially affect the results of the training.

12 While this fragment has no discernible meaning on its own, I provide its morphological analysis below: AM 6a3a mKa

During sentence alignment, we also encountered a problem differentiating between

orthographic and formatting hyphenation. In many Tuvan texts ofthe Tuvan MT Project's

parallel corpora, and compounds are represented orthographically with a hyphen; words that are broken at the end of the line of a Tuvan text are also split by a hyphen. While Tuvan

compounds and words containing clitics did not need to be altered, each word containing an end-

of-line hyphen was re-unified and the hyphen was removed, so that the MT system would not mistake these hyphens for indicators of compounds or clitics. In order to do this, David Harrison

systematically perused the Tuvan texts, using his extensive knowledge of the language to remove the hyphens that corresponded to end-of-line characters. The Tuvan word in (32) contains the

emphatic enclitic -daa; the hyphen in this word is orthographic and is therefore not removed.

Example (33), lines 47 and 48 of the Tuvan version of The Lion, the Witch, and the Wardrobe,

contains an instance of an end-of-line hyphen. Example (33a) contains line 47 and (33b) contains

line 48 of the text, while (33c) is the English gloss of both lines. For the purposes of the Project,

lines (33a) and (33b) were conjoined and the hyphen was removed from copseeresken 'they

agreed,' as the word does not normally contain a hyphen.

(32) KhIM-.n:aa kim-daa 'anyone,' 'no one'

(33) a. ApTKaH ypyrnap myII'ry qerrIll33- Artkan urugdar suptu cop see- b. -peIllKeH. OJIapHhIH; y)l(ypan.n:aphl IllaK hIHQaap 3reJI33H. -pesken. Olarnil] uzuraldari sak incaar egeleen. c. 'Everyone agreed to this and that was how the adventures began.'

Second, the researchers generated original noun paradigms, as well as full sentences, in both Tuvan and English, adding to the English to Tuvan system's parallel corpora. As Tuvan nominal morphology is fairly regular, we were able to create three basic noun paradigms: first KILLACKEY 33

person singular future (34), first person singular past (35), and second person singular question

(36). In creating these noun paradigms, phonological processes such as vowel harmony and velar

deletion, along with exceptions to various morphological rules had to be factored into the

production. In (34)-(36), I have provided an example from each noun paradigm. These

paradigms were created with the goal of teaching the English to Tuvan system complex Tuvan

morphology in a clear, direct way.

(34) MeH 6y3ype,IlJlp MeH. Men btizUredir men. men btiztired -If men ISG persuade -P/F IsG 'I will persuade.'

(35) MeH eCKepTKeH MeH. Men oskertken men. men oskert -ken men IsG agomze -PAST.I IsG 'I agonized.'

(36) CeH Ke3e,Il;IIp ceH 6e? Sen kezedir sen be? sen kezed -If sen be 2SG punish -P/F 2sG QUES ' Will you punish?'

In order to generate full Tuvan sentences, we used two carrier sentences to provide a

basic skeleton onto which we inserted Tuvan nouns. As with the noun paradigms, complex

aspects of mainly nominal, but also some verbal morphology, as well as phonological processes

like vowel harmony mentioned in Section 3.1 had to be accounted for. While many of the

resulting sentences are semantically nonsensical, they are still syntactically grammatical and therefore useful for the goal of teaching the system Tuvan morphology. We inflected each noun

found on the Tuvan Talking Dictionaryl3 for two of Tuvan' s six total cases (locative [LOC] and

ablative [ABL]), in addition to both the singular (SG) and plural (PL) number suffixes and the

13 The Tuvan Talking Dictionary can be found online at: http://tuvan.swarthmore.edu/. KILLACKEY 34

person/number ofthe possessor. Possessor suffixes include the first person singular/plural

(1 SG/PL.poss), second person singular/plural (2sG/pL.poss), third person singular/plural

(3SG/pL.POSS) markers, and a single unpossessed form. As there are two number, one case, and

seven possessor suffixes for each carrier sentence, these combinations result in a total of 14

sentences per noun.

Each carrier sentence contains two slots for nouns - NOUN.x and NOUN.Y - but only

one of these nouns (NOUN.Y) changes its inflection in each carrier sentence. NOUN.x maintains the same inflection throughout each seven carrier sentences. Additionally, because these generated sentences became part ofthe system's parallel corpora, we also created the

corresponding English sentences. Ultimately, we produced a total of just fewer than 650,000

original parallel sentences.

For the carrier sentence in (37), we constructed noun forms containing all of the possible

combinations of forms for NOUN.Y-LOC using number and possessor markers.

(37) NOUNy-LOC 6HP yJIyr NOUNx 6ap. NOUNy-LOC bir ulug NOUNx bar. 'One big x is in the y.'

Given NOUNxyala 'punishment' and NOUNy saliimniimda 'ability' we produced 14 total sentences in which NOUN.Y is inflected for number, case, and possession. In (38)-(40), I

provide three examples of these sentences. In each sentence, the noun stem saliimni- attaches to the locative suffix -da, which marks present tense location. Both sentences (38) and (39) are unmarked for singular number; sentence (40) contains the plural suffix -lar. Finally, sentence

(38) contains the first person singular possessive suffix ---im, while (39) and (40) contain the

second person singular possessive suffix ---in. KILLACKEY 35

(38) CaJIhIbIMHbIhIM.n:a 6HP ynyr 5IJIa 6ap. Saliimniimda bir ulug yala bar. saliimni -1m -da bir ulug yala b -ar ability -lsG.poss -LOC one big punishment be -pip 'The one big punishment is in my ability.'

(39) CaJIbIbIMHbIbIH:.n:a 6HP yJIyr 5IJIa 6ap. SaliimniiIJda bir ulug yala bar. saliimni -IIJ -da bir ulug yala b -ar ability -2sG.poss -LOC one big punishment be -pip 'The one big punishment is in your ability.'

(40) CaJIbIbIMHbIbIJIapbIH.n:a 6HP ynyr 5IJIa 6ap. Saliimniilarinda bir ulug yala bar. saliimnii -lar -ill -da bir ulug yala b -ar ability -PL -2sG.poss -LOC one big punishment be -pip 'The one big punishment is in your abilities.'

For carrier sentence (41), we again constructed all of the combinations of forms for

NOUN.Y-ABL using the ablative case, which marks motion away from a location.

(41) NOUNx NOUNy-ABL yJIyr. NOUNx NOUNy-ABL ulug. 'x is bigger thany.'

Given NOUNx oskertirinden 'decision' and NOUNy bodaarizi 'declension' we produced the following three of 14 total sentences (42)-(44). In each sentence, the noun stem oskertir-

attaches to the ablative suffix -ten or -den. Sentences (42) and (43) are unmarked for the

singular number, while (44) contains the plural suffix -Zero Additionally, sentence (42) contains the first person plural possessive suffix -ivis and (43) contains the second person singular

possessive suffix -i1J; sentence (44) is unpossessed. The noun stem bodaar- also attaches to the third person singular possessive suffix -zi in each sentence.

(42) 60.n:aaPbI3bI eCKepTHpHBHcTeH ynyr. Bodaarizi oskertirivisten ulug. bodaari -zi oskertir -lVIS -ten ulug decision -3SG.POSS declension -lpL.POSS -ABL big 'Its decision is bigger than our declension. ' KILLACKEY 36

(43) 60.n:aaPhI3hI eCKepTllplUJ:.n:eH ynyr. Bodaarizi oskertiriIJden ulug. bodaari -zi oskertir -IIJ -den ulug decision -3SG.POSS declension -2SG.POSS -ABL big 'Its decision is bigger than your declension. '

(44) 60.n:aaPhI3hI eCKepTllpHJIep.n:eH ynyr. Bodaarizi oskertirilerden ulug. bodaari -zi oskertiri -ler -den ulug decision -3SG.POSS declension -PL -ABL big 'Its decision is bigger than the declensions.'

Once all of the parallel and monolingual corpora had been uploaded to the Hub, we

created various systems (12 in total), using different combinations of corpora in order to achieve the highest possible BLEU score. In other words, each system contains different combinations of

parallel and monolingual documents (see Appendix A). The goal of this task was to discern whether certain documents were more or less helpful in teaching the system English to Tuvan translation. While we used BLEU scores to compare the relative successes of these systems, we

found that the scores could be misleading. For example, system #11 shows the largest BLEU

score. Previously, the highest BLEU score was achieved by system #10 with a 9.32. However, the addition of 650,000 lines of generated sentences increased the BLEU score to 48.99 in

system #11.

Thus, a disproportionately large addition of parallel corpora to a system skews the BLEU

score by tricking the system into thinking it is performing better than it actually is. Because the testing documents in system #11 are randomly selected aligned sentences from all ofthe parallel

corpora, it appears that the vast number of generated sentences overwhelmed the testing set so that the system was tested on grammatical constructions similar to those in its training set.

Additionally, we did not observe any major changes in BLEU scores of systems with different tuning or testing documents. However, it is important to note that the BLEU scores of systems KILLACKEY 37 with different tuning and testing sets cannot be meaningfully compared, reducing the possible

choices of systems to compare.

After the English to Tuvan system #11 had been deployed, the final task involved

creating a set of about 3,000 basic English sentences in order to test the system. These sentences

differ minimally from each other, containing the same basic words aside from slight changes in

inflection (e.g., tense, number) or the addition of prepositional phrases. The purpose of this task was to isolate areas of Tuvan syntax or morphology that may pose problems for the English to

Tuvan system. In (45)-(49), I provide an example set of the generated English sentences. These

sentences constitute a part of the post-editing analysis in Section 4 and the re-training in Section

5.

(45) I see one red house. (46) I do not see one red house. (47) I see one red house down the street. (48) You see two red houses. (49) Yesterday you saw one red house.

In summary, twelve English to Tuvan test systems were created under the English to

Tuvan model of the Tuvan MT Project. Only one system (#11), which received a BLEU score of

48.99, was deployed. Out ofthe twelve total trained systems under the English to Tuvan model,

system #11 received the highest BLEU score. Refer to Appendix A for a table of each system in the English to Tuvan model, along with its corresponding BLEU score and a list of the parallel

and monolingual texts used.

4. Post-editing

In this section, I review the fluent speaker post-editing data, with particular attention to the missing affix that I will be using in Section 5 on re-training. I collected a sample of 47 sentences that have gone through a post-editing process by fluent Tuvan speakers. In addition to the post- KILLACKEY 38

edited sentences, I have included the ungrammatical Hub output and the original English

sentence. The Tuvan MT Project student researchers created each ofthe original English source

sentences as part ofthe testing data set under system #11; therefore, the system had never

encountered these exact sentences before. In my analysis of the post-editing corrections made to these 47 sentences, I observe that the Tuvan to English system consistently makes several

lexical, morphological, and syntactic errors. These errors, which I have collated in tables found

in Appendix B, fall into the following seven typological categories: Extra Word, Missing Word,

Wrong Word, Wrong Affix, Missing Affix, Extra Affix, and Word Order/Syntax.

In sentences (50)_(55)14, I provide examples of each error type found in the output of

English to Tuvan system #11, in the order that the errors occur in the Hub output sentence. There

are many instances in which one sentence contains errors from more than one category. The number of errors and the error types are in bold.

(50) a. Elip KhI3hID 6a)l(bIH; Kapyn Typ MeH. Bir kizil baZilJ kOriip tur men. b. lJaH;rbIc KbI3bID 6a)l(bIH;HbI Kapyn TYP MeH. CalJgis kizil bazilJni kOriip tur men. c. 'I see one red house.'

There are two errors in sentence (50a): wrong word (bir 'one' should be ca1)gis 'lone')

and missing suffix (ACC -ni from baii1)ni 'house'). Each ofthese errors is corrected in sentence

(50b).

(51) a. ,.qyyH 611p KbI3bID 6a)l(bIH; KapreH cllJIep. Dillin bir kizil baZilJ korgen siler. b. ,.qyyH ceH -qaH;rbIC KbI3bID 6a)l(bIH; KapreH ceH. Dillin sen caIJgis kizil bazilJ korgen sen. c. 'Yesterday you saw one red house.'

14 Each example is written in the customary Cyrillic script, along with a turcological transliteration underneath. The English to Tuvan system output is found in (a), while the post-edited sentence is found in (b). Additionally, I have provided the original English source sentence as a gloss in (c). KILLACKEY 39

There are three errors in sentence (51a): missing word (the second person nominative

singular pronoun sen 'you'); wrong word (bir 'one' should be cal)gis 'lone'); and once again wrong word (the second person nominative plural pronoun siler 'you' should be the singular sen

'you'). Each ofthese errors is corrected in sentence (51 b).

(52) a. Ky,llyWI)'HYH; .lJ:Y)K)'H.lJ:a qaH;rbIC KbI3bIJI 6a)l(bIH; Kepyn TYP MeH. KudumcunUlJ duzunda caIJgis kizil baziIJ kortip tur men. b. KY.lJ:YMqy a.lJ:aaH.lJ:a qaH;rbIC KbI3bIJI 6a)l(bIH; Kopyrr TYP MeH. Kudumcu adaanda caIJgis kizil baziIJ kortip tur men. c. 'I see one red house down the street.'

There are two errors in sentence (52a): extra suffix (GEN -nul) from kudumcunul) 'street')

and wrong word (duiunda 'against' should be adaanda 'down'). Both of these errors are

corrected in (52b).

(53) a. CeH HMH KbI3bIJI 6a)l(bIH; Kep6ec. Sen iyi kizil baziIJ korbes. b. liMH KbI3bIJI 6a)l(bIH;HapHbI Kep6eMH Typ CHJIep. Iyi kizil baziIJnarni korbeyn tur siler. c. 'You do not see two red houses.'

There are six errors in sentence (53a): extra word (the second person nominative

singular pronoun sen 'you'); missing suffix (PL -nar from baiil)narni 'houses'); missing suffix

(ACC -ni from baiil)narni 'house'); wrong suffix (NEG.pip -bes from korbes 'see' should be

NEG.PAST.II -beyn); and two missing words (AUX fur and the second person nominative plural

pronoun siler 'you'). The suffix -ni is targeted for correction in Section 5. Each of these errors is

corrected in sentence (53b).

(54) a. KY.lJ:yMqyHyH; .lJ:y)K)'H.lJ:a HMH KbI3bIJI 6a)l(bIH;HbI Kepyn TYP CHJIep. KudumcunUlJ duzunda iyi kizil ba.ziI]ni kortip tur siler. b. KY.lJ:YMqy a.lJ:aaH.lJ:a HMH KbI3bIJI 6a)l(bIH;HapHbI Kepyrr Typ ceH. Kudumcu adaanda iyi kizil baziIJnarni kortip tur sen. c. 'You see two red houses down the street.' KILLACKEY 40

There are four errors in sentence (54a): extra suffIx (GEN -nUl) from kudumcunUl)

'street'); wrong word (duiunda 'against' should be adaanda 'down'); missing suffIx (PL -nar

from baii1)narni 'houses'); and once again wrong word (the second person nominative plural

pronoun siler 'you' should be the singular sen 'you'). Each of these errors is corrected in (54b).

(55) a. MeH .zwpreH qeMHeHllp MeH. Men diirgen cemnenir men. b. M33H; qeMHeHHpHM .zwpreH. Mee1) cemnenirim diirgen. c. 'I eat quickly.'

There are four errors in sentence (55a): wrong word (the first person nominative

singular pronoun men 'I' should be the genitive mee1J 'my'); wrong word order/syntax (diirgen

'quick' cemnenir 'eat' should be cemnenirim 'I eat' diirgen); missing suffIx (lSG -im from

cemnenirim 'I eat'); and extra word (the first person nominative singular pronoun men'!').

Each ofthese errors is corrected in (55b).

5. Re-training

Given the errors outlined in Section 4 and in Appendix B, I observe that the English to Tuvan

system #11 most frequently produces morphological errors involving incorrect, missing, or extra

Tuvan suffixes. I now provide the results ofre-training the English to Tuvan model of the Tuvan

MT Project with a new system #13. I target the accusative case, of the basic form -ni (along with

several surface forms), for correction in the hopes that the new system will not produce any

errors involving this suffix. Unlike in many other languages, the accusative case in Tuvan not

only marks the direct object of a verb, but also adds the property of definiteness (Anderson and

Harrison 1999). For example, the direct object book in the English sentence 'I saw a book' would not take the accusative case in Tuvan. However, in the sentence 'I saw the book,' book would KILLACKEY 41 take the accusative case, as the noun is definite (indicated by the presence of the determiner the

in English).

In order to correct the Tuvan accusative case, Tuvan MT Project student researcher Peter

Nilsson and I created an original corpus of 46,354 parallel English and Tuvan sentences

containing the accusative case suffix. The procedure for generating this corpus mirrors that

delineated in Section 3.3. Using carrier sentence (56), we inserted each noun found on the Tuvan

Talking Dictionary into the appropriate slot indicated by NOUN (in this case, there is only one noun slot), inflecting for possession, number, and of course accusative case.

(56) NOUN-ACC Kep;::(YM. NOUN-ACC kordiim. 'I saw the noun.'

Once again, this gave us a total of 14 sentences per noun (1 case, 7 possessive, 2 number

suffixes). I give three examples ofthese sentences in (57)-(59). Each example contains the

accusative case suffix -ni, as this is the suffix that is being targeted for correction in this section.

While (57) is unmarked in the singular number, sentences (58) and (59) contain the plural suffix

-nar. Additionally, sentence (57) contains the first person singular possessive suffix -im, (58)

contains the second person singular possessive suffix -i1), and (59) contains the third person

plural possessive suffix -i. The verb stem kor- 'see' attaches to the PAST-II tense -dii and the first

person singular -m suffixes in each example.

(57) AMhITaHbIMHbI Kep;::(YM. Amitanimni kordiim. amitan -im -ill kor -dii -m animal -lsG.poss -ACC see -PAST.II -lsG 'I saw my animal.' (58) AMbITaHHapbIH;HbI Kep;::(YM. AmitannariIJni kordiim. amitan -nar -iIJ -ill kor -dii -m animal -PL -2sG.poss -ACC see -PAST.II -lsG 'I saw your animals. ' KILLACKEY 42

(59) AMhITaHHapbIHhI Kep):{yM. Amitannarini kordiim. amitan -nar -1 -m kor -dii -m animal-PL -3pL.POSS -ACC see -PAST.II -lSG 'I saw their animals.'

As the BLEU scores of only systems with identical tuning and testing data sets can be meaningfully compared, I would have preferred to replicate system #11 in creating the new

system # 13, and subsequently to deploy system # 13. This would have allowed me to use the

convenient "Test System" feature, available only under deployed systems, to input an English

sentence and automatically receive the Tuvan output. However, Microsoft Translator Hub allows

only one system to be deployed at a time, and because system #11 is currently deployed, I am unable to deploy a new system. Therefore, I created two new systems under the English to Tuvan model: #13 and #14. System #13 contains identical training and tuning documents to system #11

(see Appendix A). For system #13 and #14's testing data set, I used a subset of the post-edited

sentences (some of which are evaluated in Section 4; see Appendix B for the rest), in which I

observed consistent errors involving the accusative case. System #14 replicates system #13, with the addition of the generated parallel corpus containing the accusative case suffix to the training

data set. Thus, I am able to evaluate the results of system # 14 against # 13 for changes both in

BLEU score and performance involving the accusative case.

The results oftraining systems #13 and #14 first show that changing the testing data set

from randomly selected text to the post-editing sentences caused a dramatic decrease in BLEU

score from 48.99 in system #11 to 5.30 in system #13. Furthermore, the addition of the

accusative case parallel corpus to system #14 also caused a relatively small decrease in BLEU

score; system #13 received a BLEU score of 5.30, and #14 received a BLEU score of 4.74.

While the decrease in score in both systems is troubling, the cause is unclear. However, it does KILLACKEY 43

appear that even slightly altering the testing data set can have a noticeable (primarily negative)

effect on the BLEU score of a system.

Next, I compare the "Evaluate Results" section of system #14 to system #13 in order to

see if the addition of the parallel corpus that targets the accusative case can improve the

performance of the system with respect to the accusative case. Out of the 13 total test sentences targeted in system #14, the accusative case is successfully corrected in each one ofthese 13

sentences (see Appendix C). In each instance, the accusative case suffix is missing in the output

of system # 13; the output of system # 14 shows that the case has been added correctly. I was not

able to find any new, extraneous examples of the accusative suffix (i.e., appearing where it's not

supposed to be) in system #14. I provide three examples of the corrected sentences in (60)_(62)15.

(60) a. CeH Kep6eMH TYP HMH llHJIrH 6a)l(bIH;Hap. Sen korbeyn tur iyi silgi baZilJnar. b. CeH Kep6eMH TYP HMH llHJIrH 6a)l(bIH;HbI. Sen korbeyn tur iyi silgi baZilJni. c. 'You do not see two red houses.'

In sentence (58b) from system #14, the accusative suffix is found on the noun baii1Jni

'house.' The form of this noun is corrected from the output of system #13 in (60a), in which it

appears as baii1Jnar, without the accusative case suffix -ni. From the post-edited sentence (53b), we know that the correct form of 'houses' in the sentence 'You do not see two red houses' is

baii1Jnarni. While (58b) contains other grammatical errors (e.g., baii1Jni is missing the plural

suffix -nar), it is clear that the accusative case has been corrected from (60a).

(61) a. Ky,llYWI)' 6a,[(bll HMH 6a)l(bIH;Hap llHJIrH Keep ceH. Kudumcu badip iyi bazilJar silgi koor sen. b. Ky,[(yMqy 6a,[(bll HMH 6a)l(bIH;HapHbI KbI3bIJI Keep ceH. Kudumcu badip iyi bazilJnarni kizil koor sen. c. 'You see two red houses down the street.'

15 The output sentence from system #13 is found in (a), while the output sentence from system #14 is found in (b). The English gloss is provided in (c). KILLACKEY 44

From the post-edited sentence (54b), we know that the correct form of 'houses' in the

sentence 'You see two red houses down the street' is baii1Jnarni. While the output of system #13

in (61a) produces the incorrect form baii1Jar, the output of system #14 (61 b) produces the correct

form baii1Jnarni. In this example, it appears that the system has learned the accusative case.

(62) a. KhI3hIJI 6ap 6a)l(bIH;ra KapyH;ep. Kizil bar baii1Jga kori.i1)er. b. Kapyn KbI3bIJI 6a)l(bIH;HbI 6ap. Kortip kizil bazi1Jni bar. c. 'I see one red house.'

In sentence (62b), system #14 has once again corrected the form of 'house' from baii1Jga to baii1Jni. The post-edited sentence (SOb) tells us that the correct form of 'house' in 'I see one red house' is baii1Jni. Thus, sentences (60b)-(62b), along with the rest of the examples from

system # 14 given in Appendix C, generally show that providing the English to Tuvan system with a parallel corpus that contains the accusative case has improved the performance of the

system with respect to that case.

6. Conclusion

As the field of machine translation expands to include translation to and from many of the world's major languages, it becomes clear that minority and low-density languages like Tuvan

are being overlooked. In this thesis, I have examined the English to Tuvan system developed by the Tuvan Machine Translation Project, which hopes to initiate efforts to rectify this problem, using the paradigm of statistical machine translation. However, both of the two major paradigms

of machine translation - statistical and rule-based - have issues that prevent them from

consistently and efficiently producing high quality output, and consequently from being widely

available for public use. KILLACKEY 45

In my analysis of the English to Tuvan system, I have shown that SMT systems make

predictable grammatical errors, particularly when languages with complex, agglutinative morphology like Tuvan are involved. By feeding corpora that corrects the error(s) into the

system, the performance of the system can be improved. In order to expand on the analysis

presented in Section 5, I would hope to add a larger set of test sentences so that the performance

of system # 14 with respect to the accusative case is based on a larger range of data. While

system #14 corrects the accusative case in each test sentence provided in Section 5, it is entirely

possible that the system would make different kinds of errors if presented with more

grammatically complex data.

However, this method is overall inefficient, as it may take a long time to account for each

error that an SMT system makes. It would be more effective to incorporate a degree oflinguistic

knowledge, such as morphological rules and phonological processes, in MT systems from the

start. Hybrid machine translation attempts to do just this, and hybrid systems such as AppTek

build in equal parts of rule-based and statistical methods (Sawaf et al. 2010). While research in hybrid MT is relatively limited at this point, I contend that this paradigm holds the best hope for the highest quality MT output. In the future, I hope that hybrid MT will also look toward minority and low-density languages like Tuvan in creating new systems, as these languages have much to tell us about flawed areas of MT that call for improvement. KILLACKEY 46

Appendices Appendix A. 16 All training documents for English to Tuvan systems #1-14.

Training Training Tuning Documents Testing Documents BLEU System # Documents Documents (Parallel) (Parallel) Score (Parallel) (Monolingual) Bible (NIV) JW All Web TuvanPoems LWW PC 1 TyvaWiki DT Random Random 8.49 Adjectives NP All Stories 2 #1 #1 LWW(HQ) Bible: Chs. 9, 13,42 7.45 3 #1, no Adjectives #1, no DT and PC #2 #2 7.58 4 #2 #3 Random #2 7.83 Bible (NIV) Bible: Chs. 9, 13, LWW(HQ) 14, 15,42 All Stories 5 #3 8.33 LWW PC LWW: Ch.13 TyvaWiki 6 #5 #3 Random Random 8.17 7 #5 #3 PC DT 3.26 Bible (NIV) All Stories 8 #3 Random Random 7.46 All Narnia TyvaWiki 9 #8 None Random Random 7.90 Bible (NIV) All Web 10 All Narnia Random Random 9.32 JW All stories 11 #10, plus Gen1 #10 Random Random 48.99 Bible: Ch. 2 12 #10 #10 Random 5.05 LWW 13 #11 #10 Random PE 5.30 14 #11, plus Gen2 #10 Random PE 4.74

16 List of Abbreviations: JW: Jehovah's Witness articles All Web: Tuvan Gov., JW, WWW, Tuvan news articles, Tuvan poems, and Tuvan language blog LWW: The Chronicles ofNarnia: The Lion, the Witch, and the Wardrobe pc: The Chronicles ofNarnia: Prince Caspian; DT: The Chronicles ofNarnia: The Voyage ofthe Dawn Treader NP: Noun paradigms (lSG.FUT, lSG.PAST, 2SG.QUES) All Stories: "Tos-karak," "Because I Love You," and "Boktu Kirish" HQ: denotes 'high quality,' a document that contains only certain lines from the original text that have been carefully and precisely aligned Tuvan Gov.: Tuvan Government texts Gen1: generated sentences containing LOC and ABL case suffixes Gen2: generated sentences containing ACC case suffix PE: post-editing sentences given in Appendix B KILLACKEY 47

Appendix 8 17• Post-editing corrections to English to Tuvan system #11.

Table BI. Post-editing pt 1· Original English Sentence , Translation Hub Output, Post-edited Sentence Row Original English Translation Hub Output Post-edited Sentence # Sentence 1 I see one red house. Ellp KhI3bIJI 6IDKbIH; Kapyn TYP MeH. 1JaH;rbIc KbI3bIJI 6mKbIH;HbI Kapyn TYP MeH. 2 I see one red house. Ellp KhI3bIJI 6IDKbIH; Kapyn TYP MeH. 1JaH;rbIc KbI3bIJI 6mKbIH; Kapyn TYP MeH. 3 I see one red house. Ellp KhI3bIJI 6IDKbIH; Kapyn TYP MeH. 1JaH;rbIc 6mKbIH;ra TYP MeH. 4 I see one red house down KY.L\YM'I)'HYH; Y)I()'H)l;a qaH;rbIC KY.L\YM'I)' a)l;aaH)l;a qaHrbIC KbI3bIJI the street. KbI3bIJI 6mKbIH; Kapyn TYP MeH. 6mKbIH Kopyn TYP MeH. 5 I see one red house down KY.L\YM'I)'HYH; Y)I()'H)l;a qaH;rbIC KY.L\YM'I)')l;a qaHrbIC KbI3bIJI 6mKbIH the street. KbI3bIJI 6mKbIH; Kapyn TYP MeH. Kapyn TYP MeH. 6 I see one red house down KY.L\YM'I)'HYH; Y)I()'H)l;a qaH;rbIC KY.L\YM'I)')l;a qaH;rbIC KbI3bIJI 6mKbIH; the street. KbI3bIJI 6mKbIH; Kapyn TYP MeH. Kapyn TYP MeH. 7 I do not see one red house. MeH qaH;rbIC KbI3bIJI 6mKbIH;HbI 1JaHrbIc KbI3bIJI 6mKbIHHbI Kop6eMllH Kap6eMH TYP MeH. TYP MeH. 8 I do not see one red house. MeH qaH;rbIC KbI3bIJI 6mKbIH;HbI MeH qaHrbIC KbI3bIJI 6mKbIHHbI Kap6eMH TYP MeH. Kop6eMH TYP MeH. 9 I do not see one red house. MeH qaH;rbIC KbI3bIJI 6mKbIH;HbI 1JaHrbIc KbI3bIJI 6mKbIH;HbI Kop6eMllH Kap6eMH TYP MeH. TYP MeH. 10 I do not see one red house. MeH qaH;rbIC KbI3bIJI 6mKbIH;HbI MeH qaHrbIC KbI3bIJI 6mKbIH;HbI Kap6eMH TYP MeH. Kap6eMH TYP MeH. 11 Yesterday you did not see ,lJ,yyH 611P KbI3bIJI 6IDKbIH;HbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIHHbI one red house. Kap633H CllJIep. KopreH ceH. 12 Yesterday you did not see ,lJ,yyH 611P KbI3bIJI 6IDKbIH;HbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIHHbI one red house. Kap633H CllJIep. Kop633H ceH. 13 Yesterday you did not see ,lJ,yyH 611P KbI3bIJI 6IDKbIH;HbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIHHbI one red house. Kap633H CllJIep. Kap633H ceH. 14 Yesterday you did not see ,lJ,yyH 611P KbI3bIJI 6IDKbIH;HbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIH;HbI one red house. Kap633H CllJIep. Kap633H ceH. 15 Yesterday you did not see ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI two red houses. Kap633H-)l;IlP ceH. Kop633H ceH. 16 Yesterday you did not see ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI two red houses. Kap633H-)l;IlP ceH. Kap633H ceH. 17 Yesterday you did not see ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI two red houses. Kap633H-)l;IlP ceH. Kap633H-)l;IlP ceH. 18 Yesterday you saw one red ,lJ,yyH 611P KbI3bIJI 6IDKbIH; KapreH ,lJ,yyH ceH qaH;rbIC KhI3bIJI 6IDKbIH; house. CllJIep. KapreH. 19 Yesterday you saw one red ,lJ,yyH 611P KbI3bIJI 6IDKbIH; KapreH ,lJ,yyH ceH qaH;rbIC KhI3bIJI 6IDKbIH; house. CllJIep. KapreH ceH. 20 Yesterday you saw two red ,lJ,yyH KY.L\YMqYHYH; y)l()'H)l;a llMll ,lJ,yyH ceHll Kapyn KY.L\YM'I)' 6a)l;bIII houses down the street. KbI3bIJI 6mKbIH;HapHbI KapreH CllJIep. llMll 6IDKbIH;Hap IImJIrll. 21 Yesterday you saw two red ,lJ,yyH KY.L\YMqYHYH; y)l()'H)l;a llMll ,lJ,yyH ceH llMll KbI3bIJI 6mKbIH;HapHbI houses down the street. KbI3bIJI 6mKbIH;HapHbI KapreH CllJIep. KapreH ceH. 22 Yesterday you saw two red ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIHHbI houses. KapreH CllJIep. KopreH ceH. 23 Yesterday you saw two red ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIHHbI houses. KapreH CllJIep. KapreH ceH.

17 There are multiple post-edited versions of the same English source sentence in instances where reviewers disagree as to the best way to translate a sentence. Whenever possible, I have attempted to account for these differences by randomizing the selection of post-edited sentences in my analysis. KILLACKEY 48

24 Yesterday you saw two red ,lJ,yyH llMll KbI3bIJI 6IDKbIH;HapHbI ,lJ,yyH ceH llMll KbI3bIJI 6mKbIH;HbI houses. KapreH CllJIep. KapreH ceH. 25 You do not see two red CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. houses. 26 You do not see two red CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. CeH llMll KbI3bIJI HbI Kap6eMH TYP houses. ceH. 27 You do not see two red CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. CeH llMll KbI3bIJI 6IDKbIH;HbI Kap6eMH houses. TYP ceH. 28 You do not see two red CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. CeH llMll KbI3bIJI 6IDKbIH; Kap6eMH TYP houses. ceH. 29 You do not see two red CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. MMII KbI3bIJI 6mKbIH;HapHbI Kap6eMH houses. TYP CllJIep. 30 You do not see two red CeH llMll KbI3bIJI 6IDKbIH; Kap6ec. MMII KbI3bIJI 6mKbIH; Kap6eMH TYP houses. CllJIep. 31 You see two red houses KY.L\YM'I)'HYH; Y)l(YH)l;a llMll KbI3bIJI KY.L\YM'I)' a)l;aaH)l;a llMll KbI3bIJI down the street. 6a)l(bIH;HbI Kapyn TYP CllJIep. 6a)l(bIH; KapHn TYP ceH. 32 You see two red houses KY.L\YM'I)'HYH; y~H)l;a llMll KbI3bIJI KY.L\YM'I)')l;a llMll KbI3bIJI 6a)l(bIH; down the street. 6a)l(bIH;HbI Kapyn TYP CllJIep. Kapyn TYP ceH. 33 You see two red houses. MMII KbI3bIJI 6a)l(bIH;HapHbI Kapyn MMII KbI3bIJI 6a)l(bIH Kapyn TYP ceH. TYP CllJIep. 34 You see two red houses. MMII KbI3bIJI 6a)l(bIH;HapHbI Kapyn MMII KbI3bIJI 6a)l(bIH; Kapyn TYP ceH. TYP CllJIep. 35 Your house is big. C33H; 6IDKbIH; ynyr 6oop. C33H; 6IDKbIH; ynyr 6oop. 36 Your house is big. C33H; 6IDKbIH; ynyr 6oop. CllJIepHllH; 6IDKbIH;bIH;ap ynyr. 37 Your house is big. C33H; 6IDKbIH; ynyr 6oop. C33H; 6IDKbIH;bIH; yJIyr. 38 Your houses are big. C33H; 6IDKbIH;Hap ynyr. C33H; 6IDKbIH;Hap ynyr. 39 Your houses are big. C33H; 6IDKbIH;Hap ynyr. CllJIepHllH; 6IDKbIH;HapbIH;ap yJIyr. 40 Your houses are big. C33H; 6IDKbIH;Hap ynyr. C33H; 6IDKbIH;HapbIH; ynyr. 41 My house is big. M33H; 6IDKbIH; ynyr 6oop. M33H; 6IDKbIH; ynyr 6oop. 42 My house is big. M33H; 6IDKbIH; ynyr 6oop. Ea)l(bIH;bIM ynyr 43 My house is big. M33H; 6IDKbIH; ynyr 6oop. M33H; 6IDKbIH;bIM ynyr.

44 Yesterday you saw one red ,lJ,yyH KY.L\YMqYHYH; y~H)l;a 611P ,lJ,yyH ceH KY.L\YM'I)'ra qaH;rbIC house down the street. KbI3bIJI 6a)l(bIH; KapreH CllJIep. Kapyrr6a)l;bIII 6ap 6IDKbIH;bIHra IImJIrll. 45 Yesterday you saw one red ,lJ,yyH KY.L\YMqYHYH; y~H)l;a 611P ,lJ,yyH ceH KY.L\YM'I)'ra qaH;rbIC KbI3bIJI house down the street. KbI3bIJI 6a)l(bIH; KapreH CllJIep. 6a)l(bIH; KapreH ceH. 46 Yesterday you saw one red ,lJ,yyH 611P KbI3bIJI 6IDKbIH; KapreH ,lJ,yyH ceH qaH;rbIC KbI3bIJI 6IDKbIH; house. CllJIep. KapreH. 47 Yesterday you saw one red ,lJ,yyH 611P KbI3bIJI 6IDKbIH; KapreH ,lJ,yyH ceH qaH;rbIC KbI3bIJI 6IDKbIH; house. CllJIep. KapreH ceH.

T a bl e B2 Post-e d· ltmg . pt. 2 E xtra W or d , M·lssm . W or d , W rong W or d Row Extra Extra Missing Missing Wrong Word(s) Wrong Word(s) II # Word Word II Word(s) Word(s) II 1 Ellp/qaH;rbIC 2 Ellp/qaH;rbIC 3 KbI3bIJI 4 y~H)l;a/a)l;aaH)l;a

5 y~H)l;a 6a)l(bIH;/6IDKbIH

6 y~H)l;a 7 MeH 6a)l(bIH;HbI/6a)l(bIHHbI Kap6eMH/Kop6eMllH KILLACKEY 49

8 '1aH;rblc/'1aHrbIC 6mKbIH;HbI/6mKbIHHbI 9 MeH '1aH;rblc/'1aHrbIC Kep6eitH/Kop6eitHH 10 '1aH;rblc/'1aHrbIC 11 6Hp/ceH HitH 6mKbIH;HbI/6mKbIHHbI 12 6Hp/ceH HitH 6mKbIH;HbI/6mKbIHHbI 13 6Hp/ceH HitH 6mKbIH;HbI/6mKbIHHbI 14 6Hp/ceH HitH 15 Kep633H-AHp/KOp633H 16 Kep633H-AHp/Kep633H 17 18 CHJlep 6Hp/ceH 19 '1aH;rbIC CHJlep/ceH 20 KbI3bIJI KepreH ceHH Kepyll 6aAbIII YX

Table B3. Post-editing pt. 3: Wrong Word, Wrong Affix, Missing Affix. Row Wrong Word(s) III Wrong Word(s) IV Wrong Affix Missing Affix # 1 -HbI 2 3 -ra 4 5 6 7 8 9 10 11 Kep633H/KopreH CHJIep/ceH 12 Kep633H/KOp633H 13 14 15 16 17 18 19 20 21 22 6mKblH/6IDKbIH 23 6mKblH/6IDKbIH 24 25 26 -6ec/-6eHH 27 -6ec/-6eHH -HbI 28 -6ec/-6eHH 29 -6ec/-6eHH -Hap 30 31 CHJIep/ceH -HbI 32 CHJIep/ceH -HYW-Aa 33 CHJIep/ceH 34 CHJIep/ceH 35 36 -blH;ap 37 -bIH; 38 39 -bIH;ap 40 -bIH; KILLACKEY 51

41 42 -bIM 43 -bIM

44 611p/6ap -H)'w-ra -reH/-yrr6a)l;bIII

45 -H)'w-ra 46 47

Table B4. Post-editing pt. 4: Missing Affix, Extra Affix, Word Order, No Changes. Row # Missing Affix II Extra Affix Extra Affix II Word Order No Changes 1 2 3

4 -H)'H; 5 6 7 8 9 10 11 12 13 14 15 16 17 " 18 19

20 -H)'H; -HbI 21 22 -Hap 23 -Hap 24 -Hap 25 " 26 27 28 29 -HbI 30

31 -H)'H; 32 -HbI 33 -Hap -HbI KILLACKEY 52

34 -Hap -HbI 35 " 36 37 38 " 39 40 41 " 42 43

44 -bIH; -ra 611P KbI3bIJI/ 611P KbI3bIJI 6mKbIH; KepreH 45 46 47 KILLACKEY 53

Appendix C. Accusative case from system #13 to system #14.

# English Gloss System #13 Output System #14 Output 1 'I see one red house.' KbI3bIJI 6ap 6IDKbIH:bIHra KepYH:ep. KbI3bIJI 6ap 6IDKbIH:HbI KepYH:ep. 'I see one red house down the KY.L\YM'I)' 6a)J,bIII 6ap 6mKbIHra KY.L\YM'I)' 6a)J,bIII 6ap 6mKbIHblra 2 street. ' KbI3bIJI KepYH:ep. KbI3bIJI KepYH:ep. 'You do not see two red CeH Kep6eMH TYP llMll IIIllJIrll CeH Kep6eMH TYP llMll IIIllJIrll 3 houses.' 6mKbIH:Hap. 6mKbIH:HbI. Kep6eMH TYP MeH KbI3bIJI Kep6eMH TYP MeH KbI3bIJI 4 'I do not see one red house.' 6mKbIH:bIHra. 6mKbIH:Hblra 6ap. 'Yesterday you did not see two ,lJ,yyH llMll 6IDKbIH:Hap Kep6eMH ,lJ,yyH llMll KbI3bIJI 6mKbIH:HapHbI 5 red houses.' TYpraH ceH. Kep6eMH TYpraH ceH. 'I see one red house down the KY)J,YM'I)' 6a)J,bIII 6ap 6mKbIH:bIHra KY.L\YM'I)' 6a)J,bIII 6ap 6mKbIH;blra 6 street. ' IIIllJIrll KepYH:ep. KbI3bIJI KepYH:ep. 'Yesterday you saw two red ,lJ,yyH llMll IIIllJIrll 6IDKbIH:Hap ,lJ,yyH ceH llMll KbI3bIJI 6mKbIH:HapHbI 7 houses.' Kepyrr ceH. Kep)J,YM. 8 'You see two red houses.' MMII IIIllJIrll 6IDKbIH:Hap Keep ceH. MMII KbI3bIJI 6mKbIH:HapHbI Keep ceH. 'Yesterday you did not see one ,lJ,yyH ceHll KepeMH TYpraH KbI3bIJI ,lJ,yyH ceHll Kep6eMH TYpraH KbI3bIJI 9 red house.' 6mKbIH:bIHra 6ap. 6mKbIH:Hblra 6ap. 'You see two red houses down KY.L\YM'I)' 6a)J,bIII llMll 6mKbIH:Hap KY.L\YM'I)' 6a)J,bIII llMll 6mKbIH:HapHbI 10 the street.' IIIllJIrll Keep ceH. KbI3bIJI Keep ceH. 'Yesterday you saw one red ,lJ,yyHY Kepyrr KY)J,YMqy 6a)J,bIII 6ap ,lJ,yyHY Kepyrr KY)J,YMqy 6a)J,bIII 6ap 11 house down the street.' 6mKbIH: KbI3bIJI. 6mKbIH;bI KbI3bIJI. 'Yesterday you saw one red 12 ,lJ,yyHY Kepyrr KbI3bIJI 6mKbI 6ap. ,lJ,yyHy Kepyrr KbI3bIJI 6mKbIH;bI 6ap. house.' 'Yesterday you saw two red ,lJ,yyH KY.L\YM'I)' 6a)J,bIII llMll ,lJ,yyH KY.L\YMqy 6a)J,bIII llMll 13 houses down the street.' 6mKbIH:Hap KbI3bIJI Kepyrr KaaH. 6mKbIH:HapHbI KbI3bIJI Kepyrr KaaH. KILLACKEY 54

References Anderson, Gregory, and K. David Harrison. 1999. Tyvan. Languages of the world/materials: Volume 257. Munich: Lincom Europa. Boretz, Adam. 2009. AppTek launches hybrid machine translation software. SpeechTechMag.com. 2 March 2009. Online: http://www.speechtechmag.com/ ArticleslN ewslNews- F eature/App Tek- Launches-Hybrid -Machine-Translation -Software- 52871.aspx. Brown, Peter F., John Cocke, Stephen A. Della Pietra, Vincent 1. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16:79-85. Eisele, Andreas, Christian Federmann, Hans Uszkoreit, Herve Saint-Amand, Martin Kay, Michael Jellinghaus, Sabine Hunsicker, Teresa Herrmann, and Yu Chen. 2008. Hybrid machine translation architectures within and beyond the EuroMatrix project. In Proceedings ofthe 1i h Annual Conference ofthe European Association for Machine Translation 12:27-34. Harrison, K. David. 2000. Topics in the phonology and morphology ofTuvan. Doctoral dissertation. Yale University. New Haven, CT. Henisz-Dostert, Bozena, R. Ross Macdonald, and Michael Zarechnak. 1979. Trends in linguistics: Machine translation. Studies and Monographs 11. New York, NY: Mouton Publishers. Hutchins, William 1. 1986. Machine translation: Past, present, future. New York: Halsted Press. Hutchins, W. John, and Harold L. Somers. 1992. An introduction to machine translation. London: Academic Press. Johnson, Rod, Maghi King, and Louis des Tombe. 1985. Eurotra: A multilingual system under development. Computational Linguistics 11: 155-169. Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings ofthe 2004 Conference on Empirical Methods in Natural Language Processing 1 :388-395. Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. Machine Translation Summit X. 79-86. KILLACKEY 55

Koehn, Philipp. 2010. Statistical machine translation. New York, NY: Cambridge University Press. Microsoft. 2012. Microsoft Translator Hub. Online: https://hub.microsofttranslator.com. Nichols, Johanna. 1999. Linguistic diversity in space and time. Chicago: University of Chicago Press. Nirenburg, Sergei. 1989. Knowledge-based machine translations. Machine Translation 4:5 24. Nirenburg, Sergei, and Kenneth Goodman. 1998. Treatment of meaning in MT systems. In Nirenburg, et al. 2003:281-293. Nirenburg, Sergei, Harold L. Somers, and Yorick A. Wilks (eds.) 2003. Readings in machine translation. Cambridge, MA: MIT Press. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings ofthe 40th Annual Meeting ofthe Associationfor Computational Linguistics 40:311-318. Povlson, Claus, Nancy Underwood, Bradley Music, and Anne Neville. 1998. Evaluating text-type suitability for machine translation: A case study on an English-Danish MT system. In Proceedings ofthe First International Conference on Language Resources and Evaluation 1 :27-31. Reifler, Erwin. 1998. The mechanical determination of meaning. In Nirenburg et al. 2003:21-35. Samuelson, Paul, and William Nordhaus. 2001. Economics. 17th ed. University of Virginia: McGraw-Hill. Sawaf, Hassan, Mohammad Shihadah, and Mudar Yaghi. 2010. Hybrid machine translation. WIPO Patent No. W0120101046782. Geneva, Switzerland: World Intellectual Property Organization. Soudi, Abdelhadi, Ali Farghaly, GUnter Neumann, and Rabih Zbib. 2012. Challenges for machine translation. Philadelphia: John Benjamins Publishing Company. Sumita, Eiichiro, Hitoshi Iida, and Hideo Kohyama. 1990. Translating with examples: A new approach to machine translation. TMI Conference 1990. 203-212. Toma, Peter. 1977. Systran as a multilingual machine translation system. Overcoming the language barrier 1 :569-581. KILLACKEY 56

Zhang, Ying, Stephan Vogel, and Alex Waibel. 2004. Interpreting BLEUINIST scores: How much improvement do we need to have a better system? In Proceedings ofthe Fourth International Conference on Language Resources and Evaluation 4:2051-2054.