Machine and Monolingual Postediting: The AFRL WMT-14 System Lane O.B. Schwartz Timothy Anderson Air Force Research Laboratory Air Force Research Laboratory [email protected] [email protected] Jeremy Gwinnup Katherine M. Young SRA International† N-Space Analysis LLC† [email protected] [email protected]

Abstract test set for correctness against the reference . Using bilingual judges, we fur- This paper describes the AFRL sta- ther evaluate a substantial subset of the post- tistical MT system and the improve- edited test set using a more fine-grained ade- ments that were developed during the quacy metric; using this metric, we show that WMT14 evaluation campaign. As part monolingual posteditors can successfully pro- of these efforts we experimented with duce postedited translations that convey all or a number of extensions to the stan- most of the meaning of the original source sen- dard phrase-based model that improve tence in up to 87.8% of sentences. performance on Russian to English and Hindi to English translation tasks. 2 System Description In addition, we describe our efforts We submitted systems for the Russian-to- to make use of monolingual English English and Hindi-to-English MT shared speakers to correct the output of ma- tasks. In all submitted systems, we use the chine translation, and present the re- phrase-based moses decoder (Koehn et al., sults of monolingual postediting of the 2007). We used only the constrained data sup- entire 3003 sentences of the WMT14 plied by the evaluation for each language pair Russian-English test set. for training our systems.

1 Introduction 2.1 Data Preparation As part of the 2014 Workshop on Machine Before training our systems, a cleaning pass Translation (WMT14) shared translation task, was performed on all data. Unicode charac- the human language technology team at the ters in the unallocated and private use ranges Air Force Research Laboratory participated were all removed, along with C0 and C1 con- in two language pairs: Russian-English and trol characters, zero-width and non-breaking Hindi-English. Our sys- spaces and joiners, directionality and para- tem represents enhancements to our system graph markers. from IWSLT 2013 (Kazi et al., 2013). In this paper, we focus on enhancements to our pro- 2.1.1 Hindi Processing cedures with regard to data processing and the The HindEnCorp corpus (Bojar et al., 2014) handling of unknown words. is distributed in tokenized form; in order to In addition, we describe our efforts to make ensure a uniform tokenization standard across use of monolingual English speakers to correct all of our data, we began by detokenized this the output of machine translation, and present data using the Moses detokenization scripts. the results of monolingual postediting of the In addition to normalizing various extended entire 3003 sentences of the WMT14 Russian- Latin punctuation marks to their Basic Latin English test set. Using a binary adequacy clas- equivalents, following Bojar et al. (2010) we sification, we evaluate the entire postedited normalized Devanagari Danda (U+0964), Double Danda (U+0965), and Abbrevia- †This work is sponsored by the Air Force Research Laboratory under Air Force contract FA-8650-09-D- tion Sign (U+0970) punctuation marks to 6939-029. Latin Full Stop (U+002E), any Devana-

186 Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 186–194, Baltimore, Maryland USA, June 26–27, 2014. c 2014 Association for Computational Linguistics gari Digit to the equivalent ASCII Digit, ters in the sentence were non-Latin, or if more and decomposed all Hindi data into Unicode than half of the words were unknown to the Normalization Form D (Davis and Whistler, aspell English spelling correction program, 2013) using charlint.1 In addition, we per- not counting short words, which frequently formed Hindi diacritic and vowel normaliza- occur as (possibly false) cognates across lan- tion, following Larkey et al. (2003). guages (English die vs. German die, English Since no Hindi-English development test on vs. French on, for example). Because set was provided in WMT14, we randomly aspell does not recognize some proper names, sampled 1500 sentence pairs from the Hindi- brand names, and borrowed words as known English parallel training data to serve this pur- English words, this method incorrectly flags pose. Upon discovering duplicate sentences in for removal some English sentences which have the corpus, 552 sentences that overlapped with a high proportion of these types of words. the training portion were removed from the Source sentences were marked as non- sample, leaving a development test set of 948 Russian if less than one-third of the charac- sentences. ters were within the Russian Cyrillic range, or 2.1.2 Russian Processing if non-Russian characters equal or outnumber Russian characters and the sentence contains The Russian sentences contained many exam- no contiguous sequence of at least three Rus- ples of mixed-character spelling, in which both sian characters. Some portions of the Cyrillic Latin and Cyrillic characters are used in a sin- character set are not used in typical Russian gle word, relying on the visual similarity of the text; source sentences were therefore marked characters. For example, although the first for removal if they contained Cyrillic exten- letter and last letter in the word ap- cейчас sion characters Ukrainian I (і І), Yi(ї Ї), pear visually indistinguishable, we find that Ghe With Upturn (ґ Ґ) or Ie (є Є) in ei- the former is U+0063 Latin Small Letter ther upper- or lowercase, with exceptions for C and the latter is U+0441 Cyrillic Small U+0406 Ukrainian I (І) in Roman numerals Letter Es. We created a spelling normal- and for U+0491 Ghe With Upturn (ґ) when ization program to convert these words to all it occurred as an encoding error artifact.3 Cyrillic or all Latin characters, with a pref- erence for all-Cyrillic conversion if possible. Sentence pairs where the source was identi- Normalization also removes U+0301 Combin- fied as non-Russian or the target was identified ing Acute Accent ( ̲́) and converts U+00F2 as non-English were removed from the parallel Latin Small Letter O with Grave (ò) corpus. Overall, 12% of the parallel sentences and U+00F3 Latin Small Letter O with were excluded based on a non-Russian source Acute (ó) to the unaccented U+043E Cyril- sentence (94k instances) or a non-English tar- get sentence (11.8k instances). lic Small Letter O (о). The Russian-English Common Crawl par- Our Russian-English parallel training data allel corpus (Smith et al., 2013) is relatively includes a parallel corpus extracted from noisy. A number of Russian source sentences Wikipedia headlines (Ammar et al., 2013), are incorrectly encoded using characters in the provided as part of the WMT14 shared trans- Latin-1 supplement block; we correct these lation task. Two files in this parallel cor- sentences by shifting these characters ahead pus (wiki.ru-en and guessed-names.ru-en) by 350hex code points into the correct Cyrillic contained some overlapping data. We re- character range.2 moved 6415 duplicate lines within wiki.ru-en We examine the Common Crawl parallel (about 1.4%), and removed 94 lines of sentences and mark for removal any non- guessed-names.ru-en that were already Russian source sentences and non-English tar- present in wiki.ru-en (about 0.17%). get sentences. Target sentences were marked as non-English if more than half of the charac- 3Specifically, we allowed lines containing ґ where it appears as an encoding error in place of an apostro- 1http://www.w3.org/International/charlint phe within English words. For example: “Песня The 2For example: “Ñïðàâêà ïî ãîðîäàì Ðîññèè è ìèðà.” Kelly Family Iґm So Happy представлена вам Lyrics- becomes “Справка по городам России и мира.” Keeper.”

187 2.2 Machine Translation Decoding Features Our baseline system is a variant of the MIT- P(f e) | LL/AFRL IWSLT 2013 system (Kazi et al., P(e f) | 2013) with some modifications to the training Pw(f e) | and decoding processes. Pw(e f) Phrase Penalty| 2.2.1 Phrase Table Training Lexical Backoff For our Russian-English system, we trained Word Penalty a phrase table using the Moses Experiment Distortion Model Management System (Koehn, 2010b), with Unknown Word Penalty mgiza (Gao and Vogel, 2008) as the word Lexicalized Reordering Model aligner; this phrase table was trained using the Operation Sequence Model Russian-English Common Crawl, News Com- Rescoring Features mentary, Yandex (Bojar et al., 2013), and P (E) – 7-gram class-based LM Wikipedia headlines parallel corpora. class Plex(F E) – sentence-level averaged The phrase table for our Hindi-English sys- lexical| translation score tem was trained using a similar in-house train- ing pipeline, making use of the HindEnCorp Table 1: Models used in log-linear combina- and Wikipedia headlines parallel corpora. tion 2.2.2 Language Model Training 2.2.3 Decoding, n-best List Rescoring, During the training process we built n-gram and Optimization language models (LMs) for use in decoding and rescoring using the KenLM language mod- We decode using the phrase-based moses de- elling toolkit (Heafield et al., 2013). Class- coder (Koehn et al., 2007), choosing the best based language models (Brown et al., 1992) translation for each source sentence according were also trained, for later use in n-best list to a linear combination of decoding features: rescoring, using the SRILM language mod- Eˆ = arg max λrhr(E, F) (1) elling toolkit (Stolcke, 2002).We trained a 6- E r gram language model from the LDC English ∑∀ Gigaword Fifth Edition, for use in both the We make use of a standard set of decoding Hindi-English and Russian-English systems. features, listed in Table 1. In contrast to our All language models were binarized in order IWSLT 2013 system, all experiments submit- to reduce model disk usage and loading time. ted to this year’s WMT evaluation made use For the Russian-to-English task, we concate- of version 2.1 of moses, and incorporated ad- nated the English portion of the parallel train- ditional decoding features, namely the Oper- ing data for the WMT 2014 shared transla- ation Sequence Model (Durrani et al., 2011) tion task (Common Crawl, News Commen- and Lexicalized Reordering Model (Tillman, tary, Wiki Headlines and Yandex corpora) in 2004; Galley and Manning, 2008). addition to the shared task English monolin- Following Shen et al. (2006), we use gual training data (Europarl, News Commen- the word-level lexical translation probabili- tary and News Crawl corpora) into a training ties Pw(fj ei) to obtain a sentence-level aver- set for a large 6-gram language model using aged lexical| translation score (Eq. 2), which is KenLM. We denote this model as “BigLM”. In- added as an additional feature to each n-best dividual 6-gram models were also constructed list entry. from each respective corpus. 1 For the Hindi-to-English task, individual 6- P (F E) = P (f e ) lex | I + 1 w j | i gram models were constructed from the re- j 1...J i 1...I spective English portions of the HindEnCorp ∈∏ ∈∑ (2) and Wikipedia headlines parallel corpora, and Shen et al. (2006) use the term “IBM model 1 from the monolingual English sections of the score” to describe the value calculated in Eq. Europarl and News Crawl corpora. 2. While the lexical probability distribution

188 from IBM Model 1 (Brown et al., 1993) could BLEU BLEU-cased in fact be used as the P (f e ) in Eq. 2, in 1 hi-en 13.1 12.1 w j | i practice we use a variant of Pw(fj ei) defined 2 ru-en 32.0 30.8 by Koehn et al. (2003). | 3 ru-en 32.2 31.0

We also add a 7-gram class language model System 4 ru-en 31.5 30.3 score Pclass(E) (Brown et al., 1992) as an ad- 5 ru-en 33.0 31.1 ditional feature of each n-best list entry. After adding these features to each translation in an Table 2: Translation results, as measured by n-best list, Eq. 1 is applied, rescoring the en- BLEU (Papineni et al., 2002). tries to extract new 1-best translations. To optimize system performance we train phrase table. Selective stemming of just the scaling factors, , for both decoding and λr unknown words allows us to retain informa- rescoring features so as to minimize an ob- tion that would be lost if we applied stemming jective error criterion. In our systems we use to all the data. DREM (Kazi et al., 2013) or PRO (Hopkins Any remaining unknown words were and May, 2011) to perform this optimization. transliterated as a post-process, using a For development data during optimization, simple letter-mapping from Cyrillic characters we used newstest2013 for the Russian-to- to Latin characters representing their typical English task and newsdev2014 for the Hindi- sounds. to-English task supplied by WMT14. 2.3 MT Results 2.2.4 Unknown Words For the Hindi-to-English task, unknown words Our best Hindi-English system for were marked during the decoding process and newstest2014 is listed in Table 2 as System were transliterated by the icu4j Devanagari- 1. This system uses a combination of 6-gram to-Latin transliterator.4 language models built from HindEnCorp, News Commentary, Europarl, and News For the Russian-to-English task, we selec- Crawl corpora. of unknown tively stemmed and inflected input words not words was performed after decoding but found in the phrase table. Each input sentence before -best list rescoring. was examined to identify any source words n which did not occur as a phrase of length 1 System 2 is Russian-English, and handles § in the phrase table. For each such unknown unknown words following 2.2.4. We used as word, we used treetagger (Schmid, 1994; independent decoder features separate 6-gram Schmid, 1995) to identify the part of speech, LMs trained respectively on Common Crawl, and then we removed inflectional endings to Europarl, News Crawl, Wiki headlines and derive a stem. We applied all possible Rus- Yandex corpora. This system was optimized sian inflectional endings for the given part of with DREM. No rescoring was performed. We speech; if an inflected form of the unknown also tested a variant of System 2 which did word could be found as a stand-alone phrase perform rescoring. That variant (not listed in in the phrase table, that form was used to re- Table 2) performed worse than System 2, with place the unknown word in the original Rus- scores of 31.2 BLEU and 30.1 BLEU-cased. sian file. If multiple candidates were found, System 3, our best Russian-English system we used the one with the highest frequency of for newstest2014, used the BigLM and Giga- § occurrence in the training data. This process word language models (see 2.2.2) as indepen- replaces words that we know we cannot trans- dent decoder features and was optimized with late with semantically similar words that we DREM. Rescoring was performed after de- § can translate, replacing unknown words like coding. Instead of following 2.2.4, unknown фотоном “photon” (instrumental case) with words were dropped to maximize BLEU score. a known morphological variant фотон “pho- We note that the optimizer assigned weights of ton” (nominative case) that is found in the 0.314 and 0.003 to the BigLM and Gigaword models, respectively, suggesting that the opti- 4http://site.icu-project.org mizer found the BigLM to be much more use-

189 Figure 1: Posteditor user interface

Documents Sentences Words 12 The postedited translation is superior 1 44 950 20086 to the reference translation 2 21 280 6031 10 The meaning of the Russian source 3 25 476 10194 sentence is fully conveyed in the post- 4 25 298 6164 edited translation 5 20 301 5809 8 Most of the meaning is conveyed 6 15 210 4433 6 Misunderstands the sentence in a ma- Posteditor 7 10 140 2650 jor way; or has many small mistakes 8 15 348 6743 4 Very little meaning is conveyed All 175 3003 62110 2 The translation makes no sense at all

Table 3: Number of documents within the Table 5: Evaluation guidelines for bilingual Russian-English test set processed by each human judges, adapted from Albrecht et al. monolingual human posteditor. Number of (2009). machine translated sentences processed by each posteditor is also listed, along with the total number of words in the corresponding Evaluation Category Russian source sentences. 2 4 6 8 10 12 0.2% 2.2% 9.8% 24.7% 60.2% 2.8%

# ✓ # ✗ % ✓ Table 6: Percentage of evaluated sentences 1 684 266 72.0% judged to be in each category by a bilingual 2 190 90 67.9% judge. Category labels are defined in Table 5. 3 308 168 64.7% 4 162 136 54.4% 5 194 107 64.5% Evaluation Category 6 94 116 44.8% 2 4 6 8 10 12 Posteditor 7 88 52 62.9% # ✗ 2 20 72 89 79 4 8 196 152 56.3% # ✓ 0 1 21 146 493 23 All 1916 1087 63.8% % ✓ 0% 5% 23% 62% 86% 85%

Table 4: For each monolingual posteditor, the Table 7: Number of sentences in each evalu- number and percentage of sentences judged to ation category (see Table 5) that were judged be correct (✓) versus incorrect (✗) according as correct (✓) or incorrect (✗) according to a to a monolingual human judge.6 monolingual human judge.

190 ful than the Gigaword LM. This intuition was monolingual human posteditors, working confirmed by an experimental variation of Sys- without knowledge of the source language, can tem 3 (not listed in Table 2) where we omitted also improve the quality of machine trans- the BigLM; that variant performed substan- lation output (Callison-Burch, 2005; Koehn, tially worse, with scores of 25.3 BLEU and 2010a; Mitchell et al., 2013), especially if well- 24.2 BLEU-cased. We also tested a variant designed tools provide automated linguistic of System 3 which did not perform rescoring; analysis of source sentences (Albrecht et al., that variant (also not listed in Table 2) per- 2009). formed worse, with scores of 31.7 BLEU and In this study, we designed a simple user in- 30.6 BLEU-cased. terface for postediting that presents the user The results of monolingual postediting (see with the source sentence, machine transla- §3) of System 4 (a variant of System 2 tuned tion, and word alignments for each sentence using PRO) uncased output is System 5. Due in a test document (Figure 1). While it may to time constraints, the monolingual post- seem counter-intuitive to present monolingual editing experiments in §3 were conducted (us- posteditors with the source sentence, we found ing the machine translation results from Sys- that the presence of alignment links between tem 4) before the results of Systems 2 and 3 source words and target words can in fact aid were available. The Moses recaser was applied a monolingual posteditor, especially with re- in all experiments except for System 5. gard to correcting word order. For example, in our experiments posteditors encountered some 3 Monolingual Postediting sentences where a word or phrase was enclosed within bracketing punctuation marks (such as Postediting is the process whereby a human quotation marks, commas, or parentheses) in user corrects the output of a machine trans- the source sentence, and the machine transla- lation system. The use of basic postediting tion system incorrectly reordered the word or tools by bilingual human translators has been phrase outside the enclosing punctuation; by shown to yield substantial increases in terms examining the alignment links the posteditors of productivity (Plitt and Masselot, 2010) as were able to correct such reordering mistakes. well as improvements in translation quality The Russian-English test set comprises 175 (Green et al., 2013) when compared to bilin- documents in the news domain, totaling 3003 gual human translators working without as- sentences. We assigned each test document sistance from machine translation and post- to one of 8 monolingual5 posteditors (Table editing tools. More sophisticated interactive 3). The postediting tool did not record tim- interfaces (Langlais et al., 2000; Barrachina ing information. However, several posteditors et al., 2009; Koehn, 2009b; Denkowski and informally reported that they were able to pro- Lavie, 2012) may also provide benefit (Koehn, cess on average approximately four documents 2009a). per hour; if accurate, this would indicate a We hypothesize that for at least some lan- processing speed of around one sentence per guage pairs, monolingual posteditors with no minute. knowledge of the source language can success- Following Koehn (2010a), we evaluated fully translate a substantial fraction of test postedited translation quality according to sentences. We expect this to be the case espe- a binary adequacy metric, as judged by a cially when the monolingual humans are do- monolingual English speaker6 against the En- main experts with regard to the documents to 5 be translated. If this hypothesis is confirmed, All posteditors are native English speakers. Poste- ditors 2 and 3 know Chinese and Arabic, respectively, this could allow for multi-stage translation but not Russian. Posteditor 8 understands the Cyrillic workflows, where less highly skilled monolin- character set and has a minimal Russian vocabulary from two undergraduate semesters of Russian taken gual posteditors triage the translation pro- several years ago. cess, postediting many of the sentences, while 6All monolingual adequacy judgements were per- forwarding on the most difficult sentences to formed by Posteditor 1. Additional analysis of Post- editor 1’s 950 postedited translations were indepen- more highly skilled bilingual translators. dently judged by bilingual judges against the reference Small-scale studies have suggested that and the source sentence (Table 7).

191 glish references. In this metric, incorrect our experimental evaluation, the System 4 ma- spellings of transliterated proper names were chine translations, the postedited translations, not grounds to judge as incorrect an otherwise and the monolingual and bilingual evaluation adequate postedited translation. Binary ade- results are released as supplementary data to quacy results are shown in Table 4; we observe accompany this paper. that correctness varied widely between poste- ditors (44.8–72.0%), and between documents. 4 Conclusion Interestingly, several posteditors self- In this paper, we present data preparation and reported that they could tell which documents language-specific processing techniques for our were originally written in English and were Hindi-English and Russian-English submis- subsequently translated into Russian, and sions to the 2014 Workshop on Machine Trans- which were originally written in Russian, lation (WMT14) shared translation task. Our based on observations that sentences from submissions examine the effectiveness of han- the latter were substantially more difficult to dling various monolingual target language cor- postedit. Once per-document source language pora as individual component language mod- data is released by WMT14 organizers, we els (System 2) or alternatively, concatenated intend to examine translation quality on a together into a single big language model (Sys- per-document basis and test whether postedi- tem 3). We also examine the utility of n- tors did indeed perform worse on documents best list rescoring using class language model which originated in Russian. and lexicalized translation model rescoring Using bilingual judges, we further evaluate a features. substantial subset of the postedited test set us- In addition, we present the results of mono- ing a more fine-grained adequacy metric (Ta- lingual postediting of the entire 3003 sentences ble 5). Because of time constraints, only the of the WMT14 Russian-English test set. Post- first 950 postedited sentences of the test set6 editing was performed by monolingual English were evaluated in this manner. Each sentence speakers, who corrected the output of ma- was evaluated by one of two bilingual human chine translation without access to external judges. In addition to the 2-10 point scale of resources, such as bilingual dictionaries or on- Albrecht et al. (2009), judges were instructed line search engines. This system scored high- to indicate (with a score of 12) any sentences est according to BLEU of all Russian-English where the postedited machine translation was submissions to WMT14. superior to the reference translation. Using Using a binary adequacy classification, we this metric, we show in Table 6 that monolin- evaluate the entire postedited test set for cor- gual posteditors can successfully produce post- rectness against the reference translations. Us- edited translations that convey all or most of ing bilingual judges, we further evaluate a sub- the meaning of the original source sentence in stantial subset of the postedited test set us- up to 87.8% of sentences; this includes 2.8% ing a more fine-grained adequacy metric; using which were superior to the reference. this metric, we show that monolingual postedi- tors can successfully produce postedited trans- Finally, as part of WMT14, the results of lations that convey all or most of the meaning our Systems 1 (hi-en), 3 (ru-en), and 5 (post- of the original source sentence in up to 87.8% edited ru-en) were ranked by monolingual hu- of sentences. man judges against the machine translation output of other WMT14 participants. These Acknowledgements judgements are reported in WMT (2014). Due to time constraints, the machine trans- We would like to thank the members of the lations (from System 4) presented to postedi- SCREAM group at Wright-Patterson AFB. tors were not evaluated by human judges, nei- Opinions, interpretations, conclusions and recom- ther using our 12-point evaluation scale nor mendations are those of the authors and are not nec- as part of the WMT human evaluation rank- essarily endorsed by the United States Government. Cleared for public release on 1 Apr 2014. Origina- ings. However, to enable such evaluation by tor reference number RH-14-112150. Case number future researchers, and to enable replication of 88ABW-2014-1328.

192 References Chris Callison-Burch. 2005. Linear B system de- scription for the 2005 NIST MT evaluation exer- Joshua S. Albrecht, Rebecca Hwa, and G. Elisa- cise. In Proceedings of the NIST 2005 Machine beta Marai. 2009. Correcting automatic trans- Translation Evaluation Workshop. lations through collaborations between MT and monolingual target-language users. In Proceed- Mark Davis and Ken Whistler. 2013. Unicode nor- ings of the 12th Conference of the European malization forms. Technical Report UAX #15, Chapter of the Association for Computational The Unicode Consortium, September. Rev. 39. Linguistics (EACL ’12), pages 60–68, Athens, Greece, March–April. Michael Denkowski and Alon Lavie. 2012. Trans- Waleed Ammar, Victor Chahuneau, Michael Center: Web-based translation research suite. Denkowski, Greg Hanneman, Wang Ling, In Proceedings of the AMTA 2012 Workshop Austin Matthews, Kenton Murray, Nicola on Post-Editing Technology and Practice Demo Segall, Yulia Tsvetkov, Alon Lavie, and Chris Session, November. Dyer. 2013. The CMU machine translation sys- tems at WMT 2013: Syntax, synthetic trans- Nadir Durrani, Helmut Schmid, and Alexander lation options, and pseudo-references. In Pro- Fraser. 2011. A joint sequence translation ceedings of the Eighth Workshop on Statistical model with integrated reordering. In Proceed- Machine Translation (WMT ’13), pages 70–77, ings of the 49th Annual Meeting of the Associ- Sofia, Bulgaria, August. ation for Computational Linguistics (ACL ’11), pages 1045–1054, Portland, Oregon, June. Sergio Barrachina, Oliver Bender, Francisco Casacuberta, Jorge Civera, Elsa Cubel, Michel Galley and Christopher D. Manning. 2008. Shahram Khadivi, Antonio Lagarda, Hermann A simple and effective hierarchical phrase re- Ney, Jesus´ Tom´as, Enrique Vidal, and Juan- ordering model. In Proceedings of the 2008 Con- Miguel Vilar. 2009. Statistical approaches to ference on Empirical Methods in Natural Lan- computer-assisted translation. Computational guage Processing (EMNLP ’08), pages 848–856, Linguistics, 35(1):3–28, March. Honolulu, Hawai‘i, October.

Ondˇrej Bojar, Pavel Stranˇ´ak, and Daniel Zeman. Qin Gao and Stephan Vogel. 2008. Parallel im- 2010. Data issues in English-to-Hindi machine plementations of word alignment tool. In Soft- translation. In Proceedings of the Seventh In- ware Engineering, Testing and Quality Assur- ternational Conference on Language Resources ance for Natural Language Processing, pages and Evaluation (LREC ’10), pages 1771–1777, 49–57, Columbus, Ohio, June. Valletta, Malta, May. Spence Green, Jeffrey Heer, and Christopher D. Ondˇrej Bojar, Christian Buck, Chris Callison- Manning. 2013. The efficacy of human post- Burch, Christian Federmann, Barry Haddow, editing for language translation. In Proceedings Philipp Koehn, Christof Monz, Matt Post, Radu of the ACM SIGCHI Conference on Human Fac- Soricut, and Lucia Specia. 2013. Findings of the tors in Computing Systems (CHI ’13), pages 2013 Workshop on Statistical Machine Trans- 439–448, Paris, France, April–May. lation. In Proceedings of the Eighth Workshop on Statistical Machine Translation (WMT ’13), Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. pages 1–44, Sofia, Bulgaria, August. Clark, and Philipp Koehn. 2013. Scalable mod- ified Kneser-Ney language model estimation. In Ondˇrej Bojar, Vojtˇech Diatka, Pavel Rychly,´ Pavel Proceedings of the 51st Annual Meeting of the Stranˇ´ak, Aleˇs Tamchyna, and Dan Zeman. 2014. Hindi-English and Hindi-only corpus for Association for Computational Linguistics (ACL machine translation. In Proceedings of the Ninth ’13), pages 690–696, Sofia, Bulgaria, August. International Language Resources and Evalua- Mark Hopkins and Jonathan May. 2011. Tuning tion Conference (LREC ’14), Reykjavik, Ice- as ranking. In Proceedings of the 2011 Confer- land, May. ELRA, European Language Re- ence on Empirical Methods in Natural Language sources Association. Processing (EMNLP ’11), pages 1352–1362, Ed- Peter Brown, Vincent Della Pietra, Peter deSouza, inburgh, Scotland, U.K. Jenifer Lai, and Robert Mercer. 1992. Class- based n-gram models of natural language. Com- Michaeel Kazi, Michael Coury, Elizabeth Salesky, putational Linguistics, 18(4):467–479, Decem- Jessica Ray, Wade Shen, Terry Gleason, Tim ber. Anderson, Grant Erdmann, Lane Schwartz, Brian Ore, Raymond Slyh, Jeremy Gwinnup, Peter Brown, Vincent Della Pietra, Stephen Della Katherine Young, and Michael Hutt. 2013. Pietra, and Robert Mercer. 1993. The math- The MIT-LL/AFRL IWSLT-2013 MT system. ematics of statistical machine translation: pa- In The 10th International Workshop on Spo- rameter estimation. Computational Linguistics, ken Language Translation (IWSLT ’13), pages 19(2):263–311, June. 136–143, Heidelberg, Germany, December.

193 Philipp Koehn, Franz Joseph Och, and Daniel Association for Computational Linguistics (ACL Marcu. 2003. Statistical phrase-based trans- ’02), pages 311–318, Philadelphia, Pennsylva- lation. In Proceedings of the 2003 Human nia, July. Language Technology Conference of the North American Chapter of the Association for Com- Mirko Plitt and Franc¸ois Masselot. 2010. A pro- putational Linguistics (HLT-NAACL ’13), pages ductivity test of statistical machine translation 48–54, Edmonton, Canada, May–June. post-editing in a typical localisation context. The Prague Bulletin of Mathematical Linguis- Philipp Koehn, Hieu Hoang, Alexandra Birch, tics, 93:7–16, January. Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Helmut Schmid. 1994. Probabilistic part-of-speech Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, tagging using decision trees. In Proceedings of Alexandra Constantin, and Evan Herbst. 2007. the International Conference on New Methods Moses: Open source toolkit for statistical ma- in Language Processing, Manchester, England, chine translation. In Proceedings of the 45th September. Annual Meeting of the Association for Compu- tational Linguistics (ACL ’07) Demo and Poster Helmut Schmid. 1995. Improvements in part-of- Sessions, pages 177–180, Prague, Czech Repub- speech tagging with an application to German. lic, June. In Proceedings of the EACL SIGDAT Workshop, Dublin, Ireland, March. Philipp Koehn. 2009a. A process study of com- Wade Shen, Brian Delaney, and Tim Anderson. puter aided translation. Machine Translation, 2006. The MIT-LL/AFRL IWSLT-2006 MT 23(4):241–263, November. system. In The 3rd International Workshop on Philipp Koehn. 2009b. A web-based interactive Spoken Language Translation (IWSLT ’06), Ky- computer aided translation tool. In Proceedings oto, Japan. of the ACL-IJCNLP 2009 Software Demonstra- Jason R. Smith, Herve Saint-Amand, Magdalena tions, pages 17–20, Suntec, Singapore, August. Plamada, Philipp Koehn, Chris Callison-Burch, Philipp Koehn. 2010a. Enabling monolingual and Adam Lopez. 2013. Dirt cheap web-scale translators: Post-editing vs. options. In Hu- parallel text from the common crawl. In Pro- man Language Technologies: The 2010 Annual ceedings of the 51st Annual Meeting of the As- Conference of the North American Chapter of sociation for Computational Linguistics (ACL the Association for Computational Linguistics ’13), pages 1374–1383, Sofia, Bulgaria, August. (HLT-NAACL ’10), pages 537–545, Los Ange- Andreas Stolcke. 2002. SRILM — an extensible les, California, June. language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Lan- Philipp Koehn. 2010b. An experimental manage- guage Processing (ICSLP ’02), pages 901–904, ment system. The Prague Bulletin of Mathemat- Denver, Colorado, September. ical Linguistics, 94:87–96, December. Christoph Tillman. 2004. A unigram orientation Philippe Langlais, George Foster, and Guy La- model for statistical machine translation. In palme. 2000. TransType: A computer-aided Proceedings of the Human Language Technology translation typing system. In Proceedings of Conference of the North American Chapter of the ANLP/NAACL 2000 Workshop on Embed- the Association for Computational Linguistics ded Machine Translation Systems, pages 46–51, (HLT-NAACL ’04), Companion Volume, pages Seattle, Washington, May. 101–104, Boston, Massachusetts, May. Leah S. Larkey, Margaret E. Connell, and Nasreen WMT. 2014. Findings of the 2014 Workshop on Abduljaleel. 2003. Hindi CLIR in thirty days. Statistical Machine Translation. In Proceedings ACM Transactions on Asian Language Informa- of the Ninth Workshop on Statistical Machine tion Processing (TALIP), 2(2):130–142, June. Translation (WMT ’14), Baltimore, Maryland, June. Linda Mitchell, Johann Roturier, and Sharon O’Brien. 2013. Community-based post-editing of machine translation content: monolingual vs. bilingual. In Proceedings of the 2nd Work- shop on Post-editing Technology and Practice (WPTP-2), pages 35–43, Nice, France, Septem- ber. EAMT.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for au- tomatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the

194