Machine Translation and Monolingual Postediting: the AFRL WMT-14 System Lane O.B
Total Page:16
File Type:pdf, Size:1020Kb
Machine Translation and Monolingual Postediting: The AFRL WMT-14 System Lane O.B. Schwartz Timothy Anderson Air Force Research Laboratory Air Force Research Laboratory [email protected] [email protected] Jeremy Gwinnup Katherine M. Young SRA International† N-Space Analysis LLC† [email protected] [email protected] Abstract test set for correctness against the reference translations. Using bilingual judges, we fur- This paper describes the AFRL sta- ther evaluate a substantial subset of the post- tistical MT system and the improve- edited test set using a more fine-grained ade- ments that were developed during the quacy metric; using this metric, we show that WMT14 evaluation campaign. As part monolingual posteditors can successfully pro- of these efforts we experimented with duce postedited translations that convey all or a number of extensions to the stan- most of the meaning of the original source sen- dard phrase-based model that improve tence in up to 87.8% of sentences. performance on Russian to English and Hindi to English translation tasks. 2 System Description In addition, we describe our efforts We submitted systems for the Russian-to- to make use of monolingual English English and Hindi-to-English MT shared speakers to correct the output of ma- tasks. In all submitted systems, we use the chine translation, and present the re- phrase-based moses decoder (Koehn et al., sults of monolingual postediting of the 2007). We used only the constrained data sup- entire 3003 sentences of the WMT14 plied by the evaluation for each language pair Russian-English test set. for training our systems. 1 Introduction 2.1 Data Preparation As part of the 2014 Workshop on Machine Before training our systems, a cleaning pass Translation (WMT14) shared translation task, was performed on all data. Unicode charac- the human language technology team at the ters in the unallocated and private use ranges Air Force Research Laboratory participated were all removed, along with C0 and C1 con- in two language pairs: Russian-English and trol characters, zero-width and non-breaking Hindi-English. Our machine translation sys- spaces and joiners, directionality and para- tem represents enhancements to our system graph markers. from IWSLT 2013 (Kazi et al., 2013). In this paper, we focus on enhancements to our pro- 2.1.1 Hindi Processing cedures with regard to data processing and the The HindEnCorp corpus (Bojar et al., 2014) handling of unknown words. is distributed in tokenized form; in order to In addition, we describe our efforts to make ensure a uniform tokenization standard across use of monolingual English speakers to correct all of our data, we began by detokenized this the output of machine translation, and present data using the Moses detokenization scripts. the results of monolingual postediting of the In addition to normalizing various extended entire 3003 sentences of the WMT14 Russian- Latin punctuation marks to their Basic Latin English test set. Using a binary adequacy clas- equivalents, following Bojar et al. (2010) we sification, we evaluate the entire postedited normalized Devanagari Danda (U+0964), Double Danda (U+0965), and Abbrevia- †This work is sponsored by the Air Force Research Laboratory under Air Force contract FA-8650-09-D- tion Sign (U+0970) punctuation marks to 6939-029. Latin Full Stop (U+002E), any Devana- 186 Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 186–194, Baltimore, Maryland USA, June 26–27, 2014. c 2014 Association for Computational Linguistics gari Digit to the equivalent ASCII Digit, ters in the sentence were non-Latin, or if more and decomposed all Hindi data into Unicode than half of the words were unknown to the Normalization Form D (Davis and Whistler, aspell English spelling correction program, 2013) using charlint.1 In addition, we per- not counting short words, which frequently formed Hindi diacritic and vowel normaliza- occur as (possibly false) cognates across lan- tion, following Larkey et al. (2003). guages (English die vs. German die, English Since no Hindi-English development test on vs. French on, for example). Because set was provided in WMT14, we randomly aspell does not recognize some proper names, sampled 1500 sentence pairs from the Hindi- brand names, and borrowed words as known English parallel training data to serve this pur- English words, this method incorrectly flags pose. Upon discovering duplicate sentences in for removal some English sentences which have the corpus, 552 sentences that overlapped with a high proportion of these types of words. the training portion were removed from the Source sentences were marked as non- sample, leaving a development test set of 948 Russian if less than one-third of the charac- sentences. ters were within the Russian Cyrillic range, or 2.1.2 Russian Processing if non-Russian characters equal or outnumber Russian characters and the sentence contains The Russian sentences contained many exam- no contiguous sequence of at least three Rus- ples of mixed-character spelling, in which both sian characters. Some portions of the Cyrillic Latin and Cyrillic characters are used in a sin- character set are not used in typical Russian gle word, relying on the visual similarity of the text; source sentences were therefore marked characters. For example, although the first for removal if they contained Cyrillic exten- letter and last letter in the word ap- cейчас sion characters Ukrainian I (і І), Yi(ї Ї), pear visually indistinguishable, we find that Ghe With Upturn (ґ Ґ) or Ie (є Є) in ei- the former is U+0063 Latin Small Letter ther upper- or lowercase, with exceptions for C and the latter is U+0441 Cyrillic Small U+0406 Ukrainian I (І) in Roman numerals Letter Es. We created a spelling normal- and for U+0491 Ghe With Upturn (ґ) when ization program to convert these words to all it occurred as an encoding error artifact.3 Cyrillic or all Latin characters, with a pref- erence for all-Cyrillic conversion if possible. Sentence pairs where the source was identi- Normalization also removes U+0301 Combin- fied as non-Russian or the target was identified ing Acute Accent ( ̲́) and converts U+00F2 as non-English were removed from the parallel Latin Small Letter O with Grave (ò) corpus. Overall, 12% of the parallel sentences and U+00F3 Latin Small Letter O with were excluded based on a non-Russian source Acute (ó) to the unaccented U+043E Cyril- sentence (94k instances) or a non-English tar- get sentence (11.8k instances). lic Small Letter O (о). The Russian-English Common Crawl par- Our Russian-English parallel training data allel corpus (Smith et al., 2013) is relatively includes a parallel corpus extracted from noisy. A number of Russian source sentences Wikipedia headlines (Ammar et al., 2013), are incorrectly encoded using characters in the provided as part of the WMT14 shared trans- Latin-1 supplement block; we correct these lation task. Two files in this parallel cor- sentences by shifting these characters ahead pus (wiki.ru-en and guessed-names.ru-en) by 350hex code points into the correct Cyrillic contained some overlapping data. We re- character range.2 moved 6415 duplicate lines within wiki.ru-en We examine the Common Crawl parallel (about 1.4%), and removed 94 lines of sentences and mark for removal any non- guessed-names.ru-en that were already Russian source sentences and non-English tar- present in wiki.ru-en (about 0.17%). get sentences. Target sentences were marked as non-English if more than half of the charac- 3Specifically, we allowed lines containing ґ where it appears as an encoding error in place of an apostro- 1http://www.w3.org/International/charlint phe within English words. For example: “Песня The 2For example: “Ñïðàâêà ïî ãîðîäàì Ðîññèè è ìèðà.” Kelly Family Iґm So Happy представлена вам Lyrics- becomes “Справка по городам России и мира.” Keeper.” 187 2.2 Machine Translation Decoding Features Our baseline system is a variant of the MIT- P(f e) | LL/AFRL IWSLT 2013 system (Kazi et al., P(e f) | 2013) with some modifications to the training Pw(f e) | and decoding processes. Pw(e f) Phrase Penalty| 2.2.1 Phrase Table Training Lexical Backoff For our Russian-English system, we trained Word Penalty a phrase table using the Moses Experiment Distortion Model Management System (Koehn, 2010b), with Unknown Word Penalty mgiza (Gao and Vogel, 2008) as the word Lexicalized Reordering Model aligner; this phrase table was trained using the Operation Sequence Model Russian-English Common Crawl, News Com- Rescoring Features mentary, Yandex (Bojar et al., 2013), and P (E) – 7-gram class-based LM Wikipedia headlines parallel corpora. class Plex(F E) – sentence-level averaged The phrase table for our Hindi-English sys- lexical| translation score tem was trained using a similar in-house train- ing pipeline, making use of the HindEnCorp Table 1: Models used in log-linear combina- and Wikipedia headlines parallel corpora. tion 2.2.2 Language Model Training 2.2.3 Decoding, n-best List Rescoring, During the training process we built n-gram and Optimization language models (LMs) for use in decoding and rescoring using the KenLM language mod- We decode using the phrase-based moses de- elling toolkit (Heafield et al., 2013). Class- coder (Koehn et al., 2007), choosing the best based language models (Brown et al., 1992) translation for each source sentence according were also trained, for later use in n-best list to a linear combination of decoding features: rescoring, using the SRILM language mod- Eˆ = arg max λrhr(E, F) (1) elling toolkit (Stolcke, 2002).We trained a 6- E r gram language model from the LDC English ∑∀ Gigaword Fifth Edition, for use in both the We make use of a standard set of decoding Hindi-English and Russian-English systems.