Generating English in Phrase-Based Translation with Synthetic Translation Options

Yulia Tsvetkov Chris Dyer Lori Levin Archna Bhatia Language Technologies Institute Carnegie Mellon University Pittspurgh, PA, 15213, USA ytsvetko, cdyer, lsl, archna @cs.cmu.edu { }

Abstract Eidelman et al., 2013), improving the identifica- propose a technique for improving tion of phrase pairs in parallel data (DeNero et al., the quality of phrase-based translation 2008; DeNero and Klein, 2010), and formal gen- systems by creating synthetic translation eralizations to gapped rules and rich nonterminal options—phrasal translations that are gen- types (Chiang, 2007; Galley et al., 2006). This erated by auxiliary translation and post- paper proposes a different mechanism for improv- editing processes—to augment the de- ing phrase-based translation: the use of synthetic fault phrase inventory learned from par- translation options to supplement the standard allel data. We apply our technique to phrasal inventory used in phrase-based translation the problem of producing English deter- systems. miners when translating from Russian and In the following, we argue that phrase tables ac- Czech, languages that lack definiteness quired in usual way will be expected to have gaps morphemes. Our approach augments the in their coverage in certain language pairs and English side of the phrase table using a that supplementing these with synthetic translation classifier to predict where English arti- options is a priori preferable to alternative tech- cles might plausibly be added or removed, niques, such as post processing, for generalizing and then we decode as usual. Doing beyond the translation pairs observable in training data ( 2). As a case study, we consider the prob- so, we obtain significant improvements in § quality relative to a standard phrase-based lem of producing English definite/indefinite arti- baseline and to a to post-editing complete cles (the, a, and an) when translating from Russian translations with the classifier. and Czech, two languages that lack overt definite- ness morphemes ( 3). We develop a classifier that § 1 Introduction predicts the presence and absence of English arti- cles ( 4). This classifier is used to generate syn- Phrase-based translation works as follows. A set § of candidate translations for an input sentence is thetic translation options that are used to augment phrase tables used the usual way ( 5). We eval- created by matching contiguous spans of the in- § put against an inventory of phrasal translations, uate their performance relative to post-processing reordering them into a target-language appropri- approach and to a baseline phrase-based system, ate order, and choosing the best according to a finding that synthetic translation options reliably outperform the other approaches ( 6). We then discriminative model that combines features of the § phrases used, reordering patterns, and target lan- discuss how our approach relates to previous work ( 7) and conclude by discussing further applica- guage model (Koehn et al., 2003). This relatively § tions of our technique ( 8). simple approach to translation can be remarkably § effective, and, since its introduction, it has been 2 Why Synthetic Translation Options? the basis for further innovations, including devel- oping better models for distinguishing the good Before turning to the problem of generating En- translations from bad ones (Chiang, 2012; Gim- glish articles, we give arguments for why syn- pel and Smith, 2012; Cherry and Foster, 2012; thetic translation options are a useful extension of

271 Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 271–280, Sofia, Bulgaria, August 8-9, 2013 c 2013 Association for Computational Linguistics standard phrase-based translation approaches, and I saw the why this technique might be better than some al- I saw a ternative proposals that been made for generaliz- ing beyond translation examples directly observ- I saw able in the training data. saw a cat In language pairs that are typologically sim- saw the cat ilar (i.e., when both languages lexicalize the saw the the cat same kinds of semantic and syntactic informa- saw a a cat tion), and phrases map relatively directly I saw cat from source to target languages, and the standard approach to learning phrase pairs is quite effec- Я увидел кошку tive.1 However, in language pairs in which in- 1SG+NOM saw +1SG +PST cat+ACC dividual source language words have many dif- ferent possible translations (e.g., when the target Figure 1: Russian-English phrase-based transla- language could have many different inflec- tion example. Since Russian lacks a definiteness tions or could be surrounded by different func- morpheme the determiners a, the must be part of tion words that have no direct correspondence in a translation option containing óâèäåë or êîøêó the source language), we can expect the standard in order to be present in the right place in the En- phrasal inventory to be incomplete, except when glish output. Translation options that are in dashed very large quantities of parallel data are available boxes should exist but were not observed in the or for very frequent words. There simply will not training data. This work seeks to produce such be enough examples from which to learn the ideal missing translation options synthetically. set of translation options. Therefore, since phrase based translation can only generate input/output if multiple possibilities appear to be equally good word pairs that were directly observed in the train- (say, multiple inflections of a translated lemma), ing corpus, the decoder’s only hope for produc- then multiple translation options may be synthe- ing a good output is to find a fluent, meaning- sized. Ultimately, of course, the global translation preserving translation using incomplete transla- model must select one translation for every phrase tion lexicons. Synthetic translation option genera- it uses, but the decoder will have access to global tion seeks to fill these gaps using secondary gener- information that it can use to pick better transla- ation processes that produce possible phrase trans- tion options. lation alternatives that are not directly extractable from the training data. We hypothesize that by 3 Case Study: English Definite Articles filling in gaps in the translation options, discrim- inative translation models will be more effective We now turn to a translation problem that we will (leading to better translation quality). use to assess the value of synthetic translation op- The creation of synthetic translation options can tions: generating English in/definite articles when be understood as a kind of translation or post- translating from Russian. editing of phrasal units/translations. This raises Definiteness is a semantic property of a question: if we have the ability to post-edit a phrases that expresses information such as iden- phrasal translation or retranslate a source phrase tifiability, specificity, familiarity and unique- so as to fill in gaps in the phrasal inventory, we ness (Lyons, 1999). In English, it is expressed should be able to use the same technique to trans- through the use of determiners and non- late the sentence; why not do this? While the ef- article determiners. Although languages may ex- fectiveness of this approach will ultimately be as- press definiteness through such morphemes, many sessed empirically, translation option generation is languages use alternative mechanisms. For exam- appealing because the translation option synthe- ple may use noncanonical word orders (Mo- 2 sizer need not produce only single-best guesses— hanan, 1994) or different constructions such as existentials, differential object marking (Aissen, 1When translating from a language with a richer lexical 2003), and the ba (吧) construction in Chinese inventory to a simpler one, approximate matching or backing off to (e.g.) morphologically simpler forms likewise reliably 2See pp. 11–12 for an example in Hindi, a language with- produces good translations. out articles.

272 (Chen, 2004). While these languages lack arti- logistic regression: cles, they may use and the quan- tifier one to emphasize definiteness and indefinite- p(y w, i) exp λjhj(y, w, i), | ∝ ness, respectively. Xj Russian and Czech are examples of languages where h ( ) are feature functions, λ are the corre- that use non-lexical means to express definiteness. j · j sponding weights, and y D, I, N refer, respec- As such, in Russian to English translation systems, ∈ { } tively, to the outputs: definite article, indefinite ar- we expect that most Russian should have at ticle, and no article.5 least three translation options—the bare noun, the noun preceded by the, and the noun preceded a/an. 4.2 Features Fig. 1 illustrates how the definiteness mismatch The English article system is extremely com- between Russian and English can result in “gaps” plex (as non-native English speakers will surely in the phrasal inventory learned from a relatively know!): in addition to a general placement rule large parallel corpus. The Russian input should that articles must precede a noun or its modifiers translate (depending on context) as either I saw a in an NP, multiple other factors can also affect ar- cat or I saw the cat; however, the phrase table we 3 ticle selection, including countability of the learned is only able to generate the former. noun, syntactic properties of an modi- fying a noun (superlative, ordinal), discourse fac- 4 Predicting English Definite Articles tors, general knowledge, etc. In this section, we define morphosyntactic features aimed at reflect- Although express semantic con- ing basic grammatical rules, we define statistical, tent, their use is largely predictable in context, semantic and shallow lexical features to capture both for native English speakers and for automated additional regular and idiosyncratic usages of def- 4 systems (Knight and Chander, 1994). In this sec- inite and indefinite articles in English. Below we tion we describe a classifier that uses local contex- provide brief details of the features and their mo- tual features to predict whether an article belongs tivation. in a particular position in a sequence of words, and if so, whether it is definite or indefinite (the form Lexical. Because training data can be con- of the indefinite article is deterministic given the structed inexpensively (from any unannotated En- pronunciation of the following word). glish corpus), n-gram indicator features, such as [[wi 1ywiwi+1 = with y lot of]], can be es- − 4.1 Model timated reliably and capture construction-specific article use. The classifier takes an English word sequence w = Morphosyntactic. We used part-of-speech w1, w2, . . . , w w with missing articles and an in- h | |i dex i and predicts whether no article, a definite ar- (POS) tags produced by the Stanford POS tagger ticle, or an indefinite article should appear before (Toutanova and Manning, 2000) to capture gen- wi. We parameterize the classifier as a multiclass eral article patterns. These are relevant features in the prediction of articles as we observe certain 3The phrase table for this example was extracted from the constraints regarding the use of articles in the WMT 2013 shared task training data consisting of 1.2M sen- neighborhood of certain POS tags. For example, tence pairs. we do not expect to predict an article following an 4An interesting contribution of this work is a discussion on lower and upper bounds that can be achieved by native adjective (JJ). English speakers in predicting determiners. 67% is a lower bound, obtained by guessing the for every instance. The up- Semantic. We extract further information indi- per bound was obtained experimentally, and was measured on cating whether a named entity, as identified by the noun phrases (NP) without context, in a context of 4 words (2 before and 2 after NP), and given full context. Human Stanford NE Recognizer (Finkel et al., 2005) be- subjects achieved an accuracy of 94-96% given full context, gins at wi. These features are relevant as there 83-88% for NPs in a context of 4 words, and 79-80% for NPs without context. Since in the current state-of-the-art building 5Realization of the classes D and N as lexical items is an automated determiners prediction in a full context (repre- straightforward. To convert I into a or an, we use the senting meaning computationally) is not a feasible task, we CMU pronouncing dictionary (http://www.speech. view 83-88% accuracy as our goal, and 88% as an upper cs.cmu.edu/cgi-bin/cmudict) and select an if wi bound for our method. starts with a phonetic vowel.

273 is, in general, a constraint on the co-occurrence Feature combination All I D N Statistical 0.80 0.76 0.79 0.87 of articles with named entities which can help us Lexical 0.82 0.79 0.80 0.87 predict the use of articles in such constructions. Morphosyntactic 0.75 0.71 0.64 0.86 For example, proper nouns do not tend to co- Semantic 0.35 0.99 0.02 0.04 Statistical+Lexical 0.85 0.83 0.82 0.89 occur with articles in English. Although there are + Morphosyntactic 0.87 0.86 0.83 0.92 some proper nouns that have an article included in + Semantic 0.87 0.86 0.83 0.92 them, such as the Netherlands, the United States of America, but these are fixed expressions and the Table 1: 10-fold cross validation accuracy of the model is easily able to capture such cases with lex- classifier over all and by class. ical features.

Statistical. Statistical features capture probabil- nation of morphosyntactic, lexical, and statistical ity of co-occurrences of a sample with each of features is also helpful, reducing 13% more errors. the classes, e.g., for wi 1ywi we Semantic features do not contribute to the classi- − collect probabilities of wi 1Iwi, wi 1Dwi, and fier accuracy (we believe, mainly due to the feature 6 − − wi 1Nwi. sparsity). − 4.3 Training and evaluation 5 Experimental Setup We employ the creg regression modeling frame- Our experimental workflow includes the follow- work to train a ternary logistic regression classi- ing steps. First, we select a phrase table PTsource 7 fier. All features were computed for the target- from which we generate synthetic phrases. For side of the Russian-English TED corpus (Cettolo each phrase pair f, e in PT we generate n h i source et al., 2012); from 117,527 sentences we removed synthetic variants of the target side phrase e which 5K sentences used as tuning and test sets in the we then append to PTbaseline. We annotate both MT system. We extract statistical features from the original and synthetic phrases with additional monolingual English corpora released for WMT- translation features in PTbaseline. 11 (Callison-Burch et al., 2011). For this language pair, we have several options I In the training corpus there are 65,075 in- for how to construct PTsource. The most straight- stances, 114,571 D instances, and 2,435,287 N in- forward way is to extract the phrasal inventory as stances. To create a balanced training set we usual; a second option is to extract phrases from randomly sample 65K instances from each set of training data from which definite articles have 8 collected instances. This training set of feature been removed (since we will rely on the classifier vectors has 142,604 features and 285,210 param- to reinsert them where they belong). eters. To minimize the number of free parame- To synthesize phrases, we employ two differ- ters in our model we use `1 regularization. We ent techniques: LM-based and classifier-based. perform 10-fold cross validation experiments with We use a LM for one- or two-word phrases or various feature combinations, evaluating the clas- an auxiliary classifier for longer phrases and cre- sifier accuracy for all classes and for each class ate a new phrase in which we insert, remove or independently. The performance of the classifier substitute an article between each adjacent pair of on individual classes and consolidated results for words in the original phrase. Such distinction be- all classes are listed in Table 1. tween short and longer phrases has clear motiva- We observe that morphosyntactic and lexical tion: phrases without context may allow alterna- features are highly significant, reducing the er- tive, equally plausible options for article selection, ror rate of statistical features by 25%. A combi- therefore we can just rely on a LM, trained on 6Although statistical features are de rigeur in NLP, they large monolingual corpora, to identify phrases un- are arguably justified for this problem on linguistic grounds observed in MT training corpus. Longer context since human subjects use frequency-based in addition to their restricts determiners usage and statistical model grammatical knowledge. For example, we say He is at school rather than He is at the school, but Americans say He is in decisions are less prone to generating ungrammat- the hospital while UK English speakers might prefer He is in ical synthetic phrases. hospital. LM-based method is applied to phrases shorter 7https://github.com/redpony/creg 8Preliminary experiments indicated that the excess of N than three words. These phrases are numerous, labels resulted in poor performance. roughly 20% of a phrase table, and extracted from

274 many sites in the training data. For each short (tar- et al., 2007) to train a baseline phrase-based SMT get) phrase we add all possible alternative entries system. Each configuration we compare has a dif- observed in the LM and not observed in the orig- ferent phrase table, with synthetic phrases gen- inal translation model. For example, for a short erated with best-first or iterative strategies, from target phrase a cat we extract the cat. a phrase table with- or without-determiners, with We apply an auxiliary classifier to longer variable number of translation features. To verify phrases, containing three or more words. Based that system improvement is consistent, and is not a on the classifier prediction, we use the maximally result of optimizer instability (Clark et al., 2011), probable class to insert, remove or substitute an we replicate each experimental setup three times, article between each adjacent pair of words in and then estimate the translation quality of the me- the original phrase. Synthetic phrases are gener- dian MT system using the MultEval toolkit.9 ated by linguistically-informed features and can The corpus is the same as in Section 4.3: introduce alternative grammatically-correct trans- the training part contains 112,527 sentences from lations of source phrases by adding or removing Russian-English TED corpus, randomly sampled existing articles (since the English article selection 3K sentences are used for tuning and a disjoint set in a local context is often ambiguous and not cat- of 2K sentences is used for test. We lowercase egorical). We add a synthetic phrase only if the both sides, and use Stanford CoreNLP10 tools to phrase pair not observed in the original model. tokenize the corpora. We employ SRILM toolkit We compare two possible applications of a clas- (Stolcke, 2002) to linearly interpolate the target sifier: one-pass and iterative prediction. With side of the training corpus with the WMT En- one-pass prediction we decide on the prediction glish corpus, optimizing towards the MT tuning for each position independently of other deci- set. This LM is used in all experiments. sions. With iterative update we adopt the best The rest of this section is organized as follows. first (greedy) strategy, selecting in each iteration First, we compare two approaches to the deter- the update-location in which the classifier obtains miners classifier application. Then, we provide highest confidence score. In each iteration we in- detailed description of experiments with synthetic corporate a prediction in a target phrase, and in the phrases. We evaluate various aspects of synthetic next iteration the best first decision is made on an phrases generation and summarize all the results updated phrase. Iterative prediction stops when no in Table 3. In Table 5 we show examples of im- updates are introduced. proved translations. Synthetic phrases are added to a phrase table with the five standard phrasal translation features Classifier application: one-pass vs. iterative. that were found in the source phrase, and with sev- First, as an intrinsic evaluation of the prediction eral new features. First, we add a boolean fea- strategy we remove definite and indefinite articles ture indicating the origin of a phrase: synthetic or from the reference translations (2K test sentences) original. Second, we experiment with a posterior and then employ the determiners classifier to re- probability of a classifier averaged over all loca- produce the original sentences. In Table 2 we re- tions where it could be extracted from the training port on the word error rate (WER) derived from data. The next feature is derived from this score: the Levenshtein distance between the original sen- it is a boolean feature indicating a confidence of tences and the sentences (1) without articles, (2) the classifier: the feature value is 1 iff the average with articles recovered using one-pass prediction, classifier score is higher than some threshold. and (3) articles recovered using iterative predic- Consider again a phrase I saw a cat discussed tion. The WER is averaged over all test sentences. in Section 1. Synthetic entry generation from the Both one-pass and iterative approaches are effec- original phrase table entry is illustrated in Fig- tive in the task of determiners prediction, reducing ure 2. the number of errors by 44%. The iterative ap- proach yields slightly lower WER, hence we em- 6 Translation Results ploy the iterative prediction in the future experi- ments with synthetic phrases. We now review the results of experiments using synthetic translation options in a machine trans- 9https://github.com/jhclark/multeval lation system. We use the Moses toolkit (Koehn 10http://nlp.stanford.edu/software/corenlp.shtml

275 original phrase я увидел кошку ||| i saw a cat ||| f0 f1 f2 f3 f4 exp(0) exp(0) |||

I saw a cat is synthetic

is no-context post-processing None

None

the

synthetic phrase я увидел кошку ||| i saw the cat ||| f0 f1 f2 f3 f4 exp(1) exp(0) |||

Figure 2: Synthetic entry generation example. The original parallel phrase has two additional boolean features (set to false) indicating that this is not a synthetic phrase and not a short phrase. We apply our determiners classifier to predict an article at each location marked with a dashed box. Based on a classifier prediction we derive a new phrase I saw the cat. Since corresponding parallel entry is not in the original phrase table, we set the synthetic indicator feature to 1.

Post-processing WER ble. Table 3 shows the evaluation of the median None 5.6% SMT system (derived from three systems) with One-pass 3.2% short phrases. In these systems the five phrasal Iterative 3.1% translation features are the same as in the base- line systems. Improvement in the BLEU score Table 2: WER (lower is better) of reference trans- (Papineni et al., 2002) is statistically significant lations without articles and of post-processed ref- (p < .05), compared to the baseline system erence translations. Both one-pass and iterative approaches are effective in the task of determin- Classifier-generated synthetic phrases We ap- ers prediction. ply classifier with the iterative prediction directly on the baseline phrase table entries and synthe- size 944,145 new parallel phrases, increasing the MT output post-processing. We then evaluate phrase table size by 38%. The phrasal transla- the post-processing strategy directly on the MT tion features in each synthetic phrase are the same output. We experiment with one-pass and itera- as in the phrase it was derived from. The BLEU tive post-processing of two variants of the base- score of the median SMT system with synthetic line system outputs: original output and the out- phrases is 22.9 .1, the improvement is statisti- ± put without articles (we remove the articles prior cally significant (p < .01). Post-processing of a to post-processing). The results are listed in Ta- phrase table created from corpora without articles ble 3. Interestingly, we do not obtain any improve- and adding synthetic phrases to the baseline phrase ments applying the determiners classifier in a con- table yielded similar results. ventional way of a MT output post-processing. It is the combination of linguistically-motivated fea- Translation features for synthetic phrases In tures with synthetic phrases that contribute to the the following experiments we aim to establish the best performance. optimal set of translation features that should be used with synthetic phrases. We train several SMT LM-based synthetic phrases. As discussed systems, each containing synthetic phrases derived above, LM-based (short) phrases are shorter than from the original phrase table by iterative classifi- 3 tokens and their synthetic variants contain same cation, and with LM-based short phrases. Each words with articles inserted or deleted between synthetic phrase has five translation features as an each adjacent pair of words. The phrase table original phrase it was derived from. The additional of the baseline system contains 2,441,678 phrase features that we evaluate are: pairs. There are 518,453 original short phrases, and our technique yields 842,252 new synthetic 1. Boolean feature for LM-based synthetic entries which we append to the baseline phrase ta- phrases

276 MT System BLEU tion task.11 For training, we use the WMT Czech- Baseline 22.6 .1 ± English parallel corpus CzEng0.7; we tune using MT output post-processing one-pass, MT output with articles 20.8 the WMT2011 test set and test on the WMT2012 one-pass, MT output without articles 19.7 test set. The LM is trained on the target side of the iterative, MT output with articles 22.6 training corpus. Determiners classifier, re-trained iterative, MT output without articles 21.8 on the English side of this corpus, with statistical, With synthetic phrases LM-based phrases 22.9 .1 lexical, morphosyntactic and dependency features + classifier-generated phrases 22.9 ± .1 obtained an accuracy of 88%. ± + features 1,2 23.0 .1 In Table 4, we report the results of evaluat- + features 1,2,3 22.8 ± .1 + features 1,2,3,4 22.8 ± .1 ing the performance of the Russian-to-English + feature 5 22.9 ± .1 ± and Czech-to-English MT systems with synthetic phrases. The results of both systems show a statis- Table 3: Summary of experiments with MT out- tically significant (p < .01) improvement in terms put post-processing and with synthetic translation of BLEU score. options in a phrase table. Post-processing of the MT output do not improve translations. Best per- Russian Czech Baseline 22.6 .1 16.0 .05 forming system with synthetic phrases has five ± ± Synthetic 23.0 .1 16.2 .03 original phrase translation features and two addi- ± ± tional boolean features indicating if the phrase is LM-based or not, is classifier-generated or not. All Table 4: BLEU score of Russian-to-English the synthetic systems are significantly better than and Czech-to-English MT systems with synthetic the baseline system. phrases and features 1 and 2 show a significant im- provement.

2. Boolean feature for classifier-generated syn- thetic phrases Qualitative analysis. Table 5 shows some ex- amples from the output of our Russian-to-English 3. Classifier confidence: posterior probability of systems. Although both systems produce compre- the classifier averaged over all samples in a tar- hensible translations, the system augmented with get phrase. determiner classifier is more fluent. The first ex- 4. Boolean feature indicating a confidence of the ample represents a case where a singular count classifier: the feature value is 1 iff the Fea- noun (piece) is present which requires an article. ture 3 scores higher than some threshold. The The baseline is not able to identify this require- threshold was set to 0.8, we did not experiment ment and hence does not insert the article an be- with other values. fore the phrase extraordinary engineering piece. 5. Boolean feature for a synthetic phrase of any Our system, however, correctly identifies the con- type: LM-based or classifier-generated struction requiring an article and thus provides an appropriate form of the article (an- Indefinite arti- Table 3 details the change in the BLEU score cle for lexical items beginning with a vowel). Thus of each experimental setup. The best perform- we see that our system is able to capture the lin- ing system has five original phrase translation fea- guistic requirement of the singular count nouns to tures and two additional boolean features indicat- co-occur with an article. In the second row, the ing if the phrase is LM-based or not, is classifier- lexical item poor is used as an adjective. The base- generated or not. Note that all the synthetic sys- line has inserted an article in front of it, chang- tems are significantly better than the baseline. ing it to a noun. Our system, however, is able to maintain the status of poor as an adjective since Czech-English. Our technique was developed it has the option not to insert an article. Thus we using Russian-English system in the TED domain, see that besides fluency, our system also does bet- so we want to see how our method generalizes to a ter in maintaining the grammatical category of a different domain when translating from a different lexical item. In the third row, the phrase three language. We therefore applied our most success- 11Like Russian, Czech is a Slavic language that does not ful configuration to a Czech-English news transla- have definite or indefinite articles.

277 Source: íî òåì íå ìåíåå , ýòî âûäàþùååñÿ ïðîèçâåäåíèå èíæåíåðíîãî èñêóññòâà . Reference: but nonetheless , it ’s an extraordinary piece of engineering . Baseline: but nevertheless , it ’s extraordinary engineering piece of art . Ours: but nevertheless , it ’s an extraordinary piece of engineering art . Source: è ïî ìíîãèì äåôèíèöèÿì îíà óæå íå áåäíàÿ . Reference: and by many definitions she is no longer poor . Baseline: and in a lot definitions , it ’s not a poor . Ours: and in a lot definitions she ’s not poor . Source: íàì íóæíî íàêîðìèòü òðè ìèëëèàðäà ãîðîäñêèõ æèòåëåé . Reference: we must feed three billion people in cities . Baseline: we need to feed the three billion urban hundreds of them . Ours: we need to feed three billion people in the city . Table 5: Examples of translations with improved articles handling. billion people refers to a nonidentifiable referent. their implementation has been plagued by compu- The baseline inserts the definite article the. If a tational challenges. human subject reads this translation, it would mis- Post-processing techniques have been ex- lead him/her to interpret the object three billion tremely popular. These can be understood as using people as referring to a specific identifiable set. a translation model to generate a translation skele- Our system, on the other hand, correctly selects ton (or k-best skeletons) and then post-editing the determiner class N and hence does not insert an these in various ways. These have been applied article. Thus we see that our system does not just to translation into morphologically rich languages, add fluency but it also captures a semantic distinc- such as Japanese, German, Turkish, and Finnish tion, namely identifiability, that a human subject (de Gispert et al., 2005; Suzuki and Toutanova, makes when producing or interpreting a phrase. 2006; Suzuki and Toutanova, 2007; Fraser et al., 2012; Clifton and Sarkar, 2011; Oflazer and Dur- 7 Related Work gar El-Kahlout, 2007).

Automated determiner prediction has been found 8 Conclusions and future work beneficial in a variety of applications, including postediting of MT output (Knight and Chander, The contribution of this work is twofold. First, we 1994), text generation (Elhadad, 1993; Minnen propose a new supervised method to predict defi- et al., 2000), and more recently identification and nite and indefinite articles. Our log-linear model correction of ESL errors (Han et al., 2006; De Fe- trained on a linguistically-motivated set of fea- lice and Pulman, 2008; Gamon et al., 2009; Ro- tures outperforms previously reported results, and zovskaya and Roth, 2010). Our work on determin- obtains an upper bound of an accuracy achieved ers extends previous studies in several dimensions. by human subjects given a context of four words. While all previous approaches were tested only on However, more important result of this work is the NP constructions, we evaluate our classifier on any experimentally verified idea of improving phrase- sequence of tokens. based SMT via synthetic phrases. While we have To the best of our knowledge, the only stud- focused on a limited problem in this paper, there ies that directly address generation of synthetic are numerous alternative applications including phrase table entries was conducted by Chen et al. translation into morphologically rich languages, as (2011) and Koehn and Hoang (2007). The former a method for incorporating (source) contextual in- find semantically similar source phrases and pro- formation in making local translation decisions, duce “fabricated” translations by combining these enriching the target language lexicon using lexical source phrases with a set of their target phrases; translation resources, and many others. however, they do not observe improvements. The Acknowledgments later work integrates the synthesis of translation options into the decoder. While related in spirit, We are grateful to Shuly Wintner for insightful sugges- their method only supports a limited set of gen- tions and support. This work was supported in part by the erative processes for producing the candidate set U. S. Army Research Laboratory and the U. S. Army Re- (lacking, for instance, the simple and effective search Office under contract/grant number W911NF-10-1- phrase post-editing process we have used), and 0533.

278 References J. DeNero, A. Bouchard-Cotˆ e,´ and D. Klein. 2008. Sampling alignment structure under a Bayesian J. Aissen. 2003. Differential object marking: Iconic- translation model. In Proceedings of the Confer- ity vs. economy. Natural Language and Linguistic ence on Empirical Methods in Natural Language Theory, 21(3):435–483. Processing, pages 314–323. Association for Com- putational Linguistics. C. Callison-Burch, P. Koehn, C. Monz, and O. Zaidan. 2011. Findings of the 2011 workshop on statisti- V. Eidelman, Y. Marton, and P. Resnik. 2013. Online cal machine translation. In Proceedings of the Sixth relative margin maximization for statistical machine Workshop on Statistical Machine Translation, pages translation. In Proceedings of ACL. 22–64, Edinburgh, Scotland, July. Association for Computational Linguistics. M. Elhadad. 1993. Generating argumentative judg- M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: ment determiners. In AAAI, pages 344–349. Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European J. R. Finkel, T. Grenager, and C. Manning. 2005. In- Association for Machine Translation (EAMT), pages corporating non-local information into information 261–268, Trento, Italy, May. extraction systems by gibbs sampling. In ACL ’05: Proceedings of the 43rd Annual Meeting on Asso- B. Chen, R. Kuhn, and G. Foster. 2011. Semantic ciation for Computational Linguistics, pages 363– smoothing and fabrication of phrase pairs for SMT. 370, Morristown, NJ, USA. Association for Compu- In Proceedings of the International Workshop on tational Linguistics. Spoken Lanuage Translation (IWSLT-2011). A. Fraser, M. Weller, A. Cahill, and F. Cap. 2012. P. Chen. 2004. Identifiability and definiteness in chi- Modeling inflection and word-formation in SMT. In nese. Linguistics, 42(6):1129–1184. Proceedings of EACL.

C. Cherry and G. Foster. 2012. Batch tuning strategies M. Galley, J. Graehl, K. Knight, D. Marcu, S. DeNeefe, for statistical machine translation. In Proceedings of W. Wang, and I. Thayer. 2006. Scalable infer- HLT-NAACL 2012, volume 12, pages 34–35. ence and training of context-rich syntactic transla- tion models. In ACL-44: Proceedings of the 21st In- D. Chiang. 2007. Hierarchical phrase-based transla- ternational Conference on Computational Linguis- tion. Computational Linguistics, 33(2):201–228. tics and the 44th annual meeting of the Associa- tion for Computational Linguistics, pages 961–968, D. Chiang. 2012. Hope and fear for discriminative Morristown, NJ, USA. Association for Computa- training of statistical translation models. The Jour- tional Linguistics. nal of Machine Learning Research, 98888:1159– 1187. M. Gamon, J. Gao, C. Brockett, A. Klementiev, W. B. Dolan, D. Belenko, and L. Vanderwende. 2009. Us- J. H. Clark, C. Dyer, A. Lavie, and N. A. Smith. ing contextual speller techniques and language mod- 2011. Better hypothesis testing for statistical ma- eling for ESL error correction. Urbana, 51:61801. chine translation: Controlling for optimizer instabil- In Proc. of ACL ity. In . K. Gimpel and N. A. Smith. 2012. Structured ramp A. Clifton and A. Sarkar. 2011. Combining loss minimization for machine translation. In Pro- morpheme-based machine translation with post- ceedings of 2012 Conference of the North Ameri- processing morpheme prediction. In Proceedings of can Chapter of the Association for Computational ACL. Linguistics: Human Language Technologies HLT- NAACL 2012, Montreal, Canada. R. De Felice and S. G. Pulman. 2008. A classifier- based approach to preposition and determiner error N.-R. Han, M. Chodorow, and C. Leacock. 2006. De- correction in L2 English. In Proceedings of the tecting errors in English article usage by non-native 22nd International Conference on Computational speakers. Linguistics-Volume 1, pages 169–176. Association for Computational Linguistics. K. Knight and I. Chander. 1994. Automated poste- diting of documents. In Proceedings of the Na- A. de Gispert, J. B. Marino,˜ and J. M. Crego. 2005. tional Conference on Artificial Intelligence, pages Improving statistical machine translation by classi- 779–779, Seattle, WA. fying and generalizing inflected verb forms. In Pro- ceedings of InterSpeech. P. Koehn and H. Hoang. 2007. Factored transla- tion models. In Proceedings of the 2007 Joint J. DeNero and D. Klein. 2010. Discriminative mod- Conference on Empirical Methods in Natural Lan- eling of extraction sets for machine translation. In guage Processing and Computational Natural Lan- Proceedings of the 48th Annual Meeting of the Asso- guage Learning (EMNLP-CoNLL), pages 868–876, ciation for Computational Linguistics, pages 1453– Prague, Czech Republic, June. Association for Com- 1463. Association for Computational Linguistics. putational Linguistics.

279 P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical K. Toutanova and C. D. Manning. 2000. Enriching phrase-based translation. In NAACL ’03: Proceed- the knowledge sources used in a maximum entropy ings of the 2003 Conference of the North American part-of-speech tagger. In Proceedings of the 2000 Chapter of the Association for Computational Lin- Joint SIGDAT conference on Empirical methods in guistics on Human Language Technology, pages 48– natural language processing and very large corpora, 54. Association for Computational Linguistics. pages 63–70, Morristown, NJ, USA. Association for Computational Linguistics. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Com- putational Linguistics Companion Volume Proceed- ings of the Demo and Poster Sessions, pages 177– 180, Prague, Czech Republic, June. Association for Computational Linguistics.

C. Lyons. 1999. Definiteness. Cambridge University Press.

G. Minnen, F. Bond, and A. Copestake. 2000. Memory-based learning for article generation. In Proceedings of the 2nd workshop on Learning lan- guage in logic and the 4th conference on Compu- tational natural language learning-Volume 7, pages 43–48. Association for Computational Linguistics.

T. Mohanan. 1994. Argument Structure in Hindi. CSLI Publications.

K. Oflazer and I. Durgar El-Kahlout. 2007. Explor- ing different representational units in English-to- Turkish statistical machine translation. In Proceed- ings of the Second Workshop on Statistical Machine Translation, pages 25–32, Prague, Czech Republic, June. Association for Computational Linguistics.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: a method for automatic evaluation of ma- chine translation. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computa- tional Linguistics, pages 311–318, Morristown, NJ, USA. Association for Computational Linguistics.

A. Rozovskaya and D. Roth. 2010. Training paradigms for correcting errors in grammar and us- age. Urbana, 51:61801.

A. Stolcke. 2002. SRILM—an extensible language modeling toolkit. In Procedings of International Conference on Spoken Language Processing, pages 901–904.

H. Suzuki and K. Toutanova. 2006. Learning to pre- dict case markers in Japanese. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Asso- ciation for Computational Linguistics, pages 1049– 1056. Association for Computational Linguistics.

H. Suzuki and K. Toutanova. 2007. Generating case markers in machine translation. In Proceedings of HLT-NAACL 2007, pages 49–56.

280