Generating Complex Morphology for Machine Translation

Einat Minkov∗ Kristina Toutanova Hisami Suzuki Language Technologies Institute Microsoft Research Microsoft Research Carnegie Mellon University Redmond, WA, USA Redmond, WA, USA Pittsburgh, PA, USA [email protected] [email protected] [email protected]

Abstract ical entities, and use this information to aid in MT for morphologically rich languages. Our goal is two- We present a novel method for predicting in- fold: first, to allow generalization over morphology flected word forms for generating morpho- to alleviate the data sparsity problem in morphology logically rich languages in machine trans- generation. Second, to model syntactic coherence in lation. We utilize a rich set of syntactic the form of morphological agreement in the target and morphological knowledge sources from language to improve the generation of morphologi- both source and target sentences in a prob- cally rich languages. So far, this problem has been abilistic model, and evaluate their contribu- addressed in a very limited manner in MT, most typ- tion in generating Russian and Arabic sen- ically by using a target language model. tences. Our results show that the proposed model substantially outperforms the com- In the framework suggested in this paper, we train monly used baseline of a trigram target lan- a model that predicts the inflected forms of a se- guage model; in particular, the use of mor- quence of word stems in a target sentence, given phological and syntactic features leads to the corresponding source sentence. We use word large gains in prediction accuracy. We also and word alignment information, as well as lexi- show that the proposed method is effective cal resources that provide morphological informa- with a relatively small amount of data. tion about the words on both the source and target sides. Given a sentence pair, we also obtain syntactic 1 Introduction analysis information for both the source and trans- lated sentences. We generate the inflected forms of Machine Translation (MT) quality has improved words in the target sentence using all of the available substantially in recent years due to applying data information, using a log-linear model that learns the intensive statistical techniques. However, state-of- relevant mapping functions. the-art approaches are essentially lexical, consider- ing every surface word or phrase in both the source As a case study, we focus on the English-Russian sentence and the corresponding translation as an in- and English-Arabic language pairs. Unlike English, dependent entity. A shortcoming of this word-based Russian and Arabic have very rich systems of mor- approach is that it is sensitive to data sparsity. This is phology, each with distinct characteristics. Trans- an issue of importance as aligned corpora are an ex- lating from a morphology-poor to a morphology- pensive resource, which is not abundantly available rich language is especially challenging since de- for many language pairs. This is particularly prob- tailed morphological information needs to be de- lematic for morphologically rich languages, where coded from a language that does not encode this in- word stems are realized in many different surface formation or does so only implicitly (Koehn, 2005). forms, which exacerbates the sparsity problem. We believe that these language pairs are represen- In this paper, we explore an approach in which tative in this respect and therefore demonstrate the words are represented as a collection of morpholog- generality of our approach. This∗ research was conducted during the author’s internship There are several contributions of this work. First, at Microsoft Research. we propose a general approach that shows promise in addressing the challenges of MT into morpholog- display a rich system of agreements. In Russian, for ically rich languages. We show that the use of both example, adjectives agree with head nouns in num- syntactic and morphological information improves ber, gender and case, and verbs agree with the sub- translation quality. We also show the utility of ject noun in person and number (past tense verbs source language information in predicting the word agree in gender and number). Arabic has a similarly forms of the target language. Finally, we achieve rich system of agreement, with unique characteris- these results with limited morphological resources tics. For example, in addition to agreement involv- and training data, suggesting that the approach is ing person, number and gender, it also requires a de- generally useful for resource-scarce language pairs terminer for each word in a definite noun phrase with in MT. adjectival modifiers; in a noun compound, a determiner is attached to the last noun in the chain. Also, 2 Russian and Arabic Morphology non-human subject plural nouns require the verb to be inflected in a singular feminine form. Generating Table 1 describes the morphological features rele- these morphologically complex languages is there- vant to Russian and Arabic, along with their possible fore more difficult than generating English in terms values. The rightmost column in the table refers to of capturing the agreement phenomena. the morphological features that are shared by Rus- sian and Arabic, including person, number, gender 3 Related Work and tense. While these features are fairly generic (they are also present in English), note that Rus- The use of morphological features in language mod- sian includes an additional gender (neuter) and Ara- elling has been explored in the past for morphology- bic has a distinct number notion for two (dual). A rich languages. For example, (Duh and Kirchhoff, central dimension of Russian morphology is case 2004) showed that factored language models, which marking, realized as suffixation on nouns and nom- consider morphological features and use an opti- inal modifiers1. The Russian case feature includes mized backoff policy, yield lower perplexity. six possible values, representing the notions of sub- In the area of MT, there has been a large body ject, direct object, location, etc. In Arabic, like other of work attempting to modify the input to a transla- Semitic languages, word surface forms may include tion system in order to improve the generated align- proclitics and enclitics (or prefixes and suffixes as ments for particular language pairs. For example, we refer to them in this paper), concatenated to in- it has been shown (Lee, 2004) that determiner seg- flected stems. For nouns, prefixes include conjunc- mentation and deletion in Arabic sentences in an tions (wa: “and”, fa: “and, so”), prepositions (bi: Arabic-to-English translation system improves sen- “by, with”, ka: “like, such as”, li: “for, to”) and a de- tence alignment, thus leading to improved over- terminer, and suffixes include possessive pronouns. all translation quality. Another work (Koehn and Verbal prefixes include conjunction and negation, Knight, 2003) showed improvements by splitting and suffixes include object pronouns. Both object compounds in German. (Nießen and Ney, 2004) and possessive pronouns are captured by an indica- demonstrated that a similar level of alignment qual- tor function for its presence or absence, as well as ity can be achieved with smaller corpora applying by the features that indicate their person, number morpho-syntactic source restructuring, using hierar- and gender. As can be observed from the table, a chical lexicon models, in translating from German large number of surface inflected forms can be gen- into English. (Popovic´ and Ney, 2004) experimented erated by the combination of these features, making successfully with translating from inflectional lan- the morphological generation of these languages a guages into English making use of POS tags, word non-trivial task. stems and suffixes in the source language. More re- Morphologically complex languages also tend to cently, (Goldwater and McClosky, 2005) achieved improvements in Czech-English MT, optimizing a 1Case marking also exists in Arabic. However, in many in- set of possible source transformations, incorporat- stances, it is realized by diacritics which are ignored in standard orthography. In our experiments, we include case marking in ing morphology. In general, this line of work fo- Arabic only when it is reflected in the orthography. cused on translating from morphologically rich lan- Features Russian Arabic Both POS (11 categories) (18 categories) Person 1,2,3 Number dual sing(ular), pl(ural) Gender neut(er) masc(uline), fem(inine) Tense gerund present, past, future, imperative Mood subjunctive, jussive Case dat(ive), prep(ositional), nom(inative), acc(usative), gen(itive) instr(umental) Negation yes, no Determiner yes, no Conjunction wa, fa, none Preposition bi, ka, li, none ObjectPronoun yes, no Pers/Numb/Gend of pronoun, none PossessivePronoun Same as ObjectPronoun Table 1: Morphological features used for Russian and Arabic guages into English; there has been limited research 4.1 Morphology Analysis and Generation in MT in the opposite direction. Koehn (2005) in- Morphological analysis can be performed by ap- cludes a survey of statistical MT systems in both di- plying language specific rules. These may include rections for the Europarl corpus, and points out the a full-scale morphological analysis with contextual challenges of this task. A recent work (El-Kahlout disambiguation, or, when such resources are not and Oflazer, 2006) experimented with English-to- available, simple heuristic rules, such as regarding Turkish translation with limited success, suggesting the last few characters of a word as its morphogical that inflection generation given morphological fea- suffix. In this work, we assume that lexicons LS and tures may give positive results. LT are available for the source and translation lan- In the current work, we suggest a probabilistic guages, respectively. Such lexicons can be created framework for morphology generation performed as manually, or automatically from data. Given a lexi- post-processing. It can therefore be considered as con L and a surface word w, we define the following complementary to the techniques described above. operations: Our approach is general in that it is not specific to • Stemming - let S = {s1, ..., sl} be the set of a particular language pair, and is novel in that it al- w possible morphological stems (lemmas) of w lows modelling of agreement on the target side. The according to L.2 framework suggested here is most closely related to 1 m (Suzuki and Toutanova, 2006), which uses a proba- • Inflection - let Iw = {i , ..., i } be the set of bilistic model to generate Japanese case markers for surface form words that have the same stem as T English-to-Japanese MT. This work can be viewed w. That is, i ∈ Iw iff Si Sw 6= ∅. as a generalization of (Suzuki and Toutanova, 2006) 1 v in that our model generates inflected forms of words, • Morphological analysis - let Aw = {a , ..., a } and is not limited to generating a small, closed set of be the set of possible morphological analyses case markers. In addition, the morphology genera- for w. A morphological analysis a is a vector of tion problem is more challenging in that it requires categorical values, where the dimensions and handling of complex agreement phenomena along possible values for each dimension in the vector multiple morphological dimensions. representation space are defined by L. 4.2 The Task 4 Inflection Prediction Framework We assume that we are given aligned sentence pairs, where a sentence pair includes a source and a tar- In this section, we define the task of of morphologi- get sentence, and lexicons LS and LT that support cal generation as inflection prediction, as well as the 2Multiple stems are possible due to ambiguity in morpho- lexical operations relevant for the task. logical analysis. dictions. The model implemented here is of second DET NN+sg PREP NN+pl AUXV+sg VERB+pastpart order: at any decision point t we condition the prob- of the allocation resources has completed ability distribution over labels on the previous two predictions yt−1 and yt−2 in addition to the given NN+sg+nom+neut NN+sg+gen+pl+masc VERB+perf+pass+part+neut+sg распределение ресурсов завершено (static) word context from both the source and tar- raspredelenie resursov zaversheno get sentences. That is, the probability of a predicted inflection sequence is defined as follows: Yn p(y | x) = p(yt | yt−1, yt−2, xt), yt ∈ It Figure 1: Aligned English-Russian sentence pair t=1 with syntactic and morphological annotation where xt denotes the given context at position t and It is the set of inflections corresponding to St, the operations described in the section above. Let from which the model should choose yt. a sentence w1, ...wt, ...wn be the output of a MT The features we constructed pair up predicates on system in the target language. This sentence can the context ( x,¯ yt−1, yt−2) and the target label (yt). be converted into the corresponding stem set se- In the suggested framework, it is straightforward to quence S1, ...St, ...Sn, applying the stemming op- encode the morphological properties of a word, in eration. Then the task is, for every stem set St in addition to its surface inflected form. For example, the output sentence, to predict an inflection yt from for a particular inflected word form yt and its con- its inflection set It. The predicted inflections should text, the derived paired features may include: both reflect the meaning conveyed by the source sen- ½ 1 if surface word y is y0 and s0 ∈ S φ = t t+1 tence, and comply with the agreement rules of the k 0 otherwise target language. 3 Figure 1 shows an example of an aligned English- n Russian sentence pair: on the source (English) side, 1 if Gender(yt) =“Fem” and Gender(yt−1) =“Fem” φk+1 = POS tags and word dependency structure are indi- 0 otherwise cated by solid arcs. The alignments between En- In the first example, a given neighboring stem set glish and Russian words are indicated by the dot- St+1 is used as a context feature for predicting the ted lines. The dependency structure on the Russian target word yt. The second feature captures the gen- side, indicated by solid arcs, is given by a treelet MT der agreement with the previous word. This is possi- system in our case (see Section 6.1), projected from ble because our model is of second order. Thus, we the word dependency structure of English and word can derive context features describing the morpho- alignment information. Note that the Russian sen- logical properties of the two previous predictions.4 tence displays agreement in number and gender be- Note that our model is not a simple multi-class clas- tween the subject noun (raspredelenie) and the pred- sifier, because our features are shared across mul- icate (zaversheno); note also that resursov is in gen- tiple target labels. For example, the gender fea- itive case, as it modifies the noun on its left. ture above applies to many different inflected forms. Therefore, it is a structured prediction model, where 5 Models for Inflection Prediction the structure is defined by the morphological proper- 5.1 A Probabilistic Model ties of the target predictions, in addition to the word Our learning framework uses a Maximum Entropy sequence decomposition. Markov model (McCallum et al., 2000). The model 5.2 Feature Categories decomposes the overall probability of a predicted The information available for estimating the distri- inflection sequence into a product of local proba- bution over y can be split into several categories, bilities for individual word predictions. The local t probabilities are conditioned on the previous k pre- 4Note that while we decompose the prediction task left-to- right, an appealing alternative is to define a top-down decompo- 3That is, assuming that the stem sequence that is output by sition, traversing the dependency tree of the sentence. However, the MT system is correct. this requires syntactic analysis of sufficient quality. corresponding to feature source. The first ma- Feature categories Instantiations jor distinction is monolingual versus bilingual fea- Monolingual lexical Word stem st−1,st−2,st,st+1 tures: monolingual features refer only to the context Predicted word yt, yt−1, yt−2 (and predicted label) in the target language, while Monolingual morphological f : POS, Person, Number, Gender, Tense f(yt−2),f(yt−1),f(yt) bilingual features have access to information in the Neg, Det, Prep, Conj, ObjPron, PossPron source sentences, obtained by traversing the word Monolingual syntactic alignment links from target words to a (set of) source Parent stem sHEAD(t) Bilingual lexical words, as shown in Figure 1. Aligned word set Al Alt, Alt−1, Alt+1 Both monolingual and bilingual features can be Bilingual morph & syntactic further split into three classes: lexical, morpholog- f : POS, Person, Number, Gender, Tense f(Alt), f(Alt−1), Neg, Det, Prep, Conj, ObjPron, PossPron, f(Alt+1), f(AlHEAD(t)) ical and syntactic. Lexical features refer to surface Comp word forms, as well as their stems. Since our model is of second order, our monolingual lexical fea- Table 2: The feature set suggested for English- tures include the features of a standard word trigram Russian and English-Arabic pairs language model. Furthermore, since our model is discriminative (predicting word forms given their is assigned a separate feature. Bilingual lexical fea- stems), the monolingual lexical model can use stems tures can refer to words aligned to yt as all as words in addition to predicted words for the left and cur- aligned to its immediate neighbors yt−1 and yt+1. rent position, as well as stems from the right con- Bilingual morphological and syntactic features re- text. Morphological features are those that refer to fer to the features of the source language, which the features given in Table 1. Morphological infor- are expected to be useful for predicting morphol- mation is used in describing the target label as well ogy in the target language. For example, the bilin- as its context, and is intended to capture morpho- gual Det (determiner) feature is computed accord- logical generalizations. Finally, syntactic features ing to the source dependency tree: if a child of a can make use of syntactic analyses of the source word aligned to wt is a determiner, then the fea- and target sentences. Such analyses may be derived ture value is assigned its surface word form (such for the target language, using the pre-stemmed sen- as a or the). The bilingual Prep feature is com- tence. Without loss of generality, we will use here puted similarly, by checking the parent chain of the a dependency parsing paradigm. Given a syntactic word aligned to wt for the existence of a preposi- analysis, one can construct syntactic features; for ex- tion. This feature is hoped to be useful for predict- ample, the stem of the parent word of yt. Syntactic ing Arabic inflected forms with a prepositional pre- features are expected to be useful in capturing agree- fix, as well as for predicting case marking in Rus- ment phenomena. sian. The bilingual ObjPron and PossPron features represent any object pronoun of the word aligned to 5.3 Features wt and a preceding possessive pronoun, respectively. These features are expected to map to the object and Table 2 gives the full set of suggested features for possessive pronoun features in Arabic. Finally, the Russian and Arabic, detailed by type. For monolin- bilingual Compound feature checks whether a word gual lexical features, we consider the stems of the appears as part of a noun compound in the English predicted word and its immediately adjacent words, source. f this is the case, the feature is assigned the in addition to traditional word bigram and trigram value of “head” or “dependent”. This feature is rel- features. For monolingual morphological features, evant for predicting a genitive case in Russian and we consider the morphological attributes of the two definiteness in Arabic. previously predicted words and the current prediction; for monolingual syntactic features, we use the 6 Experimental Settings stem of the parent node. The bilingual features include the set of words In order to evaluate the effectiveness of the sug- aligned to the focus word at position t, where they gested approach, we performed reference experi- are treated as bag-of-words, i.e., each aligned word ments, that is, using the aligned sentence pairs of Data Eng-Rus Eng-Ara Source Stems Avg(| I |) Avg(| S |) Avg. sentlen Eng Rus Eng Ara Rus. Lexicon 79,309 9.4 Training 1M 470K Lexicon ∩ Train 13,929 3.8 1.6 14.06 12.90 12.85 11.90 Ara. Lexicon ∩ Train 12,670 7.0 1.7 Development 1,000 1,000 13.73 12.91 13.48 12.90 Table 4: Lexicon statistics Test 1,000 1,000 13.61 12.84 8.49 7.50 For Arabic, as a full-size Arabic lexicon was not Table 3: Data set statistics: corpus size and average available to us, we used the Buckwalter morpholog- sentence length (in words) ical analyzer (Buckwalter, 2004) to derive a lexicon. In order to acquire the stemming and inflection oper- ators, we first submit all words in our training data to reference translations rather than the output of an the Buckwalter analyzer. Note that Arabic displays 5 MT system as input. This allows us to evaluate a high level of ambiguity, each word corresponding our method with a reduced noise level, as the words to many possible segmentations and morphological and word order are perfect in reference translations. analyses; we considered all of the different stems These experiments thus constitute a preliminary step returned by the Buckwalter analyzer in creating a for tackling the real task of inflecting words in MT. word’s stem set. The lexicon created in this manner contains 12,670 distinct stems and 89,360 inflected 6.1 Data forms. We used a corpus of approximately 1 million aligned For the generation of word features, we only con- sentence pairs for English-Russian, and 0.5 million sider one dominant analysis for any surface word pairs for English-Arabic. Both corpora are from a for simplicity. In case of ambiguity, we considered technical (software manual) domain, which we be- only the first (arbitrary) analysis for Russian. For lieve is somewhat restricted along some morpho- Arabic, we apply the following heuristic: use the logical dimensions, such as tense and person. We most frequent analysis estimated from the gold stan- used 1,000 sentence pairs each for development and dard labels in the Arabic Treebank (Maamouri et al., testing for both language pairs. The details of the 2005); if a word does not appear in the treebank, we datasets used are given in Table 3. choose the first analysis returned by the Buckwal- The sentence pairs were word-aligned using ter analyzer. Ideally, the best word analysis should GIZA++ (Och and Ney, 2000) and submitted to a be provided as a result of contextual disambiguation treelet-based MT system (Quirk et al., 2005), which (e.g., (Habash and Rambow, 2005)); we leave this uses the word dependency structure of the source for future work. language and projects word dependency structure to the target language, creating the structure shown in 6.3 Baseline Figure 1 above. As a baseline, we pick a morphological inflection yt 6.2 Lexicon at random from It. This random baseline serves as an indicator for the difficulty of the problem. An- Table 4 gives some relevant statistics of the lexicons other more competitive baseline we implemented we used. For Russian, a general-domain lexicon was is a word trigram language model (LM). The LMs available to us, consisting of about 80,000 lemmas were trained using the CMU language modelling (stems) and 9.4 inflected forms per stem.6 Limiting toolkit (Clarkson and Rosenfeld, 1997) with default the lexicon to word types that are seen in the train- settings on the training data described in Table 3. ing set reduces its size substantially to about 14,000 stems, and an average of 3.8 inflections per stem. 6.4 Experiments We will use this latter “domain-adapted” lexicon in our experiments. In the experiments, our primary goal is to evaluate the effectiveness of the proposed model using all 5 In this case, yt should equal wt, according to the task defi- features available to us. Additionally, we are inter- nition. 6The averages reported in Table 4 are by type and do not ested in knowing the contribution of each informa- consider word frequencies in the data. tion source, namely of morpho-syntactic and bilingual features. Therefore, we study the performance Model Eng-Rus Eng-Ara Random 31.7 16.3 of models including the full feature schemata as well LM 77.6 31.7 as models that are restricted to feature subsets ac- Monolingual Word 85.1 69.6 cording to the feature types as described in Section Bilingual Word 87.1 71.9 Monolingual All 87.1 71.6 5.2. The models are as follows: Monolingual-Word, Bilingual All 91.5 73.3 including LM-like and stem n-gram features only; Table 5: Accuracy (%) results by model Bilingual-Word, which also includes bilingual lexical features;7 Monolingual-All, which has access to all the information available in the target lan- curacy measure therefore abstracts away from the is- guage, including morphological and syntactic fea- sue of incomplete coverage of the lexicon. When tures; and finally, Bilingual-All, which includes all we encounter these words in the true MT scenario, feature types from Table 2. we will make no predictions about them, and simply For each model and language, we perform feature leave them unmodified. In our current experiments, selection in the following manner. The features are in Russian, 68.2% of all word tokens were in Cyril- represented as feature templates, such as ”POS=X”, lic, of which 93.8% were included in our lexicon. which generate a set of binary features correspond- In Arabic, 85.5% of all word tokens were in Arabic ing to different instantiations of the template, as in characters, of which 99.1% were in our lexicon.8 ”POS=NOUN”. In addition to individual features, con- The results in Table 5 show that the suggested junctions of up to three features are also considered models outperform the language model substantially for selection (e.g., ”POS=NOUN & Number=plural”). for both languages. In particular, the contribution of Every conjunction of feature templates considered both bilingual and non-lexical features is notewor- contains at least one predicate on the prediction yt, thy: adding non-lexical features consistently leads and up to two predicates on the context. The feature to 1.5% to 2% absolute gain in both monolingual selection algorithm performs a greedy forward step- and bilingual settings in both language pairs. We wise feature selection on the feature templates so as obtain a particularly large gain in the Russian bilin- to maximize development set accuracy. The algo- gual case, in which the absolute gain is more than rithm is similar to the one described in (Toutanova, 4%, translating to 34% error rate reduction. Adding 2006). After this process, we performed some man- bilingual features has a similar effect of gaining ual inspection of the selected templates, and finally about 2% (and 4% for Russian non-lexical) in ac- obtained 11 and 36 templates for the Monolingual- curacy over monolingual models. The overall accu- All and Bilingual-All settings for Russian, respec- racy is lower in Arabic than in Russian, reflecting tively. These templates generated 7.9 million and the inherent difficulty of the task, as indicated by the 9.3 million binary feature instantiations in the fi- random baseline (31.7 in Russian vs. 16.3 in Ara- nal model, respectively. The corresponding num- bic). bers for Arabic were 27 feature templates (0.7 mil- In order to evaluate the effectiveness of the model lion binary instantiations) and 39 feature templates in alleviating the data sparsity problem in morpho- (2.3 million binary instantiations) for Monolingual- logical generation, we trained inflection prediction All and Bilingual-All, respectively. models on various subsets of the training data described in Table 3, and tested their accuracy. The 7 Results and Discussion results are given in Figure 2. We can see that with as Table 5 shows the accuracy of predicting word forms few as 5,000 training sentences pairs, the model ob- for the baseline and proposed models. We report ac- tains much better accuracy than the language model, curacy only on words that appear in our lexicons. which is trained on data that is larger by a few orders Thus, punctuation, English words occurring in the 8For Arabic, the inflection ambiguity was extremely high: target sentence, and words with unknown lemmas there were on average 39 inflected forms per stem set in our are excluded from the evaluation. The reported ac- development corpus (per token), as opposed to 7 in Russian. We therefore limited the evaluation of Arabic to those stems that 7Overall, this feature set approximates the information that have up to 30 inflected forms, resulting in 17 inflected forms per is available to a state-of-the-art statistical MT system. stem set on average in the development data. 90 RUS-bi-word were then inflected using the suggested framework. RUS-bi-all 85 ARA-bi-word This simple integration of the proposed model with ARA-bi-all 80 the MT system improved the BLEU score by 1.7.

75 The most obvious next step of our research, there-

70 fore, is to further pursue the integration of the pro-

65 posed model to the end-to-end MT scenario. Accuracy (%) 60 There are multiple paths for obtaining further im- 55 provements over the results presented here. These 50 include refinement in feature design, word analysis 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Training data size (x1,000) disambiguation, morphological and syntactic analysis on the source English side (e.g., assigning se- mantic role tags), to name a few. Another area of Figure 2: Accuracy, varying training data size investigation is capturing longer-distance agreement phenomena, which can be done by implementing a of magnitude. We also note that the learning curve global statistical model, or by using features from becomes less steep as we use more training data, dependency trees more effectively. suggesting that the models are successfully learning References generalizations. Tim Buckwalter. 2004. Buckwalter arabic morphological ana- We have also manually examined some repre- lyzer version 2.0. sentative cases where the proposed model failed to Philip Clarkson and Roni Rosenfeld. 1997. Statistical language make a correct prediction. In both Russian and Ara- modelling using the CMU cambridge toolkit. In Eurospeech. Kevin Duh and Kathrin Kirchhoff. 2004. Automatic learning of bic, a very common pattern was a mistake in pre- language model structure. In COLING. dicting the gender (as well as number and person in Ilknur Durgar El-Kahlout and Kemal Oflazer. 2006. Initial ex- Arabic) of pronouns. This may be attributed to the plorations in English to Turkish statistical machine translation. In NAACL workshop on statistical machine translation. fact that the correct choice of the pronoun requires Sharon Goldwater and David McClosky. 2005. Improving sta- coreference resolution, which is not available in our tistical MT through morphological analysis. In EMNLP. Nizar Habash and Owen Rambow. 2005. Arabic tokenization, model. A more thorough analysis of the results will part-of-speech tagging and morphological disambiguation in be helpful to bring further improvements. one fell swoop. In ACL. Philipp Koehn and Kevin Knight. 2003. Empirical methods for 8 Conclusions and Future Work compound splitting. In EACL. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. We presented a probabilistic framework for mor- Young-Suk Lee. 2004. Morphological analysis for statistical phological generation given aligned sentence pairs, machine translation. In HLT-NAACL. incorporating morpho-syntactic information from Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Hubert Jin. 2005. Arabic Treebank: Part 1 v 3.0. Linguistic Data both the source and target sentences. The re- Consortium. sults, using reference translations, show that the pro- Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. posed models achieve substantially better accuracy 2000. Maximum entropy markov models for information extraction and segmentation. In ICML. than language models, even with a relatively small Sonja Nießen and Hermann Ney. 2004. Statistical machine amount of training data. Our models using morpho- translation with scarce resources using morpho-syntactic information. Computational Linguistics, 30(2):181–204. syntactic information also outperformed models us- Franz Josef Och and Hermann Ney. 2000. Improved statistical ing only lexical information by a wide margin. This alignment models. In ACL. result is very promising for achieving our ultimate Maja Popovic´ and Hermann Ney. 2004. Towards the use of word stems and suffixes for statistical machine translation. goal of improving MT output by using a special- In LREC. ized model for target language morphological gener- Chris Quirk, Arul Menezes, and Colin Cherry. 2005. Depen- ation. Though this goal is clearly outside the scope dency tree translation: Syntactically informed phrasal SMT. In ACL. of this paper, we conducted a preliminary experi- Hisami Suzuki and Kristina Toutanova. 2006. Learning to pre- ment where an English-to-Russian MT system was dict case markers in Japanese. In COLING-ACL. trained on a stemmed version of the aligned data and Kristina Toutanova. 2006. Competitive generative models with structure learning for NLP classification tasks. In EMNLP. used to generate stemmed word sequences, which