Munich-Edinburgh-Stuttgart Submissions at WMT13
Total Page:16
File Type:pdf, Size:1020Kb
Munich-Edinburgh-Stuttgart Submissions at WMT13: Morphological and Syntactic Processing for SMT Marion Weller1, Max Kisselew1, Svetlana Smekalova1, Alexander Fraser2, Helmut Schmid2, Nadir Durrani3, Hassan Sajjad4, Richard´ Farkas5 1University of Stuttgart – (wellermnjkisselmxjsmekalsa)@ims.uni-stuttgart.de 2Ludwig-Maximilian University of Munich – (schmidjfraser)@cis.uni-muenchen.de 3University of Edinburgh – [email protected] 4Qatar Computing Research Institute – [email protected] 5University of Szeged – [email protected] Abstract mined transliterations to handle out-of-vocabulary words (OOVs) when translating from Russian. We present 5 systems of the Munich- Replacing inflected word forms with simpler 1 Edinburgh-Stuttgart joint submissions to variants (lemmas or the components of split com- the 2013 SMT Shared Task: FR-EN, EN- pounds) aims not only at reducing the general com- FR, RU-EN, DE-EN and EN-DE. The plexity of the translation model, but also at decreas- first three systems employ inflectional gen- ing the amount of out-of-vocabulary words in the eralization, while the latter two employ input data. This is particularly the case with Ger- parser-based reordering, and DE-EN per- man compounds, which are very productive and forms compound splitting. For our ex- thus often lack coverage in the parallel training periments, we use standard phrase-based data, whereas the individual components can be Moses systems and operation sequence translated. Similarly, inflected word forms (e.g. ad- models (OSM). jectives) benefit from the reduction to lemmas if the full inflection paradigm does not occur in the 1 Introduction parallel training data. Morphologically complex languages often lead to For EN-FR, a translation pair with a morpho- data sparsity problems in statistical machine trans- logically complex target language, we describe a lation. For translation pairs with morphologically two-step translation system built on non-inflected rich source languages and English as target lan- word stems with a post-processing component for guage, we focus on simplifying the input language predicting morphological features and the genera- in order to reduce the complexity of the translation tion of inflected forms. In addition to the advantage model. The pre-processing of the source-language of a more general translation model, this method is language-specific, requiring morphological anal- also allows the generation of inflected word forms ysis (FR, RU) as well as sentence reordering (DE) which do not occur in the training data. and dealing with compounds (DE). Due to time 2 Experimental setup constraints we did not deal with inflection for DE- EN and EN-DE. The translation experiments in this paper are car- The morphological simplification process con- ried out with either a standard phrase-based Moses sists in lemmatizing inflected word forms and deal- system (DE-EN, EN-DE, EN-FR and FR-EN) or ing with word formation (splitting portmanteau with an operation sequence model (RU-EN, DE- prepositions or compounds). This needs to take EN), cf. 92013bDurrani et al.) for more details. An into account translation-relevant features (e.g. num- operation sequence model (OSM) is a state-of-the- ber) which vary across the different language pairs: art SMT-system that learns translation and reorder- while French only has the features number and ing patterns by representing a sentence pair and its gender, a wider array of features needs to be con- word alignment as a unique sequence of operations sidered when modelling Russian (cf. table 6). In (see e.g. 102011Durrani et al.), 82013aDurrani addition to morphological reduction, we also apply et al.) for more details). For the Moses systems we transliteration models learned from automatically used the old train-model perl scripts rather than the EMS, so we did not perform Good-Turing smooth- 1The language pairs DE-EN and RU-EN were developed in collaboration with the Qatar Computing Research Institute ing; parameter tuning was carried out with batch- and the University of Szeged. mira (Cherry and Foster, 2012). 1 Removal of empty lines System BLEU (cs) BLEU (ci) 2 Conversion of HTML special characters like Baseline 29.90 31.02 " to the corresponding characters Simplified French* 29.70 30.83 3 Unification of words that were written both with an œ or with an oe to only one spelling Table 3: Results of the French to English system 4 Punctuation normalization and tokenization (WMT-2012). The marked system (*) corresponds 5 Putting together clitics and apostrophes like l ’ or d ’ to l’ and d’ to the system submitted for manual evaluation. (cs: case-sensitive, ci: case-insensitive) Table 1: Text normalization for FR-EN. Definite determiners la / l’ / les ! le Data and experiments We trained a French to Indefinite determiners un / une ! un English Moses system on the preprocessed and Adjectives Infl. form ! lemma Portmanteaus e. g. au ! a` le simplified constrained parallel data. Verb participles Reduced to Due to tractability problems with word align- inflected for gender non-inflected ment, the 109 French-English corpus and the UN and number verb participle form ending in ee/´ es/´ ees´ ending in e´ corpus were filtered to a more manageable size. Clitics and apostroph- d’ ! de, The filtering criteria are sentence length (between ized words are converted qu’ ! que, 15 and 25 words), as well as strings indicating that to their lemmas n’ ! ne, ... a sentence is neither French nor English, or other- Table 2: Rules for morphological simplification. wise not well-formed, aiming to obtain a subset of good-quality sentences. In total, we use 9M par- allel sentences. For the English language model The development data consists of the concate- we use large training data with 287.3M true-cased nated news-data sets from the years 2008-2011. sentences (including the LDC Giga-word data). Unless otherwise stated, we use all constrained We compare two systems: a baseline with reg- data (parallel and monolingual). For the target- ular French text, and a system with the described side language models, we follow the approach of morphological simplifications. Results for the 262008Schwenk and Koehn) and train a separate WMT-2012 test set are shown in table 3. Even language model for each corpus and then interpo- though the baseline is better than the simplified late them using weights optimized on development system in terms of BLEU, we assume that the trans- data. lation model of the simplified system benefits from the overall generalization – thus, human annotators 3 French to English might prefer the output of the simplified system. For the WMT-2013 set, we obtain BLEU scores French has a much richer morphology than English; of 29,97 (cs) and 31,05 (ci) with the system built for example, adjectives in French are inflected with on simplified French (mes-simplifiedfrench). respect to gender and number whereas adjectives 4 English to French in English are not inflected at all. This causes data sparsity in coverage of French inflected forms. We Translating into a morphologically rich language try to overcome this problem by simplifying French faces two problems: that of asymmetry of mor- inflected forms in a pre-processing step in order to phological information contained in the source and adapt the French input better to the English output. target language and that of data sparsity. In this section we describe a two-step system de- Processing of the training and test data The signed to overcome these types of problems: first, pre-processing of the French input consists of two the French data is reduced to non-inflected forms steps: (1) normalizing not well-formed data (cf. (stems) with translation-relevant morphological fea- table 1) and (2) morphological simplification. tures, which is used to built the translation model. In the second step, the normalized training data The second step consists of predicting all neces- is annotated with Part-of-Speech tags (PoS-tags) sary morphological features for the translation out- and word lemmas using RFTagger (Schmid and put, which are then used to generate fully inflected Laws, 2008) which was trained on the French tree- forms. This two-step setup decreases the complex- bank (Abeille´ et al., 2003). French forms are then ity of the translation task by removing language- simplified according to the rules given in table 2. specific features from the translation model. Fur- thermore, generating inflected forms based on word FRMOR (column 3). stems and morphological features allows to gener- ate forms which do not occur in the parallel training Post-processing As the French data has been data – this is not possible in a standard SMT setup. normalized, a post-processing step is needed in or- The idea of separating the translation into two der to generate correct French surface forms: split steps to deal with complex morphology was intro- portmanteaus are merged into their regular forms duced by 282008Toutanova et al.). 142012Fraser based on a simple rule set. Furthermore, apostro- et al.) applied this method to the language pair phes are reintroduced for words like le, la, ne, ... if English-German with an additional special focus they are followed by a vowel. Column 4 in table 4 on word formation issues such as the splitting shows post-processing including portmanteau for- and merging of portmanteau prepositions and com- mation. Since we work on lowercased data, an pounds. The presented inflection prediction sys- additional recasing step is required. tems focuses on nominal inflection; verbal inflec- tion is not addressed. Experiments and evaluation We use the same set of reduced parallel data as the FR-EN system; Morphological analysis and resources The the language model is built on 32M French sen- morphological analysis of the French training data tences. Results for the WMT-2012 test set are given is obtained using RFTagger, which is designed in table 5.