<<

-- Submissions at WMT13: Morphological and Syntactic Processing

Marion Weller1, Max Kisselew1, Svetlana Smekalova1, Alexander Fraser2, Helmut Schmid2, Nadir Durrani3, Hassan Sajjad4, Richárd Farkas5 1University of Stuttgart, 2Ludwig Maximilian University of Munich, 3University of Edinburgh, 4Quatar Computing Research Institute, 5University of Szeged

Introduction English – French English – German

We deal with problems caused by morphologically Two-step-system Translation model built on stems Predicting morphological features Using a se- • English and German can have diverging clause complex languages and diverging syntactic structures: with a component for generating inflected target- quence model, the stem markup is propagated over orders, which makes SMT difficult. the complete linguistic phrase. • Simplification of morphologically rich language forms using morphological features. This • The English SVO order is mapped to the different source languages allows to generate new inflected forms. [5], [7] Post-processing Split portmanteaus are merged. clausal orders in German [3]. – Morphological reduction (FR-EN/RU-EN) Stemming and markup Nouns are marked with Experiments Statistically significant improvements • English clauses are annotated with the clause type – Compound splitting (DE-EN) inflection-relevant features gender and number. Port- with a small system (1), less clear results on a full of their German translation. • Modeling target-side morphology (EN-FR) manteaus are split: → à + le system (2); slightly better results for a factored model. • English clauses are then reordered to the German – Simplified translation model on stems order according to rules. – Generate inflected target words SMT-output predicted generated after post- gloss System BLEU BLEU (Moses) (ci) (cs) with stem-markup in bold print features forms processing system (Moses) BLEU BLEU system name • Sentence reordering (DE-EN/EN-DE) (1) Baseline 24.91 23.40 réduction[N] Fem.Pl réductions réductions reductions EN-DE 19.68 18.97 MES-reorder • Transliteration mining (RU-EN) de[P] – de du of InflPred 25.31 23.81 Masc.Sg le the InflPred-factored 25.53 24.04 Table 7: Results on WMT-2013 (blindtest) We used standard phrase-based Moses-sys- le[] • budget[N] Masc.Sg budget budget budget (2) Baseline 29.32 27.65 tems or operation sequence models (OSM). de[P] – de de of InflPred (submitted) 29.07 27.40 InflPred-factored 29.17 27.46 • Target-side language model: individual le[ART] Fem.Sg la la the References Fem.Sg défense défense défense models for each corpus were trained and then inter- défense[N] Table 3: Results for French inflection Table 2: Processing steps for the input phrase defence cuts. prediction on the WMT-2012 test set. [1] A. Fraser. Experiments in morphosyntactic processing for polated using weights optimized on the dev-set. translating to and from German. EACL-WMT 2009. [2] F. Fritzinger, A. Fraser. How to avoid burning ducks: combining linguistic analysis and corpus statistics for German Russian – English French – English German – English compound processing. ACL-WMT 2010. [3] A. Gojun, A. Fraser. Determining the placement of German • Many of the Russian morphological features are • In contrast to English, French NPs are inflected for • Mapping German to SVO order: sentence verbs in English-to-German SMT. EACL 2012. redundant in RU-EN translation. gender and number. reordering rules [1] are applied to parsed [4] H. Sajjad, A. Fraser, H. Schmid. A statistical model for • We applied semi-automatic correction to the out- • Pre-processing: text normalization and tagging. German data. unsupervised and semi-supervised transliteration mining. ACL 2012. put of Russian TreeTagger [8] and trained a new • Language model includes the LDC Giga-word data. • We use a version of BitPar [3] optimized with version of RFTagger [6]. targeted self-training: select the parse tree [5] H. Schmid. Efficient Parsing of highly ambiguous context-free • Avoid sparsity problems by simplifying French grammars with bit-vectors. COLING 2004. • Morphological reduction was applied to with the highest usefulness-score for the nominal inflection, see table 4. [6] H. Schmid, F. Laws. Estimation of conditional probabilities reordering task from top 100 parses. nouns, pronouns, verbs, adjectives, prepositions • The baseline system is slightly better (BLEU) than with decision trees and an application to fine-grained and conjunctions. the simplified version: however, we assume that the • Usefulness-score Similarity between word order PoS-tagging. COLING 2008. • Transliteration mining: unknown words are simplified systems benefits from the generalization. obtained after reordering and word order indicated [7] H. Schmid, A. Fitschen, U. Heid. SMOR: a German transliterated [2]; inflectional suffixes are removed by automatic word alignment. Computational Morphology covering Derivation, Composition Definite determiners la / l’ / les → le and Inflection. LREC 2004. before transliteration. • Better generalization: linguistically informed com- Indefinite determiners un / une → un [8] S. Sharoff, M. Kopotev, T. Erjavec, A. Feldmann, D. Divjak. • We compared systems built with GIZA++ and Adjectives Infl. form → lemma pound splitting [2] based on SMOR [5] and Designing and evaluating Russian Tagsets. LREC 2008. transliteration-augmented GIZA++: Portmanteaus e. g. au → à le corpus statistics, as well as portmanteau splitting. [9] K. Toutanova, H. Suzuki, A. Ruopp. Applying Morphology TA-GIZA++ leads to improvement in BLEU. Verb participles inflected Reduced to non-inflected Generation Models to Machine translation. ACL-HLT 2008. for gender and number verb participle form • The morph-reduced data leads to decreased BLEU BLEU ending in ée/és/ées ending in é system system name BLEU scores: probably caused by problems with Clitics and apostrophized d’ → de, n’ → ne, (ci) (cs) Acknowledgements choosing the right verb tense. words → lemmatized form qu’ → que,... DE-EN (OSM) 27.60 26.12 MES DE-EN (OSM) Table 4: Rules for morphological simplification. 27.48 25.99 not submitted The research leading to these results received funding from BitPar not self-trained original corpus morph-reduced the European Community’s Seventh Framework Programme MES-Szeged- (FP7/2007-2013) under Grant Agreements n. 248005 and WMT12 WMT13 WMT12 WMT13 system (Moses) BLEU (cs) BLEU (ci) DE-EN (Moses) 27.14 25.65 reorder-split n. 287658, the Deutsche Forschungsgesellschafts Grants SFB GIZA++ 32.51 25.5 31.22 24.3 Baseline 29.90 31.02 DE-EN (Moses) 732 and Models of Morphosyntax for SMT and the European TA-GIZA++ 33.40 25.9 31.40 24.45 Simplified French 29.70 30.83 26.82 25.36 not submitted BitPar not self-trained Social Fund through Project FuturICT.hu (grant n. TÁMOP- Table 1: Results for RU-EN: original vs. reduced data (OSM). Table 5: Results for French to English (WMT-2012). Table 6: Results on WMT-2013. 4.2.2.C-11/1/KONV-2012-0013).

The language pairs DE-EN and RU-EN were developed in collaboration with the Qatar Computing Research Institute and the University of Szeged.