Modeling the Position and Inflection of Verbs in English to German
Total Page:16
File Type:pdf, Size:1020Kb
Modeling the position and inflection of verbs in English to German machine translation Von der Fakult¨atInformatik, Elektrotechnik und Informationstechnik der Universit¨at Stuttgart zur Erlangung der W¨urdeeines Doktors der Philosophie (Dr. phil.) genehmigte Abhandlung. Vorgelegt von Anita Ramm aus Vukovar, Kroatien Hauptberichter Prof. Dr. Alexander Fraser 1. Mitberichterin PD Dr. Sabine Schulte im Walde 2. Mitberichter Prof. Dr. Jonas Kuhn Tag der m¨undlichen Pr¨ufung:23.03.2018 Institut f¨urMaschinelle Sprachverarbeitung der Universit¨atStuttgart 2018 Abstract Machine translation (MT) is automatic translation of speech or text from a source lan- guage (SL) into a target language (TL). Machine translation is performed without hu- man interaction: a text or a speech signal is used as input to a computer program which automatically generates translation of the input data. There are two main approaches to machine translation: rule-based and corpus-based. Rule-based MT systems rely on manually or semi-automatically acquired rules which describe lexical, as well as syntactic correspondences between SL and TL. Corpus-based methods learn such correspondences automatically using large sets of parallel texts, i.e., texts which are translations of each other. Corpus-based methods such as statistical machine translation (SMT) and neural machine translation (NMT) rely on statistics which express the probability of translat- ing a specific SL translation unit into a specific TL unit (typically, word and/or word sequence). While SMT is a combination of different statistical models, NMT makes use of a neural network which encodes parallel SL and TL contexts in a more sophisticated way. Many problems have been observed when translating from English into German using SMT. One of the most prominent errors in the German SMT outputs is related to the verbs. SMT often misplaces or does not generate the German verbs at all. Furthermore, the inflected German verb forms are often incorrect. This thesis describes methods for handling the two respective problems. While the positional problems are dealt with in a pre-processing step which can be seen as a preparation of the English data for training and translation, the verbal inflection is handled in a post-processing step and can thus be seen as an automatic post-editing (or correction) step to the translation. Consider the position of the verbs have/habe and read/gelesen in the following English- German sentence pair: I have read that book $ Ich habe dieses Buch gelesen. For SMT, the different position of the participles read/gelesen is problematic since the trans- lation step needs to jump over many words, in this case over the words dieses/this and Buch/book, to place the German participle into the correct position. Such positional differences, caused by grammatical constraints in English and German, are given in al- most all sentence types and lead to many errors in the German translations. I correct these errors by applying the so-called preordering of the English sentences. Preordering transforms (i.e., reorders) English sentences in a way that they encounter German-like word order. The reordered English texts are then used to train English-German SMT models and also to translate English test sentences. Thus, instead of being trained on the sentence pairs such as I have read that book $ Ich habe dieses Buch gelesen, an English-German SMT system is now trained on the following data: I have that book read $ Ich habe dieses Buch gelesen. Doing this, SMT does not need to perform problematic search for the correct positions of the German verbs since they correspond iii to the positions of their English counterparts. I test the improvement potential of pre- ordering for English!German in many different experimental setups in which the effect of domain, size of the training data, as well as the used language models is taken into account. The experiments show that the German translations generated by an SMT model trained on reordered English sentences have more verbs which are more often correctly placed when compared with translations generated by an SMT model trained on the original parallel corpus. Correct placement of the verbs does not mean that their inflection is correct. The English!German SMT systems have problems generating correct German verb forms and this problem gets even more severe when reordering of the English data is performed. The difficulty of generating the correct German verb forms is due to the difference in the morphological richness of English and German. English differentiates between only a few forms of a single verb lemma, while in German, a single verb lemma has many different inflectional variants. In the context of (S)MT, this means that a single English verb form may be translated into numerous German verb forms, e.g., had $ {hatte, hattest, hatten, gehabt}. Which of these variants is correct, depends on the context in which the verbs occur. In German, for instance, the subjects require a specific form of the finite verbs, e.g., I work $ Ich arbeite or They work $ Sie arbeiten. This linguistic property is called subject-verb agreement. SMT often fails to capture required contextual dependencies between subjects and verbs which leads to German translations in which the subject-verb agreement is violated: those translations are grammatically incorrect. The German verbal morphology includes not only information about agreement (person and number), but also about tense and mood. Generation of the verb forms with tense and mood properties which do not correspond to the source leads to sentences which may be interpreted incorrectly. Furthermore, if the target language constraints on usage of tense and mood are not met, the translations are, as in the case of false subject- verb agreement, grammatically incorrect. In this thesis, both of the inflection-related problems are tackled with a subsequent generation of the German finite verbs according to morphological features derived by considering relevant contextual information. Subject-verb agreement errors are dealt with by parsing the German SMT outputs. Given a parse tree, first the subject-verb pairs are identified. Subsequently, the person and number features of the subject are transferred to the corresponding finite verb. This approach ensures that the agreement is established between the generated subjects and the agreeing finite verbs in the German translations. The method works well for the used test set, its success, however, largely depends on the parsing accuracy. As mentioned above, the generation of the German verbs also requires information about tense and mood. These morphological features are gained with a classifier which is trained on many types of different contextual information derived from the English and German sentences. Although the classification accuracy is relatively high when computed on well-formed test sets, it is not sufficient to generally improve tense and mood of the verbs in the German SMT outputs. In some cases, the predicted values indeed correct false German verbs: particularly the German finite verbs generated as translations of the English non-finite verbs profit from the tense and mood prediction step. iv The tense/mood classifier used in this thesis is a first attempt to model tense and mood translation from English to German. A deeper analysis of the translation exam- ples, of the parallel data, as well as of the theoretical research on (human) translation in general shows the whole complexity of the problem. The present thesis includes a summary of the most important findings with respect to the translation of tense and mood for the English!German translation direction. Not only the theoretical knowl- edge about this topic is required when it comes to its automatic modeling. Tense and mood depend on factors which need to be extractable from the data in terms of their automatic annotation. One of the by-products of my research is an open-source tool for the annotation of tense and mood for English and German in the monolingual context. Along with the results and discussions provided in this theses, the tool provides a strong basis for further work in this research area. v Deutsche Zusammenfassung Maschinelle Übersetzung (MÜ) befasst sich mit der automatischen Erstellung von Über- setzungen. Für einen Text (oder gesprochene Sprache) in der Quellsprache (QS), wird ein MÜ-System dazu verwendet, automatisch, das heißt ohne Hilfe des Menschen, den äquivalenten Text in der Zielsprache (ZS) zu generieren. Im Jahre 2004 wurde das erste frei verfügbare Programm, genannt Moses, zur statistischen maschinellen Übersetzung (SMÜ) veröffentlicht. Dies bedeutete den entscheidenden Durchbruch für den breit ge- fächerten Einsatz der MÜ. Das Erstellen der SMÜ-Systeme ist denkbar einfach: man benötigt lediglich Texte in QS und ZS, die Übersetzungen voneinander sind. Das SMÜ- Modell lernt aus den Texten, welche Wörter bzw. Wortfolgen in QS und ZS Übersetzun- gen voneinander sind. Den Übersetzungspaaren werden Übersetzungswahrscheinlichkei- ten zugewiesen, die auf der Häufigkeit des Auftretens der ermittelten Übersetzungspaare im gegebenen Textpaar basieren. Für viele Sprachpaare generiert SMÜ gute Überset- zungen, allerdings ist die Qualität der SMÜ-Übersetzungen für Sprachpaare, die sich morphologisch und/oder syntaktisch bedeutend voneinander unterscheiden, immer noch mangelhaft. Zu solchen Sprachpaaren gehören auch Englisch und Deutsch. Diese Doktorarbeit befasst sich mit den Verben in den deutschen SMÜ-Übersetzungen mit Englisch