Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information
Sonja Nießen∗ Hermann Ney∗ RWTH Aachen RWTH Aachen
In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another. The bilingual training data can be better exploited by explicitly taking into account the interdependencies of related inflected forms. We propose the construction of hierarchical lexicon models on the basis of equivalence classes of words. In addition, we introduce sentence-level restructuring transformations which aim at the assimilation of word order in related sentences. We have systematically investigated the amount of bilingual training data required to maintain an acceptable quality of machine translation. The combination of the suggested methods for improving translation quality in frameworks with scarce resources has been successfully tested: We were able to reduce the amount of bilingual training data to less than 10% of the original corpus, while losing only 1.6% in translation quality. The improvement of the translation results is demonstrated on two German-English corpora taken from the Verbmobil task and the Nespole! task.
1. Introduction
The statistical approach to machine translation has proved successful in various com- parative evaluations since its revival by the work of the IBM research group more than a decade ago. The IBM group dispensed with linguistic analysis, at least in its earliest publications. Although the IBM group finally made use of morphological and syntactic information to enhance translation quality (Brown et al. 1992; Berger et al. 1996), most of today’s statistical machine translation systems still consider only surface forms and use no linguistic knowledge about the structure of the languages involved. In many applications only small amounts of bilingual training data are available for the desired domain and language pair, and it is highly desirable to avoid at least parts of the costly data collection process. The main objective of the work reported in this article is to introduce morphological knowledge in order to reduce the amount of bilingual data necessary to sufficiently cover the vocabulary expected in testing. This is achieved by explicitly taking into account the interdependencies of related inflected forms. In this work, a hierarchy of equivalence classes at different levels of abstraction is proposed. Features from those hierarchy levels are combined to form hierarchical lexicon models, which can replace the standard probabilistic lexicon used
∗ Lehrstuhl fur ¨ Informatik VI, Computer Science Department, RWTH Aachen—University of Technology, D-52056 Aachen, Germany. E-mail: [email protected]; [email protected].