Integration Techniques for Selective Stemming in SMT Systems

Stripping Adjectives: Integration Techniques for Selective Stemming in SMT Systems Isabel Slawik Jan Niehues Alex Waibel Institute for Anthropomatics and Robotics KIT - Karlsruhe Institute of Technology, Germany [email protected] Abstract the translation quality of modern systems is often still not sufficient for many applications. In this paper we present an approach to re- Traditional SMT approaches work on a lexical duce data sparsity problems when translat- level, that is every surface form of a word is treated ing from morphologically rich languages as its own distinct token. This can create data spar- into less inflected languages by selectively sity problems for morphologically rich languages, stemming certain word types. We de- since the occurrences of a word are distributed over velop and compare three different integra- all its different surface forms. This problem be- tion strategies: replacing words with their comes even more apparent when translating from stemmed form, combined input using al- an under-resourced language, where parallel train- ternative lattice paths for the stemmed and ing data is scarce. surface forms and a novel hidden combina- When we translate from a highly inflected lan- tion strategy, where we replace the stems in guage into a less morphologically rich language, the stemmed phrase table by the observed not all syntactic information encoded in the surface surface forms in the test data. This allows forms may be needed to produce an accurate trans- us to apply advanced models trained on the lation. For example, verbs in French must agree surface forms of the words. with the noun in case and gender. When we trans- We evaluate our approach by stem- late these verbs into English, case and gender in- ming German adjectives in two formation may be safely discarded. German English translation scenar- We therefore propose an approach to overcome → ios: a low-resource condition as well as a these sparsity problems by stemming different large-scale state-of-the-art translation sys- morphological variants of a word prior to transla- tem. We are able to improve between 0.2 tion. This allows us to not only estimate transla- and 0.4 BLEU points over our baseline and tion probabilities more reliably, but also to trans- reduce the number of out-of-vocabulary late previously unseen morphological variants of words by up to 16.5%. a word, thus leading to a better generalization of our models. To fully maximize the potential of our 1 Introduction SMT system, we looked at three different integration strategies. We evaluated hard decision stem- Statistical machine translation (SMT) is currently ming, where all adjectives are replaced by their the most promising approach to automatically stem, as well as soft integration strategies, where translate text from one natural language into an- we consider the words and their stemmed form as other. While it has been successfully used for translation alternatives. a lot of languages and applications, many challenges still remain. Translating from a morpholog- 2 Related Work ically rich language is one such challenge where The specific challenges arising from the transla- c 2015 The authors. This article is licensed under a Creative Commons� 3.0 licence, no derivative works, attribution, CC- tion of morphologically rich languages have been BY-ND. widely studied in thefield of SMT. The factored 129 translation model (Koehn and Hoang, 2007) en- 3 Stemming riches phrase-based MT with linguistic information. By translating the stem of a word and its In order to address the sparsity problem, we try morphological components separately and then ap- to cluster words that have the same translation plying generation rules to form the correct surface probability distribution, leading to higher occur- form of the target word, it is possible to generate rence counts and therefore more reliable transla- translations for surface forms that have not been tion statistics. Because of the respective morpho- seen in training. logical properties of our source and target language, word stems pose a promising type of clus- Talbot and Osborne (2006) address lexical ter. Moreover, stemming alleviates the OOV prob- redundancy by automatically clustering source lem for unseen morphological variants. Because of words with similar translation distributions, these benefits, we chose stem clustering in this pa- whereas Yang and Kirchhoff (2006) propose per, however, our approach can work on different a backoff model that uses increasing levels of types of clusters, e.g. synonyms. morphological abstractions to translate previously Morphological stemming prior to translation has unseen word forms. to be done carefully, as we are actively discarding Niehues and Waibel (2011) present quasi- information. Indiscriminately stemming the whole morphological operations as a means to translate source corpus hurts translation performance, since out-of-vocabulary (OOV) words. The automati- stemming algorithms make mistakes and often too cally learned operations are able to split off po- much information is lost. tentially inflected suffixes, look up the translation Adding the stem of every word as an alterna- for the base form using a lexicon of Wikipedia1 ti- tive to our source sentence greatly increases our tles in multiple languages, and then generate the search space. Arguably the majority of the time appropriate surface form on the target side. Sim- we need the surface form of a word to make an in- ilar operations were learned for compound parts formed translation decision. We therefore propose by Macherey et al. (2011). to keep the search space small by only stemming Hardmeier et al. (2010) use morphological re- selected word classes which have a high diversity duction in a German English SMT system by → in inflections and whose additional morphological adding the lemmas of every word output as a by- information content can be safely disregarded. product of compound splitting as an alternative For our use case of translating from German to edge to input lattices. A similar approach is used English, we chose to focus only on stemming ad- by Dyer et al. (2008) and Wuebker and Ney (2012). jectives. Adjectives in German can havefive dif- They used word lattices to represent different ferent suffixes, depending on the gender, number source language alternatives for Arabic English → and case of the corresponding noun, whereas in and German English respectively. → English adjectives are only rarely inflected. We Weller et al. (2013a) employ morphological can therefore discard the information encoded in simplification for their French English WMT → the suffix of a German adjective without losing any system, including replacing inflected adjective vital information for translation. forms with their lemma using hand-written rules, and their Russian English (Weller et al., 2013b) → 3.1 Degrees of Comparison system, removing superfluous attributes from the While we want to remove gender, number and case highly inflected Russian surface forms. Their sys- information from the German adjective, we want tems are unable to outperform the baseline system to preserve its comparative or superlative nature. trained on the surface forms. Weller et al. argue In addition to its base form (e.g. schon¨ [pretty]), that human translators may prefer the morpholog- a German adjective can have one offive suffixes ically reduced system due to better generalization (-e, -em, -en, -er, -es). However, we cannot sim- ability. Their analysis showed the Russian system ply remove all suffixes usingfixed rules, because often produces an incorrect verb tense, which in- the comparative base form of an adjective is identi- dicates that some morphological information may cal to the inflected masculine, nominative, singular be helpful to choose the right translation even if the form of an attributive adjective. information seems redundant. For example, the inflected form schoner¨ of the 1http://www.wikipedia.org adjective schon¨ is used as an attributive adjective in 130 the phrase schoner¨ Mann[handsome man] and as 4 Integration a comparative in the phrase schoner¨ wird es nicht [won’t get prettier]. We can stem the adjective in After clustering the words into groups that can be the attributive case to its base form without any translated in the same or at least in a similar way, confusion (schon¨ Mann), as we generate a form there are different possibilities to use them in the that does not exist in proper German. However, translation system. A naive strategy is to replace were we to apply the same stemming to the com- each word by its cluster representative, called hard parative case, we would lose the degree of com- decision stemming. However, this carries the risk parison and still generate a valid German sentence of discarding vital information. Therefore we in- (schon¨ wird es nicht[won’t be pretty]) with a dif- vestigated techniques to integrate both, the surface ferent meaning than our original sentence. In or- forms as well as the word stems, into the transla- der to differentiate between cases in which stem- tion system. In the combined input, we add the ming is desirable and where we would lose infor- stemmed adjectives as translation alternatives to mation, a detailed morphological analysis of the the preordering lattices. Since this poses problems source text prior to stemming is vital. for the application of more advanced translation models during decoding, we propose the novel hidden combination technique. 3.2 Implementation We used readily available part-of-speech (POS) 4.1 Hard Decision Stemming taggers, namely the TreeTagger (Schmid, 1994) Assuming that the translation probabilities of the and RFTagger (Schmid and Laws, 2008), for mor- word stems can be estimated more reliably than phological analysis and stemming. In order to those of the surface forms, the most intuitive strat- achieve accurate results, we performed standard egy is to consequently replace each surface form machine translation preprocessing on our corpora by its stem.

Load more