Productive Generation of Compound Words in Statistical Machine Translation

Sara Stymne Nicola Cancedda Linkoping¨ University Xerox Research Centre Europe Linkoping,¨ Sweden Meylan, France [email protected] [email protected]

Abstract The extended use of compounds make them prob- lematic for machine translation. For translation into In many languages the use of compound words is very productive. A common practice a compounding language, often fewer compounds to reduce sparsity consists in splitting com- than in normal texts are produced. This can be due pounds in the training data. When this is done, to the fact that the desired compounds are missing in the system incurs the risk of translating com- the training data, or that they have not been aligned ponents in non-consecutive positions, or in the correctly. When a compound is the idiomatic word wrong order. Furthermore, a post-processing choice in the translation, a MT system can often step of compound merging is required to re- produce separate words, genitive or other alternative construct compound words in the output. We constructions, or translate only one part of the com- present a method for increasing the chances that components that should be merged are pound. translated into contiguous positions and in the Most research on compound translation in com- right order. We also propose new heuristic bination with SMT has been focused on transla- methods for merging components that outper- tion from a compounding language, into a non- form all known methods, and a learning-based compounding one, typically into English. A com- method that has similar accuracy as the heuris- mon strategy then consists in splitting compounds tic method, is better at producing novel com- into their components prior to training and transla- pounds, and can operate with no background linguistic resources. tion. Only few have investigated translation into a compounding language. For translation into a com- 1 Introduction pounding language, the process becomes: In many languages including most of the Germanic • (German, Swedish etc.) and Uralic (Finnish, Hun- Splitting compounds on the target (compound- garian etc.) language families so-called closed com- ing language) side of the training corpus; pounds are used productively. Closed compounds • Learn a translation model from this split train- are written as single words without spaces or other ing corpus from source (e.g. English) into word boundaries, as the Swedish: decomposed-target (e.g. decomposed-German)

gatstenshuggare gata + sten + huggare • At translation time, translate using the learned paving stone cutter street stone cutter model from source into decomposed-target.

To cope with the productivity of the phenomenon, • Apply a post-processing “merge” step to recon- any effective strategy should be able to correctly struct compounds. process compounds that have never been seen in the training data as such, although possibly their com- The merging step must solve two problems: identify ponents have, either in isolation or within a different which words should be merged into compounds, and compound. choose the correct form of the compound parts.

250

Proceedings of the 6th Workshop on Statistical Machine Translation, pages 250–260, Edinburgh, Scotland, UK, July 30–31, 2011. c 2011 Association for Computational Linguistics The former problem can become hopelessly diffi- POS-tags whether an element is a head or a modi- cult if the translation did not put components nicely fier. As an example, the German compound “Fremd- side by side and in the correct order. Preliminary sprachenkenntnisse”, originally tagged as N(oun), to merging, then, the problem of promoting transla- would be decomposed and re-tagged before training tions where compound elements are correctly posi- as: tioned needs to be addressed. We call this promoting fremd sprachen kenntnisse compound coalescence. N-Modif N-Modif N 2 Related work A POS n-gram language model using these extended tagset, then, naturally steers the decoder towards The first suggestion of a compound merging method translations with good relative placement of these for MT that we are aware of was described by components Popovic´ et al. (2006). Each word in the translation We modify this approach by blurring distinctions output is looked up in a list of compound parts, and among POS not relevant to the formation of com- merged with the next word if it results in a known pounds, thus further reducing the tagset to only three compound. This method led to improved overall tags: translation results from English to German. Stymne (2008) suggested a merging method based on part- • N-p – all parts of a split compound except the of-speech matching, in a factored translation system, last where compound parts had a special part-of-speech • N – the last part of the compound (its head) and tag, and compound parts are only merged with the all other nouns next word if the part-of-speech tags match. This re- sulted in improved translation quality from English • X – all other tokens to German, and from English to Swedish (Stymne and Holmqvist, 2008). Another method, based on The above scheme assumes that only noun com- several decoding runs, was investigated by Fraser pounds are treated but it could easily be extended to (2009). other types of compounds. Alternatively, splitting Stymne (2009a) investigated and compared merg- can be attempted irrespective of POS on all tokens ing methods inspired by Popovic´ et al. (2006), longer than a fixed threshold, removing the need of Stymne (2008) and a method inspired by morphol- a POS tagger. ogy merging (El-Kahlout and Oflazer, 2006; Virpi- 3.1 Sequence models as count features oja et al., 2007), where compound parts were anno- We expect a POS-based n-gram language model on tated with symbols, and parts with symbols in the our reduced tagset to learn to discourage sequences translation output were merged with the next word. unseen in the training data, such as the sequence 3 Promoting coalescence of compounds of compound parts not followed by a suitable head. Such a generative LM, however, might also have a If compounds are split in the training set, then there tendency to bias lexical selection towards transla- is no guarantee that translations of components will tions with fewer compounds, since the correspond- end up in contiguous positions and in the correct or- ing tag sequences might be more common in text. der. This is primarily a language model problem, To compensate for this bias, we experiment with in- and we will model it as such by applying POS lan- jecting a little dose of a-priori knowledge, and add a guage models on specially designed part-of-speech count feature, which explicitly counts the number of sets, and by applying language model inspired count occurrences of POS-sequences which we deem good features. and bad in the translation output. The approach proposed in Stymne (2008) consists Table 1 gives an overview of the possible bigram in running a POS tagger on the target side of the cor- combinations, using the three symbol tagset, plus pus, decompose only tokens with some predefined sentence beginning and end markers, and their judg- POS (e.g. Nouns), and then marking with special ment as good, bad or neutral.

251 Combination Judgment 4.1 Improving and combining heuristics N-p N-p Good We empirically veriﬁed that the simple heuristics in N-p N Good Popovic´ et al. (2006) tends to misﬁre quite often, N-p < \s > Bad leading to too many compounds. We modify it by N-p X Bad adding an additional check: tokens are merged if all other combinations Neutral they appear combined in the list of compounds, but Table 1: Tag combinations in the translation output only if their observed frequency as a compound is larger than their frequency as a bigram. This blocks the