Improved Language Modeling for Statistical Machine Translation

Improved Language Modeling for Statistical Machine Translation Katrin Kirchhoff and Mei Yang Department of Electrical Engineering University of Washington, Seattle, WA, 98195 fkatrin,[email protected] Abstract the probability of target language strings. Usually, a standard word trigram model of the form Statistical machine translation systems use a combination of one or more transla- l tion models and a language model. While P (e1; :::; el) ≈ P (eijei−1; ei−2) (2) there is a significant body of research ad- Yi=3 dressing the improvement of translation is used, where e¯ = e1; :::; el. Each word is predicted models, the problem of optimizing lan- based on a history of two preceding words. guage models for a specific translation Most work in SMT has concentrated on develop- task has not received much attention. Typ- ing better translation models, decoding algorithms, ically, standard word trigram models are or minimum error rate training for SMT. Compara- used as an out-of-the-box component in tively little effort has been spent on language mod- a statistical machine translation system. eling for machine translation. In other fields, partic- In this paper we apply language model- ularly in automatic speech recognition (ASR), there ing techniques that have proved benefi- exists a large body of work on statistical language cial in automatic speech recognition to the modeling, addressing e.g. the use of word classes, ACL05 machine translation shared data language model adaptation, or alternative probabil- task and demonstrate improvements over a ity estimation techniques. The goal of this study was baseline system with a standard language to use some of the language modeling techniques model. that have proved beneficial for ASR in the past and to investigate whether they transfer to statistical ma- 1 Introduction chine translation. In particular, this includes language models that make use of morphological and Statistical machine translation (SMT) makes use of part-of-speech information, so-called factored lan- a noisy channel model where a sentence e¯ in the de- guage models. sired language can be conceived of as originating as a sentence f¯ in a source language. The goal is to 2 Factored Language Models find, for every input utterance f¯, the best hypothesis e¯∗ such that A factored language model (FLM) (Bilmes and Kirchhoff, 2003) is based on a representation of ∗ e¯ = argmaxe¯P (e¯jf¯) = argmaxe¯P (f¯je¯)P (e¯) words as feature vectors and can utilize a variety of (1) additional information sources in addition to words, P (f¯je¯) is the translation model expressing proba- such as part-of-speech (POS) information, morpho- bilistic constraints on the association of source and logical information, or semantic features, in a uni- target strings. P (e¯) is a language model specifying fied and principled framework. Assuming that each word w can be decomposed into k features, i.e. w ≡ F F F F f 1:K, a trigram model can be defined as 1 2 3 T F F F F F F F F F W W W 1 2 1 3 2 3 1:K 1:K 1:K 1:K 1:K 1:K Wt t-1 t-2 t-3 p(f1 ; f2 ; :::; fT ) ≈ p(ft jft−1 ; ft−2 ) W W W Yt=3 t t-1 t-2 F F F F F F (3) 1 2 3 W W Each word is dependent not only on a single stream t t-1 W F of temporally preceding words, but also on addi- t tional parallel streams of features. This representation can be used to provide more robust probabil- (a) (b) ity estimates when a particular word n-gram has not been observed in the training data but its correspond- Figure 1: Standard backoff path for a 4-gram lan- ing feature combinations (e.g. stem or tag trigrams) guage model over words (left) and backoff graph has been observed. FLMs are therefore designed to over word features (right). exploit sparse training data more effectively. How- ever, even when a sufficient amount of training data is available, a language model utilizing morpholog- is also possible to choose multiple paths and com- ical and POS information may bias the system to- bine their probability estimates. This is achieved by wards selecting more fluent translations, by boost- replacing the backed-off probability pBO in Equa- ing the score of hypotheses with e.g. frequent POS tion 2 by a general function g, which can be any combinations. In FLMs, word feature information non-negative function applied to the counts of the is integrated via a new generalized parallel back- lower-order n-gram. Several different g functions off technique. In standard Katz-style backoff, the can be chosen, e.g. the mean, weighted mean, prod- maximum-likelihood estimate of an n-gram with too uct, minimum or maximum of the smoothed prob- few observations in the training data is replaced with ability distributions over all subsets of conditioning a probability derived from the lower-order (n − 1)- factors. In addition to different choices for g, dif- gram and a backoff weight as follows: ferent discounting parameters can be selected at different levels in the backoff graph. One difficulty in pBO(wtjwt−1; wt−2) (4) training FLMs is the choice of the best combination dcpML(wtjwt−1; wt−2) if c > τ of conditioning factors, backoff path(s) and smooth- = α(wt−1; wt−2)pBO(wtjwt−1) otherwise ing options. Since the space of different combinations is too large to be searched exhaustively, we use where c is the count of (wt; wt−1; wt−2), pML a guided search procedure based on Genetic Algo- denotes the maximum-likelihood estimate, τ is a rithms (Duh and Kirchhoff, 2004), which optimizes count threshold, dc is a discounting factor and the FLM structure with respect to the desired crite- α(wt−1; wt−2) is a normalization factor. During rion. In ASR, this is usually the perplexity of the standard backoff, the most distant conditioning vari- language model on a held-out dataset; here, we use able (in this case wt−2) is dropped first, followed the BLEU scores of the oracle 1-best hypotheses on by the second most distant variable etc., until the the development set, as described below. FLMs have unigram is reached. This can be visualized as a previously shown significant improvements in per- backoff path (Figure 1(a)). If additional condition- plexity and word error rate on several ASR tasks ing variables are used which do not form a tempo- (e.g. (Vergyri et al., 2004)). ral sequence, it is not immediately obvious in which order they should be eliminated. In this case, sev- 3 Baseline System eral backoff paths are possible, which can be sum- marized in a backoff graph (Figure 1(b)). Paths in We used a fairly simple baseline system trained us- this graph can be chosen in advance based on lin- ing standard tools, i.e. GIZA++ (Och and Ney, 2000) guistic knowledge, or at run-time based on statis- for training word alignments and Pharaoh (Koehn, tical criteria such as counts in the training set. It 2004) for phrase-based decoding. The training data was that provided on the ACL05 Shared MT task integration of a 4-gram language model might yield website for 4 different language pairs (translation better results. Note that this can only be done in a from Finnish, Spanish, French into English); no rescoring framework since the first-pass decoder can additional data was used. Preprocessing consisted only use a trigram language model. of lowercasing the data and filtering out sentences For the factored language models, a feature-based with a length ratio greater than 9. The total num- word representation was obtained by tagging the text ber of training sentences and words per language with Rathnaparki’s maximum-entropy tagger (Rat- pair ranged between 11.3M words (Finnish-English) naparkhi, 1996) and by stemming words using the and 15.7M words (Spanish-English). The develop- Porter stemmer (Porter, 1980). Thus, the factored ment data consisted of the development sets pro- language models use two additional features per vided on the website (2000 sentences each). We word. A word history of up to 2 was considered (3- trained our own word alignments, phrase table, lan- gram FLMs). Rather than optimizing the FLMs on guage model, and model combination weights. The the development set references, they were optimized language model was a trigram model trained us- to achieve a low perplexity on the oracle 1-best hy- ing the SRILM toolkit, with modified Kneser-Ney potheses (the hypotheses with the best individual smoothing and interpolation of higher- and lower- BLEU scores) from the first decoding pass. This is order ngrams. Combination weights were trained done to avoid optimizing the model on word combi- using the minimum error weight optimization pro- nations that might never be hypothesized by the first- cedure provided by Pharaoh. We use a two-pass de- pass decoder, and to bias the model towards achiev- coding approach: in the first pass, Pharaoh is run ing a high BLEU score. Since N-best lists differ for in N-best mode to produce N-best lists with 2000 different language pairs, a separate FLM was trained hypotheses per sentence. Seven different compo- for each language pair. While both the 4-gram lan- nent model scores are collected from the outputs, guage model and the FLMs achieved a 8-10% reduc- including the distortion model score, the first-pass tion in perplexity on the dev set references compared language model score, word and phrase penalties, to the baseline language model, their perplexities on and bidirectional phrase and word translation scores, the oracle 1-best hypotheses were not significantly as used in Pharaoh (Koehn, 2004).

Load more