Improved Language Modeling for Statistical

Katrin Kirchhoff and Mei Yang Department of Electrical Engineering University of Washington, Seattle, WA, 98195 {katrin,yangmei}@ee.washington.edu

Abstract the probability of target language strings. Usually, a standard word trigram model of the form Statistical machine translation systems use a combination of one or more transla- l tion models and a language model. While P (e1, ..., el) ≈ P (ei|ei−1, ei−2) (2) there is a significant body of research ad- Yi=3 dressing the improvement of translation is used, where e¯ = e1, ..., el. Each word is predicted models, the problem of optimizing lan- based on a history of two preceding words. guage models for a specific translation Most work in SMT has concentrated on develop- task has not received much attention. Typ- ing better translation models, decoding algorithms, ically, standard word trigram models are or minimum error rate training for SMT. Compara- used as an out-of-the-box component in tively little effort has been spent on language mod- a statistical machine translation system. eling for machine translation. In other fields, partic- In this paper we apply language model- ularly in automatic (ASR), there ing techniques that have proved benefi- exists a large body of work on statistical language cial in automatic speech recognition to the modeling, addressing e.g. the use of word classes, ACL05 machine translation shared data language model adaptation, or alternative probabil- task and demonstrate improvements over a ity estimation techniques. The goal of this study was baseline system with a standard language to use some of the language modeling techniques model. that have proved beneficial for ASR in the past and to investigate whether they transfer to statistical ma- 1 Introduction chine translation. In particular, this includes lan- guage models that make use of morphological and Statistical machine translation (SMT) makes use of part-of-speech information, so-called factored lan- a noisy channel model where a sentence e¯ in the de- guage models. sired language can be conceived of as originating as a sentence f¯ in a source language. The goal is to 2 Factored Language Models find, for every input utterance f¯, the best hypothesis e¯∗ such that A factored language model (FLM) (Bilmes and Kirchhoff, 2003) is based on a representation of ∗ e¯ = argmaxe¯P (e¯|f¯) = argmaxe¯P (f¯|e¯)P (e¯) words as feature vectors and can utilize a variety of (1) additional information sources in addition to words, P (f¯|e¯) is the translation model expressing proba- such as part-of-speech (POS) information, morpho- bilistic constraints on the association of source and logical information, or semantic features, in a uni- target strings. P (e¯) is a language model specifying fied and principled framework. Assuming that each word w can be decomposed into k features, i.e. w ≡ F F F F f 1:K, a trigram model can be defined as 1 2 3 T F F F F F F F F F W W W 1 2 1 3 2 3 1:K 1:K 1:K 1:K 1:K 1:K Wt t−1 t−2 t−3 p(f1 , f2 , ..., fT ) ≈ p(ft |ft−1 , ft−2 ) W W W Yt=3 t t−1 t−2 F F F F F F (3) 1 2 3 W W Each word is dependent not only on a single stream t t−1 W F of temporally preceding words, but also on addi- t tional parallel streams of features. This represen- tation can be used to provide more robust probabil- (a) (b) ity estimates when a particular word n-gram has not been observed in the training data but its correspond- Figure 1: Standard backoff path for a 4-gram lan- ing feature combinations (e.g. stem or tag trigrams) guage model over words (left) and backoff graph has been observed. FLMs are therefore designed to over word features (right). exploit sparse training data more effectively. How- ever, even when a sufficient amount of training data is available, a language model utilizing morpholog- is also possible to choose multiple paths and com- ical and POS information may bias the system to- bine their probability estimates. This is achieved by wards selecting more fluent translations, by boost- replacing the backed-off probability pBO in Equa- ing the score of hypotheses with e.g. frequent POS tion 2 by a general function g, which can be any combinations. In FLMs, word feature information non-negative function applied to the counts of the is integrated via a new generalized parallel back- lower-order n-gram. Several different g functions off technique. In standard Katz-style backoff, the can be chosen, e.g. the mean, weighted mean, prod- maximum-likelihood estimate of an n-gram with too uct, minimum or maximum of the smoothed prob- few observations in the training data is replaced with ability distributions over all subsets of conditioning a probability derived from the lower-order (n − 1)- factors. In addition to different choices for g, dif- gram and a backoff weight as follows: ferent discounting parameters can be selected at dif- ferent levels in the backoff graph. One difficulty in pBO(wt|wt−1, wt−2) (4) training FLMs is the choice of the best combination dcpML(wt|wt−1, wt−2) if c > τ of conditioning factors, backoff path(s) and smooth- =  α(wt−1, wt−2)pBO(wt|wt−1) otherwise ing options. Since the space of different combina- tions is too large to be searched exhaustively, we use where c is the count of (wt, wt−1, wt−2), pML a guided search procedure based on Genetic Algo- denotes the maximum-likelihood estimate, τ is a rithms (Duh and Kirchhoff, 2004), which optimizes count threshold, dc is a discounting factor and the FLM structure with respect to the desired crite- α(wt−1, wt−2) is a normalization factor. During rion. In ASR, this is usually the perplexity of the standard backoff, the most distant conditioning vari- language model on a held-out dataset; here, we use able (in this case wt−2) is dropped first, followed the BLEU scores of the oracle 1-best hypotheses on by the second most distant variable etc., until the the development set, as described below. FLMs have unigram is reached. This can be visualized as a previously shown significant improvements in per- backoff path (Figure 1(a)). If additional condition- plexity and word error rate on several ASR tasks ing variables are used which do not form a tempo- (e.g. (Vergyri et al., 2004)). ral sequence, it is not immediately obvious in which order they should be eliminated. In this case, sev- 3 Baseline System eral backoff paths are possible, which can be sum- marized in a backoff graph (Figure 1(b)). Paths in We used a fairly simple baseline system trained us- this graph can be chosen in advance based on lin- ing standard tools, i.e. GIZA++ (Och and Ney, 2000) guistic knowledge, or at run-time based on statis- for training word alignments and Pharaoh (Koehn, tical criteria such as counts in the training set. It 2004) for phrase-based decoding. The training data was that provided on the ACL05 Shared MT task integration of a 4-gram language model might yield website for 4 different language pairs (translation better results. Note that this can only be done in a from Finnish, Spanish, French into English); no rescoring framework since the first-pass decoder can additional data was used. Preprocessing consisted only use a trigram language model. of lowercasing the data and filtering out sentences For the factored language models, a feature-based with a length ratio greater than 9. The total num- word representation was obtained by tagging the text ber of training sentences and words per language with Rathnaparki’s maximum-entropy tagger (Rat- pair ranged between 11.3M words (Finnish-English) naparkhi, 1996) and by words using the and 15.7M words (Spanish-English). The develop- Porter stemmer (Porter, 1980). Thus, the factored ment data consisted of the development sets pro- language models use two additional features per vided on the website (2000 sentences each). We word. A word history of up to 2 was considered (3- trained our own word alignments, phrase table, lan- gram FLMs). Rather than optimizing the FLMs on guage model, and model combination weights. The the development set references, they were optimized language model was a trigram model trained us- to achieve a low perplexity on the oracle 1-best hy- ing the SRILM toolkit, with modified Kneser-Ney potheses (the hypotheses with the best individual smoothing and interpolation of higher- and lower- BLEU scores) from the first decoding pass. This is order ngrams. Combination weights were trained done to avoid optimizing the model on word combi- using the minimum error weight optimization pro- nations that might never be hypothesized by the first- cedure provided by Pharaoh. We use a two-pass de- pass decoder, and to bias the model towards achiev- coding approach: in the first pass, Pharaoh is run ing a high BLEU score. Since N-best lists differ for in N-best mode to produce N-best lists with 2000 different language pairs, a separate FLM was trained hypotheses per sentence. Seven different compo- for each language pair. While both the 4-gram lan- nent model scores are collected from the outputs, guage model and the FLMs achieved a 8-10% reduc- including the distortion model score, the first-pass tion in perplexity on the dev set references compared language model score, word and phrase penalties, to the baseline language model, their perplexities on and bidirectional phrase and word translation scores, the oracle 1-best hypotheses were not significantly as used in Pharaoh (Koehn, 2004). In the second different from that of the baseline model. pass, the N-best lists are rescored with additional language models. The resulting scores are then com- bined with the above scores in a log-linear fashion. 5 N-best List Rescoring The combination weights are optimized on the de- velopment set to maximize the BLEU score. The For N-best list rescoring, the original seven model weighted combined scores are then used to select scores are combined with the scores of the second- the final 1-best hypothesis. The individual rescoring pass language models using the framework of dis- steps are described in more detail below. criminative model combination (Beyerlein, 1998). This approach aims at an optimal (with respect to 4 Language Models a given error criterion) integration of different infor- mation sources in a log-linear model, whose com- We trained two additional language models to be bination weights are trained discriminatively. This used in the second pass, one word-based 4-gram combination technique has been used successfully model, and a factored trigram model. Both were in ASR, where weights are typically optimized to trained on the same training set as the baseline sys- minimize the empirical word error count on a held- tem. The 4-gram model uses modified Kneser- out set. In this case, we use the BLEU score of Ney smoothing and interpolation of higher-order the N-best hypothesis as an optimization criterion. and lower-order n-gram probabilities. The potential Optimization is performed using a simplex downhill advantage of this model is that it models n-grams method known as amoeba search (Nelder and Mead, up to length 4; since the BLEU score is a combina- 1965), which is available as part of the SRILM tion of n-gram precision scores up to length 4, the toolkit. Language pair 1st pass oracle 7 Conclusions Fi-En 21.8 29.8 We have demonstrated improvements in BLEU Fr-En 28.9 34.4 score by utilizing more complex language models De-En 23.9 31.0 in the rescoring pass of a two-pass SMT system. Es-En 30.8 37.4 We noticed that FLMs performed worse than word- based 4-gram models. However, only trigram FLM Table 1: First-pass (left column) and oracle results were used in the present experiments; larger im- (right column) on the dev set (% BLEU). provements might be obtained by 4-gram FLMs. The weights assigned to the second-pass language Language pair 4-gram FLM both models during weight optimization were larger than Fi-En 22.2 22.2 22.3 those assigned to the first-pass language model, sug- Fr-En 30.2 30.2 30.4 gesting that both the word-based model and the FLM De-En 24.6 24.2 24.6 provide more useful scores than the baseline lan- Es-En 31.4 31.0 31.3 guage model. Finally, we observed that the overall improvement represents only a small portion of the Table 2: Second-pass rescoring results (% BLEU) possible increase in BLEU score as indicated by the on the dev set for 4-gram LM, 3-gram FLM, and oracle results, suggesting that better language mod- their combination. els do not have a significant effect on the overall sys- tem performance unless the translation model is im- 6 Results proved as well. Acknowledgements The results from the first decoding pass on the de- This work was funded by the National Science velopment set are shown in Table 1. The second Foundation, Grant no. IIS-0308297. We are grate- column in Table 1 lists the oracle BLEU scores for ful to Philip Koehn for assistance with Pharaoh. the N-best lists, i.e. the scores obtained by always selecting the hypothesis known to have the highest References individual BLEU score. We see that considerable P. Beyerlein. 1998. Discriminative model combination. In improvements can in principle be obtained by a bet- Proc. ICASSP, pages 481–484. ter second-pass selection of hypotheses. The lan- guage model rescoring results are shown in Table 2, J.A. Bilmes and K. Kirchhoff. 2003. Factored language mod- els and generalized parallel backoff. In Proceedings of for both types of second-pass language models indi- HLT/NAACL, pages 4–6. vidually, and for their combination. In both cases we K. Duh and K. Kirchhoff. 2004. Automatic learning of lan- obtain small improvements in BLEU score, with the guage model structure. In Proceedings of COLING. 4-gram providing larger gains than the 3-gram FLM. P. Koehn. 2004. Pharaoh: a beam search decoder for phrase- Since their combination only yielded negligible ad- based statistical machine translation models. In Proceedings ditional improvements, only 4-grams were used for of AMTA. processing the final evaluation sets. The evaluation J.A. Nelder and R. Mead. 1965. A simplex method for function results are shown in Table 3. minimization. Computing Journal, 7(4):308–313. F.J. Och and H. Ney. 2000. Giza++: Training of sta- Language pair baseline 4-gram tistical translation models. http://www-i6.informatik.rwth- aachen.de/ och/software/GIZA++.html. Fi-En 21.6 22.0 M.F. Porter. 1980. An algorithm for suffix stripping. Program, Fr-En 29.3 30.3 14(3):130–137. De-En 24.2 24.8 A. Ratnaparkhi. 1996. A maximum entropy part-of-speech tag- Es-En 30.5 31.0 ger. In Proceedings EMNLP, pages 133–141. D. Vergyri et al. 2004. Morphology-based language modeling Table 3: Second-pass rescoring results (% BLEU) for Arabic speech recognition. In Proceedings of ICSLP. on the evaluation set.