Large-Scale Discriminative N-Gram Language Models for Statistical Machine Translation

Large-scale Discriminative n-gram Language Models for Statistical Machine Translation Zhifei Li and Sanjeev Khudanpur Department of Computer Science and Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD 21218, USA [email protected] and [email protected] Abstract is derived from a large corpus of text in the target language via maximum likelihood estimation We extend discriminative n-gram language modeling techniques originally proposed for (MLE), in conjunction with some smoothing (Chen automatic speech recognition to a statistical and Goodman, 1998). In particular, the so called n- machine translation task. In this context, we gram model is particularly effective and has become propose a novel data selection method that the dominant LM in most systems. Several attempts leads to good models using a fraction of the have been made, particularly in speech recognition, training data. We carry out systematic ex- to improve LMs by appealing to more powerful es- periments on several benchmark tests for Chi- timation techniques, e.g., decision trees (Bahl et al., nese to English translation using a hierarchical phrase-based machine translation system, and 1989), maximum entropy (Rosenfeld, 1996), neural show that a discriminative language model networks (Bengio et al., 2001), and random forests significantly improves upon a state-of-the-art (Xu and Jelinek, 2004). Attempts have also been baseline. The experiments also highlight the made to extend beyond n-gram dependencies by ex- benefits of our data selection method. ploiting (hidden) syntax structure (Chelba and Je- linek, 2000) and semantic or topical dependencies 1 Introduction (Khudanpur and Wu, 2000). We limit ourselves to n-gram models in this paper, though the ideas pre- Recent research in statistical machine translation sented extend easily to the latter kinds of models as (SMT) has made remarkable progress by evolv- well. ing from word-based translation (Brown et al., 1993), through flat phrase-based translation (Koehn A regular LM obtained through the MLE tech- et al., 2003) and hierarchical phrase-based transla- nique is task-independent and tries to distinguish tion (Chiang, 2005; Chiang, 2007), to syntax-based between likely and unlikely word sequences with- translation (Galley et al., 2006). These systems usu- out considering the actual confusions that may ex- ally contain three major components: a translation ist in a particular task. This is sub-optimal, as model, a word-reordering model, and a language different tasks have different types of confusion. model. In this paper, we mainly focus on improv- For example, while the main confusion in a speech ing the language model (LM). recognition task is between similar sounding words A language model constitutes a crucial compo- (e.g., “red” versus “read”), the confusion in a ma- nent in many other tasks such as automatic speech chine translation task is mainly due to multiple word recognition, handwriting recognition, optical char- senses or to word order. An LM derived using an acter recognition, etc. It assigns a priori probabil- explicitly discriminative criterion has the potential ities to word sequences. In general, we expect a to resolve task-specific confusion: its parameters low probability for an ungrammatical or implausi- are tuned/adjusted by observing actual confusions in ble word sequence. Normally, a language model task-dependent outputs. Such discriminative n-gram language modeling has been investigated by Stolcke In SMT, x is a sentence in the source language, and Weintraub (1998), Chen et al. (2000), Kuo et al. GEN(x) enumerates possible translations of x into (2002), and Roark et al. (2007) on various speech the target language, and y is the desired translation: recognition tasks. It is scientifically interesting to either a reference translation produced by a bilingual see whether such techniques lead to improvement in human or the alternative in GEN(x) that is most an SMT task, given the substantial task differences. similar to such a reference translation. In the lat- We do so in this paper. ter case, y is often called the oracle-best (or simply We investigate application of the discriminative oracle) translation of x. n-gram language modeling framework of Roark et The components GEN(¢), ©, and ® define a map- al. (2007) to a large-scale SMT task. Our discrim- ping from an input x to an output y¤ through inative language model is trained using an aver- ¤ aged perceptron algorithm (Collins, 2002). In our y = arg max ©(x; y) ¢ ®; (1) y 2 GEN(x) task, there are millions of training examples avail- P able, and many of them may not be beneficial due where ©(x; y)¢® = j ®j©j(x; y), with j indexing to various reasons including noisy reference trans- the feature dimensions, is the inner product. lations resulting from automatic sentence-alignment Since y is a word-sequence, (1) is called global of a document-aligned bilingual text corpus. More- linear model to emphasize that the maximization is over, our discriminative model contains millions of jointly over the entire sentence y, not locally over features, making standard perceptron training pro- each word/phrase in y (as done in (Zens and Ney, hibitively expensive. To address these two issues, 2006; Chan et al., 2007; Carpuat and Wu, 2007)). we propose a novel data selection method that strives The learning task is to obtain the “optimal” pa- to obtain a comparable/better model using only a rameter vector ® from training examples, while the fraction of the training data. We carry out system- decoding task is to search, given an x, for the max- atic experiments on a state-of-the-art SMT system imizer y¤ of (1). These tasks are discussed next in (Chiang, 2007) for the Chinese to English transla- Section 2.2 and Section 2.3, respectively. tion task. Our results show that a discriminative LM is able to improve over a very strong baseline SMT 2.2 Parameter Estimation Methods system. The results also demonstrate the benefits of Given a set of training examples, the choice of ® our data selection method. may be guided, for instance, by an explicit criterion such as maximizing, among distributions in an expo- 2 Discriminative Language Modeling nential family parameterized by ©, the conditional i i We begin with the description of a general frame- log-likelihood of y given x . Algorithms that de- work for discriminative language modeling, recapit- termine ® in this manner typically operate in batch ulating for the sake of completeness the detailed de- mode—they require processing all the training data scription of these ideas in (Collins, 2002; Roark et (often repeatedly) before arriving at an answer—and al., 2007). require regularization techniques to prevent over- fitting, but are amenable to parallelized computing 2.1 Global Linear Models and often exhibit good empirical performance. A linear discriminant aims to learn a mapping from On the other hand, sequential algorithms such as an input x 2 X to an output y 2 Y , given the linear perceptron (Collins, 2002) operate in on- line mode—processing the training data sequentially 1. training examples (xi; yi); i = 1 ¢ ¢ ¢ N, to incrementally adjust the parameters—and are not 2. a representation © : X £ Y ! Rd mapping amenable to parallelization, but exhibit faster con- each possible (x; y) to a feature vector, vergence to parameter values that yield comparable empirical performance, particularly when large 3. a function GEN(x) ⊆ Y that enumerates puta- amounts of training data are available. In this paper, tive labels for each x 2 X, and we use the perceptron algorithm due to its simplicity 4. a vector ® 2 Rd of free parameters. and suitability to large data settings. Perceptron(x; GEN(x); y) Rerank-Nbest(GEN(x)) 1 ® Ã ~0 ¤ initialize as zero vector 1 y¤ Ã GEN(x)[1] ¤ baseline 1-best y 2 for t Ã 1 to T 2 for in GEN(x) P 3 for i Ã 1 to N 3 S(x; y) Ã ¯©0(x; y) + j2[1;F ] ®j©j(x; y) 4 z i Ã arg max ©(xi; z)¢ ® 4 if S(x; y) > S(x; y¤) i z2GEN(x ) 5 y¤ Ã y 5 if (zi 6= yi) 6 return y¤ 6 ® Ã ® + ©(xi; yi) ¡ ©(xi; zi) 7 return ® Figure 2: Discriminative LM reranking of the N-best list GEN(x) of a source sentence x. Figure 1: The Basic Perceptron Algorithm 2.3.1 Features used in the Discriminative LM 2.2.1 Averaged Perceptron Algorithm Each component ©j(x; y) of the feature vector Figure 1 depicts the perceptron algorithm (Roark can be any function of the input x and the output y. et al., 2007). Given a set of training examples, the al- We define the feature vector specific to our language gorithm sequentially iterates over the examples, and modeling task as follows. adjust the parameter vector ®. After iterating over Baseline Feature: We first define a baseline fea- the training data a few times, an averaged model, ture © (x; y), to be the score assigned to y by defined as 0 the baseline SMT system. This score itself is often a linear combination of several models, XT XN 1 1 i with the relative weights among these models ®avg = ® (2) T N t obtained via some minimum error rate training t=1 i=1 procedure (Och, 2003). i is computed and is used for testing, where ®t repre- Discriminative n-gram Features: The count of sents the parameter vector after seeing the i-th ex- each n-gram in y constitutes a feature. E.g., ample in the t-th iteration, N represents the size of the first n-gram feature may be, the training set, and T is the number of iterations the perceptron algorithm runs.

Load more