Stream-Based Translation Models for Statistical Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Stream-based Translation Models for Statistical Machine Translation Abby Levenberg Chris Callison-Burch Miles Osborne School of Informatics Computer Science Department School of Informatics University of Edinburgh Johns Hopkins University University of Edinburgh [email protected] [email protected] [email protected] Abstract the corpus grows continually with time. Dealing with such unbounded streams of parallel sentences Typical statistical machine translation sys- presents two challenges: making retraining efficient tems are trained with static parallel corpora. and operating within a bounded amount of space. Here we account for scenarios with a continu- Statistical Machine Translation (SMT) systems ous incoming stream of parallel training data. Such scenarios include daily governmental are typically batch trained, often taking many CPU- proceedings, sustained output from transla- days of computation when using large volumes of tion agencies, or crowd-sourced translations. training material. Incorporating new data into these We show incorporating recent sentence pairs models forces us to retrain from scratch. Clearly, from the stream improves performance com- this makes rapidly adding newly translated sen- pared with a static baseline. Since frequent tences into our models a daunting engineering chal- batch retraining is computationally demand- lenge. We introduce an adaptive training regime us- ing we introduce a fast incremental alternative using an online version of the EM algorithm. ing an online variant of EM that is capable of in- To bound our memory requirements we use crementally adding new parallel sentences without a novel data-structure and associated training incurring the burdens of full retraining. regime. When compared to frequent batch re- For situations with large volumes of incoming training, our online time and space-bounded parallel sentences we are also forced to consider model achieves the same performance with placing space-bounds on our SMT system. We in- significantly less computational overhead. troduce a dynamic suffix array which allows us to add and delete parallel sentences, thereby maintain- 1 Introduction ing bounded space despite processing a potentially high-rate input stream of unbounded length. There is more parallel training data available to- Taken as a whole we show that online translation day than there has ever been and it keeps increas- models operating within bounded space can perform 1 ing. For example, the European Parliament releases as well as systems which are batch-based and have new parallel data in 22 languages on a regular basis. no space constraints thereby making our approach 2 Project Syndicate translates editorials into seven suitable for stream-based translation. languages (including Arabic, Chinese and Russian) every day. Existing translation systems often get 2 Stepwise Online EM ‘crowd-sourced’ improvements such as the option to contribute a better translation to GoogleTrans- The EM algorithm is a common way of inducing late3. In these and many other instances, the data can latent structure from unlabeled data in an unsuper- be viewed as an incoming unbounded stream since vised manner (Dempster et al., 1977). Given a set 1http://www.europarl.europa.eu of unlabeled examples and an initial, often uniform 2http://www.project-syndicate.org guess at a probability distribution over the latent 3http://www.translate.google.com variables, the EM algorithm maximizes the marginal 394 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 394–402, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics log-likelihood of the examples by repeatedly com- Algorithm 1: Batch EM for Word Alignments puting the expectation of the conditional probability Input: {F (source),E (target)} sentence-pairs of the latent data with respect to the current distri- Output: MLE θˆT over alignments a bution, and then maximizing the expectations over θˆ0 ←MLE initialization; the observations into a new distribution used in the for iteration k = 0,...,T do next iteration. EM (and related variants such as vari- S ← 0; // reset counts ational or sampling approaches) form the basis of foreach (f, e) ∈ {F,E} do // E-step how SMT systems learn their translation models. ′ S ← S + X Pr(f,a |e; θˆt); a′∈a 2.1 Batch vs. Online EM end Computing an expectation for the conditional prob- θˆt+1 ← θ¯t(S) ; // M-step abilities requires collecting the sufficient statistics S end over the set of n unlabeled examples. In the case of a multinomial distribution, S is comprised of the counts over each conditional observation occurring That is, for each new example we interpolate its ex- in the n examples. In traditional batch EM, we col- pected count with the existing set of sufficient statis- lect the counts over the entire dataset of n unlabeled tics. For each step we use a stepsize parameter γ training examples via the current ‘best-guess’ proba- which mixes the information from the current ex- ˆ bility model θt at iteration t (E-step) before normal- ample with information gathered from all previous izing the counts into probabilities θ¯(S) (M-step)4. examples. Over time the sOEM model probabilities After each iteration all the counts in the sufficient begin to stabilize and are guaranteed to converge to statistics vector S are cleared and the count collec- a local maximum (Cappe and Moulines, 2009). ˆ tion begins anew using the new distribution θt+1. Note that the stepsize γ has a dependence on the When we move to processing an incoming data current mini-batch. As we observe more incoming stream, however, the batch EM algorithm’s require- data the model’s current probability distribution is ment that all data be available for each iteration be- closer to the true distribution so the new observa- comes impractical since we do not have access to all tions receive less weight. From Liang and Klein −α n examples at once. Instead we receive examples (2009), if we set the stepsize as γt = (t + 2) , from the input stream incrementally. For this reason with 0.5 < α ≤ 1, we can guarantee convergence in online EM algorithms have been developed to up- the limit as n → ∞. If we set α low, γ weighs the date the probability model θˆ incrementally without newly observed statistics heavily whereas if γ is low needing to store and iterate through all the unlabeled new observations are down-weighted. training data repeatedly. Various online EM algorithms have been investi- 2.2 Batch EM for Word Alignments gated (see Liang and Klein (2009) for an overview) Batch EM is used in statistical machine translation but our focus is on the stepwise online EM (sOEM) to estimate word alignment probabilities between algorithm (Cappe and Moulines, 2009). Instead parallel sentences. From these alignments, bilingual of iterating over the full set of training examples, rules or phrase pairs can be extracted. Given a set sOEM stochastically approximates the batch E-step of parallel sentence examples, {F, E}, with F the and incorporates the information from the newly set of source sentences and E the corresponding tar- available streaming observations in steps. Each step get sentences, we want to find the latent alignments is called a mini-batch and is comprised of one or a for a sentence pair (f, e) ∈ {F, E} that defines more new examples encountered in the stream. the most probable correspondence between words fj Unlike in batch EM, in sOEM the expected counts and ei such that aj = i. We can induce these align- are retained between EM iterations and not cleared. ments using an HMM-based alignment model where 4As the M-step can be computed in closed form we desig- the probability of alignment aj is dependent only on nate it in this work as θ¯(S). the previous alignment at aj−1 (Vogel et al., 1996). 395 We can write Algorithm 2: sOEM Algorithm for Word Align- |f| ments f a e e Input: mini-batches of sentence pairs Pr( , | )= X Y p(aj | aj−1, | |) · p(fj | eaj ) {M : M ⊂ {F (source),E(target)}} a′∈a j=1 Input: stepsize weight α where we assume a first-order dependence on previ- Output: MLE θˆT over alignments a ously aligned positions. θˆ0 ←MLE initialization; To find the most likely parameter weights for S ← 0; k = 0; the translation and alignment probabilities for the foreach mini-batch {m : m ∈ M} do HMM-based alignments, we employ the EM algo- for iteration t = 0,...,T do rithm via dynamic programming. Since HMMs have foreach (f, e) ∈ {m} do // E-step ′ multiple local minima, we seed the HMM-based s¯ ← X Pr(f,a |e; θˆt); model probabilities with a better than random guess a′∈a using IBM Model 1 (Brown et al., 1993) as is stan- end dard. IBM Model 1 is of the same form as the γ = (k + 2)−α; k = k + 1; // stepsize HMM-based model except it uses a uniform distri- S ← γs¯ + (1 − γ)S; // interpolate bution instead of a first-order dependency. Although θˆt+1 ← θ¯t(S) ; // M-step a series of more complex models are defined, IBM end Models 2 to Model 6 (Brown et al., 1993; Och and end Ney, 2003), researchers typically find that extract- ing phrase pairs or translation grammar rules using Model 1 and the HMM-based alignments results in the newest sufficient statistics s¯ with our full count equivalently high translation quality. Nevertheless, vector S using an interpolation parameter γ. The in- there is nothing in our approach which limits us to terpolation parameter γ has a dependency on how using just Model 1 and the HMM model. far along the input stream we are processing. A high-level overview of the standard, batch EM algorithm applied to HMM-based word alignment 3 Dynamic Suffix Arrays model is shown in Algorithm 1.