Generalized Graphical Abstractions for Statistical Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Generalized Graphical Abstractions for Statistical Machine Translation Karim Filali and Jeff Bilmes∗ Departments of Computer Science & Engineering and Electrical Engineering University of Washington Seattle, WA 98195, USA fkarim@cs,[email protected] Abstract the viability of our graphical model representation and new software system. We introduce a novel framework for the There are several important advantages to a uni- expression, rapid-prototyping, and eval- fied probabilistic framework for MT including: (1) uation of statistical machine-translation the same codebase can be used for training and de- (MT) systems using graphical mod- coding without having to implement a separate de- els. The framework extends dynamic coder for each model; (2) new models can be pro- Bayesian networks with multiple con- totyped quickly; (3) combining models (such as in nected different-length streams, switching a speech-MT system) is easier when they are en- variable existence and dependence mech- coded in the same framework; (4) sharing algo- anisms, and constraint factors. We have rithms across different disciplines (e.g., the MT and implemented a new general-purpose MT the constraint-satisfaction community) is facilitated. training/decoding system in this frame- 2 Graphical Model Framework work, and have tested this on a variety of A Graphical Model (GM) represents a factorization existing MT models (including the 4 IBM of a family of joint probability distributions over a models), and some novel ones as well, set of random variables using a graph. The graph all using Europarl as a test corpus. We specifies conditional independence relationships be- describe the semantics of our representa- tween the variables, and parameters of the model tion, and present preliminary evaluations, are associated with each variable or group thereof. showing that it is possible to prototype There are many types of graphical models. For ex- novel MT ideas in a short amount of time. ample, Bayesian networks use an acyclic directed 1 Introduction graph and their parameters are conditional probabili- ties of each variable given its parents. Various forms We present a unified graphical model framework of GM and their conditional independence proper- based on (Filali and Bilmes, 2006) for statistical ma- ties are defined in (Lauritzen, 1996). chine translation. Graphical models utilize graphical Our graphical representation, which we call descriptions of probabilistic processes, and are capa- Multi-dynamic Bayesian Networks (MDBNs) (Filali ble of quickly describing a wide variety of different and Bilmes, 2006), is a generalization of dynamic sets of model assumptions. In our approach, either Bayesian networks (DBNs) (Dean and Kanazawa, phrases or words can be used as the unit of transla- 1988). DBNs are an appropriate representation for tion, but as a first step, we have only implemented sequential (for example, temporal) stochastic pro- word-based models since our main goal is to show cesses, but can be very difficult to apply when de- This∗ material was supported by NSF under Grant No. ISS- pendencies have arbitrary time-span and the exis- 0326276. tence of random variables is contingent on the val- ues of certain variables in the network. In (Filali which indicates how many times ei is used by words and Bilmes, 2006), we discuss inference and learn- in the French string. Each φi in turn grants existence ing in MDBNs. Here we focus on representation to a set of RVs under it. Given the fertilities (the fig- and the evaluation of our new implementation and ure depicts the case φ1 = 3; φ2 = 1; φ3 = 0), for framework. Below, we summarize key features un- each word ei, φi French word RVs are granted exis- derlying our framework. In section 3, we explain tence and are denoted by the tablet τi1; τi2; : : : ; τiφi how these features apply to a specific MT model. of ei. The values of τ variables need to match the • Multiple DBNs can be represented along with actual observed French sequence f1; : : : ; fm. This is rules for how they are interconnected — the rule de- represented as a shared constraint between all the f, scription lengths are fixed (they do not grow with the π, and τ variables which have incoming edges into length of the DBNs). the observed variable v. v’s conditional probability • Switching dependencies (Geiger and Hecker- table is such that it is one only when the associated man, 1996): a variable X is a switching parent of constraint is satisfied. The variable πi;k is a switch- Y if X influences what type of dependencies Y has ing dependency parent with respect to the constraint with respect to its other parents, e.g., an alignment variable v and determines which fj participates in variable in IBM Models 1 and 2 is switching. an equality constraint with τi;k. • Switching existence: A variable X is “switch- In the null word sub-model, the constraint that ing existence” with respect to variable Y if the value successive permutation variables be ordered is im- of X determines whether Y exists. An example is a plemented using the observed child w of π0i and fertility variable in IBM Models 3 and above. π0(i+1). The probability of w being unity is one only • Constraints and aggregation: BN semantics can when the constraint is satisfied and zero otherwise. encode various types of constraints between groups The bottom variable m is a switching existence of variables (Pearl, 1988). For example, in the con- node (observed to be 6 in the figure) with cor- struct A ! B C where B is observed, B can responding French word sequence and alignment constrain A and C to be unequal. We extend those variables. The French sequence participates in the semantics to support a more efficient evaluation of v constraint described above, while the alignment constraints under some variable order conditions. variables aj 2 f1; : : : ; `g; j 2 1; : : : ; m constrain the fertilities to take their unique allowable values 3 GM Representation of IBM MT Models (for the given alignment). Alignments also restrict In this section we present a GM representation for the domain of permutation variables, π, using the IBM model 3 (Brown et al., 1993) in fig. 1. Model 3 constraint variable x. Finally, the domain size of is intricate enough to showcase some of the features each aj has to lie in the interval [0; `] and that is en- of our graphical representation but not as complex forced by the variable u. The dashed edges connect- as, and thus is easier to describe, than model 4. Our ing the alignment a variables represent an extension choice of representing IBM models is not because to implement an M3/M-HMM hybrid.1 we believe they are state of the art MT models— 4 Experiments although they are still widely used in producing alignments and as features in log-linear models— We have developed (in C++) a new entirely self- but because they provide a good initial testbed for contained general-purpose MT training/decoding our architecture. system based on our framework, of which we pro- The topmost random variable (RV), `, is a hid- vide a preliminary evaluation in this section. Al- den switching existence variable corresponding to though the framework is perfectly capable of rep- the length of the English string. The box abutting resenting phrase-based models, we restrict ourselves ` includes all the nodes whose existence depends on to word-based models to show the viability of graph- the value of `. In the figure, ` = 3, thus resulting ical models for MT and will consider different trans- lation units in future work. We perform MT ex- in three English words e1; :::; e3, connected using a second-order Markov chain. To each English word 1We refer to the HMM MT model in (Vogel et al., 1996) as ei corresponds a conditionally dependent fertility φi, M-HMM to avoid any confusion. ` 2 e1 e2 e3 use of a fixed lexical table learned using an M- m' HMM model specified using our tool, and neither Á0 Á1 Á2 Á3 uses minimum error rate training. (3) uses Model ¿01 ¿02 ¿11 ¿12 ¿13 ¿21 4 parameters learned using GIZA++. This compari- ¼01 w ¼02 ¼11 ¼12 ¼13 ¼21 u v y son is informative because Rewrite is a special pur- x f1 f2 f3 f4 f5 f6 pose model 4 decoder and we would expect it to a1 a2 a3 a4 a5 a6 perform at least as well as decoders not written for m a specific IBM model. Pharaoh is more general in that it only requires, as input, a lexical table from Figure 1: Unrolled Model 3 graphical model with fertility 3 assignment φ0 = 2; φ1 = 3; φ2 = 1; φ3 = 0. any given model. Our MDBN system is not tai- lored for the translation task. Pharaoh was able to periments on a English-French subset of the Eu- decode the 2000 sentences of the test set in 5000s roparl corpus used for the ACL 2005 SMT evalu- on a 3.2GHz machine; Rewrite took 84000s, and we ations (Koehn and Monz, 2005). We train an En- allotted 400000s for our engine (200s per sentence). glish language model on the whole training set us- We attribute the difference in speed and BLEU score ing the SRILM toolkit (Stolcke, 2002) and train between our system and Pharaoh to the fact Value MT models mainly on a 10k sentence pair sub- Elimination searches in a depth-first fashion over set of the ACL training set. We test on the 2000 the space of partial configurations of RVs, while sentence test set used for the same evaluations. Pharaoh expands partial translation hypotheses in a For comparison, we use the MT training program, best-first search manner.