Proceedings of the 46Th Annual Meeting of The

Proceedings of SSST-4

Fourth Workshop on

Syntax and Structure in Statistical Translation

Dekai Wu (editor)

COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010 Produced by Chinese Information Processing Society of China All rights reserved for Coling 2010 CD production.

To order the CD of Coling 2010 and its Workshop Proceedings, please contact:

Chinese Information Processing Society of China No.4, Southern Fourth Street Haidian District, Beijing, 100190 China Tel: +86-010-62562916 Fax: +86-010-62562916 [email protected]

ii Introduction

The Fourth Workshop on Syntax and Structure in Statistical Translation (SSST-4) was held on 28 August 2010 following the Coling 2010 conference in Beijing. Like the ﬁrst three SSST workshops in 2007, 2008, and 2009, it aimed to bring together researchers from different communities working in the rapidly growing ﬁeld of statistical, tree-structured models of natural language translation.

We were honored to have Martin Kay deliver this year’s invited keynote talk. This ﬁeld is indebted to Martin Kay for not one but two of the classic cornerstone ideas that inspired bilingual tree-structured models for statistical machine translation: ﬁrst, chart parsing, and second, parallel text alignment.

Tabular approaches to parsing, using dynamic programming and/or memoization, were heavily inﬂuenced by Kay’s (1980) charts (or forests, packed forests, well-formed substring tables, or WFSTs). Today’s biparsing models—the bilingual generalizations of this inﬂuential work—lie at the heart of numerous alignment and training algorithms for inversion transduction grammars or ITGs—including all syntax-directed transduction grammars or SDTGs (or synchronous CFGs) of binary rank or ternary rank, such as those learned by hierarchical phrase-based translation approaches.

At the same time, Kay and Roscheisen’s¨ (1988) seminal work on alignment of parallel texts led the way in statistical machine translation’s basic paradigm of integrating the simultaneous learning of translation lexicons with aligning parallel texts. Today’s biparsing models generalize this by simultaneously learning tree structures as well. Once again, cross-pollination of ideas across different areas and disciplines, empirical and theoretical, has provided mutual inspiration.

We selected 15 papers for this year’s workshop. Studies emphasizing formal and algorithmic aspects include a method for intersecting synchronous/transduction grammars (S/TGs) with finite- state transducers (Dymetman and Cancedda), and a comparison of linear transduction grammars (LTGs) with ITGs (Saers, Nivre and Wu). Experiments on using syntactic features and constraints within flat phrase-based translation models include studies by Jiang, Du and Way; by Cao, Finch and Sumita; by Kolachina, Venkatapathy, Bangalore, Kolachina and PVS; and by Zhechev and van Genabith. Dependency constraints are also used to improve HMM word alignments for both flat phrase-based as well as S/TG based translation models (Ma and Way). Extensions to the features and parameterizations in two S/TG based translation models, as well as methods for merging models, are studied by Zollman and Vogel. The potential of incorporating LFG-style deep syntax within S/TGs is explored by Graham and van Genabith. A tree transduction based approach is presented by Khalilov and Sima’an. Meanwhile, Lo and Wu empirically compare n-gram, syntactic, and semantic structure based MT evaluation approaches. An encouraging trend is an uptick in work on low-resource language pairs and from underrepresented regions, including English-Persian (Mohaghegh, Sarrafzadeh and Moir), Manipuri-English (Singh and Bandyopadhyay), Tunisia (Khemakhem, Jamoussi and Ben hamadou), and English-Hindi (Venkatapathy, Sangal, Joshi, and Gali).

Once again this year, thanks are due to our authors and our Program Committee for making the SSST workshop another success.

iii Acknowledgements

This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under GALE Contract Nos. HR0011-06-C-0022, subcontract BBN Technologies and HR0011-06-C-0023, subcontract SRI International. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the Defense Advanced Research Projects Agency.

iv Organizers: Dekai WU, Hong Kong University of Science and Technology (HKUST), Hong Kong

Program Committee: Srinivas BANGALORE, AT&T Research, USA Marine CARPUAT, Hong Kong University of Science and Technology (HKUST), Hong Kong David CHIANG, USC Information Sciences Institute, USA Pascale FUNG, Hong Kong University of Science and Technology (HKUST), Hong Kong Daniel GILDEA, University of Rochester, USA Dan KLEIN, University of California at Berkeley, USA Kevin KNIGHT, USC Information Sciences Institute, USA Jonas KUHN, University of Potsdam, Germany Yang LIU, Institute of Computing Technology, Chinese Academy of Sciences, China Yanjun MA, Dublin City University, Ireland Daniel MARCU, USC Information Sciences Institute, USA Yuji MATSUMOTO, Nara Institute of Science and Technology, Japan Hermann NEY, RWTH Aachen, Germany Owen RAMBOW, Columbia University, USA Philip RESNIK, University of Maryland, USA Stefan RIEZLER, Google Inc., USA Libin SHEN, BBN Technologies, USA Christoph TILLMANN, IBM T. J. Watson Research Center, USA Stephan VOGEL, Carnegie Mellon University, USA Taro WATANABE, NTT Communication Science Laboratories, Japan Andy WAY, Dublin City University, Ireland Yuk-Wah WONG, Google Inc., USA Richard ZENS, Google Inc., USA Chengqing ZONG, Institute of Automation, Chinese Academy of Sciences, China

v Table of Contents

Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and Algorithms Marc Dymetman and Nicola Cancedda ...... 1

A Systematic Comparison between Inversion Transduction Grammar and Linear Transduction Gram- mar for Word Alignment Markus Saers, Joakim Nivre and Dekai Wu ...... 10

Source-side Syntactic Reordering Patterns with Functional Words for Improved Phrase-based SMT Jie Jiang, Jinhua Du and Andy Way ...... 19

Syntactic Constraints on Phrase Extraction for Phrase-Based Machine Translation Hailong Cao, Andrew Finch and Eiichiro Sumita ...... 28

Phrase Based Decoding using a Discriminative Model Prasanth Kolachina, Sriram Venkatapathy, Srinivas Bangalore, Sudheer Kolachina and Avinesh PVS...... 34

Seeding Statistical Machine Translation with Translation Memory Output through Tree-Based Struc- tural Alignment Ventsislav Zhechev and Josef van Genabith ...... 43

Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation Chi-kiu Lo and Dekai Wu ...... 52

Arabic morpho-syntactic feature disambiguation in a translation context Ines Turki Khemakhem, Salma Jamoussi and Abdelmajid Ben hamadou ...... 61

A Discriminative Approach for Dependency Based Statistical Machine Translation Sriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali ...... 66

Improved Language Modeling for English-Persian Statistical Machine Translation Mahsa Mohaghegh, Abdolhossein Sarrafzadeh and Tom Moir ...... 75

Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphology and Depen- dency Relations Thoudam Doren Singh and Sivaji Bandyopadhyay ...... 83

A Discriminative Syntactic Model for Source Permutation via Tree Transduction Maxim Khalilov and Khalil Sima’an...... 92

HMM Word-to-Phrase Alignment with Dependency Constraints Yanjun Ma and Andy Way ...... 101

New Parameterizations and Features for PSCFG-Based Machine Translation Andreas Zollmann and Stephan Vogel ...... 110

Deep Syntax Language Models and Statistical Machine Translation Yvette Graham and Josef van Genabith ...... 118

vi Conference Program

Saturday, August 28, 2010

9:00–10:40 Open Tutorial: Tree-Structured and Syntactic SMT (Dekai Wu)

10:40–11:05 Coffee break

11:05–12:05 Invited Keynote (Martin Kay)

12:05–12:25 Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and Algorithms Marc Dymetman and Nicola Cancedda

12:25–12:55 A Systematic Comparison between Inversion Transduction Grammar and Linear Transduction Grammar for Word Alignment Markus Saers, Joakim Nivre and Dekai Wu

12:55–14:00 Lunch break

14:00–14:20 Source-side Syntactic Reordering Patterns with Functional Words for Improved Phrase-based SMT Jie Jiang, Jinhua Du and Andy Way

14:20–14:40 Syntactic Constraints on Phrase Extraction for Phrase-Based Machine Translation Hailong Cao, Andrew Finch and Eiichiro Sumita

14:40–15:00 Phrase Based Decoding using a Discriminative Model Prasanth Kolachina, Sriram Venkatapathy, Srinivas Bangalore, Sudheer Kolachina and Avinesh PVS

15:00–15:20 Seeding Statistical Machine Translation with Translation Memory Output through Tree-Based Structural Alignment Ventsislav Zhechev and Josef van Genabith

15:20–15:40 Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation Chi-kiu Lo and Dekai Wu

15:40–16:05 Posters / Coffee break

Arabic morpho-syntactic feature disambiguation in a translation context Ines Turki Khemakhem, Salma Jamoussi and Abdelmajid Ben hamadou

vii Saturday, August 28, 2010 (continued)

A Discriminative Approach for Dependency Based Statistical Machine Translation Sriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali

Improved Language Modeling for English-Persian Statistical Machine Translation Mahsa Mohaghegh, Abdolhossein Sarrafzadeh and Tom Moir

Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphol- ogy and Dependency Relations Thoudam Doren Singh and Sivaji Bandyopadhyay

16:05–16:25 A Discriminative Syntactic Model for Source Permutation via Tree Transduction Maxim Khalilov and Khalil Sima’an

16:25–16:45 HMM Word-to-Phrase Alignment with Dependency Constraints Yanjun Ma and Andy Way

16:45–17:05 New Parameterizations and Features for PSCFG-Based Machine Translation Andreas Zollmann and Stephan Vogel

17:05–17:25 Deep Syntax Language Models and Statistical Machine Translation Yvette Graham and Josef van Genabith

17:25–17:40 Discussion

viii Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and Algorithms

Marc Dymetman Nicola Cancedda Xerox Research Centre Europe marc.dymetman,nicola.cancedda @xrce.xerox.com { }

Abstract niques for intersecting a hierarchical model and a phrase-based model. In order to do so, one first We address the problem of construct- difficulty has to be overcome: while hierarchical ing hybrid translation systems by inter- systems are based on the mathematically well- secting a Hiero-style hierarchical sys- understood formalism of weighted synchronous tem with a phrase-based system and CFG’s, phrase-based systems do not correspond present formal techniques for doing so. to any classical formal model, although they are We model the phrase-based component loosely connected to weighted finite state trans- by introducing a variant of weighted ducers, but crucially go beyond these by allow- finite-state automata, called σ-automata, ing phrase re-orderings. provide a self-contained description of a general algorithm for intersect- One might try to address this issue by limiting ing weighted synchronous context-free a priori the amount of re-ordering, in the spirit grammars with finite-state automata, and of (Kumar and Byrne, 2005), which would allow extend these constructs to σ-automata. to approximate a phrase-based model by a stan- We end by briefly discussing complexity dard transducer, but this would introduce further properties of the presented algorithms. issues. First, limiting the amount of reordering in the phrase-based model runs contrary to 1 Introduction the underlying intuitions behind the intersection, Phrase-based (Och and Ney, 2004; Koehn et namely that the hierarchical model should be al., 2007) and Hierarchical (Hiero-style) (Chi- mainly responsible for controlling re-ordering, ang, 2007) models are two mainstream ap- and the phrase-based model mainly responsible proaches for building Statistical Machine Trans- for lexical choice. Second, the transducer result- lation systems, with different characteristics. ing from the operation could be large. Third, While phrase-based systems allow a direct cap- even if we could represent the phrase-based ture of correspondences between surface-level model through a finite-state transducer, intersect- lexical patterns, but at the cost of a simplistic ing this transducer with the synchronous CFG handling of re-ordering, hierarchical systems are would actually be intractable in the general case, better able to constrain re-ordering, especially as we indicate later. for distant language pairs, but tend to produce We then take another route. For a fixed source sparser rules and often lag behind phrase-based sentence x, we show how to construct an au- systems for less distant language pairs. It might tomaton that represents all the (weighted) tar- therefore make sense to capitalize on the com- get sentences that can be produced by apply- plementary advantages of the two approaches by ing the phrase based model to x. However, this combining them in some way. “σ-automaton” is non-standard in the sense that This paper attempts to lay out the formal each transition is decorated with a set of source prerequisites for doing so, by developing tech- sentence tokens and that the only valid paths are

1 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 1–9, COLING 2010, Beijing, August 2010. those that do not traverse two sets containing the The grammar G defines a weighted synchronous same token (in other words, valid paths cannot language LG over (Vs,Vt), the automaton As a “consume” the same source token twice). weighted language Ls over Vs, and the automa- The reason we are interested in σ-automata ton At a weighted language Lt over Vt. We is the following. First, it is known that inter- then define the intersection language L0 between secting a synchronous grammar simultaneously these three languages as the synchronous lan- with the source sentence x and a (standard) target guage denoted L0 = Ls e LG e Lt over (Vs,Vt) automaton results in another synchronous gram- such that, for any pair (x, y) of a source and a mar; we provide a self-contained description of target sentence, the weight L0(x, y) is defined an algorithm for performing this intersection, in by L (x, y) L (x) L (x, y) L (y), where 0 ≡ s · G · t the general weighted case, and where x is gener- Ls(x),LG(x, y),Lt(y) are the weights associ- alized to an arbitrary source automaton. Second, ated to each of the component languages. we extend this algorithm to σ-automata. The It is natural to ask whether there exists a syn- resulting weighted synchronous grammar repre- chronous grammar G0 generating the language 1 sents, as in Hiero, the “parse forest” (or “hy- L0, which we will now show to be the case. pergraph”) of all weighted derivations (that is Our approach is inspired by the construction in of all translations) that can be built over x, but (Bar-Hillel et al., 1961) for the intersection of a where the weights incorporate knowledge of the CFG and an FSA and the observation in (Lang, phrase-based component; it can therefore form 1994) relating this construction to parse forests, the basis of a variety of dynamic programming and also partially from (Satta, 2008), although, or sampling algorithms (Chiang, 2007; Blunsom by contrast to that work, our construction, (i) and Osborne, 2008), as is the case with standard is done simultaneously rather than as the se- Hiero-type representations. While in the worst quence of intersecting As with G, then the re- case the intersected grammar can contain an ex- sulting grammar with At, (ii) handles weighted ponential number of nonterminals, we argue that formalisms rather than non-weighted ones. such combinatorial explosion will not happen in We will describe the construction of G0 based practice, and we also briefly indicate formal con- on an example, from which the general construc- ditions under which it will not be allowed to hap- tion follows easily. Consider a W-SCFG gram- pen. mar G for translating between French and En- glish, with initial nonterminal S, and containing 2 Intersecting weighted synchronous among others the following rule: CFG’s with weighted automata N A manque a` B / B misses A : θ, (1) We assume that the notions of weighted finite- → state automaton [W-FSA] and weighted synchronous grammar [W-SCFG] are known (for where the source and target right-hand sides are short descriptions see (Mohri et al., 1996) and separated by a slash symbol, and where θ is a (Chiang, 2006)), and we consider: non-negative real weight (interpreted multiplica- tively) associated with the rule. 1. A W-SCFG G, with associated source Now let’s consider the following “rule grammar Gs (resp. target grammar Gt); the scheme”: terminals of Gs (resp. Gt) vary over the source vocabulary Vs (resp. target vocab- t0 Nt3 t2 At3 manque a` t0 Bt1 / s0 s4 → s0 s1 s1 s2 s2 s3 s3 s4 ulary Vt). t0 t1 t1 t2 t2 t3 s3 Bs4 misses s0 As1 (2) 2. A W-FSA As over the source vocabulary 1We will actually only need the application of this result Vs, with initial state s# and final state s$. to the case where As is a “degenerate” automaton describ- ing a single source sentence x, but the general construction 3. A W-FSA At over the target vocabulary Vt, is not harder to do than this special case and the resulting with initial state t# and final state t$. format for G0 is well-suited to our needs below.

2 This scheme consists in an “indexed” version of over the indexed target terminals tj t-termtj+1 ; the original rule, where the bottom indices si θAs (si, s-term, si+1) is the weight of the transi- correspond to states of As (“source states”), and tion (si, s-term, si+1) according to As, and sim- the top indices ti to states of At (“target states”). ilarly for θAt (tj, t-term, tj+1). In these prod-

The nonterminals are associated with two source ucts, it may happen that θAs (si, s-term, si+1) is and two target indices, and for the same nonter- null (and similarly for At), and in such a case, minal, these four indices have to match across the corresponding rule instantiation is consid- the source and the target RHS’s of the rule. As ered not to be realized. Let us consider the multi- for the original terminals, they are replaced by set of all the weighted rule instantiations for (1) “indexed terminals”, where source (resp. tar- computed in this way, and for each rule in the get) terminals have two source (resp. target) in- collection, let us “forget” the indices associated dices. The source indices appear sequentially to the terminals. In this way, we obtain a col- on the source RHS of the rule, in the pattern lection of weighted synchronous rules over the s0, s1, s1, s2, s2 . . . sm 1, sm, with the nonter- vocabularies Vs and Vt, but where each nonter- − 2 minal on the LHS receiving source indices s0 minal is now indexed by four states. and sm, and similarly the target indices appear When we apply this procedure to all the rules sequentially on the target RHS of the rule, in the of the grammar G, we obtain a new weighted t t pattern t0, t1, t1, t2, t2 . . . tn 1, tn, with the non- G # S $ − synchronous CFG 0, with start symbol s# s$ , terminal on the LHS receiving target indices t0 for which we have the following Fact, of which and tn. To clarify, the operation of associating we omit the proof for lack of space. indices to terminals and nonterminals can be de- The synchronous language L associ- composed into three steps: Fact 1. G0 ated with G0 is equal to L0 = Ls e LG e Lt. s Ns s As s1 manques s2 a` s3 s Bs / 0 4 → 0 1 2 3 4 The grammar G that we have just constructed B misses A 0 does fulfill the goal of representing the bilat- t0 Nt3 A manque a` B / → eral intersection that we were looking for, but t0 Bt1 t1 missest2 t2 At3 it has a serious defect: most of its nontermi- t0 t3 t2 t3 t0 t1 nals are improductive, that is, can never pro- s0 Ns4 s0 As1 s1 manques2 s2 a` s3 s3 Bs4 / → duce a bi-sentence. If a rule refers to such an t0 Bt1 t1 missest2 t2 At3 s3 s4 s0 s1 improductive nonterminal, it can be eliminated where the first two steps corresponds to handling from the grammar. This is the analogue for a the source and target indices separately, and the SCFG of the classical operation of reduction for third step then assembles the indices in order to CFG’s; while, conceptually, we could start from get the same four indices on the two copies of G0 and perform the reduction by deleting the each RHS nonterminal. The rule scheme (2) now many rules containing improductive nontermi- generates a family of rules, each of which corre- nals, it is equivalent but much more efficient to sponds to an arbitrary instantiation of the source do the reverse, namely to incrementally add the and target indices to states of the source and tar- productive nonterminals and rules of G0 starting get automata respectively. With every such rule from an initially empty set of rules, and by pro- instantiation, we associate a weight θ0 which is ceeding bottom-up starting from the terminals. defined as: We do not detail this process, which is relatively

θ0 θ θ (s , s-term, s ) ≡ · As i i+1 2It is possible that the multiset obtained by this simpli- s-term si Y si+1 fying operation contains duplicates of certain rules (possibly with different weights), due to the non-determinism θ (t , t-term, t ), (3) · At j j+1 of the automata: for instance, two sequences such tj tj+1 t-term as‘s1 manques s2 a`s3 ’ and ‘s1 manques s0 a`s3 ’ become in- Y 2 20 2 distinguishable after the operation. Rather than producing where the ﬁrst product is over the indexed source multiple instances of rules in this way, one can “conﬂate” terminals si s-termsi+1 , the second product them together and add their weights.

3 straightforward.3 sequence of target labels on that path, and the weight of the path is the product of the weights A note on intersecting SCFGs with transduc- on its edges. ers Another way to write Ls e LG e Lt is as the intersection (L L ) L . (L L ) can s × t ∩ G s × t σ-automata and phrase-based translation be seen as a rational language (language gener- A mainstream phrase-based translation system ated by a finite state transducer) of an especially such as Moses (Koehn et al., 2007) can be ac- simple form over V V . It is then natural s × t counted for in terms of σ-automata in the follow- to ask whether our previous construction can be ing way. To simplify exposition, we assume that generalized to the intersection of G with an arbi- the language model used is a bigram model, but trary finite-state transducer. However, this is not any n-gram model can be accommodated. Then, the case. Deciding the emptiness problem for given a source sentence x, decoding works by at- the intersection between two finite state trans- tempting to construct a sequence of phrase-pairs ducers is already undecidable, by reduction to of the form (˜x1, y˜1), ..., (˜xk, y˜k) such that each Post’s Correspondence problem (Berstel, 1979, x˜i corresponds to a contiguous subsequence of p. 90) and we have extended the proof of this fact tokens of x, the x˜i’s do not overlap and com- to show that intersection between a synchronous pletely cover x, but may appear in a different CFG and a finite state transducer also has an un- order than that of x; the output associated with decidable emptiness problem (the proof relies on the sequence is simply the concatenation of all 4 the fact that a finite state transducer can be sim- the y˜i’s in that sequence. The weight associ- ulated by a synchronous grammar). A fortiori, ated with the sequence of phrase-pairs is then this intersection cannot be represented through the product (when we work with probabilities an (effectively constructed) synchronous CFG. rather than log-probabilities) of the weight of each (˜xi+1, y˜i+1) in the context of the previous 3 Phrase-based models and σ-automata (˜xi, y˜i), which consists in the product of several 3.1 σ-automata: definition elements: (i) the “out-of-context” weight of the phrase-pair (˜xi+1, y˜i+1) as determined by its fea- V V Let s be a source vocabulary, t a target vocab- tures in the phrase table, (ii) the language model ulary. Let x = x1, . . . , xM be a fixed sequence 5 probability of finding y˜i+1 following y˜i, (iii) of words over a certain source vocabulary Vs. the contextual weight of (˜xi+1, y˜i+1) relative to Let us denote by z a token in the sequence x, (˜xi, y˜i) corresponding to the distorsion cost of and by Z the set of the M tokens in x.A σ- “jumping” from the token sequence x˜i to the to- automaton over x has the general form of a stan- ken sequence x˜i+1 when these two sequences dard weighted automaton over the target vocabu- may not be consecutive in x.6 lary, but where the edges are also decorated with Such a model can be represented by a σ- elements of (Z), the powerset of Z (see Fig. 1). P automaton, where each phrase-pair (˜x, y˜) — for An edge in the σ-automaton between two states q and q0 then carries a label of the form (α, β), 4 We assume here that the phrase-pairs (˜xi, y˜i) are such where α (Z) and β V (note that here we that y˜i is not the empty string (this constraint could be re- ∈ P ∈ t do not allow β to be the empty string ). A path moved by an adaptation of the -removal operation (Mohri, 2002) to σ-automata). from the initial state of the automaton to its fi- 5This is where the bigram assumption is relevant: for nal state is defined to be valid iff each token of x a trigram model, we may need to encode in the automaton appears in exactly one label of the path, but not not only the immediately preceding phrase-pair, but also the previous one, and so on for higher-order models. An necessarily in the same order as in x. As usual, alternative is to keep the n-gram language model outside the output associated with the path is the ordered the σ-automaton and intersect it later with the grammar G0 obtained in section 4, possibly using approximation tech- 3This bottom-up process is analogous to chart-parsing, niques such as cube-pruning (Chiang, 2007). but here we have decomposed the construction into first 6Any distorsion model — in particular “lexicalized re- building a semantics-preserving grammar and then reduc- ordering” — that only depends on comparing two consec- ing it, which we think is formally neater. utive phrase-pairs can be implemented in this way.

4 avocadoes brown b a are {avocats} cooked k $ {marrons} these {sont} {cuits} {$} # h r $ {ces} totally finished {avocats, marrons} are $ corrupt {$} tcl1 {sont} {cuits}  lawyers f tcl2  tcl

avocadoes brown b a are cooked k $ these # h totally r $ finished are $ tcl1 corrupt

lawyers f tcl2 tcl

Figure 1: On the top: a σ-automaton with two valid paths shown. Each box denotes a state corresponding to a phrase pair, while states internal to a phrase pair (such as tcl1 and tcl2) are not boxed. Above each transition we have indicated the corresponding target word, and below it the corresponding set of source tokens. We use a terminal symbol $ to denote the end of sentence both on the source and on the target. The solid path corresponds to the output these totally corrupt lawyers are finished, the dotted path to the output these brown avocadoes are cooked. Note that the source tokens are not necessarily consumed in the order given by the source, and that, for example, there exists a valid path generating these are totally corrupt lawyers finished and moving according to h r tcl1 tcl2 tcl f; Note, however, that this does not mean that if a biphrase such as (marrons avocats,→ → avocado→ chestnuts)→ → existed in the phrase table, it would be applicable to the source sentence here: because the source words in this biphrase would not match the order of the source tokens in the sentence, the biphrase would not be included in the σ-automaton at all. On the bottom: The target W-FSA automaton At associated with the σ-automaton, where we are ignoring the source tokens (but keeping the same weights). x˜ a sequence of tokens in x and (˜x, y˜) an entry Example Let us consider the following French in the global phrase table — is identified with a source sentence x: ces avocats marrons sont state of the automaton and where the fact that the cuits (idiomatic expression for these totally cor- phrase-pair (˜x0, y˜0) = ((x1, ..., xk), (y1, ..., yl)) rupt lawyers are finished). Let’s assume that the follows (˜x, y˜) in the decoding sequence is mod- phrase table contains the following phrase pairs: eled by introducing l “internal” transitions with h: (ces, these) labels (σ, y ), ( , y ), ..., ( , y ), where σ = 1 ∅ 2 ∅ l a: (avocats, avocados) x , ..., x , and where the first transition con- b: (marrons, brown) { 1 k} nects the state (˜x, y˜) to some unique “internal tcl: (avocats marrons, totally corrupt lawyers) state” q1, the second transition the state q1 to r: (sont, are) some unique internal state q2, and the last tran- k: (cuits, cooked) 7 f: (cuits, finished). sition qk to the state (˜x0, y˜0). Thus, a state (˜x0, y˜0) essentially encodes the previous phrase- An illustration of the corresponding σ- pair used during decoding, and it is easy to see automaton SA is shown at the top of Figure 1, that it is possible to account for the different with only a few transitions made explicit, and weights associated with the phrase-based model with no weights shown.9 by weights associated to the transitions of the σ- 8 automaton. code the two previous phrase-pairs used during decoding, it is possible in principle to account for a trigram language 7For simplicity, we have chosen to collect the set of all model, and similarly for higher-order LMs. This is simi- the source tokens x1, ..., xk on the first transition, but we lar to implementing n-gram language models by automata could distribute it{ on the l transitions} arbitrarily (but keep- whose states encode the n 1 words previously generated. − ing the subsets disjoint) without changing the semantics of 9Only two (valid) paths are shown. If we had shown the what we do. This is because once we have entered one of full σ-automaton, then the graph would have been “com- the l internal transitions, we will always have to traverse plete” in the sense that for any two box states B,B0, we the remaining internal transitions and collect the full set of would have shown a connection B B10 ... Bk0 1 → → − → source tokens. B0, where the Bi0 are internal states, and k is the length of 8 By creating states such as ((˜x, y˜), (˜x0, y˜0)) that en- the target side of the biphrase B0.

5 4 Intersecting a synchronous grammar section LSA,x e LG,x. We base our explanations with a σ-automaton on the example just given. Intersection of a W-SCFG with a σ- A Relaxation of the Intersection At the automaton If SA is a σ-automaton over bottom of Figure 1, we show how we can as- input x, with each valid path in SA we asso- sociate an automaton At with the σ-automaton ciate a weight in the same way as we do for SA: we simply “forget” the source-sides of the a weighted automaton. For any target word labels carried by the transitions, and retain all the sequence in Vt∗ we can then associate the sum weights. As before, note that we are only show- of the weights of all valid paths outputting that ing a subset of the transitions here. sequence. The weighted language LSA,x over All valid paths for SA map into valid paths for Vt obtained in this way is called the language At (with the same weights), but the reverse is not associated with SA. Let G be a W-SCFG over true because some valid At paths can correspond Vs,Vt, and let us denote by LG,x the weighted to traversals of SA that either consume several language over Vs,Vt corresponding to the time the same source token or do not consume all intersection x G V , where x denotes source tokens. For instance, the sentence these { } e e t∗ { } the language giving weight 1 to x and weight 0 brown avocadoes brown are $ belongs to the to other sequences in Vs∗, and Vt∗ denotes the language of At, but cannot be produced by SA. language giving weight 1 to all sequences in V . Let’s however consider the intersection x t∗ { } e Note that non-null bi-sentences in LG,x have G e At, where, with a slight abuse of notation, their source projection equal to x and therefore we have notated x the “degenerate” automaton { } LG,x can be identiﬁed with a weighted language representing the sentence x, namely the automa- over Vt. The intersection of the languages LSA,x ton (with weights on all transitions equal to 1): and LG,x is denoted by LSA,x e LG,x. ces avocats marrons sont cuits $

Example Let us consider the following W- 0 1 2 3 4 5 6 SCFG (where again, weights are not explicitly This is a relaxation of the true intersection, but $ shown, and where we use a terminal symbol one that we can represent through a W-SCFG, as to denote the end of a sentence, a technicality we know from section 2.10 needed for making the grammar compatible with This being noted, we now move to the con- SA the automaton of Figure 1): struction of the full intersection. S NP VP $ / NP VP $ → NP ces N A / these A N The full intersection We discussed in sec- VP → sont A / are A A →marrons / brown tion 2 how to modify a synchronous grammar A → marrons / totally corrupt rule in order to produce the indexed rule scheme → A cuits / cooked (2) in order to represent the bilateral intersection A → cuits / finished N → avocats / avocadoes of the grammar with two automata. Let us redo N → avocats / lawyers that construction here, in the case of our example → 10 It is easy to see that, for instance, the sen- Note that, in the case of our very simple example, any tences: these brown avocadoes are cooked $, target string that belongs to this relaxed intersection (which consists of the eight sentences these brown totally cor- these brown avocadoes are finished $, and these rupt avocadoes lawyers are cooked{ finished| ) actu- totally corrupt lawyers are finished $ all belong ally}{ belongs to the| full intersection,} { as none| of these} sentences corresponds to a path in SA that violates the token- to the intersection LSA,x LG,x, while the sen- e consumption constraint. More generally, it may often be tences these avocadoes brown are cooked $, to- the case in practice that the W-SCFG, by itself, provides tally corrupt lawyers are finished these $ belong enough “control” of the possible target sentences to pre- only to L . vent generation of sentences that would violate the token- SA,x consumption constraints, so that there may be little difference in practice between performing the relaxed intersec- We now describe Building the intersection tion x G At and performing the full intersection { } e e how to build a W-SCFG that represents the inter- x G LSA,x. { } e e

6 W-SCFG, of the target automaton represented on Conceptually, when instanciating this scheme, the bottom of Figure 1, and of the source automa- the ti’s may range over all possible states of ton x . the σ-automaton, and the σ over all subsets of { } ij The construction is then done in three steps: the source tokens, but under the following constraints: the RHS σ’s (here σ01, σ12, σ23) must be disjoint and their union must be equal to the σ NP ces N A / s0 s3 → s0 s1 s1 s2 s2 s3 on the LHS (here σ03). Additionally, a σ associ- these A N ated with a target terminal (as σ01 here) must be t0NPt3 ces N A / equal to the token set associated to the transition → σ t0theset1 t1At2 t2Nt3 that this terminal realizes between -automaton states (here, this means that σ01 must be equal t0 NPt3 ces t2 Nt3 t1 At2 / s0 s3 → s0 s1 s1 s2 s2 s3 to the token set ces associated with the transi- t0 t1 t1 t2 t2 t3 { } these s2As3 s1Ns2 tion between t0, t1 labelled with ‘these’). If we perform all these instantiations, compute their In order to adapt that construction to the case weights according to equation (3), and finally re- where we want the intersection to be with a σ- move the indices associated with terminals in the automaton, what we need to do is to further spe- rules (by adding the weights of the rules only dif- cialize the nonterminals. Rather than specializ- fering by the indices of terminals, as done previ- t t0 ing a nonterminal X in the form sX , we spe- s0 ously), we obtain a very large “raw” grammar, cialize it in the form: tXt0,σ, where σ represents s s0 but one for which one can prove direct coun- a set of source tokens that correspond to “collect- terpart of Fact 1. Let us call, as before G0 the ing” the source tokens in the σ-automaton along raw W-SCFG obtained, its start symbol being 11 a path connecting the states t and t0. t# t$,σall s# Ss$ , with σall the set of all source tokens We then proceed to define a new rule scheme in x. associated to our rule, which is obtained as before in three steps, as follows. Fact 2. The synchronous language LG0 associated with G is equal to ( x ,L L ). 0 { } SA,x e G,x NP ces N A / The grammar that is obtained this way, despite s0 s3 → s0 s1 s1 s2 s2 s3 these A N correctly representing the intersection, contains a lot of useless rules, this being due to the fact t0NPt3,σ03 ces N A / → that many nonterminals can not produce any out- t0 t1,σ01 t1 t2,σ12 t2 t3,σ23 these A N put. The situation is wholly similar to the case t0 NPt3,σ03 ces t2 Nt3,σ23 t1 At2,σ12 / of section 2, and the same bottom-up techniques s0 s3 → s0 s1 s1 s2 s2 s3 t0 t1,σ01 t1 t2,σ12 t2 t3,σ23 can be used for activating nonterminals and rules these s2As3 s1Ns2 bottom-up. The only difference with our previous tech- The algorithm is illustrated in Figure 2, where nique is in the addition of the σ’s to the top in- we have shown the result of the process of acti- dices. Let us focus on the second step of the an- vating in turn the nonterminals (abbreviated by) notation process: N1, A1, A2, NP1, VP1, S1. As a consequence of these activations, the original grammar rule t0NPt3,σ03 ces N A / → NP ces N A /these A N (for instance) t0 t1,σ01 t1 t2,σ12 t2 t3,σ23 → these A N becomes instantiated as the rule: 11To avoid a possible confusion, it is important to note right away that σ is not necessarily related to the tokens # tcl, ces,avocats,marrons NP { } appearing between the positions s and s0 in the source sen- 0 3 → tence (that is, between these states in the associated source tcl2 tcl, h tcl2, avocats,marrons automaton), but is defined solely in terms of the source to- 0ces1 1 N2 ∅ 2A3 { } / kens along the t, t0 path. See example with “persons” and # h, ces h tcl2, avocats,marrons tcl2 tcl, “people” below. these { } 2A3 { } 1 N2 ∅

7 S1

NP1 VP1

these A1 # h r $ {ces} totally finished {avocats, marrons} are $ corrupt N1 {$} tcl1 {sont} {cuits}  tcl2, tcl  lawyers f NN1: 12 tcl2 tcl  h tcl2,{ avocats , marrons } AA1: 23 S1 #tcl ,{ ces , avocats , marrons } NP1: 03 NP r f,{ cuits } NP1 VP1 AA2: 45 N1 A1 tcl f,{ sont , cuits } A2 VP1: 35 VP ces avocats marrons sont cuits $ #$$ ,{ces , avocats , marrons , sont , cuits , } 0 1 2 3 4 5 6 SS1: 06

Figure 2: Building the intersection. The bottom of the figure shows some active non-terminals associated with the source sequence, at the top these same non-terminals associated with a sequence of transitions in the σ-automaton, corresponding to the target sequence these totally corrupt lawyers are finished $. To avoid cluttering the drawing, we have used the abbreviations shown on the right. Note that while A1 only spans marrons in the bottom chart, it is actually decorated with the source token set avocats, marrons : such a “disconnect” between the views that the W-SCFG and the σ-automaton have of the source tokens{ is not ruled out.} that is, after removal of the indices on terminals: and two W-FSA’s in section 2 can be shown to be of polynomial complexity in the sense that it # tcl, ces,avocats,marrons NP { } takes polynomial time and space relative to the 0 3 → tcl2 tcl, h tcl2, avocats,marrons sum of the sizes of the two automata and of the ces 1 N ∅ 2A { } / 2 3 grammar to construct the (reduced) intersected h tcl2, avocats,marrons tcl2 tcl, { } these 2A3 1 N2 ∅ grammar G0, under the condition that the grammar right-hand sides have length bounded by a tcl2 tcl, 13 Note that while the nonterminal 1 N2 ∅ by constant. itself consumes no source token (it is associated The situation here is different, because the with the empty token set), any actual use of this construction of the intersection can in princi- nonterminal (in this specific rule or possibly in ple introduce nonterminals indexed not only by some other rule using it) does require travers- states of the automata, but also by arbitrary sub- ing the internal node tcl2 and therefore all the sets of source tokens, and this may lead in ex- internal nodes “belonging” to the biphrase tcl treme cases to an exponential number of rules. (because otherwise the path from # to $ would Such problems however can only happen in sit- be disconnected); in particular this involves con- uations where, in a nonterminal tXt0,σ, the set s s0 suming all the tokens on the source side of tcl, σ is allowed to contain tokens that are “unre- 12 including ‘avocats’. lated” to the token set personnes appearing { } between s and s in the source automaton. An il- Complexity considerations The bilateral in- 0 lustration of such a situation is given by the fol- tersection that we defined between a W-SCFG lowing example. Suppose that the source sen- 12In particular there is no risk that a derivation relative to the intersected grammar generates a target containing 13If this condition is removed, and for the simpler case two instances of ‘lawyers’, one associated to the expansion where the source (resp. target) automaton encodes a single tcl2 tcl, of 1 N2 ∅ and consuming no source token, and another sentence x (resp. y), (Satta and Peserico, 2005) have shown one associated with a different nonterminal and consuming that the problem of deciding whether (x, y) is recognized the source token ‘avocats’: this second instance would in- by G is NP-hard relative to the sum of the sizes. A conse- volve not traversing tcl1, which is impossible as soon as quence is then that the grammar G0 cannot be constructed tcl2 tcl, 1 N2 ∅ is used. in polynomial time unless P = NP .

8 tence contains the two tokens personnes and phrase-based component would focus on transla- gens between positions i, i + 1 and j, j + 1 re- tion of lexical material. The present paper does spectively, with i and j far from each other, that not have the ambition to demonstrate that such the phrase table contains the two phrase pairs an approach would improve translation perfor- (personnes, persons) and (gens, people), but mance, but only to provide some formal means that the synchronous grammar only contains the for advancing towards that goal. two rules X personnes/people and Y → → gens/persons, with these phrases and rules ex- References hausting the possibilities for translating gens Bar-Hillel, Y., M. Perles, and E. Shamir. 1961. On for- and personnes; Then the intersected grammar mal properties of simple phrase structure grammars. t t0, gens Zeitschrift fur¨ Phonetik, Sprachwissenschaft und Kom- will contain such nonterminals as iXi+1{ } and municationsforschung, 14:143–172. r r0, personnes jYj+1{ }, where in the first case the token Berstel, Jean. 1979. Transductions and Context-Free Lan- set gens in the first nonterminal is unrelated to guages. Teubner, Stuttgart. { } Blunsom, P. and M. Osborne. 2008. Probabilistic inference the tokens appearing between i, i + 1, and simi- for machine translation. In Proceedings of the Confer- larly in the second case. ence on Empirical Methods in Natural Language Pro- Without experimentation on real cases, it cessing, pages 215–223. Association for Computational Linguistics. Slides downloaded. is impossible to say whether such phenomena Chiang, David. 2006. An introduction to synchronous would empirically lead to combinatorial explo- grammars. www.isi.edu/˜chiang/papers/ sion or whether the synchronous grammar would synchtut.pdf, June. Chiang, David. 2007. Hierarchical phrase-based transla- sufficiently constrain the phrase-base component tion. Computational Linguistics, 33:201–228. (whose re-ordering capabilities are responsible Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, in fine for the potential NP-hardness of the trans- Brooke Cowan, Wade Shen, Christine Moran, Richard lation process) to avoid it. Another possible ap- Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, proach is to prevent a priori a possible combi- and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL. The Association natorial explosion by adding formal constraints for Computer Linguistics. to the intersection mechanism. One such con- Kumar, Shankar and William Byrne. 2005. Local phrase straint is the following: disallow introduction of reordering models for statistical machine translation. In Proc. HLT/EMNLP. t t0,σ iXj when the symmetric difference between Lang, Bernard. 1994. Recognition can be harder than pars- σ and the set of tokens between positions i and ing. Computational Intelligence, 10:486–494. Mohri, Mehryar, Fernando Pereira, and Michael Riley. j in the source sentence has cardinality larger 1996. Weighted automata in text and speech processing. than a small constant. Such a constraint could In ECAI-96 Workshop on Extended Finite State Models be interpreted as keeping the SCFG and phrase- of Language. Mohri, Mehryar. 2002. Generic epsilon-removal and input base components “in sync”, and would be better epsilon-normalization algorithms for weighted trans- adapted to the spirit of our approach than limit- ducers. International Journal of Foundations of Com- ing the amount of re-ordering permitted to the puter Science, 13:129–143. Och, Franz Josef and Hermann Ney. 2004. The align- phrase-based component, which would contra- ment template approach to statistical machine transla- dict the reason for using a hierarchical compo- tion. Comput. Linguist., 30(4):417–449. nent in the first place. Satta, Giorgio and Enoch Peserico. 2005. Some computational complexity results for synchronous context-free grammars. In HLT ’05: Proceedings of the conference 5 Conclusion on Human Language Technology and Empirical Meth- ods in Natural Language Processing, pages 803–810, Intersecting hierarchical and phrase-based mod- Morristown, NJ, USA. Association for Computational Linguistics. els of translation could allow to capitalize on Satta, Giorgio. 2008. Translation algorithms by means of complementarities between the two approaches. language intersection. Submitted. www.dei.unipd. Thus, one might train the hierarchical compo- it/˜satta/publ/paper/inters.pdf. nent on corpora represented at the part-of-speech level (or at a level where lexical units are ab- stracted into some kind of classes) while the

9 A Systematic Comparison between Inversion Transduction Grammar and Linear Transduction Grammar for Word Alignment

Markus Saers and Joakim Nivre Dekai Wu HKUST Dept. of Linguistics & Philology Human Language Technology Center Uppsala University Dept. of Computer Science & Engineering Hong Kong Univ. of Science & Technology [email protected] [email protected]

Abstract GIZA++ (Brown et al., 1993; Vogel et al., 1996; Och and Ney, 2003). A heuristic for converting We present two contributions to gram- stochastic bracketing LTGs into stochastic brack- mar driven translation. First, since both eting ITGs is presented, and ﬁtted into the speed– Inversion Transduction Grammar and quality trade-off. Linear Inversion Transduction Gram- In section 3 we give an overview of transduc- mars have been shown to produce bet- tion grammars, introduce LTGs and show that ter alignments then the standard word they are equal to LITGs. In section 4 we give alignment tool, we investigate how the a short description of the rational for the trans- trade-off between speed and end-to-end duction grammar pruning used. In section 5 we translation quality extends to the choice describe a way of seeding a stochastic bracketing of grammar formalism. Second, we ITG with the rules and probabilities of a stochas- prove that Linear Transduction Gram- tic bracketing LTG. Section 6 describes the setup, mars (LTGs) generate the same transduc- and results are given in section 7. Finally, some tions as Linear Inversion Transduction conclusions are offered in section 8 Grammars, and present a scheme for ar- riving at LTGs by bilingualizing Linear 2 Background Grammars. We also present a method for Any form of automatic translation that relies on obtaining Inversion Transduction Gram- generalizations of observed translations needs to mars from Linear (Inversion) Transduc- align these translations on a sub-sentential level. tion Grammars, which can speed up The standard way of doing this is by aligning grammar induction from parallel corpora words, which works well for languages that use dramatically. white space separators between words. The stan- 1 Introduction dard method is a combination of the family of IBM-models (Brown et al., 1993) and Hidden In this paper we introduce Linear Transduction Markov Models (Vogel et al., 1996). These Grammars (LTGs), which are the bilingual case methods all arrive at a function (A) from lan- of Linear Grammars (LGs). We also show that guage 1 (F ) to language 2 (E). By running the LTGs are equal to Linear Inversion Transduction process in both directions, two functions can be Grammars (Saers et al., 2010). To be able to in- estimated and then combined to form an align- duce transduction grammars directly from par- ment. The simplest of these combinations are in- allel corpora an approximate search for parses is tersection and union, but usually, the intersection needed. The trade-off between speed and end-to- is heuristically extended. Transduction gram- end translation quality is investigated and com- mars on the other hand, impose a shared struc- pared to Inversion Transduction Grammars (Wu, ture on the sentence pairs, thus forcing a consis- 1997) and the standard tool for word alignment, tent alignment in both directions. This method

10 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 10–18, COLING 2010, Beijing, August 2010. has proved successful in the settings it has been but doing so eliminates any possibility of param- tried (Zhang et al., 2008; Saers and Wu, 2009; eterizing their lexical relationship. Instead, the Haghighi et al., 2009; Saers et al., 2009; Saers individual terminals are pair up with the empty et al., 2010). Most efforts focus on cutting down string (ǫ). time complexity so that larger data sets than toy- examples can be processed. A Xx BXa B ; 0, 1, 2, 3 → Xa a, ǫ → 3 Transduction Grammars Xx ǫ, x → Transduction grammars were first introduced in Lexical rules involving the empty string are re- Lewis and Stearns (1968), and further devel- ferred to as singletons. Whenever a preterminal oped in Aho and Ullman (1972). The origi- is used to pair up two terminal symbols, we refer nal notation called for regular CFG-rules in lan- to that pair of terminals as a biterminal, which guage F with rephrased E productions, either in will be written as e/f. curly brackets, or comma separated. The bilin- Any SDTG can be rephrased to contain per- gual version of CFGs is called Syntax-Directed muted nonterminal productions and biterminal Transduction Grammars (SDTGs). To differenti- productions only, and we will call this the nor- ate identical nonterminal symbols, indices were mal form of SDTGs. Note that it is not possi- used (the bag of nonterminals for the two pro- ble to produce a two-normal form for SDTGs, ductions are equal by definition). as there are some rules that are not binarizable A B(1) a B(2) x B(1) B(2) (Wu, 1997; Huang et al., 2009). This is an → { } important point to make, since efficient parsing = A B(1) a B(2), x B(1) B(2) → for CFGs is based on either restricting parsing The semantics of the rules is that one nontermi- to only handle binary grammars (Cocke, 1969; nal rewrites into a bag of nonterminals that is dis- Kasami, 1965; Younger, 1967), or rely on on- tributed independently in the two languages, and the-fly binarization (Earley, 1970). When trans- interspersed with any number of terminal sym- lating with a grammar, parsing only has to be bols in the respective languages. As with CFGs, done in F , which is binarizable (since it is a the terminal symbols can be factored out into CFG), and can therefor be computed in polynomial time ( (n3)). Once there is a parse tree preterminals with the added twist that they are O shared between the two languages, since preter- for F , the corresponding tree for E can be eas- minals are formally nonterminals. The above ily constructed. When inducing a grammar from rule can thus be rephrased as examples, however, biparsing (finding an analysis that is consistent across a sentence pair) is A B(1) Xa/x B(2),Xa/x B(1) B(2) needed. The time complexity for biparsing with → Xa/x a, x SDTGs is (n2n+2), which is clearly intractable. → O Inversion Transduction Grammars or ITGs In this way, rules producing nonterminals and (Wu, 1997) are transduction grammars that have rules producing terminals can be separated. a two-normal form, thus guaranteeing binariz- Since only nonterminals are allowed to move, ability. Defining the rank of a rule as the number their movement can be represented as the orig- of nonterminals in the production, and the rank inal sequence of nonterminals and a permutation of a grammar as the highest ranking rule in the vector as follows: rule set, ITGs are a) any SDTG of rank two, b) A BXa/x B ; 1, 0, 2 any SDTG of rank three or c) any SDTG where no → Xa/x a, x rule has a permutation vector other than identity → permutation or inversion permutation. It follows To keep the reordering as monotone as possible, from this definition that ITGs have a two-normal the terminals a and x can be produced separately, form, which is usually expressed as SDTG rules,

11 with brackets around the production to distin- Graphically, we will represent LTG rules as pro- guish the different kinds of rules from each other. duction rules with biterminals:

A BC ; 0, 1 = A [ BC ] A, x, p B y, q = A x/p B y/q → → h h i h ii → A BC ; 1, 0 = A BC A, ǫ, ǫ = B ǫ/ǫ → → h i h h ii → A e/f = A e/f → → Like STGs, LTGs do not allow any reordering, By guaranteeing binarizability, biparsing time and are monotone, but because they are linear, complexity becomes (n6). this has no impact on expressiveness, as we shall O There is an even more restricted version of see later. SDTGs called Simple Transduction Grammar Linear Inversion Transduction Grammars (STG), where no permutation at all is allowed, (LITGs) were introduced in Saers et al. (2010), which can also biparse a sentence pair in (n6) and represent ITGs that are allowed to have at O time. most one nonterminal symbol in each produc- A Linear Transduction Grammar (LTG) is a tion. These are attractive because they can bi- 4 bilingual version of a Linear Grammar (LG). parse a sentence pair in (n ) time, which can be further reduced to linearO time by severely Definition 1. An LG in normal form is a tuple pruning the search space. This makes them tractable for large parallel corpora, and a viable = N, Σ,R,S GL h i way to induce transduction grammars from large parallel corpora. Where N is a finite set of nonterminal symbols, Σ is a finite set of terminal symbols, R is a finite Definition 3. An LITG in normal form is a tuple set of rules and S N is the designated start ∈ = N, Σ, ∆,R,S symbol. The rule set is constrained so that TGLI h i R N (Σ ǫ )N(Σ ǫ ) ǫ Where N is a finite set of nonterminal symbols, ⊆ × ∪ { } ∪ { } ∪ { } Σ is a finite set of terminal symbols from lan- Where ǫ is the empty string. guage E, ∆ is a finite set of terminal symbols from language F , R is a set of rules and S N To bilingualize a linear grammar, we will take is the designated start symbol. The rule set∈ is the same approach as taken when a finite-state constrained so that automaton is bilingualized into a finite-state transducer. That is: to replace all terminal sym- R N [], ΨN NΨ ǫ, ǫ bols with biterminal symbols. ⊆ × { hi} × ∪ ∪ {h i} Definition 2. An LTG in normal form is a tuple Where [] represents identity permutation and hi represents inversion permutation, Ψ = Σ ǫ ∪{ }× = N, Σ, ∆,R,S ∆ ǫ is a possibly empty biterminal, and ǫ is L ∪ { } TG h i the empty string. Where N is a finite set of nonterminal symbols, Graphically, a rule will be represented as an ITG Σ is a finite set of terminal symbols in language rule: E, ∆ is a finite set of terminal symbols in language F , R is a finite set of linear transduction A, [],B e, f = A [ B e/f ] h h ii → rules and S N is the designated start symbol. A, , e, f B = A e/f B ∈ h hi h i i → h i The rule set is constrained so that A, [], ǫ, ǫ = A ǫ/ǫ h h ii → R N ΨNΨ ǫ, ǫ As with ITGs, productions with only biterminals ⊆ × ∪ {h i} will be represented without their permutation, as Where Ψ = Σ ǫ ∆ ǫ and ǫ is the empty any such rule can be trivially rewritten into in- ∪{ }× ∪{ } string. verted or identity form.

12 Deﬁnition 4. An ǫ-free LITG is an LITG where Lemma 2. Any LTG in normal form can be ex- no rule may rewrite one nonterminal into another pressed as an LITG in normal form. nonterminal only. Formally, the rule set is constrained so that Proof. An LTG in normal form has two rules, which can be rewritten in LITG form, either as R N [], ( ǫ, ǫ B B ǫ, ǫ ) = straight or inverted rules as follows ∩ × { hi} × {h i} ∪ {h i} ∅ The LITG presented in Saers et al. (2010) is A x/p B y/q = A [ x/p B¯ ] → → thus an ǫ-free LITG in normal form, since it has B¯ [ B y/q ] → the following thirteen rule forms (of which 8 are = A x/q B¯ → h i meaningful, 1 is only used to terminate genera- B¯ B y/p → h i tion and 4 are redundant): A ǫ/ǫ = A ǫ/ǫ → → A [ e/f B ] → A e/f B → h i Theorem 1. LTGs in normal form and LITGs in A [ B e/f ] → normal form express the same class of transduc- A B e/f → h i tions. A [ e/ǫ B ] A e/ǫ B → | → h i A [ B e/ǫ ] A B e/ǫ → | → h i Proof. Follows from lemmas 1 and 2. A [ ǫ/f B ] A B ǫ/f → | → h i A [ B ǫ/f ] A ǫ/f B By theorem 1 everything concerning LTGs is also → | → h i A ǫ/ǫ applicable to LITGs, and an LTG can be expressed → in LITG form when convenient, and vice versa. All the singleton rules can be expressed either in straight or inverted form, but the result of apply- 4 Pruning the Alignment Space ing the two rules are the same. The alignment space for a transduction grammar Lemma 1. Any LITG in normal form can be ex- is the combinations of the parse spaces of the pressed as an LTG in normal form. sentence pair. Let e be the E sentence, and f Proof. The above LITG can be rewritten in LTG be the F sentence. The parse spaces would be ( e 2) and ( f 2) respectively, and the com- form as follows: O | | O | | bination of these spaces would be ( e 2 f 2), O | | ×| | A [ e/f B ] = A e/f B or (n4) if we assume n to be proportional → → O A e/f B = A e/ǫ B ǫ/f to the sentence lengths. In the case of LTGs, → h i → A [ B e/f ] = A B e/f this space is searched linearly, giving time com- → → A B e/f = A ǫ/f B e/ǫ plexity (n4), and in the case of ITGs there → h i → O A [ e/ǫ B ] = A e/ǫ B is branching within both parse spaces, adding → → A [ B e/ǫ ] = A B e/ǫ an order of magnitude each, giving a total time → → A [ ǫ/f B ] = A ǫ/f B complexity of (n6). There is, in other words, → → O A [ B ǫ/f ] = A B ǫ/f a tight connection between the alignment space → → A ǫ/ǫ = A ǫ/ǫ and the time complexity of the biparsing al- → → gorithm. Furthermore, most of this alignment To account for all LITGs in normal form, the fol- space is clearly useless. Consider the case where lowing two non- -free rules also needs to be ac- ǫ the entire F sentence is deleted, and the entire E counted for: sentence is simply inserted. Although it is pos- A [ B ] = A B sible that it is allowed by the grammar, it should → → A B = A B have a negligible probability (since it is clearly a → h i → translation strategy that generalize poorly), and could, for all practical reasons, be ignored.

13 Language pair Bisentences Tokens to two SITG rules. The fact that the LTG rule X ǫ/ǫ lacks correspondence in ITGs has to Spanish–English 108,073 1,466,132 → French–English 95,990 1,340,718 be weighted in as well. In this paper we took the German–English 115,323 1,602,781 maximum entropy approach and distributed the probability mass uniformly. This means deﬁn- Table 1: Size of training data. ing the probability mass function p′ for the new SBITG from the probability mass function p of the original SBLTG such that: Saers et al. (2009) present a scheme for pruning away most of the points in the alignment p(X [ e/f X ]) 1 →p(X ǫ/ǫ) space. Parse items are binned according to cov- − → p′(X [ XX ]) =  q +  → erage (the total number of words covered), and e/f p(X [ X e/f ]) X  1 →p(X ǫ/ǫ)  each bin is restricted to carry a maximum of b  − →  items. Any items that do not ﬁt in the bins are  q  excluded from further analysis. To decide which p(X e/f X ) 1 →hp(X ǫ/ǫ)i items to keep, inside probability is used. This − → p′(X XX ) =  q +  → h i pruning scheme effectively linearizes the align- e/f p(X X e/f ) ment space, as is will be of size (nb), regard- X  1 →hp(X ǫ/ǫ)i  O  − →  less of what type grammar is used. An ITG can  q  p(X [ e/f X ]) LTG thus be biparsed in cubic time, and an in lin- 1 →p(X ǫ/ǫ) − → ear time.  q +  p(X [ X e/f ]) 5 Seeding an ITG with an LTG  1 →p(X ǫ/ǫ)   − →  p′(X e/f) =  q +  Since LTGs are a subclass of ITGs, it would be →    p(X e/f X )  possible to convert an LTG to a ITG. This could  1 →hp(X ǫ/ǫ)i   − →   +  save a lot of time, since LTGs are much faster to  q  induce from corpora than ITGs.  p(X X e/f )   1 →hp(X ǫ/ǫ)i  Converting a BLTG to a BITG is fairly straight  − →   q  forward. Consider the BLTG rule 6 Setup

X [ e/f X ] The aim of this paper is to compare the align- → ments from SBITG and SBLTG to those from To convert it to BITG in two-normal form, the GIZA++, and to study the impact of pruning biterminal has to be factored out. Replacing on efficiency and translation quality. Initial the biterminal with a temporary symbol X¯, and grammars will be estimated by counting cooc- introducing a rule that rewrites this temporary currences in the training corpus, after which symbol to the replaced biterminal produces two expectation-maximization (EM) will be used to rules: refine the initial estimate. At the last iteration, X [ XX¯ ] → the one-best parse of each sentence will be con- X¯ e/f → sidered as the word alignment of that sentence. This is no longer a bracketing grammar since In order to keep the experiments comparable, there are two nonterminals, but equating X¯ to X relatively small corpora will be used. If larger restores this property. An analogous procedure corpora were used, it would not be possible to get can be applied in the case where the nonterminal any results for unpruned SBITGs because of the comes before the biterminal, as well as for the prohibitive time complexity. The Europarl cor- inverting cases. pus (Koehn, 2005) was used as a starting point, When converting stochastic LTGs, the proba- and then all sentence pairs where one of the sen- bility mass of the SLTG rule has to be distributed tences were longer than 10 tokens were filtered

14 Figure 1: Trade-offs between translation quality (as measured by BLEU) and biparsing time (in seconds plotted on a logarithmic scale) for SBLTGs, SBITGs and the combination.

Beam size System 1 10 25 50 75 100 ∞ BLEU SBITG 0.1234 0.2608 0.2655 0.2653 0.2661 0.2671 0.2663 SBLTG 0.2574 0.2645 0.2631 0.2624 0.2625 0.2633 0.2628 GIZA++ 0.2597 0.2597 0.2597 0.2597 0.2597 0.2597 0.2597 NIST SBITG 3.9705 6.6439 6.7312 6.7101 6.7329 6.7445 6.6793 SBLTG 6.6023 6.6800 6.6657 6.6637 6.6714 6.6863 6.6765 GIZA++ 6.6464 6.6464 6.6464 6.6464 6.6464 6.6464 6.6464 Training times SBITG 03:10 17:00 38:00 1:20:00 2:00:00 2:40:00 3:20:00 SBLTG 35 1:49 3:40 7:33 9:44 12:13 11:59

Table 2: Results for the Spanish–English translation task.

out (see table 1). The GIZA++ system was built for SBITGs and SBLTGs on three different trans- according to the instructions for creating a base- lation tasks, several different levels of pruning line system for the Fifth Workshop on Statistical were tried (1, 10, 25, 50, 75 and 100). We also Machine Translation (WMT’10),1 but the above used the grammar induced from SBLTGs with a corpora were used instead of those supplied by beam size of 25 to seed SBITGs (see section 5), the workshop. This includes word alignment which were then run for an additional iteration with GIZA++, a 5-gram language model built of EM, also with beam size 25. with SRILM (Stolcke, 2002) and parameter tun- All systems are evaluated with BLEU (Pap- ing with MERT (Och, 2003). To carry out the ac- ineni et al., 2002) and NIST (Doddington, 2002). tual translations, Moses (Koehn et al., 2007) was used. The SBITG and SBLTG systems were built 7 Results in exactly the same way, except that the align- The results for the three different translation ments from GIZA++ were replaced by those from tasks are presented in Tables 2, 3 and 4. It is the respective grammars. interesting to note that the trend they portray is In addition to trying out exhaustive biparsing quite similar. When the beam is very narrow, GIZA++ is better, but already at beam size 10, 1http://www.statmt.org/wmt10/ both transduction grammars are superior. Con-

15 Beam size System 1 10 25 50 75 100 ∞ BLEU SBITG 0.1268 0.2632 0.2654 0.2669 0.2668 0.2655 0.2663 SBLTG 0.2600 0.2638 0.2651 0.2668 0.2672 0.2662 0.2649 GIZA++ 0.2603 0.2603 0.2603 0.2603 0.2603 0.2603 0.2603 NIST SBITG 4.0849 6.7136 6.7913 6.8065 6.8068 6.8088 6.8151 SBLTG 6.6814 6.7608 6.7656 6.7992 6.8020 6.7925 6.7784 GIZA++ 6.6907 6.6907 6.6907 6.6907 6.6907 6.6907 6.6907 Training times SBITG 03:25 17:00 42:00 1:25:00 2:10:00 2:45:00 3:10:00 SBLTG 31 1:41 3:25 7:06 9:35 13:56 10:52

Table 3: Results for the French–English translation task.

Beam size System 1 10 25 50 75 100 ∞ BLEU SBITG 0.0926 0.2050 0.2091 0.2090 0.2091 0.2094 0.2113 SBLTG 0.2015 0.2067 0.2066 0.2073 0.2080 0.2066 0.2088 GIZA++ 0.2059 0.2059 0.2059 0.2059 0.2059 0.2059 0.2059 NIST SBITG 3.4297 5.8743 5.9292 5.8947 5.8955 5.9086 5.9380 SBLTG 5.7799 5.8819 5.8882 5.8963 5.9252 5.8757 5.9311 GIZA++ 5.8668 5.8668 5.8668 5.8668 5.8668 5.8668 5.8668 Training times SBITG 03:20 17:00 41:00 1:25:00 2:10:00 2:45:00 3:40:00 SBLTG 38 1:58 4:52 8:08 11:42 16:05 13:32

Table 4: Results for the German–English translation task.

sistent with Saers et al. (2009), SBITG has a sharp trate this trade-off. It is clear that SBLTGs are rise in quality going from beam size 1 to 10, indeed much faster than SBITGs, the only excep- and then a gentle slope up to beam size 25, af- tion is when SBITGs are run with b = 1, but then ter which it levels out. SBLTG, on the other hand the BLEU score is so low that is is not worth con- start out at a respectable level, and goes up a gen- sidering. tle slope from beam size 1 to 10, after which is The time may seem inconsistent between b = level out. This is an interesting observation, as it 100 and b = for SBLTG, but the extra time SBLTG ∞ suggests that reaches its optimum with a for the tighter beam is because of beam manage- lower beam size (although that optimum is lower ment, which the exhaustive search doesn’t bother than that of SBITG). The trade-off between qual- with. ity and time can now be extended beyond beam size to include grammar choice. In Figure 1, run In table 5 we compare the pure approaches times are plotted against BLEU scores to illus- to one where an LTG was trained during 10 iterations of EM and then used to seed (see sec-

16 Translation task System BLEU NIST Total time SBLTG 0.2631 6.6657 36:40 Spanish–English SBITG 0.2655 6.7312 6:20:00 Both 0.2660 6.7124 1:14:40 SBLTG 0.2651 6.7656 34:10 French–English SBITG 0.2654 6.7913 7:00:00 Both 0.2625 6.7609 1:16:10 SBLTG 0.2066 5.8882 48:52 German–English SBITG 0.2091 5.9292 6:50:00 Both 0.2095 5.9224 1:29:40

Table 5: Results for seeding an SBITG with an SBLTG (Both) compared to the pure approach. Total time refers to 10 iterations of EM training for SBITG and SBLTG respectively, and 10 iterations of SBLTG and one iteration of SBITG training for the combined system.

tion 5) an SBITG, which was then trained for possible to beat the translation quality of GIZA++ one iteration of EM. Although the differences with a quite fast transduction grammar. are fairly small, German–English and Spanish– English seem to reach the level of SBITG, Acknowledgments whereas French–English is actually hurt. The This work was funded by the Swedish Na- big difference is in time, since the combined sys- tional Graduate School of Language Technol- tem needs about a fifth of the time the SBITG- ogy (GSLT), the Defense Advanced Research based system needs. This phenomenon needs to Projects Agency (DARPA) under GALE Con- be more thoroughly examined. tracts No. HR0011-06-C-0022 and No. HR0011- It is also worth noting that GIZA++ was beaten 06-C-0023, and the Hong Kong Research by an aligner that used less than 20 minutes (less Grants Council (RGC) under research grants than 2 minutes per iteration and at most 10 itera- GRF621008, DAG03/04.EG09, RGC6256/00E, tions) to align the corpus. and RGC6083/99E. Any opinions, findings and conclusions or recommendations expressed in 8 Conclusions this material are those of the authors and do not necessarily reflect the views ofthe Defense Ad- In this paper we have introduced the bilingual version of linear grammar: Linear Transduc- vanced Research Projects Agency. The computa- tions were performed on UPPMAX resources un- tion Grammars, and found that they generate the der project p2007020. same class of transductions as Linear Inversion Transduction Grammars. We have also compared Stochastic Bracketing versions of ITGs and References LTGs to GIZA++ on three word alignment tasks. Aho, Alfred V. Ullman, Jeffrey D. 1972. The Theory The efficiency issues with transduction gram- of Parsing, Translation, and Compiling. Prentice- mars have been addressed by pruning, and the Halll, Inc., Upper Saddle River, NJ. conclusion is that there is a trade-off between Brown, Peter F., Stephen A. Della Pietra, Vincent run time and translation quality. A part of the J. Della Pietra, and Robert L. Mercer. 1993. The trade-off is choosing which grammar framework mathematics of statistical machine translation: Pa- to use, as LTGs are faster but not as good as ITGs. rameter estimation. Computational Linguistics, It also seems possible to take a short-cut in this 19(2):263–311. trade-off by starting out with an LTG and convert- Cocke, John. 1969. Programming languages and ing it to an ITG. We have also showed that it is their compilers: Preliminary notes. Courant Insti-

17 tute of Mathematical Sciences, New York Univer- Papineni, Kishore, Salim Roukos, Todd Ward, and sity. Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Pro- Doddington, George. 2002. Automatic eval- ceedings of 40th Annual Meeting of the Associ- uation of machine translation quality using n- ation for Computational Linguistics, pages 311– gram co-occurrence statistics. In Proceedings of 318, Philadelphia, Pennsylvania, July. Human Language Technology conference (HLT- 2002), San Diego, California. Saers, Markus and Dekai Wu. 2009. Improving phrase-based translation via word alignments from Earley, Jay. 1970. An efﬁcient context-free parsing Stochastic Inversion Transduction Grammars. In algorithm. Communications of the Association for Proceedings of the Third Workshop on Syntax Comuter Machinery, 13(2):94–102. and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009, pages 28–36, Boulder, Col- Haghighi, Aria, John Blitzer, John DeNero, and Dan orado, June. Klein. 2009. Better word alignments with super- vised ITG models. In Proceedings of the Joint Saers, Markus, Joakim Nivre, and Dekai Wu. 2009. Conference of the 47th Annual Meeting of the ACL Learning Stochastic Bracketing Inversion Trans- and the 4th International Joint Conference on Nat- duction Grammars with a cubic time biparsing al- ural Language Processing of the AFNLP, pages gorithm. In Proceedings of the 11th International 923–931, Suntec, Singapore, August. Conference on Parsing Technologies (IWPT'09), pages 29–32, Paris, France, October. Huang, Liang, Hao Zhang, Daniel Gildea, and Kevin Knight. 2009. Binarization of synchronous Saers, Markus, Joakim Nivre, and Dekai Wu. 2010. context-free grammars. Computational Linguis- Word alignment with Stochastic Bracketing Linear tics, 35(4):559–595. Inversion Transduction Grammar. In Proceedings of Human Language Technologies: The 11th An- Kasami, Tadao. 1965. An efﬁcient recognition nual Conference of the North American Chapter and syntax analysis algorithm for context-free lan- of the Association for Computational Linguistics, guages. Technical Report AFCRL-65-00143, Air Los Angeles, California, June. Force Cambridge Research Laboratory. Stolcke, Andreas. 2002. SRILM – an extensible Koehn, Philipp, Hieu Hoang, Alexandra Birch, language modeling toolkit. In International Con- Chris Callison-Burch, Marcello Federico, Nicola ference on Spoken Language Processing, Denver, Bertoldi, Brooke Cowan, Wade Shen, Christine Colorado, September. Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Vogel, Stephan, Hermann Ney, and Christoph Till- Moses: Open source toolkit for statistical ma- mann. 1996. Hmm-based word alignment in sta- chine translation. In Proceedings of the 45th tistical translation. In Proceedings of the 16th con- Annual Meeting of the Association for Computa- ference on Computational linguistics, pages 836– tional Linguistics Companion Volume Proceedings 841, Morristown, New Jersey. of the Demo and Poster Sessions, pages 177–180, Wu, Dekai. 1997. Stochastic inversion transduction Prague, Czech Republic, June. grammars and bilingual parsing of parallel cor- Koehn, Philipp. 2005. Europarl: A parallel corpora. Computational Linguistics, 23(3):377–403. pus for statistical machine translation. In Machine Younger, Daniel H. 1967. Recognition and parsing Translation Summit X, Phuket, Thailand, Septem- of context-free languages in time n3. Information ber. and Control, 10(2):189–208. Lewis, Philip M. and Richard E. Stearns. 1968. Zhang, Hao, Chris Quirk, Robert C. Moore, and Syntax-directed transduction. Journal of the Asso- Daniel Gildea. 2008. Bayesian learning of non- ciation for Computing Machinery, 15(3):465–488. compositional phrases with synchronous parsing. In Proceedings of ACL-08: HLT, pages 97–105, Och, Franz Josef and Hermann Ney. 2003. A sys- Columbus, Ohio, June. tematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.

Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In 41st Annual Meeting of the Association for Computational Lin- guistics, pages 160–167, Sapporo, Japan, July.

18 Source-side Syntactic Reordering Patterns with Functional Words for Improved Phrase-based SMT

Jie Jiang, Jinhua Du, Andy Way CNGL, School of Computing, Dublin City University jjiang,jdu,away @computing.dcu.ie { }

Abstract 2008): the deterministic reordering and the non- deterministic reordering approach. Inspired by previous source-side syntactic reordering methods for SMT, this paper To carry out the deterministic approach, syntac- focuses on using automatically learned tic reordering is performed uniformly on the train- syntactic reordering patterns with func- ing, devset and testset before being fed into the tional words which indicate structural re- SMT systems, so that only the reordered source orderings between the source and target sentences are dealt with while building during language. This approach takes advan- the SMT system. In this case, most work is fo- tage of phrase alignments and source-side cused on methods to extract and to apply syntac- parse trees for pattern extraction, and then tic reordering patterns which come from manually filters out those patterns without func- created rules (Collins et al., 2005; Wang et al., tional words. Word lattices transformed 2007a), or via an automatic extraction process tak- by the generated patterns are fed into PB- ing advantage of parse trees (Collins et al., 2005; SMT systems to incorporate potential re- Habash, 2007). Because reordered source sen- orderings from the inputs. Experiments tence cannot be undone by the SMT decoders (Al- are carried out on a medium-sized cor- Onaizan et al., 2006), which implies a systematic pus for a Chinese–English SMT task. The error for this approach, classifiers (Chang et al., proposed method outperforms the base- 2009b; Du & Way, 2010) are utilized to obtain line system by 1.38% relative on a ran- high-performance reordering for some specialized domly selected testset and 10.45% rela- syntactic structures (e.g. DE construction in Chi- tive on the NIST 2008 testset in terms nese). of BLEU score. Furthermore, a system On the other hand, the non-deterministic ap- with just 61.88% of the patterns filtered proach leaves the decisions to the decoders to by functional words obtains a comparable choose appropriate source-side reorderings. This performance with the unfiltered one on the is more flexible because both the original and randomly selected testset, and achieves reordered source sentences are presented in the 1.74% relative improvements on the NIST inputs. Word lattices generated from syntactic 2008 testset. structures for N-gram-based SMT is presented in (Crego et al., 2007). In (Zhang et al., 2007a; 1 Introduction Zhang et al., 2007b), chunks and POS tags are Previous work has shown that the problem of used to extract reordering rules, while the gener- structural differences between language pairs in ated word lattices are weighted by language mod- SMT can be alleviated by source-side syntactic els and reordering models. Rules created from a reordering. Taking account for the integration syntactic parser are also utilized to form weighted with SMT systems, these methods can be divided n-best lists which are fed into the decoder (Li et into two different kinds of approaches (Elming, al., 2007). Furthermore, (Elming, 2008; Elm-

19 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 19–27, COLING 2010, Beijing, August 2010. ing, 2009) uses syntactic rules to score the output tice scoring approach (Jiang et al., 2010) is uti- word order, both on English–Danish and English– lized to discover non-monotonic phrase align- Arabic tasks. Syntactic reordering information is ments, and then syntactic reordering patterns are also considered as an extra feature to improve PB- extracted from source-side parse trees. After that, SMT in (Chang et al., 2009b) for the Chinese– functional word tags specified in (Xue, 2005) are English task. These results confirmed the effec- adopted to perform pattern filtering. Finally, both tiveness of syntactic reorderings. the unfiltered pattern set and the filtered one are However, for the particular case of Chinese used to transform inputs into word lattices to source inputs, although the DE construction has present potential reorderings for improving PB- been addressed for both PBSMT and HPBSMT SMT system. A comparison between the three systems in (Chang et al., 2009b; Du & Way, systems is carried out to examine the performance 2010), as indicated by (Wang et al., 2007a), there of syntactic reordering as well as the usefulness of are still lots of unexamined structures that im- functional words for pattern filtering. ply source-side reordering, especially in the non- The rest of this paper is organized as follows: deterministic approach. As specified in (Xue, in section 2 we describe the extraction process of 2005), these include the bei-construction, ba- syntactic reordering patterns, including the lattice construction, three kinds of de-construction (in- scoring approach and the extraction procedures. cluding DE construction) and general preposition Then section 3 presents the filtering process used constructions. Such structures are referred with to obtain patterns with functional words. After functional words in this paper, and all the con- that, section 4 shows the generation of word lat- structions can be identified by their correspond- tices with patterns, and experimental setup and re- ing tags in the Penn Chinese TreeBank. It is in- sults included related discussion are presented in teresting to investigate these functional words for section 5. Finally, we give our conclusion and av- the syntactic reordering task since most of them enues for future work in section 6. tend to produce structural reordering between the source and target sentences. 2 Syntactic reordering patterns Another related work is to filter the bilingual extraction phrase pairs with closed-class words (Sánchez- Instead of top-down approaches such as (Wang Mart´ınez, 2009). By taking account of the word et al., 2007a; Chang et al., 2009a), we use a alignments and word types, the filtering process bottom-up approach similar to (Xia et al., 2004; reduces the phrase tables by up to a third, but still Crego et al., 2007) to extract syntactic reordering provide a system with competitive performance patterns from non-monotonic phrase alignments compared to the baseline. Similarly, our idea is to and source-side parse trees. The following steps use special type of words for the filtering purpose are carried out to extract syntactic reordering pat- on the syntactic reordering patterns. terns: 1) the lattice scoring approach proposed In this paper, our objective is to exploit in (Jiang et al., 2010) is used to obtain phrase these functional words for source-side syntac- alignments from the training corpus; 2) reorder- tic reordering of Chinese–English SMT in the ing regions from the non-monotonic phrase align- non-deterministic approach. Our assumption is ments are used to identify minimum treelets for that syntactic reordering patterns with functional pattern extraction; and 3) the treelets are trans- words are the most effective ones, and others can formed into syntactic reordering patterns which be pruned for both speed and performance. are then weighted by their occurrences in the To validate this assumption, three systems are training corpus. Details of each of these steps are compared in this paper: a baseline PBSMT sys- presented in the rest of this section. tem, a syntactic reordering system with all patterns extracted from a corpus, and a syntactic re- 2.1 Lattice scoring for phrase alignments ordering system with patterns filtered with func- The lattice scoring approach is proposed in (Jiang tional words. To accomplish this, firstly the lat- et al., 2010) for the SMT data cleaning task.

20 To clean the training corpus, word alignments 1. Generate a parse tree for each of the source are used to obtain approximate decoding results, sentences. The Berkeley parser (Petrov, which are then used to calculate BLEU (Papineni 2006) is used in this paper. To obtain sim- et al., 2002) scores to filter out low-scoring sen- pler tree structures, right-binarization is per- tences pairs. The following steps are taken in formed on the parse trees, while tags gener- the lattice scoring approach: 1) train an initial ated from binarization are not distinguished PBSMT model; 2) collect anchor pairs contain- from the original ones (e.g. @V P and V P ing source and target phrase positions from word are the same). alignments generated in the training phase; 3) build source-side lattices from the anchor pairs 2. Map reordering regions AB onto the parse and the translation model; 4) search on the source- trees. Denote NA as the set of leaf nodes in side lattices to obtain approximate decoding re- region A and NB for region B. The mapping sults; 5) calculate BLEU scores for the purpose of is carried out on the parse tree to find a mini- data cleaning. mum treelet T , which satisfies the following Note that the source-side lattices in step 3 come two criteria: 1) there must exist a path from each node in N N to the root node of from anchor pairs, so each edge in the lattices con- A ∪ B tain both the source and target phrase positions. T ; 2) each leaf node of T can only be the Thus the outputs of step 4 contain phrase align- ancestor of nodes in NA or NB (or none of ments on the training corpus. These phrase align- them). ments are used to identify non-monotonic areas 3. Traverse T in pre-order to obtain syntactic for the extraction of reordering patterns. reordering pattern P . Label all the leaf nodes 2.2 Reordering patterns of T with A or B as reorder options, which indicate that the descendants of nodes with Non-monotonic regions of the phrase alignments label A are supposed to be swapped with are examined as potential source-side reorderings. those with label B. By taking a bottom-up approach, the reordering regions are identified and mapped to minimum Instead of using subtrees, we use treelets to treelets on the source parse trees. After that, syn- refer the located parse tree substructures, since tactic reordering patterns are transformed from treelets do not necessarily go down to leaf nodes. these minimum treelets. Since phrase alignments cannot always be per- In this paper, reordering regions A and B indi- fectly matched with parse trees, we also expand cating swapping operations on the source side are AB to the right and/or the left side with a limited only considered as potential source-side reorder- number of words to find a minimum treelet. In ings. Thus, given reordering regions AB, this im- this situation, a minimum number of ancestors of plies (1): expanded tree nodes are kept in T but they are as- AB BA (1) ⇒ signed the same labels as those from which they on the source-side word sequences. Referring to have been expanded. In this case, the expanded the phrase alignment extraction in the last section, tree nodes are considered as the context nodes of each non-monotonic phrase alignment produces syntactic reordering patterns. one reordering region. Furthermore, for each re- Figure 1 illustrates the extraction process. Note ordering region identified, all of its sub-areas in- the symbol @ indicates the right-binarization sym- dicating non-monotonic alignments are also at- bols (e.g. @V P in the figure). In the figure, tree tempted to produce more reordering regions. T (surrounded by dashed lines) is the minimum To represent the reordering region using syn- treelet mapped from the reordering region AB. tactic structure, given the extracted reordering re- Leaf node NP is labeled by A, V P is labeled by gions AB, the following steps are taken to map B, and the context node P is also labeled by A. them onto the source-side parse trees, and to gen- Leaf nodes labeled A or B are collected into node erate corresponding patterns: sequences LA or LB to indicate the reordering op-

21 T 3 Patterns with functional words

Some of the patterns extracted may not benefit the final system since the extraction process is controlled by phrase alignments rather than syntactic knowledge. Inspired by the study of DE constructions (Chang et al., 2009a; Du & Way, 2010), we assume that syntactic reorderings are indicated by functional words for the Chinese– English task. To incorporate the knowledge of functional words into the extracted patterns, instead of directly specifying the syntactic struc- A B ture from the linguistic aspects, we use functional word tags to filter the extracted patterns. In this case, we assume that all patterns containing func- Figure 1: Reordering pattern extraction tional words tend to produce meaningful syntactic reorderings. Thus the filtered patterns carry the re- erations. Thus the syntactic reordering pattern P ordering information from the phrase alignments is obtained from T as in (2): as well as the linguistic knowledge. Thus the P = V P (P P (P NP ) V P ) O = L ,L noise produced in phrase alignments and the size { | { A B}} (2) of pattern set can be reduced, so that the speed and where the first part of P is the V P with its tree the performance of the system can be improved. structure, and the second part O indicates the re- The functional word tags used in this paper are ordering scheme, which implies that source words shown in Table 1, which come from (Xue, 2005). corresponding with descendants of LA are sup- We choose them as functional words because nor- posed to be swapped with those of LB. mally they imply word reorders between Chinese and English sentence pairs. 2.3 Pattern weights estimation

We use preo to represent the chance of reordering Tag Description when a treelet is located by a pattern on the parse BA ba-construction tree. It is estimated by the number of reorderings DEC de (1st kind) in a relative-clause for each of the occurrences of the pattern as in (3): DEG associative de (1st kind) count reorderings of P DER de (2nd kind) in V-de const. & V-de-R p (P )= { } (3) rd reo count observation of P DEV de (3 kind) before VP { } LB bei in long bei-construction By contrast, one syntactic pattern P usually con- P preposition excluding bei and ba tains several reordering schemes (specified in for- SB bei in short bei-construction mula (2)), each of them weighted as in (4): count reorderings of O in P Table 1: Syntactic reordering tags for functional w(O, P )= { } count reorderings of P words { } (4) Generally, a syntactic reordering pattern is ex- Note that there are three kinds of de- pressed as in (5): constructions, but only the first kind is the DE construction in (Chang et al., 2009a; Du & Way, P = tree p O , w , , O , w (5) { | reo | 1 1 · · · n n} 2010). After the filtering process, both the unfil- where tree is the tree structures of the pattern, tered pattern set and the filtered one are used to preo is the reordering probability, Oi and wi are build different syntactic reordering PBSMT sys- the reordering schemes and weights (1 i n). tems for comparison purpose. ≤ ≤

22 4 Word lattice construction 5 Experiments and results

Both the devset and testset are transformed into We conducted our experiments on a medium-sized word lattices by the extracted patterns to incor- corpus FBIS (a multilingual paragraph-aligned porate potential reorderings. Figure 2 illustrates corpus with LDC resource number LDC2003E14) this process: treelet T ′ is matched with a pat- for the Chinese–English SMT task. The Cham- tern, then its leaf nodes a , a L (span- pollion aligner (Ma, 2006) is utilized to perform { 1 · · · m}∈ A ning w , , w ) are swapped with leaf nodes sentence alignment. A total number of 256,911 { 1 · · · p} b , , b L (spanning v , , v ) on sentence pairs are obtained, while 2,000 pairs for { 1 · · · n} ∈ B { 1 · · · q} the generated paths in the word lattice. devset and 2,000 pairs for testset are randomly selected, which we call FBIS set. The rest of the T’ data is used as the training corpus. Sub parse tree matched with a1 The baseline system is Moses (Koehn et a pattern bn al., 2007), and GIZA++1 is used to perform

am b1 word alignment. Minimum error rate training ...... (MERT) (Och, 2003) is carried out for tuning. A Source side 2 sentence...... 5-gram language model built via SRILM is used ...... w1 w2 wp v1 v2 vq for all the experiments in this paper. Experiments results are reported on two differ- ...... w1 w2 wp v1 v2 vq ... ent sets: the FBIS set and the NIST set. For the Generated v1 w NIST set, the NIST 2005 testset (1,082 sentences) lattice ...... p v2 vq w1 w2 is used as the devset, and the NIST 2008 testset (1,357 sentences) is used as the testset. The Figure 2: Incorporating potential reorderings into FBIS set contains only one reference translation lattices for both devset and testset, while NIST set has four references. We sort the matched patterns by preo in formula (5), and only apply a pre-defined number of re- 5.1 Pattern extraction and filtering with orderings for each sentence. For each lattice node, functional words if we denote E0 as the edge from the original sen- The lattice scoring approach is carried out with tence, while patterns P , , P , , P are ap- { 1 · · · i · · · k} the same baseline system as specified above to plied to this node, then E0 is weighted as in (6): produce the phrase alignments. The initial PB-

k SMT system in the lattice scoring approach is (1 α) tuned with the FBIS devset to obtain the weights. w(E0)= α + − 1 preo(Pi) { k ∗ { − }} As speciﬁed in section 2.1, phrase alignments are Xi=1 (6) generated in the step 4 of the lattice scoring ap- where preo(Pi) is the pattern weight in formula proach. (3), and α is the base probability to avoid E0 be- From the generated phrase alignments and ing equal to zero. Suppose Es, , Es+r 1 are source-side parse trees of the training corpus, { · · · − } generated by r reordering schemes of Pi, then Ej we obtain 48,285 syntactic reordering patterns is weighted as in (7): (57,861 reordering schemes) with an average number of 11.02 non-terminals. For computa- (1 α) ws j+1(Pi) w(Ej )= − preo(Pi) r− (7) tional efﬁciency, any patterns with number of non- k ∗ ∗ t=1 wt(Pi) terminal less than 3 and more than 9 are pruned. P This procedure leaves 18,169 syntactic reordering where wt(Pi) is the reordering scheme in formula patterns (22,850 reordering schemes) with a aver- (5), and s <= j

23 age number of 7.6 non-terminals. This pattern set impact on our experiments. is used to built the syntactic reordering PBSMT system without pattern filtering, which here after 5.2 Word lattice construction we call the ‘unfiltered system’. As specified in section 4, for both unfiltered and Using the tags specified in Table 1, the ex- the filtered systems, both the devset and testset tracted syntactic reordering patterns without func- are converted into word lattices with the unfiltered tional words are filtered out, while only 6,926 syn- and filtered syntactic reordering patterns respec- tactic reordering patterns (with 9,572 reordering tively. To avoid a dramatic increase in size of the schemes) are retained. Thus the pattern set are lattices, the following constraints are applied: for reduced by 61.88%, and over half of them are each source sentence, the maximum number of re- pruned by the functional word tags. The filtered ordering schemes is 30, and the maximum span of pattern set is used to build the syntactic reorder- a pattern is 30. ing PBSMT system with pattern filtering, which For the lattice construction, the base probabil- we refer as the ‘filtered system’. ity in (6) and (7) is set to 0.05. The two syntactic reordering PBSMT systems also incorporate Type Tag Patterns Percent the built-in reordering models (distance-based and ba-const. BA 222 3.20% lexical reordering) of Moses, and their weights in LB 97 bei-const. 2.79% the log-linear model are tuned with respect to the SB 96 devsets. DEC 1662 de-const. (1st) 60.11% The effects of the pattern filtering by functional DEG 2501 words are also reported in Table 3. For both the nd de-const. (2 ) DER 52 0.75% FBIS and NIST sets, the average number of nodes de-const. (3rd) DEV 178 2.57% in word lattices are illustrated before and after pat- preposition P 2591 37.41% tern filtering. From the table, it is clear that the excl. ba & bei pattern filtering procedure dramatically reduces the input size for the PBSMT system. The reduc- Table 2: Statistics on the number of patterns for tion is up to 37.99% for the NIST testset. each type of functional word Data set Unfiltered Filtered Reduced Statistics on the patterns with respect to func- FBIS dev 183.13 131.38 28.26% tional word types are shown in Table 2. The num- FBIS test 183.68 136.56 25.65% ber of patterns for each functional word in the fil- NIST dev 175.78 115.89 34.07% tered pattern set are illustrated, and percentages of functional word types are also reported. Note that NIST test 149.13 92.48 37.99% some patterns contain more than one kind of func- Table 3: Comparison of the average number of tional word, so that the percentages of functional nodes in word lattices word types do not sum to one. As demonstrated in Table 2, the first kind of de- construction takes up 60.11% of the filtered pat- 5.3 Results on FBIS set tern set, and is the main type of patterns used in Three systems are compared on the FBIS set: our experiment. This indicates that more than half the baseline PBSMT system, and the syntactic of the patterns are closely related to the DE con- reordering systems with and without pattern fil- struction examined in (Chang et al., 2009b; Du tering. Since the built-in reordering models of & Way, 2010). However, the general preposi- Moses are enabled, several values of the distortion construction (excluding bei and ba) accounts tion limit (DL) parameter are chosen to validate for 37.41% of the filtered patterns, which implies consistency. The evaluation results on the FBIS that it is also a major source of syntactic reorder- set are shown in Table 4. ing. By contrast, other constructions have much As shown in Table 4, the syntactic reordering smaller amount of percentages, so have a minor systems with and without pattern filtering outper-

24 System DL BLEU NIST METE These results indicates that the filtered system 0 22.32 6.45 52.51 has a comparable performance with the unfiltered 6 23.67 6.63 54.07 one on the FBIS set, while both of them outper- Baseline 10 24.52 6.66 54.04 form the baseline. 12 24.57 6.69 54.31 0 23.92 6.60 54.30 5.4 Results on NIST set 6 24.57 6.68 54.64 The evaluation results on the NIST set are illus- Unfiltered 10 24.98 6.71 54.67 trated in Table 5. 12 24.84 6.69 54.65 System DL BLEU NIST METE 0 23.71 6.60 54.11 0 14.43 5.75 45.03 6 24.65 6.68 54.61 Filtered 6 15.61 5.88 45.75 10 24.87 6.71 54.84 Baseline 10 15.73 5.78 45.27 12 24.91 6.7 54.51 12 15.89 6.16 45.88 Table 4: Results on FBIS testset (DL = distortion 0 16.77 6.54 47.16 6 17.25 6.67 47.65 limit, METE=METEOR) Unfiltered 10 17.15 6.64 47.78 12 16.88 6.56 47.17 form the baseline system for each of the distortion 0 16.79 6.64 47.67 limit parameters in terms of the BLEU, NIST and 6 17.55 6.71 48.06 METEOR scores (scores in bold face). By con- Filtered 10 17.51 6.72 48.15 trast, the filtered systems has a comparable perfor- 12 17.37 6.72 48.08 mance with the unfiltered system: for some of the distortion limits, the filtered systems even outper- Table 5: Results on NIST testset (DL = distortion forms the unfiltered system (scores in bold face, limit, METE=METEOR) e.g. BLEU and NIST for DL=12, METEOR for DL=10). From Table 5, the unfiltered system outper- The best performance of the baseline system forms the baseline system for each of the distor- is obtained with distortion limit 12 (underlined); tion limits in terms of the BLEU, NIST and ME- the best performance of the unfiltered system is TEOR scores (scores in bold face). By contrast, achieved with distortion limit 10 (underlined); the filtered system also outperform the unfiltered while for the filtered system, the best BLEU score system for each of the distortion limits in terms of is accomplished with distortion limit 12 (under- the three evaluation methods (scores in bold face). lined), and the best NIST and METEOR scores The best performance of the baseline system are shown with distortion limit 10 (underlined). is obtained with distortion limit 12 (underlined), Thus the unfiltered system outperforms the base- while the best performance of the unfiltered sys- line by 0.41 (1.67% relative) BLEU points, 0.02 tem is obtained with distortion limit 6 for BLEU (0.30% relative) NIST points and 0.36 (0.66% and NIST, and 10 for METEOR (underlined). For relative) METEOR points. By contrast, the fil- the filtered system, the best BLEU score is shown tered system outperforms the baseline by 0.34 with distortion limit 6, and the best NIST and ME- (1.38% relative) BLEU points, 0.02 (0.30% rel- TEOR scores are accomplished with distortion ative) NIST points and 0.53 (0.98% relative) ME- limit 10 (underlined). Thus the unfiltered system TEOR points. outperforms the baseline by 1.36 (8.56% relative) Compared with the unfiltered system, pattern BLEU points, 0.51 (8.28% relative) NIST points filtering with functional words degrades perfor- and 1.90 (4.14% relative) METEOR points. By mance by 0.07 (0.28% relative) in term of BLEU, contrast, the filtered system outperforms the base- but improves the system by 0.17 (0.31% rela- line by 1.66 (10.45% relative) BLEU points, 0.56 tive) in term of METEOR, while the two systems (9.52% relative) NIST points and 2.27 (4.95% rel- achieved the same best NIST score. ative) METEOR points.

25 Compared with the unfiltered system, patterns ments and parse trees. Three systems are com- with functional words boost the performance by pared: a baseline PBSMT system, a syntactic re- 0.30 (1.74% relative) in term of BLEU, 0.05 ordering system with all patterns extracted from a (0.75% relative) in term of NIST, and 0.37 (0.77% corpus and a syntactic reordering system with pat- relative) in term of METEOR. terns filtered with functional words. Evaluation These results demonstrate that the pattern filter- results on a medium-sized corpus showed that the ing improves the syntactic reordering system on two syntactic reordering systems consistently out- the NIST set, while both of them significantly outperform the baseline system. The pattern filtering perform the baseline. with functional words prunes 61.88% of patterns, but still maintains a comparable performance with 5.5 Discussion the unfiltered one on the randomly select testset, Experiments in the previous sections demonstrate and even obtains 1.74% relative improvement on that: 1) the two syntactic reordering systems im- the NIST 2008 testset. prove the PBSMT system by providing potential In future work, the structures of patterns con- reorderings obtained from phrase alignments and taining functional words will be investigated to parse trees; 2) patterns with functional words play obtain fine-grained analysis on such words in this a major role in the syntactic reordering process, task. Furthermore, experiments on larger corpora and filtering the patterns with functional words as well as on other language pairs will also be car- maintains or even improves the system perfor- ried out to validation our method. mance for Chinese–English SMT task. Further- more, as shown in the previous section, pattern Acknowledgements filtering prunes the whole pattern set by 61.88% This research is supported by Science Foundation and also reduces the sizes of word lattices by up Ireland (Grant 07/CE/I1142) as part of the Centre to 37.99%, thus the whole syntactic reordering for Next Generation Localisation (www.cngl.ie) at procedure for the original inputs as well as the Dublin City University. Thanks to Yanjun Ma for tuning/decoding steps are sped up dramatically, the sentence-aligned FBIS corpus. which make the proposed methods more useful in the real world, especially for online SMT systems. From the statistics on the filtered pattern set References in Table 2, we also argue that the first kind Yaser Al-Onaizan and Kishore Papineni 2006. Dis- of de-construction and general preposition (ex- tortion models for statistical machine translation. cluding bei and ba) are the main sources of Coling-ACL 2006: Proceedings of the 21st Inter- Chinese–English syntactic reordering. Previous national Conference on Computational Linguistics and 44th Annual Meeting of the Association for work (Chang et al., 2009b; Du & Way, 2010) Computational Linguistics, pages 529-536, Sydney, showed the advantages of dealing with the DE Australia. construction. In our experiments too, even though Pi-Chuan Chang, Dan Jurafsky, and Christopher all the patterns are automatically extracted from D.Manning 2009a. Disambiguating DE for phrase alignments, these two constructions still Chinese–English machine translation. Proceed- dominate the filtered pattern set. This result con- ings of the Fourth Workshop on Statistical Machine firms the effectiveness of previous work on DE Translation, pages 215-223, Athens, Greece. construction, and also highlights the importance Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and of the general preposition construction in this task. Christopher D. Manning. 2009b. Discriminative reordering with Chinese grammatical features. Pro- 6 Conclusion and future work ceedings of SSST-3: Third Workshop on Syntax and Structure in Statistical Translation, pages 51-59, Syntactic reordering patterns with functional Boulder, CO. words are examined in this paper. The aim is to Michael Collins, Philipp Koehn, and Ivona Kuˇcerová. exploit these functional words within the syntactic 2005. Clause restructuring for statistical machine reordering patterns extracted from phrase align- translation. ACL-2005: 43rd Annual meeting of the

~~26 Association for Computational Linguistics, pages 41st Annual meeting of the Association for Compu- 531-540, University of Michigan, Ann Arbor, MI. tational Linguistics, pp. 160-167, Sapporo, Japan.~~

Josep M. Crego, and JoséB. Mariño. 2007. Syntax- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- enhanced N-gram-based SMT. MT Summit XI, Jing Zhu. 2002. BLEU: A Method For Automatic pages 111-118, Copenhagen, Denmark. Evaluation of Machine Translation. ACL-2002: 40th Annual meeting of the Association for Compu- Jinhua Du and Andy Way. 2010. The Impact of tational Linguistics, pp.311-318, Philadelphia, PA. Source-Side Syntactic Reordering on Hierarchical Phrase-based SMT. EAMT 2010: 14th Annual Con- Slav Petrov, Leon Barrett, Romain Thibaux and Dan ference of the European Association for Machine Klein. 2006. Learning Accurate, Compact, and Translation, Saint-Raphaël, France. Interpretable Tree Annotation. Coling-ACL 2006: Proceedings of the 21st International Conference on Jakob Elming. 2008. Syntactic reordering integrated Computational Linguistics and 44th Annual Meet- with phrase-based SMT. Coling 2008: 22nd In- ing of the Association for Computational Linguis- ternational Conference on Computational Linguis- tics, pages 433-440, Sydney, Australia. tics, Proceedings of the conference, pages 209-216, Manchester, UK. Felipe Sánchez-Mart´ınez and Andy Way. 2009. Marker-based filtering of bilingual phrase pairs for Jakob Elming, and Nizar Habash. 2009. Syntac- SMT. EAMT-2009: Proceedings of the 13th An- tic reordering for English-Arabic phrase-based manual Conference of the European Association for chine translation. Proceedings of the EACL 2009 Machine Translation, pages 144-151, Barcelona, Workhop on Computational Approaches to Semitic Spain. Languages, pages 69-77, Athens, Greece. Chao Wang, Michael Collins, and Philipp Koehn. Nizar Habash. 2007. Syntactic preprocessing for sta- 2007a. Chinese syntactic reordering for statistical tistical machine translation. MT Summit XI, pages machine translation. EMNLP-CoNLL-2007: Pro- 215-222, Copenhagen, Denmark. ceedings of the 2007 Joint Conference on Empiri- cal Methods in Natural Language Processing and Jie Jiang, Andy Way, Julie Carson-Berndsen. 2010. Computational Natural Language Learning, pages Lattice Score-Based Data Cleaning For Phrase- 737-745, Prague, Czech Republic. Based Statistical Machine Translation. EAMT 2010: 14th Annual Conference of the European As- Fei Xia, and Michael McCord 2004. Improving sociation for Machine Translation, Saint-Raphaël, a statistical MT system with automatically learned France. rewrite patterns. Coling 2004: 20th International Conference on Computational Linguistics, pages Philipp Koehn, Hieu Hoang, Alexandra Birch, 508-514, University of Geneva, Switzerland. Chris Callison-Burch, Marcello Federico, Nicola Nianwen Xue, Fei Xia, Fu-dong Chiou, and Martha Bertoldi, Brooke Cowan, Wade Shen, Christine Palmer. 2005. The Penn Chinese TreeBank: Phrase Moran, Richard Zens, Chris Dyer, Ondrej Bojar, structure annotation of a large corpus. Natural Lan- Alexandra Constantin, and Evan Herbst. 2007. guage Engineering, 11(2), pages 207-238. Moses: Open Source Toolkit for Statistical Machine Translation. ACL 2007: proceedings of demo and Richard Zens, Franz Josef Och, and Hermann Ney. poster sessions, pp. 177-180, Prague, Czech Repub- 2002. Phrase-based statistical machine translation. lic. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP, pages 333-341, Suntec, Singa- Chi-Ho Li, Dongdong Zhang, Mu Li, Ming Zhou, pore. Minghui Li, and Yi Guan 2007. A probabilistic approach to syntax-based reordering for statistical Yuqi Zhang, Richard Zens, and Hermann Ney 2007a. machine translation. ACL 2007: proceedings of the Chunk-level reordering of source language sen- 45th Annual Meeting of the Association for Compu- tences with automatically learned rules for statisti- tational Linguistics, pages 720-727, Prague, Czech cal machine translation. SSST, NAACL-HLT-2007 Republic. AMTA Workshop on Syntax and Structure in Statis- tical Translation, pages 1-8, Rochester, NY. Xiaoyi Ma. 2006. Champollion: A Robust Paral- lel Text Sentence Aligner. LREC 2006: Fifth In- Yuqi Zhang, Richard Zens, and Hermann Ney 2007b. ternational Conference on Language Resources and Improved chunk-level reordering for statistical ma- Evaluation, pp.489-492, Genova, Italy. chine translation. IWSLT 2007: International Work- shop on Spoken Language Translation, pages 21-28, Franz Josef Och. 2003. Minimum Error Rate Train- Trento, Italy. ing in Statistical Machine Translation. ACL-2003:

~~27 Syntactic Constraints on Phrase Extraction for Phrase-Based Machine Translation~~

Hailong Cao, Andrew Finch and Eiichiro Sumita Language Translation Group, MASTAR Project National Institute of Information and Communications Technology {hlcao,andrew.finch,eiichiro.sumita }@nict.go.jp

quence based tree-to-tree model which can de- Abstract scribe non-syntactic phrases with syntactic structure information. A typical phrase-based machine transla- The converse of the above methods is to in- tion (PBMT) system uses phrase pairs corporate syntactic information into the PBMT extracted from word-aligned parallel model. Zollmann and Venugopal (2006) started corpora. All phrase pairs that are consis- with a complete set of phrases as extracted by tent with word alignments are collected. traditional PBMT heuristics, and then annotated The resulting phrase table is very large the target side of each phrasal entry with the la- and includes many non-syntactic phrases bel of the constituent node in the target-side which may not be necessary. We propose parse tree that subsumes the span. Marton and to filter the phrase table based on source Resnik (2008) and Cherry (2008) imposed syn- language syntactic constraints. Rather tactic constraints on the PBMT system by mak- than filter out all non-syntactic phrases, ing use of prior linguistic knowledge in the form we only apply syntactic constraints when of syntax analysis. In their PBMT decoders, a there is phrase segmentation ambiguity candidate translation gets an extra credit if it re- arising from unaligned words. Our spects the source side syntactic parse tree but method is very simple and yields a may incur a cost if it violates a constituent 24.38% phrase pair reduction and a 0.52 boundary. Xiong et al. (2009) proposed a syn- BLEU point improvement when com- tax-driven bracketing model to predict whether a pared to a baseline PBMT system with phrase (a sequence of contiguous words) is full-size tables. bracketable or not using rich syntactic constraints. 1 Introduction In this paper, we try to utilize syntactic Both PBMT models (Koehn et al., 2003; Chiang, knowledge to constrain the phrase extraction 2005) and syntax-based machine translation from word-based alignments for PBMT system. models (Yamada et al., 2000; Quirk et al., 2005; Rather than filter out all non-syntactic phrases, Galley et al., 2006; Liu et al., 2006; Marcu et al., we only apply syntactic constraints when there is 2006; and numerous others) are the state-of-the- phrase segmentation ambiguity arising from un- art statistical machine translation (SMT) meth- aligned words. Our method is very simple and ods. Over the last several years, an increasing yields a 24.38% phrase pair reduction and a 0.52 amount of work has been done to combine the BLEU point improvement when compared to the advantages of the two approaches. DeNeefe et al. baseline PBMT system with full-size tables. (2007) made a quantitative comparison of the phrase pairs that each model has to work with 2 Extracting Phrase Pairs from Word- and found it is useful to improve the phrasal based Alignments coverage of their string-to-tree model. Liu et al. In this section, we briefly review a simple and (2007) proposed forest-to-string rules to capture effective phrase pair extraction algorithm upon the non-syntactic phrases in their tree-to-string which this work builds. model. Zhang et al. (2008) proposed a tree se-

28 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 28–33, COLING 2010, Beijing, August 2010. The basic translation unit of a PBMT model is build the lexicalized reordering model. For more the phrase pair, which consists of a sequence of details of the lexicalized reordering model, source words, a sequence of target words and a please refer to Tillmann and Zhang (2005) and vector of feature values which represents this section 2.7.2 of the MOSES’s manual1. pair’s contribution to the translation model. In The main problem of such a phrase pair ex- typical PBMT systems such as MOSES (Koehn, traction procedure is the resulting phrase transla- 2007), phrase pairs are extracted from word- tion table is very large, especially when a large aligned parallel corpora. Figure 1 shows the quantity of parallel data is available. This is not form of training example. desirable in real application where speed and memory consumption are often critical concerns. f1 f2 f3 f4 In addition, some phrase translation pairs are | | | generated from training data errors and word e1 e2 e3 alignment noise. Therefore, we need to filter the phrase table in an appropriate way for both effi- Figure 1: An example parallel sentence pair ciency and translation quality (Johnson et al., and word alignment 2007; Yang and Zheng, 2009).

Since there is no phrase segmentation infor- 3 Syntactic Constraints on Phrase Pair mation in the word-aligned sentence pair, in Extraction practice all pairs of “source word sequence ||| target word sequence” that are consistent with We can divide all the possible phrases into two word alignments are collected. The words in a types: syntactic phrases and non-syntactic legal phrase pair are only aligned to each other, phrases. A “syntactic phrase” is defined as a and not to words outside (Och et al., 1999). For word sequence that is covered by a single sub- example, given a sentence pair and its word tree in a syntactic parse tree (Imamura, 2002). alignments shown in Figure1, the following nine Intuitively, we would think syntactic phrases are phrase pairs will be extracted: much more reliable while the non-syntactic phrases are useless. However, (Koehn et al., Source phrase ||| Target phrase 2003) showed that restricting phrasal translation to only syntactic phrases yields poor translation f1 ||| e1 performance – the ability to translate non- f2 ||| e2 syntactic phrases (such as “there are”, “note f4 ||| e3 that”, and “according to”) turns out to be critical f1 f2 ||| e1 e2 and pervasive. f2 f3 ||| e2 (Koehn et al., 2003) uses syntactic constraints f3 f4 ||| e3 from both the source and target languages, and f1 f2 f3 ||| e1 e2 over 80% of all phrase pairs are eliminated. In f2 f3 f4 ||| e2 e3 this section, we try to use syntactic knowledge in a less restrictive way. f1 f2 f3 f4 ||| e1 e2 e3 Firstly, instead of using syntactic restriction on both source phrases and target phrases, we Table 1: Phrase pairs extracted from the example only apply syntactic restriction to the source in Figure 1 language side. Secondly, we only apply syntactic restriction Note that neither the source phrase nor the to the source phrase whose first or last word is target phrase can be empty. So “f3 ||| EMPTY” is unaligned. not a legal phrase pair. For example, given a parse tree illustrated in Phrase pairs are extracted over the entire Figure 2, we will filter out the phrase pair “f2 f3 training corpus. Given all the collected phrase ||| e2” since the source phrase “f2 f3” is a non- pairs, we can estimate the phrase translation syntactic phrase and its last word “f3” is not probability distribution by relative frequency. The collected phrase pairs will also be used to 1 http://www.statmt.org/moses/

29 aligned to any target word. The phrase pair “f1 source word should be translated together with f2 f3 ||| e1 e2” will also be eliminated for the the words on the right of it or the words on the same reason. But we do keep phrase pairs such left of it. The existing algorithm considers both as “f1 f2 ||| e1 e2” even if its source phrase “f1 of the two directions. So both “f2 f3 ||| e2” and f2” is a non-syntactic phrase. Also, we keep “f3 “f3 f4 ||| e3” are extracted. However, it is f4 ||| e3” since “f3 f4” is a syntactic phrase. Ta- unlikely that “f3” can be translated into both ble 2 shows the completed set of phrase pairs “e2” and “e3”. So our algorithm uses prior syn- that are extracted with our constraint-based tactic knowledge to keep “f3 f4 ||| e3” and ex- method. clude “f2 f3 ||| e2”.

Source phrase ||| Target phrase 4 Experiments f1 ||| e1 Our SMT system is based on a fairly typical f2 ||| e2 phrase-based model (Finch and Sumita, 2008). f4 ||| e3 For the training of our SMT model, we use a modified training toolkit adapted from the f1 f2 ||| e1 e2 MOSES decoder. Our decoder can operate on f3 f4 ||| e3 the same principles as the MOSES decoder. f2 f3 f4 ||| e2 e3 Minimum error rate training (MERT) with re- f1 f2 f3 f4 ||| e1 e2 e3 spect to BLEU score is used to tune the decoder’s parameters, and it is performed using the

standard technique of Och (2003). A lexicalized Table 2: Phrase pairs extracted from the example reordering model was built by using the “msd- in Figure 2 bidirectional-fe” configuration in our experi-

~~ments.~~

The translation model was created from the N1 FBIS parallel corpus. We used a 5-gram lan- N2 guage model trained with modified Kneser-Ney smoothing. The language model was trained on N3 the target side of the FBIS corpus and the Xin- f1 f2 f3 f4 hua news in the GIGAWORD corpus. The development and test sets are from the NIST MT08 e1 e2 e3 evaluation campaign. Table 3 shows the statistics of the corpora used in our experiments.

Figure 2: An example parse tree and word- Chinese English Data Sentences based alignments words words Training set 221,994 6,251,554 8,065,629 The state-of-the-art alignment tool such as GIZA++ 2 can not always find alignments for Development set 1,664 38,779 46,387 every word in the sentence pair. The possible reasons could be: its frequency is too low, noisy Test set 1,357 32,377 42,444 data, auxiliary words or function words which GIGAWORD 19,049,757 - 306,221,306 have no obvious correspondence in the opposite language. In the automatically aligned parallel corpus, Table 3: Corpora statistics unaligned words are frequent enough to be no- ticeable (see section 4.1 in this paper). How to The Chinese sentences are segmented, POS decide the translation of unaligned word is left to tagged and parsed by the tools described in Kru- the phrase extraction algorithm. An unaligned engkrai et al. (2009) and Cao et al. (2007), both of which are trained on the Penn Chinese Tree- bank 6.0. 2 http://fjoch.com/GIZA++.html

30 4.1 Experiments on Word Alignments settings were the identical in the three experiments. Table 5 summarizes the SMT per- We use GIZA++ to align the sentences in both formance. The evaluation metric is case- the Chinese-English and English-Chinese direc- sensitive BLEU-4 (Papineni et al., 2002) which tions. Then we combine the alignments using the estimates the accuracy of translation output with standard “grow-diag-final-and” procedure pro- respect to a set of reference translations. vided with MOSES.

In the combined word alignments, 614,369 or Syntactic Con- Number of BLEU 9.82% of the Chinese words are unaligned. Ta- straints distinct phrase pairs ble 4 shows the top 10 most frequently unaligned words. Basically, these words are auxil- None 14,195,686 17.26 iary words or function words whose usage is very flexible. So it would be difficult to auto- Full constraint 4,855,108 16.51 matically align them to the target words. Selectively 10,733,731 17.78 constraint Unaligned word Frequency Table 5: Comparison of different constraints on 的 77776 phrase pair extraction by translation quality , 29051 在 9414 As shown in the table, it is harmful to fully 一 8768 apply syntactic constraints on phrase extraction, even just on the source language side. This is 中 8543 consistent with the observation of (Koehn et al., 个 7471 2003) who applied both source and target con- 是 7365 straints in German to English translation ex- 上 6155 periments. Clearly, we obtained the best performance if 了 5945 we use source language syntactic constraints 不 5450 only on phrases whose first or last source word is unaligned. In addition, we reduced the number Table 4: Frequently unaligned words from the of distinct phrase pairs by 24.38% over the base- training corpus line full-size phrase table. The results in table 5 show that while some 4.2 Experiments on Chinese-English SMT non-syntactic phrases are very important to In order to confirm that it is advantageous to maintain the performance of a PBMT system, apply appropriate syntactic constraints on phrase not all of them are necessary. We can achieve extraction, we performed three translation ex- better performance and a smaller phrase table by periments by using different ways of phrase ex- applying syntactic constraints when there is traction. phrase segmentation ambiguity arising from un- In the first experiment, we used the method aligned words. introduced in Section 2 to extract all possible phrase translation pairs without using any con- 5 Related Work straints arising from knowledge of syntax. To some extent, our idea is similar to Ma et al. The second experiment used source language (2008), who used an anchor word alignment syntactic constraints to filter out all non- model to find a set of high-precision anchor syntactic phrases during phrase pair extraction. links and then aligned the remaining words rely- The third experiment used source language ing on dependency information invoked by the syntactic constraints to filter out only non- acquired anchor links. The similarity is that both syntactic phrases whose first or last source word Ma et al. (2008) and this work utilize structure was unaligned. information to find appropriate translations for With the exception of the above differences in words which are difficult to align. The differ- phrase translation pair extraction, all the other

31 ence is that they used dependency information in Colin Cherry and Dekang Lin. 2006. Soft syntactic the word alignment stage while our method uses constraints for word alignment through discrimina- syntactic information during the phrase pair ex- tive training. In ACL. traction stage. There are also many works which Colin Cherry. 2008. Cohesive phrase-Based decoding leverage syntax information to improve word for statistical machine translation. In ACL-HLT. alignments (e.g., Cherry and Lin, 2006; DeNero David Chiang. 2005. A hierarchical phrase-based and Klein, 2007; Fossum et al., 2008; Hermja- model for statistical machine translation. In ACL. kob, 2009). Johnson et al., (2007) presented a technique Steve DeNeefe, Kevin Knight, Wei Wang, and for pruning the phrase table in a PBMT system Daniel Marcu. 2007. What can syntax-based MT learn from phrase-based MT? In EMNLP-CoNLL. using Fisher’s exact test. They compute the significance value of each phrase pair and prune the John DeNero and Dan Klein. 2007. Tailoring word table by deleting phrase pairs with significance alignments to syntactic machine translation. In values smaller than a certain threshold. Yang ACL. and Zheng (2008) extended the work in Johnson Andrew Finch and Eiichiro Sumita. 2008. Dynamic et al., (2007) to a hierarchical PBMT model, model interpolation for statistical machine transla- which is built on synchronous context free tion. In SMT Workshop. grammars (SCFG). Tomeh et al., (2009) de- Victoria Fossum, Kevin Knight and Steven Abney. scribed an approach for filtering phrase tables in 2008. Using syntax to improve word alignment a statistical machine translation system, which precision for syntax-based machine translation. In relies on a statistical independence measure SMT Workshop, ACL. called Noise, first introduced in (Moore, 2004). Michel Galley, Jonathan Graehl, Kevin Knight, The difference between the above research and Daniel Marcu, Steve Deneefe, Wei Wang and this work is they took advantage of some statis- Ignacio Thayer. 2006. Scalable inference and tical measures while we use syntactic knowledge training of context-rich syntactic translation mod- to filter phrase tables. els. In ACL. 6 Conclusion and Future Work Ulf Hermjakob. 2009. Improved word alignment with statistics and linguistic heuristics. In EMNLP. Phrase pair extraction plays a very important Kenji Imamura. 2002. Application of translation role on the performance of PBMT systems. We knowledge acquired by hierarchical phrase align- utilize syntactic knowledge to constrain the ment for pattern-based MT. In TMI. phrase extraction from word-based alignments Howard Johnson, Joel Martin, George Foster and for a PBMT system. Rather than filter out all Roland Kuhn. 2007. Improving translation quality non-syntactic phrases, we only filter out non- by discarding most of the phrase table. In EMNLP- syntactic phrases whose first or last source word CoNLL. is unaligned. Our method is very simple and Franz Josef Och, Christoph Tillmann and Hermann yields a 24.38% phrase pair reduction and a 0.52 Ney. 1999. Improved alignment models for statis- BLEU point improvement when compared to the tical machine translation. In EMNLP-VLC. baseline PBMT system with full-size tables. In the future work, we will use other language Franz Josef Och. 2003. Minimum error rate training pairs to test our phrase extraction method so that in statistical machine translation. In ACL. we can discover whether or not it is language Philipp Koehn, Franz Josef Och, and Daniel Marcu. independent. 2003. Statistical phrase-based translation. In HLT- NAACL. References Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Robert C. Moore. 2004. On log-likelihood-ratios and Callison-Burch, Marcello Federico, Nicola Ber- the significance of rare events. In EMNLP. toldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Hailong Cao, Yujie Zhang and Hitoshi Isahara. Em- Alexandra Constantin, Evan Herbst. 2007. Moses: pirical study on parsing Chinese based on Collins' Open Source Toolkit for Statistical Machine model. 2007. In PACLING. Translation. In ACL demo and poster sessions.

32 Canasai Kruengkrai, Kiyotaka Uchimoto, Jun'ichi Kazama, Yiou Wang, Kentaro Torisawa and Hito- shi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In ACL-IJCNLP. Yang Liu, Qun Liu, Shouxun Lin. 2006. Tree-to- string alignment template for statistical machine translation. In ACL-COLING. Yang Liu, Yun Huang, Qun Liu and Shouxun Lin. 2007. Forest-to-string statistical translation rules. In ACL. Yanjun Ma, Sylwia Ozdowska, Yanli Sun and Andy Way. 2008. Improving word alignment using syntactic dependencies. In SSST. Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight. 2006. SPMT: Statistical machine translation with syntactified target language phrases. In EMNLP. Yuval Marton and Philip Resnik. 2008. Soft syntactic constraints for hierarchical phrased-based translation. In ACL-HLT. Kishore Papineni, Salim Roukos, Todd Ward and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. Chris Quirk and Arul Menezes and Colin Cherry. 2005. Dependency treelet translation: Syntactically informed phrasal SMT. In ACL. Christoph Tillmann and Tong Zhang. 2005. A local- ized prediction model for statistical machine translation. In ACL. Nadi Tomeh, Nicola Cancedda and Marc Dymetman. 2009. Complexity-based phrase-table filtering for statistical machine translation. In MT Summit. Deyi Xiong, Min Zhang, Aiti Aw and Haizhou Li. 2009. A syntax-driven bracketing model for phrase-based translation. In ACL-IJCNLP. Kenji Yamada and Kevin Knight. 2000. A syntax- based statistical translation model. In ACL. Mei Yang and Jing Zheng. 2009. Toward smaller, faster, and better hierarchical phrase-based SMT. In ACL. Min Zhang, Hongfei Jiang, Aiti Aw, Chew Lim Tan and Sheng Li. 2008. A tree sequence alignment- based tree-to-tree translation model. In ACL- HLT. Andreas Zollmann and Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. In SMT Workshop, HLT-NAACL.

~~33 Phrase Based Decoding using a Discriminative Model~~

~~Prasanth Kolachina Sriram Venkatapathy LTRC, IIIT-Hyderabad LTRC, IIIT-Hyderabad prasanth k @research.iiit.ac.in sriram @research.iiit.ac.in { } { }~~

~~Srinivas Bangalore Sudheer Kolachina AT&T Labs-Research, NY LTRC, IIIT-Hyderabad srini @research.att.com sudheer.kpg08 @research.iiit.ac.in { } { }~~

Avinesh PVS LTRC, IIIT-Hyderabad avinesh @research.iiit.ac.in { } Abstract 1 Introduction The popular approaches to machine translation use the generative IBM models for training In this paper, we present an approach to (Brown et al., 1993; Och et al., 1999). The param- statistical machine translation that com- eters for these models are learnt using the stan- bines the power of a discriminative model dard EM Algorithm. The parameters used in these (for training a model for Machine Transla- models are extremely restrictive, that is, a simple, tion), and the standard beam-search based small and closed set of feature functions is used decoding technique (for the translation of to represent the translation process. Also, these an input sentence). A discriminative ap- feature functions are local and are word based. In proach for learning lexical selection and spite of these limitations, these models perform reordering utilizes a large set of feature very well for the task of word-alignment because functions (thereby providing the power to of the restricted search space. However, they per- incorporate greater contextual and linguis- form poorly during decoding (or translation) be- tic information), which leads to an effec- cause of their limitations in the context of a much tive training of these models. This model larger search space. is then used by the standard state-of-art To handle the contextual information, phrase- Moses decoder (Koehn et al., 2007) for the based models were introduced (Koehn et al., translation of an input sentence. 2003). The phrase-based models use the word alignment information from the IBM models and We conducted our experiments on train source-target phrase pairs for lexical se- Spanish-English language pair. We used lection (phrase-table) and distortions of source maximum entropy model in our exper- phrases (reordering-table). These models are still iments. We show that the performance relatively local, as the target phrases are tightly as- of our approach (using simple lexical sociated with their corresponding source phrases. features) is comparable to that of the In contrast to a phrase-based model, a discrim- state-of-art statistical MT system (Koehn inative model has the power to integrate much et al., 2007). When additional syntactic richer contextual information into the training features (POS tags in this paper) are used, model. Contextual information is extremely use- there is a boost in the performance which ful in making lexical selections of higher quality, is likely to improve when richer syntactic as illustrated by the models for Global Lexical Se- features are incorporated in the model. lection (Bangalore et al., 2007; Venkatapathy and

34 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 34–42, COLING 2010, Beijing, August 2010. Bangalore, 2009). example, given a source sentence s, the phrase- However, the limitation of global lexical se- table contains all the possible phrase-pairs condi- lection models has been sentence construction. tioned on the context of the source sentence s. In global lexical selection models, lattice con- In this paper, the key contributions are, struction and scoring (LCS) is used for the purpose of sentence construction (Bangalore et al., 1. We combine a discriminative training model 2007; Venkatapathy and Bangalore, 2009). In our with a phrase-based decoder. We ob- work, we address this limitation of global lexi- tained comparable results with the state-of- cal selection models by using an existing state-of- art phrase-based decoder. art decoder (Koehn et al., 2007) for the purpose 2. We evaluate the performance of the lattice of sentence construction. The translation model construction and scoring (LCS) approach to used by this decoder is derived from a discrimina- decoding. We observed that even though the tive model, instead of the usual phrase-table and lexical accuracy obtained using LCS is high, reordering-table construction algorithms. This al- the performance in terms of sentence con- lows us to use the effectiveness of an existing struction is low when compared to phrase- phrase-based decoder while retaining the advan- based decoder. tages of the discriminative model. In this paper, we compare the sentence construction accuracies 3. We show that the incorporation of syntactic of lattice construction and scoring approach (see information (POS tags) in our discriminative section 4.1 for LCS Decoding) and the phrase- model boosts the performance of translation. based decoding approach (see section 4.2). In future, we plan to use richer syntactic fea- Another advantage of using a discriminative ap- ture functions (which the discriminative approach to construct the phrase table and the re- proach allows us to incorporate) to evaluate ordering table is the flexibility it provides to in- the approach. corporate linguistic knowledge in the form of additional feature functions. In the past, factored The paper is organized in the following sec- phrase-based approaches for Machine Translation tions. Section 2 presents the related work. In have allowed the use of linguistic feature func- section 3, we describe the training of our model. tions. But, they are still bound by the local- In section 4, we present the decoding approaches ity of context, and definition of a fixed struc- (both LCS and phrase-based decoder). We de- ture of dependencies between the factors (Koehn scribe the data used in our experiments in section and Hoang, 2007). Furthermore, factored phrase- 5. Section 6 consists of the experiments and re- based approaches place constraints both on the sults. Finally we conclude the paper in section 7. type and number of factors that can be incorpo- 2 Related Work rated into the training. In this paper, though we do not extensively test this aspect, we show that us- In this section, we present approaches that are di- ing syntactic feature functions does improve the rectly related to our approach. In Direct Trans- performance of our approach, which is likely to lation Model (DTM) proposed for statistical ma- improve when much richer syntactic feature func- chine translation by (Papineni et al., 1998; Och tions (such as information about the parse struc- and Ney, 2002), the authors present a discrimi- ture) are incorporated in the model. native set-up for natural language understanding As the training model in a standard phrase- (and MT). They use a slightly modified equation based system is relatively impoverished with re- (in comparison to IBM models) as shown in equa- spect to contextual/linguistic information, integration 1. In equation 1, they consider the translation tion of the discriminative model in the form of model from f e (p(e f)), instead of the the- → | phrase-table and reordering-table with the phrase- oretically sound (after the application of Bayes’ based decoder is highly desirable. We propose to rule), e f (p(f e)) and use grammatical fea- → | do this by defining sentence specific tables. For tures such as the presence of equal number of

35 verbs forms etc. a word aligned parallel corpora a set of minimal phrases such that no two phrases overlap with eˆ = arg max pTM (e f) pLM (e) (1) each other (Hassan et al., 2009). e | ∗ The decoding strategy in DTM2 (Ittycheriah In their model, they use generic feature func- and Roukos, 2007) is similar to a phrase-based de- tions such as language model, cooccurence fea- coder except that the score of a particular transla- tures such as presence of a lexical relationship in tion block is obtained from the maximum entropy the lexicon. Their search algorithm limited the use model using the set of feature functions. In our of complex features. approach, instead of providing the complete scor- Direct Translation Model 2 (DTM2) (Itty- ing function ourselves, we compute the parame- cheriah and Roukos, 2007) expresses the phrase- ters needed by a phrase based decoder, which in based translation task in a uniﬁed log-linear prob- turn uses these parameters appropriately. In com- abilistic framework consisting of three compo- parison with the DTM2, we also use minimal non- nents: overlapping blocks as the entries in the phrase table that we generate. 1. a prior conditional distribution P0 Xiong et al. (2006) present a phrase reordering 2. a number of feature functions Φi() that cap- model under the ITG constraint using a maximum ture the effects of translation and language entropy framework. They model the reordering model problem as a two-class classiﬁcation problem, the classes being straight and inverted. The model is 3. the weights of the features λi that are esti- used to merge the phrases obtained from trans- mated using MaxEnt training (Berger et al., lating the segments in a source sentence. The 1996) as shown in equation 2. decoder used is a hierarchical decoder motivated from the CYK parsing algorithm employing a beam search algorithm. The maximum entropy P (e, j f) P r(e f) = 0 | exp λ Φ (e, j, f) (2) model is presented with features extracted from | Z i i i the blocks being merged and probabilities are es- X timated using the log-linear equation shown in In the above equation, j is the skip reordering (4). The work in addition to lexical features and factor for the phrase pair captured by Φi() and rep- collocational features, uses an additional metric resents the jump from the previous source word. called the information gain ratio (IGR) as a fea- Z represents the per source sentence normaliza- ture. The authors report an improvement of 4% tion term (Hassan et al., 2009). While a uni- BLEU score over the traditional distance based form prior on the set of futures results in a max- distortion model upon using the lexical features imum entropy model, choosing other priors out- alone. put a minimum divergence models. Normalized 1 phrase count has been used as the prior P0 in the p (y x) = exp( λ Φ (x, y)) (4) λ | Z (x) i i DTM2 model. λ i The following decision rule is used to obtain opti- X mal translation. 3 Training eˆ = arg max P r(e f) The training process of our approach has two e | steps: M (3) = arg max λmΦm(f, e) e 1. training the discriminative models for trans- m=1 X lation and reordering. The DTM2 model differs from other phrase- based SMT models in that it avoids the redun- 2. integrating the models into a phrase based dancy present in other systems by extracting from decoder.

36 The input to our training step are the word- context window alignments between source and target sentences SOURCE SENTENCE source word obtained using GIZA++ (implementation of IBM, word syntactically dependent HMM models). on source word

3.1 Training discriminative models target word 1 prob p1 We train two models, one to model the transla- target word 2 prob p2 tion of source blocks, and the other to model the ...... reordering of source blocks. We call the transla- target word K prob pK tion model a ‘context dependent block translation model’ for two reasons. Figure 1: Word prediction model

1. It is concerned with the translation of minimal phrasal units called blocks. vocabulary. These classiﬁers predict if a particular target block should be present given the source 2. The context of the source block is used dur- word and its context. This model is similar to the ing its translation. global lexical selection (GLS) model described in (Bangalore et al., 2007; Venkatapathy and Banga- The word alignments are used to obtain the set lore, 2009) except that in GLS, the predicted tar- of possible target blocks, and are added to the target blocks are not associated with any particular b n get vocabulary. A target block is a sequence of source word unlike the case here. words that are paired with a sequence of m source For the set of experiments in this paper, we used words (Ittycheriah and Roukos, 2007). In our ap- a context of size 6, containing three words to the proach, we restrict ourselves to target blocks that left and three words to the right. We also used are associated with only one source word. How- the POS tags of words in the context window as ever, this constraint can be easily relaxed. features. In future, we plan to use the words syn- Similarly, we call the reordering model, a ‘con- tactically dependent on a source word as global text dependent block distortion model’. For train- context(shown in Figure 1). ing, we use the maximum entropy software library Llama presented in (Haffner, 2006). 3.1.2 Context Dependent Block Distortion Model 3.1.1 Context Dependent Block Translation Model An IBM model 3 like distortion model is In this model, the goal is to predict a target trained to predict the relative position of a source block given the source word and contextual and word in the target given its context. Given a syntactic information. Given a source word and its source word and its context, the model estimates lexical context, the model estimates the probabil- the probability of particular relative position be- ities of the presence or absence of possible target ing an appropriate position of the source word in blocks (see Figure 1). the target (see Figure 2).

The probabilities of the candidate target blocks context window are obtained from the maximum entropy model. SOURCE SENTENCE source word The probability pe of a candidate target block ei i word syntactically dependent is estimated as given in equation 5 on source word

pei = P (true ei, fj,C) (5) | −w ... −2 −1 0 1 2 ... w p−w p−2 p−1 p0 p1 p2 pw where fj is the source word corresponding to ei and C is its context. Figure 2: Position prediction model Using the maximum entropy model, binary classiﬁers are trained for every target block in the Using a maximum entropy model similar to

37 the one described in the context dependent block pair such as phrase translation probability, lexical translation model, binary classifiers are trained weighting, inverse phrase translation probability, for every possible relative position in the target. etc.1 These classifiers output a probability distribution In our approach, given a source sentence, the over various relative positions given a source word following steps are followed to construct the and its context. phrase table. The word alignments in the training corpus are 1. Extract source blocks (’words’ in this work) used to train the distortion model. While computing the relative position, the difference in sentence 2. Use the ‘context dependent block translation lengths is also taken into account. Hence, the rela- model’ to predict the possible target blocks. tive position of the target block located at position The set of possible blocks can be predicted i corresponding to the source word located at po- using two criteria, (1) Probability threshold, sition j is given in equation 6. and (2) K-best. Here, we use a threshold m value to prune the set of possible candidates r = round(i j) (6) ∗ n − in the target vocabulary. where, m is the length of source sentence and n is 3. Use the prediction probabilities to assign the number of target blocks. round is the function scores to the phrase pairs. to compute the nearest integer of the argument. If the source word is not aligned to any target word, A similar set of steps is used to construct the a special symbol ‘INF’ is used to indicate such a reordering-table corresponding to an input sen- case. In our model, this symbol is also a part of tence in the source language. the target distribution. 4 Decoding The features used to train this model are the same as those used for the block translation 4.1 Decoding with LCS Decoder model. In order to use further lexical information, The lattice construction and scoring algorithm, as we also incorporated information about the target the name suggests, consists of two steps, word for predicting the distribution. The information about possible target words is obtained from 1. Lattice construction the ‘context dependent block translation model’. In this step, a lattice representing various The probabilities in this case are measured as possible target sequences is obtained. In the shown in equation 7 approach for global lexical selection (Banga- lore et al., 2007; Venkatapathy and Banga- p = P (true r, e , f ,C) (7) r,ei | i j lore, 2009), the input to this step is a bag of words. The bag of words is used to construct 3.2 Integration with phrase-based decoder an initial sequence (a single path lattice). To The discriminative models trained are sentence this sequence, deletion arcs are added to in- specific, i.e. the context of the sentence is used corporate additional paths (at a cost) that fa- to make predictions in these models. Hence, cilitate deletion of words in the initial se- the phrase-based decoder is required to use in- quence. This sequence is permuted using a formation specific to a source sentence. In order permutation window in order to construct a to handle this issue, a different phrase-table and lattice representing possible sequences. The reordering-table are constructed for every input permutation window is used to control the sentence. The phrase-table and reordering-table search space. are constructed using the discriminative models trained earlier. In our experiments, we used a similar process In Moses (Koehn et al., 2007), the phrase- for sentence construction. Using the con- table contains the source phrase, the target phrase text dependent block translation algorithm, and the various scores associated with the phrase 1http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases

38 we obtain a number of translation blocks for criminative model for every source sentence (see every source word. These blocks are intersection 3.2). These tables are then used by the connected in order to obtain the initial lattice state-of-art Moses decoder to obtain correspond- (see ﬁgure 3). ing translations.

~~SOURCE SENTENCE The various training and decoding parameters~~

~~f_(i−1) f_(i) f_(i+1) of the discriminative model are computed by exhaustively exploring the parameter space, and correspondingly measuring the output quality on the~~

t_(i−1,1) t_(i,1) t_(i+1,1) development set. The best set of parameters were used for decoding the sentences in the test corpus. .... t_(i−1,2) t_(i,2) t_(i+1,2) ...... We modiﬁed the weights assigned by MOSES to

t_(i−1,3) t_(i+1,3) the translation model, reordering model and language model. Experiments were conducted by INTIAL TARGET LATTICE performing pruning on the options in the phrase Figure 3: Lattice Construction table and by using the word penalty feature in MOSES. To control deletions at various source posi- We trained a language model of order 5 built on tions, deletion nodes may be added to the the entire EUROPARL corpus using the SRILM initial lattice. This lattice is permuted us- package. The method uses improved Kneser-Ney ing a permutation window to construct a lat- smoothing algorithm (Chen and Goodman, 1999) tice representing possible sequences. Hence, to compute sequence probabilities. the parameters that dictate lattice construction are, (1) Threshold for lexical selection, 5 Dataset (2) Using deletion arcs or not, and (3) Per- The experiments were conducted on the Spanish- mutation window. English language pair. The latest version of the 2. Scoring Europarl corpus(version-5) was used in this work. In this step, each of the paths in the lattice A small set of 200K sentences was selected from constructed in the earlier step is scored us- the training set to conduct the experiments. The ing a language model (Haffner, 2006), which test and development sets containing 2525 sen- is same as the one used in the sentence con- tences and 2051 sentences respectively were used, struction in global lexical selection models. without making any changes. It is to be noted that we do not use the dis- Corpus No. of sentences Source Target criminative reordering model in this decoder, Training 200000 59591 36886 and only the language model is used to score Testing 2525 10629 8905 Development 2051 8888 7750 various target sequences. Monolingual 200000 n.a 36886 English (LM) The path with the lowest score is considered the best possible target sentence for the given Table 1: Corpus statistics for Spanish-English cor- source sentence. Using this decoder, we con- pus. ducted experiments on the development set by varying threshold values and the size of the permutation window. The best parameter values ob- 6 Experiments and Results tained using the development set were used for decoding the test corpus. The output of our experiments was evaluated using two metrics, (1) BLEU (Papineni et al., 2002), 4.2 Decoding with Moses Decoder and (2) Lexical Accuracy (LexAcc). Lexical ac- In this approach, the phrase-table and the curacy measures the similarity between the un- reordering-table are constructed using the dis- ordered bag of words in the reference sentence

39 against the unordered bag of words in the hypoth- Threshold Perm. Window LexAcc BLEU esized translation. Lexical accuracy is a measure 0.16 3 0.4274 0.0914 of the fidelity of lexical transfer from the source 0.17 3 0.4321 0.0923 to the target sentence, independent of the syntax 0.18 3 0.4317 0.0918 of the target language (Venkatapathy and Banga- 0.16 4 0.4297 0.0912 lore, 2009). We report lexical accuracies to show 0.17 4 0.4315 0.0915 the performance of LCS decoding in comparison with the baseline system. Table 3: Lexical Accuracies of Lattice-Output us- We first present the results of the state-of-art ing lexical features alone for various parameter phrase-based model (Moses) trained on a paral- values lel corpus. We treat this as our baseline. The reordering feature used is msd-bidirectional, which to the lexical accuracy of Moses. This shows that allows for all possible reorderings over a speci- the discriminative model provides good lexical se- fied distortion limit. The baseline accuracies are lection, while the sentence construction technique shown in table 2. does not perform as expected. Corpus BLEU Lexical Accuracy Next, we present the results of the Moses based Development 0.1734 0.448 decoder that uses the discriminative model (see Testing 0.1823 0.492 section 3.2). In our experiments, we did not use MERT training for tuning the Moses parameters. Table 2: Baseline Accuracy Rather, we explore a set of possible parameter values (i.e. weights of the translation model, reorder- We conduct two types of experiments to test our ing model and the language model) to check the approach. performance. We show the BLEU scores obtained on the development set using Moses decoder in 1. Experiments using lexical features (see sec- Table 4. tion 6.1), and Reordering LM Translation BLEU 2. Experiments using syntactic features (see weight(d) weight(l) weight(t) section 6.2). 0 0.6 0.3 0.1347 6.1 Experiments using Lexical Features 0 0.6 0.6 0.1354 In this section, we present results of our exper- 0.3 0.6 0.3 0.1441 iments that use only lexical features. First, we 0.3 0.6 0.6 0.1468 measure the translation accuracy using LCS decoding. On the development set, we explored the Table 4: BLEU for different weight values using set of decoding parameters (as described in sec- lexical features only tion 4.1) to compute the optimal parameter values. The best lexical accuracy obtained on the de- On the test set, we obtained a BLEU score of velopment set is 0.4321 and the best BLEU score 0.1771. We observe that both the lexical accuracy obtained is 0.0923 at a threshold of 0.17 and a per- and the BLEU scores obtained using the discrim- mutation window size of value 3. The accuracies inative training model combined with the Moses corresponding to a few other parameter values are decoder are comparable to the state-of-art results. shown in Table 3. The summary of the results obtained using three On the test data, we obtained a lexical accu- approaches and lexical feature functions is pre- racy of 0.4721 and a BLEU score of 0.1023. As sented in Table 5. we can observe, the BLEU score obtained using the LCS decoding technique is low when com- 6.2 Experiments using Syntactic Features pared to the BLEU score of the state-of-art sys- In this section, we present the effect of incorpo- tem. However, the lexical accuracy is comparable rating syntactic features using our model on the

40 Approach BLEU LexAcc 7 Conclusions and Future Work State-of-art(MOSES) 0.1823 0.492 In this paper, we presented an approach to statisti- LCS decoding 0.1023 0.4721 cal machine translation that combines the power Moses decoder trained of a discriminative model (for training a model using a discriminative 0.1771 0.4841 for Machine Translation), and the standard beam- model search based decoding technique (for the translation of an input sentence). The key contributions Table 5: Translation accuracies using lexical fea- are: tures for different approaches 1. We incorporated a discriminative model in a phrase-based decoder. We obtained com- translation accuracies. Table 6 presents the results parable results with the state-of-art phrase- of our approach that uses syntactic features at dif- based decoder (see section 6.1). The ad- ferent parameter values. Here, we can observe vantage in using our approach is that it has that the translation accuracies (both LexAcc and the ﬂexibility to incorporate richer contextual BLEU) are better than the model that uses only and linguistic feature functions. lexical features. 2. We show that the incorporation of syntac- Reordering LM Translation BLEU tic information (POS tags) in our discrimina- weight(d) weight(l) weight(t) tive model boosted the performance of trans- 0 0.6 0.3 0.1661 lation. The lexical accuracy using our ap- 0 0.6 0.6 0.1724 proach improved by 6.1% when syntactic 0.3 0.6 0.3 0.1780 features were used in addition to the lexi- 0.3 0.6 0.6 0.1847 cal features. Similarly, the BLEU score improved by 2.3 points when syntactic features Table 6: BLEU for different weight values using were used compared to the model that uses syntactic features lexical features alone. The accuracies are likely to improve when richer linguistic feature functions (that use parse structure) are Table 7 shows the comparative performance of incorporated in our approach. the model using syntactic as well as lexical features against the one with lexical features func- In future, we plan to work on: tions only. 1. Experiment with rich syntactic and structural features (parse tree-based features) using our Model BLEU LexAcc approach. Lexical features 0.1771 0.4841 Lexical+Syntactic 0.201 0.5431 2. Experiment on other language pairs such as features Arabic-English and Hindi-English. 3. Improving LCS decoding algorithm using Table 7: Comparison between translation accura- syntactic cues in the target (Venkatapathy cies from models using syntactic and lexical fea- and Bangalore, 2007) such as supertags. tures

On the test set, we obtained a BLEU score of References 0.20 which is an improvement of 2.3 points over Bangalore, S., P. Haffner, and S. Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In An- the model that uses lexical features alone. We also nual Meeting-Association for Computational Linguistics, volume 45, page obtained an increase of 6.1% in lexical accuracy 152. using this model with syntactic features as com- Berger, A.L., V.J.D. Pietra, and S.A.D. Pietra. 1996. A maximum entropy approach to natural language processing. Computational linguistics, pared to the model using lexical features only. 22(1):39–71.

~~41 Brown, P.F., V.J.D. Pietra, S.A.D. Pietra, and R.L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computa- tional linguistics, 19(2):263–311.~~

~~Chen, S.F. and J. Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4):359–394.~~

~~Haffner, P. 2006. Scaling large margin classiﬁers for spoken language understanding. Speech Communication, 48(3-4):239–261.~~

Hassan, H., K. Sima’an, and A. Way. 2009. A syntactiﬁed direct translation model with linear-time decoding. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1182–1191. Association for Computational Linguistics.

~~Ittycheriah, A. and S. Roukos. 2007. Direct translation model 2. In Proceed- ings of NAACL HLT, pages 57–64.~~

Koehn, P. and H. Hoang. 2007. Factored translation models. In Pro- ceedings of the 2007 Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 868–876.

Koehn, P., F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chap- ter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54. Association for Computational Lin- guistics.

Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Annual meeting-association for computational linguistics, volume 45, page 2.

~~Och, F.J. and H. Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of ACL, volume 2, pages 295–302.~~

Och, F.J., C. Tillmann, H. Ney, et al. 1999. Improved alignment models for statistical machine translation. In Proc. of the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20–28.

Papineni, KA, S. Roukos, and RT Ward. 1998. Maximum likelihood and discriminative training of directtranslation models. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, volume 1.

Papineni, K., S. Roukos, T. Ward, and W.J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311– 318. Association for Computational Linguistics.

Venkatapathy, S. and S. Bangalore. 2007. Three models for discriminative machine translation using Global Lexical Selection and Sentence Recon- struction. In Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 96–102. Association for Computational Linguistics.

~~Venkatapathy, Sriram and Srinivas Bangalore. 2009. Discriminative Machine Translation Using Global Lexical Selection. ACM Transactions on Asian Language Information Processing, 8(2).~~

Xiong, D., Q. Liu, and S. Lin. 2006. Maximum entropy based phrase reordering model for statistical machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, page 528. Association for Computational Linguistics.

~~42 Seeding Statistical Machine Translation with Translation Memory Output through Tree-Based Structural Alignment~~

Ventsislav Zhechev Josef van Genabith EuroMatrixPlus, CNGL EuroMatrixPlus, CNGL School of Computing, Dublin City University School of Computing, Dublin City University [email protected] [email protected]

Abstract Thus, current efforts in the localisation indus- With the steadily increasing demand for try are mostly directed at the reduction of the high-quality translation, the localisation amount of data that needs to be translated from industry is constantly searching for tech- scratch by hand. Such efforts mainly include the nologies that would increase translator use of Translation Memory (TM) systems, where throughput, with the current focus on the earlier translations are stored in a database and use of high-quality Statistical Machine offered as suggestions when new data needs to Translation (SMT) as a supplement to the be translated. As TM systems were originally established Translation Memory (TM) limited to providing translations only for (al- technology. In this paper we present a most) exact matches of the new data, the integra- novel modular approach that utilises tion of Machine Translation (MT) techniques is state-of-the-art sub-tree alignment to pick seen as the only feasible development that has out pre-translated segments from a TM the potential to signiﬁcantly reduce the amount match and seed with them an SMT sys- of manual translation required. tem to produce a ﬁnal translation. We At the same time, the use of SMT is frowned show that the presented system can out- upon by the users of CAT tools as they still do perform pure SMT when a good TM not trust the quality of the SMT output. There are match is found. It can also be used in a two main reasons for that. First, currently there is Computer-Aided Translation (CAT) envi- no reliable way to automatically ascertain the ronment to present almost perfect transla- quality of SMT-generated translations, so that the tions to the human user with markup user could at a glance make a judgement as to the highlighting the segments of the transla- amount of effort that might be needed to post- tion that need to be checked manually for edit the suggested translation (Simard and Isa- correctness. belle, 2009). Not having such automatic quality metrics also has the side effect of it being impossible for a Translation-Services Provider (TSP) 1. Introduction company to reliably determine in advance the increase in translator productivity due to the use As the world becomes increasingly intercon- of MT and to adjust their resources-allocation nected, the major trend is to try to deliver ideas and cost models correspondingly. and products to the widest audience possible. The second major problem for users is that SMT- This requires the localisation of products for as generated translations are as a rule only obtained many countries and cultures as possible, with for cases where the TM system could not produce translation being one of the main parts of the lo- a good-enough translation (cf. Heyn, 1996). Given calisation process. Because of this, the amount of that the SMT system used is usually trained only data that needs professional high-quality transla- on the data available in the TM, expectedly it also tion is continuing to increase well beyond the has few examples from which to construct the capacity of the world’s human translators. translation, thus producing low quality output.

43 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 43–51, COLING 2010, Beijing, August 2010. In this paper, we combine a TM, SMT and an up parts of the input sentence and then supply automatic Sub-Tree Alignment (STA) backends this marked-up input to an SMT system. This in a single integrated tool. When a new sentence differs to our system in two ways. First, Smith that needs to be translated is supplied, first a and Clark use EMBT techniques to obtain partial Fuzzy-Match Score (FMS – see Section 2.2) is translations of the input from the complete ex- obtained from the TM backend, together with the ample base, while we are only looking at the best suggested matching sentence and its translation. TM match for the given input. Second, the authors For sentences that receive a reasonably high use dependency structures for EMBT matching, FMS, the STA backend is used to find the corre- while we employ phrase-based structures. spondences between the input sentence and the TM-suggested translation, marking up the parts 2.2. Translation Memory Backend of the input that are correctly translated by the TM. The SMT backend is then employed to ob- Although the intention is to use a full-scale TM tain the final translation from the marked-up in- system as the translation memory backend, to put sentence. In this way we expect to achieve a have complete control over the process for this better result compared to using pure SMT. initial research we decided to build a simple pro- In Section 2, we present the technical details totype TM backend ourselves. of the design of our system, together with moti- We employ a database setup using the Post- 1 vation for the particular design choices. Section 3 greSQL v.8.4.3 relational database management details the experimental setup and the data set (RDBM) system. The segment pairs from a given used for the evaluation results in Section 4. We TM are stored in this database and assigned present improvements that we plan to investigate unique IDs for further reference. When a new in further work in Section 5, and provide con- sentence is supplied for translation, the database cluding remarks in Section 6. is searched for (near) matches, using an FMS based on normalised character-level Levenshtein 2. System Framework edit distance (Levenshtein, 1965). Thus for each input sentence, from the data- We present a system that uses a TM-match to base we obtain the matching segment with the pre-translate parts of the input sentence and highest FMS, its translation and the score itself. guide an SMT system to the generation of a higher-quality translation. 2.3. Sub-Tree Alignment Backend

2.1. Related Approaches The system presented in this paper uses phrase- based sub-tree structural alignment (Zhechev, We are not aware of any published research 2010) to discover parts of the input sentence that where TM output is used to improve the per- correspond to parts of the suggested translation formance of an SMT system in a manner similar extracted from the TM database. We chose this to the system presented in this paper. particular tool, because it can produce aligned Most closely related to our approach are the phrase-based-tree pairs from unannotated (i.e. systems by Biçici and Dymetman (2008) and unparsed) data. It can also function fully auto- Simard and Isabelle (2009), where the authors matically without the need for any training data. use the TM output to extract new phrase pairs The only auxiliary requirement it has is for a that supplement the SMT phrase table. Such an probabilistic dictionary for the languages that are approach, however, does not guarantee that the being aligned. As described later in this section, SMT system will select the TM-motivated in our case this is obtained automatically from the phrases even if a heavy bias is applied to them. TM data during the training of the SMT backend. Another related system is presented in (Smith The matching between the input sentence and and Clark, 2009). Here the authors use a syntax- the TM-suggested translation is done in a three- based EBMT system to pre-translate and mark- step process. First, the plain TM match and its

~~1 http://www.postgresql.org/~~

44 translation are aligned, which produces a sub- nodes (node I4 corresponds to the phrase sender tree-aligned phrase-based tree pair with all non- email, while the linked node M10 corresponds to terminal nodes labelled ‘X’ (cf. Zhechev, 2010). sender ’s email). Such alignments can only be As we are only interested in the relations be- used to infer reordering information. In particular tween the lexical spans of the non-terminal in this case, we can infer that the target word or- nodes, we can safely ignore their labels. We call der for the input sentence is address email this first step of our algorithm bilingual alignment. sender, which produces the translation adresse In the second step, called monolingual align- électronique de l’ expéditeur. ment, the phrase-based tree-annotated version of 15 the TM match is aligned to the unannotated input M 36 T sentence. The reuse of the tree structure for the 13 5 TM match allows us to use it in the third step as 34 8 an intermediary to establish the available sub- 10 4 tree alignments between the input sentence and 1 32 the translation suggested from the TM. 6 3 During this final alignment, we identify 2 24 7 matched and mismatched portions of the input sentence and their possible translations in the 1 2 18 6 TM suggestion and, thus, this step is called 1 2 matching. Additionally, the sub-tree alignments 3 4 5 implicitly provide us with reordering informa- 4 3 tion, telling us where the portions of the input I sentence that we translate should be positioned in 6 the final translation. The alignment process is exemplified in Figure 1. I 1 2 3 The tree marked ‘I’ corresponds to the input sen- input sender email address tence, the one marked ‘M’ to the TM match and M 1 2 3 4 5 the one marked ‘T’ to the TM translation. Due to match sender ’s email address . space constraints, we only display the node ID T 1 2 3 4 5 6 7 8 numbers of the non-terminal nodes in the phrase- trans- électro- expé- mes- adresse de l’ du . structure trees — in reality all nodes carry the lation nique diteur sage label ‘X’. These IDs are used to identify the sub- sentential alignment links. The lexical items cor- Figure 1. Example of sub-tree alignment between responding to the leaves of the trees are pre- an input sentence, TM match and TM translation sented in the table below the graph. We decided to use sub-tree-based alignment, The alignment process can be visually repre- rather than plain word alignment (e.g. GIZA++ – sented as starting at a linked node in the I tree Och and Ney, 2003), due to a number of factors. and following the link to the M tree. Then, if First, sub-tree-based alignment provides much available, we follow the link to the T tree and better handling of long-distance reorderings, this leads us to the T-tree node corresponding to while word– and phrase-based alignment models the I-tree node we started from. In Figure 1, this always have a fixed limit on reordering distance results in the I–T alignments I1–T18, I2–T2, I3– that tends to be relatively low to allow efficient T1, I4–T32 and I6–T34. The first three links are computation. matches, because the lexical items covered by The alignments produced by a sub-tree align- the I nodes correspond exactly to the lexical ment model are also precision-oriented, rather items covered by their M node counterparts. than recall-oriented (cf. Tinsley, 2010). This is Such alignments provide us with direct TM important in our case, where we want to only translations for our input. The last two links in extract those parts of the translation suggested by the group are mismatched, because there is no the TM for which we are most certain that they lexical correspondence between the I and M are good translations.

45 As stated earlier, the only resource necessary 2.5. Auxilliary Tools for the operation of this system is a probabilistic bilingual dictionary covering the data that needs It must be noted that in general the SMT backend to be aligned. For the bilingual alignment step, sees the data it needs to translate in the target- such a bilingual dictionary is produced as a by- language word order (e.g. it is asked to translate product of the training of the SMT backend and an English sentence that has French word order). therefore available. For the monolingual align- This, however, does not correspond to the data found in the TM, which we use for deriving the ment step, the required probabilistic dictionary is SMT models. Because of this discrepancy, we generated by simply listing each unique token developed a pre-processing tool that goes over seen in the source-language data in the TM as the TM data performing bilingual alignment and translating only as itself with probability 1. outputting reordered versions of the sentences it processes by using the information implicitly 2.4. Statistical Machine Translation Backend encoded in the sub-tree alignments. In this way Once the matching step is completed, we have we obtain the necessary reordered data to train a identified and marked-up the parts of the input translation model where the source language al- sentence for which translations will be extracted ready has the target-language word order. In our from the TM suggestions, as well as the parts system we than use this model — together with that need to be translated from scratch. The the proper-word-order model — for translation. lengths of the non-translated segments vary de- One specific aspect of real-world TM data that pending on the FMS, but are in general relatively we need to deal with is that they often contain short (one to three tokens). meta-tag annotations of various sorts. Namely, The further processing of the input relies on a annotation tags specific to the file format used for specific feature of the SMT backend we use, storing the TM data, XML tags annotating parts namely the Moses system (Koehn et al., 2007). of the text as appearing in Graphical User Inter- We decided to use this particular system as it is face (GUI) elements, formatting tags specific to the most widely adopted open-source SMT sys- the file format the TM data was originally taken tem, both for academic and commercial pur- from, e.g. RTF, OpenDoc, etc. Letting any MT poses. In this approach, we annotate the seg- system try to deal with these tags in a probabilis- ments of the input sentence for which transla- tic manner can easily result in ill-formed, mis- translated and/or out-of-order meta-tags in the tions have been found from the TM suggestion translation. using XML tags with the translation correspond- This motivates the implementation of a rudi- ing to each segment given as an attribute to the mentary handling of meta-tags in the system pre- encapsulating XML tag, similarly to the system sented in this paper, in particular handling the described in (Smith and Clark, 2009). The SMT XML tags found in the TM data we work with, backend is supplied with marked-up input in the as described in Section 3. The tool we developed form of a string consisting of the concatenation for this purpose simply builds a map of all of the XML-enclosed translated segments and unique XML tags per language and replaces the plain non-translated segments in the target- them in the data with short placeholders that are language word order, as established by the designed in such a way that they would not inter- alignment process. The SMT backend is in- fere with the rest of the TM data.2 A special case structed to translate this input, while keeping the that the tool has to take care of is when an XML translations supplied via the XML annotation. tag contains an attribute whose value needs to be This allows the SMT backend to produce transla- translated. In such situations, we decided to not tions informed by and conforming to actual ex- perform any processing, but rather leave the amples from the TM, which should result in im- XML tag as is, so that all text may be translated provements in translation quality. as needed. A complete treatment of meta-tags, however, is beyond the scope of the current paper.

~~2 In the current implementation, the XML tags are replaced with the string , where is a unique nu- meric identiﬁer for the XML tag that is being replaced.~~

46 We also had to build a dedicated tokeniser/de- Extract Textual Data Start TMX Data tokeniser pair to handle real world TM data con- from TMX Format taining meta-tags, e-mail addresses, file paths, Meta-Tag etc., as described in Section 3. Both tools are Meta-Tag Handling TM Substitution Maps Database implemented as a cascade of regular expression substitutions in Perl. Tokenisation and Probabilistic Dictionary Language Models Masking of Special for Monolingual Finally, we use a tool to extract the textual Characters Alignment data from the TM. That is, we strip all tags spe- Language-Model Generation of Probabilistic cific to the format in which the TM is stored, as Training and Lowercasing Dictionary for they can in general be recreated and thus do not Binarisation Monolingual Alignment need to be present during translation. In our par- Sub-Tree Alignment Normal SMT Model with ticular case the TM is stored in the XML-based Input Data Word Order Normal Word Order TMX format.3 SMT Model Bilingual Parallel Automatic Training and Treebank Word-Alignment 2.6. Complete Workflow Binarisation

Besides the components described above, we Generation of Reordered Source- Probabilistic Dictionary Bilingual Parallel Language Data for Bilingual Alignment also performed two further transformations on Treebank the data. First, we lowercase the TM data before Tar W get using it to train the SMT backend models. This ord Order Reorder Source- SMT Model with also means that the alignment steps from Section Language Data Target Word Order 2.3 are performed on lowercased data, as the bi- SMT Model Automatic Training and Stop lingual dictionary used there is obtained during Word-Alignment Binarisation the SMT training process.4 Additionally, the SMT and sub-tree alignment Figure 2. Pre-Processing Workflow systems that we use cannot handle certain characters, which we need to mask in the data. For The complete pre-processing workflow is pre- the SMT backend, this includes ‘|’, ‘<’ and ‘>’ sented in Figure 2, where the rectangles with ver- and for the sub-tree aligner, ‘(’ and ‘)’. The rea- tical bars represent the use of open-source tools, son these characters cannot be handled is that the while the plain rectangles represent tools devel- SMT system uses ‘|’ internally to separate data oped by the authors of this paper. fields in the trained models and ‘<’ and ‘>’ can- First, the textual data is extracted from the not be handled whilst using XML tags to anno- original TM format, producing one plain-text file tate pre-translated portions of the input. The sub- for each language side. These data can either be tree aligner uses ‘(’ and ‘)’ to represent the pre-loaded in a PostgreSQL database at this time, phrase-based tree structures it generates and the or during the first run of the translation system. presence of these characters in the data may cre- Next, the meta-tag-handling tool is used to ate ambiguity when parsing the tree structures. generate the substitution tables for the source and All these characters are masked by substituting target languages, as well as new files for each in high-Unicode counterparts, namely ‘│’, ‘﹤ ’, language with the tags substituted by the corre- ‘﹥ ’, ‘﹙ ’ and ‘﹚ ’. Visually, there is a very slight sponding identifiers (cf. Section 2.5). These files distinction and this is intentionally so to simplify are then tokenised, lowercased and all conflicting debugging. However, the fact that the character characters are masked, as described above. codes are different alleviates the problems dis- The pre-processed files are then used to pro- cussed above. Of course, in the final output, the duce a file containing pairs of sentences in the masking is reversed and the translation contains input format of the sub-tree aligner, as well as to the regular versions of the characters. generate the probabilistic dictionary required for

3 http://www.lisa.org/fileadmin/standards/tmx1.4/tmx.htm 4 Currently, we do not use a recaser tool and the translations produced are always in lowercase. This component, however, will be added in a future version of the system.

47 the monolingual alignment and to train the SMT original plain non-pre-processed data to simulate model on the data in the proper word order. The real-life operation with a proper TM backend. SMT training produces the necessary bilingual After the best TM match and its translation are dictionary for use by the sub-tree aligner, which extracted from the TM, they — together with the is run to obtain a parallel-treebank version of the input sentence — are pre-processed by tokenisa- TM data. The parallel treebank is then used to tion, lowercasing, meta-tag and special-character retrieve bilingual alignments for the TM data, substitution. Next, the corresponding tree pair is rather than generate them on the fly during trans- extracted from the bilingual parallel treebank to lation. This is an important design decision, as establish the tree structure for the TM source- the complexity of the alignment algorithm is high language match. This tree structure is then used for plain-text alignment (cf. Zhechev, 2010). to perform the monolingual alignment, which Once we have generated the bilingual parallel allows us to perform the matching step next. Af- treebank, we run the reordering tool, which gen- ter the matching is complete, we generate a final erates a new plain-text file for the source lan- translation as described in Section 2.4. Finally, guage, where the sentences are modified to con- the translations are de-tokenised and the XML form to the target-language word order, as im- tags and special characters are unmasked. plied by the data in the parallel treebank. This is then matched with the proper-order target- 3. Experimental Setup language file to train the SMT backend for the actual use in the translation process. We use real-life TM data from an industrial part- ner. The TM was generated during the translation

Read Input Find TM Match with TM of RTF-formatted customer support documenta- Sentence Highest FMS Database tion. The data is in TMX format and originally contains 108 967 English–French translation Meta-Tag Handling, Meta-Tag Handling, Detokenisation and Tokenisation and Output T segments, out of which 14 segments either have Unmasking of Special Masking of Special (tm) Characters for Output Characters for I, M, T an empty language side or have an extreme discrepancy in the number of tokens for each lan- no Output Meta-Tag guage side and were therefore discarded. FMS >= 50 Translation Substitution Maps (direct) A particular real-life trait of the data is the yes presence of a large number of XML tags. Run- Bilingual Parallel Extract Bilingual SMT Backend ning the tag-mapping tool described in Section Treebank Alignment for M, T (normal word order) 2.6, we gathered 2 049 distinct tags for the Eng- lish side of the data and 2 653 for the French Probabilistic Dictionary Generate Mono- SMT Model with for Monolingual lingual Alignment Normal Word Order side. Still, there were certain XML tags that in- Alignment for M, I cluded a label argument whose value was translated from one language to the other. These XML SMT Model with Perform Alignment Language Models Target Word Order Matching tags were left intact so that our system could handle the translation correctly. Output SMT Backend Translation xml Approach The TM data also contain a large number of (both word orders) (xml) file paths, e-mail addresses, URLs and others, which makes bespoke tokenisation of the data Figure 3. Translation Workflow necessary. Our tokenisation tool ensures that Once all the necessary files have been gener- none of these elements are tokenised, keeps RTF ated and all pre-processing steps have been com- formatting sequences non-tokenised and properly pleted, the system is ready for use for translation. handles non-masked XML tags, minimising their The translation workflow is shown in Figure 3, fragmentation. ‘I’, ‘M’ and ‘T’ having the same meanings as in As translation segments rarely occur more than Figure 1. Here, the first step after an input sen- once in a TM, we observe a high number of unique tence has been read in is to find the TM match tokens (measured after pre-processing) — 41 379 with the highest FMS. This is done using the for English and 49 971 for French — out of

48 108 953 segment pairs. The average sentence tences were translated directly using the SMT length is 13.2 for English and 15.0 for French. backend to check the overall pure SMT perform- For evaluation, we use a data set of 4977 Eng- ance. The TM-suggested translations were also lish–French segments from the domain of the output for all input sentences. TM. The sentences in the test set are signiﬁcantly The results of the evaluation are given in Fig- shorter on average, compared to the TM — 9.2 ure 4, where the tm and direct scores are also tokens for English and 10.9 for French. given for the FMS range [0%; 50%)∪{100%}. It must be noted that we used SMT models Across all metrics we see a uniform drop in the with maximum phrase length of 3 tokens, rather quality of TM-suggested translations, which is than the standard 5 tokens, and for decoding we what we expected, given that these translations used a 3-gram language model. This results in contain one or more wrong words. We believe much smaller models than the ones usually used that the relatively high scores recorded for the in mainstream SMT applications. (The standard TM-suggested translations at the high end of the for some tools goes as far as 7-token phase- FMS scale are a result of the otherwise perfect length limit and 7-gram language models) word order and lexical choice. For n-gram- match-based metrics like the ones we used such a 4. Evaluation Results result is expected and predictable. Although the inverse F-score results show the potential of our For the evaluation of our system, we used a setup to translate the outstanding tokens in a number of widely accepted automatic metrics, 90%–100% TM match, it appears that the SMT namely BLEU (Papineni et al., 2002), METEOR system produces word order that does not corre- (Banerjee and Lavie, 2005), TER (Snover et al., spond to the reference translation and because of 2006) and inverse F-Score based on token-level this receives lower scores on the other metrics. precision and recall. The unexpected drop in scores for perfect TM We setup our system to only fully process in- matches is due to discrepancies between the ref- put sentences for which a TM match with an erence translations in our test set and the transla- FMS over 50% was found, although all sen- tions stored in the TM. We believe that this issue

~~0,8 0,9~~

~~0,7 0,8~~

~~0,6 0,7~~

~~0,5 0,6 BLEU~~

0,4 METEOR 0,5 tm 0,3 tm 0,4 xml xml 0,2 direct direct 0,3 0,1 0…50/1963 50…60/779 60…70/621 70…80/537 80…90/537 90…100/375 100/165 0…50/1963 50…60/779 60…70/621 70…80/537 80…90/537 90…100/375 100/165 FMS Range/Segments FMS Range/Segments

~~0,9 tm tm 0,7 xml 0,8 xml direct 0,6 0,7 direct~~

~~0,6 0,5~~

~~0,5 TER 0,4~~

~~0,4 Inverse F-Score~~

~~0,3 0,3~~

~~0,2 0,2~~

0,1 0…50/1963 50…60/779 60…70/621 70…80/537 80…90/537 90…100/375 100/165 0…50/1963 50…60/779 60…70/621 70…80/537 80…90/537 90…100/375 100/165 FMS Range/Segments FMS Range/Segments Figure 4. Evaluation results for English-to-French translation, broken down by FMS range

49 affects all FMS ranges, albeit to a lower extent the final SMT translation. We expect this to pro- for non-perfect matches. Unfortunately, the exact vide additional speedup to the post-editing proc- impact cannot be ascertained without human ess. Such a study will require tight integration evaluation. between our system and a CAT tool and the We observe a significant drop-off in translation modular design we presented will facilitate this quality for the direct output below FMS 50%. significantly. This suggests that sentences with such low FMS The proposed treatment of meta-tags is cur- should be translated either by a human translator rently very rudimentary and may be extended from scratch, or by an SMT system trained on with additional features and to handle additional different/more data. types of tags. The design of our system currently Our system (i.e. the xml setup) clearly outper- allows the meta-tag-handling tool to be devel- forms the direct SMT translation for FMS be- oped independently, thus giving the user the tween 80 and 100 and has comparable perform- choice of using a different meta-tag tool for each ance between FMS 70 and 80. Below FMS 70, type of data they work with. the SMT backend has the best performance. Al- In addition, the reordering tool needs to be though these results are positive, we still need to developed further, with emphasis on properly investigate why our system has poor perform- handling situations where the appropriate posi- ance at lower FMS ranges. Theoretically, it tion of an input-sentence segment cannot be re- should outperform the SMT backend across all liably established. In general, further research is ranges, as its output is generated by supplying needed into the reordering errors introduced by the SMT backend with good pre-translated frag- the SMT system into otherwise good translations. ments. The Inverse F-Score graph suggest that this is due to worse lexical choice, but only man- 6. Conclusions ual evaluation can provide us with clues for solv- ing the issue. In this paper, we presented a novel modular ap- The discrepancy in the results in the Inverse F- proach to the utilisation of Translation Memory Score graph with the other metrics suggest that data to improve the quality of Statistical Machine the biggest problem for our system is producing Translation. output in the expected word-order. The system we developed uses precise sub- tree-based alignments to reliably determine and 5. Future Work mark up correspondences between an input sentence and a TM-suggested translation, which en- There are a number of possible directions for sures the utilisation of the high-quality transla- improvement that can be explored. tion data stored in the TM database. An SMT As mentioned earlier, we plan to integrate our backend then translates the marked-up input sen- system with a full-featured open-source or com- tence to produce a final translation with im- mercial TM product that will supply the TM proved quality. matches and translations. We expect this to im- Our evaluation shows that the system pre- prove our results, as the quality of the TM matches sented in this paper significantly improves the will better correspond to the reported FMS. quality of SMT output when using TM matches Such an integration will also be the first neces- with FMS above 80 and produces results on par sary step to perform a user study evaluating the with the pure SMT output for SMT between 70 effect of the use of our system on post-editing and 80. TM matches with FMS under 70 seem to speeds. We expect the findings of such a study to provide insufficient reordering information and show a significant increase of throughput that too few matches to improve on the SMT output. will significantly reduce the costs of translation Still, further investigation is needed to properly for large-scale projects. diagnose the drop in quality for FMS below 70. It would be interesting to also conduct a user We expect further improvements to the reor- study where our system is used to additionally dering functionality of our system to result in mark up the segments that need to be edited in higher-quality output even for lower FMS ranges.

50 Acknowledgements Och, Franz Josef and Hermann Ney. 2003. A Systematic Comparison of Various Statistical th This research is funded under the 7 Framework Alignment Models. Computational Linguistics, Programme of the European Commission within 29 (1): 19–51. the EuroMatrixPlus project (grant № 231720). Papineni, Kishore, Salim Roukos, Todd Ward and The data used for evaluation was generously pro- Wei-Jing Zhu. 2002. BLEU: A Method for vided by Symantec Ireland. Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the References Association of Computational Linguistics (ACL ’02), pp. 311–318. Philadelphia, PA. Banerjee, Satanjeev and Alon Lavie. 2005. METEOR: Simard, Michel and Pierre Isabelle. 2009. Phrase- An Automatic Metric for MT Evaluation with based Machine Translation in a Computer-assisted Improved Correlation with Human Judgements. Translation Environment. In The Twelfth Machine In Proceedings of the Workshop on Intrinsic and Translation Summit (MT Summit XII), pp. 120–127. Extrinsic Evaluation Measures for MT and/or Ottawa, ON, Canada. Summarization at the 43rd Annual Meeting of the Smith, James and Stephen Clark. 2009. EBMT for Association for Computational Linguistics SMT: A New EBMT–SMT Hybrid. In Proceedings (ACL ’05), pp. 65–72. Ann Arbor, MI. of the 3rd International Workshop on Example- Biçici, Ergun and Marc Dymetman. 2008. Dynamic Based Machine Translation (EBMT ’09), Translation Memory: Using Statistical Machine eds. Mikel L. Forcada and Andy Way, pp. 3–10. Translation to improve Translation Memory Fuzzy Dublin, Ireland. Matches. In Proceedings of the 9th International Snover, Matthew, Bonnie J. Dorr, Richard Schwartz, Conference on Intelligent Text Processing and Linnea Micciulla and John Makhoul. 2006. Computational Linguistics (CICLing ’08), A Study of Translation Edit Rate with Targeted ed. Alexander F. Gelbukh, pp. 454–465. Vol. 4919 Human Annotation. In Proceedings of the of Lecture Notes in Computer Science. Haifa, 7th Conference of the Association for Machine Israel: Springer Verlag. Translation in the Americas (AMTA ’06), Heyn, Matthias. 1996. Integrating Machine pp. 223–231. Cambridge, MA. Translation into Translation Memory Systems. Tinsley, John. 2010. Resourcing Machine Translation In Proceedings of the EAMT Machine Translation with Parallel Treebanks. School of Computing, Workshop, TKE ’96, pp. 113–126. Vienna, Austria. Dublin City Univercity: PhD Thesis. Dublin, Ireland. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Zhechev, Ventsislav. 2010. Automatic Generation of Callison-Burch, Marcello Federico, Nicola Parallel Treebanks. An Efﬁcient Unsupervised Bertoldi, Brooke Cowan, Wade Shen, Christine System: Lambert Academic Publishing. Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the Demo and Poster Sessions of the 45th Annual Meeting of the Association for Computational Linguistics (ACL ’07), pp. 177–180. Prague, Czech Republic. Levenshtein, Vladimir I. 1965. Двоичные коды с исправлением выпадений, вставок и замещений символов (Binary Codes Capable of Correcting Deletions, Insertions, and Reversals). Доклады Академий Наук СССР, 163 (4): 845–848. [reprinted in: Soviet Physics Doklady, 10: 707–710.].

~~51 7IQERXMG ZW 7]RXEGXMG ZW 2KVEQ 7XVYGXYVI JSV 1EGLMRI 8VERWPEXMSR )ZEPYEXMSR~~

'LMOMY 03 ERH (IOEM ;9 ,/978 ,YQER 0ERKYEKI 8IGLRSPSK] 'IRXIV (ITEVXQIRX SJ 'SQTYXIV 7GMIRGI ERH )RKMRIIVMRK ,SRK /SRK 9RMZIVWMX] SJ 7GMIRGI ERH 8IGLRSPSK] NEGOMIPSHIOEM $GWYWXLO %FWXVEGX VIPEXMSR [MXL LYQER NYHKQIRX SR XVERWPEXMSR EH IUYEG] XLER XLI [MHIP]YWIH PI\MGEP RKVEQ ;I TVIWIRX VIWYPXW SJ ER IQTMVMGEP WXYH] TVIGMWMSR FEWIH 18 IZEPYEXMSR QIXVMG &0)9 4E SR IZEPYEXMRK XLI YXMPMX] SJ XLI QEGLMRI TMRIRM IX EP EW [IPP EW XLI FIWXORS[R XVERWPEXMSR SYXTYX F] EWWIWWMRK XLI EGGY W]RXEGXMG XVII TVIGMWMSR FEWIH 18 IZEPYEXMSR QIX VEG] [MXL [LMGL LYQER VIEHIVW EVI EFPI VMG 781 0MY ERH +MPHIE %X XLI WEQI XS GSQTPIXI XLI WIQERXMG VSPI ERRSXEXMSR XMQI YRPMOI WSQI LMKLP] PEFSV MRXIRWMZI IZEPYE XIQTPEXIW 9RPMOI XLI [MHIP]YWIH PI\M XMSR QIXVMGW WYGL EW ,8)6 7RSZIV IX EP GEP ERH RKVEQ FEWIH SV W]RXEGXMG FEWIH SYV TVSTSWIH WIQERXMG QIXVMG SRP] VIUYMVIW WMQ 18 IZEPYEXMSR QIXVMGW [LMGL EVI xYIRG] TPI ERH QMRMQEP MRWXVYGXMSRW XS XLI LYQER NYHKIW SVMIRXIH SYV VIWYPXW WLS[ XLEX YWMRK WI MRZSPZIH MR XLI IZEPYEXMSR G]GPI QERXMG VSPI PEFIPW XS IZEPYEXI XLI YXMPMX] SJ 18 SYXTYX EGLMIZI LMKLIV GSVVIPEXMSR ;I EVKYI XLEX RIMXLIV RKVEQ FEWIH QIXVMGW PMOI &0)9 RSV W]RXE\FEWIH QIXVMGW PMOI 781 [MXL LYQER NYHKQIRXW SR EHIUYEG] -R EHIUYEXIP] GETXYVI XLI WMQMPEVMX] MR QIERMRK FI XLMW WXYH] LYQER VIEHIVW [IVI IQTPS]IH X[IIR XLI QEGLMRI XVERWPEXMSR ERH XLI VIJIVIRGI XS MHIRXMJ] XLI WIQERXMG VSPI PEFIPW MR XLI XVERWPEXMSRh[LMGL YPXMQEXIP] MW IWWIRXMEP JSV XVERWPEXMSR *SV IEGL VSPI XLI wPPIV MW GSRWMHIVIH ER EGGYVEXI XVERWPEXMSR MJ MX I\ XVERWPEXMSRW XS FI YWIJYP TVIWWIW XLI WEQI QIERMRK EW XLEX ERRS *MVWX RKVEQ FEWIH QIXVMGW EWWYQI XLEX qKSSHr XEXIH MR XLI KSPH WXERHEVH VIJIVIRGI XVERW XVERWPEXMSRW WLEVI XLI WEQI PI\MGEP GLSMGIW [MXL PEXMSR 3YV 760 FEWIH JWGSVI IZEPYEXMSR XLI VIJIVIRGI XVERWPEXMSR ;LMPI &0)9 WGSVI TIV QIXVMG LEW E GSVVIPEXMSR GSIJwGMIRX JSVQW [IPP MR GETXYVMRK XLI XVERWPEXMSR xYIRG] [MXL XLI LYQER NYHKIQIRX SR EHIUYEG] 'EPPMWSR&YVGL IX EP ERH /SILR ERH 1SR^ [LMPI MR GSRXVEWX &0)9 LEW SRP] E VITSVX GEWIW [LIVI &0)9 WXVSRKP] HMW GSVVIPEXMSR GSIJwGMIRX ERH XLI W]RXEGXMG EKVIIW [MXL LYQER NYHKQIRX SR XVERWPEXMSR UYEP FEWIH 18 IZEPYEXMSR QIXVMG 781 LEW SRP] MX] 8LI YRHIVP]MRK VIEWSR MW XLEX PI\MGEP WMQMPEVMX] GSVVIPEXMSR GSIJwGMIRX [MXL XLI LY HSIW RSX EHIUYEXIP] VIxIGX XLI WMQMPEVMX] MR QIER QER NYHKIQIRX SR EHIUYEG] 3YV VIWYPXW MRK WXVSRKP] MRHMGEXI XLEX YWMRK WIQERXMG VSPI 7IGSRH NYWX PMOI RKVEQ FEWIH QIXVMGW PEFIPW JSV 18 IZEPYEXMSR GER FI WMKRMw WYGL EW &0)9 W]RXE\FEWIH QIXVMGW EVI WXMPP GERXP] QSVI IJJIGXMZI ERH FIXXIV GSVVIPEXIH QSVI xYIRG]SVMIRXIH XLER EHIUYEG]EGGYVEG] [MXL LYQER NYHKIQIRX SR EHIUYEG] XLER SVMIRXIH ;LMPI 781 EHHVIWWIW XLI JEMPYVI SJ &0)9 ERH 781 &0)9 MR IZEPYEXMRK XLI XVERWPEXMSR KVEQQEXM GEPMX] E KVEQQEXMGEP XVERWPEXMSR GER RSRIXLIPIWW -RXVSHYGXMSR EGLMIZI E LMKL 781 WGSVI IZIR MJ GSRXEMRW IVVSVW -R XLMW TETIV [I WLS[ XLEX IZEPYEXMRK QEGLMRI EVMWMRK JVSQ GSRJYWMSR SJ WIQERXMG VSPIW 7]RXEG XVERWPEXMSR UYEPMX] F] EWWIWWMRK XLI EGGYVEG] SJ XMG WXVYGXYVI WMQMPEVMX] WXMPP MREHIUYEXIP] VIxIGXW LYQER TIVJSVQERGI MR VIGSRWXVYGXMRK XLI WIQER WMQMPEVMX] SJ QIERMRK XMG JVEQIW JVSQ XLI 18 SYXTYX LEW E LMKLIV GSV %W 18 W]WXIQW MQTVSZI XLI WLSVXGSQMRKW SJ

52 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 52–60, COLING 2010, Beijing, August 2010. *MKYVI )\EQTPI SJ WIQERXMG JVEQIW MR 'LMRIWI MRTYX )RKPMWL VIJIVIRGI XVERWPEXMSR ERH 18 SYXTYX

PI\MGEP RKVEQ FEWIH ERH W]RXE\FEWIH IZEPYE 6IPEXIH ;SVO XMSR QIXVMGW EVI FIGSQMRK QSVI ETTEVIRX 7XEXI SJXLIEVX 18 W]WXIQW EVI SJXIR EFPI XS SYXTYX 7IQERXMG QSHIPW MR 718 XVERWPEXMSRW GSRXEMRMRK VSYKLP] XLI GSVVIGX [SVHW 2YQIVSYW VIGIRX [SVOW LEW FIIR HSRI SR ETTP] ERH FIMRK EPQSWX KVEQQEXMGEP FYX RSX I\TVIWW MRK HMJJIVIRX WIQERXMG QSHIPW XS WXEXMWXMGEP QE MRK QIERMRK XLEX MW GPSWI XS XLI WSYVGI MRTYX ;I GLMRI XVERWPEXMSR ;SVH WIRWI HMWEQFMKYEXMSR EHSTX XLI SYXWIX SJ XLI TVMRGMTPI XLEX EKSSHXVERW ;7( QSHIPW GSQFMRI E [MHI VERKI SJ GSRXI\X PEXMSR MW SRI JVSQ [LMGL LYQER VIEHIVW QE] WYG JIEXYVIW MRXS E WMRKPI PI\MGEP GLSMGI TVIHMGXMSR EW GIWWJYPP] YRHIVWXERH EX PIEWX XLI FEWMG IZIRX WXVYG MR XLI [SVO SJ 'EVTYEX ERH ;Y 'LER IX EP XYVI i q[LS HMH [LEX XS [LSQ [LIR [LIVI ERH ERH +MQIRI^ ERH 1DEVUYI^ E -R [L]r 4VEHLER IX EP [LMGL VITVIWIRXW XLI TEVXMGYPEV 4LVEWI 7IRWI (MWEQFMKYEXMSR 47( QSWX MQTSVXERX QIERMRK SJ XLI WSYVGI YXXIVERGIW E KIRIVEPM^EXMSR SJ XLI ;7( ETTVSEGL EYXSQEX 3YV SFNIGXMZI MW XS IZEPYEXI LS[ [IPP XLI QSWX IW MGEPP] EGUYMVIW JYPP] TLVEWEP XVERWPEXMSR PI\MGSRW WIRXMEP WIQERXMG MRJSVQEXMSR MW FIMRK GETXYVIH F] ERH TVSZMHIW E GSRXI\XHITIRHIRX TVSFEFMPMX] HMW XLI QEGLMRI XVERWPEXMSR W]WXIQW JVSQ XLI YWIVtW XVMFYXMSR SZIV XLI TSWWMFPI XVERWPEXMSR GERHMHEXIW TSMRX SJ ZMI[ JSV ER] KMZIR TLVEWEP PI\MGSR 'EVTYEX ERH ;Y %RSXLIV VIGIRX VIWIEVGL HMVIGXMSR SR WIQERXMG -R XLMW TETIV [I HIWGVMFI MR HIXEMP XLI QIXLSH 718 MW ETTP]MRK WIQERXMG VSPI PEFIPMRK QSHIPW SPSK] XLEX YRHIVPMIW XLI RI[ WIQERXMG QEGLMRI 7IQERXMG VSPI PEFIPMRK 760 MW XLI XEWO SJ MHIRXM XVERWPEXMSR IZEPYEXMSR QIXVMGW [I EVI HIZIPSTMRK J]MRK XLI WIQERXMG TVIHMGEXIEVKYQIRX WXVYGXYVIW ;I TVIWIRX XLI VIWYPXW SJ XLI WXYH] SR IZEPYEXMRK [MXLMR E WIRXIRGI 7IQERXMG VSPI PEFIPW VITVI QEGLMRI XVERWPEXMSR YXMPMX] F] QIEWYVMRK XLI EGGY WIRX ER EFWXVEGX PIZIP SJ YRHIVWXERHMRK MR QIER VEG] [MXL [LMGL LYQER VIEHIVW EVI EFPI XS GSQ MRK 8LIVI MW ER MRGVIEWMRK EZEMPEFMPMX] SJ PEVKI TPIXI XLI WIQERXMG VSPI ERRSXEXMSR XIQTPEXIW 0EWX TEVEPPIP GSVTSVE ERRSXEXIH [MXL WIQERXMG VSPI MR FYX RSX XLI PIEWX [I WLS[ XLEX SYV TVSTSWIH IZEP JSVQEXMSR MR TEVXMGYPEV MR XLI [SVO SJ 4EPQIV IX YEXMSR QIXVMG LEW E LMKLIV GSVVIPEXMSR [MXL LYQER EP ERH

53 8LI FIWX QSRSPMRKYEP WLEPPS[ WIQERXMG TEVWIV F] XLMVH 18 SYXTYX RS TVIHMGEXIEVKYQIRX WXVYGXYVI *YRK IX EP EGLMIZIH ER *WGSVI SJ MW ERRSXEXIH MR 'LMRIWI WIQERXMG VSPI PEFIPMRK [LMPI XLI FIWX 6IGIRX [SVO F] ;Y ERH *YRK E ERH ;Y GVSWWPMRKYEP WIQERXMG ZIVF JVEQI EVKYQIRX QET ERH *YRK F LEW FIKYR XS ETTP] 760 XS WXE TMRKW [MXL EGGYVEG] SJ EW VITSVXIH MR XLI XMWXMGEP QEGLMRI XVERWPEXMSR YWMRK E WIQERXMG VI WEQI [SVO SVHIVMRK QSHIP FEWIH SR 760 XLEX WYGGIWWJYPP] VI 8LI I\EQTPI MR *MKYVI MW PEFIPIH [MXL WIQER XYVRW E FIXXIV XVERWPEXMSR [MXL JI[IV WIQERXMG VSPI XMG VSPIW MR XLI 4VSTFERO GSRZIRXMSR WVG WLS[W E GSRJYWMSR IVVSVW JVEKQIRX SJ E X]TMGEP 'LMRIWI WSYVGI WIRXIRGI XLEX ;MXL VIGIRX VMWI SJ [SVO ETTP]MRK WIQERXMG MW HVE[R JVSQ RI[W[MVI KIRVI SJ XLI IZEPYEXMSR QSHIP XS WXEXMWXMGEP QEGLMRI XVERWPEXMSR XLIVI MW GSVTYW VIJ WLS[W XLI GSVVIWTSRHMRK JVEKQIRX SJ E LMKL HIQERH JSV 18 IZEPYEXMSR QIXVMGW XLEX XLI )RKPMWL VIJIVIRGI XVERWPEXMSR 18 18 ERH EVI HMVIGXP] WIRWMXMZI XS XLI WIQERXMG MQTVSZIQIRX 18 WLS[ XLI XLVII GSVVIWTSRHMRK JVEKQIRXW SJ QEHI ;I FIPMIZI IZEPYEXMRK QEGLMRI XVERWPEXMSR XLI QEGLMRI XVERWPEXMSR SYXTYX JVSQ XLVII HMJJIV YXMPMX] FEWIH SR WIQERXMG VSPIW WLSYPH VIxIGX WI IRX 18 W]WXIQW QERXMG MQTVSZIQIRX FIXXIV XLER GYVVIRX [MHIP] % VIPIZERX WYFWIX SJ XLI WIQERXMG VSPIW ERH YWIH EYXSQEXIH RKVEQ TVIGMWMSR FEWIH 18 IZEP TVIHMGEXIW LEW FIIR ERRSXEXIH MR XLIWI JVEKQIRXW YEXMSR QIXVMGW PMOI &0)9 SV xYIRG]SVMIRXIH YWMRK XLI 4VST&ERO GSRZIRXMSR SJ 3RXS2SXIW -R W]RXEGXMG 18 IZEPYEXMSR QIXVMGW PMOI 781 XLI 'LMRIWI WSYVGI WIRXIRGI XLIVI EVI X[S QEMR ZIVFW QEVOIH 46)( 8LI wVWX ZIVF q r GIEWI 781 W]RXE\FEWIH 18 IZEPYEXMSR SJ WEPIW LEW XLVII EVKYQIRXW SRI MR %6+ I\TI 0MY ERH +MPHIE TVSTSWIH XS YWI W]RXEGXMG VMIRGIV VSPI q r XLIGSQ JIEXYVIW MR 18 IZEPYEXMSR ERH HIZIPSTIH WYFXVII TPIXI VERKI SJ 7/-- TVSHYGXW SRI MR %6+1 QIXVMG 781 [LMGL FEWIH SR XLI WMQMPEVMX] SJ 03' PSGEXMSR VSPI q r MR QEMRPERH W]RXE\ XVII SJ XLI 18 SYXTYX ERH XLEX SJ XLI VIJ 'LMRE ERH SRI MR %6+1)<8 I\XIRX VSPI q IVIRGI -X MW XLI wVWX TVSTSWIH QIXVMG XLEX MRGSV r JSV EPQSWX X[S QSRXLW 8LI WIGSRH TSVEXIW W]RXEGXMG JIEXYVIW MR 18 IZEPYEXMSR ERH ZIVF q r VIWYQIH EPWS LEW XLVII EVKYQIRXW YRHIVPMIW EPP XLI SXLIV VIGIRXP] TVSTSWIH W]RXEG X[S MR %6+ EKIRX VSPIW q XMG 18 IZEPYEXMSR QIXVMGW r XLI 781 MW E TVIGMWMSR FEWIH QIXVMG XLEX GETXYVIW GSQTPIXI VERKI SJ 7/-- TVSHYGXW [LMGL WEPIW LEH XLI JVEGXMSRW SJ XLI WYFXVII MR E WTIGMwG HITXL SJ GIEWIH MR QEMRPERH 'LMRE JSV EPQSWX X[S QSRXLW XLI 18 SYXTYX W]RXE\ XVII [LMGL EPWS ETTIEV MR XLI ERH q r WEPIW ERH SRI MR %6+1%(: VSPI VIJIVIRGI W]RXE\ XVII 8LI JVEGXMSRW SJ HMJJIVIRX q r YRXMPXLIR HITXLW EVI XLIR EZIVEKI MR EVMXLQIXMG QIER -R XLI GSVVIWTSRHMRK )RKPMWL XEVKIX XLIVI EVI EPWS X[S QEMR ZIVFW QEVOIH 46)( 8LI wVWX ZIVF GIEWIH LEW XLVII EVKYQIRXW SRI MR ER %6+ I\TIVMIRGIV VSPI qXLIMV WEPIWr SRI MR ER %6+103' VSPI qMR QEMRPERH 'LMREr ERH SRI [LIVI MW XLI QE\MQYQ HITXL SJ WYFXVIWW GSR MR %6+1814 XIQTSVEP VSPI qJSV EPQSWX X[S WMHIVIH HIRSXIW XLI RYQFIV SJ XMQIW QSRXLWr 8LI WIGSRH ZIVF VIWYQIH EPWS LEW WYFXVII ETTIEVW MR XLI 18 SYXTYXtW W]RXE\ XVII XLVII EVKYQIRXW X[S MR %6+1814 XIQTSVEP ERH HIRSXIW XLI JSYRH RYQFIV SJ VSPIW qYRXMP EJXIV XLIMV WEPIW GIEWIH MR QEMRPERH XMQIW ETTIEVW MR XLI VIJIVIRGIWt W]RXE\ XVII IEGL 'LMRE JSV EPQSWX X[S QSRXLWr ERH qRS[r ERH SRI WYFXVII MR VIJIVIRGI [MPP SRP] FI JSYRH SRGI MR %6+ I\TIVMIRGIV VSPI qWEPIW SJ XLI GSQTPIXI *MKYVI WLS[W XLI W]RXE\ XVII SJ E VIJIVIRGI VERKI SJ 7/-- TVSHYGXWr XVERWPEXMSR ERH XLEX SJ XLI GSVVIWTSRHMRK 18 SYX 7MQMPEVP] XLI wVWX X[S 18 SYXTYXW EVI EPWS ER TYX *SV I\EQTPI [I WIX XLI QE\MQYQ HITXL SJ RSXEXIH [MXL WIQERXMG VSPIW MR XLI 4VST&ERO GSR WYFXVII GSRWMHIVIH XS 8LIVI EVI WIZIR HITXL ZIRXMSR 7MRGI XLIVI MW RS ZIVF ETTIEVIH MR XLI WYFXVIIW MR XLI 18 SYXTYX 7 24 :4 464 :

~~54 HIVWXERH XLI YTTIV FSYRHW SJ LYQER TIVJSVQERGI SR XLMW XEWO EW E JSYRHEXMSR JSV FIXXIV HIWMKR SJ IJwGMIRX EYXSQEXIH QIXVMGW~~

,8)6 RSREYXSQEXIH 18 IZEPYEXMSR QIXVMG ,YQERXEVKIXIH 8VERWPEXMSR )HMX 6EXI ,8)6 MR XLI [SVO SJ 7RSZIV IX EP MW E RSR *MKYVI )\EQTPI JSV XLI GSQTYXEXMSR SJ 781 EYXSQEXMG QEGLMRI XVERWPEXMSR IZEPYEXMSR QIXVMG FEWIH SR XLI RYQFIV SJ IHMXW VIUYMVIH XS GSVVIGX 24 ERH 464 MR [LMGL SRP] WM\ SJ XLIQ ETTIEV MR XLI XVERWPEXMSR L]TSXLIWIW % LYQER ERRSXEXSV XLI VIJIVIRGIW 7 24 :4 464 : ERH 24 2SXI IHMXW IEGL 18 L]TSXLIWMW WS XLEX MX MW QIERMRK XLEX XLI JSYRH GSYRX SJ 464 WLSYPH FI VEXLIV IUYMZEPIRX [MXL XLI VIJIVIRGI XVERWPEXMSR -X IQ XLER FIGEYWI XLIVI MW SRP] SRI 464 MR XLI VIJ TLEWM^IW SR QEOMRK XLI QMRMQYQ TSWWMFPI RYQ IVIRGI XVERWPEXMSR W]RXE\ XVII *SV HITXL XLIVI FIV SJ IHMXW 8LI 8VERWPEXMSR )HMX 6EXI 8)6 MW EVI JSYV WYFXVIIW MR XLI 18 SYXTYX 7 24 :4 XLIR GEPGYPEXIH YWMRK XLI LYQERIHMXIH XVERWPEXMSR 24 464 :4 :24ERH24 464 MR [LMGL EW E XEVKIXIH VIJIVIRGI JSV XLI 18 L]TSXLIWMW XLVII SJ XLIQ ETTIEV MR XLI VIJIVIRGI 7 24 :4 8LI ,8)6 MW LMKLP] PEFSV MRXIRWMZI MR XLI 24 464 ERH :4 : 24 7MQMPEVP] XLIVI EVI IZEPYEXMSR TVSGIWW 8LI LYQER ERRSXEXSVW EVI SRI SYX SJ X[S HITXL WYFXVIIW ERH ^IVS SYX SJ RSX SRP] VIUYMVIH XS YRHIVWXERH XLI QIERMRK I\ SRI HITXL WYFXVIIW MR XLI 18 SYXTYX JSYRH MR TVIWWIH MR XLI VIJIVIRGI XVERWPEXMSR ERH XLI QE XLI VIJIVIRGI 8LIVIJSVI XLI wREP 781 WGSVI JSV GLMRI XVERWPEXMSR FYX EVI EPWS VIUYMVIH XS TVSTSWI XLMW I\EQTPI MW ! QMRMQYQ TSWWMFPI RYQFIV SJ IHMXW XS XLI XVERW PEXMSR L]TSXLIWIW ;MXL WYGL LIEZ]HYX] LYQER 18 IZEPYEXMSR QIXVMG FEWIH SR WIQERXMG HIGMWMSR VIUYMVIQIRXW XLI GSWX MR IZEPYEXMSR MW VSPI SZIVPET IRSVQSYWP] MRGVIEWIH FSXXPIRIGOMRK XLI IZEPYE +MQIRI^ ERH 1DEVUYI^ MRXVSHYGIH 90' E XMSR G]GPI -RWXIEH [I FIPMIZI XLEX ER] LYQER HI RI[ EYXSQEXMG 18 IZEPYEXMSR QIXVMG MR [LMGL E GMWMSRW MR XLI IZEPYEXMSR G]GPI WLSYPH FI VIHYGIH WIVMIW SJ PMRKYMWXMG JIEXYVIW EVI GSQFMRIH XSKIXLIV XS FI EW WMQTPI EW TSWWMFPI 3RI SJ XLSWI PMRKYMWXMG JIEXYVIW MW WLEPPS[ WIQER XMG WMQMPEVMX] SR WIQERXMG VSPI SZIVPET 8LI WI 7IQERXMG VSPI XVERWPEXMSR EGGYVEG] QERXMG VSPI SZIVPET QIXVMG GEPGYPEXIW XLI PI\MGEP SZIVPETTMRK FIX[IIR WIQERXMG VSPIW SJ XLI WEQI 8S IZEPYEXI XLI WIQERXMG YXMPMX] SJ QEGLMRI XVERW X]TI MR XLI QEGLMRI XVERWPEXMSR SYXTYX ERH XLI GSV PEXMSR SYXTYX [I GSRHYGX E GSQTEVEXMZI EREP]WMW VIWTSRHMRK VIJIVIRGI XVERWPEXMSRW ERH XLIR GSRWMH SR XLI 4VSTFERO ERRSXEXMSR XIQTPEXIW GSQTPIXIH IVW XLI EZIVEKI PI\MGEP SZIVPETTMRK SZIV EPP WIQER F] XLI LYQER VIEHIVW MR XLI QEGLMRI XVERWPEXMSR XMGVSPIX]TIW SYXTYX ZIVWYW XLI VIJIVIRGI XVERWPEXMSR (IWTMXI XLI JEGX XLEX XLI QIXVMG WLS[W ER MQ )ZEPYEXMSR GSVTYW TVSZIH GSVVIPEXMSR [MXL LYQER NYHKQIRX SJ XVERW PEXMSR UYEPMX] +MQIRI^ ERH 1DEVUYI^ F 8LIWIRXIRGIWSJXLIIZEPYEXMSRGSVTYWEVIVER 'EPPMWSR&YVGL IX EP MX MW HSQP] HVE[R JVSQ XLI RI[W[MVI KIRVI SJ XLI RSXGSQQSRP]YWIHMRPEVKIWGEPI18IZEPYEXMSR (%64% +%0) TVSKVEQ 4LEWI IZEPYEXMSR *SV GEQTEMKR 8LI VIEWSR QE] PMI MR XLI LMKL XMQI IEGL 'LMRIWI MRTYX WIRXIRGI XLIVI EVI SRI GSVVI GSWX WTSRHMRK )RKPMWL VIJIVIRGI XVERWPEXMSR ERH XLVII ;I FIPMIZI MX MW MQTSVXERX XS wVWX JSGYW SR HI WXEXISJXLIEVX QEGLMRI XVERWPEXMSR W]WXIQWt SYX ZIPSTMRK WMQTPI QIEWYVIW XS IZEPYEXI QEGLMRI TYXW 8LI 'LMRIWI WSYVGI ERH XLI )RKPMWL VIJIV XVERWPEXMSR YXMPMX] XLEX QEOI YWI SJ LYQER I\XVEG IRGI EVI ERRSXEXIH [MXL KSPH WXERHEVH WIQERXMG XMSR SJ VSPI MRJSVQEXMSR -X MW RIGIWWEV] XS wVWX YR VSPI PEFIPW MR 4VSTFERO WX]PI

~~55 *MKYVI )\EQTPI KMZIR XS LYQER ERRSXEXSVW HIQSRWXVEXMRK LS[ XS PEFIP XLI WIQERXMG JVEQIW~~

q3XLIV EHZIVFMEP EVKYQIRXr PEFIP ,YQER ERRS 8EFPI 0MWX SJ WIQERXMG VSPIW XLEX LYQER NYHKIW XEXSVW EVI KMZIR WMQTPI ERH QMRMQEP MRWXVYGXMSRW EVI VIUYIWXIH XS PEFIP 0EFIP )ZIRX 0EFIP )ZIRX SR [LEX XS PEFIP ERH X[S I\EQTPIW HIQSRWXVEXMRK %GXSV [LS 8IQTSVEP [LIR LS[ XS PEFIP 8EFPI WLS[W XLI PMWX SJ PEFIPW ER %GXMSR HMH 0SGEXMSR [LIVI RSXEXSVW EVI VIUYIWXIH XS ERRSXEXI *MKYVI WLS[W )\TIVMIRGIV [LEX 3XLIV EHZIVFMEP EVK [L] LS[ 4EXMIRX [LSQ XLI I\EQTPI WLS[R XS XLI LYQER ERRSXEXSVW SR LS[ XS PEFIP WIQERXMG JVEQIW

6IGSRWXVYGXMSR SJ WIQERXMG JVEQIW MR 18 SYXTYX %JXIV VIGSRWXVYGXMSR SJ XLI WIQERXMG JVEQIW *SYV KVSYTW SJ FMPMRKYEP 'LMRIWI )RKPMWL LYQER XLI ERRSXEXIH QEGLMRI XVERWPEXMSR SYXTYXW EVI HMW ERRSXEXSVW EVI IQTPS]IH XS GSRHYGX XLI EREP]WMW XVMFYXIH XS ERSXLIV HMWNSMRX KVSYT SJ XLVII QSRS 3RI KVSYT SJ XLIQ MW KMZIR XLI VIJIVIRGI XVERWPE PMRKYEP LYQER NYHKIW 8LI LYQER NYHKIW EVI VI XMSR 8LMW WERMX] GLIGO WIVZIW EW XLI GSRXVSP GSR UYMVIH XS QEXGL IEGL TVIHMGEXI MR XLI VIJIVIRGI HMXMSR SJ XLI EREP]WMW 8LI SXLIV XLVII KVSYTW SJ XVERWPEXMSR [MXL XLSWI ERRSXEXIH MR XLI 18 SYX XLIQ MW KMZIR SRI WIX SJ XLI QEGLMRI XVERWPEXMSR TYX 8LIR JSV IEGL QEXGLIH TVIHMGEXI XLI] EVI W]WXIQ SYXTYX 8LI JSYV KVSYTW EVI EPP HMWNSMRX VIUYMVIH XS NYHKI [LIXLIV IEGL SJ XLI EWWSGMEXIH WYGL XLEX RS ERRSXEXSVW ERRSXEXI QSVI XLER SRI EVKYQIRX MR XLI VIJIVIRGI XVERWPEXMSR MW XVERWPEXIH WIRXIRGI JVSQ E 18VIJIVIRGI WIX XS EZSMH GSR ERH ERRSXEXIH MR XLI 18 SYXTYX 'SVVIGX -RGSV XEQMREXMSR MR ERRSXEXSVWt NYHKQIRXW 8S VIHYGI VIGX SV 4EVXMEP 8VERWPEXMSRW SJ XLI WIQERXMG JVEQIW XLI IJJIGX SJ TIVWSREP FMEW SR ERRSXEXMSRW IEGL EVI NYHKIH 'SVVIGX MJ XLI] I\TVIWW XLI WEQI QIER WIRXIRGI MW ERRSXEXIH F] EX PIEWX X[S LYQER ER MRK EW XLEX SJ XLI VIJIVIRGI XVERWPEXMSRW SV XLI SVMK RSXEXSVW 8LI VIWYPXW EVI VITSVXIH EW XLI EZIVEKI MREP WSYVGI MRTYX 8VERWPEXMSRW SJ XLI WIQERXMG EQSRK XLI ERRSXEXSVW JVEQIW EVI NYHKIH -RGSVVIGX MJ XLI] I\TVIWW QIER ;MXL XLI EMQ SJ IZEPYEXMRK QEGLMRI XVERWPEXMSR MRK W XLEX FIPSRKW MR SXLIV EVKYQIRXW 8VERWPE YXMPMX] JVSQ E YWIV WXERHTSMRX [I LEZI WMQTPM XMSRSJXLIWIQERXMGJVEQIWQE]EPWSFINYHKIH wIH XLI 4VSTFERO ERRSXEXMSR MRXS E QSVI MRXY 4EVXMEP MJ SRP] TEVX SJ XLI QIERMRK MW GSVVIGXP] I\ MXMZI IZIRX WXVYGXYVI MI r[LS HMH [LEX XS [LSQ TVIWWIH )\XVE QIERMRK MR XLI WIQERXMG JVEQIW [LIR [LIVI [L] ERH LS[r 7MRGI XLI PE]QER [MPP RSX FI TIREPM^IH YRPIWW MX FIPSRKW MR ERSXLIV ERRSXEXSVW wRH XLEX MX MW HMJwGYPX XS HMWXMRKYMWL FI EVKYQIRX 8LI TEVXMEPP] GSVVIGX GEXIKSV] MW HI X[IIR XLI q[L]r ERH qLS[r IZIRXW X]TI [I LEZI WMKRIH XS JEGMPMXEXI E wRIVKVEMRIH QIEWYVIQIRX SJ GSQFMRIH XLI q[L]r ERH qLS[r IZIRXW MR XS SRI XLI XVERWPEXMSR YXMPMX]

56 760 FEWIH IZEPYEXMSR QIXVMG 8EFPI 760 ERRSXEXMSR SJ I\EQTPI 18 SYX &EWIH SR XLI GSQTEVEXMZI QEXVMGIW GSPPIGXIH JVSQ TYX MR wKYVI ERH XLI LYQER NYHKIQIRX SR XVERW XLI LYQER NYHKIW E TVIGMWMSRVIGEPP EREP]WMW SJ PEXMSR GSVVIGXRIWW SJ IEGL EVKYQIRX EGGYVEG] [MXL XLI VIGSRWXVYGXIH WIQERXMG JVEQIW 760 VIJIVIRGI 18 HIGMWMSR VIxIGXMRK XLI YXMPMX] SJ IEGL QEGLMRI XVERWPEXMSR 46)( GIEWIH i RSX QEXGL 46)( VIWYQIH VIWYQI QEXGL W]WXIQ GSYPH FI HSRI %6+ i WO MM XLI WEPI MRGSVVIGX SJ TVSHYGXW MR XLI QEMRPERH SJ 'LMRE %6+ WEPIW SJ GSQ WEPIW TEVXMEP TPIXI VERKI SJ 7/ -- TVSHYGXW 814 9RXMP EJXIV XLIMV 7S JEV RIEVP] TEVXMEP WEPIW LEH GIEWIH X[S QSRXLW MR QEMRPERH 'LMRE JSV EPQSWX X[S QSRXLW 814 RS[ i MRGSVVIGX

JSV TVIGMWMSR ERH EVI XLI WYQ SJ XLI TSVXMSR SJ GSVVIGXP] SV TEVXMEP GSVVIGXP] XVERW PEXIH TVIHMGEXIEVKYQIRX WXVYGXYVIW MR XLI VIJIV IRGI 8LI] GER FI ZMI[IH EW XLI XVYI TSWMXMZI JSV VIGEPP 2SXI XLEX ERH EVI XLI [IMKLXW JSV XLI QEXGLIH TVIHMGEXI GSVI EVKYQIRXW ERH EH NYRGX EVKYQIRXW 8LIWI [IMKLXW GER FI ZMI[IH EW XLIMQTSVXERGISJQIERMRKWMRXLIHMJJIVIRXGEX IKSVMIW SJ WIQERXMG VSPIW -R XLMW ZIV] wVWX TVI PMQMREV] WXYH] [I LEZI WIX XLIQ EPP XS ERH [I I\TIGX XYRMRK XLIWI [IMKLXW GER JYVXLIV MRGVIEWI XLI GSVVIPEXMSR SJ XLI IZEPYEXMSR QIXVMG [MXL LY QER NYHKQIRX SJ XVERWPEXMSR YXMPMX] 8LI TVIGMWMSR VIGEPP ERH JWGSVI SJ XLI 760 FEWIH 18 IZEPYEXMSR QIXVMG EVI HIwRIH MR XIVQW ERH VITVIWIRX XLI RYQFIV SJ GSV SJ XLI XVERWPEXMSR EGGYVEG] SJ TVIHMGEXIEVKYQIRX VIGXP] XVERWPEXIH GSVI EVKYQIRXW ERH EHNYRGX EVKY WXVYGXYVIW 2SXI XLEX MW XLI [IMKLXW JSV XLI QIRXW SJ E QEXGLIH TVIHMGEXI VIWTIGXMZIP] [LMPI TEVXMEPP] GSVVIGX XVERWPEXIH EVKYQIRXW -R XLMW I\ ERH VITVIWIRX XLI RYQFIV SJ TEV TIVMQIRX [I LEZI EVFMXVEVMP] WIX MX XS XMEPP] XVERWPEXIH GSVI EVKYQIRXW ERH EHNYRGX EV -J EPP XLI VIGSRWXVYGXIH WIQERXMG JVEQIW MR XLI KYQIRXW SJ E QEXGLIH TVIHMGEXI ERH 18 SYXTYX EVI GSQTPIXIP] MHIRXMGEP XS XLI KSPH VITVIWIRX XLI XSXEP RYQFIV SJ GSVI EVKY WXERHEVH ERRSXEXMSR MR XLI VIJIVIRGI XVERWPEXMSR QIRXW ERH EHNYRGX EVKYQIRXW SJ XLI QEXGLIH TVIH ERH EPP XLI EVKYQIRXW MR XLI VIGSRWXVYGXIH JVEQIW MGEXI MR XLI 18 SYXTYX ERH ERH I\TVIWW XLI WEQI QIERMRK EW XLI GSVVIWTSRHMRK VITVIWIRX XLI XSXEP RYQFIV SJ GSVI EVKYQIRXW ERH EVKYQIRXW MR XLI VIJIVIRGI XVERWPEXMSRW XLI JWGSVI EHNYRGX EVKYQIRXW SJ XLI QEXGLIH TVIHMGEXI MR SJ XLI 760 FEWIH 18 IZEPYEXMSR QIXVMG [MPP FI XLI VIJIVIRGI IUYEP XS ERH EVI XLI WYQ SJ XLI TSVXMSRW SJ GSVVIGXP] SV TEVXMEP GSVVIGXP] XVERW )\TIVMQIRX ERH 6IWYPXW PEXIH TVIHMGEXIEVKYQIRX WXVYGXYVIW MR XLI 18 SYX 8EFPI WLS[W XLI 760 ERRSXEXMSR SJ 18 F] TYX 8LI] GER FI ZMI[IH EW XLI XVYI TSWMXMZI SRISJXLIERRSXEXSVWSJI\EQTPIMRwKYVI

57 XVERWPEXMSR MW ERH XLEX SR 18 SYXTYX MW 8EFPI 760 FEWIH 18 IZEPYEXMSR EZIVEKI SR EPP *SV VSPI GPEWWMwGEXMSR MR EHHMXMSR XS XLI VI ERRSXEXSVW ERH EPP WIRXIRGIW 7]WXIQ 4VIGMWMSR 6IGEPP *WGSVI UYMVIQIRXSJQEXGLMRKSJ[SVHWTERMRVSPIMHIR 6IJIVIRGI XMwGEXMSR XEWO XLI EKVIIQIRX MW GSYRXIH SR XLI 18 QEXGLMRK SJ XLI WIQERXMG VSPI PEFIPW [MXLMR X[S 18 EPMKRIH [SVH WTERW 8LI EKVIIQIRX VEXI SR 760 ERRSXEXMSRW SJ VIJIVIRGI XVERWPEXMSR ERH XLEX SR 18 18 SYXTYX EVI ERH VIWTIGXMZIP] 8LI VIWYPXW WLS[ XLEX [MXL WYGL QMRMQEP XVEMR ERH XLI LYQER NYHKIQIRX SR XVERWPEXMSR GSVVIGX MRK XLI PE]QER ERRSXEXSVW TIVJSVQ GSRWMWXIRXP] RIWW SJ IEGL EVKYQIRX 8LI TVIHMGEXI qGIEWIHr MR MRMHIRXMJ]MRKXLIWIQERXMGWXVYGXYVIMRFSXLXLI XLI VIJIVIRGI XVERWPEXMSR HMH RSX QEXGL [MXL ER] VIJIVIRGI XVERWPEXMSR ERH XLI 18 SYXTYX 8LI VI TVIHMGEXI ERRSXEXIH MR 18 [LMPI XLI TVIHMGEXI WYPXW WYKKIWX XLEX XLI PE]QER ERRSXEXSVW EPWS LEZ qVIWYQIHr QEXGLIH [MXL XLI TVIHMGEXI qVIWYQIr MRK TVSFPIQ MR VSPI GSRJYWMSR ERH [I FIPMIZI XLEX ERRSXEXIH MR 18 8LI %6+1814 EVKYQIRX E WPMKLXP] QSVI HIXEMPIH I\TPEREXMSR SR XLI VSPI PE q9RXMP EJXIV XLIMV WEPIW LEH GIEWIH MR QEMRPERH FIPW QE] LIPT XS GPIEV XLI GSRJYWMSR 'LMRE JSV EPQSWX X[S QSRXLWr MR XLI VIJIVIRGI XVERWPEXMSR MW TEVXMEPP] XVERWPEXIH XS %6+1814 'SVVIPEXMSR [MXL LYQER NYHKQIRXW SR EVKYQIRX q7S JEV RIEVP] X[S QSRXLWr MR 18 XVERWPEXMSR EHIUYEG] XLI %6+ EVKYQIRX qWEPIW SJ XLI GSQTPIXI VERKI ;I YWIH XLI 7TIEVQERtW VERO GSVVIPEXMSR GSIJw SJ 7/ -- TVSHYGXWr MR XLI VIJIVIRGI XVERWPEXMSR GMIRX XS QIEWYVI XLI GSVVIPEXMSR SJ XLI IZEP MW TEVXMEPP] XVERWPEXIH XS %6+ EVKYQIRX qWEPIWr YEXMSR QIXVMGW [MXL XLI LYQER NYHKQIRX SR EH MR 18 ERH XLI %6+1814 EVKYQIRX qRS[r MR IUYEG] EX WIRXIRGIPIZIP ERH XSSO EZIVEKI SR XLI XLI VIJIVIRGI XVERWPEXMSR MW QMWWMRK MR 18 8LI [LSPI HEXE WIX 8LI LYQER NYHKQIRX SR EHIUYEG] 760 FEWIH JWGSVI SJ XLMW I\EQTPI MW 8LI w [EW SFXEMRIH F] WLS[MRK EPP XLVII 18 SYXTYXW XS REP WIRXIRGIPIZIP 760 FEWIH 18 IZEPYEXMSR QIX KIXLIV [MXL XLI 'LMRIWI WSYVGI MRTYX XS E LYQER VMG SJ 18 MW XLI JWGSVI EZIVEKIH SR EPP ERRSXE VIEHIV 8LI LYQER VIEHIV [EW MRWXVYGXIH XS SV XSVW 8EFPI WLS[W XLI VIWYPXW SJ XLI 760 FEWIH HIV XLI WIRXIRGIW JVSQ XLI XLVII 18 W]WXIQW EG 18 IZEPYEXMSR QIXVMG EZIVEKIH SR EPP ERRSXEXSVW GSVHMRKXSXLIEGGYVEG]SJQIERMRKMRXLIXVERW ERH EPP WIRXIRGIW 3YV VIWYPXW WLS[ XLEX XLI IZEPY PEXMSRW *SV XLI 18 SYXTYX [I VEROIH XLI WIR EXMSR QIXVMG GER WYGGIWWJYPP] HMWXMRKYMWL XLI XVERW XIRGIW JVSQ XLI XLVII 18 W]WXIQW EGGSVHMRK XS PEXMSR YXMPMX] SJ XLI LYQER XVERWPEXMSR ERH XLI XLVII XLI VE[ WGSVIW SJ XLI IZEPYEXMSR QIXVMGW 8LI 18 W]WXIQW ERH SR W]WXIQ PIZIP 18 TVSZMHIW 781 WGSVIW EVI GEPGYPEXIH FEWIH SR XLI W]RXE\ XLI QSWX EGGYVEXI XVERWPEXMSR XVII SJ XLI VIJIVIRGI ERH 18 SYXTYX TEVWIH F] XLI 'LEVRMEO TEVWIV 'LEVRMEO 8EFPI -RXIVERRSXEXSV %KVIIQIRX WLS[W XLI VE[ WGSVIW SJ I\EQTPI YRHIV XLI SYV ;I QIEWYVIH XLI MRXIVERRSXEXSV EKVIIQIRX MR X[S TVSTSWIH 760 FEWIH IZEPYEXMSR QIXVMG WIRXIRGI XEWOW VSPI MHIRXMwGEXMSR ERH VSPI GPEWWMwGEXMSR PIZIP &0)9 WIRXIRGIPIZIP 781 ERH XLI GSVVI 8LI WXERHEVH JWGSVI MW YWIH XS QIEWYVI XLI EKVII WTSRHMRK VEROW EWWMKRIH XS IEGL SJ XLI W]WXIQW QIRX SR 760 ERRSXEXMSR EW MR &VERXW XSKIXLIV [MXL XLI LYQER VEROW SR EHIUYEG] *SV VSPI MHIRXMwGEXMSR XLI EKVIIQIRX MW GSYRXIH 8LI 7TIEVQERtW VERO GSVVIPEXMSR GSIJwGMIRX SR XLI QEXGLMRK SJ [SVH WTER MR XLI ERRSXEXIH GER FI GEPGYPEXIH YWMRK XLI JSPPS[MRK WMQTPMwIH EVKYQIRXW [MXL E XSPIVERGI SJ [SVHMRQMW IUYEXMSR QEXGL 8LI XSPIVERGI MW HIWMKRIH JSV XLI JEGX XLEX ERRSXEXSVW EVI RSX GSRWMWXIRX MR LERHPMRK XLI EVXM GPIW SV TYRGXYEXMSRW EX XLI FIKMRRMRK SV XLI IRH SJ XLI ERRSXEXIH EVKYQIRXW 8LI EKVIIQIRX VEXI SR 760 ERRSXEXMSRW MR VSPI MHIRXMwGEXMSR SJ VIJIVIRGI [LIVI MW XLI HMJJIVIRGI FIX[IIR XLI VEROW SJ XLI

58 8EFPI 7IRXIRGIPIZIP 760 FEWIH JWGSVI IZEPYEXMSR QIXVMGW EZIVEKI SR ERRSXEXSVW WIRXIRGIPIZIP &0)9 WIRXIRGIPIZIP 781 XLIMV GSVVIWTSRHMRK VERO EWWMKRIH ERH XLI LYQERVEROSREHIUYEG]JSV I\EQTPI 760 &0)9 781 ,YQER 7]WXIQ 18 SYXTYX WGSVI VERO WGSVI VERO WGSVI VERO VERO 7VG

6IJ 9RXMP EJXIV XLIMV WEPIW LEH GIEWIH MR QEMR PERH 'LMRE JSV EPQSWX X[S QSRXLW WEPIW SJ XLI GSQTPIXI VERKI SJ 7/ -- TVSHYGXW LEZI RS[ FI VIWYQIH 18 7S JEV RIEVP] X[S QSRXLW WO MM XLI WEPI SJ TVSHYGXW MR XLI QEMRPERH SJ 'LMRE XS VIWYQI WEPIW 18 7S JEV MR XLI QEMRPERH SJ 'LMRE XS WXST WIPP MRK RIEVP] X[S QSRXLW SJ 7/ TVSHYGXW WEPIW VIWYQIH 18 7SJEVXLIWEPIMRXLIQEMRPERHSJ'LMREJSV RIEVP] X[S QSRXLW SJ 7/ -- PMRI SJ TVSH YGXW

'SRGPYWMSRW ERH *YXYVI ;SVO 8EFPI %ZIVEKI WIRXIRGIPIZIP GSVVIPEXMSR JSV XLI IZEPYEXMSR QIXVMGW ;I TVIWIRXIH VIWYPXW SJ ER IQTMVMGEP WXYH] SR 1IXVMG 'SVVIPEXMSR IZEPYEXMSR XLI YXMPMX] SJ 18 SYXTYX F] EWWIWWMRK [MXL LYQER XLI EGGYVEG] [MXL [LMGL LYQER VIEHIV EVI EFPI XS 760 FEWIH IZEPYEXMSR GSQTPIXI XLI 760 XIQTPEXIW 8LI 760 FEWIH J &0)9 WGSVI IZEPYEXMSR QIXVMG [I TVSTSWIH TVSZMHIH ER 781 MRXYMXMZI TMGXYVI SR LS[ QYGL MRJSVQEXMSR SJ XLI SVMKMREP WSYVGI MRTYX XLI QEGLMRI XVERWPEXMSR YWIVW GER I\XVEGX F] VIEHMRK XLI 18 SYXTYX 'SQTEV MRK XS ,8)6 [LIVI XLI LYQER HIGMWMSR MW LIEZ] IZEPYEXMSR QIXVMGW ERH XLI LYQER NYHKQIRX SZIV ERH VIUYMVIW EHZERGI ORS[PIHKI MR LS[ XS w\ XLI SJ W]WXIQ ERH MW XLI RYQFIV SJ W]WXIQW 8LI XVERWPEXMSR [MXL QMRMQYQ GLERKI SRP] QMRMQEP VERKI SJ TSWWMFPI ZEPYIW SJ GSVVIPEXMSR GSIJwGMIRX MRWXVYGXMSRW MW RIGIWWEV] XS FI KMZIR XS XLI LYQER MW ?A [LIVI QIERW XLI W]WXIQW EVI VEROIH VIEHIVW MR SYV TVSTSWIH QIXVMG 8LI LYQER VIEH MR XLI WEQI SVHIV EW XLI LYQER NYHKQIRX ERH IVW QE] RSX RIGIWWEVMP] FI XVERWPEXMSR I\TIVXW QIERW XLI W]WXIQW EVI VEROIH MR XLI VIZIVWI SVHIV 3YV VIWYPXW WLS[ XLEX YWMRK 760 MR WIQER EW XLI LYQER NYHKQIRX 8LI LMKLIV XLI ZEPYI JSV XMG 18 IZEPYEXMSR MW E LMKLP] TVSQMWMRK HMVIGXMSR MRHMGEXIW XLI QSVI WMQMPEV XLI VEROMRK F] XLI JSV JYVXLIV VIWIEVGL ;I IZEPYEXIH XLI TVSTSWIH IZEPYEXMSR QIXVMG XS XLI LYQER NYHKQIRX 760 FEWIH QIXVMG [MXL LYQER NYHKQIRX SR EH 3YV VIWYPXW WLS[ XLEX XLI TVSTSWIH 760 FEWIH IUYEG] YWMRK 7TIEVQERtW GSVVIPEXMSR GSIJwGMIRX IZEPYEXMSR QIXVMG LEW E LMKLIV GSVVIPEXMSR [MXL 8LITVSTSWIH760FEWIHIZEPYEXMSRQIXVMG[EW XLI LYQER NYHKQIRX SR EHIUYEG] XLER IMXLIV XLI JSYRH XS FI WMKRMwGERXP] FIXXIV GSVVIPEXIH XS LY &0)9 SV 781 QIXVMGW 8EFPI GSQTEVIW XLI QER NYHKQIRX SR EHIUYEG] XLER IMXLIV &0)9 SV EZIVEKI WIRXIRGIPIZIP JSV SYV TVSTSWIH 760 XLI W]RXE\FEWIH IZEPYEXMSR QIXVMG 781 FEWIH IZEPYEXMSR QIXVMG &0)9 ERH 781 8LI 3YV GYVVIRX HMVIGXMSR MW XS HMWGVMQMREXMZIP] XYRI GSVVIPEXMSR GSIJwGMIRX JSV XLI TVSTSWIH 760 FEWIH XLI [IMKLXW [MXLMR XLI 760 FEWIH IZEPYEXMSR QIX IZEPYEXMSR QIXVMG MW [LMPI XLEX JSV &0)9 MW VMG WS EW XS JYVXLIV MRGVIEWI XLI GSVVIPEXMSR SJ XLI 8LI GSVVIPEXMSR GSIJwGMIRX JSV 781 MW QIXVMG [MXL LYQER NYHKQIRX WMKRMwGERXP] FIXXIV XLER &0)9 FYX WXMPP JEV WLSVX 3YV SXLIV QEMR EZIRYI SJ GYVVIRX [SVO MW XS SJ SYV 760 FEWIH QIXVMG GSRWXVYGX EYXSQEXIH QIXVMGW ETTVS\MQEXMRK XLI

59 IZEPYEXMSR QIXLSH HIWGVMFIH LIVI [LMGL TVSZMHIW 4EWGEPI *YRK >LESNYR ;Y =SRKWLIRK =ERK ERH (IOEM ;Y ER YTTIV FSYRH JSV EYXSQEXIH 760FEWIH QIX %YXSQEXMG 0IEVRMRK SJ 'LMRIWI )RKPMWL 7IQERXMG 7XVYG XYVI 1ETTMRK -R -))) 7TSOIR 0ERKYEKI 8IGLRSPSK] VMGW ;MXL XLI MQTVSZMRK TIVJSVQERGI SJ WLEP ;SVOWLST TEKIW i PS[ WIQERXMG TEVWIVW [I FIPMIZI XLEX XLI TVSTSWIH .IWYW +MQIRI^ ERH 0PY~W 1DEVUYI^ (MWGVMQMREXMZI 4LVEWI IZEPYEXMSR QIXVMG GSYPH FI JYVXLIV HIZIPSTIH MRXS 7IPIGXMSR JSV 7XEXMWXMGEP 1EGLMRI 8VERWPEXMSR 0IEVRMRK MRI\TIRWMZI EYXSQEXMG 18 IZEPYEXMSR QIXVMGW 1EGLMRI 8VERWPEXMSR .IWYW +MQIRI^ ERH 0PY~W 1DEVUYI^ 0MRKYMWXMG JIEXYVIW JSV %GORS[PIHKQIRXW EYXSQEXMG IZEPYEXMSR SJ LIXIVSKIRSYW 18 W]WXIQW -R 4VSGIIHMRKW SJ XLI 7IGSRH ;SVOWLST SR 7XEXMWXMGEP 1E 8LMW QEXIVMEP MW FEWIH YTSR [SVO WYTTSVXIH GLMRI 8VERWPEXMSR TEKIW i 4VEKYI '^IGL 6ITYF PMG .YRI %WWSGMEXMSR JSV 'SQTYXEXMSREP 0MRKYMW MR TEVX F] XLI (IJIRWI %HZERGIH 6IWIEVGL XMGW 4VSNIGXW %KIRG] (%64% YRHIV +%0) 'SR .IWYW +MQIRI^ ERH 0PY~W 1EVUYI^D % WQSVKEWFSVH SJ JIE XVEGX 2SW ,6' ERH ,6 XYVIW JSV EYXSQEXMG 18 IZEPYEXMSR -R 4VSGIIHMRKW SJ XLI VH ;SVOWLST SR 7XEXMWXMGEP 1EGLMRI 8VERWPEXMSR TEKIW ' ERH F] XLI ,SRK /SRK 6IWIEVGL +VERXW i 'SPYQFYW 3, .YRI %WWSGMEXMSR JSV 'SYRGMP 6+' VIWIEVGL KVERXW +6* 'SQTYXEXMSREP 0MRKYMWXMGW +6* (%+)+ 6+') 4LMPMTT /SILR ERH 'LVMWXSJ 1SR^ 1ERYEP ERH EYXSQEXMG ERH 6+') %R] STMRMSRW wRHMRKW ERH IZEPYEXMSR SJ QEGLMRI XVERWPEXMSR FIX[IIR IYVSTIER PER KYEKIW -R 4VSGIIHMRKW SJ XLI ;SVOWLST SR 7XEXMWXM GSRGPYWMSRW SV VIGSQQIRHEXMSRW I\TVIWWIH MR XLMW GEP 1EGLMRI 8VERWPEXMSR TEKIW i %WWSGMEXMSR JSV QEXIVMEP EVI XLSWI SJ XLI EYXLSVW ERH HS RSX RIGIW 'SQTYXEXMSREP 0MRKYMWXMGW WEVMP] VIxIGX XLI ZMI[W SJ XLI (IJIRWI %HZERGIH ( 0MY ERH ( +MPHIE 7]RXEGXMG JIEXYVIW JSV IZEPYEXMSR 6IWIEVGL 4VSNIGXW %KIRG] SJ QEGLMRI XVERWPEXMSR %'0 ;SVOWLST SR -RXVMRWMG ERH )\XVMRWMG )ZEPYEXMSR 1IEWYVIW JSV 1EGLMRI 8VERWPEXMSR 6IJIVIRGIW ERHSV 7YQQEVM^EXMSR TEKI 1EVXLE 4EPQIV (ERMIP +MPHIE ERH 4EYP /MRKWFYV] 8LI 8 &VERXW -RXIVERRSXEXSV EKVIIQIRX JSV E +IVQER RI[WTE 4VSTSWMXMSR &ERO ER %RRSXEXIH 'SVTYW SJ 7IQERXMG TIV GSVTYW -R 4VSGIIHMRKW SJ XLI RH -RXIVREXMSREP 'SR 6SPIW 'SQTYXEXMSREP 0MRKYMWXMGW i JIVIRGI SR 0ERKYEKI 6IWSYVGIW ERH )ZEPYEXMSR 06)' 'MXIWIIV /MWLSVI 4ETMRIRM 7EPMQ 6SYOSW 8SHH ;EVH ERH ;IM.MRK >LY &0)9 E QIXLSH JSV EYXSQEXMG IZEPYEXMSR SJ QE 'LVMW 'EPPMWSR&YVGL 1MPIW 3WFSVRI ERH 4LMPMTT /SILR GLMRI XVERWPEXMSR -R XL %RRYEP 1IIXMRK SJ XLI %WWSGME 6IIZEPYEXMRK XLI VSPI SJ &0)9 MR QEGLMRI XVERWPEXMSR XMSR JSV 'SQTYXEXMSREP 0MRKYMWXMGW TEKIW i VIWIEVGL -R 4VSGIIHMRKW SJ )%'0 TEKIW i 7EQIIV 4VEHLER ;E]RI ;EVH /EHVM ,EGMSKPY .EQIW , 1EVXMR ERH (ER .YVEJWO] 7LEPPS[ 7IQERXMG 4EVWMRK 9W 'LVMW 'EPPMWSR&YVGL 'EQIVSR *SVH]GI 4LMPMTT /SILR MRK 7YTTSVX :IGXSV 1EGLMRIW -R 4VSGIIHMRKW SJ 2%%'0 'LVMWXSJ 1SR^ ERH .SWL 7GLVSIHIV 1IXE IZEPYEXMSR SJ ,08 QEGLMRI XVERWPEXMSR -R 4VSGIIHMRKW SJ XLI 7IGSRH ;SVO WLST SR 7XEXMWXMGEP 1EGLMRI 8VERWPEXMSR TEKIW i 1 7RSZIV & (SVV 6 7GL[EVX^ 0 1MGGMYPPE ERH %WWSGMEXMSR JSV 'SQTYXEXMSREP 0MRKYMWXMGW . 1EOLSYP % WXYH] SJ XVERWPEXMSR IHMX VEXI [MXL XEVKIXIH LYQER ERRSXEXMSR -R 4VSGIIHMRKW SJ %WWSGMEXMSR JSV 1E 'LVMW 'EPPMWSR&YVGL 'EQIVSR *SVH]GI 4LMPMTT /SILR GLMRI 8VERWPEXMSR MR XLI %QIVMGEW TEKIW i 'LVMWXSJ 1SR^ ERH .SWL 7GLVSIHIV *YVXLIV QIXE IZEPYEXMSR SJ QEGLMRI XVERWPEXMSR -R 4VSGIIHMRKW SJ (IOEM ;Y ERH 4EWGEPI *YRK 'ER 7IQERXMG 6SPI 0EFIPMRK XLI 8LMVH ;SVOWLST SR 7XEXMWXMGEP 1EGLMRI 8VERWPEXMSR -QTVSZI 718# -R 4VSGIIHMRKW SJ XLI XL %RRYEP 'SR TEKIW i %WWSGMEXMSR JSV 'SQTYXEXMSREP 0MRKYMW JIVIRGI SJ XLI )%18 TEKIW i &EVGIPSRE 7TEMR XMGW 1E] 1EVMRI 'EVTYEX ERH (IOEM ;Y -QTVSZMRK 7XEXMWXMGEP 1E (IOEM ;Y ERH 4EWGEPI *YRK 7IQERXMG 6SPIW JSV 718 % GLMRI 8VERWPEXMSR YWMRK ;SVH 7IRWI (MWEQFMKYEXMSR -R ,]FVMH 8[S4EWW 1SHIP -R 4VSGIIHMRKW SJ ,YQER 0ER 'SRJIVIRGI SR )QTMVMGEP 1IXLSHW MR 2EXYVEP 0ERKYEKI KYEKI 8IGLRSPSKMIW 8LI %RRYEP 'SRJIVIRGI SJ XLI 4VSGIWWMRK ERH 'SRJIVIRGI SR 'SQTYXEXMSREP 2EXYVEP 2SVXL %QIVMGER 'LETXIV SJ XLI %WWSGMEXMSR JSV 'SQTY 0ERKYEKI 0IEVRMRK )1204'S200 4VEKYI.YR XEXMSREP 0MRKYMWXMGW 'SQTERMSR :SPYQI 7LSVX 4ETIVW TEKIW i %WWSGMEXMSR JSV 'SQTYXEXMSREP 0MRKYMWXMGW =7 'LER ,8 2K ERH ( 'LMERK ;SVH WIRWI HMWEQFMKYE XMSR MQTVSZIW WXEXMWXMGEP QEGLMRI XVERWPEXMSR -R XL %R 2MER[IR

~~60 Arabic morpho-syntactic feature disambiguation in a translation context~~

~~Ines Turki Khemakhem, Salma Jamoussi, Abdelmajid Ben Hamadou MIRACL Laboratory, ISIM Sfax, Pôle Technologique [email protected], [email protected], [email protected]~~

formation about the clitics, etc. Morphological Abstract analysis and disambiguation are crucial pre- processing steps for a variety of natural language Morphological analysis and disambigua- processing applications, from search and infor- tion are crucial stages in a variety of nat- mation extraction to machine translation. For ural language processing applications languages with complex morphology these are such as machine translation, especially nontrivial processes. when languages with complex morphol- Arabic words are often ambiguous in their ogy are concerned such as Arabic. Arab- morphological analysis. This is due to Arabic’s ic is a highly flexional language, in that, rich system of affixation and clitics and the the same root can lead to various forms omission of disambiguating short vowels. The according to its context. In this paper, we problem is that many words have different present a system which disambiguates meanings depending on their diacritization. This the output of a morphological analyzer leads to ambiguity when processing data for nat- for Arabic. The Arabic morphological ural language processing applications such as analyzer used consists of a set of all machine translation. This propriety has an impossible morphological analyses for each portant implication for statistical modeling of the word, with the unique correct syntactic Arabic language. feature. We want to choose the correct In this paper, we present a novel morphology features using the features generated by preprocessing technique for Arabic. We exploit the morphological analyzer for the the Arabic morphology-French alignment to French language in the other side. To ob- choose a correct morpho-syntactic feature pro- tain this data, we used the results of the duced by a morphological analyzer. alignment of word trained with GIZA++ This paper is organized as follows: section 2 (Och and Ney, 2003). gives a brief description of some related works to the introduction of morphological disambigu-

1 Introduction ation. Section 3 presents the used morphological Arabic is characterized by a rich morphology. analyzer Morph2 for Arabic texts, able to recog- Due to the fact that the Arabic script usually nize word composition and to provide more spe- does not encode short vowels, the degree of cific morphological information about it. We morphological ambiguity is very high. In addi- present in section 4 some problems in context tion to being inflected for gender, number, morpho-syntactic feature choice; in the remaind- words can be attached to various clitics for con- er of this section we discuss the complexity of -the), Arabic morphology and the challenge of mor) "ال" and), the definite article) "و" junction phological disambiguation. Section 5 gives a "ك" ,(for) "ل" ,(by/with) "ب" .prepositions (e.g their/them)). short overview of the data and tools used for our) "هﻢ" .as)), object pronouns (e.g) The morphological analysis of a word consists Arabic word Morpho-syntactic feature disam- of determining morphological information about biguation and shows the experimental details of each word, such as part-of-speech (i.e., noun, our system. Finally, section 6 presents some verb, particle, etc), voice, gender, number, in- conclusions.

61 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 61–65, COLING 2010, Beijing, August 2010. 2 Related work vous vous souvenez de nous", in English: "Do you remember us"). Morphological analysis and disambiguation are The aim of a morphological analysis step is to crucial pre-processing steps for a variety of recognize word composition and to provide spe- natural language processing applications. cific morphological information about it. For -in French: con) "ﻳﻌﺮﻓﻮن" Previous research has focused on Example : the word disambiguating the output of a morphological naîssent, in English: they know) is the result of indicating the "ي" analyzer. Hajic (2000) is the first to use a the concatenation of the prefix indicating the plural "ون" dictionary as a source of possible morphological present and suffix -in French: con) "ﻋﺮف" analyses (and hence tags) for an inflected word masculine of the verb form. He convincingly shows that for five naître, in English: to know). The morphological Eastern European languages with complex analyzer determines for each word the list of all inflection plus English, using a morphological its possible morphological features. analyzer improves performance of a tagger. He In Arabic language, some conjugated verbs or concludes that for highly inflectional languages inflected nouns can have the same orthographic “the use of an independent morphological form due to absence of vowels (Example : non- can be a verb in the "ﻓﺼﻞ" dictionary is the preferred choice more annotated voweled Arabic word He dismissed), or a masculine noun) " ﻓَ ﺼَ ﻞَ" data”. He redefines the tagging task as a choice past chapter / season), or a concatenation of) " ﻓَ ﺼْ ﻞٌ" among the tags proposed by the dictionary, using then) with the) " فَ" a log-linear model trained on specific ambiguity the coordinating conjunction .((imperative of the verb (bind :"ﺻﻞ" classes for individual morphological features. verb Hajic (2000) demonstrates convincingly that In this work, In order to handle the morpho- morphological disambiguation can be aided by a logical ambiguities, we decide to use MORPH2 morphological analyzer, which, given a word (Belguith et al., 2006), an Arabic morphological without any context, gives us the set of all analyzer developed at the Miracl laboratory1. possible morphological tags. MORPH2 is based on a knowledge-based com- The only work on Arabic tagging that uses a putational method. It accepts as input an Arabic corpus for training and evaluation, (Diab et al., text, a sentence or a word. Its morphological dis- 2004), does not use a morphological analyzer. ambiguation and analysis method is based on Diab et al. (2004) perform tokenization, POS five steps: tagging and base phrase chunking using a SVM • A tokenization process is applied in a based learner. first step. It consists of two sub-steps. The Morphological Analysis and First, the text is divided into sentences, Disambiguation of Arabic (MADA) system is using the system Star (Belguith et al., described in (Habash and Rambow, 2005). The 2005), an Arabic text tokenizer based on basic approach used in MADA is inspired by the contextual exploration of punctuation work of Hajic (2000) for tagging marks and conjunctions of coordination. morphologically rich languages, which was The second sub-step detects the different extended to Arabic. Habash and Rambow (2005) words in each sentence. use SVM-classifiers for individual • A morphological preprocessing step morphological features and a simple combining which aims to extract clitics agglutinated scheme for choosing among competing analyses to the word. A filtering process is then proposed by the dictionary. applied to check out if the remaining 3 Arabic word segmenter word is a particle, a number, a date, or a proper noun. Arabic is a morphologically complex language. • An affixal analysis is then applied to de- Compared with French, an Arabic word can termine all possible affixes and roots. It sometimes correspond to a whole French sen- aims to identify basic elements belonging -cor "أﺗﺘﺬآّﺮوﻧﻨﺎ" tence (Example : the Arabic word responds in French to the sentence "Est-ce que

~~1 http://www.miracl.rnu.tn~~

62 -corres "ﺑﻌﺪ" to the constitution of a word (the root and categories. For example: the word affixes i.e. prefix, infix and suffix). ponds to many categories (table 1). • The morphological analysis step consists Categories "ﺑﻌﺪ" of determining for each word, all its meanings of a word possible morpho-syntactic features (i.e, after Particule part of speech, gender, number, time, remoteness Noun person, etc.). Morpho-syntactic features remove Verb detection is made up on three stages. The go away Verb first stage identifies the part-of-speech of "ﺑﻌﺪ" par- Table 1. Different meanings of a word ,"اﺳﻢ" noun ,"ﻓﻌﻞ" the word (i.e. verb The .("اﺳﻢ ﻋﻠﻢ" and proper noun "أداة" ticle second stage extracts for each part-of- Arabic uses a diverse system of prefixes, suf- speech a list of its morpho-syntactic fea- fixes, and pronouns that are attached to the tures. A filtering of these feature lists is words (Soudi, 2007). "وﺿﻊ" :made in the third stage. In fact, in Arabic language, the word • Vocalization and validation step : each can be a verb in the past (He filed), or a mascu- handled word is fully vocalized accord- line noun (state), or a concatenation of the coor- and) with the verb) "و" ing to its morpho-syntactic features de- dinating conjunction -imperative of the verb (filed). For this rea :"ﺿﻊ" .termined in the previous step son, correct morphological analysis is required In our method, each Arabic word, from Arab- to resolve structural ambiguities among Arabic ic data, is replaced by its segmented form, where sentence. stem, clitic and affix are featured with their morphological classes (e.g. proclitic, prefix, 5 Word Morpho-syntactic feature dis- stem, suffix and enclitic). For example: the word ambiguation ,"in French: "et nous les avons connu) "ﻓﻌﺮﻓﻨﺎهﻢ" 5.1 Training corpus in English: "and we have known them") is the The training corpus used in this work is an " فَ" result of the concatenation of the proclitic Arabic-French bitext aligned at the sentence "ﻧﺎ" then): coordinating conjunction, the suffix) for level. Each Arabic word, from Arabic data, is) "هﻢ" for the present masculine plural, enclitic the masculine plural possession pronoun), and replaced by its segmented form. In the other (indicating the stem. side, the French corpus is part-of-speech (POS "ﻋﺮف" the rest of the word (will be replaced by: tagged by using treetagger tool (Schmid, 1994 "ﻓﻌﺮﻓﻨﺎهﻢ" So, the word for annotating text with part-of-speech and .lemma information ف"_proclitic ﻋﺮف_Stem ﻧﺎ_suffix هﻢ _enclitic" 4 Problems in context Morpho- 5.2 Alignment model syntactic feature choice The aligned model was trained with GIZA++ (Och and Ney, 2003), which implements the As mentioned in section 1, ambiguities in Arabic most typical IBM and HMM alignment models. word are mainly caused by the absence of the The alignment model used consists of IBM-1, short vowels. Thus, a word can have different HMM, IBM-3 and IBM-4. meanings. There are also the usual homographs of uninflected words with/without the same pro- 5.3 Using treetagger for Arabic Word nunciation, which have different meanings and Morpho-syntactic feature disambigua- usually different POS’s. For example: the word tion in English: To pre-process the Arabic data, we use the ,"ذهﺐ" in English: "gold" and ,"ذهﺐ" "go". In Arabic there are four categories of MORPH2 morphological analyzer (Belguith et words: noun, proper noun, verbs and particles. al., 2006). A sample output of the morphologi- The absence of short vowels can cause ambigui- cal analyzer is shown in Figure 1. ties within the same category or across different

~~63 "et_KON nous_PRO:PER les_PRO:PER avons_VER:pres connu_VER:pper"~~

~~The result of the alignment between these two sentences is:~~

~~ف_proclitic ﻋﺮف_stem ﻧﺎ_suffix هﻢ_a) enclitic)~~

~~(b) Et nous les avons Connu KON PRO:PER PRO:PER VER:pres VER:pper~~

~~Figure 2. (a) Output of morphological analysis MORPH2: segmented Arabic sentence, (b) French translation and its alignment with segmented morphological analysis~~

:is aligned with the word "ﻋﺮف" The stem "connu" : the past participle of the verb "con- naître" (in English : "known"). We can deduce is a "ﻋﺮف" that the part of speech of the stem verb. Morpho-syntactic features provided by Tree- Tagger are verb, proper noun, noun, adjective, adverb, conjunction, pronoun, preposition, etc. "ﺑﻌﺪ" Figure 1. Possible analyses for the word The morpho-syntactic feature of Arabic words The obtained output consists of a set of all aligned with French words tagged by adjective possible morphological analyses for each word, or noun will be replaced by the morpho- -While, the mor ."اﺳﻢ" :with the unique correct analysis. One needs to syntactic feature : noun select the right meaning by looking at the con- pho-syntactic feature : adverb, conjunction, or text. Given the highly inflection nature of Arab- prepostion will be replaced by the Arabic mor- ."أداة" : ic, resolving ambiguities is syntactically harder pho-syntactic feature : particle within the same category. We want to choose the We can attest that the use of morpho-syntactic correct output using the features generated by features provided by the part-of-speech tagged TreeTagger applied to the French corpus. corpus in the other side can remove disambigua- To obtain this correct feature, we needed to tion of morpho-syntactic feature of the Arabic match data in the segmented Arabic corpus to word provided by a morphological analyser, es- the lexeme and feature representation output by pecially for agglutinative and inflectional lan- TreeTagger. The matching included the results guages. of the alignment of word between the segmented 5.4 Experimental results Arabic corpus and the part-of-speech tagged French corpus. In our experiments, on the entire corpus, the in French: "et MORPH2 morphological analyzer makes 1152) "ﻓﻌﺮﻓﻨﺎهﻢ" Example : the word nous les avons connu", in English: "and we have errors (27%). Table 2 shows the results obtained known them") is segmented by: with the morphological analyzer MORPH2, BASELINE, and the results obtained with the "ف_proclitic ﻋﺮف_stem ﻧﺎ_suffix هﻢ_enclitic" Arabic morphology-French alignment, treetag- The part-of-speech tagged of the French sen- ger-to-morph2, where Arabic morphology- tence is: French alignment is used to choose a correct

64 morpho-syntactic feature produced by a morpho- Future work will focus on taking advantage of logical analyzer MORPH2. our efficient technique. We are very interested to use Word Morpho-syntactic feature disambigua- System Accuracy (%) tion to build up an efficient French-Arabic trans- BASELINE 73% lation system. treetagger-to- 88% morph2 References

Table 2. Results of treetagger-to-morph2 com- Belguith L., and Chaâben N. 2006. Analyse et dé- pared against BASELINE on the task of POS sambiguïsation morphologiques de textes arabes tagging of Arabic text non voyellés, Actes de la 13ème confrence sur le Traitement Automatique des Langues Naturelles, Leuven Belgique, 493-501. Thus one can observe that, for Arabic, the treetagger-to-morph2 outperforms the BASE- Belguith L., Baccour L. and Mourad G. 2005. Seg- LINE tagger with a significant absolute differ- mentation des textes arabes basée sur l'analyse ence of 15% in tagging accuracy. contextuelle des signes de ponctuations et de cer- taines particules. Actes de la 12éme Conférence The performance of treetagger-to-morph2 is annuelle sur le Traitement Automatique des better than the baseline BASELINE. The errors Langues Naturelles, 451-456. encountered result from confusing nouns with verbs, particles or vice versa. This is to be Diab M., Hacioglu K., and Jurafsky D. 2004. Auto- caused by the presence of homographs of Arabic matic tagging of Arabic text: From raw text to words, which have different meanings and dif- base phrase chunks. In 5th Meeting of the North ferent POS’s. American Chapter of the Association for Computa- tional Linguistics/Human Language Technologies 6 Conclusion Conference (HLT-NAACL04), Boston, MA. Morphological disambiguation of Arabic is a Habash N. and Rambow O.. 2005. Arabic tokeniza- difficult task which involves, in theory, thou- tion, part-of-speech tagging and morphological sands of possible tags. disambiguation in one fell swoop. In Proceedings In this paper, we present a system which dis- of the 43rd Annual Meeting of the Association for ambiguates the output of a morphological ana- Computational Linguistics (ACL’05), Ann Arbor, Michigan, June. Association for Computational lyzer for Arabic. Arabic is a morphologically Linguistics. 573–580 rich language, and Morphological analysis and disambiguation are crucial stages in a variety of Hajic J. 2000. Morphological tagging: Data vs. dic- natural language processing applications. tionaries. In 1st Meeting of the North American We first applied an Arabic word segmentation Chapter of the Association for Computational Lin- step, to improve the alignments models. So we guistics (NAACL’00), Seattle, WA. use the Arabic morphological analyzer Och F. J., and Ney H. 2003. A Systematic compari- MORPH2. Then, we proposed to use TreeTag- son of various statistical alignment models, Com- ger tool, able to annotate text with part-of- putational Linguistics, 29(1): 19-51. speech and lemma information, where each word from French corpus is agglutinated to its Och F. J., Ney H. 2004. The alignment template ap- part-of-speech (POS). The sentence alignment proach to statistical machine translation, Computa- tional Linguistics, 30(4): 417-449. between Arabic and French corpus was trained with GIZA++. The core idea is to avoid morpho- Schmid H. 1994. Probabilistic Part-of-speech Tag- syntactic ambiguity of the Arabic words ob- ging Using Decision Trees. In Proceedings of In- tained in the MORPH2 output by using the part ternational Conference on New Methods in Lan- of speech of corresponding aligned French word. guage Processing, Manchester. 44-49. We showed that imposing Arabic-French align- Soudi A., Bosch A. and Neumann G. Arabic Compu- ment dependent constraints on possible se- tational Morphology: Knowledge-based and Em- quences of analyses improves the morphological pirical Methods. Springer, 2007. disambiguation.

~~65 A Discriminative Approach for Dependency Based Statistical Machine Translation~~

~~Sriram Venkatapathy Rajeev Sangal LTRC, IIIT-Hyderabad LTRC, IIIT-Hyderabad [email protected] [email protected]~~

~~Aravind Joshi Karthik Gali1 University of Pennsylvania Talentica [email protected] [email protected]~~

Abstract 2002; Ananthakrishnan et al., 2008; Ramanathan et al., 2009) where there is very limited paral- In this paper, we propose a dependency lel corpora available, and breaking words into based statistical system that uses discrim- smaller units helps in reducing sparsity. In or- inative techniques to train its parameters. der to handle phenomenon such as long-distance We conducted experiments on an English- word agreement to achieve accurate generation of Hindi parallel corpora. The use of syntax target language words, the inter-dependence be- (dependency tree) allows us to address the tween the factors of syntactically related words large word-reorderings between English need to be modelled effectively. and Hindi. And, discriminative training Some of the limitations with the syntax based allows us to use rich feature sets, includ- approaches such as (Yamada and Knight, 2002; ing linguistic features that are useful in the Quirk et al., 2005; Chiang, 2005) are, (1) They machine translation task. We present redo not offer ﬂexibility for adding linguistically sults of the experimental implementation motivated features, and (2) It is not possible to of the system in this paper. use morphological factors in the syntax based approaches. In a recent work (Shen et al., 2009), lin- 1 Introduction guistic and contextual information was effectively Syntax based approaches for Machine Translation used in the framework of a hierarchical machine (MT) have gained popularity in recent times be- translation system. In their work, four linguistic cause of their ability to handle long distance re- and contextual features are used for accurate se- orderings (Wu, 1997; Yamada and Knight, 2002; lection of translation rules. In our approach in Quirk et al., 2005; Chiang, 2005), especially for contrast, linguistically motivated features can be divergent language pairs such as English-Hindi deﬁned that directly effect the prediction of var- (or English-Urdu). Languages such as Hindi are ious elements in the target during the translation also known for their rich morphology and long process. This features use syntactic labels and col- distance agreement of features of syntactically re- location statistics in order to allow effective train- lated units. The morphological richness can be ing of the model. handled by employing techniques that factor the Some of the other approaches related to our lexical items into morphological factors. This model are the Direct Translation Model 2 (DTM2) strategy is also useful in the context of English- (Ittycheriah and Roukos, 2007), End-to-End Dis- Hindi MT (Bharati et al., 1997; Bharati et al., criminative Approach to MT (Liang et al., 2006) 1This work was done at LTRC, IIIT-Hyderabad, when he and Factored Translation Models (Koehn and was a masters student, till July 2008 Hoang, 2007). In DTM2, a discriminative trans-

66 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 66–74, COLING 2010, Beijing, August 2010. lation model is defined in the setting of a phrase the transformation of source syntactic structure s, based translation system. In their approach, the the score of transformation is computed as repre- features are optimized globally. In contrast to sented in Equation 1. their approach, we define a discriminative model for translation in the setting of a syntax based ma- score(τ s) = w f (τ, s) (1) | i ∗ i chine translation system. This allows us to use i both the power of a syntax based approach, as X well as, the power of a large feature space during In Equation 1, fi0s are the various features of translation. In our approach, the weights are op- transformation and wi0s are the weights of the fea- timized in order to achieve an accurate prediction tures. The strength of our approach lies in the flex- of the individual target nodes, and their relative ibility it offers in incorporating linguistic features positions. that are useful in the task of machine translation. We propose an approach for syntax based sta- These features are also known as prediction fea- tistical machine translation which models the fol- tures as they map from source language informa- lowing aspects of language divergence effectively. tion to information in the target language that is being predicted. Word-order variation including long- • During decoding a source sentence, the goal distance reordering which is prevalent is to choose a transformation that has the high- between language pairs such as English- est score. The source syntactic structure is tra- Hindi and English-Japanese. versed in a bottom-up fashion and the target syn- Generation of word-forms in the target lan- tactic structure is simultaneously built. We used • guage by predicting the word and its factors. a bottom-up traversal while decoding because it During prediction, the inter-dependence of builds a contiguous sequence of nodes for the sub- factors of the target word form with the fac- trees during traversal enabling the application of a tors of syntactically related words is consid- wide variety of language models. ered. In the training phase, the task is to learn the weights of features. We use an online large- To accomplish this goal, we visualize the prob- margin training algorithm, MIRA (Crammer et lem of MT as transformation from a morpho- al., 2005), for learning the weights. The weights logically analyzed source syntactic structure to a are locally updated at every source node during 1 target syntactic structure (See Figure 1). The the bottom-up traversal of the source structure. transformation is factorized into a series of mini- For training the translation model, automatically transformations, which we address as features of obtained word-aligned parallel corpus is used. We the transformation. The features denote the vari- used GIZA++ (Och and Ney, 2003) along with the ous linguistic modifications in the source structure growing heuristics to word-align the training cor- to obtain the target syntactic structure. Some of pus. the examples of features are lexical translation of The basic factors of the word used in our exper- a particular source node, the ordering at a particu- iments are root, part-of-speech, gender, number lar source node etc. These features can be entirely and person. In Hindi, common nouns and verbs local to a particular node in the syntactic structure have gender information whereas, English doesn’t or can span across syntactically related entities. contain that information. Apart from the basic More about the features (or mini-transformations) factors, we also consider the role information pro- is explained in section 3. The transformation of vided by labelled dependency parsers. For com- a source syntactic structure is scored by taking a puting the dependency tree on the source side, We 2 weighted sum of its features . Let τ represent used stanford parser (Klein and Manning, 2003) 3 1Note that target structure contains only the target fac- in the experiments presented in this chapter . tors. An accurate and deterministic morphological generator combines these factors to produce the target word form. 3Stanford parser gives both the phrase-structure tree as 2The features can be either binary-values or real-valued well as dependency relations for a sentence.

~~67 paid/VBD root=mila, tense=PAST root=pay, tense=PAST gnp=m3sg gnp=x3sg, role=X~~

~~Ram/NNP visit/NN to/TO root=Ram, gnp=x1sg root=visit, gnp=x3sg root=to, gnp=x3sg root=raam root=se role=subj role=obj role=vmod gnp=m1sg gnp=x3sg~~

~~a/DT Shyam/NNP root=shyaam root=a, gnp=x3sg root=Shyam, gnp=x1sg gnp=m1sg role=nmod role=pmod~~

~~Figure 1: Transformation from source structure to target language~~

The function words such as prepositions and tent word4, and it requires simple pattern auxiliary verbs largely express the grammatical matching to obtain the LWGs. The rules ap- roles/functions of the content words in the sen- plied are, (1) VM (RB VAUX)+, and (2) N.* IN. tence. In fact, in many agglutinative languages, | these words are commonly attached to the con- 3 Features tent word to form one word form. In this paper, we also conduct experiments where we begin There are three types of transformation features by grouping the function words with their corre- explored by us, (1) Local Features, (2) Syntactic sponding function words. These groups of words Features and, (3) Contextual Features. In this sec- are called local-word groups. In these cases, the tion, we describe each of these categories of fea- function words are considered as factors of the tures representing different aspects of transforma- content words. Section 2 explains more about the tion with examples. local word groups in English and Hindi. 3.1 Local Features 2 Local Word Groups The local features capture aspects of local transformation of an atomic treelet in the source Local word groups (LWGs) (Bharati et al., 1998; structure to an atomic treelet in the target lan- Vaidya et al., 2009) consist of a content word and guage. Atomic treelet is a semantically non- its associated function words. Local word group- decomposible group of one or more nodes in the ing reduces a sentence to a sequence of content syntactic structure. It usually contains only one words with the case-markers and tense-markers node, except for the case of multi-word expres- acting as their factors. For example, consider sions (MWEs). Figure 2 presents the examples of an English sentence ‘People of these island have local transformation. adopted Hindi as a means of communication’. Some of the local features used by us in our ex- ‘have adopted’ is a LWG with root ‘adopt’ and periments are (1) dice coefficient, (2) dice coeffi- tense markers being ‘have ed’. Another example cient of roots, (3) dice coefficient of null transla- for the LWG will be ‘of communication’ where tions, (4) treelet translation probability, (5) gnp- ‘communication’ is the root, and ‘of’ is the case- gnp pair, (5) preposition-postposition pair, (6) marker. It is to be noted that Local word grouping tense-tense pair, (7) part-of-speech fertility etc. is different from chunking, where more than one Dice coefficients and treelet translation probabil- content word can be part of a chunk. We obtain lo- ities are measures that express the statistical co- cal word groups in English by processing the out- occurrence of the atomic treelets. put of the stanford parser. In Hindi, the function words always appear immediately after the con- 4case-markers are called postpositions

68 paid/VBD root=pay, tense=PAST 3.3 Contextual Features Ram/NNP gnp=x3sg, role=X Contextual features model the inter-dependence root=Ram, gnp=x1sg role=subj visit/NN of factors of nodes connected by dependency arcs. root=visit, gnp=x3sg These features are used to enable access to global role=obj information for prediction of target nodes (words and its factors). root=raam root=mila, tense=PAST One of the features diceCoeffParent, relates the gnp=m1sg gnp=m3sg parent of a source node to the corresponding target node (see ﬁgure 4. Figure 2: Local transformations x1 dice

x2 y 3.2 Syntactic Features x3 x4 The syntactic features are used to model the difference in the word orders of the two languages. At Figure 4: Use of Contextual (parent) information every node of the source syntactic structure, these of x2 for generation of y features define the changes in the relative order of children during the process of transformation. The use of this feature is expected to address of They heavily use source information such as part- the limitations of using ‘atomic treelets’ as the ba- of-speech tags and syntactic roles of the source sic units in contrast to phrase based systems which nodes. One of the features used is reorderPostags. consider arbitrary sequences of words as units to This feature captures the change in relative po- encode the local contextual information. In my sitions of children with respect to their parents case, We relate the target treelet with the contex- during the tree transformation. An example fea- tual information of the source treelet using feature ture for the transformation given in Figure 1 is functions rather than using larger units. Similar shown in Figure 3. features are used to connect the context of a source node to the target node. VB Various feature functions are defined to han- VB dle interaction between the factors of syntacti- NNP TO NNP IN cally related treelets. The gender-number-person agreement is a factor that is dependent of gender- Figure 3: Syntactic feature - reorder postags number-person factors of the syntactically related treelets in Hindi. The rules being learnt here The feature reorderPostags is in the form of a are simple. However, more complex interac- complete transfer rule. To handle cases, where the tions can also be handled though features such as left-hand side of ‘reorderPostags’ does not match prep Tense where, the case-marker in the target is the syntactic structure of the source tree, the sim- linked to the tense of parent verb. pler feature functions are used to qualify various 4 Decoding reorderings. Instead of using POS tags, feature functions can be defined that use syntactic roles. The goal is to compute the most probable target Apart from the above feature functions, we can sentence given a source sentence. First, the source also have features that compute the score of a par- sentence is analyzed using a morphological ana- 5 ticular order of children using syntactic language lyzer , local word grouper (see section 2) and a models (Gali and Venkatapathy, 2009; Guo et al., dependency parser. Given the source structure, 2008). Different features can be defined that use the task of the decoding algorithm is to choose the different levels of information pertaining to the transformation that has the maximum score. atomic treelet and its children. 5http://www.cis.upenn.edu/ xtag/ ∼

69 The dependency tree of the source language lead to the target gold prediction for the source sentence is traversed in a bottom-up fashion for node. In such cases where the gold transforma- building the target language structure. At every tion is unreachable, the weights are not updated source node during the traversal, the local trans- at all for the source node as it might cause erro- formation is first computed. Then, the relative or- neous weight updates. We conducted our exper- der of its children is then computed using the syn- iments by considering both the cases, (1) Identi- tactic features. This results in a target structure fying source nodes with unreachable transforma- associated with the subtree rooted at the particular tions, and (2) Updating weights for all the source node. The target structure associated with the root nodes (till a maximum iteration limit). The num- node of the source structure is the result of the best ber of iterations on the entire corpus can also be transformation of the entire source structure. fixed. Typically, two iterations have been found to Hence, the task of computing the best transfor- be sufficient to train the model. mation of the entire source structure is factorized The dependency tree is traversed in a bottom-up into the tasks of computing the best transforma- fashion and the weights are updated at each source tions of the source treelets. The equation for com- node. puting the score of a transformation, Equation 1, can be modified as Equation 2 given below. 6 Experiments and Results

score(τ s) = r w f (τ , r) (2) The important aspects of the translation model | | | ∗ i ∗ i r r i proposed in this paper have been implemented. X X Some of the components that handle word in- where, τj is the local transformation of the sertions and non-projective transformations have source treelet r. The best transformation τˆ of not yet been implemented in the decoder, and source sentence s is, should be considered beyond the scope of this paper. The focus of this work has been to τˆ = argmax score(τ s) (3) τ | build a working syntax based statistical machine 5 Training Algorithm translation system, which can act as a platform for further experiments on similar lines. The goal of the training algorithm is to learn the The system would be available for download feature weights from the word aligned corpus. For at http://shakti.iiit.ac.in/ sriram/vaanee.html. To ∼ word-alignment, we used the IBM Model 5 imple- evaluate this experimental system, a restricted mented in GIZA++ along with the growing heuris- set of experiments are conducted. The experi- tics (Koehn et al., 2003). The gold atomic treelets ments are conducted on the English-Hindi lan- in the source and their transformation is obtained guage pair using a corpus in tourism domain con- by mapping the source node to the target using the taining 11300 sentence pairs6. word-alignment information. This information is stored in the form of transformation tables that is 6.1 Training used for the prediction of target atomic treelets, 6.1.1 Conﬁguration prepositions and other factors. The transformation For training, we used DIT-TOURISM-ALIGN- tables are pruned in order to limit the search and TRAIN dataset which is the word-aligned dataset eliminate redundant information. For each source of 11300 sentence pairs. The word-alignment is element, only the top few entries are retained in done using GIZA++ (Och and Ney, 2003) toolkit the table. This limit ranges from 3 to 20. and then growing heuristics are applied. For We used an online-large margin algorithm, our experiments, we use two growing heuristics, MIRA (McDonald and Pereira, 2006; Crammer GROW-DIAG-FINAL-AND and GROW-DIAG- et al., 2005), for updating the weights. During FINAL as they cover most number of words in parameter optimization, it is sometimes impossi- both the sides of the parallel corpora. ble to achieve the gold transformation for a node because the pruned transformation tables may not 6DIT-TOURISM corpus

70 Number of Training Sentences 500 rect prediction even after the maximum iteration Iterations on Corpus 1-2 limit (see section 6.1.1) is reached. At some of the Parameter optimization algorithm MIRA source nodes, the reference transformations are Beam Size 1-20 unreachable (U). The goal is to choose the con- Maximum update attempts at source node 1-4 ﬁguration that has least number of average failed Unreachable updates False updates (F ) because it implies that the model has Size of transformation tables 3 been learnt effectively.

Table 1: Training Configuration UpdateHit K m P S F U The training of the model can be performed un- 1. 1 4 1680 2692 84 4081 der different configurations. The configurations 2. 5 4 1595 2786 75 4081 that we used for the training experiments are given 3. 10 4 1608 2799 49 4081 in Table 6.1.1. 4. 20 4 1610 2799 47 4081

6.2 Results Table 2: Training Statistics - Effect of Beam Size For the complete training, the number of sen- From Table 2, we can see that the bigger beam tences that should be used for the best perfor- size leads to a better training of the model. The mance of the decoder should be the complete set. beam size was varied between 1 and 20, and the In the paper, we have conducted experiments by number of update failures (F ) was observed to be considering 500 training sentences to observe the least at K=20. best training configuration. At a source node, the weight vector is itera- UpdateHit tively updated till the system predicts the gold K m P S F U transformation. We conducted experiments by fix- 1. 20 1 1574 2724 158 4081 ing the maximum number of update attempts. A 2. 20 2 1598 2767 91 4081 source node, where the gold transformation is not 3. 20 4 1610 2799 47 4081 achieved even after the maximum updates limit, the update at this source node is termed a update Table 3: Training Statistics - Effect of maximum failure. The source nodes, where the gold trans- update attempts formation is achieved even without making any updates is known as the correct prediction. In Table 3, we can see that an higher limit on At some of the source nodes, it is not possible the maximum number of update attempts results to arrive at the gold target transformation because in less number of update attempts as expected. A of limited size of the training corpus. At such much higher value of m is not preferable because nodes, we have avoided doing any weight update. the training updates makes noisy updates in case As the desired transformation is unachievable, any of difficult nodes i.e., the nodes where target trans- attempt to update the weight vector would cause formation is reachable in theory, but is unreach- noisy weight updates. able given the set of features. We observe various parameters to check the ef- UpdateHit fectiveness of the training configuration. One of K i P S F U the parameters (which we refer to as ‘updateHits’) computes the number of successful updates (S) 1. 1 1 1680 2692 84 4081 performed at the source nodes in contrast to num- 2. 1 2 1679 2694 83 4081 ber of failed updates (F ). Successful updates re- Table 4: Training Statistics - Effect of number of sult in the prediction of the transformation that is iterations same as the reference transformation. A failed update doesn’t result in the achievement of the cor- Now, we examine the effect of number of it-

71 erations on the quality of the model. In table 4, Table 7 presents the top-10 ordering relative po- we can observe that the number of iterations on sition feature where the head word is a verb. In the data has no effect on the quality of the model. this feature, the relative position (left or right) of This implies, that the model is adequately learnt the head and the child is captured. For example, a after one pass through the data. This is possible feature ‘relPos:amod-NN’, if active, conveys that because of the multiple number of update attempts an argument with the role ‘amod’ is at the left of allowed at every node. Hence, the weights are up- a head word with POS tag ‘NN’. dated at a node till the model prediction is consistent with the gold transformation. Feature Weight Based on the above observations, we consider relPos:amod-NN 6.70 the configuration 4 in Table 2 for the decoding ex- relPos:NN-appos 1.62 periments. relPos:lrb-NN 1.62 Now, we present some of the top features weights leant by the best configuration. The Table 7: Top weights of relPos feature weights convey that important properties of transformation are being learnt well. Table 5 presents 6.3 Decoding the weights of the features ‘diceRoot’, ‘dice- RootChildren’ and ‘diceRootParent’. We computed the translation accuracies using two metrics, (1) BLEU score (Papineni et al., 2002), Feature Weight and (2) Lexical Accuracy (or F-Score) on a test dice 75.67 set of 30 sentences. We compared the accuracy diceChildren 540.31 of the experimental system (Vaanee) presented in this paper, with Moses (state-of-the-art translation diceParent 595.94 system) and Shakti (rule-based translation system treelet translation probability (ttp) 1 0.77 7) under similar conditions (with using a develop- treelet translation probability (ttp) 2 389.62 ment set to tune the models). The rule-based sys- Table 5: Weights of dice coefficient based features tem considered is a general domain system tuned to the tourism domain. The best BLEU score for We see that the dice coefficient based local and Moses on the test set is 0.118, and the best lexi- contextual features have a positive impact on the cal accuracy is 0.512. The best BLEU score for selection of correct transformations. A feature Shakti is 0.054, and the best lexical accuracy is that uses a syntactic language model to compute 0.369. the perplexity per word has a negative weight of In comparison, the best BLEU score of Vaanee -1.115. is 0.067, while the best lexical accuracy is 0.445. Table 6 presents the top-5 entries of contex- As observed, the decoding results of the experi- tual features that describe the translation of source mental system mentioned here are not yet compa- argument ’nsubj’ using contextual information rable to the state-of-art. The main reasons for the (‘tense’ of its parent). low translation accuracies are, 1. Poor Quality of the dataset Feature Weight roleTenseVib:nsubj+NULL NULL 44.194196513246 The dataset currently available for English- roleTenseVib:nsubj+has VBN ne 14.4541356715382 Hindi language pair is noisy. This is an extremely large limiting factor for a model roleTenseVib:nsubj+VBD ne 10.9241093097953 which uses rich linguistic information within roleTenseVib:nsubj+VBP meM 6.14149937079584 the statistical framework. roleTenseVib:nsubj+VBP NULL 4.76795730621754

~~Table 6: Top weights of a contextual feature : 2. Low Parser accuracy preposition+Tense-postposition 7http://shakti.iiit.ac.in/~~

72 The parser accuracy on the English-Hindi model that we have proposed, has the flexibility of dataset is low, the reasons being, (1) Noise, adding rich linguistic features. (2) Length of sentences, and (3) Wide scope An experimental version of the system has been of the tourism domain. implemented, which is available for download at http://shakti.iiit.ac.in/ sriram/vaanee.html. This ∼ 3. Word insertions not implemented yet can facilitate as a platform for future research in 4. Non-projectivity not yet handled syntax based statistical machine translation from English to Indian languages. We also plan to per- 5. BLEU is not an appropriate metric form experiments using this system between Eu- BLEU is not an appropriate metric (Anan- ropean languages in future. thakrishnan et al., ) for measuring the trans- The performance of the implemented transla- lation accuracy into Indian languages. tion system, is not yet comparable to the state- of-art results primarily for two reasons, (1) Poor 6. Model is context free as far as targets words quality of available data, because of which our are concerned. Selection depends on chil- model which uses rich linguistic information dren but not parents and siblings doesn’t perform as expected, and (2) Components This point concerns the decoding algorithm. for word insertion and non-projectivity handling The current algorithm is greedy while chos- are yet to be implemented in this version of the ing the best translation at every source node. system. It first explores the K-best local transformations at a source node. It then makes a greedy References selection of the predicted subtree based on it’s overall score after considering the predic- Ananthakrishnan, R, B Pushpak, M Sasikumar, and Ritesh Shah. Some issues in automatic evaluation tions at the child nodes, and the relative posi- of english-hindi mt: more blues for bleu. ICON- tion of the local transformation with respect 2007. the predictions at the child nodes. Ananthakrishnan, R., Jayprasad Hegde, Pushpak Bhat- The problem in this approach is that, an er- tacharyya, and M. Sasikumar. 2008. Simple syntac- ror once made at a lower level of the tree tic and morphological processing can help english- is propogated to the top, causing more mis- hindi statistical machine translation. In Proceedings of IJCNLP-2008. IJCNLP. takes. A computationally reasonable solution to this problem is to maintain a K-best list Bharati, Akshar, Vineet Chaitanya, Amba P Kulkarni, of predicted subtrees corresponding to every and Rajeev Sangal. 1997. Anusaaraka: Machine translation in stages. A Quarterly in Artificial Intel- source node. This allows rectification of a ligence, NCST, Bombay (renamed as CDAC, Mum- mistake made at any stage. bai). The system, however, performs better than the Bharati, Akshar, Medhavi Bhatia, Vineet Chaitanya, rule based system. As observed earlier, the right and Rajeev Sangal. 1998. Paninian grammar framework applied to english. South Asian Lan- type of information is being learnt by the model, guage Review, (3). and the approach looks promising. The limitations expressed here shall be addressed in the future. Bharati, Akshar, Rajeev Sangal, Dipti M Sharma, and Amba P Kulkarni. 2002. Machine translation activities in india: A survey. In Proceedings of workshop 7 Conclusion on survey on Research and Development of Machine In this work, we presented a syntax based de- Translation in Asian Countries. pendency model to effectively handle problems in Chiang, David. 2005. A hierarchical phrase-based translation from English to Indian languages such model for statistical machine translation. In Pro- ceedings of the 43rd Annual Meeting of the Associa- as, (1) Large word order variation, and (2) Ac- tion for Computational Linguistics (ACL’05), pages curate generation of word-forms in the target lan- 263–270, Ann Arbor, Michigan, June. Association guage by predicted the word and its factors. The for Computational Linguistics.

73 Crammer, K., R. McDonald, and F. Pereira. 2005. Papineni, Kishore, Salim Roukos, Todd Ward, and W.J. Scalable large-margin online learning for structured Zhu. 2002. Bleu: A method for automatic eval- classiﬁcation. Technical report, University of Penn- uation of machine translation. In Proceedings of sylvania. 40th Annual Meeting of the Association of Compu- tational Linguistics, pages 313–318, Philadelphia, Gali, Karthik and Sriram Venkatapathy. 2009. Sen- PA, July. tence realisation from bag of words with dependency constraints. In Proceedings of Human Lan- Quirk, Chris, Arul Menezes, and Colin Cherry. 2005. guage Technologies: The 2009 Annual Conference Dependency treelet translation: Syntactically in- of the North American Chapter of the Association formed phrasal SMT. In Proceedings of the 43rd for Computational Linguistics, Companion Volume: Annual Meeting of the Association for Computa- Student Research Workshop and Doctoral Consor- tional Linguistics (ACL’05), pages 271–279, Ann tium, pages 19–24, Boulder, Colorado, June. Asso- Arbor, Michigan, June. Association for Computa- ciation for Computational Linguistics. tional Linguistics.

Guo, Yuqing, Josef van Genabith, and Haifeng Wang. Ramanathan, Ananthakrishnan, Hansraj Choudhary, 2008. Dependency-based n-gram models for gen- Avishek Ghosh, and Pushpak Bhattacharyya. 2009. eral purpose sentence realisation. In Proceedings Case markers and morphology: Addressing the crux of the 22nd International Conference on Compu- of the ﬂuency problem in english-hindi smt. In Pro- tational Linguistics (Coling 2008), pages 297–304, ceedings of ACL-IJCNLP 2009. ACL-IJCNLP. Manchester, UK, August. Coling 2008 Organizing Committee. Shen, Libin, Jinxi Xu, Bing Zhang, Spyros Matsoukas, and Ralph Weischedel. 2009. Effective use of lin- Ittycheriah, Abraham and Salim Roukos. 2007. Di- guistic and contextual information for statistical ma- rect translation model 2. In Human Language Tech- chine translation. In Proceedings of the 2009 Con- nologies 2007: The Conference of the North Amer- ference on Empirical Methods in Natural Language ican Chapter of the Association for Computational Processing, pages 72–80, Singapore, August. Asso- Linguistics; Proceedings of the Main Conference, ciation for Computational Linguistics. pages 57–64, Rochester, New York, April. Associa- tion for Computational Linguistics. Vaidya, Ashwini, Samar Husain, Prashanth Reddy, and Dipti M Sharma. 2009. A karaka based annotation Klein, Dan and Christopher D. Manning. 2003. Ac- scheme for english. In Proceedings of CICLing , curate unlexicalized parsing. In Proceedings of the 2009. 41st Annual Meeting of the Association for Com- putational Linguistics, pages 423–430, Sapporo, Wu, Dekai. 1997. Stochastic Inversion Transduction Japan, July. Association for Computational Linguis- Grammars and Bilingual Parsing of Parallel Cor- tics. pora. Computational Linguistics, 23(3):377–404. Koehn, Philipp and Hieu Hoang. 2007. Factored Yamada, Kenji and Kevin Knight. 2002. A decoder translation models. In Proceedings of the 2007 Joint for syntax-based statistical mt. In Proceedings of Conference on Empirical Methods in Natural Lan- 40th Annual Meeting of the Association for Compu- guage Processing and Computational Natural Lan- tational Linguistics, pages 303–310, Philadelphia, guage Learning (EMNLP-CoNLL), pages 868–876. Pennsylvania, USA, July. Association for Computa- Koehn, P., F. J. Och, and D. Marcu. 2003. Statistical tional Linguistics. phrase-based translation. In Proceedings of the Hu- man Language Technology Conference 2003 (HLT- NAACL 2003), Edmonton, Canada, May. Liang, P., A. Bouchard-Cote, D. Klein, and B. Taskar. 2006. An end-to-end discriminative approach to machine translation. In International Conference on Computational Linguistics and Association for Computational Linguistics (COLING/ACL). McDonald, R. and F. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In EACL. Och, F.J. and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computa- tional Linguistics, 29(1):19–51.

~~74 Improved Language Modeling for English-Persian Statistical Machine Translation~~

Mahsa Mohaghegh Abdolhossein Sarrafzadeh Tom Moir Massey University, Unitec, Massey University, School of Engineering and Department of computing School of Engineering and Advanced Technology Advanced Technology [email protected] [email protected] [email protected]

Abstract enormously. With the entrance of the World Wide Web effectively connecting countries As interaction between speakers of different together over a giant network, this interaction languages continues to increase, the ever- reached a new peak. In the area of business and present problem of language barriers must be commerce, the vast majority of companies overcome. For the same reason, automatic simply would not work without this global language translation (Machine Translation) has connection. However, with this vast global become an attractive area of research and benefit comes a global problem: the language development. Statistical Machine Translation barrier. As the international connection barriers (SMT) has been used for translation between continually break down, the language barrier many language pairs, the results of which have becomes a greater issue. The English language shown considerable success. The focus of this is now the world’s lingua franca, and non- research is on the English/Persian language pair. English speaking people are faced with the This paper investigates the development and problem of communication, and limited access evaluation of the performance of a statistical to resources in English. machine translation system by building a Machine translation is the process of using baseline system using subtitles from Persian computers for translation from one human films. We present an overview of previous language to another(Lopez, 2008). This is not a related work in English/Persian machine recent area of research and development. In fact, translation, and examine the available corpora machine translation was one of the first for this language pair. We finally show the applications of natural language processing, results of the experiments of our system using with research work dating back to the an in-house corpus and compare the results we 1950s(Cancedda, Dymetman, Foster, & Goutte, obtained when building a language model with 2009). However, due to the complexity and different sized monolingual corpora. Different diversity of human language, automated automatic evaluation metrics like BLEU, NIST translation is one of the hardest problems in and IBM-BLEU were used to evaluate the computer science, and significantly successful performance of the system on half of the corpus results are uncommon. built. Finally, we look at future work by There are a number of different approaches to outlining ways of getting highly accurate machine translation. Statistical Machine translations as fast as possible. Translation (SMT) however, seems to be the preferred approach of many industrial and 1 Introduction academic research laboratories (Schmidt, 2007). The advantages of SMT compared to rule-based Over the 20th century, international interaction, approaches lie in their adaptability to different travel and business relationships have increased domains and languages: once a functional

75 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 75–82, COLING 2010, Beijing, August 2010. system exists, all that has to be done in order to Persian uses a script that is written from make it work with other language pairs or text right to left. It has similarities with Arabic but domains is to train it on new data. has an extended alphabet and different words Research work on statistical machine and/or pronunciations from Arabic. translation systems began in the early 1990s. During its long history, the language has These systems, which are based on phrase-based been influenced by other languages such as approaches, operate using parallel corpora – Arabic, Turkish and even European languages huge databases of corresponding sentences in such as English and French. Today’s Persian two languages, and employ statistics and contains many words from these languages and probability to learn by example which in some cases words from other languages still translation of a word or phrase is most likely follow the grammar of their original language correct. The translation moves directly from particularly in building plural, singular or source language to target language with no different verb forms. Because of the special and intermediate transfer step. In recent years, such different nature of the Persian language phrase-based MT approaches have become compared to other languages like English, the popular because they generally show better design of SMT systems for Persian requires translation results. One major factor for this special considerations. development is the growing availability of large monolingual and bilingual text corpora in recent 1.2 Related Work years for a number of languages. The focus of this paper is on statistical Several MT systems have already been machine translation for the English/Persian constructed for the English/Persian language language pair. The statistical approach has only pair. been employed in several experimental One such system is the Shiraz project, (Amtrup, translation attempts for this language pair, and Laboratory, & University, 2000). The Shiraz is still largely undeveloped. This project is MT system is an MT prototype that translates considered to be a challenge for several reasons. text one way from Persian to English. The Firstly, the Persian language structure is very project began in 1997 and the final version was different in comparison to English; secondly, delivered in 1999. there has been little previous work done for this The Shiraz corpus is a 10 MB manually- language pair; and thirdly, effective SMT constructed bilingually tagged Persian to systems rely on very large bilingual corpora, English dictionary of about 50,000 words, however these are not readily available for the developed using on-line material for testing English/Persian language pair. purposes in a project at New Mexico State University. The system also comprises its own 1.1 The Persian Language syntactic parser and morphological analyzer, and is focused on news stories material The Persian language, or Farsi as it is also translation as its domain. known as, belongs to the Indo-European Another English/Persian system was developed language family and is one of the more by (Saedi, Motazadi, & Shamsfard, 2009). This dominant languages in parts of the Middle East. system, called PEnTrans, is a bidirectional text It is in fact the most widely spoken language in translator, comprising two main modules the Iranian branch of the Indo-Iranian (PEnT1, and PEnT2) which translate in opposite languages, being the official language of Iran directions (PEnT1 from English to Persian; (Persia) and also spoken in several countries PEnT2 from Persian to English). PEnT1 including Iran, Tajikistan and Afghanistan. employs a combination of both corpus based There also exist large groups and communities and extended dictionary approaches, and PEnT2 in Iraq, United Arab Emirates, People's uses a combination of rule, knowledge and Democratic Republic of Yemen, Bahrain, and corpus based approaches. PEnTrans introduced Oman, not to mention communities in the USA. a new WSD method with a hybrid measure which evaluates different word senses in a

76 sentence and scores them according to their of-domain. A classifier attempts to assign a condition in the sentence, together with the concept to an utterance. If the object to be placement of other words in that sentence. translated is within the translation domain, the ParsTranslator is a machine translation system is capable of significantly accurate system built to translate English to Persian text. translations. Where the object is outside the It was first released for public use in mid-1997, translation domain, the SMT method is used. the latest update being PTran version in April Transonics is a translation system for a specific 2004. The ParsTran input uses English text domain (medical: doctor-to-patient interviews), typed or from a file. The latest version is able to and only deals with question/answer situations operate for over 1.5 million words and (Ettelaie, et al., 2005). terminologies in English. It covers 33 fields of Another speech-to-speech English/Persian sciences, and is a growing translation service, machine translation system is suggested by with word banks being continually reviewed Xiang et al. They present an unsupervised and updated, available at: training technique to alleviate the problem of http://www.ParsTranslator.Net/eng/index.htm . the lack of bilingual training data by taking Another English to Persian MT system is the advantage of available source language rule-based system developed by (Faili & data(Xiang, Deng, & Gao, 2008). Ghassem-Sani, 2005)This system was based on However, there was no large parallel text tree adjoining grammar (TAG), and later corpus available at the time of development for improved by implementing trained decision both of these systems. For its specific domain, trees as a word sense disambiguation module. the Transonics translation system relied on a Mohaghegh et al. (2009) presented the first dictionary approach for translation, using a such attempt to construct a parallel corpus from speech corpus, rather than a parallel text corpus. BBC news stories. This corpus is intended to be Their Statistical Translation approach was an open corpus in which more text may be merely used as a backup system. added as they are collected. This corpus was used to construct a prototype for the first 2 Corpus Development for Persian statistical machine translation system. The problems encountered, especially with the A corpus is defined as a large compilation of process of alignment are discussed in this written text or audible speech transcript. research (Mohaghegh & Sarrafzadeh, 2009). Corpora, both monolingual and bilingual, have Most of these systems have largely used a been used in various applications in rule based approach, and their BLEU scores on computational linguistics and machine a standard data set have not been published. translation. Nowadays however, most large companies A parallel corpus is effectively two corpora employ the statistical translation approach, in two different languages comprising sentences using exceedingly large amounts of bilingual and phrases accurately translated and aligned data (aligned sentences in two languages). A together phrase to phrase. When used in good example of this is perhaps the most well- machine translation systems, parallel corpora known Persian/English MT system: Google must be of a very large size – billions of Translate recently released option for this sentences – to be effective. It is for this reason language pair. Google’s MT system is based on that the Persian language poses some difficulty. the statistical approach, and was made available There is an acute shortage of digitally stored online as a BETA version in June 2009. linguistic material, and few parallel online The Transonics Spoken Dialogue Translator documents, making the construction of a is also partially a statistically based machine parallel Persian corpus is extremely difficult. translation system. The complete system itself There are a few parallel Persian corpora that operates using a speech to text converter, do exist. These vary in size, and in the domains statistical language translation, and subsequent they cover. One such corpus is FLDB1, which is text to speech conversion. The actual translation a linguistic corpus consisting of approximately unit operates in two modes: in-domain and out- 3 million words in ASCII format. This corpus

77 was developed and released by (Assi, 1997) at consists of about 3,500,000 English and Persian the Institute for Humanities and Cultural words aligned at sentence level, to give Studies. This corpus version was updated in approximately 100,000 sentences distributed 2005, in 1256 character code page, and named over 50,021 entries. The corpus was originally PLDB2. This new updated version contains constructed with SQL Server, but presented in more than 56 million words, and was access type file. The format for the files is constructed with contemporary literary books, Unicode. This corpus consists of several articles, magazines, newspapers, laws and different domains, including art, culture, idioms, regulations, transcriptions of news, reports, and law, literature, medicine, poetry, politics, telephone speeches for lexicography purposes. proverbs, religion, and science; it is available Several corpora construction efforts have for sale online. been made based on online Hamshahri newspaper archives. These include Ghayoomi 3 Statistical Machine Translation (2004), with 6 months of Hamshahri archives to 3.1 General yield a corpus of 6.5 million words, and (Darrudi, Hejazi, & Oroumchian, 2004), with 4 Statistical machine translation (SMT) can be years’ worth of archives to yield a 37 million- defined as the process of maximizing the word corpus. probability of a sentence s in the source The ‘Peykareh’ or ‘Text Corpus’ is a corpus language matching a sentence t in the target of 38 million words developed by Bijankhan et language. In other words, “given a sentence s in al. available at: the source language, we seek the sentence t in http://ece.ut.ac.ir/dbrg/bijankhan/ and the target language such that it maximizes P(t | comprises newspapers, books, magazines s) which is called the conditional probability or articles, technical books, together with the chance of t happening given s'' (Koehn, et al., transcription of dialogs, monologues, and 2007). speeches for language modeling purposes. It is also referred to as the most likely Shiraz corpus (Amtrup, et al., 2000)is a translation. This can be more formally written bilingual tagged corpus of about 3000 aligned as shown in equation (1). Persian/English sentences also collected from arg max P(t | s) (1) the Hamshahri newspaper online archive and manually translated at New Mexico State Using Bayes Rule from equation (2), we can University. write equation (1) for the most likely translation Another corpus, TEP (Tehran English- as shown in equation (3). Persian corpus), available at: http://ece.ut.ac.ir/NLP/ resources.htm , consists P (t | s) = P (t) * P(s | t) =P (s) of 21,000 subtitle files obtained from (2) www.opensubtitles.org. Subtitle pairs of arg max P(t | s) = arg max P(t) * P(s | t) multiple versions of same movie were extracted, (3) a total of about 1,200(Itamar & Itai, 2008) then aligned the files using their proposed dynamic Where (t) is the target sentence, and (s) is the programming method. This method operates by source sentence. P (t) is the target language using the timing information contained in model and P(s | t) is the translation model. The subtitle files so as to align the text accurately. argmax operation is the search, which is done The end product yielded a parallel corpus of by a so-called decoder which is a part of a approximately 150,000 sentences which has statistical machine translation system. 4,100,000 tokens in Persian and 4,400,000 tokens in English. 3.2 Statistical Machine Translation Tools Finally, European Language Resources Association (ELRA), available at: There are a number of implementations of http://catalog.elra.info/product_info.php?produc subtasks and algorithms in SMT and even ts_id=1111 , have constructed a corpus which software tools that can be used to set up a fully- featured state-of-the-art SMT system.

78 Moses (Koehn, et al., 2007) is an open-source Microsoft’s bi-lingual sentence aligner statistical machine translation system which developed by (Moore, 2002). allows one to train translation models using The next step we plan to take involves the GIZA++ (Och & Ney, 2004).for any given construction of a statistical prototype based on language pair for which a parallel corpus exists. the largest available English/Persian parallel This tool was used to build the baseline system corpus extracted from the domain of movie discussed in this paper. MOSES uses a beam subtitles. This domain was chosen because the search algorithm where the translated output maximum number of words that can be sentence is generated left to right in form of displayed as a subtitle on the screen is between hypotheses. Beam-search is an efficient search 10- 12 which means both training and decoding algorithm which quickly finds the highest will be a lot faster. Building a parallel corpus probability translation among the exponential for any domain is generally the most time number of choices. consuming process as it depends on the The search begins with an initial state where availability of parallel text. But the domain of no foreign input words are translated and no subtitling makes it easier to get the source English output words have been generated. New language in the form of scripts and the target states are created by extending the English language in the form of subtitles in many output with a phrasal translation of that covers different languages. some of the foreign input words not yet translated. The algorithm can be used for exhaustively searching through all possible translations when data gets very large. The search can be optimized by discarding hypotheses that cannot be part of the path to the best translation. Furthermore, by comparing states, one can define a beam of good hypotheses and prune out hypotheses that fall out of this beam (Dean & Ghemawat, 2008).

~~3.3 Building a Baseline SMT System~~

To build a good baseline system it is important to build a sentence aligned parallel corpus Figure1. A typical SMT System which is spell-checked and grammatically correct for both the source and target language. A language model (LM) is usually trained The alignment of words or phrases turns out to on large amounts of monolingual data in the be the most difficult problem SMT faces. target language to ensure the fluency of the Words and phrases in the source and target language that the sentence is getting translated languages normally differ in where they are into. Language modeling is not only used in placed in a sentence. Words that appear on one machine translation but also used in many language side may be dropped on the other. One natural language processing applications such as English word may have as its counterpart a speech recognition, part-of-speech tagging, longer Persian phrase and vice versa. The parsing and information retrieval. A statistical accuracy of SMT relies heavily on the existence language model assigns probabilities to a of large amounts of data which is commonly sequence of words and tries to capture the referred to as a parallel corpus. The first step properties of a language. taken was to develop the parallel corpus. This The Language Model (LM) for this study corpus is intended to be an open corpus in was trained on the BBC Persian News corpus which more text can be added as they are and also an in-house corpus from different collected. Sentences were aligned using genres. The SRILM toolkit developed was used

79 to train a 5-gram LM for experimentation as in size increased, we performed various (Stolcke, 2002). experiments such as increasing the language model in each instance. 4 Experiments and Results Test No. EN/FA 1 EN/FA 2 EN/FA 3 4.1 Experiment setup Test Sentences 817 1011 2343 We used Moses a phrase-based SMT Training 864 1066 7005 development tool for constructing our machine Sentences translation system. This included n-gram language models trained with the SRI language Table 1. Size of test set and train set (language modeling tool, GIZA++ alignment tool, Moses Model) En: English, FA: Farsi decoder and the script to induce phrase-based translation models from word-based ones. Evaluation results from these experiments are presented in Tables 2, 3 and 4. As expected, 4.2 Performance evaluation metrics BLEU scores improved as the size of the corpus increased. The BLEU scores themselves were A lot of research has been done in the field of significantly low; however this was expected automatic machine translation evaluation. due to the small size of the corpus. We plan to Human evaluations of machine translation are update and increase the corpus size in the near extensive but expensive. Human evaluations can future, which will undoubtedly yield more take months to finish and involve human labor satisfactory results. that cannot be reused which is the main idea behind the method of automatic machine LM=864 BLEU NIST IBM -BLEU Corpus size 0.1061 1.8218 0.0060 translation evaluation that is quick, inexpensive, 817 and language independent. Corpus size 0.0882 1.5338 0.0050 One of the most popular metrics is called 1011 Corpus size 0.0806 1.7364 0.0067 BLEU (BiLingual Evaluation Understudy) 2343 developed at IBM. The closer a MT is to a professional human translation, the better it is. Table 2. Result obtained using Language Model This is the central idea behind the BLEU metric. size=864 NIST is another automatic evaluation metric with the following primary differences LM=1066 BLEU NIST IBM -BLEU compared to BLEU such as Text pre-processing, Corpus size 0.0920 1.6838 0.0060 gentler length penalty, information-weighted N- 817 gram counts and selective use of N-grams (Li, Corpus size 0.0986 1.5301 0.0050 Callison-Burch, Khudanpur, & Thornton, 1011 Corpus size 0.1127 1.6961 0.0069 2009); (Li, Callison-Burch, Khudanpur, & 2343 Thornton, 2009). Table 3. Result obtained using Language Model 4.3 Discussion and analysis of the results size=1066 LM= 7005 BLEU NIST IBM -BLEU In this study, Moses was used to establish a Corpus size 0.0805 1.6721 0.0063 baseline system. This system was trained and 817 Corpus size 0.0888 1.5512 0.0051 tested on three in-house corpora, the first 817 1011 sentences, the second 1011 sentences, and the Corpus size 0.1148 1.7554 0.0071 third 2343 sentences. The data available was 2343 split into a training and test set. Microsoft’s Table 4. Result obtained using Language Model bilingual sentence aligner (Moore, 2002) was size=7005 used to align the corpus and training sets.

~~Aligning was also performed manually to aid in the improvement of the results. As the corpus~~

80 The first test was performed on a corpus of 817 translation (Ma & Way, 2009). In the Persian sentences in Persian and the same number for language, some problems and difficulties arise their aligned translation in English. In this due to natural language ambiguities, anaphora instance, the training set used was 864 resolution, idioms and differences in the types sentences. Results of this translation were and symbols used for punctuation. These issues evaluated using three evaluation metrics had to be resolved before any attempt at SMT (BLEU, NIST, and IBM-BLEU) An excerpt could be made. Needless to stress on the fact from the output of this first experiment is shown that the better the alignment the better the in figure2 (a). results of the translation. The second test comprised of a 1011 sentences corpus, with a 1066 sentence training Trained on 864 sentences Language Model set. As can be seen, the evaluation metric results 2 1.8 improved. 1.6 The same experiment was repeated for a 1.4 1.2 third time, this time with an even larger corpus 1 0.8 of 2343 sentences, and a training set of 7005 0.6 sentences. The result can be seen in table 4. The 0.4 0.2 results obtained in this test were close to those 0 in the previous test, apart from a small increase BLEU NIST IBM-BLEU in BLEU scores. It must be noted that BLEU is only a tool to compare different MT systems. So Corpus size 817 Corpus size 1011 Corpus Size 2343 an increase in BLEU scores may not necessarily (a) mean an increase in the accuracy of translation. The performance of the baseline English- Trained on 1066 sentences Language Model Persian SMT system was evaluated by 1.8 1.6 computing BLEU, IBM-BLEU-NIST (Li, et al., 1.4 2009) scores from different automatic 1.2 1 evaluation metrics against different sizes of the 0.8 0.6 sentence aligned corpus and different sizes of 0.4 the training set . 0.2 Tables 2, 3 and 4 show the results obtained 0 BLEU NIST IBM-BLEU using corpuses of 817, 1011, and 2343 sentences respectively. The language model size Corpus size 817 Corpus size 1011 Corpus Size 2343 was varied from 864 to 1066 and finally to 7005 sentences. (b) Moreover as shown in table 3, using a Trained on 7005 sentences Language Model corpus and language model of 1011 and 1066 in 2 size respectively produces better results. This 1.8 1.6 can clearly be noticed from graph in Figure 1.4 1.2 2(b). 1 Finally, increasing the size of the corpus to 0.8 0.6 2343 and language model constructed using 0.4 7005 sentences produced the best translation 0.2 0 results as shown in both Figure 2(c) and Table BLEU NIST IBM-BLEU 4. This data shows that an increased corpus size will yield an improved translation quality, but Corpus size 817 Corpus size 1011 Corpus Size 2343 only as long as the size of the language model is (c) proportional to the corpus size. Literature refers Figure 3. (a) Results obtained using training to the fact that the size of the corpus, although size=864 (b) Results obtained using training important, does not have as great an effect as size=1066 (c) Results obtained using training corpus and language model in the domain of size=7005

81 5 Future work Darrudi, E., Hejazi, M., & Oroumchian, F. (2004). Assessment of a modern farsi corpus . Despite the fact that compared to other Dean, J., & Ghemawat, S. (2008). MapReduce: language pairs, the available parallel corpora Simplified data processing on large clusters. for the English/Persian language pair is Communications of the ACM, 51 (1), 107- significantly smaller, the future of statistical 113. machine translation for this language pair Ettelaie, E., Gandhe, S., Georgiou, P., Knight, K., looks promising. We have been able to Marcu, D., Narayanan, S., et al. (2005). procure several very large bilingual corpora, Transonics: A practical speech-to-speech translator for English-Farsi medical which we intend to combine with the open dialogues . corpus we used in the original tests. With the Faili, H., & Ghassem-Sani, G. (2005). Using use of a much larger bilingual corpus, we Decision Tree Approach For Ambiguity expect to produce a significantly higher Resolution In Machine Translation . evaluation metric score. Our planned Itamar, E., & Itai, A. (2008). Using Movie Subtitles immediate future work will consist of for Creating a Large-Scale Bilingual combining these corpora together, Corpora . addressing the task of corpus alignment, and Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., continuing the use of a web crawler to obtain Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical further bilingual text. machine translation . Li, Z., Callison-Burch, C., Khudanpur, S., & 6 Conclusion Thornton, W. (2009). Decoding in Joshua. The Prague Bulletin of Mathematical This paper presented an overview of some of Linguistics, 91 , 47-56. the work in the area of English/Persian MT Lopez, A. (2008). Statistical machine translation. systems that has been done to date, and showed Ma, Y., & Way, A. (2009). Bilingually Motivated a set of experiments in which our SMT system Domain-Adapted Word Segmentation for was applied to the Persian language using a Statistical Machine Translation . relatively small corpus. The first part of this Mohaghegh, M., & Sarrafzadeh, A. (2009). An analysis of the effect of training data work was to test how well our system translates variation in English-Persian statistical from Persian to English when trained on the machine translation . available corpora and to spot and try and resolve Moore, R. (2002). Fast and accurate sentence problems with the process and the output alignment of bilingual corpora. Lecture produced. According to the results we obtained, notes in computer science , 135-144. it was concluded that a corpus of much greater Och, F., & Ney, H. (2004). The alignment template size would be required to produce satisfactory approach to statistical machine translation. results. Our experience with the corpus of Computational Linguistics, 30 (4), 417-449. smaller size shows us that for a large corpus, Saedi, C., Motazadi, Y., & Shamsfard, M. (2009). there will be a significant amount of work Automatic translation between English and Persian texts . required in aligning sentences. Schmidt, A. (2007). Statistical Machine Translation Between New Language Pairs Using References Multiple Intermediaries. Stolcke, A. (2002). SRILM-an extensible language Amtrup, J., Laboratory, C. R., & University, N. M. modeling toolkit . S. (2000). Persian-English machine Xiang, B., Deng, Y., & Gao, Y. (2008). translation: An overview of the Shiraz Unsupervised training for farsi-english project : Computing Research Laboratory, speech-to-speech translation . New Mexico State University. Assi, S. (1997). Farsi linguistic database (FLDB). International Journal of Lexicography, 10 (3), 5. Cancedda, N., Dymetman, M., Foster, G., & Goutte, C. (2009). A Statistical Machine Translation Primer.

~~82 Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphology and Dependency Relations~~

Thoudam Doren Singh Sivaji Bandyopadhyay Department of Computer Science and Department of Computer Science and Engineering Engineering Jadavpur University Jadavpur University [email protected] [email protected]

Abstract 1 Introduction Manipuri has little resource for NLP related re- The present work reports the develop- search and development activities. Manipuri is a ment of Manipuri-English bidirectional less privileged Tibeto-Burman language spoken statistical machine translation systems. In by approximately three million people mainly in the English-Manipuri statistical machine the state of Manipur in India as well as its neigh- translation system, the role of the suffixes boring states and in the countries of Myanmar and dependency relations on the source and Bangladesh. Some of the unique features of side and case markers on the target side this language are tone, the agglutinative verb are identified as important translation morphology and predominance of aspect than factors. A parallel corpus of 10350 sen- tense, lack of grammatical gender, number and tences from news domain is used for person. Other features are verb final word order training and the system is tested with 500 in a sentence i.e., Subject Object Verb (SOV) sentences. Using the proposed translation order, extensive suffix with more limited prefixa- factors, the output of the translation qual- tion. In Manipuri, identification of most of the ity is improved as indicated by baseline word classes and sentence types are based on the BLEU score of 13.045 and factored markers. All sentences, except interrogatives end BLEU score of 16.873 respectively. Si- with one of these mood markers, which may or milarly, for the Manipuri English system, may not be followed by an enclitic. Basic sen- the role of case markers and POS tags in- tence types in Manipuri are determined through formation at the source side and suffixes illocutionary mood markers, all of which are and dependency relations at the target verbal inflectional suffixes, with the exception of side are identified as useful translation the interrogatives that end with an enclitic. Two factors. The case markers and suffixes important problems in applying statistical ma- are not only responsible to determine the chine translation (SMT) techniques to English- word classes but also to determine the Manipuri bidirectional MT systems are: (a) the dependency relations. Using these trans- wide syntactic divergence between the language lation factors, the output of the transla- pairs, and (b) the richer morphology and case tion quality is improved as indicated by marking of Manipuri compared to English. The baseline BLEU score of 13.452 and fac- first problem manifests itself in poor word-order tored BLEU score of 17.573 respectively. in the output translations, while the second one Further, the subjective evaluation indi- leads to incorrect inflections and case marking. cates the improvement in the fluency and The output Manipuri sentences in case of Eng- adequacy of both the factored SMT out- lish-Manipuri system suffer badly when mor- puts over the respective baseline systems. phology and case markers are incorrect in this free word order and morphologically rich language.

83 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 83–91, COLING 2010, Beijing, August 2010. The parallel corpora used is in news domain phrase translations are identified and then re- which have been collected, cleaned and aligned combined with the retrieved output. English- (Singh et al., 2010b) from the Sangai Express Manipuri SMT system using morpho-syntactic newspaper website www.thesangaiexpress.com and semantic information is reported in (Singh available in both Manipuri and English. A daily and Bandyopadhyay, 2010c). In this system, the basis collection was done covering the period role of the suffixes and dependency relations on from May 2008 to November 2008 since there is the source side and case markers on the target no repository. side are identified as important translation factors. 2 Related Works 3 Syntactic Reordering Koehn and Hoang (2007) developed a framework for statistical translation models that tightly This is a preprocessing step applied to the in- integrates additional morphological, syntactic, or put English sentences for English-Manipuri SMT semantic information. Statistical Machine Trans- system. The program for syntactic reordering lation with scarce resources using morpho- uses the parse trees generated by Stanford parser1 syntactic information is discussed in (Nieβen and and applies a handful of reordering rules written Ney, 2004). It introduces sentence level restruc- using perl module Parse::RecDescent. By doing turing transformations that aim at the assimila- this, the SVO order of English is changed to tion of word order in related sentences and SOV order for Manipuri, and post modifiers are exploitation of the bilingual training data by ex- converted to pre-modifiers. The basic difference plicitly taking into account the interdependencies of Manipuri phrase order compared to English is of related inflected forms thereby improving the handled by reordering the input sentence follow- translation quality. Popovic and Ney (2006) dis- ing the rule (Rao et al., 2000): cussed SMT with a small amount of bilingual training data. Case markers and morphology are SSmV VmOOmCm C'mS'mS'O'mO'V'mV' used to address the crux of fluency in the Eng- where, S: Subject lish-Hindi SMT system (Ramanathan et al., O: Object 2009). Work on translating from rich to poor V : Verb morphology using factored model is reported in Cm: Clause modifier (Avramidis and Koehn, 2008). In this method of X': Corresponding constituent in Manipuri, enriching input, the case agreement for nouns, where X is S, O, or V adjectives and articles are mainly defined by the Xm: modifier of X syntactic role of each phrase. Resolution of verb conjugation is done by identifying the person of There are two reasons why the syntactic reor- a verb and using the linguistic information tag. dering approach improves over the baseline Manipuri to English Example Based Machine phrase-based SMT system (Wang et al., 2007). Translation system is reported in (Singh and One obvious benefit is that the word order of the Bandyopadhyay, 2010a) on news domain. For transformed source sentence is much closer to this, POS tagging, morphological analysis, NER the target sentence, which reduces the reliance on and chunking are applied on the parallel corpus the distortion model to perform reordering during for phrase level alignment. Chunks are aligned decoding. Another potential benefit is that the using a dynamic programming “edit-distance alignment between the two sides will be of high- style” alignment algorithm. The translation er quality because of fewer “distortions” between process initially looks for an exact match in the the source and the target, so that the resulting parallel example base and returns the retrieved phrase table of the reordered system would be target output. Otherwise, the maximal match better. However, a counter argument is that the source sentence is identified. For word level reordering is very error prone, so that the added mismatch, the unmatched words in the input are noise in the reordered data actually hurts the either translated from the lexicon or translite- alignments and hence the phrase tables. rated. Unmatched phrases are looked into the phrase level parallel example base; the target 1 http://nlp.stanford.edu/software/lex-parser.shtml

84 4 Morphology Table 1 gives some examples of the inflected forms of a person name and its corresponding The affixes are the determining factor of the English meaning. The Manipuri stemmer sepa- word class in Manipuri. In this agglutinative lan- rates the case markers such as –নো (-na), -দগী (- guage the number of verbal suffixes is more than dagi), -স ু (-su), -গী (-gi), -গো (-ga) etc. from that of nominal suffixes. Works on Manipuri surface forms so that “ব োম” (Tom) from Manipu- morphology are found in (Singh and Bandyo- ri side matches with “Tom” at English side help- padhyay, 2006) and (Singh and Bandyopadhyay, ing to overcome the data sparseness. Enclitics in 2008). In this language, a verb must minimally Manipuri fall into six categories: determiners, consist of a verb root and an inflectional suffix. case markers, the copula, mood markers, inclu- A noun may be optionally affixed by derivational sive / exclusive and pragmatic peak markers and morphemes indicating gender, number and quan- attitude markers. The role of the enclitics used tity. Further, a noun may be prefixed by a pro- and its meaning differs based on the context. nominal prefix which indicates its possessor. Words in Manipuri consist of stems or bound 5 Factored Model of Translation roots with suffixes (from one to ten suffixes), prefixes (only one per word) and/or enclitics. Using factored approach, a tighter integration of (a) ইব োমচো-না ব োল-দ ু কোওই linguistic information into the translation model Ibomcha-na Ball-du kao-i is done for two reasons2: Ibomcha-nom Ball-distal kick . Translation models that operate on more Ibomcha kicks the ball. general representations, such as lemma in- (b) ব োল-দ ু ইব োমচো-না কোওই stead of surface forms of words, can draw on Ball-du Ibomcha-na kao-i richer statistics and overcome the data Ball-distal Ibomcha-nom kick sparseness problem caused by limited train- Ibomcha kicks the ball. ing data. The identification of subject and object in both . Many aspects of translation can be best ex- the sentences are done by the suffixes না (na) and plained at a morphological, syntactic or se- দ ু (du) as given by the examples (a) and (b). The mantic level. Having such information available to the translation model allows the case markers convey the right meaning during translation though the most acceptable order of direct modeling of these aspects. For in- Manipuri sentence is SOV. In order to produce a stance, reordering at the sentence level is good translation output all the morphological mostly driven by general syntactic principles, forms of a word and its translations should be local agreement constraints that show up in available in the training data and every word has morphology, etc. to appear with every possible suffixes. This will 5.1 Combination of Components in Fac- require a large training data. By learning the gen- tored Model eral rules of morphology, the amount of training data could be reduced. Separating lemma and Factored translation model is the combination of suffix allows the system to learn more about the several components including language model, reordering model, translation steps and genera- different possible word formations. 3 tion steps in a log-linear model : Manipuri Gloss English Meaning ব োম্নো Tom-na by Tom (1) ব োমদগী Tom-dagi from Tom

ব োমস ু Tom-su Tom also Z is a normalization constant that is ignored in ব োমগী Tom-gi of Tom practice. To compute the probability of a transla- ব োমগো Tom-ga with Tom tion e given an input sentence f, we have to eva- Table 1: Some of the inflected forms of names in luate each feature function h . The feature weight Manipuri and its corresponding English meaning i

~~2http://www.statmt.org/moses/?n=Moses.FactoredModels 3http://www.statmt.org/moses/?n=Moses.FactoredModels~~

85 λi in the log linear model is determined by using tions, nsubj(said, sources) and agent (shot, po- minimum error rate training method (Och, 2003). lice) . Thus, sources|nsubj and police|agent are For a translation step component, each feature the factors used. “Tom was shot by police” forms the object of the verb “said”. The Stanford parser function ht is defined over the phrase pairs (fj,ej) given a scoring function τ: represents these dependencies with the help of a clausal complement relation which links “said” (2) with “shot” and uses the complementizer relation to introduce the subordination conjunction. Fig-

For the generation step component, each fea- ure 1 shows the dependency relation graph of the ture function hg given a scoring function γ is de- sentence “Sources said that Tom was shot by po- fined over the output words ek only: lice”.

~~(3) 5.3 Factorization approach of English- Manipuri SMT system~~

Manipuri case markers are decided by dependen- 5.2 Stanford Dependency Parser cy relation and aspect information of English. Figure 2 shows the translation factors used in the The dependency relations used in the experiment translation between English and Manipuri. are generated by the Stanford dependency parser (Marie-Catherine de Marneffe and Manning, (i) Tomba drives the car. 2008). This parser uses 55 relations to express the dependencies among the various words in a ত াম্বনা কারদু ত ৌই sentence. The dependencies are all binary rela- Tomba-na car-du thou-i tions: a grammatical relation holds between a (Tomba) (the car) (drives) governor and a dependent. These relations form a hierarchical structure with the most general rela- Tomba|empty|nsubj drive|s|empty the|empty|det tion at the root. car|empty|dobj A subject requires a case marker in a clause with a perfective form such as –না (na). It can be said nsubj represented as, ccomp suffix+ dependency relation  case marker s|empty + empty|dobj  না (na) source complm shot agent s (ii) Birds are flying. Police that উচেকশিং পাইশর ucheksing payri nsubjpass auxpass (birds are) (flying) Tom was Bird|s|nsubj are|empty|aux fly|ing|empty

Figure 1. Dependency relation graph of the sen- Thus, English-Manipuri factorization consists of tence “Sources said that Tom was shot by police” generated by Stanford Parser . a lemma to lemma translation factor [i.e., There are various argument relations like sub- Bird  উচেক (uchek) ] ject, object, objects of prepositions and clausal . a suffix + dependency relation  suffix [i.e., complements, modifier relations like adjectival, s + nsubj  শিং (sing)] adverbial, participial, infinitival modifiers and . a lemma + suffix  surface form generation other relations like coordination, conjunct, exple- factor tive and punctuation. Let us consider an example [i.e., উচেক (uchek) + শিং (sing)  উচেকশিং “Sources said that Tom was shot by police”. (ucheksing)] Stanford parser produces the dependency rela-

86 Input Output 5.5 Syntactically enriched output Word High-order sequence models (just like n-gram Word language models over words) are used in order to support syntactic coherence of the output (Koehn and Hoang, 2007). Lemma Lemma Input Output

~~word word~~

~~Suffix Case Marker 3-gram Parts-of-speech Dependency Relation 7-gram~~

Figure 2. English to Manipuri translation factors Figure 4. By generating additional linguistic factors on the output side, high-order sequence models over 5.4 Factorization approach of Manipuri- these factors support syntactical coherence of the out- English SMT system put. Manipuri case markers are responsible to identify Adding part-of-speech factor on the output dependency relation and aspect information of side and exploiting them with 7-gram sequence English. Figure 3 shows the translation factors models (as shown in Figure 4) results in minor used in the translation between Manipuri and improvements in BLEU score. English. The Manipuri- English factorization consists of: 6 Experimental Setup A number of experiments have been carried out . Translation factor: lemma to lemma using factored translation framework and incor- [e.g., উচেক (uchek)  Bird] porating linguistic information. The toolkits used . Translation factor: suffix + POS  depen- in the experiment are: dency relation + POS + suffix . Stanford Dependency Parser4 was used to (i) [e.g., শিং (sing) + NN  nsubj + NN + s] generate the dependency relations and (ii) . Generation factor: lemma + POS + depen- syntactic reordering of the input English sen- dency Relation +suffix  surface form gen- tences using Parse::RecDescent module. eration factor . Moses5 toolkit (Koehn, 2007) was used for [e.g., উচেক (uchek) + NN + nsubj + শিং (sing) training with GIZA++6, decoding and mini-  (ucheksing ] উচেকশিং mum error rate training (Och, 2003) for tuning.

~~Word Word . SRILM7 toolkit (Stolcke, 2002) was used to~~

build language models with 10350 Manipuri sentences for English-Manipuri system and Lemma Lemma four and a half million English wordforms collected from the news domain for Manipu- ri-English system. 8 . English morphological analyzer morpha POS POS (Minnen et al., 2001) was used and the

4 http://nlp.stanford.edu/software/lex-parser.shtml Suffix/ Dependen- 5 http://www.statmt.org/moses/ Case cy Relation 6 http://www.fjoch.com/GIZA++.html 7 http://www.speech.sri.com/projects/srilm Marker 8 Suffix Figure 3. The Manipuri-English translation factors ftp://ftp.informatics.susx.ac.uk/pub/users/johnca/morph.tar. gz

87 stemmer from Manipuri Morphological ana- fix to surface form. Table 3 shows the BLEU and lyzer (Singh and Bandyopadhyay, 2006) was NIST scores of the system using these factors. used for the Manipuri side. Table 4 shows the BLEU and NIST scores of . Manipuri POS tagger (Singh et. al., 2008) is the English-Manipuri SMT systems using lexica- used to tag the POS (Parts of speech) factors lized and syntactic reordering. of the input Manipuri sentences. Model BLEU NIST 7 Evaluation Baseline (surface) 13.045 4.25 7.1 English-Manipuri SMT System Lemma + Suffix 15.237 4.79 Lemma + Suffix + De- 16.873 5.10 The evaluation of the machine translation sys- pendency Relation tems developed in the present work is done in two approaches using automatic scoring with Table 3. Evaluation Scores of English - Manipuri SMT System using various translation factors reference translation and subjective evaluation as discussed in (Ramanathan et al., 2009). Model Reordering BLEU NIST Evaluation Metrics: Baseline 13.045 4.25 . NIST (Doddington, 2002): A high score (surface) means a better translation by measuring the Surface Lexicalized 13.501 4.32 precision of n-gram. Surface Syntactic 14.142 4.47 . BLEU (Papineni et al, 2002): This metric Table 4. Evaluation Scores of English-Manipuri SMT gives the precision of n-gram with respect to system using Lexicalized and Syntactic Reordering the reference translation but with a brevity penalty. Input/Output of English-Manipuri SMT:

(1a) Input: Going to school is obligatory for stu- No of sentences No of words dents. Training 10350 296728 স্কুল ে重পা ছাত্রশিংগী ত ৌদ য়াদ্রবা মচ ৌশন | Development 600 16520 School chatpa shatra-sing-gi touda ya Test 500 15204 Table 2. Training, development and testing corpus draba mathouni. statistics Baseline output: স্কুল মচ ৌ ে重পা ওই ছাত্র Table 2 shows the corpus statistics used in the school mathou chatpa oy shatra experiment. The corpus is annotated with the gloss : school duty going is student. proposed factors. The following models are de- Syntactic Reorder output: ছাত্র স্কুল ে重পা ত ৌদ য়াদ্রবা veloped for the experiment. shatra school chatpa touda yadraba Baseline: gloss: Student school going compulsory. The model is developed using the default setting Dependency output: ছাত্রশিং স্কুল ে重পা মচ ৌশন values in MOSES. shatrasing schoolda chatpa mathouni Lemma +Suffix: gloss: Students going to the school is duty.

It uses lemma and suffix factors on the source (1b) Input: Krishna has a flute in his hand. side, lemma and suffix on the target side for কৃ ষ্ণগী খুত্তা ত ৌশদ্র অমা লল | lemma to lemma and suffix to suffix translations Krishna-gi khut-ta toudri ama lei. with generation step of lemma plus suffix to sur- Syntactic Reorder output: কৃ ষ্ণ লল খু重 অমা ত ৌশদ্র face form. Krishna lei khut ama toudri Lemma + Suffix + Dependency Relation: gloss : Krishna has a hand flute Lemma, suffix and dependency relations are used Dependency output: কৃ ষ্ণগী লল ত ৌশদ্র অমা খুত্তা on the source side. The translation steps are (a) krishnagi lei toudri ama khutta lemma to lemma (b) suffix + dependency rela- gloss : Krishna has a flute in his hand tion to suffix and generation step is lemma + suf-

88 One of the main aspects required for the fluen- lemma+suffix systems follow same factors as cy of a sentence is agreement. Certain words English-Manipuri. have to match in gender, case, number, person Lemma + Suffix + POS: etc. within a sentence. The rules of agreement are Lemma, suffix and POS are used on the source language dependent and are closely linked to the side. The translation steps are (a) lemma to morphological structure of language. Subjective lemma (b) suffix + POS to POS + suffix + de- evaluations on 100 sentences have been per- pendency relation and generation step is lemma formed for fluency and adequacy by two judges. + suffix + POS + dependency relation to surface The fluency measures how well formed the sen- form. tences are at the output and adequacy measures the closeness of the output sentence with the ref- Model BLUE NIST erence translation. The Table 5 and Table 6 show Baseline (surface) 13.452 4.31 the adequacy and fluency scales used for evalua- Lemma + Suffix 16.137 4.89 tion and Table 7 shows the scores of the evalua- Lemma + Suffix + POS 17.573 5.15 tion. Table 8. Evaluation Scores of Manipuri-English SMT system using various translation factors Level Interpretation Table 8 shows the BLEU and NIST scores of 4 Full meaning is conveyed the Manipuri-English systems using the different 3 Most of the meaning is conveyed factors. Table 9 shows the scores of using lexica- 2 Poor meaning is conveyed lized reordering and POS language model. 1 No meaning is conveyed Table 5. Adequacy scale Model BLUE NIST Baseline + POS LM 14.341 4.52 Level Interpretation Baseline + Lexicalized 13.743 4.46 4 Flawless with no grammatical error Baseline + Lexicalized 14.843 4.71 3 Good output with minor errors +POS LM 2 Disfluent ungrammatical with correct Table 9. Evaluation Scores of Manipuri-English SMT phrase system using Lexicalized reordering and POS Lan- 1 Incomprehensible guage Model Table 6. Fluency scale Input/Output of Manipuri-English SMT: Sentence Fluency Adequacy length (2a) Input: স্কুল ে重পা ছাত্রশিংগী ত ৌদ য়াদ্রবা মচ ৌশন | Baseline <=15 1.95 2.24 gloss: School chatpa shatra-sing-gi touda words yadraba mathouni. >15 words 1.49 1.75 Going to school is obligatory for students. Reordered <=15 2.58 2.75 Baseline output: school going to the students words important >15 words 1.82 1.96 Lexicalized Reordered output: school going Dependency <=15 2.83 2.91 important to the students Re lation words Lemma+Suffix+POS+lexicalized reordered >15 words 1.94 2.10 output: School going important to the students Table 7. Scale of Fluency and Adequacy on sentence length basis of English-Manipuri SMT system (2b) Input: কৃ ষ্ণগী খুত্তা ত ৌশদ্র অমা লল | gloss: Krishna-gi khut-ta toudri ama lei. 7.2 Manipuri-English SMT System Krishna has a flute in his hand. The system uses the corpus statistics shown in Baseline output: Krishna is flute and hand Table 2. The corpus is annotated with the pro- Lexicalized Reordered output: Krishna flute posed factors. The following models are devel- has his hand oped for the experiment. The baseline and

89 Lemma+Suffix+POS+lexicalized reordered Input : Khamba pushed the stone with a lever. output: Krishna has flute his hand খম্বনো জম্ফত্নো নংু অদ ু ইল্লম্মী | By considering the lemma along with suffix Outputs: and POS factors, the fluency and adequacy of the Syntactic Reordered: খম্ব নংু জম্ফ重 অদ ু ইল্লল্ল | output is better addressed as given by the sample Khamba nung jamfat adu illi input and output (2a) and (2b) over the baseline gloss: Khamba stone the lever push system. Using the Manipuri stemmer, the case Dependency: খম্বনো নংু অদ ু জম্ফত্নো ইল্লল্ল | markers and suffixes are taken into account for Khambana nung adu jamfatna illi different possible word forms thereby helping to gloss: Khamba the stone pushed with lever overcome the data sparseness problem. Table 10 shows the scores of adequacy and fluency of the By the use of semantic relation, নো (na) is at- evaluation. tached to খম্ব (Khamba), which makes the meaning খম্বনো “by Khamba” instead of just খম্ব Sentence Fluency Adequacy “Khamba”. length Input : Suddenly the woman burst into tears. Baseline <=15 1.93 2.31 খঙব ৌদনো বমৌ অদনু ো মল্লি ল্লিন্থরকই | words Outputs: >15 words 1.51 1.76 Syntactic Reordered: নিু ী থনু ো ল্লিরোংগো কপ্পী | Reordered <=15 2.48 2.85 Nupi thuna pirang-ga kappi words gloss: woman soon tears cry >15 words 1.83 1.97 Dependency: অথ ু দো নিু ীদ ু কপ্লম্মী | Lemma + <=15 2.86 2.92 Athubada nupidu kaplammi Suffix words gloss: suddenly the woman cried + POS >15 words 2.01 2.11 Table 10. Scale of Fluency and Adequacy on sen- Here, in this example, the নিু ী (nupi) is suf- tence length basis of Manipuri-English SMT system fixed by the দু (du), to produce নিু ীদ ু “the wom- Subjective evaluations on 100 sentences have an” instead of just নিু ী “woman”. been performed for fluency and adequacy. In the The factored approach of Manipuri-English process of subjective evaluation, sentences were SMT system also shows improved BLEU and judged on fluency, adequacy and the number of NIST scores using the proposed factors as shown errors in case marking/morphology. It is ob- in Table 8 not only gain in fluency and adequacy served that poor word-order makes the baseline scores as shown in Table 10. output almost incomprehensible, while lexicalized reordering solves the problem correctly 9 Conclusion along with parts-of-speech language model (POS A framework for Manipuri and English bidirec- LM). Statistical significant test is performed to tional SMT system using factored model is expe- judge if a change in score that comes from a rimented with a goal to improve the translation change in the system reflects a change in overall output and reduce the amount of training data. translation quality. It is found that all the differ- The output of the translation is improved by in- ences are significant at the 99% level. corporating morphological information and semantic relations by tighter integration. The 8 Discussion systems are evaluated using automatic scoring The factored approach using the proposed factors techniques BLEU and NIST. The subjective show improved fluency and adequacy at the Ma- evaluation of the systems is done to find out the nipuri output for English-Manipuri system as fluency and adequacy. The fluency and adequacy shown in the Table 6. Using the Stanford gener- are also addressed better for the shorter sentences ated relations shows an improvement in terms of than the longer ones using semantic relations. fluency and adequacy for shorter sentences than The improvement is statistically significant. the longer ones.

90 References Natural Language Processing of the AFNLP: Vo- lume 2, pages: 800-808 Avramidis, E. and Koehn, P. 2008. Enriching morphologically poor languages for Statistical Machine Rao, D., Mohanraj, K., Hegde, J., Mehta, V. and Ma- Translation, Proceedings of ACL-08: HLT hadane, P. 2000. A practical framework for syntactic transfer of compound-complex sentences for Callison-Burch, Chris., Osborne, M. and Koehn, P. English-Hindi Machine Translation, Proceedings 2006. Re-evaluating the Role of Bleu in Machine of KBCS 2000, pages 343-354 Translation Research" In Proceedings of EACL- 2006 Singh, Thoudam D., and Bandyopadhyay, S. 2006. Word Class and Sentence Type Identification in Doddington, G. 2002. Automatic evaluation of Ma- Manipuri Morphological Analyzer, Proceeding of chine Translation quality using n-gram co- Modeling and Shallow Parsing of Indian Languag- occurrence statistics. In Proceedings of HLT 2002, es(MSPIL) 2006, IIT Bombay, pages 11-17, Mum- San Diego, CA. bai, India

Koehn. P., and Hoang, H. 2007. Factored Translation Singh, Thoudam D., and Bandyopadhyay, S. 2008. Models, In Proceedings of EMNLP-2007 Morphology Driven Manipuri POS Tagger, In proceedings International Joint Conference on Natu- Koehn, P., Hieu, H., Alexandra, B., Chris, C., Marcel- ral Language Processing (IJCNLP-08) Workshop lo, F., Nicola, B., Brooke, C., Wade, S., Christine, on Natural Language Processing of Less Privi- M., Richard, Z., Chris, D., Ondrej, B., Alexandra, leged Languages (NLPLPL) 2008, pages 91-98, C., Evan, H. 2007. Moses: Open Source Toolkit for Hyderabad, India Statistical Machine Translation, Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177– Singh, Thoudam D., and Bandyopadhyay, S. 2010a. 180, Prague. Manipuri-English Example Based Machine Trans- lation System, International Journal of Computa- Marie-Catherine de Marneffe and Manning, C. 2008. tional Linguistics and Applications (IJCLA), ISSN Stanford Typed Dependency Manual 0976-0962, pages 147-158

Minnen, G., Carroll, J., and Pearce, D. 2001. Applied Singh, Thoudam D., Singh, Yengkhom R. and Ban- Morphological Processing of English, Natural dyopadhyay, S., 2010b. Manipuri-English Semi Language Engineering, 7(3), pages 207-223 Automatic Parallel Corpora Extraction from Web, In proceedings of 23rd International Conference Nieβen, S., and Ney, H. 2004. Statistical Machine on the Computer Processing of Oriental Languag- Translation with Scarce Resources Using Morpho- es (ICCPOL 2010) - New Generation in Asian In- syntactic Information, Computational Linguistics, formation Processing , San Francisco Bay, CA, 30(2), pages 181-204 USA, Pages 45-48

Och, F. 2003. Minimum error rate training in Statis- Singh, Thoudam D. and Bandyopadhyay, S., 2010c. tical Machine Translation , Proceedings of ACL Statistical Machine Translation of English- Manipuri using Morpho-Syntactic and Semantic Papineni, K., Roukos, S., Ward, T., and Zhu, W. Information, In the proceedings of Ninth Confe- 2002. BLEU: a method for automatic evaluation of rence of the Association for Machine Translation machine translation. In Proceedings of 40th ACL, in Americas (AMTA 2010), Denver, Colorado, Philadelphia, PA USA. (To appear) Stolcke. A. 2002. SRILM - An Extensible Language Popovic, M., and Ney, H. 2006. Statistical Machine Modeling Toolkit. In Proc. Intl. Conf. Spoken Lan- Translation with a small amount of bilingual train- guage Processing, Denver, Colorado, September. ing data, 5th LREC SALTMIL Workshop on Minor- ity Languages Wang, C., Collin, M., and Koehn, P. 2007. Ch inese Ramanathan, A., Choudhury, H., Ghosh, A., and syntactic reordering for statistical machine transla- Bhattacharyya, P. 2009. Case markers and Mor- tion, Proceedings of EMNLP-CoNLL phology: Addressing the crux of the fluency problem in English-Hindi SMT, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on

~~91 A Discriminative Syntactic Model for Source Permutation via Tree Transduction~~

~~Maxim Khalilov and Khalil Sima’an Institute for Logic, Language and Computation University of Amsterdam m.khalilov,k.simaan @uva.nl { }~~

Abstract order differences between source and target languages using hierarchical methods that draw on A major challenge in statistical machine Inversion Transduction Grammar (ITG), e.g., (Wu translation is mitigating the word or- and Wong, 1998; Chiang, 2005). In principle, der differences between source and tar- the latter approach explores reordering defined by get strings. While reordering and lexical the choice of swapping the order of sibling sub- translation choices are often conducted in trees under each node in a binary parse-tree of the tandem, source string permutation prior source/target sentence. to translation is attractive for studying re- An alternative approach aims at minimizing the ordering using hierarchical and syntactic need for reordering during translation by permut- structure. This work contributes an ap- ing the source sentence as a pre-translation step, proach for learning source string permu- e.g., (Collins et al., 2005; Xia and McCord, 2004; tation via transfer of the source syntax Wang et al., 2007; Khalilov, 2009). In effect, tree. We present a novel discriminative, the translation process works with a model for probabilistic tree transduction model, and source permutation (s s0 ) followed by trans- → contribute a set of empirical upperbounds lation model (s0 t), where s and t are source → on translation performance for English- and target strings and s0 is the target-like permuted to-Dutch source string permutation under source string. In how far can source permutation sequence and parse tree constraints. Fi- reduce the need for reordering in conjunction with nally, the translation performance of our translation is an empirical question. learning model is shown to outperform the In this paper we define source permutation as state-of-the-art phrase-based system sig- the problem of learning how to transfer a given nificantly. source parse-tree into a parse-tree that minimizes the divergence from target word-order. We model 1 Introduction the tree transfer τ τ as a sequence of local, s → s0 From its beginnings, statistical machine transla- independent transduction operations, each trans- forming the current intermediate tree τ into the tion (SMT) has faced a word reordering challenge si0 next intermediate tree τ , with τs0 = τs and that has a major impact on translation quality. si0+1 While standard mechanisms embedded in phrase- τ = τ . A transduction operation merely per- sn0 s0 based SMT systems, e.g. (Och and Ney, 2004), mutes the sequence of n > 1 children of a single deal efficiently with word reordering within a lim- node in an intermediate tree, i.e., unlike previous ited window of words, they are still not expected work, we do not binarize the trees. The number to handle all possible reorderings that involve of permutations is factorial in n, and learning a words beyond this relatively narrow window, e.g., sequence of transductions for explaining a source (Tillmann and Ney, 2003; Zens and Ney, 2003; permutation can be computationally rather chal- Tillman, 2004). More recent work handles word lenging (see (Tromble and Eisner, 2009)). Yet,

92 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 92–100, COLING 2010, Beijing, August 2010. from the limited perspective of source string per- ter subtrees might turn out insufficient. mutation (s s0 ), another challenge is to inte- → grate a figure of merit that measures in how far s0 2 Baseline: Phrase-based SMT resembles a plausible target word-order. Given a word-aligned parallel corpus, phrase- We contribute solutions to these challenging based systems (Och and Ney, 2002; Koehn et al., problems. Firstly, we learn the transduction 2003) work with (in principle) arbitrarily large operations using a discriminative estimate of phrase pairs (also called blocks) acquired from P (π(α ) N , α , context ), where N is the la- word-aligned parallel data under a simple defi- x | x x x x bel of node (address) x, N α is the context- nition of translational equivalence (Zens et al., x → x free production under x, π(αx) is a permutation of 2002). The conditional probabilities of one phrase αx and contextx represents a surrounding syntac- given its counterpart are interpolated log-linearly tic context. As a result, this constrains π(α ) together with a set of other model estimates: { x } only to those found in the training data, and it M conditions the transduction application probabil- I I J eˆ1 = arg max λmhm(e1, f1 ) (1) eI ity on its specific contexts. Secondly, in every se- 1 ( m=1 ) quence s0 = s, . . . , s0 = s0 resulting from a tree X 0 n where a feature function h refer to a system transductions, we prefer those local transductions m 0 model, and the corresponding λm refers to the rel- on τ that lead to source string permutation si si0 1 − ative weight given to this model. A phrase-based 0 that are closer to target word order than si 1; we system employs feature functions for a phrase pair − employ s0 language model probability ratios as a translation model, a language model, a reordering measure of word order improvement. model, and a model to score translation hypothesis In how far does the assumption of source per- according to length. The weights λm are usually mutation provide any window for improvement optimized for system performance (Och, 2003) as over a phrase-based translation system? We con- measured by BLEU (Papineni et al., 2002). Two duct experiments on translating from English into reordering methods are widely used in phrase- Dutch, two languages which are characterized based systems. by a number of systematic divergences between Distance-based A simple distance-based re- them. Initially, we conduct oracle experiments ordering model default for Moses system is the with varying constraints on source permutation first reordering technique under consideration. to set upperbounds on performance relative to This model provides the decoder with a cost lin- a state-of-the-art system. Translating the oracle ear to the distance between words that should be source string permutation (obtained by untangling reordered. the crossing alignments) offers a large margin of improvement, whereas the oracle parse tree per- MSD A lexicalized block-oriented data-driven mutation provides a far smaller improvement. A reordering model (Tillman, 2004) considers three minor change to the latter to also permute con- different orientations: monotone (M), swap (S), stituents that include words aligned with NULL, and discontinuous (D). The reordering probabili- offers further improvement, yet lags bahind bare ties are conditioned on the lexical context of each string permutation. Subsequently, we present phrase pair, and decoding works with a block se- translation results using our learning approach, quence generation process with the possibility of and exhibit a significant improvement in BLEU swapping a pair of blocks. score over the state-of-the-art baseline system. 3 Related Work on Source Permutation Our analysis shows that syntactic structure can provide important clues for reordering in trans- The integration of linguistic syntax into SMT lation, especially for dealing with long distance systems offers a potential solution to reordering cases found in, e.g., English and Dutch. Yet, tree problem. For example, syntax is successfully transduction by merely permuting the order of sis- integrated into hierarchical SMT (Zollmann and

93 Venugopal, 2006). Similarly, the tree-to-string 4 Pre-Translation Source Permutation syntax-based transduction approach offers a com- Given a word-aligned parallel corpus, we define plete translation framework (Galley et al., 2006). the source string permutation as the task of learn- The idea of augmenting SMT by a reordering ing to unfold the crossing alignments between step prior to translation has often been shown to sentence pairs in the parallel corpus. Let be given improve translation quality. Clause restructuring a source-target sentence pair s t with word performed with hand-crafted reordering rules for → alignment set a between their words. Unfold- German-to-English and Chinese-to-English tasks ing the crossing instances in a should lead to as are presented in (Collins et al., 2005) and (Wang monotone an alignment a0 as possible between a et al., 2007), respectively. In (Xia and McCord, permutation s0 of s and the target string t. Con- 2004; Khalilov, 2009) word reordering is ad- ducting such a “monotonization” on the parallel dressed by exploiting syntactic representations of corpus gives two parallel corpora: (1) a source- source and target texts. to-permutation parallel corpus (s s0 ) and → Other reordering models operate provide the (2) a source permutation-to-target parallel corpus decoder with multiple word orders. For ex- (s0 t). The latter corpus is word-aligned au- → ample, the MaxEnt reordering model described tomatically again and used for training a phrase- in (Xiong et al., 2006) provides a hierarchi- based translation system, while the former corpus cal phrasal reordering system integrated within is used for training our model for pre-translation a CKY-style decoder. In (Galley and Manning, source permutation via parse tree transductions. 2008) the authors present an extension of the fa- mous MSD model (Tillman, 2004) able to handle long-distance word-block permutations. Coming up-to-date, in (PVS, 2010) an effective application of data mining techniques to syntax-driven source reordering for MT is presented. Recently, Tromble and Eisner (2009) define source permutation as learning source permutations; the model works with a preference matrix for word pairs, expressing preference for their two alternative orders, and a corresponding weight matrix that is fit to the parallel data. The huge Figure 1: Example of crossing alignments and space of permutations is then structured using a long-distance reordering using a source parse tree. binary synchronous context-free grammar (Binary ITG) with O(n3) parsing complexity, and the per- In itself, the problem of permuting the source mutation score is calculated recursively over the string to unfold the crossing alignments is compu- tree at every node as the accumulation of the tationally intractable (see (Tromble and Eisner, relative differences between the word-pair scores 2009)). However, different kinds of constraints taken from the preference matrix. Application to can be made on unfolding the crossing alignments German-to-English translation exhibits some per- in a. A common approach in hierarchical SMT is formance improvement. to assume that the source string has a binary parse Our work is in the general learning direction tree, and the set of eligible permutations is defined taken in (Tromble and Eisner, 2009) but differs by binary ITG transductions on this tree. This de- both in defining the space of permutations, using fines permutations that can be obtained only by local probabilistic tree transductions, as well as in at most inverting pairs of children under nodes of the learning objective aiming at scoring permuta- the source tree. Figure 1 exhibits a long distance tions based on a log-linear interpolation of a lo- reordering of the verb in English-to-Dutch transla- cal syntax-based model with a global string-based tion: inverting the order of the children under the (language) model. VP node would unfold the crossing alignment.

94 4.1 Oracle Performance that provide n-ary constituency parses. Now we As has been shown in the literature (Costa-jussa` constrain unfolding crossing alignments only to and Fonollosa, 2006; Khalilov and Sima’an, 2010; those alignment links which agree with the struc- Wang et al., 2007), source and target texts mono- ture of the source-side parse tree and consider the tonization leads to a signiﬁcant improvement in constituents which include aligned tokens only. terms of translation quality. However it is not Unfolding a crossing alignment is modeled as per- known how many alignment crossings can be un- muting the children of a node in the parse tree. We folded under different parse tree conditions. In or- refer to this oracle system as oracle-tree. For ex- der to gauge the impact of corpus monotonization ample provided in Figure 2(b), there is no way to on translation system performance, we trained a construct a monotonized version of the sentence set of oracle translation systems, which create since the word “so” is aligned to NULL and im- target sentences that follow the source language pedes swapping the order of VB and ADJP under word order using the word alignment links and the VP. various constraints. Oracle under relaxed tree constraint The oracle-tree system does not permute the words which are both (1) not found in the alignment and (2) are spanned by the sub-trees sibling to the reordering constituents. Now we introduce a re- (a) Word alignment. laxed version of the parse tree constraint: the order of the children of a node is permuted when the node covers the reordering constituents and also when the frontier contains leaf nodes aligned with NULL (oracle-span). For example, in Fig- (c) Word alignment (b) Parse tree and corre- ure 2(c) the English word “so” is not aligned, but and ADJP span. sponding alignment. according to the relaxed version, must move to- gladly Figure 2: Reordering example. gether with the word “ ” since they share a parent node (ADJP).

The set-up of our experiments and corpus char- Source BLEU NIST acteristics are detailed in Section 5. Table 1 re- baseline dist 24.04 6.29 ports translation scores of the oracle systems. No- baseline MSD 24.04 6.28 tice that all the numbers are calculated on the re- oracle string 27.02 6.51 oracle − tree 24.09 6.30 aligned corpora. Baseline results are provided for oracle − span 24.95 6.37 informative purposes. − Table 1: Translation scores of oracle systems. String permutation The ﬁrst oracle system under consideration is created by traversing the string from left to right and unfolding all The main conclusion which can be drawn from crossing alignment links (we call this system the oracle results is that there is a possibility for relatively big ( 3 BLEU points) improvement oracle-string). For example in Figure 2(a), ≈ the oracle-string system generates a string “do with complete unfolding of crossing alignments and very limited ( 0.05 BLEU points) with the so gladly” swapping the words “do” and ≈ “gladly” without considering the parse tree. same done under the parse tree constraint. A tree- The ﬁrst line of the table shows the performance based system that allows for permuting unaligned of the oracle-string system with monotone source words that are covered by a dominating parent and target portions of the corpus. node shows more improvement in terms of BLEU and NIST scores ( 0.9 BLEU points). ≈ Oracle under tree constraint We use a syntac- The gap between oracle-string and oracle-tree tic parser for parsing the English source sentences performance is due to alignment crossings which

~~95 cannot be unfolded under trees (illustrated in Fig- local transduction: ure 3), but possibly also due to parse and align- P (τ τ ) P (π(αx) Nx αx,Cx) (2) ment errors. si0 si0 1 | − ≈ | →~~

where π(αx) is a permutation of αx (the ordered sequence of node labels under x) and Cx is a local tree context of node x in tree τ . One wrin- si0 1 kle in this definition is that the number− of possible permutations of αx is factorial in the length of αx. Fortunately, the source permuted training data exhibits only a fraction of possible permutations even for longer αx sequences. Furthermore, by conditioning the probability on local context, Figure 3: Example of alignment crossing that does the general applicability of the permutation is re- not agree with the parse tree. strained. Given this definition, we define the probability of the sequence of local tree transductions 4.2 Source Permutation via Syntactic τ ... τ as s0 sn0 Transfer 0 → → n Given a parallel corpus with string pairs s t P (τ ... τ ) = P (τ τ ) (3) → s0 sn0 si0 si0 1 with word alignment a, we create a source per- → → | − Yi=1 muted parallel corpus s s0 by unfolding the → The problem of calculating the most likely per- crossing alignments in a: this is done by scanning mutation under this transduction model is made the string s from left to right and moving words in- difficult by the fact that different transduction se- volved in crossing alignments to positions where quences may lead to the same permutation, which the crossing alignments are unfolded). The source demands summing over these sequences. Fur- strings s are parsed, leading to a single parse tree thermore, because every local transduction condi- τ per source string. s tions on local context of an intermediate tree, this Our model aims at learning from the source quickly risks becoming intractable (even when we permuted parallel corpus s s0 a probabilistic → use packed forests). In practice we take a prag- optimization arg max P (π(s) s, τ ). We as- π(s) | s matic approach and greedily select at every inter- sume that the set of permutations π(s) is de- { } mediate point τs0 τs0 the single most likely fined through a finite set of local transductions i 1 → i local transduction− that can be conducted on any over the tree τs. Hence, we view the permutations 0 node of the current intermediate tree τ using leading from s to s as a sequence of local tree si0 1 − 0 an interpolation of the term in Equation 2 with transductions τs0 ... τs0 , where s0 = s 0 → → n string probability ratios as follows: 0 0 and sn = s , and each transduction τs0 τs0 i 1 → i − 0 is defined using a tree transduction operation that P (si 1) P (π(αx) Nx αx,Cx) − 0 at most permutes the children of a single node in | → × P (si) τ as defined next. si0 1 − The rationale behind this log-linear interpolation A local transduction τ τ is modelled si0 1 si0 is that our source permutation approach aims at − → by an operation that applies to a single node with finding the optimal permutation s0 of s that can address x in τ , labeled Nx, and may permute si0 1 serve as input for a subsequent translation model. − the ordered sequence of children αx dominated by Hence, we aim at tree transductions that are syn- node x. This constitutes a direct generalization of tactically motivated that also lead to improved the ITG binary inversion transduction operation. string permutation. In this sense, the tree trans- We assign a conditional probability to each such duction definitions can be seen as an efficient and

96 syntactically informed way to define the space of For combinatory models we use the following no- possible permutations. tations: M4 = i [ NP, VP, SENT, SBAR] Mi and M2 = 0 ∈ We estimate the string probabilities P (si) us- i [VP, SENT] Mi. ∈ P ing 5-gram language models trained on the s0 P side of the source permuted parallel corpus s 5 Experiments and results → s0 . We estimate the conditional probability The SMT system used in the experiments was P (π(α ) N α ,C ) using a Maximum- x | x → x x implemented within the open-source MOSES Entropy framework, where feature functions are toolkit (Koehn et al., 2007). Standard train- defined to capture the permutation as a class, the ing and weight tuning procedures which were node label N and its head POS tag, the child x used to build our system are explained in details sequence α together with the corresponding se- x on the MOSES web page1. The MSD model quence of head POS tags and other features corre- was used together with a distance-based reorder- sponding to different contextual information. ing model. Word alignment was estimated with We were particularly interested in those linguis- GIZA++ tool2 (Och, 2003), coupled with mk- tic features that motivate reordering phenomena cls3 (Och, 1999), which allows for statistical word from the syntactic and linguistic perspective. The clustering for better generalization. An 5-gram features that were used for training the permuta- target language model was estimated using the tion system are extracted for every internal node SRI LM toolkit (Stolcke, 2002) and smoothed of the source tree that has more than one child: with modified Kneser-Ney discounting. We use 4 Local tree topology. Sub-tree instances that the Stanford parser (Klein and Manning, 2003) • include parent node and the ordered se- as a source-side parsing engine. The parser was quence of child node labels. trained on the English treebank set provided with 14 syntactic categories and 48 POS tags. The Dependency features. Features that deter- evaluation conditions were case-sensitive and in- • mine the POS tag of the head word of the cur- cluded punctuation marks. For Maximum En- 5 rent node, together with the sequence of POS tropy modeling we used the maxent toolkit . tags of the head words of its child nodes. Data The experiment results were obtained us- Syntactic features. Three binary features ing the English-Dutch corpus of the European Par- • from this class describe: (1) whether the parliament Plenary Session transcription (EuroParl). ent node is a child of the node annotated with Training corpus statistics can be found in Table 2. the same syntactic category, (2) whether the Dutch English parent node is a descendant of the node an- Sentences 1.2 M 1.2 M notated with the same syntactic category, and Words 32.9 M 33.0 M (3) if the current subtree is embedded into a Average sentence length 27.20 27.28 “SENT-SBAR” sub-tree. The latter feature in- Vocabulary 228 K 104 K tends to model the divergence in word order Table 2: Basic statistics of the English-Dutch Eu- in relative clauses between Dutch and En- roParl training corpus. glish which is illustrated in Figure 1.

In initial experiments we piled up all feature func- The development and test datasets were ran- tions into a single model. Preliminary results domly chosen from the corpus and consisted of showed that the system performance increases if 1http://www.statmt.org/moses/ the set of patterns is split into partial classes con- 2code.google.com/p/giza-pp/ ditioned on the current node label. Hence, we 3http://www.fjoch.com/mkcls.html 4 trained four separate MaxEnt models for the cate- http://nlp.stanford.edu/software/ lex-parser.shtml gories with potentially high number of crossing 5http://homepages.inf.ed.ac.uk/ alignments, namely VP, NP, SENT, and SBAR. lzhang10/maxent_toolkit.html

97 500 and 1,000 sentences, respectively. Both were Translation scores The evaluation results for provided with one reference translation. the development and test corpora are reported in Table 4. They include two baseline configurations Results Evaluation of the system performance (dist and MSD), oracle results and contrasts them is twofold. In the first step, we analyze the qual- with the performance shown by different combi- ity of reordering method itself. In the next step nations of single-category tree-based reordering we look at the automatic translation scores and models. Best scores within each experimental sec- evaluate the impact which the choice of reorder- tion are placed in cells filled with grey. ing strategy has on the translation quality. In both stages of evaluation, the results are con- Dev Test System trasted with the performance shown by the stan- BLEU BLEU NIST dard phrase-based SMT system (baseline) and baseline dist 23.88 24.04 6.29 baseline MSD 24.07 24.04 6.28 with oracle results. oracle-string 26.28 27.02 6.50 oracle-tree 23.84 24.09 6.30 Source reordering analysis Table 3 shows the oracle-span 24.79 24.95 6.35 parameters of the reordered system allowing to as- MNP 23.79 23.81 6.27 sess the effectiveness of reordering permutations, MVP 24.16 24.55 6.29 namely: (1) a total number of crossings found in MSENT 24.27 24.56 6.32 the word alignment (#C), (2) the size of the re- MSBAR 23.99 24.12 6.27 M4 23.50 23.86 6.29 sulting phrase table (PT), (3) BLEU, NIST, and M2 24.28 24.64 6.33 WER scores obtained using monotonized parallel corpus (oracle) as a reference. Table 4: Experimental results. All the numbers are calculated on the re-aligned corpora. Calculations are done on the basis of the Analysis The number of crossings 100,000 line extraction from the corpus6 and cor- found in word alignment intersection and responding alignment matrix. The baseline rows BLEU/NIST/WER scores estimated on reordered show the number of alignment crossings found in data vs. monotonized data report the reordering the original (unmonotonized) corpus. algorithm effectiveness. A big gap between num- Scores ber of crossings and total number of reorderings System #C PT BLEU NIST WER per corpus found in oracle-string system7 and Oracle baseline systems demonstrates the possible reduc- string 54.6K 48.4M - - - tion of system’s non-monotonicity. The difference tree 187.3K 30.3M 71.73 17.01 16.77 in number of crossings and BLEU/NIST/WER span 146.9K 33.0M 73.41 17.11 15.73 scores between the oracle-span and the best Baselines performing MaxEnt models (namely, M ) shows baselines 187.0K 29.8M 71.70 17.07 16.55 2 the level of performance of the prediction module. Category models A number of distinct phrase translation pairs in MNP 188.9K 29.7M 71.63 17.07 16.52 MVP 168.1K 29.8M 73.17 17.16 15.99 the translation table implicitly reveals the general- MSENT 171.0K 29.8M 73.08 17.08 16.10 ization capabilities of the translation system since M SBAR 188.6K 29.8M 72.89 16.90 16.41 it simplifies the translation task. From the other Combinatory models hand, increased number of shorter phrases can add M4 193.2K 29.1M 70.98 16.85 16.78 noise in the reordered data and makes decoding M2 165.4K 29.9M 73.07 16.92 15.88 more complex. Hence, the size of phrase table it- Table 3: Main parameters of the tree-based re- self can not be considered as a robust indicator of ordering system. its translation potential.

7The number of crossings for oracle conﬁguration is not 6A smaller portion of the corpus is used for analysis in zero since this parameter is calculated on the re-aligned cor- order to reduce evaluation time. pus.

98 Table 4 shows that three of six MaxEnt re- and target languages as a pre-translation step. Our ordering systems outperform baseline systems by model avoids complete generalization of reorder- about 0.5-0.6 BLEU points, that is statistically ing instances by using tree contexts and limit- significant8. The combination of NP, NP, SENT, ing the permutations to data instances. From a and SBAR models do not show good performance learning perspective, our work shows that navigat- possibly due to increased sparseness of reordering a large space of intermediate tree transforma- ing patterns. However, the system that consider tions can be conducted effectively using both the only the MVP and MSENT models achieves 0.62 source-side syntactic tree and a language model BLEU score gain over the baseline configurations. of the idealized (target-like) source-permuted lan- The main conclusion which can be drawn from guage. analysis of Tables 3 and 4 is that there is an We have shown the potential for translation evident correlation between characteristics of re- quality improvement when target sentences are ordering system and performance demonstrated created following the source language word or- by the translation system trained on the corpus der ( 3 BLEU points over the standard phrase- ≈ with reordered source part. based SMT) and under parse tree constraint ( 0.9 ≈ BLEU points). As can be seen from these re- Figure 4 exemplifies the sentences Example sults, our model exhibits competitive translation that presumably benefits from the monotonization performance scores compared with the standard of the source part of the parallel corpus. The ex- distance-based and lexical reordering. ample demonstrates a pervading syntactic distinc- The gap between the oracle and our system’s tion between English and Dutch: the reordering of results leaves room for improvement. We intend verb-phrase subconstituents VP NP PP within the to study extensions of the current tree transfer relative clause into PP NP VP. model to narrow this performance gap. As a first 6 Conclusions and future work step we are combining isolated models for con- crete syntactic categories and aggregating more We introduced a tree-based reordering model that features into the MaxEnt model. Algorithmic im- aims at monotonizing the word order of source provements, such as beam-search and chart parsing, could allow us to apply our method to full 8All statistical significance calculations are done for a 95% confidence interval and 1 000 resamples, following the parse-forests as opposed to a single parse tree. guidelines from (Koehn, 2004).

~~(a) Original parse tree. (b) Reordered parse tree.~~

Src: that ... to lead the Commission during the next ﬁve-year term Ref.: dat ... om de komende vijf jaar de Commissie te leiden Baseline MSD: dat ... om het voortouw te nemen in de Commissie tijdens de komende vijf jaar Rrd src: that ... during the next ﬁve-year term the Commission to lead M2 : dat ... om de Commissie tijdens de komende vijf jaar te leiden (c) Translations.

~~Figure 4: Example of tree-based monotonization.~~

99 References F. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL’03, D. Chiang. 2005. A hierarchical phrase-based model pages 160–167. for statistical machine translation. In Proceedings of ACL’05, pages 263–270. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. Bleu: a method for automatic evaluation of machine M. Collins, P. Koehn, and I. Kucerovˇ a.´ 2005. Clause translation. In Proceedings of ACL’02, pages 311– restructuring for statistical machine translation. In 318. Proceedings of ACL’05, pages 531–540. A. PVS. 2010. A data mining approach to M. R. Costa-jussa` and J. A. R. Fonollosa. 2006. learn reorder rules for SMT. In Proceedings of Statistical machine reordering. In Proceedings of NAACL/HLT’10, pages 52–57. HLT/EMNLP’06, pages 70–76. M. Galley and Ch. D. Manning. 2008. A simple and A. Stolcke. 2002. SRILM: an extensible language effective hierarchical phrase reordering model. In modeling toolkit. In Proceedings of SLP’02, pages Proceedings of EMNLP’08, pages 848–856. 901–904. M. Galley, J. Graehl, K. Knight, D. Marcu, S. De- C. Tillman. 2004. A unigram orientation model for Neefe, W. Wang, and I. Thaye. 2006. Scalable in- statistical machine translation. In Proceedings of ference and training of context-rich syntactic trans- HLT-NAACL’04, pages 101–104. lation models. In Proc. of COLING/ACL’06, pages C. Tillmann and H. Ney. 2003. Word reordering and 961–968. a dynamic programming beam search algorithm for M. Khalilov and K. Sima’an. 2010. Source reordering statistical machine translation. Computational Lin- using maxent classifiers and supertags. In Proc. of guistics, 1(29):93–133. EAMT’10, pages 292–299. R. Tromble and J. Eisner. 2009. Learning linear order- M. Khalilov. 2009. New statistical and syntactic mod- ing problems for better translation. In Proceedings els for machine translation. Ph.D. thesis, Universi- of EMNLP’09, pages 1007–1016. tat Politecnica` de Catalunya, October. C. Wang, M. Collins, and Ph. Koehn. 2007. Chinese D. Klein and C. Manning. 2003. Accurate unlexi- syntactic reordering for statistical machine transla- calized parsing. In Proceedings of the 41st Annual tion. In Proceedings of EMNLP-CoNLL’07, pages Meeting of the ACL’03, pages 423–430. 737–745. Ph. Koehn, F. Och, and D. Marcu. 2003. Statistical D. Wu and H. Wong. 1998. Machine translation wih a phrase-based machine translation. In Proceedings stochastic grammatical channel. In Proceedings of of the HLT-NAACL 2003, pages 48–54. ACL-COLING’98, pages 1408–1415. Ph. Koehn, H. Hoang, A. Birch, C. Callison-Burch, F. Xia and M. McCord. 2004. Improving a statistical M. Federico, N. Bertoldi, B. Cowan, W. Shen, MT system with automatically learned rewrite pat- C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con- terns. In Proceedings of COLING’04, pages 508– stantin, and E. Herbst. 2007. Moses: open-source 514. toolkit for statistical machine translation. In Pro- ceedings of ACL 2007, pages 177–180. D. Xiong, Q. Liu, and S. Lin. 2006. Maximum entropy based phrase reordering model for statistical Ph. Koehn. 2004. Statistical significance tests for machine translation. In Proceedings of ACL’06, machine translation evaluation. In Proceedings of pages 521–528. EMNLP’04, pages 388–395. R. Zens and H. Ney. 2003. A comparative study on F. Och and H. Ney. 2002. Discriminative training reordering constraints in statistical machine transla- and maximum entropy models for statistical mation. In Proceedings of ACL’03, pages 144–151. chine translation. In Proceedings of ACL’02, pages 295–302. R. Zens, F. Och, and H. Ney. 2002. Phrase-based statistical machine translation. In Proceedings of KI: F. Och and H. Ney. 2004. The alignment template Advances in Artificial Intelligence, pages 18–32. approach to statistical machine translation. Compu- tational Linguistics, 30(4):417–449. A. Zollmann and A. Venugopal. 2006. Syntax augmented machine translation via chart parsing. In F. Och. 1999. An efficient method for determin- Proceedings of NAACL’06, pages 138–141. ing bilingual word classes. In Proceedings of ACL 1999, pages 71–76.

~~100 HMM Word-to-Phrase Alignment with Dependency Constraints~~

~~Yanjun Ma Andy Way Centre for Next Generation Localisation School of Computing Dublin City University yma,away @computing.dcu.ie { }~~

Abstract of the alignment yielded from these models is still far from satisfactory even with significant In this paper, we extend the HMM word- amounts of training data; this is particularly true to-phrase alignment model with syntac- for radically different languages such as Chinese tic dependency constraints. The syn- and English. tactic dependencies between multiple The weakness of most generative models of- words in one language are introduced ten lies in the incapability of addressing one to into the model in a bid to produce co- many (1-to-n), many to one (n-to-1) and many herent alignments. Our experimental re- to many (m-to-n) alignments. Some research di- sults on a variety of Chinese–English rectly addresses m-to-n alignment with phrase data show that our syntactically con- alignment models (Marcu and Wong, 2002). strained model can lead to as much as However, these models are unsuccessful largely a 3.24% relative improvement in BLEU due to intractable estimation (DeNero and Klein, score over current HMM word-to-phrase 2008). Recent progress in better parameteri- alignment models on a Phrase-Based sation and approximate inference (Blunsom et Statistical Machine Translation system al., 2009) can only augment the performance of when the training data is small, and these models to a similar level as the baseline a comparable performance compared to where bidirectional word alignments are com- IBM model 4 on a Hiero-style system bined with heuristics and subsequently used to with larger training data. An intrin- induce translation equivalence (e.g. (Koehn et sic alignment quality evaluation shows al., 2003)). The most widely used word align- that our alignment model with depen- ment models, such as IBM models 3 and 4, can dency constraints leads to improvements only model 1-to-n alignment; these models are in both precision (by 1.74% relative) and often called “asymmetric” models. IBM models recall (by 1.75% relative) over the model 3 and 4 model 1-to-n alignments using the notion without dependency information. of “fertility”, which is associated with a “defi- ciency” problem despite its high performance in 1 Introduction practice. Generative word alignment models including On the other hand, the HMM word-to-phrase IBM models (Brown et al., 1993) and HMM alignment model tackles 1-to-n alignment prob- word alignment models (Vogel et al., 1996) have lems with simultaneous segmentation and align- been widely used in various types of Statisti- ment while maintaining the efficiency of the cal Machine Translation (SMT) systems. This models. Therefore, this model sets a good ex- widespread use can be attributed to their robust- ample of addressing the tradeoffs between mod- ness and high performance particularly on large- elling power and modelling complexity. This scale translation tasks. However, the quality model can also be seen as a more generalised

101 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 101–109, COLING 2010, Beijing, August 2010. K th case of the HMM word-to-word model (Vogel et phrases: e = v1 , where vk represents the k al., 1996; Och and Ney, 2003), since this model phrase in the target sentence. The assumption can be reduced to an HMM word-to-word model that each phrase vk generated as a translation of by restricting the generated target phrase length one single source word is consecutive is made to to one. One can further refine existing word allow efficient parameter estimation. Similarly alignment models with syntactic constraints (e.g. to word-to-word alignment models, a variable K (Cherry and Lin, 2006)). However, most re- a1 is introduced to indicate the correspondence search focuses on the incorporation of syntactic between the target phrase index and a source constraints into discriminative alignment mod- word index: k i = a indicating a mapping → k els. Introducing syntactic information into gen- from a target phrase vk to a source word fak .A erative alignment models is shown to be more random process φk is used to specify the num- challenging mainly due to the absence of appro- ber of words in each target phrase, subject to the K priate modelling of syntactic constraints and the constraints J = k=1 φk, implying that the to- “inflexibility” of these generative models. tal number of words in the phrases agrees with P In this paper, we extend the HMM word-to- the target sentence length J. phrase alignment model with syntactic depen- The insertion of target phrases that do not cor- dencies by presenting a model that can incor- respond to any source words is also modelled porate syntactic information while maintaining by allowing a target phrase to be aligned to a the efficiency of the model. This model is based non-existent source word f0 (NULL). Formally, on the observation that in 1-to-n alignments, to indicate whether each target phrase is aligned the n words bear some syntactic dependencies. to NULL or not, a set of indicator functions Leveraging such information in the model can εK = ε , , ε is introduced (Deng and 1 { 1 ··· K } potentially further aid the model in producing Byrne, 2008): if ε = 0, then NULL v ; if k → k more fine-grained word alignments. The syn- ε = 1, then f v . k ak → k tactic constraints are specifically imposed on the To summarise, an alignment a in an HMM n words involved in 1-to-n alignments, which word-to-phrase alignment model consists of the is different from the cohesion constraints (Fox, following elements: 2002) as explored by Cherry and Lin (2006), K K K where knowledge of cross-lingual syntactic pro- a = (K, φ1 , a1 , ε1 ) jection is used. As a syntactic extension of the The modelling objective is to define a condi- open-source MTTK implementation (Deng and tional distribution P (e, a f) over these align- Byrne, 2006) of the HMM word-to-phrase align- | ments. Following (Deng and Byrne, 2008), ment model, its source code will also be released P (e, a f) can be decomposed into a phrase count as open source in the near future. | distribution (1) modelling the segmentation of a The remainder of the paper is organised as fol- target sentence into phrases (P (K J, f) ηK | ∝ lows. Section 2 describes the HMM word-to- with scalar η to control the length of the hy- phrase alignment model. In section 3, we present pothesised phrases), a transition distribution (2) the details of the incorporation of syntactic de- modelling the dependencies between the current pendencies. Section 4 presents the experimental link and the previous links, and a word-to-phrase setup, and section 5 reports the experimental re- translation distribution (3) to model the degree sults. In section 6, we draw our conclusions and to which a word and a phrase are translational to point out some avenues for future work. each other.

2 HMM Word-to-Phrase Alignment P (e, a f) = P (vK , K, aK , εK , φK f) | 1 1 1 1 | Model = P (K J, f) (1) | P (aK, εK , φK K, J, f) (2) In HMM word-to-phrase alignment, a sentence 1 1 1 | e is segmented into a sequence of consecutive P (vK aK, εK , φK , K, J, f)(3) 1 | 1 1 1

102 The word-to-phrase translation distribution English–Chinese (EN–ZH) word alignment, we (3) is formalised as in (4): observe that 75.62% of the consecutive Chinese words and 71.15% of the non-consecutive Chi- P (vK aK , εK , φK , K, J, f) 1 | 1 1 1 nese words have syntactic dependencies. Our K model represents an attempt to encode these lin- = p (v ε f , φ ) (4) v k| k · ak k guistic intuitions. kY=1 Note here that we assume that the translation 3.2 Component Variables and Distributions of each target phrase is conditionally indepen- We constrain the word-to-phrase alignment dent of other target phrases given the individual model with a syntactic coherence model. Given source words. a target phrase vk consisting of φk words, we If we assume that each word in a target phrase use the dependency label rk between words vk[1] is translated with a dependence on the previ- and vk[φk] to indicate the level of coherence. ously translated word in the same phrase given The dependency labels are a closed set obtained the source word, we derive the bigram transla- from dependency parsers, e.g. using Maltparser, tion model as follows: we have 20 dependency labels for English and 12 for Chinese in our data. Therefore, we have pv(vk fak , εk, φk) = pt1 (vk[1] εk, fak ) K | | an additional variable r1 associated with the se- φ k quence of phrases vK to indicate the syntactic p (v [j] v [j 1], ε , f ) 1 t2 k | k − k ak coherence of each phrase, defining P (e, a f) as j=2 | Y below: where vk[1] is the first word in phrase vk, vk[j] K K K K K is the jth word in v , p is an unigram transla- P (r , v , K, a , ε , φ f) = P (K J, f) k t1 1 1 1 1 1 | | K K K K K K K tion probability and pt2 is a bigram translation P (a , φ , ε K, J, f)P (v a , ε , φ , K, J, f) 1 1 1 | 1 | 1 1 1 probability. The intuition is that the first word K K K K K P (r1 a1 , ε1 , φ1 , v1 , K, J, f) (5) in vk is firstly translated by fak and the transla- | tion of the remaining words v [j] in v from f k k ak The syntactic coherence distribution (5) is is dependent on the translation of the previous simplified as in (6): word v [j 1] from f . The use of a bigram k − ak translation model can address the coherence of P (rK aK , εK , φK , vK , K, J, f) 1 | 1 1 1 1 the words within the phrase vk so that the qual- K ity of phrase segmentation can be improved. = pr(rk; ε, fak , φk) (6) 3 Syntactically Constrained HMM kY=1 Word-to-Phrase Alignment Models Note that the coherence of each target phrase 3.1 Syntactic Dependencies for is conditionally independent of the coherence of other target phrases given the source words f Word-to-Phrase Alignment ak and the number of words in the current phrase As a proof-of-concept, we performed depen- φk. We name the model in (5) the SSH model. dency parsing on the GALE gold-standard word SSH is an abbreviation of Syntactically con- alignment corpus using Maltparser (Nivre et al., strained Segmental HMM, given the fact that 1 2007). We find that 82.54% of the consec- the HMM word-to-phrase alignment model is a utive English words have syntactic dependen- Segmental HMM model (SH) (Ostendorf et al., cies and 77.46% non-consecutive English words 1996; Murphy, 2002). have syntactic dependencies in 1-to-2 Chinese– As our syntactic coherence model utilises syn- English (ZH–EN) word alignment (one Chi- tactic dependencies which require the presence nese word aligned to two English words). For of at least two words in target phrase vk, we 1http://maltparser.org/ therefore model the cases of φ = 1 and φ 2 k k ≥

~~103 separately. We rewrite (6) as follows: satisfying the syntactic constraints and to a larger number otherwise. The introduction of this vari-~~

pr(rk; ε, fak , φk) = able enables us to tune the model towards our different end goals. We can now define pφ 2 pφk=1(rk; ε, fak ) if φk = 1 k as: ≥ (pφk 2(rk; ε, fak ) if φk 2 ≥ ≥ pφ 2(rk; ε, fa ) = p(rk ϕ; ε, fa )pϕ(ϕ) k≥ k | k where pφk=1 defines the syntactic coherence where we insist that p(r ϕ; ε, f ) = 1 if when the target phrase only contains one word k| ak ϕ = 0 (the first and last words in the target (φk = 1) and pφk 2 defines the syntactic coherence of a target phrase≥ composed of multiple phrase do not have syntactic dependencies) to reflect the fact that in most arbitrary consecu- words (φk 2). We define pφ =1 as follows: ≥ k tive word sequences the first and last words do not have syntactic dependencies, and otherwise pφk=1(rk; ε, fak ) pn(φk = 1; ε, fak ) ∝ p(r ϕ; ε, f ) if ϕ = 1 (the first and last words k| ak where the coherence of the target phrase (word) in the target phrase have syntactic dependen- vk is defined to be proportional to the probability cies). of target phrase length φk = 1 given the source 3.3 Parameter Estimation word fak . The intuition behind this model is that the syntactic coherence is strong iff the probabil- The Forward-Backward Algorithm (Baum, ity of the source fak fertility φk = 1 is high. 1972), a version of the EM algorithm (Dempster

For pφk 2, which measures the syntactic co- et al., 1977), is speciﬁcally designed for unsu- herence of≥ a target phrase consisting of more pervised parameter estimation of HMM models. than two words, we use the dependency label rk The Forward statistic αj(i, φ, ε) in our model between words vk[1] and vk[φk] to indicate the can be calculated recursively over the trellis as level of coherence. A distribution over the values follows: rk = SBJ, ADJ, ( is the set of de- ∈ R { ···} R αj(i, φ, ε) = αj φ(i′, φ′, ε′)pa(i i′, ε; I) pendency types for a speciﬁc language) is main- { − | } i ,φ ,ε tained as a table for each source word associated ′X′ ′ p (φ; ε, f )ηp (e ε, f ) with all the possible lengths φ 2, ,N ) n i t1 j φ+1 i ∈ { ··· } − | of the target phrase it can generate, e.g. we set j

N = 4 for ZH–EN alignment and N = 2 for pt2 (ej′ ej′ 1, ε, fi)pr(rk; ε, fi, φ) | − j =j φ+2 EN–ZH alignment in our experiments. ′ Y− Given a target phrase v containing φ (φ which sums up the probabilities of every path k k k ≥ 2) words, it is possible that there are no depen- that could lead to the cell j, i, φ . Note that the h i dencies between the first word vk[1] and the last syntactic coherence term pr(rk; ε, fi, φ) can ef- word vk[φk]. To account for this fact, we intro- ficiently be added into the Forward procedure. duce a indicator function ϕ as in below: Similarly, the Backward statistic βj(i, φ, ε) is calculated over the trellis as below: 1 if vk[1] and vk[φk]have βj(i, φ, ε) = βj+φ′ (i′, φ′, ε′)pa(i′ i, h′; I) ϕ(vk[1], φk) = syntactic dependencies |  i′,φ′,ε′ 0 otherwise X p (φ′; ε′, f )ηp (e ε′, f ) n i′ t1 j+1| i′  j+φ′ We can thereafter introduce a distribution pϕ(ϕ), pt2 (ej ej 1, ε′, fi )pr(rk; ε′, fi , φ′) where pϕ(ϕ = 0) = ζ (0 ζ 1) and ′ | ′− ′ ′ ≤ ≤ j =j+2 pϕ(ϕ = 0) = 1 ζ, with ζ indicating how likely ′ Y − it is that the first and final words in a target phrase Note also that the syntactic coherence term do not have any syntactic dependencies. We can pr(rk; ε′, fi′ , φ′) can be integrated into the Back- set ζ to a small number to favour target phrases ward procedure efficiently.

104 Posterior probability can be calculated based test sets. Finally the test set from NIST 2006 on the Forward and Backward probabilities. evaluation campaign was used as the ﬁnal test set. 3.4 EM Parameter Updates The Chinese data was segmented using the The Expectation step accumulates fractional LDC word segmenter. The maximum-entropy- counts using the posterior probabilities for each based POS tagger MXPOST (Ratnaparkhi, 1996) parameter during the Forward-Backward passes, was used to tag both English and Chinese texts. and the Maximisation step normalises the counts The syntactic dependencies for both English and in order to generate updated parameters. Chinese were obtained using the state-of-the-art The E-step for the syntactic coherence model Maltparser dependency parser, which achieved proceeds as follows: 84% and 88% labelled attachment scores for Chinese and English respectively. c(r′; f, φ′) = γj(i, φ, ε = 1)

(f,e) T i,j,φ,fi=f 4.2 Word Alignment X∈ X δ(φ, φ′)δ(ϕj (e, φ), r′) The GIZA++ (Och and Ney, 2003) implementation of IBM Model 4 (Brown et al., 1993) is used where γj(i, φ, ε) is the posterior probability that as the baseline for word alignment. Model 4 is j a target phrase tj φ+1 is aligned to source word incrementally trained by performing 5 iterations − fi, and ϕj(e, φ) is the syntactic dependency label of Model 1, 5 iterations of HMM, 3 iterations between ej φ+1 and ej. The M-step performs of Model 3, and 3 iterations of Model 4. We normalisation,− as below: compared our model against the MTTK (Deng and Byrne, 2006) implementation of the HMM c(r′; f, φ′) pr(r′; f, φ′) = word-to-phrase alignment model. The model r c(r; f, φ′) training includes 10 iterations of Model 1, 5 it- Other component parametersP can be estimated erations of Model 2, 5 iterations of HMM word- in a similar manner. to-word alignment, 20 iterations (5 iterations respectively for phrase lengths 2, 3 and 4 with un- 4 Experimental Setup igram translation probability, and phrase length 4 with bigram translation probability) of HMM 4.1 Data word-to-phrase alignment for ZH–EN alignment We built the baseline word alignment and and 5 iterations (5 iterations for phrase length Phrase-Based SMT (PB-SMT) systems using ex- 2 with uniform translation probability) of HMM isting open-source toolkits for the purposes of word-to-phrase alignment for EN–ZH. This con- fair comparison. A collection of GALE data figuration is empirically established as the best (LDC2006E26) consisting of 103K (2.9 million for Chinese–English word alignment. To allow English running words) sentence pairs was firstly for a fair comparison between IBM Model 4 used as a proof of concept (“small”), and FBIS and HMM word-to-phrase alignment models, we data containing 238K sentence pairs (8 million also restrict the maximum fertility in IBM model English running words) was added to construct a 4 to 4 for ZH–EN and 2 for EN–ZH (the default “medium” scale experiment. To investigate the is 9 in GIZA++ for both ZH–EN and EN–ZH). intrinsic quality of the alignment, a collection “grow-diag-final” heuristic described in (Koehn of parallel sentences (12K sentence pairs) for et al., 2003) is used to derive the refined align- which we have manually annotated word alignment from bidirectional alignments. ment was added to both “small” and “medium” scale experiments. Multiple-Translation Chinese 4.3 MT system Part 1 (MTC1) from LDC was used for Mini- The baseline in our experiments is a standard mum Error-Rate Training (MERT) (Och, 2003), log-linear PB-SMT system. With the word align- and MTC2, 3 and 4 were used as development ment obtained using the method described in

105 section 4.2, we perform phrase-extraction using PB-SMT Hiero small medium small medium heuristics described in (Koehn et al., 2003), Min- H 0.1440 0.2591 0.1373 0.2595 imum Error-Rate Training (MERT) (Och, 2003) SH 0.1418 0.2517 0.1372 0.2609 optimising the BLEU metric, a 5-gram language SSH 0.1464 0.2518 0.1356 0.2624 M4 0.1566 0.2627 0.1486 0.2660 model with Kneser-Ney smoothing (Kneser and Ney, 1995) trained with SRILM (Stolcke, 2002) Table 1: Performance of PB-SMT using different on the English side of the training data, and alignment models on NIST06 test set MOSES (Koehn et al., 2007) for decoding. A Hiero-style decoder Joshua (Li et al., 2009) is 5.2 Translation Results also used in our experiments. All significance tests are performed using approximate randomi- Table 1 shows the performance of PB-SMT and sation (Noreen, 1989) at p = 0.05. Hiero systems using a small amount of data for alignment model training on the NIST06 test set. 5 Experimental Results For the PB-SMT system trained on the small data set, using SSH word alignment leads to a 3.24% 5.1 Alignment Model Tuning relative improvement over SH, which is statistically significant. SSH also leads to a slight In order to find the value of ζ in the SSH model gain over the HMM word-to-word alignment that yields the best MT performance, we used model (H). However, when the PB-SMT system three development test sets using a PB-SMT sys- is trained on larger data sets, there are no sig- tem trained on the small data condition. Figure 1 nificant differences between SH and SSH. Addi- shows the results on each development test set tionally, both SH and SSH models underperform using different configurations of the alignment H on the medium data condition, indicating that models. For each system, we obtain the mean the performance of the alignment model tuned of the BLEU scores (Papineni et al., 2002) on on the PB-SMT system with small training data the three development test sets, and derive the does not carry over to PB-SMT systems with optimal value for ζ of 0.4, which we use here- larger training data (cf. Figure 1). IBM model after for final testing. It is worth mentioning 4 demonstrates stronger performance over other that while IBM model 4 (M4) outperforms other models for both small and medium data condi- models including the HMM word-to-word (H) tions. and word-to-phrase (SH) alignment model in our For the Hiero system trained on a small data current setup, using the default IBM model 4 set- set, no significant differences are observed be- ting (maximum fertility 9) yields an inferior per- tween SSH, SH and H. On a larger training set, formance (as much as 8.5% relative) compared we observe that SSH alignment leads to better to other models. performance compared to SH. Both SH and SSH

0.14 alignments achieved higher translation quality MTC2 MTC3 0.135 MTC4 than H. Note that while IBM model 4 outperforms other models on a small data condition, the 0.13 difference between IBM model 4 and SSH is not 0.125

BLEU score statistically signiﬁcant on a medium data condi- 0.12 tion. It is also worth pointing out that the SSH 0.115 model yields signiﬁcant improvement over IBM 0.11 M4 H SH SSH-0.05SSH-0.1 SSH-0.2 SSH-0.3 SSH-0.4 SSH-0.5 SSH-0.6 model 4 with the default fertility setting, indicat-

alignment systems ing that varying the fertility limit in IBM model 4 has a signiﬁcant impact on translation quality. Figure 1: BLEU score on development test set In summary, the SSH model which incorpo- using PB-SMT system rates syntactic dependencies into the SH model achieves consistently better performance than

106 ZH–EN EN–ZH the SH model. Both SH and SSH lead to gains PR PR H 0.5306 0.3752 0.5282 0.3014 over H for both ZH–EN and EN–ZH directions, SH 0.5378 0.3802 0.5523 0.3151 while gains in the EN–ZH direction appear to be SSH 0.5384 0.3807 0.5619 0.3206 more pronounced. IBM model 4 achieves signif- M4 0.5638 0.3986 0.5988 0.3416 icantly higher P over other models while the gap Table 2: Intrinsic evaluation of the alignment us- in R is narrow. ing different alignment models Relating Table 2 to Table 1, we observe that the HMM word-to-word alignment model (H) can still achieve good MT performance despite SH in both PB-SMT and Hiero systems under the lower P and R compared to other mod- both small and large data conditions. For a els. This provides additional support to previ- PB-SMT system trained on the small data set, ous ﬁndings (Fraser and Marcu, 2007b) that the the SSH model leads to signiﬁcant gains over intrinsic quality of word alignment does not nec- the baseline SH model. The results also en- essarily correlate with the performance of the re- tail an observation concerning the suitability of sulted MT system. different alignment models for different types of SMT systems; trained on a large data set, 5.4 Alignment Characteristics our SSH alignment model is more suitable to a Hiero-style system than a PB-SMT system, In order to further understand the characteristics as evidenced by a lower performance compared of the alignment that each model produces, we to IBM model 4 using a PB-SMT system, and investigated several statistics of the alignment re- a comparable performance compared to IBM sults which can hopefully reveal the capabilities model 4 using a Hiero system. and limitations of each model.

5.3 Intrinsic Evaluation 5.4.1 Pairwise Comparison In order to further investigate the intrinsic qual- Given the asymmetric property of these align- ity of the word alignment, we compute the Preci- ment models, we can evaluate the quality of the sion (P), Recall (R) and F-score (F) of the align- links for each word and compare the alignment ments obtained using different alignment mod- links across different models. For example, in els. As the models investigated here are asym- ZH–EN word alignment, we can compute the metric models, we conducted intrinsic evalua- links for each Chinese word and compare those tion for both alignment directions, i.e. ZH–EN links across different models. Additionally, we word alignment where one Chinese word can be can compute the pairwise agreement in align- aligned to multiple English words, and EN–ZH ing each Chinese word for any two alignment word alignment where one English word can be models. Similarly, we can compute the pairwise aligned to multiple Chinese words. agreement in aligning each English word in the Table 2 shows the results of the intrinsic eval- EN–ZH alignment direction. uation of ZH–EN and EN–ZH word alignment For ZH–EN word alignment, we observe that on a small data set (results on the medium data the SH and SSH models reach a 85.94% agree- set follow the same trend but are left out due ment, which is not surprising given the fact that to space limitations). Note that the P and R SSH is a syntactic extension over SH, while IBM are all quite low, demonstrating the difﬁculty of model 4 and SSH reach the smallest agreement Chinese–English word alignment in the news do- (only 65.09%). We also observe that there is a main. For the ZH–EN direction, using the SSH higher agreement between SSH and H (76.64%) model does not lead to signiﬁcant gains over SH than IBM model 4 and H (69.58%). This can be in P or R. For the EN–ZH direction, the SSH attributed to the fact that SSH is still a form of model leads to a 1.74% relative improvement in HMM model while IBM model 4 is not. A simi- P, and a 1.75% relative improvement in R over lar trend is observed for EN–ZH word alignment.

107 ZH–EN EN–ZH 1-to-0 1-to-1 1-to-n 1-to-0 1-to-1 1-to-n con. non-con. con. non-con. HMM 0.3774 0.4693 0.0709 0.0824 0.4438 0.4243 0.0648 0.0671 SH 0.3533 0.4898 0.0843 0.0726 0.4095 0.4597 0.0491 0.0817 SSH 0.3613 0.5092 0.0624 0.0671 0.3990 0.4835 0.0302 0.0872 M4 0.2666 0.5561 0.0985 0.0788 0.3967 0.4850 0.0592 0.0591

~~Table 3: Alignment types using different alignment models~~

5.4.2 Alignment Types ric word alignments. Again, by taking advantage of the asymmet- 6 Conclusions and Future Work ric property of these alignment models, we can compute different types of alignment. For both In this paper, we extended the HMM word-to- ZH–EN (EN–ZH) alignment, we divide the links phrase word alignment model to handle syntac- for each Chinese (English) word into 1-to-0 tic dependencies. We found that our model was where each Chinese (English) word is aligned consistently better than that without syntactic de- to the empty word “NULL” in English (Chi- pendencies according to both intrinsic and ex- nese), 1-to-1 where each Chinese (English) word trinsic evaluation. Our model is shown to be ben- is aligned to only one word in English (Chinese), eficial to PB-SMT under a small data condition and 1-to-n where each Chinese (English) word and to a Hiero-style system under a larger data is aligned to n (n 2) words in English (Chi- ≥ condition. nese). For 1-to-n links, depending on whether As to future work, we firstly plan to investi- the n words are consecutive, we have consecu- gate the impact of parsing quality on our model, tive (con.) and non-consecutive (non-con.) 1-to- and the use of different heuristics to combine n links. word alignments. Secondly, the syntactic co- Table 3 shows the alignment types in the herence model itself is very simple, in that it medium data track. We can observe that for only covers the syntactic dependency between ZH–EN word alignment, both SH and SSH pro- the first and last word in a phrase. Accordingly, duce far more 1-to-0 links than Model 4. It can we intend to extend this model to cover more so- also be seen that Model 4 tends to produce more phisticated syntactic relations within the phrase. consecutive 1-to-n links than non-consecutive 1- Furthermore, given that we can construct dif- to-n links. On the other hand, the SSH model ferent MT systems using different word align- tends to produce more non-consecutive 1-to-n ments, multiple system combination can be con- links than consecutive ones. Compared to SH, ducted to avail of the advantages of different sys- SSH tends to produce more 1-to-1 links than 1- tems. We also plan to compare our model with to-n links, indicating that adding syntactic de- other alignment models, e.g. (Fraser and Marcu, pendency constraints biases the model towards 2007a), and test this approach on more data and only producing 1-to-n links when the n words on different language pairs and translation direc- follow coherence constraint, i.e. the first and last tions. word in the chunk have syntactic dependencies. For example, among the 6.24% consecutive ZH– Acknowledgements EN 1-to-n links produced by SSH, 43.22% of them follow the coherence constraint compared This research is supported by the Science Foundation Ire- land (Grant 07/CE/I1142) as part of the Centre for Next to just 39.89% in SH. These properties can have Generation Localisation (www.cngl.ie) at Dublin City Uni- significant implications for the performance of versity. Part of the work was carried out at Cambridge Uni- our MT systems given that we use the grow- versity Engineering Department with Dr. William Byrne. The authors would also like to thank the anonymous re- diag-final heuristics to derive the symmetrised viewers for their insightful comments. word alignment based on bidirectional asymmet-

108 References Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Baum, Leonard E. 1972. An inequality and associ- Brooke Cowan, Wade Shen, Christine Moran, Richard ated maximization technique in statistical estimation for Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, probabilistic functions of Markov processes. Inequali- and Evan Herbst. 2007. Moses: Open source toolkit for ties, 3:1–8. Statistical Machine Translation. In Proceedings of ACL 2007, pages 177–180, Prague, Czech Republic. Blunsom, Phil, Trevor Cohn, Chris Dyer, and Miles Os- borne. 2009. A gibbs sampler for phrasal synchronous Li, Zhifei, Chris Callison-Burch, Chris Dyer, Sanjeev grammar induction. In Proceedings of ACL-IJCNLP Khudanpur, Lane Schwartz, Wren Thornton, Jonathan 2009, pages 782–790, Singapore. Weese, and Omar Zaidan. 2009. Joshua: An open source toolkit for parsing-based machine translation. In Brown, Peter F., Stephen A. Della-Pietra, Vincent J. Della- Proceedings of the WMT 2009, pages 135–139, Athens, Pietra, and Robert L. Mercer. 1993. The mathematics of Greece. Statistical Machine Translation: Parameter estimation. Computational Linguistics, 19(2):263–311. Marcu, Daniel and William Wong. 2002. A Phrase-Based, joint probability model for Statistical Machine Transla- tion. In Proceedings of EMNLP 2002, pages 133–139, Cherry, Colin and Dekang Lin. 2006. Soft syntactic con- Philadelphia, PA. straints for word alignment through discriminative training. In Proceedings of the COLING-ACL 2006, pages Murphy, Kevin. 2002. Hidden semi-markov models (seg- 105–112, Sydney, Australia. ment models). Technical report, UC Berkeley.

Dempster, Arthur, Nan Laird, and Donald Rubin. 1977. Nivre, Joakim, Johan Hall, Jens Nilsson, Atanas Chanev, Maximum likelihood from incomplete data via the EM Gülsen Eryigit, Sandra Kübler, Svetoslav Marinov, algorithm. Journal of the Royal Statistical Society, Se- and Ervin Marsi. 2007. MaltParser: A language- ries B, 39(1):1–38. independent system for data-driven dependency parsing. Natural Language Engineering, 13(2):95–135. DeNero, John and Dan Klein. 2008. The complexity of phrase alignment problems. In Proceedings of ACL-08: Noreen, Eric W. 1989. Computer-Intensive Methods HLT, Short Papers, pages 25–28, Columbus, OH. for Testing Hypotheses: An Introduction. Wiley- Interscience, New York, NY. Deng, Yonggang and William Byrne. 2006. MTTK: An Och, Franz and Hermann Ney. 2003. A systematic com- alignment toolkit for Statistical Machine Translation. In parison of various statistical alignment models. Compu- Proceedings of HLT-NAACL 2006, pages 265–268, New tational Linguistics, 29(1):19–51. York City, NY. Och, Franz. 2003. Minimum Error Rate Training in Statis- Deng, Yonggang and William Byrne. 2008. HMM word tical Machine Translation. In Proceedings of ACL 2003, and phrase alignment for Statistical Machine Transla- pages 160–167, Sapporo, Japan. tion. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 16(3):494–507. Ostendorf, Mari, Vassilios V. Digalakis, and Owen A. Kim- ball. 1996. From HMMs to segment models: A uni- Fox, Heidi. 2002. Phrasal cohesion and statistical machine fied view of stochastic modeling for speech recognition. translation. In Proceedings of the EMNLP 2002, pages IEEE Transactions on Speech and Audio Processing, 304–3111, Philadelphia, PA, July. 4(5):360–378.

Fraser, Alexander and Daniel Marcu. 2007a. Getting the Papineni, Kishore, Salim Roukos, Todd Ward, and Wei- structure right for word alignment: LEAF. In Pro- Jing Zhu. 2002. BLEU: a method for automatic eval- ceedings of EMNLP-CoNLL 2007, pages 51–60, Prague, uation of Machine Translation. In Proceedings of ACL Czech Republic. 2002, pages 311–318, Philadelphia, PA. Ratnaparkhi, Adwait. 1996. A maximum entropy model Fraser, Alexander and Daniel Marcu. 2007b. Measuring for part-of-speech tagging. In Proceedings of EMNLP word alignment quality for Statistical Machine Transla- 1996, pages 133–142, Somerset, NJ. tion. Computational Linguistics, 33(3):293–303. Stolcke, Andreas. 2002. SRILM – An extensible lan- Kneser, Reinhard and Hermann Ney. 1995. Improved guage modeling toolkit. In Proceedings of the Inter- backing-off for m-gram language modeling. In Pro- national Conference on Spoken Language Processing, ceedings of the IEEE ICASSP, volume 1, pages 181– pages 901–904, Denver, CO. 184, Detroit, MI. Vogel, Stefan, Hermann Ney, and Christoph Tillmann. Koehn, Philipp, Franz Och, and Daniel Marcu. 2003. 1996. HMM-based word alignment in statistical trans- Statistical Phrase-Based Translation. In Proceedings lation. In Proceedings of ACL 1996, pages 836–841, of HLT-NAACL 2003, pages 48–54, Edmonton, AB, Copenhagen, Denmark. Canada.

~~109 New Parameterizations and Features for PSCFG-Based Machine Translation~~

~~Andreas Zollmann Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University zollmann,vogel @cs.cmu.edu { }~~

Abstract used to generalize beyond purely lexical operations. Labels on these nonterminal symbols are We propose several improvements to the often used to enforce syntactic constraints in the hierarchical phrase-based MT model of generation of bilingual sentences and imply con- Chiang (2005) and its syntax-based exten- ditional independence assumptions in the statis- sion by Zollmann and Venugopal (2006). tical translation model. Several techniques have We add a source-span variance model been recently proposed to automatically iden- that, for each rule utilized in a prob- tify and estimate parameters for PSCFGs (or re- abilistic synchronous context-free gram- lated synchronous grammars) from parallel cor- mar (PSCFG) derivation, gives a conﬁ- pora (Galley et al., 2004; Chiang, 2005; Zollmann dence estimate in the rule based on the and Venugopal, 2006; Liu et al., 2006; Marcu et number of source words spanned by the al., 2006). rule and its substituted child rules, with In this work, we propose several improvements the distributions of these source span sizes to the hierarchical phrase-based MT model of estimated during training time. Chiang (2005) and its syntax-based extension by We further propose different methods of Zollmann and Venugopal (2006). We add a source combining hierarchical and syntax-based span variance model that, for each rule utilized PSCFG models, by merging the grammars in a probabilistic synchronous context-free gram- as well as by interpolating the translation mar (PSCFG) derivation, gives a conﬁdence es- models. timate in the rule based on the number of source Finally, we compare syntax-augmented words spanned by the rule and its substituted child MT, which extracts rules based on target- rules, with the distributions of these source span side syntax, to a corresponding variant sizes estimated during training (i.e., rule extrac- based on source-side syntax, and experi- tion) time. ment with a model extension that jointly We further propose different methods of com- takes source and target syntax into ac- bining hierarchical and syntax-based PSCFG count. models, by merging the grammars as well as by interpolating the translation models. 1 Introduction Finally, we compare syntax-augmented MT, The Probabilistic Synchronous Context Free which extracts rules based on target-side syntax, Grammar (PSCFG) formalism suggests an intu- to a corresponding variant based on source-side itive approach to model the long-distance and lex- syntax, and experiment with a model extension ically sensitive reordering phenomena that often based on source and target syntax. occur across language pairs considered for statis- We evaluate the different models on the tical machine translation. As in monolingual pars- NIST large resource Chinese-to-English transla- ing, nonterminal symbols in translation rules are tion task.

110 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 110–117, COLING 2010, Beijing, August 2010. 2 Related work chical rules separately to the syntax-augmented grammar, resulting in a backbone grammar of Chiang et al. (2008) introduce structural dis- well-estimated hierarchical rules supporting the tortion features into a hierarchical phrase-based sparser syntactic rules. They allow the model model, aimed at modeling nonterminal reordering preference between hierarchical and syntax rules given source span length, by estimating for each to be learned from development data by adding possible source span length ` a Bernoulli distribu- an indicator feature to all rules, which is one tion p(R `) where R takes value one if reorder- for hierarchical rules and zero for syntax rules. | ing takes place and zero otherwise. Maximum- However, no empirical comparison is given be- likelihood estimation of the distribution amounts tween the purely syntax-augmented and the hy- to simply counting the relative frequency of non- brid grammar. We aim to fill this gap by experi- terminal reorderings over all extracted rule in- menting with both models, and further refine the stances that incurred a substitution of span length hybrid approach by adding interpolated probabil- `. In a more fine-grained approach they add a sep- ity models to the syntax rules. arate binary feature R, ` for each combination of Chiang (2010) augments a hierarchical phrase- h i reordering truth value R and span length ` (where based MT model with binary syntax features rep- all ` 10 are merged into a single value), and resenting the source and target syntactic con- ≥ then tune the feature weights discriminatively on a stituents of a given rule’s instantiations during development set. Our approach differs from Chi- training, thus taking source and target syntax ang et al. (2008) in that we estimate one source into account while avoiding the data-sparseness span length distribution for each substitution site and decoding-complexity problems of multi- of each grammar rule, resulting in unique distri- nonterminal PSCFG models. In our approach, the butions for each rule, estimated from all instances source- and target-side syntax directly determines of the rule in the training data. This enables our the grammar, resulting in a nonterminal set de- model to condition reordering range on the in- rived from the labels underlying the source- and dividual rules used in a derivation, and even al- target-language treebanks. lows to distinguish between two rules r1 and r2 that both reorder arguments with identical mean 3 PSCFG-based translation span lengths `, but where the span lengths encoun- Given a source language sentence f, statistical tered in extracted instances of r1 are all close to `, machine translation defines the translation task as whereas span length instances for r2 vary widely. selecting the most likely target translation e under Chen and Eisele (2010) propose a hypbrid ap- a model P (e f), i.e.: | proach between hierarchical phrase based MT m and a rule based MT system, reporting improve- eˆ(f) = arg max P (e f) = arg max hi(e, f)λi ment over each individual model on an English- e | e i=1 to-German translation task. Essentially, the rule X based system is converted to a single-nonterminal where the arg max operation denotes a search PSCFG, and hence can be combined with the through a structured space of translation outputs hierarchical model, another single-nonterminal in the target language, hi(e, f) are bilingual fea- PSCFG, by taking the union of the rule sets tures of e and f and monolingual features of and augmenting the feature vectors, adding zero- e, and weights λi are typically trained discrim- values for rules that only exist in one of the two inatively to maximize translation quality (based grammars. We face the challenge of combining on automatic metrics) on held out data, e.g., us- the single-nonterminal hierarchical grammar with ing minimum-error-rate training (MERT) (Och, a multi-nonterminal syntax-augmented grammar. 2003). Thus one hierarchical rule typically corresponds In PSCFG-based systems, the search space is to many syntax-augmented rules. The SAMT sys- structured by automatically extracted rules that tem used by Zollmann et al. (2008) adds hierar- model both translation and re-ordering operations.

111 i 1 Most large scale systems approximate the search where e.g. f1− is short-hand for f1 . . . fi 1, and − above by simply searching for the most likely where k is an index for the nonterminal X that derivation of rules, rather than searching for the indicates the one-to-one correspondence between most likely translated output. There are efficient the new X tokens on the two sides (it is not in algorithms to perform this search (Kasami, 1965; the space of word indices like i, j, u, v, m, n). The Chappelier and Rajman, 1998) that have been ex- recursive form of this generalization operation al- tended to efficiently integrate n-gram language lows the generation of rules with multiple nonter- model features (Chiang, 2007; Venugopal et al., minal pairs. 2007; Huang and Chiang, 2007; Zollmann et al., Chiang (2005) uses features analogous to the 2008; Petrov et al., 2008). ones used in phrase-based translation: a lan- In this work we experiment with PSCFGs guage model neg-log probability, a ‘rule given that have been automatically learned from word- source-side’ neg-log-probability, a ‘rule given aligned parallel corpora. PSCFGs are defined by a target-side’ neg-log-probability, source- and tar- source terminal set (source vocabulary) , a target conditioned ‘lexical’ neg-log-probabilities TS get terminal set (target vocabulary) , a shared based on word-to-word co-occurrences (Koehn et TT nonterminal set and rules of the form: X al., 2003), as well as rule, target word, and glue N → γ, α, w where operation counters. We follow Venugopal and h i Zollmann (2009) to further add a rareness penalty, X is a labeled nonterminal referred to as • ∈ N the left-hand-side of the rule. 1/ count(r) γ ( ) is the source side of the rule. • ∈ N ∪ TS ∗ α ( ) is the target side of the rule. where count(r) is the occurrence count of rule • ∈ N ∪ TT ∗ w [0, ) is a non-negative real-valued r in the training corpus, allowing the system to • ∈ ∞ weight assigned to the rule; in our model, w is learn penalization of low-frequency rules, as well the exponential function of the inner product of as three indicator features firing if the rule has features h and weights λ. one, two unswapped, and two swapped nonterminal pairs, respectively.1 3.1 Hierarchical phrase-based MT 3.2 Syntax Augmented MT Building upon the success of phrase-based methods, Chiang (2005) presents a PSCFG model of Syntax Augmented MT (SAMT) (Zollmann and translation that uses the bilingual phrase pairs Venugopal, 2006) extends Chiang (2005) to in- of phrase-based MT as starting point to learn clude nonterminal symbols from target language hierarchical rules. For each training sentence phrase structure parse trees. Each target sentence pair’s set of extracted phrase pairs, the set of in- in the training corpus is parsed with a stochas- duced PSCFG rules can be generated as follows: tic parser to produce constituent labels for target First, each phrase pair is assigned a generic X- spans. Phrase pairs (extracted from a particular nonterminal as left-hand-side, making it an initial sentence pair) are assigned left-hand-side nonter- rule. We can now recursively generalize each al- minal symbols based on the target side parse tree ready obtained rule (initial or including nontermi- constituent spans. nals) Phrase pairs whose target side corresponds to a constituent span are assigned that constituent’s N f1 . . . fm/e1 . . . en → label as their left-hand-side nonterminal. If the for which there is an initial rule target side of the phrase pair is not spanned by a single constituent in the corresponding parse M fi . . . fu/ej . . . ev → tree, we use the labels of subsuming, subsumed, where 1 i < u m and 1 j < v n, to and neighboring parse tree constituents to assign ≤ ≤ ≤ ≤ obtain a new rule 1Penalization or reward of purely-lexical rules can be in- directly learned by trading off these features with the rule i 1 m j 1 n N f − X f /e − X e counter feature. → 1 k u+1 1 k v+1

112 an extended label of the form C1 + C2, C1/C2, the same rule to be greater for long phrase pairs or C C (the latter two being motivated from than for short ones. 2\ 1 the operations in combinatory categorial gram- We can now add kˆ + 1 features to the transla- mar (CCG) (Steedman, 2000)), indicating that the tion framework, where kˆ is the maximum num- phrase pair’s target side spans two adjacent syn- ber of PSCFG rule nonterminal pairs, in our case tactic categories (e.g., she went: NP+VB), a par- two. Each feature is computed during translation tial syntactic category C missing a C at the right 1 2 time. Ideally, it should represent the probabil- (e.g., the great: NP/NN), or a partial C missing 1 ity of the hypothesized rule given the respective a C at the left (e.g., great wall: DT NP), respec- 2 \ chart cell span length. However, as each com- tively. The label assignment is attempted in the or- peting rule underlies a different distribution, this der just described, i.e., assembling labels based on would require a Bayesian setting, in which priors ‘+’ concatenation of two subsumed constituents is over distributions are specified. In this prelimi- preferred, as smaller constituents tend to be more nary work we take a simpler approach: Based on accurately labeled. If no label is assignable by ei- the rule’s span distribution, we compute the prob- ther of these three methods, a default label ‘FAIL’ ability that a span length no likelier than the one is assigned. encountered was generated from the distribution. In addition to the features used in hierarchical This probability thus yields a confidence estimate phrase-based MT, SAMT introduces a relative- for the rule. More formally, let µ be the mean and frequency estimated probability of the rule given σ the standard deviation of the logarithm of the its left-hand-side nonterminal. span length random variable X concerned, and let 4 Modeling Source Span Length of x be the span length encountered during decoding. PSCFG Rule Substitution Sites Then the computed confidence estimate is given by Extracting a rule with k right-hand-side nonterminal pairs, i.e., substitution sites, (from now on called order-k rule) by the method described in P ( ln(X) µ ln(x) µ ) Section 3 involves k + 1 phrase pairs: one phrase | − | ≥ | − | = 2 Z ( ( ln(x) µ )/σ) pair used as initial rule and k phrase pairs that are ∗ − | − | sub phrase pairs of the first and replaced by nonterminal pairs. Conversely, during translation, applying this rule amounts to combining k hypothe- where Z is the cumulative density function of the ses from k different chart cells, each represented normal distribution with mean zero and variance by a source span and a nonterminal, to form a new one. hypothesis and file it into a chart cell. Intuitively, The confidence estimate is one if the encoun- we want the source span lengths of these k + 1 tered span length is equal to the mean of the dis- chart cells to be close to the source side lengths of tribution, and decreases as the encountered span the k+1 phrase pairs from the training corpus that length deviates further from the mean. The sever- were involved in extracting the rule. Of course, ity of that decline is determined by the distribution each rule generally was extracted from multiple variance: the higher the variance, the less a devia- training corpus locations, with different involved tion from the mean is penalized. phrase pairs of different lengths. We therefore model k + 1 source span length distributions for Mean and variance of log source span length are each order-k rule in the grammar. sufficient statistics of the log-normal distribution. Ignoring the discreteness of source span length As we extract rules in a distributed fashion, we for the sake of easier estimation, we assume the use a straightforward parallelization of the online distribution to be log-normal. This is motivated algorithm of Welford (1962) and its improvement by the fact that source span length is positive and by West (1979) to compute the sample variance that we expect its deviation between instances of over all instances of a rule.

113 5 Merging a Hierarchical and a used in hierarchical and syntax-augmented trans- Syntax-Based Model lation into a glue feature that only fires when a hierarchical rule is glued, and a distinct glue feature While syntax-based grammars allow for more re- firing when gluing a syntax-augmented rule. fined statistical models and guide the search by constraining substitution possibilitites in a gram- 6 Extension of SAMT to a bilingually mar derivation, grammar sizes tend to be much parsed corpus greater than for hierarchical grammars. Therefore the average occurrence count of a syntax rule is Syntax-based MT models have been proposed much lower than that of a hierarchical rule, and both based on target-side syntactic annotations thus estimated probabilitites are less reliable. (Galley et al., 2004; Zollmann and Venugopal, 2006) as well source-side annotations (Liu et al., We propose to augment the syntax-based “rule 2006). Syntactic annotations for both source and given source side” and “rule given target side” dis- target language are available for popular language tributions by hierarchical counterparts obtained by pairs such as Chinese-English. In this case, our marginalizing over the left-hand-side and right- grammar extraction procedure can be easily ex- hand-side rule nonterminals. For example, the tended to impose both source and target con- hierarchical equivalent of the “rule given source straints on the eligible substitutions simultane- side” probability is obtained by summing occur- ously. rence counts over all rules that have the same source and target terminals and substitution posi- Let Nf be the nonterminal label that would be tions but possibly differ in the left- and/or right- assigned to a given initial rule when utilizing the hand side nonterminal labels, divided by the sum source-side parse tree, and Ne the assigned label of occurrence counts of all rules that have the according to the target-side parse. Then our bilin- same source side terminals and source side substi- gual model assigns ‘Nf + Ne’ to the initial rule. tution positions. Similarly, an alternative rareness The extraction of complex rules proceeds as be- penalty based on the combined frequency of all fore. The number of nonterminals in this model, rules with the same terminals and substitution po- based on a source-model label set of size s and a sitions is obtained. target label set of size t, is thus given by st. Using these syntax and hierarchical features 7 Experiments side by side amounts to interpolation of the respective probability models in log-space, with We evaluate our approaches by comparing trans- minimum-error-rate training (MERT) determining lation quality according to the IBM-BLEU (Pap- the optimal interpolation coefficient. We also add ineni et al., 2002) metric on the NIST Chinese- respective models interpolated with coefficient .5 to-English translation task using MT04 as devel- in probability-space as additional features to the opment set to train the model parameters λ, and system. MT05, MT06 and MT08 as test sets. We further experiment with adding hierarchical We perform PSCFG rule extraction and de- rules separately to the syntax-augmented gram- coding using the open-source “SAMT” system mar, as proposed in Zollmann et al. (2008), with (Venugopal and Zollmann, 2009), using the pro- the respective syntax-specific features set to zero. vided implementations for the hierarchical and A ‘hierarchical-indicator’ feature is added to all syntax-augmented grammars. For all systems, we rules, which is one for hierarchical rules and zero use the bottom-up chart parsing decoder imple- for syntax rules, allowing the joint model to trade mented in the SAMT toolkit with a reordering of hierarchical against syntactic rules. During limit of 15 source words, and correspondingly ex- translation, the hierarchical and syntax worlds are tract rules from initial phrase pairs of maximum bridged by glue rules, which allow monotonic source length 15. All rules have at most two non- concatenation of hierarchical and syntactic partial terminal symbols, which must be non-consecutive sentence hypotheses. We separate the glue feature on the source side, and rules must contain at least

114 one source-side terminal symbol. based on the number of source words spanned by For parameter tuning, we use the L0- the rule and its substituted child rules, resulting in regularized minimum-error-rate training tool pro- small improvements for hierarchical phrase-based vided by the SAMT toolkit. as well as syntax-augmented MT. The parallel training data comprises of 9.6M We further demonstrated the utility of combin- sentence pairs (206M Chinese Words, 228M En- ing hierarchical and syntax-based PSCFG models glish words). The source and target language and grammars. parses for the syntax-augmented grammar were Finally, we compared syntax-augmented MT, generated by the Stanford parser (Klein and Man- which extracts rules based on target-side syntax, ning, 2003). to a corresponding variant based on source-side The results are given in Table 1. The source syntax, showing that target syntax is more ben- span models (indicated by +span) achieve small efitial, and unsuccessfully experimented with a test set improvements of 0.15 BLEU points on av- model extension that jointly takes source and tar- erage for the hierarchical and 0.26 BLEU points get syntax into account. for the syntax-augmented system, but these are Hierarchical phrase-based MT suffers from not statistically significant. spurious ambiguity: A single translation for a Augmenting a syntax-augmented grammar given source sentence can usually be accom- with hierarchical features (“Syntax+hiermodels”) plished by many different PSCFG derivations. results in average test set improvements of 0.5 This problem is exacerbated by syntax-augmented BLEU points. These improvements are not sta- MT with its thousands of nonterminals, and made tistically significant either, but persist across all even worse by its joint source-and-target exten- three test sets. This demonstrates the benefit of sion. Future research should apply the work of more reliable feature estimation. Further aug- Blunsom et al. (2008) and Blunsom and Osborne menting the hierarchical rules to the grammar (2008), who marginalize over derivations to find (“Syntax+hiermodels+hierrules”) does not yield the most probable translation rather than the most additional improvements. probable derivation, to these multi-nonterminal The use of bilingual syntactic parses (‘Syn- grammars. tax/src&tgt’) turns out detrimental to translation All source code underlying this work is avail- quality. We assume this is due to the huge number able under the GNU Lesser General Public Li- of nonterminals in these grammars and the great cense as part of the ‘SAMT’ system at: amount of badly-estimated low-occurrence-count www.cs.cmu.edu/˜zollmann/samt rules. Perhaps merging this grammar with a regular syntax-augmented grammar could yield better Acknowledgements results. This work is in part supported by NSF un- We also experimented with a source-parse der the Cluster Exploratory program (grant NSF based model (‘Syntax/src’). While not being able 0844507), and in part by the US DARPA GALE to match translation quality of its target-based program. Any opinions, findings, and conclusions counterpart, the model still outperforms the hier- or recommendations expressed in this material are archical system on all test sets. those of the authors and do not necessarily reflect the views of NSF or DARPA. 8 Conclusion We proposed several improvements to the hierarchical phrase-based MT model of Chiang (2005) References and its syntax-based extension by Zollmann and Blunsom, Phil and Miles Osborne. 2008. Probabilistic Venugopal (2006). We added a source span length inference for machine translation. In EMNLP ’08: Proceedings of the Conference on Empirical Meth- model that, for each rule utilized in a probabilis- ods in Natural Language Processing, pages 215– tic synchronous context-free grammar (PSCFG) 223, Morristown, NJ, USA. Association for Com- derivation, gives a confidence estimate in the rule putational Linguistics.

115 Dev (MT04) MT05 MT06 MT08 TestAvg Time Hierarchical 38.63 36.51 33.26 25.77 31.85 14.3 Hier+span 39.03 36.44 33.29 26.26 32.00 16.7 Syntax 39.17 37.17 33.87 26.81 32.62 59 Syntax+hiermodels 39.61 37.74 34.30 27.30 33.11 68.4 Syntax+hiermodels+hierrules 39.69 37.56 34.66 26.93 33.05 34.6 Syntax+span+hiermodels+hierrules 39.81 38.02 34.50 27.41 33.31 39.6 Syntax/src+span+hiermodels+hierrules 39.62 37.25 33.99 26.44 32.56 20.1 Syntax/src&tgt+span+hiermodels+hierrules 39.15 36.92 33.70 26.24 32.29 17.5

Table 1: Translation quality in % case-insensitive IBM-BLEU (i.e., brevity penalty based on closest reference length) for different systems on Chinese-English NIST-large translation tasks. ‘TestAvg’ shows the average score over the three test sets. ‘Time’ is the average decoding time per sentence in seconds on one CPU.

Blunsom, Phil, Trevor Cohn, and Miles Osborne. Sweden, July. Association for Computational Lin- 2008. A discriminative latent variable model for guistics. statistical machine translation. In Proceedings of the Annual Meeting of the Association for Compu- Galley, Michael, Mark Hopkins, Kevin Knight, and tational Linguistics (ACL). Daniel Marcu. 2004. What’s in a translation rule? In Proceedings of the Human Language Technology Brown, Peter F., Vincent J. Della Pietra, Stephen Conference of the North American Chapter of the A. Della Pietra, and Robert L. Mercer. 1993. Association for Computational Linguistics Confer- The mathematics of statistical machine translation: ence (HLT/NAACL). parameter estimation. Computational Linguistics, Huang, Liang and David Chiang. 2007. Forest rescor- 19(2). ing: Faster decoding with integrated language mod- Chappelier, J.C. and M. Rajman. 1998. A general- els. In Proceedings of the Annual Meeting of the ized CYK algorithm for parsing stochastic CFG. In Association for Computational Linguistics (ACL). Proceedings of Tabulation in Parsing and Deduction Kasami, T. 1965. An efﬁcient recognition (TAPD), pages 133–137, Paris. and syntax-analysis algorithm for context-free lan- Chen, Yu and Andreas Eisele. 2010. Hierarchical hy- guages. Technical report, Air Force Cambridge Re- brid translation between english and german. In search Lab. Hansen, Viggo and Francois Yvon, editors, Pro- Klein, Dan and Christoper Manning. 2003. Accurate ceedings of the 14th Annual Conference of the Eu- unlexicalized parsing. In Proceedings of the Annual ropean Association for Machine Translation, pages Meeting of the Association for Computational Lin- 90–97. EAMT, EAMT, 5. guistics (ACL). Chiang, David, Yuval Marton, and Philip Resnik. Koehn, Philipp, Franz J. Och, and Daniel Marcu. 2008. Online large-margin training of syntactic and 2003. Statistical phrase-based translation. In Pro- structural translation features. In Proceedings of ceedings of the Human Language Technology Con- the 2008 Conference on Empirical Methods in Nat- ference of the North American Chapter of the As- ural Language Processing, pages 224–233, Hon- sociation for Computational Linguistics Conference olulu, Hawaii, October. Association for Computa- (HLT/NAACL). tional Linguistics. Liu, Yang, Qun Liu, and Shouxun Lin. 2006. Tree- Chiang, David. 2005. A hierarchical phrase-based to-string alignment template for statistical machine model for statistical machine translation. In Pro- translation. In Proceedings of the 21st International ceedings of the Annual Meeting of the Association Conference on Computational Linguistics and the for Computational Linguistics (ACL). 44th annual meeting of the Association for Compu- tational Linguistics. Chiang, David. 2007. Hierarchical phrase based translation. Computational Linguistics, 33(2). Marcu, Daniel, Wei Wang, Abdessamad Echihabi, and Kevin Knight. 2006. SPMT: Statistical ma- Chiang, David. 2010. Learning to translate with chine translation with syntactiﬁed target language source and target syntax. In Proceedings of the phrases. In Proceedings of the Conference on Em- 48th Annual Meeting of the Association for Com- pirical Methods in Natural Language Processing putational Linguistics, pages 1443–1452, Uppsala, (EMNLP), Sydney, Australia.

116 Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Annual Meeting of the Association for Compu- tational Linguistics (ACL). Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Pro- ceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Petrov, Slav, Aria Haghighi, and Dan Klein. 2008. Coarse-to-ﬁne syntactic machine translation using language projections. In Proceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP). Steedman, Mark. 2000. The Syntactic Process. MIT Press. Venugopal, Ashish and Andreas Zollmann. 2009. Grammar based statistical MT on Hadoop: An end- to-end toolkit for large scale PSCFG based MT. The Prague Bulletin of Mathematical Linguistics, 91:67–78. Venugopal, Ashish, Andreas Zollmann, and Stephan Vogel. 2007. An efﬁcient two-pass approach to synchronous-CFG driven statistical MT. In Pro- ceedings of the Human Language Technology Con- ference of the North American Chapter of the As- sociation for Computational Linguistics Conference (HLT/NAACL). Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Techno- metrics, 4(3):419–420. West, D. H. D. 1979. Updating mean and variance estimates: an improved method. Commun. ACM, 22(9):532–535. Zollmann, Andreas and Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. In Proceedings of the Workshop on Sta- tistical Machine Translation, HLT/NAACL. Zollmann, Andreas, Ashish Venugopal, Franz J. Och, and Jay Ponte. 2008. A systematic comparison of phrase-based, hierarchical and syntax-augmented statistical MT. In Proceedings of the Conference on Computational Linguistics (COLING).

~~117 Deep Syntax Language Models and Statistical Machine Translation~~

~~Yvette Graham Josef van Genabith NCLT CNGL Dublin City University Dublin City University [email protected] [email protected]~~

Abstract target language translations. Hierarchical Mod- els (Chiang, 2007; Chiang, 2005) build on Phrase- Hierarchical Models increase the re- Based Models by relaxing the constraint that ordering capabilities of MT systems phrases must be contiguous sequences of words by introducing non-terminal symbols to and allow a short phrase (or phrases) nested within phrases that map source language (SL) a longer phrase to be replaced by a non-terminal words/phrases to the correct position symbol forming a new hierarchical phrase. Tra- in the target language (TL) translation. ditional language models use the local context of Building translations via discontiguous words to estimate the probability of the sentence TL phrases increases the difficulty of lan- and introducing hierarchical phrases that generate guage modeling, however, introducing the discontiguous sequences of TL words increases need for heuristic techniques such as cube the difficulty of computing language model proba- pruning (Chiang, 2005), for example. bilities during decoding and require sophisticated An additional possibility to aid language heuristic language modeling techniques (Chiang, modeling in hierarchical systems is to use 2007; Chiang, 2005). a language model that models fluency of words not using their local context in the Leaving aside heuristic language modeling for string, as in traditional language models, a moment, the difficulty of integrating a tradi- but instead using the deeper context of tional string-based language model into the de- a word. In this paper, we explore the coding process in a hierarchical system, highlights potential of deep syntax language mod- a slight incongruity between the translation model els providing an interesting comparison and language model in Hierarchical Models. Ac- with the traditional string-based language cording to the translation model, the best way to model. We include an experimental evalu- build a fluent TL translation is via discontiguous ation that compares the two kinds of mod- phrases, while the language model can only pro- els independently of any MT system to in- vide information about the fluency of contiguous vestigate the possible potential of integrat- sequences of words. Intuitively, a language model ing a deep syntax language model into Hi- that models fluency between discontiguous words erarchical SMT systems. may be well-suited to hierarchical models. Deep syntax language models condition the probability 1 Introduction of a word on its deep context, i.e. words linked to it via dependency relations, as opposed to preced- In Phrase-Based Models of Machine Translation ing words in the string. During decoding in Hi- all phrases consistent with the word alignment erarchical Models, words missing a context in the are extracted (Koehn et al., 2003), with shorter string due to being preceded by a non-terminal, phrases needed for high coverage of unseen data might however be in a dependency relation with and longer phrases providing improved fluency in a word that is already present in the string and

118 Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 118–126, COLING 2010, Beijing, August 2010. this context could add useful information about predicate trigrams in English f-structures” (Rie- the fluency of the hypothesis as its constructed. zler and Maxwell, 2006). An important prop- In addition, using the deep context of a word erty of LFG f-structures (and deep syntactic struc- provides a deeper notion of fluency than the lo- tures in general) was possibly overlooked here. cal context provides on its own and this might be F-structures can contain more than one path of useful to improve such things as lexical choice in predicates from the root to a frontier that in- SMT systems. Good lexical choice is very im- clude the same ngram, and this occurs when the portant and the deeper context of a word, if avail- underlying graph includes unary branching fol- able, may provide more meaningful information lowed by branching with arity greater than one. and result in better lexical choice. Integrating In such cases, the language model probability as such a model into a Hierarchical SMT system is described in Riezler and Maxwell (2006) is incor- not straightforward, however, and we believe be- rect as the probability of these ngrams will be in- fore embarking on this its worthwhile to evalu- cluded multiple times. In our definition of a deep ate the model independently of any MT system. syntax language model, we ensure that such du- We therefore provide an experimental evaluation plicate ngrams are omitted in training and testing. of the model and in order to provide an interesting In addition, Wu (1998) use a bigram deep syntax comparison, we evaluate a traditional string-based language model in a stochastic inversion transduc- language model on the same data. tion grammar for English to Chinese. None of the related research we discuss here has included an 2 Related Work evaluation of the deep syntax language model they The idea of using a language model based on deep employ in isolation from the MT system, however. syntax is not new to SMT. Shen et al. (2008) use a dependency-based language model in a string 3 Deep Syntax to dependency tree SMT system for Chinese- The deep syntax language model we describe is English translation, using information from the not restricted to any individual theory of deep deeper structure about dependency relations be- syntax. For clarity, however, we restrict our ex- tween words, in addition to the position of the amples to LFG, which is also the deep syntax words in the string, including information about theory we use for our evaluation. The Lexical whether context words were positioned on the left Functional Grammar (LFG) (Kaplan and Bres- or right of a word. Bojar and Hajiˇc(2008) use a nan, 1982; Kaplan, 1995; Bresnan, 2001; Dalrym- deep syntax language model in an English-Czech ple, 2001) functional structure (f-structure) is an dependency tree-to-tree transfer system, and in- attribute-value encoding of bi-lexical labeled de- clude three separate bigram language models: a pendencies, such as subject, object and adjunct reverse, direct and joint model. The model in our for example, with morpho-syntactic atomic at- evaluation is similar to their direct bigram model, tributes encoding information such as mood and but is not restricted to bigrams. tense of verbs, and person, number and case for Riezler and Maxwell (2006) use a trigram deep nouns. Figure 1 shows the LFG f-structure for En- syntax language model in German-English depen- glish sentence “Today congress passed Obama’s dency tree-to-tree transfer to re-rank decoder out- health care bill.”1 put. The language model of Riezler and Maxwell Encoded within the f-structure is a directed (2006) is similar to the model in our evaluation, graph and our language model uses a simplified but differs in that it is restricted to a trigram model acyclic unlabeled version of this graph. Figure trained on LFG f-structures. In addition, as lan- 1(b) shows the graph structure encoded within the guage modeling is not the main focus of their f-structure of Figure 1(a). We discuss the simpli- work, they provide little detail on the language fication procedure later in Section 5. model they use, except to say that it is based on “log-probability of strings of predicates from root 1Morpho-syntactic information/ atomic features are omit- to frontier of target f-structure, estimated from ted from the diagram.

119 (b) ~~(a) PRED pass SUBJ PRED congress pass  PRED bill   SPEC POSS PRED Obama  today congress bill OBJ    hPRED care i   MOD    MOD PRED health ~~ obama care       ADJ PRED today    health  

Figure 1: “Today congress passed Obama’s health care bill.”

4 Language Model word in the structure. For example, a trigram deep syntax language model conditions the probability We use a simplified approximation of the deep of each word on the sequence of words consisting syntactic structure, d , that encodes the unlabeled e of the head of the head of the word followed by dependencies between the words of the sentence, the head of the word as follows: to estimate a deep syntax language model probability. Traditional string-based language mod- l els combine the probability of each word in the p(de)= P (wi wm(m(i)), wm(i)) (3) i=1 | sentence, wi, given its preceding context, the se- Y quence of words from w1 to wi 1, as shown in In addition, similar to string-based language − Equation 1. modeling, we add a start symbol, ~~, at the root of the structure and end symbols,~~ , at l the leaves to include the probability of a word be- p(w1, w2, ..., wl)= p(wi w1, ..., wi 1) (1) − i=1 | ing the head of the sentence and the probability Y of words occurring as leaf nodes in the structure. In a similar way, a deep syntax language model Figure 2(a) shows an example of how a trigram probability combines the probability of each word deep syntax language model probability is com- in the structure, wi, given its context within the puted for the example sentence in Figure 1(a). structure, the sequence of words from wr, the head of the sentence, to wm(i), as shown in Equa- 5 Simplified Approximation of the Deep tion 2, with function m used to map the index of a Syntactic Representation word in the structure to the index of its head. 2 We describe the deep syntactic structure, de, as l an approximation since a parser is employed to p(d )= p(w w , ..., w w ) (2) e i| r m(m(i)) m(i) automatically produce it and there is therefore no iY=1 certainty that we use the actual/correct deep syn- In order to combat data sparseness, we apply tactic representation for the sentence. In addi- the Markov assumption, as is done in traditional tion, the function m requires that each node in the string-based language modeling, and simplify the structure has exactly one head, however, structure- probability by only including a limited length of sharing can occur within deep syntactic structures history when estimating the probability of each resulting in a single word legitimately having two heads. In such cases we use a simplification of 2We refer to the lexicalized nodes in the dependency structure as words, alternatively the term predicate can be the graph in the deep syntactic structure. Fig- used. ure 3 shows an f-structure in which the subject

120 (a) Deep Syntax LM (b) Traditional LM

p(e) p( pass ) p(e) p( passed today congress ) ≈ | ∗ ≈ | ∗ p( today pass ) p( today ~~) | ∗ | ∗ p(~~ pass today ) | ∗ p( congress pass ) p( congress ~~today ) | ∗ | ∗ p(~~ pass congress ) | ∗ p( bill ~~pass ) p( bill health care ) | ∗ | ∗ p( obama pass bill ) p( obama congress passed ) | ∗ | ∗ p(~~ bill obama ) | ∗ p( care pass bill ) p( care s health ) | ∗ | ∗ p( health bill care ) p( health ’ s ) | ∗ | ∗ p( care health ) | p( ’ passed Obama ) | ∗ p( s obama ’ ) | ∗ p( . care bill ) | ∗ p( bill . ) | Figure 2: Example Comparison of Deep Syntax and Traditional Language Models of both like, be and president is hillary. In our simplified structure, the dependency relations between be and hillary and president and hillary are PRED like SUBJ 1: PRED Hillary dropped. We discuss how we do this later in Sec-   tion 6. Similar to our simplification for structure PRED be   sharing, we also simplify structures that contain  SUBJ 1      PRED president  cycles by discarding edges that cause loops in the  XCOMP-PRED    SUBJ 1  structure. XCOMP     PRED at         ADJ PRED U.N.  6 Implementation   OBJ     SPEC PRED the    " #       SRILM (Stolcke, 2002) can be used to compute     a language model from ngram counts (the -read option of the ngram-count command). Implemen- like tation to train the language model, therefore, simply requires accurately extracting counts from the hillary be deep syntax parsed training corpus. To simplify president at the structures to acyclic graphs, nodes are labeled with an increasing index number via a depth first U.N. traversal. This allows each arc causing a loop in the graph or argument sharing to be identified by the a simple comparison of index numbers, as the index number of its start node will be greater than that of its end node. The algorithm we use to extract ngrams from the dependency structures is Figure 3: “Hillary liked being president at the straightforward: we simply carry out a depth-first U.N.” traversal of the graph to construct paths of words that stretch from the root of the graph to words

121 PRED agree on the language model to ensure, that when two SUBJ PRED nobody phrases are combined with each other, that the   PRED with model can rank combined phrases that are flu-    PRED point  ent higher than less fluent combinations. Con-     COORD and  ditioning the probability of each word on its XCOMP     OBJ PRED two ,  deep context has the potential to provide a   ADJ          ( PRED three ) more meaningful context than the local context         within the string. A comparison of the proba-        bilities of individual words in the deep syntax agree model and traditional language model in Figure 2 clearly shows this. For instance, let us con- nobody with sider how the language model in a German to English SMT system is used to help rank the point following two translations today congress passed ... and today convention passed ... (the word and Kongress in German can be translated into ei- two three ther congress or convention in English). In the deep syntax model, the important competing probabilities are (i) p(congress pass) | and (ii) p(convention pass), where (i) | Figure 4: “Nobody agreed with points two and can be interpreted as the probability of the three.” word congress modifying pass when pass is the head of the entire sentence and, similarly (ii) the probability of the word conven- at the leaves and then extract the required order tion modifying pass when pass is the head of ngrams from each path. As mentioned earlier, the entire sentence. In the traditional string- some ngrams can belong to more than one path. based language model, the equivalent compet- Figure 4 shows an example structure containing ing probabilities are (i) p(congress today), | unary branching followed by binary branching in the probability of congress following today when which the sequence of symbols and words “ ~~today is the start of the sentence and (ii) agree with point and” belong to the path ending p(convention ~~today), probability of con- in two~~ and three~~ . In order to ensure vention following| today when today is the start that only distinct ngrams are extracted we assign of the sentence, showing that the deep syntax each word in the structure a unique id number language model is able to use more meaningful and include this in the extracted ngrams. Paths context for good lexical choice when estimating are split into ngrams and duplicate ngrams result- the probability of words congress and convention ing from their occurrence in more than one path compared to the traditional language model. are discarded. Its also possible for ngrams to le- In addition, the deep syntax language model gitimately be repeated in a deep structure, and in will encounter less data sparseness problems for such cases we do not discard these ngrams. Legit- some words than a string-based language model. imately repeating ngrams are easily identified as In many languages words occur that can legiti- the id numbers attached to words will be differ- mately be moved to different positions within the ent. string without any change to dependencies be- 7 Deep Syntax and Lexical Choice in tween words. For example, sentential adverbs SMT in English, can legitimately change position in a sentence, without affecting the underlying de- Correct lexical choice in machine translation is pendencies between words. The word today in extremely important and PB-SMT systems rely “Today congress passed Obama’s health bill”

122 can appear as “Congress passed Obama’s health We also add end symbols at the leaf nodes in the bill today” and “Congress today passed Obama’s structure to include the probability of these words health bill”. Any sentence in the training cor- appearing at that position in a structure. For in- pus in which the word pass is modified by today stance, a noun followed by its determiner such as will result in a bigram being counted for the two p( attorney a) would have a high probabil- | words, regardless of the position of today within ity compared to a conjunction followed by a verb each sentence. p( and be). | In addition, some surface form words such as auxiliary verbs for example, are not represented 8 Evaluation as predicates in the deep syntactic structure. For We carry out an experimental evaluation to inves- lexical choice, its not really the choice of auxiliary tigate the potential of the deep syntax language verbs that is most important, but rather the choice model we describe in this paper independently of of an appropriate lexical item for the main verb any machine translation system. We train a 5- (that belongs to the auxiliary verb). Omitting aux- gram deep syntax language model on 7M English iliary verbs during language modeling could aid f-structures, and evaluate it by computing the per- good lexical choice, by focusing on the choice of plexity and ngram coverage statistics on a held- a main verb without the effect of what auxiliary out test set of parsed fluent English sentences. In verb is used with it. order to provide an interesting comparison, we For some words, however, the probability in the also train a traditional string-based 5-gram lan- string-based language model provides as good if guage model on the same training data and test not better context than the deep syntax model, but it on the same held-out test set of English sen- only for the few words that happen to be preceded tences. A deep syntax language model comes with by words that are important to its lexical choice, the obvious disadvantage that any data it is trained and this reinforces the idea that SMT systems can on must be in-coverage of the parser, whereas a benefit from using both a deep syntax and string- string-based language model can be trained on any based language model. For example, the proba- available data of the appropriate language. Since bility of bill in Figures 2(a) and 2(b) is computed parser coverage is not the focus of our work, we in the deep syntax model as p(bill pass) eliminate its effects from the evaluation by select- | and in the string-based model using p(bill health ing the training and test data for both the string- | care), and for this word the local context seems to based and deep syntax language models on the ba- provide more important information than the deep sis that they are in fact in-coverage of the parser. context when it comes to lexical choice. The deep model nevertheless adds some useful information, 8.1 Language Model Training as it includes the probability of bill being an argu- Our training data consists of English sentences ment of pass when pass is the head of a sentence. from the WMT09 monolingual training corpus In traditional language modeling, the special with sentence length range of 5-20 words that are start symbol is added at the beginning of a sen- in coverage of the parsing resources (Kaplan et al., tence so that the probability of the first word ap- 2004; Riezler et al., 2002) resulting in approxi- pearing as the first word of a sentence can be mately 7M sentences. Preparation of training and included when estimating the probability. With test data for the traditional language model con- similar motivation, we add a start symbol to the sisted of tokenization and lower casing. Parsing deep syntactic representation so that the probabil- was carried out with XLE (Kaplan et al., 2002) ity of the head of the sentence occurring as the and an English LFG grammar (Kaplan et al., head of a sentence can be included. For exam- 2004; Riezler et al., 2002). The parser produces ple, p(be ) will have a high probability as a packed representation of all possible parses ac- | the verb be is the head of many sentences of En- cording to the LFG grammar and we select only glish, whereas p(colorless ~~) will have a low the single best parse for language model training | probability since it is unlikely to occur as the head. by means of a disambiguation model (Kaplan et~~

~~123 Corpus Tokens Ave. Tokens Vocab per Sent. strings 138.6M 19 345K LFG lemmas/predicates 118.4M 16 280K~~

Table 1: Language model statistics for string-based and deep syntax language models, statistics are for string tokens and LFG lemmas for the same set of 7.29M English sentences al., 2004; Riezler et al., 2002). Ngrams were auto- model is trained on surface form tokens, whereas matically extracted from the f-structures and low- the deep syntax model uses lemmas. Ngram cov- ercased. SRILM (Stolcke, 2002) was used to com- erage statistics provide a better comparison. pute both language models. Table 1 shows statis- Unigram coverage for all models is high with tics on the number of words and lemmas used to all models achieving close to 100% coverage on train each model. the held-out test set. Bigram coverage is highest for the deep syntax language model when eos 8.2 Testing markers are included (94.71%) with next high- The test set consisted of 789 sentences selected est coverage achieved by the string-based model from WMT09 additional development sets3 con- that includes eos markers (93.09%). When eos taining English Europarl text and again was se- markers are omitted bigram coverage goes down lected on the basis of sentences being in-coverage slightly to 92.44% for the deep syntax model and of the parsing resources. SRILM (Stolcke, 2002) to 92.83% for the string-based model, and when was used to compute test set perplexity and ngram punctuation is also omitted from the string-based coverage statistics for each order model. model, coverage goes down again to 91.57%. Since the deep syntax language model adds end Trigram coverage statistics for the test set main- of sentence markers to leaf nodes in the structures, tain the same rank between models as in the bi- the number of (so-called) end of sentence markers gram coverage, from highest to lowest as follows: in the test set for the deep syntax model is much DS+eos at 64.71%, SB+eos at 58.75%, SB-eos higher than in the string-based model. We there- at 56.89%, DS-eos at 53.67%, SB-eos-punc at fore also compute statistics for each model when 53.45%. For 4-gram and 5-gram coverage a sim- end of sentence markers are omitted from training ilar coverage ranking is seen, but with DS-eos and testing. 4 In addition, since the vast majority (4gram at 17.17%, 5gram at 3.59%) and SB-eos- of punctuation is not represented as predicates in punc (4gram at 20.24%, 5gram at 5.76%) swap- LFG f-structures, we also test the string-based lan- ping rank position. guage model when punctuation has been removed. 8.4 Discussion 8.3 Results Table 2 shows perplexity scores and ngram cover- Ngram coverage statistics for the DS-eos and age statistics for each order and type of language SB-eos-punc models provide the fairest com- model. Note that perplexity scores for the string- parison, with the deep syntax model achiev- based and deep syntax language models are not ing higher coverage than the string-based model directly comparable because each model has a dif- for bigrams (+0.87%) and trigrams (+0.22%), ferent vocabulary. Although both models train on marginally lower coverage coverage of unigrams an identical set of sentences, the data is in a dif- (-0.02%) and lower coverage of 4-grams (-3.07%) ferent format for each model, as the string-based and 5-grams (2.17%) compared to the string- based model. 3 test2006.en and test2007.en Perplexity scores for the deep syntax model 4When we include end of sentence marker probabilities we also include them for normalization, and omit them from when eos symbols are included are low (79 for the normalization when their probabilities are omitted. 5gram model) and this is caused by eos markers

124 1-gram 2-gram 3-gram 4-gram 5-gram cov. ppl cov. ppl cov. ppl cov. ppl cov. ppl SB-eos 99.61% 1045 92.83% 297 56.89% 251 23.32% 268 7.19% 279 SB-eos-punc 99.58% 1357 91.57% 382 53.45% 327 20.24% 348 5.76% 360 DS-eos 99.56% 1005 92.44% 422 53.67% 412 17.17% 446 3.59% 453 SB+eos 99.63% 900 93.09% 227 58.75% 194 25.48% 207 8.35% 215 DS+eos 99.70% 211 94.71% 77 64.71% 73 29.86% 78 8.75% 79

Table 2: Ngram coverage and perplexity (ppl) on held-out test set. Note: DS = deep syntax, SB string- based, eos = end of sentence markers in the test set in general being assigned relatively 3-gram No. Occ. Prob. high probabilities by the model, and since several and be 42 0.1251 occur per sentence, the perplexity increases when be this 21 0.0110 the are omitted (453 for the 5gram model). must we 19 0.0347 would i 19 0.0414 Tables 3 and 4 show the most frequently en- be in 17 0.0326 countered trigrams in the test data for each type be that 14 0.0122 of model. A comparison shows how different the be debate the 13 0.0947 two models are and highlights the potential of the deep syntax language model to aid lexical choice be debate 13 0.0003 in SMT systems. Many of the most frequently oc- can not 12 0.0348 curring trigram probabilities for the deep syntax and president 11 0.0002 model are for arguments of the main verb of the would like 11 0.0136 sentence, conditioned on the main verb, and in- would be 11 0.0835 cluding such probabilities in a system could im- be also 10 0.0075 prove ﬂuency by using information about which words are in a dependency relation together ex- Table 3: Most frequent trigrams in test set for deep plicitely in the model. In addition, a frequent tri- syntax model gram in the held-out data is be also, where the word also is a sentential adverb modifying based model. be. Trigrams for sentential adverbs are likely to be less effected by data sparseness in the deep 9 Conclusions syntax model compared to the string-based model We presented a comparison of a deep syntax which could result in the deep syntax model im- language and traditional string-based language proving ﬂuency with respect to combinations of model. Results showed that the deep syntax lan- main verbs and their modifying adverbs. The most guage model achieves similar ngram coverage to frequent trigram in the deep syntax test set is the string-based model on a held out test set. and be, in which the head of the sentence is the We highlighted the potential of integrating such conjunction and with argument be. In this type of a model into SMT systems for improving lexical syntactic construction in English, its often the case choice by using a deeper context for probabilities that the conjunction and verb will be distant from of words compared to a string-based model. each other in the sentence, for example: Nobody was there except the old lady and without thinking we quickly left. (where was and and are in a de- References pendency relation). Using a deep syntax language Bojar, Ondˇrej, Jan Hajiˇc. 2008. Phrase-Based and model could therefore improve lexical choice for Deep Syntactic English-to-Czech Statistical Ma- such words, since they are too distant for a string- chine Translation. In Proceedings of the third Work-

125 3-gram No. Occ. Prob. Kaplan, Ronald M. 1995. The Formal Architecture of mr president , 40 0.5385 Lexical Functional Grammar. In Formal Issues in Lexical Functional Grammar, ed. Mary Dalrymple, this is 25 0.1877 pages 7-28, CSLI Publications, Stanford, CA. by the european 20 0.0014 Kaplan, Ronald M., Joan Bresnan. 1982. Lexical the european union 18 0.1096 Functional Grammar, a Formal System for Gram- it is 16 0.1815 matical Represenation. In J. Bresnan, editor, The the european parliament 15 0.0252 Mental Representation of Grammatical Relations, would like to 15 0.4944 173-281, MIT Press, Cambridge, MA. i would 15 0.0250 Koehn, Philipp, Hieu Hoang. 2007. Factored Trans- that is 14 0.1094 lation Models. Proceedings of the 2007 Joint Con- ference on Empirical Methods in Natural Language i would like 14 0.0335 Processing and Computational Natural Language and gentlemen , 13 0.1005 Learning, 868-876. ladies and gentlemen 13 0.2834 Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris we must 12 0.0120 Callison-Burch, Marcello Federico, Nicoli Bertoldi, should like to 12 0.1304 Brooke Cowan, Wade Shen, Christine Moran, i should like 11 0.0089 Richard Zens, Chris Dyer, Ondrej Bojar, Alexan- , ladies and 11 0.5944 dra Constantin, Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. ,it is 10 0.1090 Annual Meeting of the Association for Computa- tional Linguistics, demonstration session Table 4: Most frequent trigrams in test set for Koehn, Philipp 2005. Europarl: A Parallel Corpus for string-based model Statistical Machine Translation. In Proceedings of the tenth Machine Translation Summit. shop on Statistical Machine Translation, Columbus, Koehn, Philipp, Franz Josef Och, Daniel Marcu. 2003. Ohio. Statistical Phrase-based Translation. In Proceed- ings of Human Language Technology and North Bresnan, Joan. 2001. Lexical-Functional Syntax., American Chapter of the Association for Computa- Blackwell Oxford. tional Linguistics Conference, 48-54. Chiang, David. 2007. Hierarchical Phrase-based Riezler, Stefan, John T. Maxwell III. 2006. Grammat- Models of Translation In Computational Linguis- ical Machine Translation. In Proceedings of HLT- tics, No. 33:2. ACL, pages 248-255, New York. Chiang, David. 2005. A Hierarchical Phrase-Based Riezler, Stefan, Tracy H. King, Ronald M. Kaplan, Model for Statistical Machine Translation In Pro- Richard Crouch, John T. Maxwell, Mark Johnson. ceedings of the 43rd Annual Meeting of the Associa- 2002. Parsing the Wall Street Journal using Lexical tion for Computational Linguistics, pages 263-270, Functional Grammar and Discriminitive Estimation Ann Arbor, Michigan. Techniques . (grammar version 2005) In Proceed- ings of the 40th ACL, Philadelphia. Dalrymple, Mary. 2001. Lexical Functional Gram- mar, Academic Press, San Diego, CA; London. Shen, Libin, Jinxi Xu, Ralph Weischedel. 2008. A New String-to-Dependency Machine Translation Kaplan, Ronald, Stefan Riezler, Tracy H. King, John Algorithm with a Target Dependency Language T. Maxwell, Alexander Vasserman. 2004. Speed Model. Proceedings of ACL-08: HLT, pages 577- and Accuracy in Shallow and Deep Stochastic Pars- 585. ing. In Proceedings of Human Language Tech- Stolcke, Andreas. 2002. SRILM - An Extensible Lan- nology Conference/North American Chapter of the guage Modeling Toolkit. In Proceedings of the In- Association for Computational Linguistics Meeting, ternational Conference on Spoken Language Pro- Boston, MA. cessing, Denver, Colorado. Kaplan, Ronald M., Tracy H. King, John T. Maxwell. Dekai, Wu, Hongsing Wong. 1998. Machine Trans- 2002. Adapting Existing Grammars: the XLE Ex- lation with a Stochastic Grammatical Channel. In perience. In Proceedings of the 19th International Proceedings of the 36th ACL and 17th COLING, Conference on Computational Linguistics (COL- Montreal, Quebec. ING) 2002, Taipei, Taiwan.

~~126 Author Index~~

Bandyopadhyay, Sivaji, 83 Venkatapathy, Sriram, 34, 66 Bangalore, Srinivas, 34 Vogel, Stephan, 110 Ben hamadou, Abdelmajid, 61 Way, Andy, 19, 101 Cancedda, Nicola,1 Wu, Dekai, 10, 52 Cao, Hailong, 28 Zhechev, Ventsislav, 43 Du, Jinhua, 19 Zollmann, Andreas, 110 Dymetman, Marc,1

~~Finch, Andrew, 28~~

~~Gali, Karthik, 66 Graham, Yvette, 118~~

~~Jamoussi, Salma, 61 Jiang, Jie, 19 Joshi, Aravind, 66~~

~~Khalilov, Maxim, 92 Kolachina, Prasanth, 34 Kolachina, Sudheer, 34~~

~~Lo, Chi-kiu, 52~~

~~Ma, Yanjun, 101 Mohaghegh, Mahsa, 75 Moir, Tom, 75~~

~~Nivre, Joakim, 10~~

~~PVS, Avinesh, 34~~

~~Saers, Markus, 10 Sangal, Rajeev, 66 Sarrafzadeh, Abdolhossein, 75 Sima’an, Khalil, 92 Singh, Thoudam Doren, 83 Sumita, Eiichiro, 28~~

~~Turki Khemakhem, Ines, 61 van Genabith, Josef, 43, 118~~

~~127~~