Head-modifier Relation based Non-lexical Reordering Model for -Based Translation

Shui Liu1, Sheng Li1, Tiejun Zhao1, Min Zhang2, Pengyuan Liu3 1School of Computer Science and Technology, Habin Institute of Technology {liushui,lisheng,tjzhao}@mtlab.hit.edu.cn 2Institute for Infocomm Research [email protected] 3Institute of Computational , Peking University [email protected]

To tackle this problem, n-best parse trees and Abstract parsing forest (Mi and Huang, 2008; Zhang, 2009) are proposed to relieve the error propaga- Phrase-based statistical MT (SMT) is a tion brought by linguistic analysis. Secondly, milestone in MT. However, the transla- some which violate the boundary of tion model in the phrase based SMT is linguistic analysis are also useful in these mod- structure free which greatly limits its els ( DeNeefe et al., 2007; Cowan et al. 2006). reordering capacity. To address this is- Thus, a tradeoff needs to be found between lin- sue, we propose a non-lexical head- guistic sense and formal sense. modifier based reordering model on On the other hand, instead of using syntactic word level by utilizing constituent based translation rules, some previous work attempts parse tree in source side. Our experi- to learn the knowledge separately and mental results on the NIST Chinese- then integrated those knowledge to the original English benchmarking data show that, constraint. Marton and Resnik (2008) utilize the with a very small size model, our me- language linguistic analysis that is derived from thod significantly outperforms the base- parse tree to constrain the translation in a soft line by 1.48% bleu score. way. By doing so, this approach addresses the challenges brought by linguistic analysis 1 Introduction through the log-linear model in a soft way. Syntax has been successfully applied to SMT to Starting from the state-of-the-art phrase based improve translation performance. Research in model Moses ( Koehn e.t. al, 2007), we propose applying syntax information to SMT has been a head-modifier relation based reordering model carried out in two aspects. On the one hand, the and use the proposed model as a soft syntax syntax knowledge is employed by directly inte- constraint in the phrase-based translation grating the syntactic structure into the transla- framework. Compared with most of previous tion rules i.e. syntactic translation rules. On this soft constraint models, we study the way to util- perspective, the of the target transla- ize the constituent based parse tree structure by tion is modeled by the syntax structure explicit- mapping the parse tree to sets of head-modifier ly. Chiang (2005), Wu (1997) and Xiong (2006) for phrase reordering. In this way, we build a learn the syntax rules using the formal gram- word level reordering model instead of phras- mars. While more research is conducted to learn al/constituent level model. In our model, with syntax rules with the help of linguistic analysis the help of the alignment and the head-modifier (Yamada and Knight, 2001; Graehl and Knight, dependency based relationship in the source 2004). However, there are some challenges to side, the reordering type of each target word these models. Firstly, the linguistic analysis is with alignment in source side is identified as far from perfect. Most of these methods require one of pre-defined reordering types. With these an off-the-shelf parser to generate syntactic reordering types, the reordering of phrase in structure, which makes the translation results translation is estimated on word level. sensitive to the parsing errors to some extent.

748 Coling 2010: Poster Volume, pages 748–756, Beijing, August 2010

Fig 1. An Constituent based Parse Tree

2 Baseline ing capacity of the translation model. Instead of Moses, a state-of-the-art phrase based SMT sys- directly employing the parse tree fragments tem is used as our baseline system. In Moses, (Bod, 1992; Johnson, 1998) in reordering rules given the source language f and target language (Huang and Knight, 2006; Liu 2006; Zhang and e, the decoder is to find: Jiang 2008), we make a mapping from trees to length(e) ebest = argmaxe p ( e | f ) pLM ( e ) ω (1) sets of head-modifier dependency relations (Collins 1996 ) which can be obtained from the where p(e|f) can be computed using phrase constituent based parse tree with the help of translation model, distortion model and lexical head rules ( Bikel, 2004 ). reordering model. pLM(e) can be computed us- length(e) ing the language model. ω is word penalty 3.1 Head-modifier Relation model. Among the above models, there are three According to Klein and Manning (2003) and reordering-related components: language model, Collins (1999), there are two shortcomings in n- lexical reordering model and distortion model. ary Treebank grammar. Firstly, the grammar is The language model can reorder the local target too coarse for parsing. The rules in different words within a fixed window in an implied way. context always have different distributions. Se- The lexical reordering model and distortion condly, the rules learned from training corpus reordering model tackle the reordering problem cannot cover the rules in testing set. between adjacent phrase on lexical level and Currently, the state-of-the-art parsing algo- alignment level. Besides these reordering model, rithms (Klein and Manning, 2003; Collins 1999) the decoder induces distortion pruning con- decompose the n-ary Treebank grammar into straints to encourage the decoder translate the sets of head-modifier relationships. The parsing leftmost uncovered word in the source side rules in these algorithms are constructed in the firstly and to limit the reordering within a cer- form of finer-grained binary head-modifier de- tain range. pendency relationships. Fig.2 presents an exam- ple of head-modifier based dependency tree 3 Model mapped from the constituent parse tree in Fig.1.

In this paper, we utilize the constituent parse tree of source language to enhance the reorder-

749

Fig. 2. Head-modifier Relationships with Aligned Translation

Moreover, there are several reasons for which Relationship: in this paper, we not only use the we adopt the head-modifier structured tree as label of the constituent label as Collins (1996), the main frame of our reordering model. Firstly, but also use some well-known context in pars- the dependency relationships can reflect some ing to define the head-modifier relationship r(.), underlying binary long distance dependency including the POS of the modifier m, the POS relations in the source side. Thus, binary depen- of the head h, the dependency direction d, the dency structure will suffer less from the long parent label of the dependency label l, the distance reordering constraint. Secondly, in grandfather label of the dependency relation p, head-modifier relation, we not only can utilize the POS of adjacent siblings of the modifier s. the context of dependency relation in reordering Thus, the head-modifier relationship can be model, but also can utilize some well-known represented as a 6-tuple . and proved helpful context (Johnson, 1998) of constituent base parse tree in reordering model. r(.) relationship Finally, head-modifier relationship is mature r(1) and widely adopted method in full parsing. r(2) r(3) 3.2 Head-modifier Relation Based Reor- r(4) dering Model r(5) Before elaborating the model, we define some r(6) notions further easy understanding. S= 2…fn> is the source sentence; T= is Table 1. Relations Extracted from Fig 2. the target sentence; AS={as(i) | 1≤ as(i) ≤ n } where as(i) represents that the ith word in source In Table 1, there are 7 relationships extracted sentence aligned to the as(i)th word in target from the source head-modifier based dependen- sentence; AT={aT(i) | 1≤ aT (i) ≤ n } where aT(i) cy tree as shown in Fig.2. Please notice that, in represents that the ith word in target sentence this paper, each source word has a correspond- aligned to the aT(i)th word in source sentence; ing relation. D= {( d(i), r(i) )| 0≤ d(i) ≤n} is the head- Reordering type: there are 4 reordering types modifier relation set of the words in S where for target words with linked word in the source d(i) represents that the ith word in source sen- side in our model: R= {rm1, rm2, rm3 , rm4}. The tence is the modifier of d(i)th word in source reordering type of target word as(i) is defined as sentence under relationship r(i); O= < o1, o2,…, follows: om > is the sequence of the reordering type of every word in target language. The reordering  rm1: if the position number of the ith model probability is P(O| S, T, D, A). word’s head is less than i ( d(i) < i ) in source language, while the position num- ber of the word aligned to i is less than

750 as(d(i)) (as(i) < as(d(i)) ) in target lan- ing type into rm2; otherwise, we classify the guage; reordering type into rm4. Probability estimation: we adopt maximum  rm : if the position number of the ith 2 likelihood (ML) based estimation in this paper. word’s head is less than i ( d(i) < i ) in In ML estimation, in order to avoid the data source language, while the position num- sparse problem brought by lexicalization, we ber of the word aligned to i is larger than discard the lexical information in source and a (d(i)) (a (i) > a (d(i)) ) in target lan- s s s target language: guage. P(O | S, T, D, A)  rm3: if the position number of the ith m (2) word’s head is larger than i ( d(i) > i ) in  P(oi | -,-, r(a T (i))) source language, while the position num- i1

ber of the word aligned to i is larger than where oi∈{rm1,rm2,rm3,rm4} is the reorder- as(d(i)) (as(i) > as(d(i))) in target language. ing type of ith word in target language. To get a non-zero probability, additive smooth  rm4: if the position number of the ith ing( Chen and Goodman, 1998) is used: word’s head is larger than i ( d(i) > i) in P(o | -,-, F(r(a (i))) ) source language, while the position num- i t F(o , r(a (i)))  ber of the word aligned to i is less than  i T as(d(i)) (as(i) < as(d(i)) ) in target lan-  F(r(at (i))) | O |  guage. oi R F(o , m , h , d , l , p , s )   i aT (i) aT (i) aT (i) aT (i) aT (i) aT (i) F(m , h , d , l , p , s ) | O |   aT (i) aT (i) aT (i) aT (i) aT (i) aT (i) oi R

(3) where F(. ) is the frequency of the statistic event in training corpus. For a given set of dependen- cy relationships mapping from constituent tree, the reordering type of ith word is confined to two types: it is whether one of rm1 and rm2 or rm3 and rm4. Therefore, |O|=2 instead of |O|=4 in (2). The parameter α is an additive factor to prevent zero probability. It is computed as: Fig. 3. An example of the reordering types in Fig. 2. 1 Fig. 3 shows examples of all the reordering   C  F(m , h , d , l , p , s ) types. In Fig. 3, the reordering type is labeled at  aT (i) aT (i) aT (i) aT (i) aT (i) aT (i) oi R the target word aligned to the modifier: for ex- (4) ample, the reordering type of rm1 belongs to the where c is a constant parameter(c=5 in this pa- target word “scale”. Please note that, in general, per). these four types of reordering can be divided In above, the additive parameter α is an adap- into 2 categories: the target words order of rm2 tive parameter decreasing with the size of the and rm4 is identical with source word order, statistic space. By doing this, the data sparse while rm1 and rm3 is the swapped order of problem can be relieved. source. In practice, there are some special cases that can’t be classified into any of the defined 4 Apply the Model to Decoder reordering types: the head and modifier in source link to the same word in target. In such Our decoding algorithm is exactly the same as cases, rather than define new reordering types, (Kohn, 2004). In the translation procedure, the we classify these special cases into these four decoder keeps on extending new phrases with- defined reordering types: if the head is right to out overlapping, until all source words are trans- the modifier in source, we classify the reorder- lated. In the procedure, the order of the target

751 words in decoding procedure is fixed. That is, 16: end if once a hypothesis is generated, the order of tar- 17: end if get words cannot be changed in the future. Tak- 18: C[p] ←true //update word index ing advantage of this feature, instead of compu- 19: end for ting a totally new reordering score for a newly Fig. 4. Identify the Reordering Types of Newly generated hypothesis, we merely calculate the Extended Phrase reordering score of newly extended part of the After all the reordering types in the newly ex- hypothesis in decoding. Thus, in decoding, to tended phrase are identified, the reordering compute the reordering score, the reordering scores of the phrase can be computed by using types of each target word in the newly extended equation (3). phrase need to be identified. The method to identify the reordering types 5 Preprocess the Alignment in decoding is proposed in Fig.4. According to In Fig. 4, the word index is to identify the reor- the definition of reordering, the reordering type dering type of the target translated words. Ac- of the target word is identified by the direction tually, in order to use the word index without of head-modifier dependency on the source side, ambiguity, the alignment in the proposed algo- the alignment between the source side and tar- rithm needs to satisfy some constraints. get side, and the relative translated order of Firstly, every word in the source must have word pair under the head-modifier relationship. alignment word in the target side. Because, in The direction of dependency and the alignment the decoding procedure, if the head word is not can be obtained in input sentence and phrase covered by the word index, the algorithm cannot table. While the relative translation order needs distinguish between the head word will not be to record during decoding. A word index is em- translated in the future and the head word is not ployed to record the order. The index is con- translated yet. Furthermore, in decoding, as structed in the form of true/false array: the index shown in Fig.4, the index of source would be set of the source word is set with true when the with true only when there is word in target word has been translated. With the help of this linked to it. Thus, the index of the source word index, reordering type of every word in the without alignment in target is never set with true. phrase can be identified.

1: Input: alignment array AT; the Start is the start position of the phrase in the source side; head-modifier relation d(.); source word in- dex C, where C[i]=true indicates that the

ith word in source has been translated. Fig. 5. A complicated Example of Alignment in 2: Output: reordering type array O which re- Head-modifier based Reordering Model serves the reordering types of each word in the target phrase Secondly, if the head word has more than one alignment words in target, different alignment 3: for i = 1, |AT| do possibly result in different reordering type. For 4: P ← aT(i) + Start 5: if (d (P)

752 order to explain the alignment preprocessing, 10: FindAnchor(ei ) the following notions are defined: if there is a 11:endfor link between the source word fj and target word Fig. 6. Alignment Pre-Processing algorithm ei, let l(ei ,fj) = 1 , otherwise l(ei ,fj) = 0; the source word fj∈F1-to-N , iff ∑i l(ei,fj) >1, such as the source word f2 in Fig. 5; the source word fj∈FNULL, iff ∑i l(ei,fj) = 0, such as the source word f4 in Fig. 5; the target word ei∈E1-to-N , iff

∑j l(ei,fj) > 1, such as the target word e1 in Fig. 5. In preprocessing, there are 3 types of opera- tion, including DiscardLink(fj) , BorrowLink( fj ) and FindAnchor(ei ) : DiscardLink( f ) : if the word f in source with j j more than one words aligned in target, i.e. fj∈ F1-to-N ; We set the target word en with l(en, fj) = 1, where en= argmaxi p(ei | fj) and p(ei | fj) is estimated by ( Koehn e.t. al, 2003), while set rest of words linked to fj with l (en, fj) = 0. BorrowLink( fj ): if the word fj in source with- Fig. 7. An Example of Alignment Preprocessing. out a alignment word in target, i.e. fj∈FNULL ; An example of the preprocess the alignment let l(ei,fj)=1 where ei aligned to the word fj , in Fig. 5 is shown in Fig. 7 : firstly, Discar- which is the nearest word to fj in the source dLink(f2) operation discards the link between f2 side; when there are two words nearest to fj with and e1 in (a); then the link between f4 and e3 is alignment words in the target side at the same established by operation BorrowLink(f ) in (b); time, we select the alignment of the left word 4 at last, FindAnchor(e3) select f2 as the anchor firstly . word of e3 in source in (c). After the prepro- FindAnchor( ): for the word e in target with i cessing, the reordering type of e3 can be identi- more than one words aligned in source , i.e. ei fied. Furthermore, in decoding, when the de- ∈E1-to-N ; we select the word fm aligned to ei as coder scans over e2, the word index sets the its anchor word to decide the reordering type of word index of f3 and f4 with true. In this way, ei , where fm= argmaxj p(ei | fj) and p(fj | ei) is the never-true word indexes in decoding are estimated by ( Koehn et al, 2003); For the rest avoided. of words aligned to ei , we would set their word indexes with true in the update procedure of 6 Training the Reordering Model decoding in the 18th line of Fig.4. With these operations, the required alignment Before training, we get the required alignment can be obtained by preprocessing the origin by alignment preprocessing as indicated above. alignment as shown in Fig. 6. Then we train the reordering model with this 1: Input: set of alignment A between target lan- alignment: from the first word to the last word guage e and source language f in the target side, the reordering type of each 2: Output: the 1-to-1 alignment required by the word is identified. In this procedure, we skip the model words without alignment in source. Finally, all the statistic events required in equation (3) are 3: foreach f ∈F do i 1-to-N added to the model. 4: DiscardLink( fi ) 5: end for In our model, there are 20,338 kinds of rela- tions with reordering probabilities which are 6: foreach fi ∈FNULL do much smaller than most phrase level reordering 7: BorrowLink( fi ) models on the training corpus FBIS. 8: end for Table 1 is the distribution of different reor- 9: foreach ei∈E1-to-N do dering types in training model.

753 Type of Reordering Percentage % In these feature groups, the top 5 groups of features are the baseline model, the left two rm1 3.69 group scores are related with our model. rm2 27.61 In decoding, we drop all the OOV words and rm3 20.94 use default setting in Moses: set the distortion rm4 47.75 limitation with 6, beam-width with 1/100000, Table 1: Percentage of different reordering stack size with 200 and max number of phrases types in model for each span with 50. From Table 1, we can conclude that the reor- 7.2 Results and Discussion dering type rm and rm are preferable in reor- 2 4 We take the replicated Moses system as our dering which take over nearly 3/4 of total num- baseline. Table 2 shows the results of our model. ber of reordering type and are identical with In the table, Baseline model is the model includ- word order of the source. The statistic data indi- ing feature group 1, 2, 3 and 4. Baseline mod- cate that most of the words order doesn’t change rm el is the Baseline model with feature group 5. H- in our head-modifier reordering view. This M model is the Baseline model with feature maybe can explain why the models (Wu, 1997; group 6 and 7. H-M model is the Baseline Xiong, 2006; Koehn, et., 2003) with limited rm rm model with feature group 6 and 7. capacity of reordering can reach certain perfor- mance. Model BLEU% NIST Baseline 27.06 7.7898 7 Experiment and Discussion Baselinerm 27.58 7.8477 H-M 28.47 8.1491 7.1 Experiment Settings H-Mrm 29.06 8.0875 We perform Chinese-to-English translation task Table 2: Performance of the Systems on NIST- on NIST MT-05 test set, and use NIST MT-02 05(bleu4 case-insensitive). as our tuning set. FBIS corpus is selected as our From table 2, we can conclude that our reor- training corpus, which contains 7.06M Chinese dering model is very effective. After adding words and 9.15M English words. We use GI- feature group 6 and 7, the performance is im- ZA++(Och and Ney, 2000) to make the corpus proved by 1.41% and 1.48% in bleu score sepa- aligned. A 4-gram language model is trained rately. Our reordering model is more effective using Xinhua portion of the English Gigaword than the lexical reordering model in Moses: corpus (181M words). All models are tuned on 1.41% in bleu score is improved by adding our BLEU, and evaluated on both BLEU and NIST reordering model to Baseline model, while 0.48 score. is improved by adding the lexical reordering to To map from the constituent trees to sets of Baseline model. head-modifier relationships, firstly we use the Stanford parser (Klein, 2003) to parse the threshold KOR BLEU NIST source of corpus FBIS, then we use the head- ≥1 20,338 29.06 8.0875 finding rules in (Bikel, 2004) to get the head- ≥2 13,447 28.83 8.3658 modifier dependency sets. ≥3 10,885 28.64 8.0350 In our system, there are 7 groups of features. 9,518 28.94 8.1002 They are: ≥4 1. Language model score (1 feature) ≥5 8,577 29.18 8.1213 2. word penalty score (1 feature) Table 3: Performance on NIST-05 with Differ- 3. phrase model scores (5 features) ent Relation Frequency Threshold (bleu4 case- 4. distortion score (1 feature) insensitive). 5. lexical RM scores (6 features) Although our model is lexical free, the data 6. Number of each reordering type (4 fea- sparse problem affects the performance of the tures) model. In the reordering model, nearly half 7. Scores of each reordering type (4 fea- numbers of the relations in our model occur less tures, computed by equation (3)) than three times. To investigate this, we statistic

754 the frequency of the relationships in our model, efficient way in adding linguistic analysis into and expertise our H-M full model with different formal sense SMT model. frequency threshold. In modeling the reordering, most of previous In Table 3, when the frequency of relation is studies are on phrase level. In Moses, the lexical not less than the threshold, the relation is added reordering is modeled on adjacent phrases. In into the reordering model; KOR is the number (Wu, 1996; Xiong, 2006), the reordering is also of relation type in the reordering model. modeled on adjacent translated phrases. In hiero, Table 3 shows that, in our model, many rela- the reordering is modeled on the segments of tions occur only once. However, these low- the unmotivated translation rules. The tree-to- frequency relations can improve the perfor- string models (Yamada et al. 2001; Liu et mance of the model according to the experimen- al.2006) are model on phrases with syntax re- tal results. Although low frequency statistic presentations. All these studies show excellent events always do harm to the parameter estima- performance, while there are few studies on tion in ML, the model can estimate more events word level model in recent years. It is because, in the test corpus with the help of low frequency we consider, the alignment in word level model event. These two factors affect the experiment is complex which limits the reordering capacity results on opposite directions: we consider that of word level models. is the reason the result don’t increase or de- However, our work exploits a new direction crease with the increasing of frequency thre- in reordering that, by utilizing the decomposed shold in the model. According to the results, the dependency relations mapped from parse tree as model without frequency threshold achieves the a soft constraint, we proposed a novel head- highest bleu score. Then, the performance drops modifier relation based word level reordering quickly, when the frequency threshold is set model. The word level reordering model is with 2. It is because there are many events can’t based on a phrase based SMT framework. Thus, be estimated by the smaller model. Although, in the task to find the proper position of translated the model without frequency threshold, there words converts to score the reordering of the are some probabilities overestimated by these translated words, which relax the tension be- events which occur only once, the size of the tween complex alignment and word level reor- model affects the performance to a larger extent. dering in MT. When the frequency threshold increases above 3, the size of model reduces slowly which makes 9 Conclusion and Future Work the overestimating problem become the impor- Experimental results show our head-modifier tant factor affecting performance. From these relationship base model is effective to the base- results, we can see the potential ability of our line (enhance by 1.48% bleu score), even with model: if our model suffer less from data spars limited size of model and simple parameter es- problem, the performance should be further im- timation. In the future, we will try more compli- proved, which is to be verified in the future. cated smooth methods or use maximum entropy 8 Related Work and Motivation based reordering model. We will study the per- formance with larger distortion constraint, such There are several researches on adding linguis- as the performances of the distortion constraint tic analysis to MT in a “soft constraint” way. over 15, or even the performance without distor- Most of them are based on constituents in parse tion model. tree. Chiang(2005), Marton and Resnik(2008) explored the constituent match/violation in hie- 10 Acknowledgement ro; Xiong (2009 a) added constituent parse tree The work of this paper is funded by National based linguistic analysis into BTG model; Natural Science Foundation of China (grant no. Xiong (2009 b) added source dependency struc- 60736014), National High Technology Re- ture to BTG; Zhang(2009) added tree-kernel to search and Development Program of China (863 BTG model. All these studies show promising Program) (grant no. 2006AA010108), and Mi- results. Making soft constrain is an easy and crosoft Research Asia IFP (grant no. FY09- RES-THEME-158).

755 References Mark Johnson. 1998. PCFG models of linguistic tree representations. Computational Linguistics, Dekai Wu. 1997. Stochastic transduction 24:613-632. grammars and bilingual parsing of parallel corpo- ra. Computational Lingustics,23(3):377-403. Liang Huang, Kevin Knight, and Aravind Joshi. Sta- tistical Syntax-Directed Translation with Ex- David Chiang. 2005. A hierarchical phrase-based tended Domain of . 2006. In Proceedings model for SMT. ACL-05.263-270. of the 7th AMTA. David Chiang. 2007. Hierarchical phrase-based Yang Liu, Qun Liu, and Shouxun Lin. Tree-to-String translation. Computational Linguistics, 33(2):201- Alignment Template for Statistical Machine 228. Translation. 2006.In Proceedings of the ACL 2006. Kenji Yamada and K. Knight. 2001. A syntax-based Min Zhang, Hongfei Jiang, Ai Ti Aw, Haizhou Li, statistical translation model. ACL-01.523-530. Chew Lim Tan and Sheng Li. 2008. A Tree Se- Yuval Marton and Philip Resnik. 2008. Soft syntac- quence Alignment-based Tree-to-Tree Translation tic Constraints for Hierarchical Phrased-based Model. ACL-HLT-08. 559-567. Translation. ACL-08. 1003-1011. Dan Klein, Christopher D. Manning. Accurate Un- Libin shen, Jinxi Xu and Ralph Weischedel. 2008. A lexicalized Parsing. 2003. In Proceedings of New String-to-Dependency Machine Translation ACL-03. 423-430. Algorithm with a Target Dependency Language M. Collins. 1996. A new statistical parser based on Model. ACL-08. 577-585. bigram lexical dependencies. In Proceedings of J. Graehl and K. Knight.2004.Training Tree trans- ACL-96. 184-191. ducers. In proceedings of the 2004 Human Lan- M. Collins. 1999. Head-Driven Statistical Models for guage Technology Conference of the North Amer- Natural Language Parsing. Ph.D. thesis, Univ. of ican Chapter of the Association for Computation- Pennsylvania. al Linguistics. Andreas Zollmann. 2005. A Consistent and Efficient Dekai Wu. 1996. A Polynomial-Time Algorithm for Estimator for the Data-Oriented Parsing Model. Statistical Machine Translation. In proceedings of Journal of Automata, Languages and Combinator- ACL-1996 ics. 2005(10):367-388 Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Ma x- Mark Johnson. 2002. The DOP estimation method is imum Entropy Based Phrase Reordering Model biased and inconsistent. Computational Linguis- for Statistical Machine Translation. In proceed- tics 28, 71-76. ings of COLING-ACL 2006 Daniel M. Bikel. 2004. On the Parameter Space of Deyi Xiong, Min Zhang, Aiti AW and Haizhou Li. Generative Lexicalized Statistical Parsing Models. 2009a. A Syntax-Driven Bracket Model for Ph.D. thesis. Univ. of Pennsylvania. Phrase-Based Translation. ACL-09.315-323. S. F. Chen, J. Goodman. An Empirical Study of Deyi Xiong, Min Zhang, Aiti AW and Haizhou Li. Smoothing Techniques for Language Modeling. 2009b. A Source Dependency Model for Statistic In Proceedings of the 34th annual meeting on As- Machine translation. MT-Summit 2009. sociation for Computational Linguistics, Och, F.J. and Ney, H. 2000. Improved statistical 1996.310-318. alignment models. In Proceedings of ACL 38. Haitao Mi and Liang Huang. 2008. Forest-based Philipp Koehn, et al. Moses: Open Source Toolkit translation Rule Extraction. ENMLP-08. 2006-214. for Statistical Machine Translation, ACL 2007. Hui Zhang, Min Zhang , Haizhou Li, Aiti Aw and Philipp Koehn, Franz Joseph Och, and Daniel Mar- Chew Lim Tan. Forest-based Tree Sequence to cu.2003. Statistical Phrase-based Translation. In String Translation Model. ACL-09: 172-180 Proceedings of HLT-NAACL. S DeNeefe, K. Knight, W. Wang, and D. Marcu. Philipp Koehn. 2004. A Beam Search Decoder for 2007. What can syntax-based MT learn from Phrase-Based Translation model. In: Proceeding phrase-based MT ? In Proc. EMNLP-CoNULL. of AMTA-2004,Washington Brooke Cowan, Ivona Kucerova, and Michael Rens Bod. 1992. Data oriented Parsing( DOP ). In Collins.2006. A discriminative model for tree-to- Proceedings of COLING-92. tree translation. In Proc. EMNLP.

756