<<

Synchronous Dependency Insertion A Formalism for Based Statistical MT

Yuan Ding and Martha Palmer Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA {yding, mpalmer}@linc.cis.upenn.edu

Abstract In the early 1990s, (Brown et. al. 1993) intro- duced the idea of statistical machine translation, This paper introduces a grammar formalism where the to word translation probabilities and specifically designed for syntax-based sta- sentence reordering probabilities are estimated from tistical machine translation. The synchro- a large set of parallel sentence pairs. By having the nous grammar formalism we propose in advantage of leveraging large parallel corpora, the this paper takes into consideration the per- statistical MT approach outperforms the traditional vasive structure divergence between lan- transfer based approaches in tasks for which ade- guages, which many other synchronous quate parallel corpora is available (Och, 2003). grammars are unable to model. A Depend- However, a major criticism of this approach is that it ency Insertion Grammars (DIG) is a gen- is void of any internal representation for syntax or erative grammar formalism that captures . word order phenomena within the depend- In recent years, hybrid approaches, which aim at ency representation. Synchronous Depend- applying statistical learning to structured data, began ency Insertion Grammars (SDIG) is the to emerge. Syntax based statistical MT approaches synchronous version of DIG which aims at began with (Wu 1997), who introduced a polyno- capturing structural divergences across the mial-time solution for the alignment problem based languages. While both DIG and SDIG have on synchronous binary trees. (Alshawi et al., 2000) comparatively simpler mathematical forms, extended the tree-based approach by representing we prove that DIG nevertheless has a gen- each production in parallel dependency trees as a eration capacity weakly equivalent to that finite-state transducer. (Yamada and Knight, 2001, of CFG. By making a comparison to TAG 2002) model translation as a sequence of operations and Synchronous TAG, we show how such transforming a syntactic tree in one language into formalisms are linguistically motivated. We the string of the second language. then introduce a probabilistic extension of The syntax based statistical approaches have SDIG. We finally evaluated our current im- been faced with the major problem of pervasive plementation of a simplified version of structural divergence between languages, due to both SDIG for syntax based statistical machine systematic differences between languages (Dorr, translation. 1994) and the vagaries of loose translations in real 1 Introduction corpora. While we would like to use syntactic in- formation in both languages, the problem of non- Dependency grammars have a long history and isomorphism grows when trees in both languages are have played an important role in machine translation required to match. (MT). The early use of dependency structures in ma- To allow the syntax based machine translation chine translation tasks mainly fall into the category approaches to work as a generative process, certain of transfer based MT, where the dependency struc- isomorphism assumptions have to be made. Hence a ture of the source language is first analyzed, then reasonable question to ask is: to what extent should transferred to the target language by using a set of the grammar formalism, which we choose to repre- transduction rules or a transfer lexicon, and finally sent syntactic , assume isomor- the linear form of the target language sentence is phism between the structures of the two languages? generated. (Hajic et al., 2002) allows for limited non- While the above approach seems to be plausible, isomorphism in that n-to-m matching of nodes in the the transfer process demands intense human effort in two trees is permitted. However, even after extend- creating a working transduction rule set or a transfer ing this model by allowing cloning operations on lexicon, which largely limits the performance and subtrees, (Gildea, 2003) found that parallel trees application domain of the resultant machine transla- over-constrained the alignment problem, and tion system. achieved better results with a tree-to-string model

using one input tree than with a tree-to-tree model 5. Simple: it should have a minimal number of using two. different structures and operations so that it will At the same time, grammar theoreticians have be learnable from the empirical data. proposed various generative synchronous grammar In the following sections of this paper, we intro- formalisms for MT, such as Synchronous Context duce a grammar formalism that satisfies the above Free Grammars (S-CFG) (Wu, 1997) or Synchro- properties: Synchronous Dependency Insertion nous Tree Adjoining Grammars (S-TAG) (Shieber Grammar (SDIG). Section 2 gives an informal look and Schabes, 1990). Mathematically, generative at the desired capabilities of a monolingual version synchronous grammars share many good properties Dependency Insertion Grammar (DIG) by address- similar to their monolingual counterparts such as ing the problems with previous dependency gram- CFG or TAG (Joshi and Schabes, 1992). If such a mars. Section 3 gives the formal definition of the synchronous grammar could be learnt from parallel DIG and shows that it is weakly equivalent to Con- corpora, the MT task would become a mathemati- text Free Grammar (CFG). Section 4 shows how cally clean generative process. DIG is linguistically motivated by making a com- However, the problem of inducing a synchronous parison between DIG and Tree Adjoining Grammar grammar from empirical data was never solved. For (TAG). Section 5 specifies the Synchronous DIG example, Synchronous TAGs, proposed by (Shieber and Section 6 gives the probabilistic extension of and Schabes, 1990), which were introduced primar- SDIG. ily for semantics but were later also proposed for translation. From a formal perspective, Syn-TAGs 2 Issues with Dependency Grammars characterize the correspondences between languages by a set of synchronous elementary tree pairs. While 2.1 Dependency Grammars and Statistical MT examples show that this formalism does capture cer- tain cross language structural divergences, there is According to (Fox, 2002), dependency represen- not, to our knowledge, any successful statistical tations have the best phrasal cohesion properties learning method to learn such a grammar from em- across languages. The percentage of head crossings pirical data. We believe that this is due to the limited per chance is 12.62% and that of modifier crossings ability of Synchronous TAG to model structure di- per chance is 9.22%. Observing this fact, it is rea- vergences. This observation will be discussed later sonable to propose a formalism that handles lan- in Section 5. guage transfer based on dependency structures. We studied the problem of learning synchronous What is more, if a formalism based on depend- syntactic sub-structures (parallel dependency treelets) ency structures is made possible, it will have the from unaligned parallel corpora in (Ding and Palmer, nice property of being simple, as expressed in the 2004). At the same time, we would like to formalize following table: a synchronous grammar for syntax based statistical CFG TAG DG MT. The necessity of a well-defined formalism and Node# 2n 2n n certain limitations of the current existing formalisms, Lexicalized? NO YES YES motivate us to design a new synchronous grammar Node types 2 2 1* formalism which will have the following properties: Operation types 1 2 1* 1. Linguistically motivated: it should be able to (*: will be shown later in this paper) capture most language phenomena, e.g. compli- Figure 1. cated word orders such as “wh” movement. The simplicity of a grammar is very important for 2. Without the unrealistic word-to-word isomor- statistical modeling, i.e. when it is being learned phism assumption: it should be able to capture from the corpora and when it is being used in ma- structural variations between the languages. chine translation decoding, we don’t need to condi- 3. Mathematically rigorous: it should have a well tion the probabilities on two different node types or defined formalism and a proven generation ca- operations. pacity, preferably context free or mildly context At the same time, dependency grammars are in- sensitive. herently lexicalized in that each node is one word. 4. Generative: it should be “generative” in a Statistical parsers (Collins 1999) showed perform- mathematical sense. This property is essential ance improvement by using bilexical probabilities, for the grammar to be used in statistical MT. i.e. probabilities of word pair occurrences. This is Each production rule should have its own prob- what dependency grammars model explicitly. ability, which will allow us to decompose the overall translation probability.

2.2 A ? definition of the monolingual Dependency Insertion Grammar. Why do we want the grammar for statistical MT to be generative? First of all, generative models have 3 The DIG Formalism long been studied in the machine learning commu- nity, which will provide us with mathematically rig- orous algorithms for training and decoding. Second, 3.1 Elementary Trees CFG, the most popular formalism in describing Formally, the Dependency Insertion Grammar is natural language phenomena, is generative. Certain defined as a six tuple (C, L, A, B,S, R) . C is a set ideas and algorithms can be borrowed from CFG if of syntactic categories and L is a set of lexical we make the formalism generative. items. A is a set of Type-A trees and B is a set of While there has been much previous work in Type-B trees (defined later). S is a set of the start- formalizing dependency grammars and in its appli- cation to the task, until recently (Joshi and ing categories of the sentences. R is a set of word Rambow, 2003), little attention has been given to the order rules local to each node of the trees. issue of making the proposed Each node in the DIG has three fields: generative. And in machine translation tasks, al- though using dependency structures is an old idea, A Node consists of: little effort has been made to propose a formal 1. One lexical item grammar which views the composition and decom- 2. One corresponding category position of dependency trees as a generative process 3. One local word order rule. from a formal perspective. There are two reasons for this fact: (1) The We define two types of elementary trees in DIG: “pure” dependency trees do not have nonterminals. Type-A trees and Type-B trees. Both types of trees The standard solution to this problem was intro- have one or more nodes. One of the nodes in an duced as early as (Gaifman 1965), where he - elementary tree is designated as the head of the ele- posed adding syntactic categories to each node on mentary tree. the dependency tree. (2) However, there is a deeper Type-A trees are also called “root lexicalized problem with dependency grammar formalisms, as trees”. They roughly correspond to the α trees in observed by (Rambow and Joshi 1997). In the de- TAG. Type-A trees have the following properties: pendency representation, it is hard to handle com- plex word order phenomena without resorting to Properties of a Type-A elementary tree: global word order rules, which makes the grammar 1. The root is lexicalized. no longer generative. This will be explored in the 2. The root is designated as the head of the next subsection (2.3). tree 3. Any lexicalized node can take a set of 2.3 Non-projectivity unlexicalized nodes as its arguments. Non-projectivity has long been a major obstacle 4. The local word order rule specifies the for anyone who wants to formalize dependency relative order between the current node grammar. When we draw projection lines from the and all its immediate children, including nodes in the dependency trees to a linear representa- the unlexicalized arguments. tion of the sentence, if we cannot do so without hav- ing one or more projection lines going across at least Here is an example of a Type-A elementary tree one of the arcs of the dependency tree, we say the for the verb “like”. Note that the head node is dependency tree is non-projective. marked with (@). A typical example for non-projectivity is “wh” Please note that the placement of the dependency movement, which is illustrated below. arcs reflects the relative order between the parent and all its immediate children.

Figure 2. Figure 3

Type-B trees are also called “root unlexicalized trees”. They roughly correspond to β trees in TAG Our solution for this problem is given in section and have the following properties: 4 and in the next section we will first give the formal

3. (Type-B) (unify) (Type B) = (Type-B) Properties of a Type-B elementary tree: 1. The root is the ONLY unlexicalized node 3.3 Comparison to Other Approaches 2. One of the lexicalized nodes is desig- There are two major differences between our de- nated as the head of the tree pendency grammar formalism and that of (Joshi and 3. Similar to Type-A trees, each node also Rambow, 2003): have a word order rule that specifies the 1. We only define one unification operation, relative order between the current node whereas (Joshi and Rambow, 2003) defined two and all its immediate children. operations: substitution and adjunction. 2. We introduce the concept of “heads” in the DIG Here is and example of a Type-B elementary tree for so that the derivation complexity is significantly the adverb “really” smaller. 3.4 Proof of Weak Equivalence between DIG Figure 4 and CFG We prove the weak equivalence between DIG and CFG by first showing that the language that a DIG 3.2 The Unification Operation generates is a subset of one that a CFG generates, We define only one type of operation: unification i.e. L(DIG) ⊆ L(CFG) . And then we show the for any DIG derivation: opposite is also true: L(CFG) ⊆ L(DIG) .

Unification Operation: 3.4.1 L(DIG) ⊆ L(CFG) When an unlexicalized node and a head The proof is given constructively. First, for each node have the same categories, they can Type-A tree, we “insert” a “waiting for Type-B tree” be merged into one node. argument at each possible slot underneath it with the category B. This process is shown below: This specifies that an unlexicalized node cannot be unified with a non-head node, which guarantees limited complexity when a unification operation takes place. Figure 6 After unification, 1. If the resulting tree is a Type-A tree, its root becomes the new root; Then we “flatten” the Type-A tree to its linear 2. If the resulting tree is a Type-B tree, the root form according to the local word order rule, which node involved in the unification operation be- decides the relative ordering between the parent and comes the new root. all its children at each of the nodes. And we get: Here is one example for the unification operation NT{A.C } → NT{B.C }w NT{C }w NT{B.C } H H 0 0 1 H L which adjoins the adverb “really” to the verb “like”: Lwi NT{C j}Lwn NT{B.CH }

y w0 Lwn is the strings of lexical items

y NT{A.CH } is the nonterminal created for

this Type-A tree, and CH is the category of the head (root). Figure 5 y NT{C j} is the nonterminal for each category Note that for the above unification operation the y NT{B.C } is the nonterminal for each “Type- dependency tree on the right hand side is just one of H the possible resultant dependency trees. The strings B site” generated by the set of possible resultant depend- Similarly, for each Type-B tree we can create ency trees should all be viewed as the language “Type-B site” under its head node. So we have: NT{RB.C } → w NT{B.C } w NT{B.C }w L(DIG) generated by the DIG grammar. R 0 H L i L H n Then we create the production to take arguments: Also note that the definition of DIG is preserved NT{C} → NT{A.C} through the unification operation, as we have: And the production rules to take Type-B trees:

1. (Type-A) (unify) (Type A) = (Type-A) NT{B.C} → NT{RB.C}NT{B.C} 2. (Type-A) (unify) (Type B) = (Type-A) NT{B.C} → NT{B.C}NT{RB.C}

Hence, a DIG can be converted to a CFG.

3.4.2 L(CFG) ⊆ L(DIG) It is known that a context free grammar can be con- verted to Greibach Normal Form, where each pro- Figure 8b Substitution through DIG unification duction will have the form: A → aV * , where V is the set of nonterminals z Non-predicative Adjunction We simply construct a corresponding Type-A In TAG, this type of operation includes all ad- dependency tree as follows: junctions when the embedded tree does not contain a predicate, i.e. the root of the embedded tree is not an S. For example, the trees for adverbs are with root Figure 7 VP and are adjoined to non-terminal VPs in the ma- trix tree.

4 Compare DIG to TAG A Tree Adjoining Grammars is defined as a five tuple (Σ, NT, I, A,S) , where Σ is a set of terminals, NT is a set of nonterminals, I is a finite set of fi- nite initial trees (α trees), A is a finite set of auxil- Figure 9a Non-predicative Adjunction in TAG iary trees ( β trees), and S is a set of starting Like[V]@ symbols. The TAG formalism defines two opera- [V] Like[V]@ tions, substitution and adjunction. A TAG derives a phrase-structure tree, called the really[adv]@ John[N] [N] “derived tree” and at the same time, in each step of John[N] really[adv] [N] the derivation process, two elementary trees are Figure 9b Non-predicative Adjunction through DIG connected through either the substitution or adjunc- unification tion operation. Hence, we have a “derivation tree” which represents the syntactic and/or logical relation z Predicative Adjunction between the elementary trees. Since each elementary This type of operation adjoins an embedded tree tree of TAG has exactly one lexical node, we can which contains a predicate, i.e. with a root S, to the view the derivation tree as a “Deep Syntactic Repre- matrix tree. A typical example is the sentence: Who sentation” (DSynR). This representation closely re- does John think Mary likes? sembles the dependency structure of the sentence. This example is non-projective and has “wh” Here we show how DIG models different opera- movement. In the TAG sense, the tree for “does tions of TAG and hence handles word order phe- John think” is adjoined to the matrix tree for “Who nomena gracefully. Mary likes”. This category of operation has some We categorize the TAG operations into three dif- interesting properties. The dependency relation of ferent types: substitution, non-predicative adjunction the embedded tree and the matrix tree is inverted. and predicative adjunction. This means that if tree T1 is adjoined to T2, in non- z Substitution predicative adjunction, T1 depends on T2, but in We model the TAG substitution operation by predicative adjunction, T2 depends on T1. In the having the embedded tree replaces the non-terminal above example, the tree with “like” depends on the that is in accordance with its root. An example for tree with “think”. this type is the substitution of NP.

Figure 8a Substitution in TAG

Figure 10a “Wh” movement through TAG (predicative) adjunction operation

Our solution is quite simple: when we are con- nous grammar assumes certain isomorphism be- structing the grammar, we invert the arc that points tween the two languages which we refer to as the to a predicative clause. Despite the fact that the re- “isomorphism assumption”. sulting dependency trees have certain arcs inverted, Now we examine the isomorphism assumptions we will still be able to use localized word order rules in S-CFG and S-TAG: and derive the desired sentence with the simple uni- y For S-CFG, the substitutions for all the non- fication operation. As shown below: terminals need to be synchronous. Hence the isomorphism assumption for S-CFG is isomor- phic phrasal structure. y For S-TAG, all the substitution and adjunction operations need to be synchronous, and the derivation trees of both languages are isomor-

Figure 10b “Wh” movement through unification phic. The derivation tree for TAG is roughly equivalent to a dependency tree. Hence the Since TAG is mildly context sensitive, and we isomorphism assumption for S-TAG is an iso- have shown in Section 3 that DIG is context free, we morphic dependency structure. are not claiming the two grammars are weakly or strongly equivalent. Also, please note DIG does not As shown by real translation tasks, both of those handle all the non-projectivity issues due to its CFG assumptions would fail due to structural divergences equivalent generation capacity. between languages. On the other hand SDIG does NOT assume word 5 Synchronous DIG level isomorphism or isomorphic dependency trees. Since in the SDIG sense, the parallel dependency 5.1 Definition trees are in fact the “derived” form rather than the “derivation” form. In other , SDIG assumes (Wu, 1997) introduced synchronous binary trees the isomorphism lies deeper than the dependency and (Shieber, 1990) introduced synchronous tree structure. It is “the derivation tree of DIG” that is adjoining grammars, both of which view the transla- isomorphic. tion process as a synchronous derivation process of The following “pseudo-translation” example il- parallel trees. Similarly, with our DIG formalism, lustrates how SDIG captures structural divergence we can construct a Synchronous DIG by synchroniz- between the languages. Suppose we want to translate: ing both structures and operations in both languages y [Source] The girl kissed her kitty cat. and ensuring synchronous derivations. y [Target] The girl gave a kiss to her cat.

Properties of SDIG: 1. The roots of both trees of the source and target languages are aligned, and have the same category 2. All the unlexicalized nodes of both trees are aligned and have the same category. 3. The two heads of both trees are aligned and have the same category. Synchronous Unification Operation: By the above properties of SDIG, we can show that unification operations are synchro- nized in both languages. Hence we can have synchronous unification operations.

5.2 Isomorphism Assumption So how is SDIG different from other synchro- nous grammar formalisms? Figure 11 As we know, a synchronous grammar derives Note that both S-CFG and S-TAG won’t be able both source and target languages through a series of to handle such structural divergence. However, synchronous derivation steps. For any tree-based when we view each of the two sentences as derived synchronous grammar, the synchronous derivation from three elementary trees in DIG, we can have a would create two derivation trees for both languages synchronous derivation, as shown below: which have isomorphic structure. Thus a synchro-

6 The Probabilistic Extension to SDIG and proach is shown in Figure 13b. (e) stands for an Statistical MT empty node trace. y [English] I have been here since 1947. The major reason to construct an SDIG is to have y [Chinese] Wo 1947 nian yilai yizhi zhu zai zheli. a generative model for syntax based statistical MT. I year since always live in here By relying on the assumption that the derivation tree of DIG represents the probability dependency graph, we can build a graphical model which captures the Figure following two statistical dependencies: 13a. 1. Probabilities of Elementary Tree unification (in Input the target language) 2. Probabilities of Elementary Tree transfer (be- tween languages), i.e. the probability of two elementary trees being paired

ET-f1 ET-e1

ET-f2 ET-f3 ET-e2 ET-e3 Figure 12

ET-f ET-e 4 4 Figure 13b. Output The above graph shows two isomorphic deriva- (5 parallel elementary tree pairs) tion trees for two languages. ET stands for elemen- tary trees and dotted arcs denote the conditional We build a decoder for the model in Section 6 for dependence assumptions). Under the above model, our machine translation system. The decoder is the best translation is: e* = argmax P( f | e)P(e) ; based on a polynomial time decoding algorithm for e fast non-isomorphic tree-to-tree transduction (Un- published by the time of this paper). And P( f | e) = ∏ P(ET ( f )i | ET(e)i ) ; also we i We use an automatic syntactic parser (Collins, 1999; Bikel, 2002) to produce the parallel unaligned have P(e) = P ET (e) | Parent(ET (e) ) . ∏ ()i i . The parser was trained using the i Hence, we can have PSDIG (probabilistic syn- Penn English/Chinese Treebanks. We then used the chronous Dependency Insertion Grammar). Given algorithm in (Xia 2001) to convert the phrasal struc- the dynamic programming property of the above ture trees into dependency trees. graphical model, an efficient polynomial time The following table shows the statistics of the Viterbi decoding algorithm can be constructed. datasets we used. (Genre, number of sentence pairs, number of Chinese/English words, type and usage). 7 Current Implementation Dataset Xinhua FBIS NIST To test our idea, we implemented the above syn- Genre News News News chronous grammar formalism in a Chinese-English Sent# 56263 21003 206 machine translation system. The actual implementa- Chn W# 1456495 522953 26.3 average tion of the synchronous grammar used in the system Eng W# 1490498 658478 32.5 average is a scaled-down version of the SDIG introduced Type unaligned unaligned multi-reference above, where all the word categories are treated as Usage training training testing one. The reason for this simplification is that word category mappings across languages are not straight- Figure 14 forward. Defining the word categories so that they The training set consists of Xinhua newswire can be consistent between the languages is a major data from LDC and the FBIS data. We filtered both goal for our future research. datasets to ensure parallel sentence pair quality. We The uni-category version of the SDIG is induced used the development test data from the 2001 NIST using the algorithm in (Ding and Palmer, 2004), MT evaluation workshop as our test data for the MT which is a statistical approach to extracting parallel system performance. In the testing data, each input dependency structures from large scale parallel cor- Chinese sentence has 4 English translations as refer- pora. An example is given in Figure 12. We can ences, so that the result of the MT system can be construct the parallel dependency trees as shown in evaluated using Bleu and NIST machine translation Figure 13a. The expected output of the above ap- evaluation software.

1-gram 2-gram 3-gram 4-gram References NIST: 4.3753 4.9773 5.0579 5.0791 Hiyan Alshawi, Srinivas Bangalore, and Shona Douglas. BLEU: 0.5926 0.3417 0.2060 0.1353 2000. Learning dependency translation models as col- Figure 15 lections of finite state head transducers. Computational The above table shows the cumulative Bleu and , 26(1): 45-60. NIST n-gram scores for our current implementation; Daniel M. Bikel. 2002. Design of a multi-lingual, paral- with the final Bleu score 0.1353 with average input lel-processing statistical parsing engine. In Proceedings sentence length of 26.3 words. of HLT 2002. In comparison, in (Yamada and Knight, 2002), Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della which was a phrasal structure based statistical MT Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. system for Chinese to English translation, the Bleu Computational Linguistics, 19(2): 263-311. score reported for short sentences (less than 14 Michael John Collins. 1999. Head-driven Statistical Mod- words) is 0.099 to 0.102. els for Natural Language Parsing. Ph.D. thesis, Univer- Please note that the Bleu/NIST scorers, while sity of Pennsylvania, Philadelphia. based on n-gram matching, do not model syntax dur- Yuan Ding and Martha Palmer. 2004. Automatic Learn- ing evaluation, which means a direct comparison ing of Parallel Dependency Treelet Pairs, in Proceed- between a syntax based MT system and a string ings of The First International Joint Conference on based statistical MT system using the above scorer Natural Language Processing (IJCNLP-04). would favor the string based systems. Bonnie J. Dorr. 1994. Machine translation divergences: A We believe that our results can be improved us- formal description and proposed solution. Computa- tional Linguistics, 20(4): 597-633. ing a more sophisticated machine translation pipe- Heidi J. Fox. 2002. Phrasal cohesion and statistical ma- line which has separate components that handle chine translation. In Proceedings of EMNLP-02, pages specific language phenomena such as named entities. 304-311 Larger training corpora can also be helpful. Daniel Gildea. 2003. Loosely tree based alignment for machine translation. In Proceedings of ACL-03 8 Conclusion Jan Hajic, et al. 2002. Natural language generation in the context of machine translation. Summer workshop final Finally, let us review whether the proposed SDIG report, Center for Language and Processing, formalism has achieved the goals we setup in Sec- Johns Hopkins University, Baltimore. tion 1 of this paper for a grammar formalism for Sta- Aravind Joshi and Owen Rambow. 2003. A formalism of tistical MT applications: dependency grammar based on Tree Adjoining Gram- 1. Linguistically motivated: DIG captures word- mar. In Proceedings of the first international confer- order phenomena within the CFG domain. ence on meaning text theory (MTT 2003), June 2003. 2. SDIG dropped the unrealistic word-to-word Aravind K. Joshi and Yves Schabes. Tree-adjoining isomorphism assumption and is able to capture grammars and lexicalized grammars. In Maurice Nivat structural divergences. and Andreas Podelski, editors, Tree Automata and Lan- 3. DIG is weakly equivalent to CFG. guages. Elsevier Science, 1992. Franz Josef Och. 2003. Minimum Error Rate Training in 4. DIG and SDIG are generative grammars. Statistical Machine Translation. In Proceedings of 5. They have both simple formalisms, only one ACL-03), pages 160-167. type of node, and one type of operation. Owen Rambow and Aravind Joshi. 1997. A formal look at dependency grammars and phrase structures. In Leo 9 Future Work Wanner, editor, Recent Trends in Meaning-Text Theory, We observe from our testing results that the cur- pages 167-190. S. M. Shieber and Y. Schabes. 1990. Synchronous Tree- rent simplified uni-category version of SDIG suffers Adjoining Grammars, Proceedings of the 13th COLING, from various grammatical errors, both in grammar pp. 253-258, August 1990. induction and decoding, therefore our future work Dekai Wu. 1997. Stochastic inversion transduction should focus on word category consistency between grammars and bilingual parsing of parallel corpora. the languages so that a full-fledged version of SDIG Computational Linguistics, 23(3):3-403. can be used. Fei Xia. 2001. Automatic grammar generation from two different perspectives. Ph.D. thesis, University of Penn- 10 Acknowledgements sylvania, Philadelphia. Kenji Yamada and Kevin Knight. 2001. A syntax based Our thanks go Aravind Joshi, Owen Rambow, statistical translation model. In Proceedings of ACL-01 Dekai Wu and all the anonymous reviewers of the Kenji Yamada and Kevin Knight. 2002. A decoder for previous versions of the paper, who gave us invalu- syntax-based statistical MT. In Proceedings of ACL-02 able advices, suggestions and feedbacks.