Synchronous Dependency Insertion Grammars a Grammar Formalism for Syntax Based Statistical MT
Total Page:16
File Type:pdf, Size:1020Kb
Synchronous Dependency Insertion Grammars A Grammar Formalism for Syntax Based Statistical MT Yuan Ding and Martha Palmer Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA {yding, mpalmer}@linc.cis.upenn.edu Abstract In the early 1990s, (Brown et. al. 1993) intro- duced the idea of statistical machine translation, This paper introduces a grammar formalism where the word to word translation probabilities and specifically designed for syntax-based sta- sentence reordering probabilities are estimated from tistical machine translation. The synchro- a large set of parallel sentence pairs. By having the nous grammar formalism we propose in advantage of leveraging large parallel corpora, the this paper takes into consideration the per- statistical MT approach outperforms the traditional vasive structure divergence between lan- transfer based approaches in tasks for which ade- guages, which many other synchronous quate parallel corpora is available (Och, 2003). grammars are unable to model. A Depend- However, a major criticism of this approach is that it ency Insertion Grammars (DIG) is a gen- is void of any internal representation for syntax or erative grammar formalism that captures semantics. word order phenomena within the depend- In recent years, hybrid approaches, which aim at ency representation. Synchronous Depend- applying statistical learning to structured data, began ency Insertion Grammars (SDIG) is the to emerge. Syntax based statistical MT approaches synchronous version of DIG which aims at began with (Wu 1997), who introduced a polyno- capturing structural divergences across the mial-time solution for the alignment problem based languages. While both DIG and SDIG have on synchronous binary trees. (Alshawi et al., 2000) comparatively simpler mathematical forms, extended the tree-based approach by representing we prove that DIG nevertheless has a gen- each production in parallel dependency trees as a eration capacity weakly equivalent to that finite-state transducer. (Yamada and Knight, 2001, of CFG. By making a comparison to TAG 2002) model translation as a sequence of operations and Synchronous TAG, we show how such transforming a syntactic tree in one language into formalisms are linguistically motivated. We the string of the second language. then introduce a probabilistic extension of The syntax based statistical approaches have SDIG. We finally evaluated our current im- been faced with the major problem of pervasive plementation of a simplified version of structural divergence between languages, due to both SDIG for syntax based statistical machine systematic differences between languages (Dorr, translation. 1994) and the vagaries of loose translations in real 1 Introduction corpora. While we would like to use syntactic in- formation in both languages, the problem of non- Dependency grammars have a long history and isomorphism grows when trees in both languages are have played an important role in machine translation required to match. (MT). The early use of dependency structures in ma- To allow the syntax based machine translation chine translation tasks mainly fall into the category approaches to work as a generative process, certain of transfer based MT, where the dependency struc- isomorphism assumptions have to be made. Hence a ture of the source language is first analyzed, then reasonable question to ask is: to what extent should transferred to the target language by using a set of the grammar formalism, which we choose to repre- transduction rules or a transfer lexicon, and finally sent syntactic language transfer, assume isomor- the linear form of the target language sentence is phism between the structures of the two languages? generated. (Hajic et al., 2002) allows for limited non- While the above approach seems to be plausible, isomorphism in that n-to-m matching of nodes in the the transfer process demands intense human effort in two trees is permitted. However, even after extend- creating a working transduction rule set or a transfer ing this model by allowing cloning operations on lexicon, which largely limits the performance and subtrees, (Gildea, 2003) found that parallel trees application domain of the resultant machine transla- over-constrained the alignment problem, and tion system. achieved better results with a tree-to-string model using one input tree than with a tree-to-tree model 5. Simple: it should have a minimal number of using two. different structures and operations so that it will At the same time, grammar theoreticians have be learnable from the empirical data. proposed various generative synchronous grammar In the following sections of this paper, we intro- formalisms for MT, such as Synchronous Context duce a grammar formalism that satisfies the above Free Grammars (S-CFG) (Wu, 1997) or Synchro- properties: Synchronous Dependency Insertion nous Tree Adjoining Grammars (S-TAG) (Shieber Grammar (SDIG). Section 2 gives an informal look and Schabes, 1990). Mathematically, generative at the desired capabilities of a monolingual version synchronous grammars share many good properties Dependency Insertion Grammar (DIG) by address- similar to their monolingual counterparts such as ing the problems with previous dependency gram- CFG or TAG (Joshi and Schabes, 1992). If such a mars. Section 3 gives the formal definition of the synchronous grammar could be learnt from parallel DIG and shows that it is weakly equivalent to Con- corpora, the MT task would become a mathemati- text Free Grammar (CFG). Section 4 shows how cally clean generative process. DIG is linguistically motivated by making a com- However, the problem of inducing a synchronous parison between DIG and Tree Adjoining Grammar grammar from empirical data was never solved. For (TAG). Section 5 specifies the Synchronous DIG example, Synchronous TAGs, proposed by (Shieber and Section 6 gives the probabilistic extension of and Schabes, 1990), which were introduced primar- SDIG. ily for semantics but were later also proposed for translation. From a formal perspective, Syn-TAGs 2 Issues with Dependency Grammars characterize the correspondences between languages by a set of synchronous elementary tree pairs. While 2.1 Dependency Grammars and Statistical MT examples show that this formalism does capture cer- tain cross language structural divergences, there is According to (Fox, 2002), dependency represen- not, to our knowledge, any successful statistical tations have the best phrasal cohesion properties learning method to learn such a grammar from em- across languages. The percentage of head crossings pirical data. We believe that this is due to the limited per chance is 12.62% and that of modifier crossings ability of Synchronous TAG to model structure di- per chance is 9.22%. Observing this fact, it is rea- vergences. This observation will be discussed later sonable to propose a formalism that handles lan- in Section 5. guage transfer based on dependency structures. We studied the problem of learning synchronous What is more, if a formalism based on depend- syntactic sub-structures (parallel dependency treelets) ency structures is made possible, it will have the from unaligned parallel corpora in (Ding and Palmer, nice property of being simple, as expressed in the 2004). At the same time, we would like to formalize following table: a synchronous grammar for syntax based statistical CFG TAG DG MT. The necessity of a well-defined formalism and Node# 2n 2n n certain limitations of the current existing formalisms, Lexicalized? NO YES YES motivate us to design a new synchronous grammar Node types 2 2 1* formalism which will have the following properties: Operation types 1 2 1* 1. Linguistically motivated: it should be able to (*: will be shown later in this paper) capture most language phenomena, e.g. compli- Figure 1. cated word orders such as “wh” movement. The simplicity of a grammar is very important for 2. Without the unrealistic word-to-word isomor- statistical modeling, i.e. when it is being learned phism assumption: it should be able to capture from the corpora and when it is being used in ma- structural variations between the languages. chine translation decoding, we don’t need to condi- 3. Mathematically rigorous: it should have a well tion the probabilities on two different node types or defined formalism and a proven generation ca- operations. pacity, preferably context free or mildly context At the same time, dependency grammars are in- sensitive. herently lexicalized in that each node is one word. 4. Generative: it should be “generative” in a Statistical parsers (Collins 1999) showed perform- mathematical sense. This property is essential ance improvement by using bilexical probabilities, for the grammar to be used in statistical MT. i.e. probabilities of word pair occurrences. This is Each production rule should have its own prob- what dependency grammars model explicitly. ability, which will allow us to decompose the overall translation probability. 2.2 A Generative Grammar? definition of the monolingual Dependency Insertion Grammar. Why do we want the grammar for statistical MT to be generative? First of all, generative models have 3 The DIG Formalism long been studied in the machine learning commu- nity, which will provide us with mathematically rig- orous algorithms for training and decoding. Second, 3.1 Elementary Trees CFG, the most popular formalism in describing Formally, the Dependency Insertion Grammar is natural language phenomena, is generative. Certain defined as a six tuple (C, L, A, B,S, R) . C is a set ideas and algorithms can be borrowed from CFG if of syntactic categories and L is a set of lexical we make the formalism generative. items. A is a set of Type-A trees and B is a set of While there has been much previous work in Type-B trees (defined later). S is a set of the start- formalizing dependency grammars and in its appli- cation to the parsing task, until recently (Joshi and ing categories of the sentences. R is a set of word Rambow, 2003), little attention has been given to the order rules local to each node of the trees.