Dependency Grammar Induction Via Bitext Projection Constraints
Total Page:16
File Type:pdf, Size:1020Kb
Dependency Grammar Induction via Bitext Projection Constraints Kuzman Ganchev and Jennifer Gillenwater and Ben Taskar Department of Computer and Information Science University of Pennsylvania, Philadelphia PA, USA {kuzman,jengi,taskar}@seas.upenn.edu Abstract ing recent interest in transferring linguistic re- sources from one language to another via parallel Broad-coverage annotated treebanks nec- text. For example, several early works (Yarowsky essary to train parsers do not exist for and Ngai, 2001; Yarowsky et al., 2001; Merlo many resource-poor languages. The wide et al., 2002) demonstrate transfer of shallow pro- availability of parallel text and accurate cessing tools such as part-of-speech taggers and parsers in English has opened up the pos- noun-phrase chunkers by using word-level align- sibility of grammar induction through par- ment models (Brown et al., 1994; Och and Ney, tial transfer across bitext. We consider 2000). generative and discriminative models for Alshawi et al. (2000) and Hwa et al. (2005) dependency grammar induction that use explore transfer of deeper syntactic structure: word-level alignments and a source lan- dependency grammars. Dependency and con- guage parser (English) to constrain the stituency grammar formalisms have long coex- space of possible target trees. Unlike isted and competed in linguistics, especially be- previous approaches, our framework does yond English (Mel’cuk,ˇ 1988). Recently, depen- not require full projected parses, allowing dency parsing has gained popularity as a simpler, partial, approximate transfer through lin- computationally more efficient alternative to con- ear expectation constraints on the space stituency parsing and has spurred several super- of distributions over trees. We consider vised learning approaches (Eisner, 1996; Yamada several types of constraints that range and Matsumoto, 2003a; Nivre and Nilsson, 2005; from generic dependency conservation to McDonald et al., 2005) as well as unsupervised in- language-specific annotation rules for aux- duction (Klein and Manning, 2004; Smith and Eis- iliary verb analysis. We evaluate our ap- ner, 2006). Dependency representation has been proach on Bulgarian and Spanish CoNLL used for language modeling, textual entailment shared task data and show that we con- and machine translation (Haghighi et al., 2005; sistently outperform unsupervised meth- Chelba et al., 1997; Quirk et al., 2005; Shen et al., ods and can outperform supervised learn- 2008), to name a few tasks. ing for limited training data. Dependency grammars are arguably more ro- bust to transfer since syntactic relations between 1 Introduction aligned words of parallel sentences are better con- For English and a handful of other languages, served in translation than phrase structure (Fox, there are large, well-annotated corpora with a vari- 2002; Hwa et al., 2005). Nevertheless, sev- ety of linguistic information ranging from named eral challenges to accurate training and evalua- entity to discourse structure. Unfortunately, for tion from aligned bitext remain: (1) partial word the vast majority of languages very few linguis- alignment due to non-literal or distant transla- tic resources are available. This situation is tion; (2) errors in word alignments and source lan- likely to persist because of the expense of creat- guage parses, (3) grammatical annotation choices ing annotated corpora that require linguistic exper- that differ across languages and linguistic theo- tise (Abeillé, 2003). On the other hand, parallel ries (e.g., how to analyze auxiliary verbs, conjunc- corpora between many resource-poor languages tions). and resource-rich languages are ample, motivat- In this paper, we present a flexible learning 369 Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 369–377, Suntec, Singapore, 2-7 August 2009. c 2009 ACL and AFNLP framework for transferring dependency grammars servation. To get a feel for the typical case, we via bitext using the posterior regularization frame- used off-the-shelf parsers (McDonald et al., 2005) work (Graça et al., 2008). In particular, we ad- for English, Spanish and Bulgarian on two bi- dress challenges (1) and (2) by avoiding com- texts (Koehn, 2005; Tiedemann, 2007) and com- mitment to an entire projected parse tree in the pared several measures of dependency conserva- target language during training. Instead, we ex- tion. For the English-Bulgarian corpus, we ob- plore formulations of both generative and discrim- served that 71.9% of the edges we projected were inative probabilistic models where projected syn- edges in the corpus, and we projected on average tactic relations are constrained to hold approxi- 2.7 edges per sentence (out of 5.3 tokens on aver- mately and only in expectation. Finally, we ad- age). For Spanish, we saw conservation of 64.4% dress challenge (3) by introducing a very small and an average of 5.9 projected edges per sentence number of language-specific constraints that dis- (out of 11.5 tokens on average). ambiguate arbitrary annotation choices. As these numbers illustrate, directly transfer- We evaluate our approach by transferring from ring information one dependency edge at a time an English parser trained on the Penn treebank to is unfortunately error prone for two reasons. First, Bulgarian and Spanish. We evaluate our results parser and word alignment errors cause much of on the Bulgarian and Spanish corpora from the the transferred information to be wrong. We deal CoNLL X shared task. We see that our transfer with this problem by constraining groups of edges approach consistently outperforms unsupervised rather than a single edge. For example, in some methods and, given just a few (2 to 7) language- sentence pair we might find 10 edges that have specific constraints, performs comparably to a su- both end points aligned and can be transferred. pervised parser trained on a very limited corpus Rather than requiring our target language parse to (30 - 140 training sentences). contain each of the 10 edges, we require that the expected number of edges from this set is at least 2 Approach 10η, where η is a strength parameter. This gives the parser freedom to have some uncertainty about At a high level our approach is illustrated in Fig- which edges to include, or alternatively to choose ure 1(a). A parallel corpus is word-level aligned to exclude some of the transferred edges. using an alignment toolkit (Graça et al., 2009) and A more serious problem for transferring parse the source (English) is parsed using a dependency information across languages are structural differ- parser (McDonald et al., 2005). Figure 1(b) shows ences and grammar annotation choices between an aligned sentence pair example where depen- the two languages. For example dealing with aux- dencies are perfectly conserved across the align- iliary verbs and reflexive constructions. Hwa et al. ment. An edge from English parent p to child c is (2005) also note these problems and solve them by 0 called conserved if word p aligns to word p in the introducing dozens of rules to transform the trans- 0 second language, c aligns to c in the second lan- ferred parse trees. We discuss these differences 0 0 guage, and p is the parent of c . Note that we are in detail in the experimental section and use our not restricting ourselves to one-to-one alignments framework introduce a very small number of rules 0 0 here; p, c, p , and c can all also align to other to cover the most common structural differences. words. After filtering to identify well-behaved sentences and high confidence projected depen- 3 Parsing Models dencies, we learn a probabilistic parsing model us- ing the posterior regularization framework (Graça We explored two parsing models: a generative et al., 2008). We estimate both generative and dis- model used by several authors for unsupervised in- criminative models by constraining the posterior duction and a discriminative model used for fully distribution over possible target parses to approxi- supervised training. mately respect projected dependencies and other The discriminative parser is based on the rules which we describe below. In our experi- edge-factored model and features of the MST- ments we evaluate the learned models on depen- Parser (McDonald et al., 2005). The parsing dency treebanks (Nivre et al., 2007). model defines a conditional distribution pθ(z | x) Unfortunately the sentence in Figure 1(b) is over each projective parse tree z for a particular highly unusual in its amount of dependency con- sentence x, parameterized by a vector θ. The prob- 370 (a) (b) Figure 1: (a) Overview of our grammar induction approach via bitext: the source (English) is parsed and word-aligned with target; after filtering, projected dependencies define constraints over target parse tree space, providing weak supervision for learning a target grammar. (b) An example word-aligned sentence pair with perfectly projected dependencies. ability of any particular parse is work that incorporates side-information into un- Y θ·φ(z,x) supervised problems in the form of linear con- pθ(z | x) ∝ e , (1) straints on posterior expectations. In grammar z∈z transfer, our basic constraint is of the form: the where z is a directed edge contained in the parse expected proportion of conserved edges in a sen- tree z and φ is a feature function. In the fully su- tence pair is at least η (the exact proportion we pervised experiments we run for comparison, pa- used was 0.9, which was determined using un- rameter estimation is performed by stochastic gra- labeled data as described in Section 5). Specifi- dient ascent on the conditional likelihood func- cally, let Cx be the set of directed edges projected tion, similar to maximum entropy models or con- from English for a given sentence x, then given ditional random fields. One needs to be able to a parse z, the proportion of conserved edges is compute expectations of the features φ(z, x) under f(x, z) = 1 P 1(z ∈ C ) and the expected |Cx| z∈z x the distribution pθ(z | x).