<<

Dependency Induction via Bitext Projection Constraints

Kuzman Ganchev and Jennifer Gillenwater and Ben Taskar Department of Computer and Information Science University of Pennsylvania, Philadelphia PA, USA {kuzman,jengi,taskar}@seas.upenn.edu

Abstract ing recent interest in transferring linguistic re- sources from one language to another via parallel Broad-coverage annotated nec- text. For example, several early works (Yarowsky essary to train parsers do not exist for and Ngai, 2001; Yarowsky et al., 2001; Merlo many resource-poor languages. The wide et al., 2002) demonstrate transfer of shallow - availability of parallel text and accurate cessing tools such as part-of- taggers and parsers in English has opened up the pos- noun- chunkers by using -level align- sibility of grammar induction through par- ment models (Brown et al., 1994; Och and Ney, tial transfer across bitext. We consider 2000). generative and discriminative models for Alshawi et al. (2000) and Hwa et al. (2005) dependency grammar induction that use explore transfer of deeper syntactic structure: word-level alignments and a source lan- dependency . Dependency and con- guage parser (English) to constrain the stituency grammar formalisms have long coex- space of possible target trees. Unlike isted and competed in , especially be- previous approaches, our framework does yond English (Mel’cuk,ˇ 1988). Recently, depen- not require full projected parses, allowing dency has gained popularity as a simpler, partial, approximate transfer through lin- computationally more efficient alternative to con- ear expectation constraints on the space stituency parsing and has spurred several super- of distributions over trees. We consider vised learning approaches (Eisner, 1996; Yamada several types of constraints that range and Matsumoto, 2003a; Nivre and Nilsson, 2005; from generic dependency conservation to McDonald et al., 2005) as well as unsupervised in- language-specific annotation rules for aux- duction (Klein and Manning, 2004; Smith and Eis- iliary verb analysis. We evaluate our ap- ner, 2006). Dependency representation has been proach on Bulgarian and Spanish CoNLL used for language modeling, textual entailment shared task data and show that we con- and machine translation (Haghighi et al., 2005; sistently outperform unsupervised meth- Chelba et al., 1997; Quirk et al., 2005; Shen et al., ods and can outperform supervised learn- 2008), to a few tasks. ing for limited training data. Dependency grammars are arguably more ro- bust to transfer since syntactic relations between 1 Introduction aligned of parallel sentences are better con- For English and a handful of other languages, served in translation than phrase structure (Fox, there are large, well-annotated corpora with a vari- 2002; Hwa et al., 2005). Nevertheless, sev- ety of linguistic information ranging from named eral challenges to accurate training and evalua- entity to discourse structure. Unfortunately, for tion from aligned bitext remain: (1) partial word the vast majority of languages very few linguis- alignment due to non-literal or distant transla- tic resources are available. This situation is tion; (2) errors in word alignments and source lan- likely to persist because of the expense of creat- guage parses, (3) grammatical annotation choices ing annotated corpora that require linguistic exper- that differ across languages and linguistic theo- tise (Abeillé, 2003). On the other hand, parallel ries (e.g., how to analyze auxiliary verbs, conjunc- corpora between many resource-poor languages tions). and resource-rich languages are ample, motivat- In this paper, we present a flexible learning

369 Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 369–377, Suntec, Singapore, 2-7 August 2009. c 2009 ACL and AFNLP framework for transferring dependency grammars servation. To get a feel for the typical case, we via bitext using the posterior regularization frame- used off-the-shelf parsers (McDonald et al., 2005) work (Graça et al., 2008). In particular, we ad- for English, Spanish and Bulgarian on two bi- dress challenges (1) and (2) by avoiding com- texts (Koehn, 2005; Tiedemann, 2007) and com- mitment to an entire projected in the pared several measures of dependency conserva- target language during training. Instead, we ex- tion. For the English-Bulgarian corpus, we ob- plore formulations of both generative and discrim- served that 71.9% of the edges we projected were inative probabilistic models where projected syn- edges in the corpus, and we projected on average tactic relations are constrained to hold approxi- 2.7 edges per sentence (out of 5.3 tokens on aver- mately and only in expectation. Finally, we ad- age). For Spanish, we saw conservation of 64.4% dress challenge (3) by introducing a very small and an average of 5.9 projected edges per sentence number of language-specific constraints that dis- (out of 11.5 tokens on average). ambiguate arbitrary annotation choices. As these numbers illustrate, directly transfer- We evaluate our approach by transferring from ring information one dependency edge at a time an English parser trained on the Penn to is unfortunately error prone for two reasons. First, Bulgarian and Spanish. We evaluate our results parser and word alignment errors cause much of on the Bulgarian and Spanish corpora from the the transferred information to be wrong. We deal CoNLL X shared task. We see that our transfer with this problem by constraining groups of edges approach consistently outperforms unsupervised rather than a single edge. For example, in some methods and, given just a few (2 to 7) language- sentence pair we might find 10 edges that have specific constraints, performs comparably to a su- both end points aligned and can be transferred. pervised parser trained on a very limited corpus Rather than requiring our target language parse to (30 - 140 training sentences). contain each of the 10 edges, we require that the expected number of edges from this set is at least 2 Approach 10η, where η is a strength parameter. This gives the parser freedom to have some uncertainty about At a high level our approach is illustrated in Fig- which edges to include, or alternatively to choose ure 1(a). A parallel corpus is word-level aligned to exclude some of the transferred edges. using an alignment toolkit (Graça et al., 2009) and A more serious problem for transferring parse the source (English) is parsed using a dependency information across languages are structural differ- parser (McDonald et al., 2005). Figure 1(b) shows ences and grammar annotation choices between an aligned sentence pair example where depen- the two languages. For example dealing with aux- dencies are perfectly conserved across the align- iliary verbs and reflexive constructions. Hwa et al. ment. An edge from English parent p to child c is (2005) also note these problems and solve them by 0 called conserved if word p aligns to word p in the introducing dozens of rules to transform the trans- 0 second language, c aligns to c in the second lan- ferred parse trees. We discuss these differences 0 0 guage, and p is the parent of c . Note that we are in detail in the experimental section and use our not restricting ourselves to one-to-one alignments framework introduce a very small number of rules 0 0 here; p, c, p , and c can all also align to other to cover the most common structural differences. words. After filtering to identify well-behaved sentences and high confidence projected depen- 3 Parsing Models dencies, we learn a probabilistic parsing model us- ing the posterior regularization framework (Graça We explored two parsing models: a generative et al., 2008). We estimate both generative and dis- model used by several authors for unsupervised in- criminative models by constraining the posterior duction and a discriminative model used for fully distribution over possible target parses to approxi- supervised training. mately respect projected dependencies and other The discriminative parser is based on the rules which we describe below. In our experi- edge-factored model and features of the MST- ments we evaluate the learned models on depen- Parser (McDonald et al., 2005). The parsing dency treebanks (Nivre et al., 2007). model defines a conditional distribution pθ(z | x) Unfortunately the sentence in Figure 1(b) is over each projective parse tree z for a particular highly unusual in its amount of dependency con- sentence x, parameterized by a vector θ. The prob-

370 (a)

(b)

Figure 1: (a) Overview of our grammar induction approach via bitext: the source (English) is parsed and word-aligned with target; after filtering, projected dependencies define constraints over target parse tree space, providing weak supervision for learning a target grammar. (b) An example word-aligned sentence pair with perfectly projected dependencies. ability of any particular parse is work that incorporates side-information into un-

Y θ·φ(z,x) supervised problems in the form of linear con- pθ(z | x) ∝ e , (1) straints on posterior expectations. In grammar z∈z transfer, our basic constraint is of the form: the where z is a directed edge contained in the parse expected proportion of conserved edges in a sen- tree z and φ is a feature function. In the fully su- tence pair is at least η (the exact proportion we pervised experiments we run for comparison, pa- used was 0.9, which was determined using un- rameter estimation is performed by stochastic gra- labeled data as described in Section 5). Specifi- dient ascent on the conditional likelihood func- cally, let Cx be the set of directed edges projected tion, similar to maximum entropy models or con- from English for a given sentence x, then given ditional random fields. One needs to be able to a parse z, the proportion of conserved edges is compute expectations of the features φ(z, x) under f(x, z) = 1 P 1(z ∈ C ) and the expected |Cx| z∈z x the distribution pθ(z | x). A version of the inside- proportion of conserved edges under distribution outside algorithm (Lee and Choi, 1997) performs p(z | x) is this computation. Viterbi decoding is done using Eisner’s algorithm (Eisner, 1996). 1 X Ep[f(x, z)] = p(z | x). We also used a generative model based on de- |Cx| pendency model with valence (Klein and Man- z∈Cx ning, 2004). Under this model, the probability of The posterior regularization framework (Graça a particular parse z and a sentence with part of et al., 2008) was originally defined for gener- speech tags x is given by ative unsupervised learning. The standard ob- pθ(z, x) = proot(r(x)) · (2) jective is to minimize the negative marginal  Y  log-likelihood of the data : Eb[− log pθ(x)] = p¬stop(zp, zd, vz) pchild(zp, zd, zc) · P Eb[− log pθ(z, x)] over the parameters θ (we z∈z z  Y  use Eb to denote expectation over the sample sen- pstop(x, left, vl) pstop(x, right, vr) tences x). We typically also add standard regular- x∈x ization term on θ, resulting from a parameter prior where r(x) is the part of speech tag of the root − log p(θ) = R(θ), where p(θ) is Gaussian for the of the parse tree z, z is an edge from parent zp MST-Parser models and Dirichlet for the valence to child zc in direction zd, either left or right, and model. vz indicates —false if zp has no other chil- To introduce supervision into the model, we de- dren further from it in direction zd than zc, true fine a set Qx of distributions over the hidden vari- otherwise. The valencies vr/vl are marked as true ables z satisfying the desired posterior constraints if x has any children on the left/right in z, false in terms of linear equalities or inequalities on fea- otherwise. ture expectations (we use inequalities in this pa- per): 4 Posterior Regularization

Graça et al. (2008) introduce an estimation frame- Qx = {q(z): E[f(x, z)] ≤ b}.

371 Basic Bi-gram Features In Between POS Features Basic Uni-gram Features xi-word, xi-pos, xj -word, xj -pos xi-pos, b-pos, xj -pos xi-word, xi-pos xi-pos, xj -word, xj -pos xi-word xi-word, xj -word, xj -pos Surrounding Word POS Features xi-pos xi-word, xi-pos, xj -pos xi-pos, xi-pos+1, xj -pos-1, xj -pos xj -word, xj -pos xi-word, xi-pos, xj -word xi-pos-1, xi-pos, xj -pos-1, xj -pos xj -word xi-word, xj -word xi-pos, xi-pos+1, xj -pos, xj -pos+1 xj -pos xi-pos, xj -pos xi-pos-1, xi-pos, xj -pos, xj -pos+1

Table 1: Features used by the MSTParser. For each edge (i, j), xi-word is the parent word and xj -word is the child word, analogously for POS tags. The +1 and -1 denote preceeding and following tokens in the sentence, while b denotes tokens between xi and xj .

In this paper, for example, we use the conserved- The optimization problem in Equation 3 can be edge-proportion constraint as defined above. The efficiently solved in its dual formulation: marginal log-likelihood objective is then modi- > X > fied with a penalty for deviation from the de- arg min b λ+log pθ(z | x) exp {−λ f(x, z)}. sired set of distributions, measured by KL- λ≥0 z (4) divergence from the set Q , KL(Q ||p (z|x)) = x x θ Given λ, the primal solution is given by: q(z) = min KL(q(z)||p (z|x)). The generative q∈Qx θ p (z | x) exp{−λ>f(x, z)}/Z, where Z is a nor- learning objective is to minimize: θ malization constant. There is one dual variable per expectation constraint, and we can optimize them Eb[− log pθ(x)] + R(θ) + Eb[KL(Qx||pθ(z | x))]. by projected gradient descent, similar to log-linear For discriminative estimation (Ganchev et al., model estimation. The gradient with respect to λ 2008), we do not attempt to model the marginal is given by: b − Eq[f(x, z)], so it involves com- distribution of x, so we simply have the two regu- puting expectations under the distribution q(z). larization terms: This remains tractable as long as features factor by P edge, f(x, z) = z∈z f(x, z), because that en- R(θ) + Eb[KL(Qx||pθ(z | x))]. sures that q(z) will have the same form as pθ(z | x). Furthermore, since the constraints are per in- Note that the idea of regularizing moments is re- stance, we can use incremental or online version lated to generalized expectation criteria algorithm of EM (Neal and Hinton, 1998), where we update of Mann and McCallum (2007), as we discuss in parameters θ after posterior-constrained E-step on the related work section below. In general, the each instance x. objectives above are not convex in θ. To opti- mize these objectives, we follow an Expectation 5 Experiments Maximization-like scheme. Recall that standard We conducted experiments on two languages: EM iterates two steps. An E-step computes a prob- Bulgarian and Spanish, using each of the pars- ability distribution over the model’s hidden vari- ing models. The Bulgarian experiments transfer a ables (posterior probabilities) and an M-step that parser from English to Bulgarian, using the Open- updates the model’s parameters based on that dis- Subtitles corpus (Tiedemann, 2007). The Span- tribution. The posterior-regularized EM algorithm ish experiments transfer from English to Spanish leaves the M-step unchanged, but involves project- using the Spanish portion of the Europarl corpus ing the posteriors onto a constraint set after they (Koehn, 2005). For both corpora, we performed are computed for each sentence x: word alignments with the open source PostCAT (Graça et al., 2009) toolkit. We used the Tokyo arg min KL(q(z) k pθ(z|x)) q (3) tagger (Tsuruoka and Tsujii, 2005) to POS tag s.t. Eq[f(x, z)] ≤ b, the English tokens, and generated parses using the first-order model of McDonald et al. (2005) where pθ(z|x) are the posteriors. The new poste- with projective decoding, trained on sections 2-21 riors q(z) are used to compute sufficient statistics of the Penn treebank with dependencies extracted for this instance and hence to update the model’s using the rules of Yamada and Matsumoto parameters in the M-step for either the generative (2003b). For Bulgarian we trained the Stanford or discriminative setting. POS tagger (Toutanova et al., 2003) on the Bul-

372 Discriminative model Generative model Bulgarian Spanish Bulgarian Spanish no rules 2 rules 7 rules no rules 3 rules no rules 2 rules 7 rules no rules 3 rules Baseline 63.8 72.1 72.6 67.6 69.0 66.5 69.1 71.0 68.2 71.3 Post.Reg. 66.9 77.5 78.3 70.6 72.3 67.8 70.7 70.8 69.5 72.8

Table 2: Comparison between transferring a single tree of edges and transferring all possible projected edges. The transfer models were trained on 10k sentences of length up to 20, all models tested on CoNLL train sentences of up to 10 words. Punctuation was stripped at train time. gtreebank corpus from CoNLL X. The Spanish “bike”. To prevent such false transfer, we filter Europarl data was POS tagged with the FreeLing out alignments between incompatible POS tags. In language analyzer (Atserias et al., 2006). The dis- both language pairs, filtering out noun-verb align- criminative model used the same features as MST- ments gave the biggest improvement. Parser, summarized in Table 1. Both corpora also contain sentence fragments, In order to evaluate our method, we a baseline either because of question responses or frag- inspired by Hwa et al. (2005). The baseline con- mented speech in movie subtitles or because of structs a full parse tree from the incomplete and voting announcements and similar formulaic sen- possibly conflicting transferred edges using a sim- tences in the parliamentary proceedings. We over- ple random process. We start with no edges and come this problem by filtering out sentences that try to add edges one at a time verifying at each do not have a verb as the English root or for which step that it is possible to complete the tree. We the English root is not aligned to a verb in the first try to add the transferred edges in random or- target language. For the subtitles corpus we also der, then for each orphan node we try all possible remove sentences that end in an or con- parents (both in random order). We then use this tain more than one comma. Finally, following full labeling as supervision for a parser. Note that (Klein and Manning, 2004) we strip out punctu- this baseline is very similar to the first iteration of ation from the sentences. For the discriminative our model, since for a large corpus the different model this did not affect results significantly but random choices made in different sentences tend improved them slightly in most cases. We found to smooth each other out. We also tried to cre- that the generative model gets confused by punctu- ate rules for the adoption of orphans, but the sim- ation and tends to predict that periods at the end of ple rules we tried added bias and performed worse sentences are the parents of words in the sentence. than the baseline we report. Table 2 shows at- Our basic model uses constraints of the form: tachment accuracy of our method and the baseline the expected proportion of conserved edges in a for both language pairs under several conditions. sentence pair is at least η = 90%.1 By attachment accuracy we mean the fraction of words assigned the correct parent. The experimen- 5.2 No Language-Specific Rules tal details are described in this section. Link-left We call the generic model described above “no- baselines for these corpora are much lower: 33.8% rules” to distinguish it from the language-specific and 27.9% for Bulgarian and Spanish respectively. constraints we introduce in the sequel. The no rules columns of Table 2 summarize the perfor- 5.1 Preprocessing mance in this basic setting. Discriminative models Preliminary experiments showed that our word outperform the generative models in the majority alignments were not always appropriate for syn- of cases. The left panel of Table 3 shows the most tactic transfer, even when they were correct for common errors by child POS tag, as well as by translation. For example, the English “bike/V” true parent and guessed parent POS tag. could be translated in French as “aller/V en Figure 2 shows that the discriminative model vélo/N”, where the word “bike” would be aligned continues to improve with more transfer-type data with “vélo”. While this captures some of the se- 1We chose η in the following way: we split the unlabeled mantic shared information in the two languages, parallel text into two portions. We trained a models with dif- ferent η on one portion and ran it on the other portion. We we have no expectation that the noun “vélo” chose the model with the highest fraction of conserved con- will have a similar syntactic behavior to the verb straints on the second portion.

373 0.68 parents of pronouns, adverbs, and common nouns 0.66 that directly preceed auxiliary verbs. By adopt- 0.64 ing children we mean that we change the parent 0.62 of transferred edges to be the adopting node. The 0.6 second Spanish rule states that the first element 0.58

accuracy (%) of an adjective-noun or noun-adjective pair domi- 0.56 nates the second; the first element also adopts the 0.54 our method children of the second element. The third and fi- baseline 0.52 0.1 1 10 nal Spanish rule sets all prepositions to be chil- training data size (thousands of sentences) dren of the first main verb in the sentence, unless the preposition is a “de” located between two noun Figure 2: Learning curve of the discriminative no-rules transfer model on Bulgarian bitext, testing on CoNLL train . In this later case, we set the closest noun sentences of up to 10 words. in the first of the two noun phrases as the preposi- tion’s parent. For Bulgarian the first rule is that “da” should dominate all words until the next verb and adopt their noun, preposition, particle and adverb chil- Figure 3: A Spanish example where an auxiliary verb dom- dren. The second rule is that auxiliary verbs inates the main verb. should dominate main verbs and adopt their chil- dren. We have a list of 12 Bulgarian auxiliary up to at least 40 thousand sentences. verbs. The “seven rules” experiments add rules for 5 more words similar to the rule for “da”, specif- 5.3 Annotation guidelines and constraints ically “qe”, “li”, “kakvo”, “ne”, “za”. Table 3 Using the straightforward approach outlined compares the errors for different linguistic rules. above is a dramatic improvement over the standard When we train using the “da” rule and the rules for link-left baseline (and the unsupervised generative auxiliary verbs, the model learns that main verbs model as we discuss below), however it doesn’t attach to auxiliary verbs and that “da” dominates have any information about the annotation guide- its nonfinite . This causes an improvement lines used for the testing corpus. For example, the in the attachment of verbs, and also drastically re- Bulgarian corpus has an unusual treatment of non- duces words being attached to verbs instead of par- finite . Figure 4 shows an example. We see ticles. The latter is expected because “da” is an- that the “da” is the parent of both the verb and its alyzed as a particle in the Bulgarian POS tagset. , which is different than the treatment in the We see an improvement in root/verb confusions English corpus. since “da” is sometimes errenously attached to a We propose to deal with these annotation dis- the following verb rather than being the root of the similarities by creating very simple rules. For sentence. Spanish, we have three rules. The first rule sets The rightmost panel of Table 3 shows similar main verbs to dominate auxiliary verbs. Specifi- analysis when we also use the rules for the five cally, whenever an auxiliary precedes a main verb other closed-class words. We see an improvement the main verb becomes its parent and adopts its in attachments in all categories, but no qualitative children; if there is only one main verb it becomes change is visible. The reason for this is probably the root of the sentence; main verbs also become that these words are relatively rare, but by encour- aging the model to add an edge, it also rules out in- correct edges that would cross it. Consequently we are seeing improvements not only directly from the constraints we enforce but also indirectly as types of edges that tend to get ruled out.

Figure 4: An example where transfer fails because of 5.4 Generative parser different handling of reflexives and nonfinite clauses. The alignment links provide correct glosses for Bulgarian words. The generative model we use is a state of the art “B h” is a marker while “se” is a reflexive marker. model for unsupervised parsing and is our only

374 No Rules Two Rules Seven Rules child POS parent POS child POS parent POS child POS parent POS acc(%) errors errors acc(%) errors errors acc(%) errors errors V 65.2 2237 T/V 2175 N 78.7 1572 N/V 938 N 79.3 1532 N/V 1116 N 73.8 1938 V/V 1305 P 70.2 1224 V/V 734 P 75.7 998 V/V 560 P 58.5 1705 N/V 1112 V 84.4 1002 V/N 529 R 69.3 993 V/N 507 R 70.3 961 root/V 555 R 79.3 670 N/N 376 V 86.2 889 N/N 450

Table 3: Top 4 discriminative parser errors by child POS tag and true/guess parent POS tag in the Bulgarian CoNLL train data of length up to 10. Training with no language-specific rules (left); two rules (center); and seven rules (right). POS meanings: V verb, N noun, P pronoun, R preposition, T particle. Accuracies are by child or parent truth/guess POS tag.

0.8 0.75

0.75 0.7

0.7 0.65 accuracy (%) accuracy (%) supervised no rules 0.65 supervised 0.6 two rules no rules seven rules three rules 20 40 60 80 100 120 140 20 40 60 80 100 120 140 supervised training data size supervised training data size

0.8 0.8

0.75 0.75 supervised no rules 0.7 0.7

accuracy (%) two rules accuracy (%) seven rules 0.65 supervised 0.65 no rules three rules 20 40 60 80 100 120 140 20 40 60 80 100 120 140 supervised training data size supervised training data size Figure 5: Comparison to parsers with supervised estimation and transfer. Top: Generative. Bottom: Discriminative. Left: Bulgarian. Right: Spanish. The transfer models were trained on 10k sentences all of length at most 20, all models tested on CoNLL train sentences of up to 10 words. The x-axis shows the number of examples used to train the supervised model. Boxes show first and third quartile, whiskers extend to max and min, with the line passing through the median. Supervised experiments used 30 random samples from CoNLL train. fully unsupervised baseline. As smoothing we add show the results in this setting. We see that perfor- a very small backoff probability of 4.5 × 10−5 to mance is still always below the accuracy achieved each learned paramter. Unfortunately, we found by supervised training on 20 annotated sentences. generative model performance was disappointing However, the improvement in stability makes the overall. The maximum unsupervised accuracy it algorithm much more usable. As we shall see be- achieved on the Bulgarian data is 47.6% with ini- low, the discriminative parser performs even better tialization from Klein and Manning (2004) and than the generative model. this result is not stable. Changing the initialization parameters, training sample, or maximum sen- 5.5 Discriminative parser tence length used for training drastically affected the results, even for samples with several thousand We trained our discriminative parser for 100 iter- sentences. When we use the transferred informa- ations of online EM with a Gaussian prior vari- tion to constrain the learning, EM stabilizes and ance of 100. Results for the discriminative parser achieves much better performance. Even setting are shown in the bottom panels of Figure 5. The all parameters equal at the outset does not prevent supervised experiments are given to provide con- the model from learning the dependency structure text for the accuracies. For Bulgarian, we see that of the aligned language. The top panels in Figure 5 without any hints about the annotation guidelines, the transfer system performs better than an unsu-

375 pervised parser, comparable to a supervised parser mant vs. cross-lingual projection). Also, we de- trained on 10 sentences. However, if we spec- fine our regularization with respect to inequality ify just the two rules for “da” and verb conjuga- constraints (the model is not penalized for exceed- tions performance jumps to that of training on 60- ing the required model expectations), while they 70 fully labeled sentences. If we have just a lit- require moments to be close to an estimated value. tle more prior knowledge about how closed-class We suspect that the two learning methods could words are handled, performance jumps above 140 perform comparably when they exploit similar in- fully labeled sentence equivalent. formation. We observed another desirable property of the discriminative model. While the generative model 7 Conclusion can get confused and perform poorly when the training data contains very long sentences, the dis- In this paper, we proposed a novel and effec- criminative parser does not appear to have this tive learning scheme for transferring dependency drawback. In fact we observed that as the maxi- parses across bitext. By enforcing projected de- mum training sentence length increased, the pars- pendency constraints approximately and in expec- ing performance also improved. tation, our framework allows robust learning from noisy partially supervised target sentences, instead 6 Related Work of committing to entire parses. We show that dis- criminative training generally outperforms gener- Our work most closely relates to Hwa et al. (2005), ative approaches even in this very weakly super- who proposed to learn generative dependency vised setting. By adding easily specified language- grammars using Collins’ parser (Collins, 1999) by specific constraints, our models begin to rival constructing full target parses via projected de- strong supervised baselines for small amounts of pendencies and completion/transformation rules. data. Our framework can handle a wide range of Hwa et al. (2005) found that transferring depen- constraints and we are currently exploring richer dencies directly was not sufficient to get a parser syntactic constraints that involve conservation of with reasonable performance, even when both multiple edge constructions as well as constraints the source language parses and the word align- on conservation of surface length of dependen- ments are performed by hand. They adjusted for cies. this by introducing on the order of one or two dozen language-specific transformation rules to Acknowledgments complete target parses for unaligned words and This work was partially supported by an Integra- to account for diverging annotation rules. Trans- tive Graduate Education and Research Trainee- ferring from English to Spanish in this way, they ship grant from National Science Foundation achieve 72.1% and transferring to Chinese they (NSFIGERT 0504487), by ARO MURI SUB- achieve 53.9%. TLE W911NF-07-1-0216 and by the European Our learning method is very closely related to Projects AsIsKnown (FP6-028044) and LTfLL the work of (Mann and McCallum, 2007; Mann (FP7-212578). and McCallum, 2008) who concurrently devel- oped the idea of using penalties based on pos- terior expectations of features not necessarily in References the model in order to guide learning. They call their method generalized expectation constraints A. Abeille.´ 2003. Treebanks: Building and Using or alternatively expectation regularization. In this Parsed Corpora. Springer. volume (Druck et al., 2009) use this framework H. Alshawi, S. Bangalore, and S. Douglas. 2000. to train a dependency parser based on constraints Learning dependency translation models as collec- stated as corpus-wide expected values of linguis- tions of finite state head transducers. Computational tic rules. The rules select a class of edges (e.g. Linguistics, 26(1). auxiliary verb to main verb) and require that the J. Atserias, B. Casas, E. Comelles, M. Gonzalez,´ expectation of these be close to some value. The L. Padro,´ and M. Padro.´ 2006. Freeling 1.3: Syn- main difference between this work and theirs is tactic and semantic services in an open-source nlp the source of the information (a linguistic infor- library. In Proc. LREC, Genoa, Italy.

376 P. F. Brown, S. Della Pietra, V. J. Della Pietra, and R. L. R. McDonald, K. Crammer, and F. Pereira. 2005. On- Mercer. 1994. The mathematics of statistical ma- line large-margin training of dependency parsers. In chine translation: Parameter estimation. Computa- Proc. ACL, pages 91–98. tional Linguistics, 19(2):263–311. I. Mel’cuk.ˇ 1988. Dependency : theory and C. Chelba, D. Engle, F. Jelinek, V. Jimenez, S. Khudan- practice. SUNY. inci. pur, L. Mangu, H. Printz, E. Ristad, R. Rosenfeld, A. Stolcke, and D. Wu. 1997. Structure and perfor- P. Merlo, S. Stevenson, V. Tsang, and G. Allaria. 2002. mance of a dependency language model. In Proc. A multilingual paradigm for automatic verb classifi- Eurospeech. cation. In Proc. ACL.

M. Collins. 1999. Head-Driven Statistical Models for R. M. Neal and G. E. Hinton. 1998. A new view of the Natural Language Parsing. Ph.D. thesis, University EM algorithm that justifies incremental, sparse and Learning in of Pennsylvania. other variants. In M. I. Jordan, editor, Graphical Models, pages 355–368. Kluwer. G. Druck, G. Mann, and A. McCallum. 2009. Semi- J. Nivre and J. Nilsson. 2005. Pseudo-projective de- supervised learning of dependency parsers using pendency parsing. In Proc. ACL. generalized expectation criteria. In Proc. ACL. J. Nivre, J. Hall, S. Kubler,¨ R. McDonald, J. Nils- J. Eisner. 1996. Three new probabilistic models for de- son, S. Riedel, and D. Yuret. 2007. The CoNLL pendency parsing: an exploration. In Proc. CoLing. 2007 shared task on dependency parsing. In Proc. EMNLP-CoNLL. H. Fox. 2002. Phrasal cohesion and statistical machine translation. In Proc. EMNLP, pages 304–311. F. J. Och and H. Ney. 2000. Improved statistical align- ment models. In Proc. ACL. K. Ganchev, J. Graca, J. Blitzer, and B. Taskar. 2008. Multi-view learning over structured and non- C. Quirk, A. Menezes, and C. Cherry. 2005. De- identical outputs. In Proc. UAI. pendency treelet translation: syntactically informed phrasal smt. In Proc. ACL. J. Grac¸a, K. Ganchev, and B. Taskar. 2008. Expec- tation maximization and posterior constraints. In L. Shen, J. Xu, and R. Weischedel. 2008. A new Proc. NIPS. string-to-dependency machine translation algorithm with a target dependency language model. In Proc. J. Grac¸a, K. Ganchev, and B. Taskar. 2009. Post- of ACL. cat - posterior constrained alignment toolkit. In The Third Machine Translation Marathon. N. Smith and J. Eisner. 2006. Annealing structural bias in multilingual weighted grammar induction. In A. Haghighi, A. Ng, and C. Manning. 2005. Ro- Proc. ACL. bust textual inference via graph matching. In Proc. EMNLP. J. Tiedemann. 2007. Building a multilingual parallel subtitle corpus. In Proc. CLIN. R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak. 2005. Bootstrapping parsers via syntactic K. Toutanova, D. Klein, C. Manning, and Y. Singer. projection across parallel texts. Natural Language 2003. Feature-rich part-of-speech tagging with a Proc. HLT-NAACL Engineering, 11:11–311. cyclic dependency network. In . Y. Tsuruoka and J. Tsujii. 2005. Bidirectional infer- D. Klein and C. Manning. 2004. Corpus-based induc- ence with the easiest-first strategy for tagging se- tion of syntactic structure: Models of dependency quence data. In Proc. HLT/EMNLP. and constituency. In Proc. of ACL. H. Yamada and Y. Matsumoto. 2003a. Statistical de- P. Koehn. 2005. Europarl: A parallel corpus for statis- pendency analysis with support vector machines. In tical machine translation. In MT Summit. Proc. IWPT, pages 195–206. S. Lee and K. Choi. 1997. Reestimation and best- H. Yamada and Y. Matsumoto. 2003b. Statistical de- first parsing algorithm for probabilistic dependency pendency analysis with support vector machines. In grammar. In In WVLC-5, pages 41–55. Proc. IWPT. G. Mann and A. McCallum. 2007. Simple, robust, D. Yarowsky and G. Ngai. 2001. Inducing multilin- scalable semi-supervised learning via expectation gual pos taggers and np bracketers via robust pro- regularization. In Proc. ICML. jection across aligned corpora. In Proc. NAACL.

G. Mann and A. McCallum. 2008. Generalized expec- D. Yarowsky, G. Ngai, and R. Wicentowski. 2001. tation criteria for semi-supervised learning of con- Inducing multilingual text analysis tools via robust ditional random fields. In Proc. ACL, pages 870 – projection across aligned corpora. In Proc. HLT. 878.

377