Three Dependency-And-Boundary Models for Grammar Induction
Total Page:16
File Type:pdf, Size:1020Kb
Three Dependency-and-Boundary Models for Grammar Induction Valentin I. Spitkovsky Hiyan Alshawi Stanford University and Google Inc. Google Inc., Mountain View, CA, 94043 [email protected] [email protected] Daniel Jurafsky Stanford University, Stanford, CA, 94305 [email protected] Abstract DT NN VBZ IN DT NN [The check] is in [the mail]. We present a new family of models for unsu- | {z } | {z } pervised parsing, Dependency and Boundary Subject Object models, that use cues at constituent bound- Figure 1: A partial analysis of our running example. aries to inform head-outward dependency tree generation. We build on three intuitions that Consider the example in Figure 1. Because the are explicit in phrase-structure grammars but determiner (DT) appears at the left edge of the sen- only implicit in standard dependency formu- tence, it should be possible to learn that determiners lations: (i) Distributions of words that oc- may generally be present at left edges of phrases. cur at sentence boundaries — such as English This information could then be used to correctly determiners — resemble constituent edges. parse the sentence-internal determiner in the mail. (ii) Punctuation at sentence boundaries fur- ther helps distinguish full sentences from Similarly, the fact that the noun head (NN) of the ob- fragments like headlines and titles, allow- ject the mail appears at the right edge of the sentence ing us to model grammatical differences be- could help identify the noun check as the right edge tween complete and incomplete sentences. of the subject NP. As with jigsaw puzzles, working (iii) Sentence-internal punctuation boundaries inwards from boundaries helps determine sentence- help with longer-distance dependencies, since internal structures of both noun phrases, neither of punctuation correlates with constituent edges. which would be quite so clear if viewed separately. Our models induce state-of-the-art depen- dency grammars for many languages without Furthermore, properties of noun-phrase edges are special knowledge of optimal input sentence partially shared with prepositional- and verb-phrase lengths or biased, manually-tuned initializers. units that contain these nouns. Because typical head- driven grammars model valence separately for each class of head, however, they cannot see that the left 1 Introduction fringe boundary, The check, of the verb-phrase is Natural language is ripe with all manner of bound- shared with its daughter’s, check. Neither of these aries at the surface level that align with hierarchical insights is available to traditional dependency for- syntactic structure. From the significance of func- mulations, which could learn from the boundaries tion words (Berant et al., 2006) and punctuation of this sentence only that determiners might have no marks (Seginer, 2007; Ponvert et al., 2010) as sepa- left- and that nouns might have no right-dependents. rators between constituents in longer sentences — to We propose a family of dependency parsing mod- the importance of isolated words in children’s early els that are capable of inducing longer-range im- vocabulary acquisition (Brent and Siskind, 2001) plications from sentence edges than just fertilities — word boundaries play a crucial role in language of their fringe words. Our ideas conveniently lend learning. We will show that boundary information themselves to implementations that can reuse much can also be useful in dependency grammar induc- of the standard grammar induction machinery, in- tion models, which traditionally focus on head rather cluding efficient dynamic programming routines for than fringe words (Carroll and Charniak, 1992). the relevant expectation-maximization algorithms. 2 The Dependency and Boundary Models mented as both head-outward and head-inward au- tomata. (In fact, arbitrary permutations of siblings Our models follow a standard generative story for to a given side of their parent would not affect the head-outward automata (Alshawi, 1996a), restricted likelihood of the modified tree, with the DMV.) We 1 to the split-head case (see below), over lexical word propose to make fuller use of split-head automata’s classes {cw }: first, a sentence root cr is chosen, with head-outward nature by drawing on information in P probability ATTACH (cr | ⋄; L); ⋄ is a special start partially-generated parses, which contain useful pre- symbol that, by convention (Klein and Manning, dictors that, until now, had not been exploited even 2004; Eisner, 1996), produces exactly one child, to in featurized systems for grammar induction (Cohen its left. Next, the process recurses. Each (head) and Smith, 2009; Berg-Kirkpatrick et al., 2010). word ch generates a left-dependent with probability Some of these predictors, including the identity P 1 − STOP( · | L; · · · ), where dots represent additional — or even number (McClosky, 2008) — of already- parameterization on which it may be conditioned. If generated siblings, can be prohibitively expensive in the child is indeed generated, its identity cd is cho- sentences above a short length k. For example, they P sen with probability ATTACH(cd | ch; · · · ), influenced break certain modularity constraints imposed by the by the identity of the parent ch and possibly other pa- charts used in O(k3)-optimized algorithms (Paskin, rameters (again represented by dots). The child then 2001a; Eisner, 2000). However, in bottom-up pars- generates its own subtree recursively and the whole ing and training from text, everything about the yield process continues, moving away from the head, un- — i.e., the ordered sequence of all already-generated til ch fails to generate a left-dependent. At that point, descendants, on the side of the head that is in the an analogous procedure is repeated to ch’s right, this process of spawning off an additional child — is not P time using stopping factors STOP( · | R; · · · ). All parse only known but also readily accessible. Taking ad- trees derived in this way are guaranteed to be projec- vantage of this availability, we designed three new tive and can be described by split-head grammars. models for dependency grammar induction. Instances of these split-head automata have been heavily used in grammar induction (Paskin, 2001b; 2.1 Dependency and Boundary Model One Klein and Manning, 2004; Headden et al., 2009, DBM-1 conditions all stopping decisions on adja- inter alia), in part because they allow for efficient cency and the identity of the fringe word ce — the implementations (Eisner and Satta, 1999, §8) of currently-farthest descendant (edge) derived by head the inside-outside re-estimation algorithm (Baker, ch in the given head-outward direction (dir ∈ {L, R}): 1979). The basic tenet of split-head grammars is P that every head word generates its left-dependents STOP( · | dir; adj, ce ). independently of its right-dependents. This as- In the adjacent case (adj = T), ch is deciding whether sumption implies, for instance, that words’ left- to have any children on a given side: a first child’s and right-valences — their numbers of children subtree would be right next to the head, so the head to each side — are also independent. But it does and the fringe words coincide (ch = ce). In the non- not imply that descendants that are closer to the adjacent case (adj = F), these will be different words head cannot influence the generation of farther and their classes will, in general, not be the same.2 dependents on the same side. Nevertheless, many Thus, non-adjacent stopping decisions will be made popular grammars for unsupervised parsing behave independently of a head word’s identity. Therefore, as if a word had to generate all of its children all word classes will be equally likely to continue to (to one side) — or at least their count — before grow or not, for a specific proposed fringe boundary. allowing any of these children themselves to recurse. For example, production of The check is involves For example, Klein and Manning’s (2004) depen- two non-adjacent stopping decisions on the left: one dency model with valence (DMV) could be imple- by the noun check and one by the verb is, both of which stop after generating a first child. In DBM-1, 1Unrestricted head-outward automata are strictly more pow- erful (e.g., they recognize the language anbn in finite state) than 2Fringe words differ also from other standard dependency the split-head variants, which process one side before the other. features (Eisner, 1996, §2.3): parse siblings and adjacent words. quantity Revenue: $3.57 billion, the time 1:11am, DT NN VBZ IN DT NN ♦ and the like — tend to be much shorter than com- The check is in the mail . plete sentences. The new root-attachment factors 0 z }| { could further track that incomplete sentences gener- P P P = (1 − STOP(⋄ | L; T)) × ATTACH (VBZ | ⋄; L) ally lack verbs, in contrast to other short sentences, × (1 − PSTOP( · | L; T, VBZ)) × PATTACH (NN | VBZ; L) × (1 − PSTOP( · | R; T, VBZ)) × PATTACH (IN | VBZ; R) e.g., Excerpts follow:, Are you kidding?, Yes, he × PSTOP( · | L; F, DT) // VBZ × PSTOP( · | R; F, NN) // VBZ did., It’s huge., Indeed it is., I said, ‘NOW?’, “Ab- 2 2 × (1 − PSTOP( · | L; T, NN)) × PATTACH (DT | NN; L) solutely,” he said., I am waiting., Mrs. Yeargin de- × (1 − PSTOP( · | R; T, IN)) × PATTACH (NN | IN; R) 2 2 clined., McGraw-Hill was outraged., “It happens.”, × PSTOP( · | R; T, NN) × PSTOP( · | L; F, DT) // NN I’m OK, Jack., Who cares?, Never mind. and so on. × PSTOP( · | L; T, IN) × PSTOP( · | R; F, NN) // IN P2 P2 × STOP( · | L; T, DT) × STOP( · | R; T, DT) All other attachment probabilities PATTACH(cd | ch; dir) × PSTOP(⋄ | L; F) × PSTOP(⋄ | R; T) . | {z } | {z } remain unchanged, as in DBM-1. In practice, comp 1 1 can indicate presence of sentence-final punctuation. Figure 2: Our running example — a simple sentence and its unlabeled dependency parse structure’s probability, as 2.3 Dependency and Boundary Model Three factored by DBM-1; highlighted comments specify heads DBM-3 adds further conditioning on punctuation associated to non-adjacent stopping probability factors.