Dynamic Programming for Linear-Time Incremental Parsing
Total Page:16
File Type:pdf, Size:1020Kb
Dynamic Programming for Linear-Time Incremental Parsing Liang Huang Kenji Sagae USC Information Sciences Institute USC Institute for Creative Technologies 4676 Admiralty Way, Suite 1001 13274 Fiji Way Marina del Rey, CA 90292 Marina del Rey, CA 90292 [email protected] [email protected] Abstract that runs in (almost) linear-time, yet searches over a huge space with dynamic programming? Incremental parsing techniques such as Theoretically, the answer is negative, as Lee shift-reduce have gained popularity thanks (2002) shows that context-free parsing can be used to their efficiency, but there remains a to compute matrix multiplication, where sub-cubic major problem: the search is greedy and algorithms are largely impractical. only explores a tiny fraction of the whole We instead propose a dynamic programming al- space (even with beam search) as op- ogorithm for shift-reduce parsing which runs in posed to dynamic programming. We show polynomial time in theory, but linear-time (with that, surprisingly, dynamic programming beam search) in practice. The key idea is to merge is in fact possible for many shift-reduce equivalent stacks according to feature functions, parsers, by merging “equivalent” stacks inspired by Earley parsing (Earley, 1970; Stolcke, based on feature values. Empirically, our 1995) and generalized LR parsing (Tomita, 1991). algorithm yields up to a five-fold speedup However, our formalism is more flexible and our over a state-of-the-art shift-reduce depen- algorithm more practical. Specifically, we make dency parser with no loss in accuracy. Bet- the following contributions: ter search also leads to better learning, and our final parser outperforms all previously • theoretically, we show that for a large class reported dependency parsers for English of modern shift-reduce parsers, dynamic pro- and Chinese, yet is much faster. gramming is in fact possible and runs in poly- nomial time as long as the feature functions 1 Introduction are bounded and monotonic (which almost al- ways holds in practice); In terms of search strategy, most parsing al- gorithms in current use for data-driven parsing • practically, dynamic programming is up to can be divided into two broad categories: dy- five times faster (with the same accuracy) as namic programming which includes the domi- conventional beam-search on top of a state- nant CKY algorithm, and greedy search which in- of-the-art shift-reduce dependency parser; cludes most incremental parsing methods such as • as a by-product, dynamic programming can shift-reduce.1 Both have pros and cons: the for- output a forest encoding exponentially many mer performs an exact search (in cubic time) over trees, out of which we can draw better and an exponentially large space, while the latter is longer k-best lists than beam search can; much faster (in linear-time) and is psycholinguis- tically motivated (Frazier and Rayner, 1982), but • finally, better and faster search also leads to its greedy nature may suffer from severe search er- better and faster learning. Our final parser rors, as it only explores a tiny fraction of the whole achieves the best (unlabeled) accuracies that space even with a beam. we are aware of in both English and Chi- Can we combine the advantages of both ap- nese among dependency parsers trained on proaches, that is, construct an incremental parser the Penn Treebanks. Being linear-time, it is 1McDonald et al. (2005b) is a notable exception: the MST also much faster than most other parsers, algorithm is exact search but not dynamic programming. even with a pure Python implementation. input: w0 ...wn−1 input: “I saw Al with Joe” step action stack queue axiom 0: h0, ǫi: 0 0 - I ... ℓ : hj, Si : c 1 sh I saw ... sh 2 sh I saw Al ... j<n x ℓ +1: hj + 1, S|wji : c + ξ 3 rex I saw Al ... 4 sh Ixsaw Al with ... x y ℓ : hj, S|s1|s0i : c 5a rey I saw Al with ... re x x 5b sh Ixsaw Al with Joe ℓ +1: hj, S|s1 s0i : c + λ Figure 2: A trace of vanilla shift-reduce. After ℓ : hj, S|s1|s0i : c rey y step (4), the parser branches off into (5a) or (5b). ℓ +1: hj, S|s1 s0i : c + ρ goal 2n − 1: hn, s0i: c queue head position (current word q0 is wj). At each step, we choose one of the three actions: where ℓ is the step, c is the cost, and the shift cost ξ and reduce costs λ and ρ are: 1. sh: move the head of queue, wj, onto stack S as a singleton tree; ξ = w · fsh(j, S) (1) re w f 2. x: combine the top two trees on the stack, λ = · rex (j, S|s1|s0) (2) x s0 and s1, and replace them with tree s1 s0. ρ = w · frey (j, S|s1|s0) (3) 3. rey: combine the top two trees on the stack, y s0 and s1, and replace them with tree s1 s0. Figure 1: Deductive system of vanilla shift-reduce. Note that the shorthand notation txt′ denotes a new tree by “attaching tree t as the leftmost child For convenience of presentation and experimen- of the root of tree t′”. This procedure can be sum- tation, we will focus on shift-reduce parsing for marized as a deductive system in Figure 1. States dependency structures in the remainder of this pa- are organized according to step ℓ, which denotes per, though our formalism and algorithm can also the number of actions accumulated. The parser be applied to phrase-structure parsing. runs in linear-time as there are exactly 2n−1 steps for a sentence of n words. 2 Shift-Reduce Parsing As an example, consider the sentence “I saw Al 2.1 Vanilla Shift-Reduce with Joe” in Figure 2. At step (4), we face a shift- reduce conflict: either combine “saw” and “Al” in Shift-reduce parsing performs a left-to-right scan a rey action (5a), or shift “with” (5b). To resolve of the input sentence, and at each step, choose one this conflict, there is a cost c associated with each of the two actions: either shift the current word state so that we can pick the best one (or few, with onto the stack, or reduce the top two (or more) a beam) at each step. Costs are accumulated in items at the end of the stack (Aho and Ullman, each step: as shown in Figure 1, actions sh, rex, 1972). To adapt it to dependency parsing, we split and rey have their respective costs ξ, λ, and ρ, re re the reduce action into two cases, x and y, de- which are dot-products of the weights w and fea- pending on which one of the two items becomes tures extracted from the state and the action. the head after reduction. This procedure is known as “arc-standard” (Nivre, 2004), and has been en- 2.2 Features gineered to achieve state-of-the-art parsing accu- We view features as “abstractions” or (partial) ob- racy in Huang et al. (2009), which is also the ref- servations of the current state, which is an im- 2 erence parser in our experiments. portant intuition for the development of dynamic More formally, we describe a parser configura- programming in Section 3. Feature templates tion by a state hj, Si where S is a stack of trees are functions that draw information from the fea- s0, s1,... where s0 is the top tree, and j is the ture window (see Tab. 1(b)), consisting of the top few trees on the stack and the first few 2There is another popular variant, “arc-eager” (Nivre, 2004; Zhang and Clark, 2008), which is more complicated words on the queue. For example, one such fea- and less similar to the classical shift-reduce algorithm. ture templatef100 = s0.w ◦ q0.t is a conjunction (a) Features Templates f j,S q w + of two atomic features s0.w and q0.t, capturing ( ) i = j i (1) s0.w s0.t s0.w ◦ s0.t the root word of the top tree s0 on the stack, and s1.w s1.t s1.w ◦ s1.t the part-of-speech tag of the current head word q0 q0.w q0.t q0.w ◦ q0.t w w t t on the queue. See Tab. 1(a) for the list of feature (2) s0. ◦ s1. s0. ◦ s1. s0.t ◦ q0.t s0.w ◦ s0.t ◦ s1.t templates used in the full model. Feature templates s0.t ◦ s1.w ◦ s1.t s0.w ◦ s1.w ◦ s1.t are instantiated for a specific state. For example, at s0.w ◦ s0.t ◦ s1.w s0.w ◦ s0.t ◦ s1 ◦ s1.t (3) s0.t ◦ q0.t ◦ q1.t s1.t ◦ s0.t ◦ q0.t step (4) in Fig. 2, the above template f100 will gen- s0.w ◦ q0.t ◦ q1.t s1.t ◦ s0.w ◦ q0.t erate a feature instance (4) s1.t ◦ s1.lc.t ◦ s0.t s1.t ◦ s1.rc.t ◦ s0.t s1.t ◦ s0.t ◦ s0.rc.t s1.t ◦ s1.lc.t ◦ s0 s0.w Al q0.t IN . ( = ) ◦ ( = ) s1.t ◦ s1.rc.t ◦ s0.w s1.t ◦ s0.w ◦ s0.lc.t (5) s2.t ◦ s1.t ◦ s0.t More formally, we denote f to be the feature func- tion, such that f(j, S) returns a vector of feature (b) ← stack queue → ... q0 q1 ... instances for state hj, Si. To decide which action s2 s1 s0 is the best for the current state, we perform a three- ..