Transition-Based Dependency Parsing with Heuristic Backtracking

Transition-Based Dependency Parsing with Heuristic Backtracking Jacob Buckman♣ Miguel Ballesteros♦ Chris Dyer♠♣ ♣School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA ♦NLP Group, Pompeu Fabra University, Barcelona, Spain ♠Google DeepMind, London, UK [email protected], [email protected], [email protected] Abstract One common search technique is beam search. (Zhang and Clark, 2008; Zhang and Nivre, 2011; We introduce a novel approach to the decoding Bohnet and Nivre, 2012; Zhou et al., 2015; Weiss et problem in transition-based parsing: heuris- al., 2015; Yazdani and Henderson, 2015) In beam- tic backtracking. This algorithm uses a se- search, a fixed number of candidate transition series of partial parses on the sentence to locate quences are generated, and the highest-scoring se- the best candidate parse, using confidence estimates of transition decisions as a heuristic to quence is chosen as the answer. One downside to guide the starting points of the search. This beam search is that it often results in a significant allows us to achieve a parse accuracy compa- amount of wasted predictions. A constant number rable to beam search, despite using fewer tran- of beams are explored at all points throughout the sitions. When used to augment a Stack-LSTM sentence, leading to some unnecessary exploration transition-based parser, the parser shows an towards the beginning of the sentence, and poten- unlabeled attachment score of up to 93.30% tially insufficient exploration towards the end. for English and 87.61% for Chinese. One way that this problem can be mitigated is by using a dynamically-sized beam (Mejia-Lavalle and 1 Introduction Ramos, 2013). When using this technique, at each step, prune all beams whose scores are below some Transition-based parsing, one of the most prominent value s, where s is calculated based upon the distri- dependency parsing techniques, constructs a depen- bution of scores of available beams. Common meth- dency structure by reading words sequentially from ods for pruning are removing all beams below some the sentence, and making a series of local decisions percentile, or any beams which scored below some (called transitions) which incrementally build the constant percentage of the highest-scoring beam. structure. Transition-based parsing has been shown to be both fast and accurate; the number of transi- Another approach to solving this issue is given by tions required to fully parse the sentence is linear Choi and McCallum (2013). They introduced se- relative to the number of words in the sentence. lectional branching, which involves performing an In recent years, the field has seen dramatic im- initial greedy parse, and then using confidence esti- provements in the ability to correctly predict tran- mates on each prediction to spawn additional beams. sitions. Recent models include the greedy Stack- Relative to standard beam-search, this reduces the LSTM model of Dyer et al. (2015) and the globally average number of predictions required to parse a normalized feed-forward networks of Andor et al. sentence, resulting in a speed-up. (2016). These models output a local decision at each In this paper, we introduce heuristic backtracking, transition point, so searching the space of possible which expands on the ideas of selectional branching paths to the predicted tree is an important compo- by integrating a search strategy based on a heuristic nent of high-accuracy parsers. function (Pearl, 1984): a function which estimates 2313 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2313–2318, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics the future cost of taking a particular decision. When the valid transition actions that may be taken in the paired with a good heuristic, heuristic backtracking current state. The objective function is: maintains the property of reducing wasted predic- z tions, but allows us to more fully explore the space | | (w, z) = log p(z p ) (3) of possible transition sequences (as compared to se- Lθ t | t t=1 lectional branching). In this paper, we use a heuristic X based on the confidence of transition predictions. where z refers to parse transitions. We also introduce a new optimization: heuristic backtracking with cutoff. Since heuristic backtrack- 3 Heuristic Backtracking ing produces results incrementally, it is possible to Using the Stack-LSTM parsing model of Dyer et stop the search early if we have found an answer that al. (2015) to predict each decision greedily yields we believe to be the gold parse, saving time propor- very high accuracy; however, it can only explore one tional to the number of backtracks remaining. path, and it therefore can be improved by conduct- We compare the performance of these various ing a larger search over the space of possible parses. decoding algorithms with the Stack-LSTM parser To do this, we introduce a new algorithm, heuristic (Dyer et al., 2015), and achieve slightly higher ac- backtracking. We also introduce a novel cutoff ap- curacy than beam search, in significantly less time. proach to further increase speed. 2 Transition-Based Parsing With 3.1 Decoding Strategy Stack-LSTM We model the space of possible parses as a tree, where each node represents a certain parse state Our starting point is the model described by Dyer et 1 (with complete values for stack, buffer, and action al. (2015). The parser implements the arc-standard history). Transitions connect nodes of the tree, and algorithm (Nivre, 2004) and it therefore makes use leaves of the tree represent final states. of a stack and a buffer. In (Dyer et al., 2015), the During the first iteration, we start at the root of the stack and the buffer are encoded with Stack-LSTMs, tree, and greedily parse until we reach a leaf. That and a third sequence with the history of actions taken is, for each node, we use the Stack-LSTM model by the parser is encoded with another Stack-LSTM. to calculate scores for each transition (as described The three encoded sequences form the parser state in Section 2), and then execute the highest-scoring pt defined as follows, transition, generating a child node upon which we repeat the procedure. Additionally, we save an or- p = max 0, W[s ; b ; a ] + d , (1) t { t t t } dered list of the transition scores, and calculate the confidence of the node (as described in Section 3.2). where W is a learned parameter matrix, b , s and t t When we reach the leaf node, we backtrack to the a are the stack LSTM encoding of buffer, stack and t location that is most likely to fix a mistake. To find the history of actions, and d is a bias term. The out- this, we look at all explored nodes that still have at put p (after a component-wise rectified linear unit t least one unexplored child, and choose the node with (ReLU) nonlinearity (Glorot et al., 2011)) is then the lowest heuristic confidence (see Section 3.2). used to compute the probability of the parser action We rewind our stack, buffer, and action history to at time t as: that state, and execute the highest-scoring transition from that node that has not yet been explored. At exp g>pt + qz p(z p ) = zt t , (2) t | t this point, we are again in a fully-unexplored node, z (S,B) exp gz>pt + qz0 0∈A 0 and can greedily parse just as before until we reach P another leaf. where gz is a column vector representing the (out- Once we have generated b leaves, we score them put) embedding of the parser action z, and qz is a bias term for action z. The set (S, B) represents all and return the transition sequence leading up to A the highest-scoring leaf as the answer. Just as in pre- 1We refer to the original work for details. vious studies (Collins and Roark, 2004), we use the 2314 1 1 1 1 ... 1 1 1 1 1 ... 1 n1 n2 n3 n4 nl n1 n2 n3 n4 nl 2 2 2 ... 2 2 2 2 ... 2 n2 n3 n4 nl n2 n3 n4 nl 3 3 3 ... 3 3 3 ... 3 n2 n3 n4 nl n2 n4 nl 4 4 4 ... 4 4 4 n2 n3 n4 nl n2 nl (a) Beam Search (b) Dynamic Beam Search 1 1 1 1 ... 1 1 1 1 1 ... 1 n1 n2 n3 n4 nl n1 n2 n3 n4 nl 2 2 2 ... 2 2 2 2 ... 2 n2 n3 n4 nl n2 n3 n4 nl 3 3 ... 3 3 ... 3 n3 n4 nl n4 nl 4 ... 4 4 ... 4 n4 nl n4 nl (c) Selectional Branching (d) Heuristic Backtracking Figure 1: Visualization of various decoding algorithms sum of the log probabilities of all individual transi- We do this in the following way: tions as the overall score for the parse. j 1 j j j H(n ) = (V (ni ) V (n )) + (S(u j ) 1(n ) + S(u j )(n )) i i n i n i − i − i 3.2 Calculating Error Likelihood (4) Let n indicate a node, which consists of a state, a Intuitively, this formula means that the node that will buffer, and an action history. We may refer to a be explored first is the node that will yield a parse j specific node as ni , which means it has i actions that scores as close to the greedy choice as possible. in its action history and it is part of the history of The first term ensures that it has a history of good the jth leaf (and possibly subsequent leaves).

Transition-Based Dependency Parsing with Heuristic Backtracking

Lecture 4 Dynamic Programming

Exhaustive Recursion and Backtracking

Backtrack Parsing Context-Free Grammar Context-Free Grammar

Adaptive LL(*) Parsing: the Power of Dynamic Analysis

Backtracking / Branch-And-Bound

Sequence Alignment/Map Format Specification

Lexing and Parsing with ANTLR4

Module 5: Backtracking

Compiler Design

7.2 Finite State Machines Finite State Machines Are Another Way to Specify the Syntax of a Sentence in a Lan- Guage

CS/ECE 374: Algorithms & Models of Computation

2D Trie for Fast Parsing