Dependency Grammar Induction with a Neural Variational Transition
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Dependency Grammar Induction with a Neural Variational Transition-Based Parser Bowen Li, Jianpeng Cheng, Yang Liu, Frank Keller Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK fbowen.li, jianpeng.cheng, [email protected], [email protected] Abstract Though graph-based models achieve impressive results, their inference procedure requires O(n3) time complexity. Dependency grammar induction is the task of learning de- Meanwhile, features in graph-based models must be decom- pendency syntax without annotated training data. Traditional graph-based models with global inference achieve state-of- posable over substructures to enable dynamic programming. the-art results on this task but they require O(n3) run time. In comparison, transition-based models allow faster infer- Transition-based models enable faster inference with O(n) ence with linear time complexity and richer feature sets. time complexity, but their performance still lags behind. Although relying on local inference, transition-based mod- In this work, we propose a neural transition-based parser els have been shown to perform well in supervised parsing for dependency grammar induction, whose inference proce- (Kiperwasser and Goldberg 2016; Dyer et al. 2015). How- dure utilizes rich neural features with O(n) time complex- ever, unsupervised transition parsers are not well-studied. ity. We train the parser with an integration of variational in- One exception is the work of Rasooli and Faili (2012), in ference, posterior regularization and variance reduction tech- which search-based structure prediction (Daume´ III 2009) is niques. The resulting framework outperforms previous un- used with a simple feature set. However, there is still a large supervised transition-based dependency parsers and achieves performance comparable to graph-based models, both on the performance gap compared to graph-based models. English Penn Treebank and on the Universal Dependency Recently, Dyer et al. (2016) proposed recurrent neu- Treebank. In an empirical comparison, we show that our ral network grammars (RNNGs)—a probabilistic transition- approach substantially increases parsing speed over graph- based model for constituency trees. RNNG can be used ei- based models. ther in a generative way as a language model or in a discrim- inative way as a parser. Cheng, Lopez, and Lapata (2017) Introduction use an autoencoder to integrate discriminative and genera- tive RNNGs, yielding a reconstruction process with parse Grammar induction is the task of deriving plausible syntac- trees as latent variables and enabling the two components to tic structures from raw text, without the use of annotated be trained jointly on a language modeling objective. How- training data. In the case of dependency parsing, the syntac- ever, their work uses observed trees for training and does tic structure takes the form of a tree whose nodes are the not study unsupervised learning. words of the sentence, and whose arcs are directed and de- In this paper, we make a more radical departure from note head-dependent relationships between words. Inducing the existing literature in dependency grammar induction, such a tree without annotated training data is challenging by proposing an unsupervised neural variational transition- because of data sparseness and ambiguity, and because the based parser. Specifically, we first modify the transition ac- search space of potential trees is huge, making optimization tions in the original RNNG into a set of arc-standard ac- difficult. tions for projective dependency parsing, and then build a de- Most existing approaches to dependency grammar in- pendency variant of the model of Cheng, Lopez, and Lapata duction have used inference over graph structures and are (2017). Although this approach performs well for supervised based either on the dependency model with valence (DMV) parsing, when applied in an unsupervised setting, the perfor- of Klein and Manning (2004) or the maximum spanning mance decreases dramatically (see Experiments for details). tree algorithm (MST) for dependency parsing by McDon- We hypothesize that this is because the parser is fairly un- ald, Petrov, and Hall (2011). State-of-the-art representa- constrained without prior linguistic knowledge (Naseem et tives include LC-DMV (Noji, Miyao, and Johnson 2016) al. 2010; Noji, Miyao, and Johnson 2016). Therefore, we and Convex-MST (Grave and Elhadad 2015). Recently, re- augment the model with posterior regularization, allowing searchers have also introduced neural networks for feature us to seamlessly integrate linguistic knowledge in the shape extraction in graph-based models (Jiang, Han, and Tu 2016; of a small number of universal linguistic rules. In addition, Cai, Jiang, and Tu 2017). we propose a novel variance reduction method for stabi- Copyright c 2019, Association for the Advancement of Artificial lizing neural variational inference with discrete latent vari- Intelligence (www.aaai.org). All rights reserved. ables. This yields the first known model that makes it possi- 6658 ble to use posterior regularization for neural variational in- Methodology ference with discrete latent variables. When evaluating on To build our dependency grammar induction model, we the English Penn Treebank and on eight languages from follow Cheng, Lopez, and Lapata (2017) and propose a the Universal Dependency (UD) Treebank, we find that our dependency-based, encoder-decoder RNNG. This model in- model with posterior regularization outperforms the best un- cludes (1) a discriminative RNNG as the encoder for map- supervised transition-based dependency parser (Rasooli and ping the input sentence into a latent variable, which for the Faili 2012), and approaches the performance of graph-based grammar induction task is a sequence of parse actions for models. We also show how a weak form of supervision can building the dependency tree; (2) a generative RNNG as be integrated elegantly into our framework in the form of the decoder for reconstructing the input sentence based on rule expectations. Finally, we present empirical evidence for the latent parse actions. The training objective is the like- the complexity advantage of transition-based models: our lihood of the observed input sentence, which is reformu- model attains a large speed-up compared to a state-of-the- lated as an evidence lower bound (ELBO), and solved with art graph-based model. Code and Supplementary Material 1 neural variational inference. The REINFORCE algorithm are available. (Williams 1992) is utilized to handle discrete latent variables in optimization. Overall, the encoder and decoder are jointly Background trained, inducing latent parse trees or actions from only unla- belled text data. To further regularize the space of parse trees RNNG is a top-down transition system originally proposed with a linguistic prior, we introduce posterior regularization for constituency parsing and generation. There are two vari- into the basic framework. Finally, we propose a novel vari- ants: the discriminative RNNG and the generative RNNG. ance reduction technique to train our posterior regularized The discriminative RNNG takes a sentence as input, and pre- framework more effectively. dicts the probability of generating a corresponding parse tree from the sentence. The model uses a buffer to store unpro- Encoder cessed terminal words and a stack to store partially com- pleted syntactic constituents. It then follows top-down tran- We formulate the encoder as a discriminative dependency p(ajx) sition actions to shift words from the buffer to the stack to RNNG that computes the conditional probability of a construct syntactic constituents incrementally. the transition action sequence given the observed sentence x. The conditional probability is factorized over time steps, The discriminative RNNG can be modified slightly to for- and parameterized by a transitional state embedding v: mulate the generative RNNG, an algorithm for incremen- tally producing trees and sentences in a generative fash- jaj Y ion. In generative RNNG, there is no buffer of unprocessed p(ajx) = p(atjvt) (1) words, but there is an output buffer for storing words that t=1 have been generated. Top-down actions are then specified to generate words and tree non-terminals in pre-order. Though where vt is the transitional state embedding of the encoder at not able to parse on its own, a generative RNNG can be used time step t. The encoder is the actual component for parsing for language modeling as long as parse trees are sampled at run time. from a known distribution. We modify the transition actions in the original RNNG Decoder into a set of arc-standard actions for projective dependency The decoder is a generative dependency RNNG that mod- parsing. In the discriminative modeling case, the action els the joint probability p(x; a) of a latent transition action space includes: sequence a and an observed sentence x. This joint distribu- tion can be factorized into a sequence of action and word • SHIFT fetches the first word in the buffer and pushes it (emitted by GEN) probabilities, which are parameterized by onto the top of the stack. a transitional state embedding u: • LEFT-REDUCE adds a left arc in between the top two p(x; a) = p(a)p(xja) words of the stack and merges them into a single con- struct. jaj (2) Y I(at=GEN) = p(atjut)p(xtjut) • RIGHT-REDUCE adds a right arc in between the top two t=1 words of the stack and merges them into a single con- struct. where I is an indicator function and ut is the state embed- ding at time step t. The features and the modeling details of In the generative modeling case, the SHIFT operation is re- both the encoder and the decoder can be found in the Sup- placed by a GEN operation: plementary Material. • GEN generates a word and adds it to the stack and the output buffer. Training Objective Consider a latent variable model in which the encoder infers 1https://github.com/libowen2121/ the latent transition actions (i.e., the dependency structure) VI-dependency-syntax and the decoder reconstructs the sentence from these actions.