Unsupervised Recurrent Neural Network Grammars
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Recurrent Neural Network Grammars Yoon Kimy Alexander M. Rushy Lei Yu3 Adhiguna Kuncoroz;3 Chris Dyer3 Gabor´ Melis3 yHarvard University zUniversity of Oxford 3DeepMind fyoonkim,[email protected] fleiyu,akuncoro,cdyer,[email protected] Abstract The standard setup for unsupervised structure p (x; z) Recurrent neural network grammars (RNNG) learning is to define a generative model θ are generative models of language which over observed data x (e.g. sentence) and unob- jointly model syntax and surface structure by served structure z (e.g. parse tree, part-of-speech incrementally generating a syntax tree and sequence), and maximize the log marginal like- P sentence in a top-down, left-to-right order. lihood log pθ(x) = log z pθ(x; z). Success- Supervised RNNGs achieve strong language ful approaches to unsupervised parsing have made modeling and parsing performance, but re- strong conditional independence assumptions (e.g. quire an annotated corpus of parse trees. In context-freeness) and employed auxiliary objec- this work, we experiment with unsupervised learning of RNNGs. Since directly marginal- tives (Klein and Manning, 2002) or priors (John- izing over the space of latent trees is in- son et al., 2007). These strategies imbue the learn- tractable, we instead apply amortized varia- ing process with inductive biases that guide the tional inference. To maximize the evidence model to discover meaningful structures while al- lower bound, we develop an inference net- lowing tractable algorithms for marginalization; work parameterized as a neural CRF con- however, they come at the expense of language stituency parser. On language modeling, unsu- modeling performance, particularly compared to pervised RNNGs perform as well their super- vised counterparts on benchmarks in English sequential neural models that make no indepen- and Chinese. On constituency grammar in- dence assumptions. duction, they are competitive with recent neu- Like RNN language models, RNNGs make no ral language models that induce tree structures independence assumptions. Instead they encode from words through attention mechanisms. structural bias through operations that compose 1 Introduction linguistic constituents. The lack of independence Recurrent neural network grammars (RNNGs) assumptions contributes to the strong language (Dyer et al., 2016) model sentences by first gen- modeling performance of RNNGs, but make unsu- erating a nested, hierarchical syntactic structure pervised learning challenging. First, marginaliza- which is used to construct a context representation tion is intractable. Second, the biases imposed by to be conditioned upon for upcoming words. Su- the RNNG are relatively weak compared to those pervised RNNGs have been shown to outperform imposed by models like PCFGs. There is little arXiv:1904.03746v6 [cs.CL] 5 Aug 2019 standard sequential language models, achieve ex- pressure for non-trivial tree structure to emerge cellent results on parsing (Dyer et al., 2016; Kun- during unsupervised RNNG (URNNG) learning. coro et al., 2017), better encode syntactic proper- In this work, we explore a technique for han- ties of language (Kuncoro et al., 2018), and cor- dling intractable marginalization while also inject- relate with electrophysiological responses in the ing inductive bias. Specifically we employ amor- human brain (Hale et al., 2018). However, these tized variational inference (Kingma and Welling, all require annotated syntactic trees for training. 2014; Rezende et al., 2014; Mnih and Gregor, In this work, we explore unsupervised learning of 2014) with a structured inference network. Varia- recurrent neural network grammars for language tional inference lets us tractably optimize a lower modeling and grammar induction. bound on the log marginal likelihood, while em- Work done while the first author was an intern at DeepMind. ploying a structured inference network encour- Code available at https://github.com/harvardnlp/urnng ages non-trivial structure. In particular, a con- ditional random field (CRF) constituency parser (Finkel et al., 2008; Durrett and Klein, 2015), which makes significant independence assump- tions, acts as a guide on the generative model to learn meaningful trees through regularizing the posterior (Ganchev et al., 2010). We experiment with URNNGs on English and Chinese and observe that they perform well as language models compared to their supervised counterparts and standard neural LMs. In terms of grammar induction, they are competitive with recently-proposed neural architectures that dis- cover tree-like structures through gated attention (Shen et al., 2018). Our results, along with other Figure 1: Overview of our approach. The inference net- recent work on joint language modeling/structure work qφ(z j x) (left) is a CRF parser which produces a dis- tribution over binary trees (shown in dotted box). Bij are learning with deep networks (Shen et al., 2018, random variables for existence of a constituent spanning i-th 2019; Wiseman et al., 2018; Kawakami et al., and j-th words, whose potentials are the output from a bidi- 2018), suggest that it is possible to learn genera- rectional LSTM (the global factor ensures that the distribu- tion is only over valid binary trees). The generative model tive models of language that model the underlying pθ(x; z) (right) is an RNNG which consists of a stack LSTM data well (i.e. assign high likelihood to held-out (from which actions/words are predicted) and a tree LSTM data) and at the same time induce meaningful lin- (to obtain constituent representations upon REDUCE). Train- ing involves sampling a binary tree from qφ(z j x), converting guistic structure. it to a sequence of shift=reduce actions (z = [SHIFT, SHIFT, SHIFT, REDUCE, REDUCE, SHIFT, REDUCE] in the above ex- 2 Unsupervised Recurrent Neural ample), and optimizing the log joint likelihood log pθ(x; z). Network Grammars the stack, composes them, and shifts the composed representation onto the stack. We use x = [x1; : : : ; xT ] to denote a sentence of Formally, let S = [(0; 0)] be the initial stack. length T , and z 2 ZT to denote an unlabeled bi- nary parse tree over a sequence of length T , repre- Each item of the stack will be a pair, where the first sented as a binary vector of length 2T − 1. Here element is the hidden state of the stack LSTM, and 0 and 1 correspond to SHIFT and REDUCE actions, the second element is an input vector, described explained below.1 Figure1 presents an overview below. We use top(S) to refer to the top pair in of our approach. the stack. The push and pop operations are de- fined imperatively in the usual way. At each time 2.1 Generative Model step, the next action zt (SHIFT or REDUCE) is sam- An RNNG defines a joint probability distribu- pled from a Bernoulli distribution parameterized in terms of the current stack representation. Let- tion pθ(x; z) over sentences x and parse trees z. We consider a simplified version of the orig- ting (hprev; gprev) = top(S), we have > inal RNNG (Dyer et al., 2016) by ignoring con- zt ∼ Bernoulli(pt); pt = σ(w hprev + b): stituent labels and only considering binary trees. The RNNG utilizes an RNN to parameterize a Subsequent generation depend on zt: stack data structure (Dyer et al., 2015) of partially- • If zt = 0 (SHIFT), the model first generates a completed constituents to incrementally build the terminal symbol via sampling from a categori- parse tree while generating terminals. Using the cal distribution whose parameters come from an current stack representation, the model samples affine transformation and a softmax, an action (SHIFT or REDUCE): SHIFT generates x ∼ softmax(Wh + b): a terminal symbol, i.e. word, and shifts it onto prev the stack,2 REDUCE pops the last two elements off Then the generated terminal is shifted onto the stack using a stack LSTM, 1 2T −1 The cardinality of ZT ⊂ f0; 1g is given by the (2T −2)! hnext = LSTM(ex; hprev); (T − 1)-th Catalan number, jZT j = T !(T −1)! . 2 A better name for SHIFT would be GENERATE (as in push(S; (hnext; ex)); Dyer et al.(2016)), but we use SHIFT to emphasize similarity with the shift-reduce parsing. where ex is the word embedding for x. • If zt = 1 (REDUCE), we pop the last two ele- In the supervised case where ground-truth z ments off the stack, is available, we can straightforwardly perform gradient-based optimization to maximize the joint (hr; gr) = pop(S); (hl; gl) = pop(S); log likelihood log pθ(x; z). In the unsupervised and obtain a new representation that combines case, the standard approach is to maximize the log the left/right constituent representations using a marginal likelihood, tree LSTM (Tai et al., 2015; Zhu et al., 2015), X log p (x) = log p (x; z0): gnew = TreeLSTM(gl; gr): θ θ 0 z 2ZT Note that we use gl and gr to obtain the new 3 representation instead of hl and hr. We then However this summation is intractable be- update the stack using gnew, cause zt fully depends on all previous actions [z ; : : : ; z ]. Even if this summation were (h ; g ) = top(S); 1 t−1 prev prev tractable, it is not clear that meaningful latent hnew = LSTM(gnew; hprev); structures would emerge given the lack of explicit push(S; (hnew; gnew)): independence assumptions in the RNNG (e.g. it is The generation process continues until an end-of- clearly not context-free). We handle these issues sentence symbol is generated. The parameters θ of with amortized variational inference. the generative model are w; b; W; b, and the pa- 2.2 Amortized Variational Inference rameters of the stack/tree LSTMs. For a sentence Amortized variational inference (Kingma and x = [x1; : : : ; xT ] of length T , the binary parse tree Welling, 2014) defines a trainable inference net- is given by the binary vector z = [z ; : : : ; z ].4 1 2T −1 work φ that parameterizes qφ(z j x), a variational The joint log likelihood decomposes as a sum of posterior distribution, in this case over parse trees terminal/action log likelihoods, z given the sentence x.