Unsupervised Recurrent Neural Network Grammars
Yoon Kim† Alexander M. Rush† Lei Yu3 Adhiguna Kuncoro‡,3 Chris Dyer3 Gabor´ Melis3
†Harvard University ‡University of Oxford 3DeepMind {yoonkim,srush}@seas.harvard.edu {leiyu,akuncoro,cdyer,melisgl}@google.com
Abstract The standard setup for unsupervised structure p (x, z) Recurrent neural network grammars (RNNG) learning is to define a generative model θ are generative models of language which over observed data x (e.g. sentence) and unob- jointly model syntax and surface structure by served structure z (e.g. parse tree, part-of-speech incrementally generating a syntax tree and sequence), and maximize the log marginal like- P sentence in a top-down, left-to-right order. lihood log pθ(x) = log z pθ(x, z). Success- Supervised RNNGs achieve strong language ful approaches to unsupervised parsing have made modeling and parsing performance, but re- strong conditional independence assumptions (e.g. quire an annotated corpus of parse trees. In context-freeness) and employed auxiliary objec- this work, we experiment with unsupervised learning of RNNGs. Since directly marginal- tives (Klein and Manning, 2002) or priors (John- izing over the space of latent trees is in- son et al., 2007). These strategies imbue the learn- tractable, we instead apply amortized varia- ing process with inductive biases that guide the tional inference. To maximize the evidence model to discover meaningful structures while al- lower bound, we develop an inference net- lowing tractable algorithms for marginalization; work parameterized as a neural CRF con- however, they come at the expense of language stituency parser. On language modeling, unsu- modeling performance, particularly compared to pervised RNNGs perform as well their super- vised counterparts on benchmarks in English sequential neural models that make no indepen- and Chinese. On constituency grammar in- dence assumptions. duction, they are competitive with recent neu- Like RNN language models, RNNGs make no ral language models that induce tree structures independence assumptions. Instead they encode from words through attention mechanisms. structural bias through operations that compose 1 Introduction linguistic constituents. The lack of independence Recurrent neural network grammars (RNNGs) assumptions contributes to the strong language (Dyer et al., 2016) model sentences by first gen- modeling performance of RNNGs, but make unsu- erating a nested, hierarchical syntactic structure pervised learning challenging. First, marginaliza- which is used to construct a context representation tion is intractable. Second, the biases imposed by to be conditioned upon for upcoming words. Su- the RNNG are relatively weak compared to those pervised RNNGs have been shown to outperform imposed by models like PCFGs. There is little arXiv:1904.03746v6 [cs.CL] 5 Aug 2019 standard sequential language models, achieve ex- pressure for non-trivial tree structure to emerge cellent results on parsing (Dyer et al., 2016; Kun- during unsupervised RNNG (URNNG) learning. coro et al., 2017), better encode syntactic proper- In this work, we explore a technique for han- ties of language (Kuncoro et al., 2018), and cor- dling intractable marginalization while also inject- relate with electrophysiological responses in the ing inductive bias. Specifically we employ amor- human brain (Hale et al., 2018). However, these tized variational inference (Kingma and Welling, all require annotated syntactic trees for training. 2014; Rezende et al., 2014; Mnih and Gregor, In this work, we explore unsupervised learning of 2014) with a structured inference network. Varia- recurrent neural network grammars for language tional inference lets us tractably optimize a lower modeling and grammar induction. bound on the log marginal likelihood, while em- Work done while the first author was an intern at DeepMind. ploying a structured inference network encour- Code available at https://github.com/harvardnlp/urnng ages non-trivial structure. In particular, a con- ditional random field (CRF) constituency parser (Finkel et al., 2008; Durrett and Klein, 2015), which makes significant independence assump- tions, acts as a guide on the generative model to learn meaningful trees through regularizing the posterior (Ganchev et al., 2010). We experiment with URNNGs on English and Chinese and observe that they perform well as language models compared to their supervised counterparts and standard neural LMs. In terms of grammar induction, they are competitive with recently-proposed neural architectures that dis- cover tree-like structures through gated attention (Shen et al., 2018). Our results, along with other Figure 1: Overview of our approach. The inference net- recent work on joint language modeling/structure work qφ(z | x) (left) is a CRF parser which produces a dis- tribution over binary trees (shown in dotted box). Bij are learning with deep networks (Shen et al., 2018, random variables for existence of a constituent spanning i-th 2019; Wiseman et al., 2018; Kawakami et al., and j-th words, whose potentials are the output from a bidi- 2018), suggest that it is possible to learn genera- rectional LSTM (the global factor ensures that the distribu- tion is only over valid binary trees). The generative model tive models of language that model the underlying pθ(x, z) (right) is an RNNG which consists of a stack LSTM data well (i.e. assign high likelihood to held-out (from which actions/words are predicted) and a tree LSTM data) and at the same time induce meaningful lin- (to obtain constituent representations upon REDUCE). Train- ing involves sampling a binary tree from qφ(z | x), converting guistic structure. it to a sequence of shift/reduce actions (z = [SHIFT, SHIFT, SHIFT, REDUCE, REDUCE, SHIFT, REDUCE] in the above ex- 2 Unsupervised Recurrent Neural ample), and optimizing the log joint likelihood log pθ(x, z). Network Grammars the stack, composes them, and shifts the composed representation onto the stack. We use x = [x1, . . . , xT ] to denote a sentence of Formally, let S = [(0, 0)] be the initial stack. length T , and z ∈ ZT to denote an unlabeled bi- nary parse tree over a sequence of length T , repre- Each item of the stack will be a pair, where the first sented as a binary vector of length 2T − 1. Here element is the hidden state of the stack LSTM, and 0 and 1 correspond to SHIFT and REDUCE actions, the second element is an input vector, described explained below.1 Figure1 presents an overview below. We use top(S) to refer to the top pair in of our approach. the stack. The push and pop operations are de- fined imperatively in the usual way. At each time 2.1 Generative Model step, the next action zt (SHIFT or REDUCE) is sam- An RNNG defines a joint probability distribu- pled from a Bernoulli distribution parameterized in terms of the current stack representation. Let- tion pθ(x, z) over sentences x and parse trees z. We consider a simplified version of the orig- ting (hprev, gprev) = top(S), we have > inal RNNG (Dyer et al., 2016) by ignoring con- zt ∼ Bernoulli(pt), pt = σ(w hprev + b). stituent labels and only considering binary trees. The RNNG utilizes an RNN to parameterize a Subsequent generation depend on zt: stack data structure (Dyer et al., 2015) of partially- • If zt = 0 (SHIFT), the model first generates a completed constituents to incrementally build the terminal symbol via sampling from a categori- parse tree while generating terminals. Using the cal distribution whose parameters come from an current stack representation, the model samples affine transformation and a softmax, an action (SHIFT or REDUCE): SHIFT generates x ∼ softmax(Wh + b). a terminal symbol, i.e. word, and shifts it onto prev the stack,2 REDUCE pops the last two elements off Then the generated terminal is shifted onto the stack using a stack LSTM, 1 2T −1 The cardinality of ZT ⊂ {0, 1} is given by the (2T −2)! hnext = LSTM(ex, hprev), (T − 1)-th Catalan number, |ZT | = T !(T −1)! . 2 A better name for SHIFT would be GENERATE (as in push(S, (hnext, ex)), Dyer et al.(2016)), but we use SHIFT to emphasize similarity with the shift-reduce parsing. where ex is the word embedding for x. • If zt = 1 (REDUCE), we pop the last two ele- In the supervised case where ground-truth z ments off the stack, is available, we can straightforwardly perform gradient-based optimization to maximize the joint (hr, gr) = pop(S), (hl, gl) = pop(S), log likelihood log pθ(x, z). In the unsupervised and obtain a new representation that combines case, the standard approach is to maximize the log the left/right constituent representations using a marginal likelihood, tree LSTM (Tai et al., 2015; Zhu et al., 2015), X log p (x) = log p (x, z0). gnew = TreeLSTM(gl, gr). θ θ 0 z ∈ZT Note that we use gl and gr to obtain the new 3 representation instead of hl and hr. We then However this summation is intractable be- update the stack using gnew, cause zt fully depends on all previous actions [z , . . . , z ]. Even if this summation were (h , g ) = top(S), 1 t−1 prev prev tractable, it is not clear that meaningful latent hnew = LSTM(gnew, hprev), structures would emerge given the lack of explicit push(S, (hnew, gnew)). independence assumptions in the RNNG (e.g. it is The generation process continues until an end-of- clearly not context-free). We handle these issues sentence symbol is generated. The parameters θ of with amortized variational inference. the generative model are w, b, W, b, and the pa- 2.2 Amortized Variational Inference rameters of the stack/tree LSTMs. For a sentence Amortized variational inference (Kingma and x = [x1, . . . , xT ] of length T , the binary parse tree Welling, 2014) defines a trainable inference net- is given by the binary vector z = [z , . . . , z ].4 1 2T −1 work φ that parameterizes qφ(z | x), a variational The joint log likelihood decomposes as a sum of posterior distribution, in this case over parse trees terminal/action log likelihoods, z given the sentence x. This distribution is used to T X form an evidence lower bound (ELBO) on the log log pθ(x, z) = log pθ(xt | xAlgorithm 1 Inside algorithm for calculating ZT (x) limits posterior flexibility of the overly powerful 1: procedure INSIDE(s) . scores sij for i ≤ j RNNG generative model.6,7 2: for i := 1 to T do . length-1 spans 3: β[i, i] = exp(sii) The parameterization of span scores is similar to 4: for ` := 1 to T − 1 do . span length recent works (Wang and Chang, 2016; Stern et al., 5: for i := 1 to T − ` do . span start 2017; Kitaev and Klein, 2018): we add position 6: j = i + ` . span end 7: β[i, j] = Pj−1 exp(s ) · β[i, k] · β[k + 1, j] embeddings to word embeddings and run a bidi- k=i ij return β[1,T ] . Z (x) rectional LSTM over the input representations to 8: return partition function T −→ −→ obtain the forward [ h ,..., h ] and backward ←− ←− 1 T Algorithm1. This computation is itself differen- [ h 1,..., h T ] hidden states. The score sij ∈ R tiable and amenable to gradient-based optimiza- for a constituent spanning xi to xj is given by, tion. Finally, letting f : BT → ZT be the bijec- −→ −→ ←− ←− tion between the binary tree matrix representation sij = MLP([ h j+1 − h i; h i−1 − h j]). and a sequence of SHIFT/REDUCE actions, the in- ference network defines a distribution over ZT via Letting B be the binary matrix representation of a −1 qφ(z | x) , qφ(f (z) | x). tree (Bij = 1 means there is a constituent span- 2.3 Optimization ning xi and xj), the CRF parser defines a distri- bution over binary trees via the Gibbs distribution, For optimization, we use the following variant of the ELBO, 1 X q (B | x) = exp B s , Eqφ(z | x)[log pθ(x, z)] + H[qφ(z | x)], φ Z (x) ij ij T i≤j where H[qφ(z | x)] = Eqφ(z | x)[− log qφ(z | x)] is where ZT (x) is the partition function, the entropy of the variational posterior. A Monte Carlo estimate for the gradient with respect to θ is X X 0 ZT (x) = exp Bijsij , K 0 1 X (k) B ∈BT i≤j ∇ ELBO(θ, φ; x) ≈ ∇ log p (x, z ), θ K θ θ and φ denotes the parameters of the inference net- k=1 (1) (K) work (i.e. the bidirectional LSTM and the MLP). with samples z ,..., z from qφ(z | x). Sam- pling uses the intermediate values calculated dur- Calculating ZT (x) requires a summation over an T ×T ing the inside algorithm to sample split points re- exponentially-sized set BT ⊂ {0, 1} , the set of all binary trees over a length T sequence. How- cursively (Goodman, 1998; Finkel et al., 2006), ever we can perform the summation in O(T 3) us- as shown in Algorithm2. The gradient with re- ing the inside algorithm (Baker, 1979), shown in spect to φ involves two parts. The entropy term 3 H[qφ(z | x)] can be calculated exactly in O(T ), 6While it has a similar goal, this formulation differs the again using the intermediate values from the in- from posterior regularization as formulated by Ganchev et al. 8 (2010), which constrains the distributional family via linear side algorithm (see Algorithm3). Since each constraints on posterior expectations. In our case, the condi- step of this dynamic program is differentiable, we tional independence assumptions in the CRF lead to a curved can obtain the gradient ∇φH[qφ(z | x)] using au- exponential family where the vector of natural parameters has 9 fewer dimensions than the vector of sufficient statistics of the tomatic differentation. An estimator for the gra- full exponential family. This curved exponential family is a dient with respect to Eqφ(z | x)[log pθ(x, z)] is ob- subset of the marginal polytope of the full exponential fam- tained via the score function gradient estimator ily, but it is an intersection of both linear and nonlinear man- ifolds, and therefore cannot be characterized through linear (Glynn, 1987; Williams, 1992), constraints over posterior expectations. ∇φEq (z | x)[log pθ(x, z)] 7In preliminary experiments, we also attempted to learn φ latent trees with a transition-based parser (which does not = Eqφ(z | x)[log pθ(x, z)∇φ log qφ(z | x)] make explicit independence assumptions) that looks at the en- tire sentence. However we found that under this setup, the in- K 1 X (k) (k) ference network degenerated into a local minimum whereby ≈ log p (x, z )∇ log q (z | x). K θ φ φ it always generated left-branching trees despite various opti- k=1 mization strategies. Williams et al.(2018) observe a similar phenomenon in the context of learning latent trees for classi- 8We adapt the algorithm for calculating tree entropy in fication tasks. However Li et al.(2019) find that it is possible PCFGs from Hwa(2000) to the CRF case. 9 use a transition-based parser as the inference network for de- ∇φH[qφ(z | x)] can also be computed using the inside- pendency grammar induction, if the inference network is con- outside algorithm and a second-order expectation semir- strained via posterior regularization (Ganchev et al., 2010) ing (Li and Eisner, 2009), which has the same asymptotic based on universal syntactic rules (Naseem et al., 2010). runtime complexity but generally better constants. Algorithm 2 Top-down sampling a tree from qφ(z | x) Algorithm 3 Calculating the tree entropy H[qφ(z | x)] 1: procedure SAMPLE(β) . β from running INSIDE(s) 1: procedure ENTROPY(β) . β from running INSIDE(s) 2: B = 0 . binary matrix representation of tree 2: for i := 1 to T do . initialize entropy table 3: Q = [(1,T )] . queue of constituents 3: H[i, i] = 0 4: while Q is not empty do 4: for l := 1 to T − 1 do . span length 5: (i, j) = pop(Q) 5: for i := 1 to T − l do . span start Pj−1 6: τ = k=i β[i, k] · β[k + 1, j] 6: j = i + l . span end 7: for k := i to j − 1 do . get distribution over splits Pj−1 7: τ = u=i β[i, u] · β[u + 1, j] 8: wk = (β[i, k] · β[k + 1, j])/τ 8: for u := i to j − 1 do 9: k ∼ Cat([wi, . . . , wj−1]) . sample a split point 9: wu = (β[i, u] · β[u + 1, j])/τ j−1 10: Bi,k = 1, Bk+1,j = 1 . update B P 10: H[i, j] = u=i(H[i, u] + H[u + 1, j] 11: if k > i then . if left child has width > 1 11: − log wu) · wu 12: push(Q, (i, k)) . add to queue 12: return H[1,T ] . return tree entropy H[qφ(z | x)] 13: if k + 1 < j then . if right child has width > 1 14: push(Q, (k + 1, j)) . add to queue corpus (Chelba et al., 2013). We randomly sam- 15: z = f(B) . f : BT → ZT maps matrix represen- tation of tree to sequence of actions. ple 1M sentences for training and 2K sentences 16: return z for validation/test, and limit the vocabulary to 30K word types. While still a subset of the full corpus The above estimator is unbiased but typically suf- (which has 30M sentences), this dataset is two or- fers from high variance. To reduce variance, we ders of magnitude larger than PTB. Experiments use a control variate derived from an average of on Chinese utilize version 5.1 of the Chinese Penn the other samples’ joint likelihoods (Mnih and Treebank (CTB) (Xue et al., 2005), with the same Rezende, 2016), yielding the following estimator, splits as in Chen and Manning(2014). Singleton K 1 X words are replaced with a single hUNKi token, re- (log p (x, z(k)) − r(k))∇ log q (z(k) | x), K θ φ φ sulting in a vocabulary of 17,489 word types. k=1 3.2 Training and Hyperparameters (k) 1 P (j) where r = K−1 j6=k log pθ(x, z ). This The stack LSTM has two layers with input/hidden control variate worked better than alternatives size equal to 650 and dropout of 0.5. The tree such as estimates of baselines from an auxiliary LSTM also has 650 units. The inference network network (Mnih and Gregor, 2014; Deng et al., uses a one-layer bidirectional LSTM with 256 hid- 2018) or a language model (Yin et al., 2018). den units, and the MLP (to produce span scores 3 Experimental Setup sij for i ≤ j) has a single hidden layer with a 3.1 Data ReLU nonlinearity followed by layer normaliza- For English we use the Penn Treebank (Marcus tion (Ba et al., 2016) and dropout of 0.5. We share et al., 1993, PTB) with splits and preprocessing word embeddings between the generative model from Dyer et al.(2016) which retains punctu- and the inference network, and also tie weights ation and replaces singleton words with Berke- between the input/output word embeddings (Press ley parser’s mapping rules, resulting in a vocab- and Wolf, 2016). ulary of 23,815 word types.10 Notably this is Optimization of the model itself required stan- much larger than the standard PTB LM setup from dard techniques for avoiding posterior collapse in 12 Mikolov et al.(2010) which uses 10K types. 11 VAEs. We warm-up the ELBO objective by Also different from the LM setup, we model each linearly annealing (per batch) the weight on the sentence separately instead of carrying informa- conditional prior log pθ(z | xlearning rate equal to 1, 10https://github.com/clab/rnng 11 12 Both versions of the PTB data can be obtained from http: Posterior collapse in our context means that qφ(z | x) al- //demo.clab.cs.cmu.edu/cdyer/ptb-lm.tar.gz. ways produced trivial (always left or right branching) trees. except for the affine layer that produces a distribu- PTB CTB tion over the actions, which has learning rate 0.1. Model PPL F1 PPL F1 Gradients of the generative model are clipped at RNNLM 93.2 – 201.3 – PRPN (default) 126.2 32.9 290.9 32.9 5. The inference network is optimized with Adam PRPN (tuned) 96.7 41.2 216.0 36.1 (Kingma and Ba, 2015) with learning rate 0.0001, Left Branching Trees 100.9 10.3 223.6 12.4 β1 = 0.9, β2 = 0.999, and gradient clipping at Right Branching Trees 93.3 34.8 203.5 20.6 Random Trees 113.2 17.0 209.1 17.4 1. As Adam converges significantly faster than URNNG 90.6 40.7 195.7 29.1 SGD (even with a much lower learning rate), we RNNG 88.7 68.1 193.1 52.3 stop training the inference network after the first RNNG → URNNG 85.9 67.7 181.1 51.9 two epochs. Initial model parameters are sampled Oracle Binary Trees – 82.5 – 88.6 from U[−0.1, 0.1]. The learning rate starts decay- ing by a factor of 2 each epoch after the first epoch Table 1: Language modeling perplexity (PPL) and grammar induction F1 scores on English (PTB) and Chinese (CTB) for at which validation performance does not improve, the different models. We separate results for those that make but this learning rate decay is not triggered for do not make use of annotated data (top) versus those that do (mid). Note that our PTB setup from Dyer et al.(2016) the first eight epochs to ensure adequate training. differs considerably from the usual language modeling setup We use the same hyperparameters/training setup (Mikolov et al., 2010) since we model each sentence inde- for both PTB and CTB. For experiments on (the pendently and use a much larger vocabulary (see §3.1). subset of) the one billion word corpus, we use a chitecture as URNNG’s inference network. For all smaller dropout rate of 0.1. The baseline RNNLM models, we perform early stopping based on vali- also uses the smaller dropout rate. dation perplexity. All models are trained with an end-of-sentence 4 Results and Discussion token, but for perplexity calculation these tokens 4.1 Language Modeling are not counted to be comparable to prior work Table1 shows perplexity for the different models (Dyer et al., 2016; Kuncoro et al., 2017; Buys and on PTB/CTB. As a language model URNNG out- Blunsom, 2018). To be more precise, the inference performs an RNNLM and is competitive with the network does not make use of the end-of-sentence supervised RNNG.14 The left branching baseline token to produce parse trees, but the generative performs poorly, implying that the strong perfor- model is trained to generate the end-of-sentence mance of URNNG/RNNG is not simply due to token after the final REDUCE operation. the additional depth afforded by the tree LSTM 3.3 Baselines composition function (a left branching tree, which We compare the unsupervised RNNG (URNNG) always performs REDUCE when possible, is the against several baselines: (1) RNNLM, a standard “deepest” model). The right branching baseline RNN language model whose size is the same as is essentially equivalent to an RNNLM and hence URNNG’s stack LSTM; (2) Parsing Reading Pre- performs similarly. We found PRPN with de- dict Network (PRPN) (Shen et al., 2018), a neu- fault hyperparameters (which obtains a perplex- ral language model that uses gated attention lay- ity of 62.0 in the PTB setup from Mikolov et al. ers to embed soft tree-like structures into a neu- (2010)) to not perform well, but tuning hyperpa- ral network (and among the current state-of-the-art rameters improves performance.15 The supervised in grammar induction from words on the full cor- RNNG performs well as a language model, despite pus); (3) RNNG with trivial trees (left branching, being trained on the joint (rather than marginal) right branching, random); (4) supervised RNNG likelihood objective.16 This indicates that explicit trained on unlabeled, binarized gold trees.13 Note 14 that the supervised RNNG also trains a discrim- For RNNG and URNNG we estimate the log marginal likelihood (and hence, perplexity) with inative parser qφ(z | x) (alongside the generative K = 1000 importance-weighted samples, log pθ(x) ≈ 1 PK log p(x,z(k)) model pθ(x, z)) in order to sample parse forests log (k) . During evaluation only, we K k=1 qφ(z | x) for perplexity evaluation (i.e. importance sam- also flatten qφ(z | x) by dividing span scores sij by a pling). This discriminative parser has the same ar- temperature term 2.0 before feeding it to the CRF. 15Using the code from https://github.com/yikangshen/ 13We use right branching binarization—Matsuzaki et al. PRPN, we tuned model size, initialization, dropout, learning (2005) find that differences between various binarization rate, and use of batch normalization. 16 schemes have marginal impact. Our supervised RNNG RNNG is trained to maximize log pθ(x, z) while therefore differs the original RNNG, which trains on non- URNNG is trained to maximize (a lower bound on) the lan- binarized trees and does not ignore constituent labels. guage modeling objective log pθ(x). PTB PPL KN 5-gram (Dyer et al., 2016) 169.3 RNNLM (Dyer et al., 2016) 113.4 Original RNNG (Dyer et al., 2016) 102.4 Stack-only RNNG (Kuncoro et al., 2017) 101.2 Gated-Attention RNNG (Kuncoro et al., 2017) 100.9 Generative Dep. Parser (Buys and Blunsom, 2015) 138.6 RNNLM (Buys and Blunsom, 2018) 100.7 Sup. Syntactic NLM (Buys and Blunsom, 2018) 107.6 Unsup. Syntactic NLM (Buys and Blunsom, 2018) 125.2 Figure 2: Perplexity of the different models grouped by sen- PRPN† (Shen et al., 2018) 96.7 tence length on PTB. This work: RNNLM 93.2 modeling of syntax helps generalization even with URNNG 90.6 richly-parameterized neural models. Encouraged RNNG 88.7 by these observations, we also experiment with RNNG → URNNG 85.9 a hybrid approach where we train a supervised 1M Sentences PPL RNNG first and continue fine-tuning the model PRPN† (Shen et al., 2018) 77.7 (including the inference network) on the URNNG RNNLM 77.4 objective (RNNG → URNNG in Table1). 17 This URNNG 71.8 RNNG‡ 72.9 approach results in nontrivial perplexity improve- RNNG‡ → URNNG 72.0 ments, and suggests that it is potentially possi- Table 2: (Top) Comparison of this work as a language model ble to improve language models with supervision against prior works on sentence-level PTB with preprocess- on parsed data. In Figure2 we show perplexity ing from Dyer et al.(2016). Note that previous versions by sentence length. We find that a standard lan- of RNNG differ from ours in terms of parameterization and model size. (Bottom) Results on a subset (1M sentences) guage model (RNNLM) is better at modeling short of the one billion word corpus. PRPN† is the model from sentences, but underperforms models that explic- Shen et al.(2018), whose hyperparameters were tuned by ‡ itly take into account structure (RNNG/URNNG) us. RNNG is trained on predicted parse trees from the self- attentive parser from Kitaev and Klein(2018). when the sentence length is greater than 10. Ta- RNNG/URNNG we obtain the highest scoring ble2 (top) compares our results against prior work tree from q (z | x) through the Viterbi inside on this version of the PTB, and Table2 (bot- φ (i.e. CKY) algorithm.20 We calculate unla- tom) shows the results on a 1M sentence sub- beled F using evalb, which ignores punctua- set of the one billion word corpus, which is two 1 tion and discards trivial spans (width-one and sen- orders of magnitude larger than PTB. On this tence spans).21 Since we compare F against the larger dataset URNNG still improves upon the 1 original, non-binarized trees (per convention), F RNNLM. We also trained an RNNG (and RNNG 1 scores of models using oracle binarized trees con- → URNNG) on this dataset by parsing the training stitute the upper bounds. set with the self-attentive parser from Kitaev and We confirm the replication study of Htut et al. Klein(2018). 18 These models improve upon the (2018) and find that PRPN is a strong model for RNNLM but not the URNNG, potentially high- grammar induction. URNNG performs on par lighting the limitations of using predicted trees for with PRPN on English but PRPN does better on supervising RNNGs. Chinese; both outperform right branching base- 4.2 Grammar Induction 20 Alternatively, we could estimate arg maxz pθ(z | x) by Table1 also shows the F1 scores for grammar sampling parse trees from qφ(z | x) and using pθ(x, z) to induction. Note that we induce latent trees di- rerank the output, as in Dyer et al.(2016). 21 rectly from words on the full dataset.19 For Available at https://nlp.cs.nyu.edu/evalb/. We evaluate with COLLINS.prm parameter file and LABELED option equal to 0. We observe that the setup for grammar induction 17 We fine-tune for 10 epochs and use a smaller learning varies widely across different papers: lexicalized vs. unlex- rate of 0.1 for the generative model. icalized; use of punctuation vs. not; separation of train/test 18To parse the training set we use the benepar en2 sets; counting sentence-level spans for evaluation vs. ig- model from https://github.com/nikitakit/self-attentive-parser, noring them; use of additional data; length cutoff for train- which obtains an F1 score of 95.17 on the PTB test set. ing/evaluation; corpus-level F1 vs. sentence-level F1; and, 19Past work on grammar induction usually train/evaluate more. In our survey of twenty or so papers, almost no two on short sentences and also assume access to gold POS tags papers were identical in their setup. Such variation makes (Klein and Manning, 2002; Smith and Eisner, 2004; Bod, it difficult to meaningfully compare models across papers. 2006). However more recent works train directly on words Hence, we report grammar induction results mainly for the (Jin et al., 2018; Shen et al., 2018; Drozdov et al., 2019). models and baselines considered in the present work. Tree PTB CTB Label URNNG PRPN PTB CTB RNNG URNNG RNNG URNNG Gold 40.7 29.1 SBAR 74.8% 28.9% Left 9.2 8.4 NP 39.5% 63.9% PPL 88.7 90.6 193.1 195.7 Right 68.3 51.2 VP 76.6% 27.3% Recon. PPL 74.6 73.4 183.4 151.9 Self 92.3 87.3 PP 55.8% 55.1% KL 7.10 6.13 11.11 8.91 RNNG 55.4 47.1 ADJP 33.9% 42.5% Prior Entropy 7.65 9.61 9.48 15.13 PRPN 41.0 47.2 ADVP 50.4% 45.1% Post. Entropy 1.56 2.28 6.23 5.75 Unif. Entropy 26.07 26.07 30.17 30.17 Table 3: (Left) F1 scores of URNNG against other trees. “Self” refers to another URNNG trained with a different ran- Table 5: Metrics related to the generative model/inference dom seed. (Right) Recall of constituents by label for URNNG network for RNNG/URNNG. For the supervised RNNG and PRPN. Recall for a particular label is the fraction of we take the “inference network” to be the discriminative ground truth constituents of that label that were identified by parser trained alongside the generative model (see §3.3). the model (as in Htut et al.(2018)). Recon. PPL is the reconstruction perplexity based on
Eqφ(z | x)[log pθ(x | z)], and KL is the Kullback-Leibler di- F1 +PP vergence. Prior entropy is the entropy of the conditional prior pθ(z | xautoencoders (Ammar et al., 2014) 5 Related Work and structured VAEs (Johnson et al., 2016; Krish- nan et al., 2017), which have been used previously There has been much work on incorporating tree for unsupervised (Cai et al., 2017; Drozdov et al., structures into deep models for syntax-aware lan- 2019; Li et al., 2019) and semi-supervised (Yin guage modeling, both for unconditional (Emami et al., 2018; Corro and Titov, 2019) parsing. and Jelinek, 2005; Buys and Blunsom, 2015; Dyer et al., 2016) and conditional (Yin and Neubig, 6 Conclusion 2017; Alvarez-Melis and Jaakkola, 2017; Rabi- It is an open question as to whether explicit mod- novich et al., 2017; Aharoni and Goldberg, 2017; eling of syntax significantly helps neural mod- Eriguchi et al., 2017; Wang et al., 2018;G u et al., els. Strubell et al.(2018) find that supervising 2018) cases. These approaches generally rely on intermediate attention layers with syntactic heads annotated parse trees during training and maxi- improves semantic role labeling, while Shi et al. mizes the joint likelihood of sentence-tree pairs. (2018) observe that for text classification, syntac- Prior work on combining language modeling and tic trees only have marginal impact. Our work sug- unsupervised tree learning typically embed soft, gests that at least for language modeling, incorpo- tree-like structures as hidden layers of a deep net- rating syntax either via explicit supervision or as latent variables does provide useful inductive bi- 23The main time bottleneck is the dynamic compution ases and improves performance. graph, since the dynamic programming algorithm can be Finally, in modeling child language acquisition, batched (however the latter is a significant memory bottle- neck). We manually batch the SHIFT and REDUCE operation the complex interaction of the parser and the gram- as much as possible, though recent work on auto-batching matical knowledge being acquired is the object (Neubig et al., 2017) could potentially make this easier/faster. of much investigation (Trueswell and Gleitman, 24Many prior works that induce trees directly from words often employ additional heuristics based on punctuation 2007); our work shows that apparently grammati- (Seginer, 2007; Ponvert et al., 2011; Spitkovsky et al., 2013; cal constraints can emerge from the interaction of Parikh et al., 2014), as punctuation (e.g. comma) is usu- a constrained parser and a more general grammar ally a reliable signal for start/end of constituent spans. The URNNG still has to learn to rely on punctuation, similar to learner, which is an intriguing but underexplored recent works such as depth-bounded PCFGs (Jin et al., 2018) hypothesis for explaining human linguistic biases. and DIORA (Drozdov et al., 2019). In contrast, PRPN (Shen et al., 2018) and Ordered Neurons (Shen et al., 2019) induce Acknowledgments trees by directly training on corpus without punctuation. We We thank the members of the DeepMind language team for also reiterate that punctuation is used during training but ig- helpful feedback. YK is supported by a Google Fellowship. nored during evaluation (except in Table4). AR is supported by NSF Career 1845664. References Chris Cremer, Xuechen Li, and David Duvenaud. 2018. In- ference Suboptimality in Variational Autoencoders. In Roee Aharoni and Yoav Goldberg. 2017. Towards String-to- Proceedings of ICML. Tree Neural Machine Translation. In Proceedings of ACL. Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and David Alvarez-Melis and Tommi S. Jaakkola. 2017. Tree- Alexander M. Rush. 2018. Latent Alignment and Varia- structured Decoding with Doubly-Recurrent Neural Net- tional Attention. In Proceedings of NIPS. works. In Proceedings of ICLR. Andrew Drozdov, Patrick Verga, Mohit Yadev, Mohit Iyyer, Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Con- and Andrew McCallum. 2019. Unsupervised Latent ditional Random Field Autoencoders for Unsupervised Tree Induction with Deep Inside-Outside Recursive Auto- Structured Prediction. In Proceedings of NIPS. Encoders. In Proceedings of NAACL.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Greg Durrett and Dan Klein. 2015. Neural CRF Parsing. In 2016. Layer Normalization. In Proceedings of NIPS. Proceedings of ACL.
James K. Baker. 1979. Trainable Grammars for Speech Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Recognition. In Proceedings of the Spring Conference of Matthews, and Noah A. Smith. 2015. Transition-Based the Acoustical Society of America. Dependency Parsing with Stack Long Short-Term Mem- ory. In Proceedings of ACL. Rens Bod. 2006. An All-Subtrees Approach to Unsupervised Parsing. In Proceedings of ACL. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent Neural Network Gram- Samuel R. Bowman, Luke Vilnis, Oriol Vinyal, Andrew M. mars. In Proceedings of NAACL. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generat- ing Sentences from a Continuous Space. In Proceedings Ahmad Emami and Frederick Jelinek. 2005. A Neural Syn- of CoNLL. tactic Language Model. Machine Learning, 60:195–227.
James Bradbury and Richard Socher. 2017. Towards Neural Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Machine Translation with Latent Tree Attention. In Pro- 2017. Learning to Parse and Translate Improves Neural ceedings of the 2nd Workshop on Structured Prediction for Machine Translation. In Proceedings of ACL. Natural Language Processing. Jenny Rose Finkel, Alex Kleeman, and Christopher D. Man- Jan Buys and Phil Blunsom. 2015. Generative Incremental ning. 2008. Efficient, Feature-based, Conditional Random Dependency Parsing with Neural Networks. In Proceed- Field Parsing. In Proceedings of ACL. ings of ACL. Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. Jan Buys and Phil Blunsom. 2018. Neural Syntactic Gener- Ng. 2006. Solving the Problem of Cascading Errors: Ap- ative Models with Exact Marginalization. In Proceedings proximate Bayesian Inference for Linguistic Annotation of NAACL. Pipelines. In Proceedings of EMNLP.
Jiong Cai, Yong Jiang, and Kewei Tu. 2017. CRF Autoen- Kuzman Ganchev, Joao˜ Grac¸a, Jennifer Gillenwater, and Ben coder for Unsupervised Dependency Parsing. In Proceed- Taskar. 2010. Posterior Regularization for Structured La- ings of EMNLP. tent Variable Models. Journal of Machine Learning Re- search, 11:2001–2049. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Peter Glynn. 1987. Likelihood Ratio Gradient Estimation: Thorsten Brants, Phillipp Koehn, and Tony Robin- An Overview. In Proceedings of Winter Simulation Con- son. 2013. One Billion Word Benchmark for ference. Measuring Progress in Statistical Language Modeling. arXiv:1312.3005. Joshua Goodman. 1998. Parsing Inside-Out. PhD thesis, Harvard University. Danqi Chen and Christopher D. Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Jetic Gu, Hassan S. Shavarani, and Anoop Sarkar. 2018. Top- Proceedings of EMNLP. down Tree Structured Decoding with Syntactic Connec- tions for Neural Machine Translation and Parsing. In Pro- Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, ceedings of EMNLP. and Yoshua Bengio. 2014. On the Properties of Neu- ral Machine Translation: Encoder-Decoder Approaches. John Hale, Chris Dyer, Adhiguna Kuncoro, and Jonathan R. In Proceedings of Eighth Workshop on Syntax, Semantics Brennan. 2018. Finding Syntax in Human Encephalogra- and Structure in Statistical Translation. phy with Beam Search. In Proceedings of ACL.
Jihun Choi, Kang Min Yoo, and Sang goo Lee. 2018. Learn- Serhii Havrylov, German´ Kruszewski, and Armand Joulin. ing to Compose Task-Specific Tree Structures. In Pro- 2019. Cooperative Learning of Disjoint Syntax and Se- ceedings of AAAI. mantics. In Proceedings of NAACL.
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Phu Mon Htut, Kyunghyun Cho, and Samuel R. Bowman. Hierarchical Multiscale Recurrent Neural Networks. In 2018. Grammar Induction with Neural Language Models: Proceedings of ICLR. An Unusual Replication. In Proceedings of EMNLP.
Caio Corro and Ivan Titov. 2019. Differentiable Perturb-and- Rebecca Hwa. 1999. Supervised Grammar Induction Using Parse: Semi-Supervised Parsing with a Structured Varia- Training Data with Limited Constituent Information. In tional Autoencoder. In Proceedings of ICLR. Proceedings of ACL. Rebecca Hwa. 2000. Sample Selection for Statistical Gram- Jean Maillard and Stephen Clark. 2018. Latent Tree Learn- mar Induction. In Proceedings of EMNLP. ing with Differentiable Parsers: Shift-Reduce Parsing and Chart Parsing. In Proceedings of the Workshop on the Rel- Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William evance of Linguistic Structure in Neural Architectures for Schuler, and Lane Schwartz. 2018. Unsupervised Gram- NLP. mar Induction with Depth-bounded PCFG. In Proceed- ings of TACL. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a Large Annotated Corpus of Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. English: The Penn Treebank. Computational Linguistics, 2007. Bayesian Inference for PCFGs via Markov chain 19:313–330. Monte Carlo. In Proceedings of NAACL. Rebecca Marvin and Tal Linzen. 2018. Targeted Syntac- Matthew Johnson, David K. Duvenaud, Alex Wiltschko, tic Evaluation of Language Models. In Proceedings of Ryan P. Adams, and Sandeep R. Datta. 2016. Compos- EMNLP. ing Graphical Models with Neural Networks for Struc- tured Representations and Fast Inference. In Proceedings Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii. 2005. of NIPS. Probabilistic CFG with Latent Annotations. In Proceed- ings of ACL. Kazuya Kawakami, Chris Dyer, and Phil Blunsom. 2018. Gabor´ Melis, Chris Dyer, and Phil Blunsom. 2018. On the Unsupervised Word Discovery with Segmental Neural State of the Art of Evaluation in Neural Language Models. Language Models. arXiv:1811.09353. In Proceedings of ICLR.
Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Rush. 2017. Structured Attention Networks. In Proceed- 2018. Regularizing and Optimizing LSTM Language ings of ICLR. Models. In Proceedings of ICLR.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cer- for Stochastic Optimization. In Proceedings of ICLR. nocky, and Sanjeev Khudanpur. 2010. Recurrent Neural Network Based Language Model. In Proceedings of IN- Diederik P. Kingma and Max Welling. 2014. Auto-Encoding TERSPEECH. Variational Bayes. In Proceedings of ICLR. Andriy Mnih and Karol Gregor. 2014. Neural variational in- Nikita Kitaev and Dan Klein. 2018. Constituency Parsing ference and learning in belief networks. In Proceedings of with a Self-Attentive Encoder. In Proceedings of ACL. ICML.
Dan Klein and Christopher Manning. 2002. A Generative Andriy Mnih and Danilo J. Rezende. 2016. Variational In- Constituent-Context Model for Improved Grammar Induc- ference for Monte Carlo Objectives. In Proceedings of tion. In Proceedings of ACL. ICML.
Rahul G. Krishnan, Uri Shalit, and David Sontag. 2017. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark John- Structured Inference Networks for Nonlinear State Space son. 2010. Using Universal Linguistic Knowledge to Models. In Proceedings of AAAI. Guide Grammar Induction. In Proceedings of EMNLP.
Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017. Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. On-the-fly Operation Batching in Dynamic Computation What Do Recurrent Neural Network Grammars Learn Graphs. arXiv:1705.07860. About Syntax? In Proceedings of EACL. Vlad Niculae, Andre´ F. T. Martins, and Claire Cardie. 2018. Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Towards Dynamic Computation Graphs via Sparse Latent Proceedings of EMNLP Stephen Clark, and Phil Blunsom. 2018. LSTMs Can Structure. In . Learn Syntax-Sensitive Dependencies Well, But Model- Ankur P. Parikh, Shay B. Cohen, and Eric P. Xing. 2014. ing Structure Makes Them Better. In Proceedings of ACL. Spectral Unsupervised Parsing with Additive Tree Met- rics. In Proceedings of ACL. Bowen Li, Jianpeng Cheng, Yang Liu, and Frank Keller. 2019. Dependency Grammar Induction with a Neural Hao Peng, Sam Thomson, and Noah A. Smith. 2018. Back- Variational Transition-based Parser. In Proceedings of propagating through Structured Argmax using a SPIGOT. AAAI. In Proceedings of ACL.
Zhifei Li and Jason Eisner. 2009. First- and Second-Order Elis Ponvert, Jason Baldridge, and Katrin Erk. 2011. Sim- Expectation Semirings with Applications to Minimum- pled Unsupervised Grammar Induction from Raw Text Risk Training on Translation Forests. In Proceedings of with Cascaded Finite State Methods. In Proceedings of EMNLP. ACL.
Yang Liu, Matt Gardner, and Mirella Lapata. 2018. Struc- Ofir Press and Lior Wolf. 2016. Using the Output Embedding tured Alignment Networks for Matching Sentences. In to Improve Language Models. In Proceedings of EACL. Proceedings of EMNLP. Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Yang Liu and Mirella Lapata. 2018. Learning Structured Abstract Syntax Networks for Code Generation and Se- Text Representations. In Proceedings of TACL. mantic Parsing. In Proceedings of ACL. Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2014. Stochastic Backpropagation and Approximate In- 2018. Learning Neural Templates for Text Generation. In ference in Deep Generative Models. In Proceedings of Proceedings of EMNLP. ICML. Naiwen Xue, Fei Xia, Fu dong Chiou, and Marta Palmer. Yoav Seginer. 2007. Fast Unsupervised Incremental Parsing. 2005. The Penn Chinese Treebank: Phrase Structure An- In Proceedings of ACL. notation of a Large Corpus. Natural Language Engineer- ing, 11:207–238. Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and Aaron Courville. 2018. Neural Language Modeling by Jointly Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and Learning Syntax and Lexicon. In Proceedings of ICLR. William W. Cohen. 2018. Breaking the Softmax Bottle- neck: A High-Rank RNN Language Model. In Proceed- Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron ings of ICLR. Courville. 2019. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In Proceed- Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neu- ings of ICLR. ral Model for General-Purpose Code Generation. In Pro- ceedings of ACL. Haoyue Shi, Hao Zhou, Jiaze Chen, and Lei Li. 2018. On Tree-based Neural Sentence Modeling. In Proceedings of Pengcheng Yin, Chunting Zhou, Junxian He, and Graham EMNLP. Neubig. 2018. StructVAE: Tree-structured Latent Vari- able Models for Semi-supervised Semantic Parsing. In Noah A. Smith and Jason Eisner. 2004. Annealing Tech- Proceedings of ACL. niques for Unsupervised Statistical Language Learning. In Proceedings of ACL. Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefen- stette, and Wang Ling. 2017. Learning to Compose Words Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, into Sentences with Reinforcement Learning. In Proceed- Søren Kaae Sønderby, and Ole Winther. 2016. Ladder ings of ICLR. Variational Autoencoders. In Proceedings of NIPS. Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Long Short-Term Memory Over Tree Structures. In Pro- 2013. Breaking Out of Local Optima with Count Trans- ceedings of ICML. forms and Model Recombination: A Study in Grammar Induction. In Proceedings of EMNLP.
Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A Min- imal Span-Based Neural Constituency Parser. In Proceed- ings of ACL.
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. In Proceed- ings of EMNLP.
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree- Structured Long Short-Term Memory Networks. In Pro- ceedings of ACL.
Ke Tran and Yonatan Bisk. 2018. Inducing Grammars with and for Neural Machine Translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Gener- ation.
John C. Trueswell and Lila R. Gleitman. 2007. Learning to Parse and its Implications for Language Acquisition. The Oxford Handbook of Psycholinguistics.
Wenhui Wang and Baobao Chang. 2016. Graph-based De- pendency Parsing with Bidirectional LSTM. In Proceed- ings of ACL.
Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neu- big. 2018. A Tree-based Decoder for Neural Machine Translation. In Proceedings of EMNLP.
Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018. Do Latent Tree Learning Models Identify Meaning- ful Structure in Sentences? In Proceedings of TACL.
Ronald J. Williams. 1992. Simple Statistical Gradient- following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8.