Unsupervised Recurrent Neural Network Grammars

Home , Grammar induction, Unsupervised learning

Yoon Kim† Alexander M. Rush† Lei Yu3 Adhiguna Kuncoro‡,3 Chris Dyer3 Gabor´ Melis3

†Harvard University ‡University of Oxford 3DeepMind {yoonkim,srush}@seas.harvard.edu {leiyu,akuncoro,cdyer,melisgl}@google.com

Abstract The standard setup for unsupervised structure p (x, z) Recurrent neural network grammars (RNNG) learning is to define a generative model θ are generative models of language which over observed data x (e.g. sentence) and unob- jointly model syntax and surface structure by served structure z (e.g. parse tree, part-of-speech incrementally generating a syntax tree and sequence), and maximize the log marginal like- P sentence in a top-down, left-to-right order. lihood log pθ(x) = log z pθ(x, z). Success- Supervised RNNGs achieve strong language ful approaches to unsupervised parsing have made modeling and parsing performance, but re- strong conditional independence assumptions (e.g. quire an annotated corpus of parse trees. In context-freeness) and employed auxiliary objec- this work, we experiment with unsupervised learning of RNNGs. Since directly marginal- tives (Klein and Manning, 2002) or priors (John- izing over the space of latent trees is in- son et al., 2007). These strategies imbue the learn- tractable, we instead apply amortized varia- ing process with inductive biases that guide the tional inference. To maximize the evidence model to discover meaningful structures while al- lower bound, we develop an inference net- lowing tractable algorithms for marginalization; work parameterized as a neural CRF con- however, they come at the expense of language stituency parser. On language modeling, unsu- modeling performance, particularly compared to pervised RNNGs perform as well their supervised counterparts on benchmarks in English sequential neural models that make no indepen- and Chinese. On constituency grammar in- dence assumptions. duction, they are competitive with recent neu- Like RNN language models, RNNGs make no ral language models that induce tree structures independence assumptions. Instead they encode from words through attention mechanisms. structural bias through operations that compose 1 Introduction linguistic constituents. The lack of independence Recurrent neural network grammars (RNNGs) assumptions contributes to the strong language (Dyer et al., 2016) model sentences by first gen- modeling performance of RNNGs, but make unsu- erating a nested, hierarchical syntactic structure pervised learning challenging. First, marginaliza- which is used to construct a context representation tion is intractable. Second, the biases imposed by to be conditioned upon for upcoming words. Su- the RNNG are relatively weak compared to those pervised RNNGs have been shown to outperform imposed by models like PCFGs. There is little standard sequential language models, achieve ex- pressure for non-trivial tree structure to emerge cellent results on parsing (Dyer et al., 2016; Kun- during unsupervised RNNG (URNNG) learning. coro et al., 2017), better encode syntactic proper- In this work, we explore a technique for han- ties of language (Kuncoro et al., 2018), and cor- dling intractable marginalization while also inject- relate with electrophysiological responses in the ing inductive bias. Specifically we employ amor- human brain (Hale et al., 2018). However, these tized variational inference (Kingma and Welling, all require annotated syntactic trees for training. 2014; Rezende et al., 2014; Mnih and Gregor, In this work, we explore unsupervised learning of 2014) with a structured inference network. Varia- recurrent neural network grammars for language tional inference lets us tractably optimize a lower modeling and grammar induction. bound on the log marginal likelihood, while em- Work done while the first author was an intern at DeepMind. ploying a structured inference network encour- Code available at https://github.com/harvardnlp/urnng ages non-trivial structure. In particular, a con-

1105 Proceedings of NAACL-HLT 2019, pages 1105–1117 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics ditional random field (CRF) constituency parser (Finkel et al., 2008; Durrett and Klein, 2015), which makes significant independence assumptions, acts as a guide on the generative model to learn meaningful trees through regularizing the posterior (Ganchev et al., 2010). We experiment with URNNGs on English and Chinese and observe that they perform well as language models compared to their supervised counterparts and standard neural LMs. In terms of grammar induction, they are competitive with recently-proposed neural architectures that discover tree-like structures through gated attention (Shen et al., 2018). Our results, along with other Figure 1: Overview of our approach. The inference recent work on joint language modeling/structure network qφ(z | x) (left) is a CRF parser which pro- learning with deep networks (Shen et al., 2018, duces a distribution over binary trees (shown in dotted box). B are random variables for existence of a con- 2019; Wiseman et al., 2018; Kawakami et al., ij stituent spanning i-th and j-th words, whose potentials 2018), suggest that it is possible learn generative are the output from a bidirectional LSTM (the global models of language that model the underlying data factor ensures that the distribution is only over valid bi- well (i.e. assign high likelihood to held-out data) nary trees). The generative model pθ(x, z) (right) is an and at the same time induce meaningful linguistic RNNG which consists of a stack LSTM (from which structure. actions/words are predicted) and a tree LSTM (to obtain constituent representations upon REDUCE). Train- 2 Unsupervised Recurrent Neural ing involves sampling a binary tree from qφ(z | x), con- Network Grammars verting it to a sequence of shift/reduce actions, and op- We use x = [x1, . . . , xT ] to denote a sentence of timizing the log joint likelihood log pθ(x, z). length T , and z ∈ ZT to denote an unlabeled bi- representation onto the stack. nary parse tree over a sequence of length T , repre- Formally, let S = [(0, 0)] be the initial stack. sented as a a binary vector of length 2T − 1. Here Each item of the stack will be a pair, where the first 0 and 1 correspond to SHIFT and REDUCE actions, element is the hidden state of the stack LSTM, and explained below.1 Figure1 presents an overview the second element is an input vector, described of our approach. below. We use top(S) to refer to the top pair in 2.1 Generative Model the stack. The push and pop operations are de- fined imperatively in the usual way. At each time An RNNG defines a joint probability distribu- step, the next action z (SHIFT or REDUCE) is sam- tion p (x, z) over sentences x and parse trees t θ pled from a Bernoulli distribution parameterized z. We consider a simplified version of the orig- in terms of the current stack representation. Let- inal RNNG (Dyer et al., 2016) by ignoring con- ting (h , g ) = top(S), we have stituent labels and only considering binary trees. prev prev > The RNNG utilizes an RNN to parameterize a zt ∼ Bernoulli(pt), pt = σ(w hprev + b). stack data structure (Dyer et al., 2015) of partially- Subsequent generation depend on zt: completed constituents to incrementally build the • If zt = 0 (SHIFT), the model first generates a parse tree while generating terminals. Using the terminal symbol via sampling from a categori- current stack representation, the model samples cal distribution whose parameters come from an an action (SHIFT or REDUCE): SHIFT generates affine transformation and a softmax, a terminal symbol, i.e. word, and shifts it onto x ∼ softmax(Wh + b). the stack,2 REDUCE pops the last two elements off prev the stack, composes them, and shifts the composed Then the generated terminal is shifted onto the

1 2T −1 stack using a stack LSTM, The cardinality of ZT ⊂ {0, 1} is given by the (2T −2)! h = LSTM(e , h ), (T − 1)-th Catalan number, |ZT | = T !(T −1)! . next x prev 2 A better name for SHIFT would be GENERATE (as in push(S, (hnext, ex)), Dyer et al.(2016)), but we use SHIFT to emphasize similarity with the shift-reduce parsing. where ex is the word embedding for x. 1106 • If zt = 1 (REDUCE), we pop the last two ele- In the supervised case where ground-truth z ments off the stack, is available, we can straightforwardly perform gradient-based optimization to maximize the joint (hr, gr) = pop(S), (hl, gl) = pop(S), log likelihood log pθ(x, z). In the unsupervised and obtain a new representation that combines case, the standard approach is to maximize the log the left/right constituent representations using a marginal likelihood, tree LSTM (Tai et al., 2015; Zhu et al., 2015), X log p (x) = log p (x, z0). gnew = TreeLSTM(gl, gr). θ θ 0 z ∈ZT Note that we use gl and gr to obtain the new 3 representation instead of hl and hr. We then However this summation is intractable be- update the stack using gnew, cause zt fully depends on all previous actions [z , . . . , z ]. Even if this summation were (h , g ) = top(S), 1 t−1 prev prev tractable, it is not clear that meaningful latent hnew = LSTM(gnew, hprev), structures would emerge given the lack of explicit push(S, (hnew, gnew)). independence assumptions in the RNNG (e.g. it is The generation process continues until an end-of- clearly not context-free). We handle these issues sentence symbol is generated. The parameters θ of with amortized variational inference. the generative model are w, b, W, b, and the pa- 2.2 Amortized Variational Inference rameters of the stack/tree LSTMs. For a sentence Amortized variational inference (Kingma and x = [x1, . . . , xT ] of length T , the binary parse tree Welling, 2014) deﬁnes a trainable inference net- is given by the binary vector z = [z , . . . , z ].4 1 2T −1 work φ that parameterizes qφ(z | x), a variational The joint log likelihood decomposes as a sum of posterior distribution, in this case over parse trees terminal/action log likelihoods, z given the sentence x. This distribution is used to T X form an evidence lower bound (ELBO) on the log log pθ(x, z) = log pθ(xt | x

1108 Algorithm 2 Top-down sampling a tree from qφ(z | x) Algorithm 3 Calculating the tree entropy H[qφ(z | x)] 1: procedure SAMPLE(β) . β from running INSIDE(s) 1: procedure ENTROPY(β) . β from running INSIDE(s) 2: B = 0 . binary matrix representation of tree 2: for i := 1 to T do . initialize entropy table 3: Q = [(1,T )] . queue of constituents 3: H[i, i] = 0 4: while Q is not empty do 4: for l := 1 to T − 1 do . span length 5: (i, j) = pop(Q) 5: for i := 1 to T − l do . span start Pj−1 6: τ = k=i β[i, k] · β[k + 1, j] 6: j = i + l . span end 7: for k := i to j − 1 do . get distribution over splits Pj−1 7: τ = u=i β[i, u] · β[u + 1, j] 8: wk = (β[i, k] · β[k + 1, j])/τ 8: for u := i to j − 1 do 9: k ∼ Cat([wi, . . . , wj−1]) . sample a split point 9: wu = (β[i, u] · β[u + 1, j])/τ j−1 10: Bi,k = 1, Bk+1,j = 1 . update B P 10: H[i, j] = u=i(H[i, u] + H[u + 1, j] 11: if k > i then . if left child has width > 1 11: − log wu) · wu 12: push(Q, (i, k)) . add to queue 12: return H[1,T ] . return tree entropy H[qφ(z | x)] 13: if k + 1 < j then . if right child has width > 1 14: push(Q, (k + 1, j)) . add to queue corpus (Chelba et al., 2013). We randomly sam- 15: z = f(B) . f : BT → ZT maps matrix representation of tree to sequence of actions. ple 1M sentences for training and 2K sentences 16: return z for validation/test, and limit the vocabulary to 30K word types. While still a subset of the full corpus The above estimator is unbiased but typically suf- (which has 30M sentences), this dataset is two or- fers from high variance. To reduce variance, we ders of magnitude larger than PTB. Experiments use a control variate derived from an average of on Chinese utilize version 5.1 of the Chinese Penn the other samples’ joint likelihoods (Mnih and Treebank (CTB) (Xue et al., 2005), with the same Rezende, 2016), yielding the following estimator, splits as in Chen and Manning(2014). Singleton K 1 X words are replaced with a single hUNKi token, re- (log p (x, z(k)) − r(k))∇ log q (z(k) | x), K θ φ φ sulting in a vocabulary of 17,489 word types. k=1 3.2 Training and Hyperparameters (k) 1 P (j) where r = K−1 j6=k log pθ(x, z ). This The stack LSTM has two layers with input/hidden control variate worked better than alternatives size equal to 650 and dropout of 0.5. The tree such as estimates of baselines from an auxiliary LSTM also has 650 units. The inference network network (Mnih and Gregor, 2014; Deng et al., uses a one-layer bidirectional LSTM with 256 hid- 2018) or a language model (Yin et al., 2018). den units, and the MLP (to produce span scores 3 Experimental Setup sij for i ≤ j) has a single hidden layer with a 3.1 Data ReLU nonlinearity followed by layer normaliza- For English we use the Penn Treebank (Marcus tion (Ba et al., 2016) and dropout of 0.5. We share et al., 1993, PTB) with splits and preprocessing word embeddings between the generative model from Dyer et al.(2016) which retains punctu- and the inference network, and also tie weights ation and replaces singleton words with Berke- between the input/output word embeddings (Press ley parser’s mapping rules, resulting in a vocab- and Wolf, 2016). ulary of 23,815 word types.10 Notably this is Optimization of the model itself required stan- much larger than the standard PTB LM setup from dard techniques for avoiding posterior collapse in 12 Mikolov et al.(2010) which uses 10K types. 11 VAEs. We warm-up the ELBO objective by Also different from the LM setup, we model each linearly annealing (per batch) the weight on the sentence separately instead of carrying informa- conditional prior log pθ(z | xlearning rate equal to 1, 10https://github.com/clab/rnng 11 12 Both versions of the PTB data can be obtained from http: Posterior collapse in our context means that qφ(z | x) al- //demo.clab.cs.cmu.edu/cdyer/ptb-lm.tar.gz. ways produced trivial (always left or right branching) trees.

1109 except for the affine layer that produces a distribu- PTB CTB tion over the actions, which has learning rate 0.1. Model PPL F1 PPL F1 Gradients of the generative model are clipped at RNNLM 93.2 – 201.3 – PRPN (default) 126.2 32.9 290.9 32.9 5. The inference network is optimized with Adam PRPN (tuned) 96.7 41.2 216.0 36.1 (Kingma and Ba, 2015) with learning rate 0.0001, Left Branching Trees 100.9 10.3 223.6 12.4 β1 = 0.9, β2 = 0.999, and gradient clipping at Right Branching Trees 93.3 34.8 203.5 20.6 Random Trees 113.2 17.0 209.1 17.4 1. As Adam converges significantly faster than URNNG 90.6 40.7 195.7 29.1 SGD (even with a much lower learning rate), we RNNG 88.7 68.1 193.1 52.3 stop training the inference network after the first RNNG → URNNG 85.9 67.7 181.1 51.9 two epochs. Initial model parameters are sampled Oracle Binary Trees – 82.5 – 88.6 from U[−0.1, 0.1]. The learning rate starts decay- Table 1: Language modeling perplexity (PPL) and ing by a factor of 2 each epoch after the first epoch grammar induction F1 scores on English (PTB) and at which validation performance does not improve, Chinese (CTB) for the different models. Note that our but this learning rate decay is not triggered for PTB setup from Dyer et al.(2016) differs consider- the first eight epochs to ensure adequate training. ably from the usual language modeling setup (Mikolov We use the same hyperparameters/training setup et al., 2010) since we model each sentence indepen- for both PTB and CTB. For experiments on (the dently and use a much larger vocabulary (see §3.1). subset of) the one billion word corpus, we use a chitecture as URNNG’s inference network. For all smaller dropout rate of 0.1. The baseline RNNLM models, we perform early stopping based on vali- also uses the smaller dropout rate. dation perplexity. All models are trained with an end-of-sentence 4 Results and Discussion token, but for perplexity calculation these tokens 4.1 Language Modeling are not counted to be comparable to prior work Table1 shows perplexity for the different models (Dyer et al., 2016; Kuncoro et al., 2017; Buys and on PTB/CTB. As a language model URNNG out- Blunsom, 2018). To be more precise, the inference performs an RNNLM and is competitive with the network does not make use of the end-of-sentence supervised RNNG.14 The left branching baseline token to produce parse trees, but the generative performs poorly, implying that the strong perfor- model is trained to generate the end-of-sentence mance of URNNG/RNNG is not simply due to token after the final REDUCE operation. the additional depth afforded by the tree LSTM 3.3 Baselines composition function (a left branching tree, which We compare the unsupervised RNNG (URNNG) always performs REDUCE when possible, is the against several baselines: (1) RNNLM, a standard “deepest” model). The right branching baseline RNN language model whose size is the same as is essentially equivalent to an RNNLM and hence URNNG’s stack LSTM; (2) Parsing Reading Pre- performs similarly. We found PRPN with de- dict Network (PRPN) (Shen et al., 2018), a neu- fault hyperparameters (which obtains a perplex- ral language model that uses gated attention lay- ity of 62.0 in the PTB setup from Mikolov et al. ers to embed soft tree-like structures into a neu- (2010)) to not perform well, but tuning hyperpa- ral network (and among the current state-of-the-art rameters improves performance.15 The supervised in grammar induction from words on the full cor- RNNG performs well as a language model, despite pus); (3) RNNG with trivial trees (left branching, being trained on the joint (rather than marginal) right branching, random); (4) supervised RNNG likelihood objective.16 This indicates that explicit trained on unlabeled, binarized gold trees.13 Note 14 that the supervised RNNG also trains a discrim- For RNNG and URNNG we estimate the log marginal likelihood (and hence, perplexity) with inative parser qφ(z | x) (alongside the generative K = 1000 importance-weighted samples, log pθ(x) ≈ 1 PK log p(x,z(k)) model pθ(x, z)) in order to sample parse forests log (k) . During evaluation only, we K k=1 qφ(z | x) for perplexity evaluation (i.e. importance sam- also flatten qφ(z | x) by dividing span scores sij by a pling). This discriminative parser has the same ar- temperature term 2.0 before feeding it to the CRF. 15Using the code from https://github.com/yikangshen/ 13We use right branching binarization—Matsuzaki et al. PRPN, we tuned model size, initialization, dropout, learning (2005) find that differences between various binarization rate, and use of batch normalization. 16 schemes have marginal impact. Our supervised RNNG RNNG is trained to maximize log pθ(x, z) while therefore differs the original RNNG, which trains on non- URNNG is trained to maximize (a lower bound on) the lan- binarized trees and does not ignore constituent labels. guage modeling objective log pθ(x).

1110 PTB PPL KN 5-gram (Dyer et al., 2016) 169.3 RNNLM (Dyer et al., 2016) 113.4 Original RNNG (Dyer et al., 2016) 102.4 Stack-only RNNG (Kuncoro et al., 2017) 101.2 Gated-Attention RNNG (Kuncoro et al., 2017) 100.9 Generative Dep. Parser (Buys and Blunsom, 2015) 138.6 RNNLM (Buys and Blunsom, 2018) 100.7 Sup. Syntactic NLM (Buys and Blunsom, 2018) 107.6 Figure 2: Perplexity of the different models grouped by Unsup. Syntactic NLM (Buys and Blunsom, 2018) 125.2 † sentence length on PTB. PRPN (Shen et al., 2018) 96.7 This work: modeling of syntax helps generalization even with RNNLM 93.2 richly-parameterized neural models. Encouraged URNNG 90.6 by these observations, we also experiment with RNNG 88.7 RNNG → URNNG 85.9 a hybrid approach where we train a supervised 1M Sentences PPL RNNG first and continue fine-tuning the model † (including the inference network) on the URNNG PRPN (Shen et al., 2018) 77.7 17 RNNLM 77.4 objective (RNNG → URNNG in Table1). This URNNG 71.8 approach results in nontrivial perplexity improve- RNNG‡ 72.9 ments, and suggests that it is potentially possi- RNNG‡ → URNNG 72.0 ble to improve language models with supervision Table 2: (Top) Comparison of this work as a language on parsed data. In Figure2 we show perplexity model against prior works on sentence-level PTB with by sentence length. We find that a standard lan- preprocessing from Dyer et al.(2016). Note that pre- guage model (RNNLM) is better at modeling short vious versions of RNNG differ from ours in terms of parameterization and model size. (Bottom) Results on sentences, but underperforms models that explic- a subset (1M sentences) of the one billion word corpus. itly take into account structure (RNNG/URNNG) PRPN† is the model from Shen et al.(2018), whose hy- when the sentence length is greater than 10. Ta- perparameters were tuned by us. RNNG‡ is trained on ble2 (top) compares our results against prior work predicted parse trees from Kitaev and Klein(2018). on this version of the PTB, and Table2 (bot- tree from qφ(z | x) through the Viterbi inside (i.e. tom) shows the results on a 1M sentence sub- CKY) algorithm. We calculate unlabeled F1 using set of the one billion word corpus, which is two evalb, which ignores punctuation and discards orders of magnitude larger than PTB. On this trivial spans (width-one and sentence spans).20 larger dataset URNNG still improves upon the Since we compare F1 against the original, non- RNNLM. We also trained an RNNG (and RNNG binarized trees (per convention), F1 scores of → URNNG) on this dataset by parsing the training models using oracle binarized trees constitute the set with the self-attentive parser from Kitaev and upper bounds. 18 Klein(2018). These models improve upon the We confirm the replication study of Htut et al. RNNLM but not the URNNG, potentially high- (2018) and find that PRPN is a strong model for lighting the limitations of using predicted trees for grammar induction. URNNG performs on par supervising RNNGs. with PRPN on English but PRPN does better on 4.2 Grammar Induction Chinese; both outperform right branching base-

Table1 also shows the F1 scores for grammar lines. Table3 further analyzes the learned trees induction. Note that we induce latent trees di- and shows the F1 score of URNNG trees against rectly from words on the full dataset.19 For 20Available at https://nlp.cs.nyu.edu/evalb/. We evaluate RNNG/URNNG we obtain the highest scoring with COLLINS.prm parameter file and LABELED option equal to 0. We observe that the setup for grammar induction 17 We fine-tune for 10 epochs and use a smaller learning varies widely across different papers: lexicalized vs. unlex- rate of 0.1 for the generative model. icalized; use of punctuation vs. not; separation of train/test 18To parse the training set we use the benepar en2 sets; counting sentence-level spans for evaluation vs. ig- model from https://github.com/nikitakit/self-attentive-parser, noring them; use of additional data; length cutoff for train- which obtains an F1 score of 95.17 on the PTB test set. ing/evaluation; corpus-level F1 vs. sentence-level F1; and, 19Past work on grammar induction usually train/evaluate more. In our survey of twenty or so papers, almost no two on short sentences and also assume access to gold POS tags papers were identical in their setup. Such variation makes (Klein and Manning, 2002; Smith and Eisner, 2004; Bod, it difficult to meaningfully compare models across papers. 2006). However more recent works do train directly words Hence, we report grammar induction results mainly for the (Jin et al., 2018; Shen et al., 2018; Drozdov et al., 2019). models and baselines considered in the present work.

1111 Tree PTB CTB Label URNNG PRPN PTB CTB RNNG URNNG RNNG URNNG Gold 40.7 29.1 SBAR 74.8% 28.9% Left 9.2 8.4 NP 39.5% 63.9% PPL 88.7 90.6 193.1 195.7 Right 68.3 51.2 VP 76.6% 27.3% Recon. PPL 74.6 73.4 183.4 151.9 Self 92.3 87.3 PP 55.8% 55.1% KL 7.10 6.13 11.11 8.91 RNNG 55.4 47.1 ADJP 33.9% 42.5% Prior Entropy 7.65 9.61 9.48 15.13 PRPN 41.0 47.2 ADVP 50.4% 45.1% Post. Entropy 1.56 2.28 6.23 5.75 Unif. Entropy 26.07 26.07 30.17 30.17 Table 3: (Left) F1 scores of URNNG against other trees. “Self” refers to another URNNG trained with Table 5: Metrics related to the generative a different random seed. (Right) Recall of constituents model/inference network for RNNG/URNNG. by label for URNNG and PRPN. Recall for a particular For the supervised RNNG we take the “inference label is the fraction of ground truth constituents of that network” to be the discriminative parser trained label that were identified by the model. alongside the generative model (see §3.3). Re- con. PPL is the reconstruction perplexity based on F1 +PP Eqφ(z | x)[log pθ(x | z)], and KL is the Kullback-Leibler PRPN-UP‡ 39.8 45.4 divergence. Prior entropy is the entropy of the con- ‡ PRPN-LM 42.8 42.4 ditional prior pθ(z | xmean- Table 4: PTB F1 scores using the same evaluation setup as Drozdov et al.(2019), which evaluates against ingful way and that there is no posterior collapse binarized trees, counts punctuation and trivial spans, (Bowman et al., 2016). As expected, the entropy and uses sentence-level F1. +PP indicates a post- of the variational posterior is much lower than the processing heuristic which directly attaches trailing entropy of the conditional prior, but there is still punctuation to the root. This does not change URNNG some uncertainty in the posterior. results since it learns to do so anyway. Results with ‡ are copied from Table 1 of Drozdov et al.(2019). 4.4 Syntactic Evaluation other trees (left), and the recall of URNNG/PRPN We perform a syntactic evaluation of the differ- trees against ground truth constituents (right). We ent models based on the setup from Marvin and find that trees induced by URNNG and PRPN Linzen(2018): the model is given two mini- are quite different; URNNG is more sensitive to mally different sentences, one grammatical and SBAR and VP, while PRPN is better at identifying one ungrammatical, and must identify the gram- NP. While left as future work, this naturally sug- matical sentence by assigning it higher probabil- 21 gests a hybrid approach wherein the intersection ity. Table6 shows the accuracy results. Overall of constituents from URNNG and PRPN is used to the supervised RNNG significantly outperforms create a corpus of partially annotated trees, which the other models, indicating opportunities for fur- can be used to guide another model, e.g. via poste- ther work in unsupervised modeling. While the rior regularization (Ganchev et al., 2010) or semi- URNNG does slightly outperform an RNNLM, supervision (Hwa, 1999). Finally, Table4 com- the distribution of errors made from both models pares our results using the same evaluation setup are similar, and thus it is not clear whether the out- as in Drozdov et al.(2019), which differs consid- performance is simply due to better perplexity or erably from our setup. learning different structural biases. 4.5 Limitations 4.3 Distributional Metrics There are several limitations to our approach. Table5 shows some standard metrics related For one, the URNNG takes considerably more to the learned generative model/inference net- time/memory to train than a standard language work. The “reconstruction” perplexity based on model due to the O(T 3) dynamic program in [log p (x | z)] is much lower than regular Eqφ(z | x) θ the inference network, multiple samples to obtain perplexity, and further, the Kullback-Leibler diver- low-variance gradient estimators, and dynamic gence between the conditional prior and the varia- computation graphs that make efficient batching tional posterior, given by 21We modify the publicly available dataset from https:// github.com/BeckyMarvin/LM syneval to only keep sentence q (z | x) log φ , pairs that did not have any unknown words with respect to our Eqφ(z | x) vocabulary, resulting in 80K sentence pairs for evaluation. pθ(z | x

1113 References Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Hierarchical Multiscale Recurrent Neural Roee Aharoni and Yoav Goldberg. 2017. Towards Networks. In Proceedings of ICLR. String-to-Tree Neural Machine Translation. In Pro- ceedings of ACL. Caio Corro and Ivan Titov. 2019. Differentiable David Alvarez-Melis and Tommi S. Jaakkola. 2017. Perturb-and-Parse: Semi-Supervised Parsing with a Tree-structured Decoding with Doubly-Recurrent Structured Variational Autoencoder. In Proceedings Neural Networks. In Proceedings of ICLR. of ICLR. Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Chris Cremer, Xuechen Li, and David Duvenaud. Conditional Random Field Autoencoders for Unsu- 2018. Inference Suboptimality in Variational Au- pervised Structured Prediction. In Proceedings of toencoders. In Proceedings of ICML. NIPS. Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin- Alexander M. Rush. 2018. Latent Alignment and ton. 2016. Layer Normalization. In Proceedings of Variational Attention. In Proceedings of NIPS. NIPS. Andrew Drozdov, Patrick Verga, Mohit Yadev, Mohit James K. Baker. 1979. Trainable Grammars for Speech Iyyer, and Andrew McCallum. 2019. Unsupervised Recognition. In Proceedings of the Spring Confer- Latent Tree Induction with Deep Inside-Outside Re- ence of the Acoustical Society of America. cursive Auto-Encoders. In Proceedings of NAACL.

Rens Bod. 2006. An All-Subtrees Approach to Unsu- Greg Durrett and Dan Klein. 2015. Neural CRF Pars- pervised Parsing. In Proceedings of ACL. ing. In Proceedings of ACL. Samuel R. Bowman, Luke Vilnis, Oriol Vinyal, An- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin drew M. Dai, Rafal Jozefowicz, and Samy Ben- Matthews, and Noah A. Smith. 2015. Transition- gio. 2016. Generating Sentences from a Continuous Based Dependency Parsing with Stack Long Short- Space. In Proceedings of CoNLL. Term Memory. In Proceedings of ACL. James Bradbury and Richard Socher. 2017. Towards Neural Machine Translation with Latent Tree Atten- Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, tion. In Proceedings of the 2nd Workshop on Struc- and Noah A. Smith. 2016. Recurrent Neural Net- tured Prediction for Natural Language Processing. work Grammars. In Proceedings of NAACL. Jan Buys and Phil Blunsom. 2015. Generative Incre- Ahmad Emami and Frederick Jelinek. 2005. A Neu- mental Dependency Parsing with Neural Networks. ral Syntactic Language Model. Machine Learning, In Proceedings of ACL. 60:195–227. Jan Buys and Phil Blunsom. 2018. Neural Syntactic Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Generative Models with Exact Marginalization. In Cho. 2017. Learning to Parse and Translate Im- Proceedings of NAACL. proves Neural Machine Translation. In Proceedings of ACL. Jiong Cai, Yong Jiang, and Kewei Tu. 2017. CRF Au- toencoder for Unsupervised Dependency Parsing. In Jenny Rose Finkel, Alex Kleeman, and Christopher D. Proceedings of EMNLP. Manning. 2008. Efficient, Feature-based, Condi- tional Random Field Parsing. In Proceedings of Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, ACL. Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One Billion Word Benchmark for Mea- Jenny Rose Finkel, Christopher D. Manning, and An- suring Progress in Statistical Language Modeling. drew Y. Ng. 2006. Solving the Problem of Cas- arXiv:1312.3005 . cading Errors: Approximate Bayesian Inference for Danqi Chen and Christopher D. Manning. 2014. A Fast Linguistic Annotation Pipelines. In Proceedings of and Accurate Dependency Parser using Neural Net- EMNLP. works. In Proceedings of EMNLP. Kuzman Ganchev, Joao˜ Graça, Jennifer Gillenwater, Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- and Ben Taskar. 2010. Posterior Regularization for danau, and Yoshua Bengio. 2014. On the Properties Structured Latent Variable Models. Journal of Ma- of Neural Machine Translation: Encoder-Decoder chine Learning Research, 11:2001–2049. Approaches. In Proceedings of Eighth Workshop on Syntax, Semantics and Structure in Statistical Trans- Peter Glynn. 1987. Likelihood Ratio Gradient Estima- lation. tion: An Overview. In Proceedings of Winter Simu- lation Conference. Jihun Choi, Kang Min Yoo, and Sang goo Lee. 2018. Learning to Compose Task-Specific Tree Structures. Joshua Goodman. 1998. Parsing Inside-Out. PhD the- In Proceedings of AAAI. sis, Harvard University.

1114 Jetic Gu, Hassan S. Shavarani, and Anoop Sarkar. Rahul G. Krishnan, Uri Shalit, and David Sontag. 2017. 2018. Top-down Tree Structured Decoding with Structured Inference Networks for Nonlinear State Syntactic Connections for Neural Machine Transla- Space Models. In Proceedings of AAAI. tion and Parsing. In Proceedings of EMNLP. Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng John Hale, Chris Dyer, Adhiguna Kuncoro, and Kong, Chris Dyer, Graham Neubig, and Noah A. Jonathan R. Brennan. 2018. Finding Syntax in Hu- Smith. 2017. What Do Recurrent Neural Network man Encephalography with Beam Search. In Pro- Grammars Learn About Syntax? In Proceedings of ceedings of ACL. EACL.

Serhii Havrylov, German´ Kruszewski, and Armand Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo- Joulin. 2019. Cooperative Learning of Disjoint Syn- gatama, Stephen Clark, and Phil Blunsom. 2018. tax and Semantics. In Proceedings of NAACL. LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better. Phu Mon Htut, Kyunghyun Cho, and Samuel R. Bow- In Proceedings of ACL. man. 2018. Grammar Induction with Neural Lan- guage Models: An Unusual Replication. In Pro- Bowen Li, Jianpeng Cheng, Yang Liu, and Frank ceedings of EMNLP. Keller. 2019. Dependency Grammar Induction with a Neural Variational Transition-based Parser. In Rebecca Hwa. 1999. Supervised Grammar Induction Proceedings of AAAI. Using Training Data with Limited Constituent Infor- mation. In Proceedings of ACL. Zhifei Li and Jason Eisner. 2009. First- and Second- Order Expectation Semirings with Applications to Rebecca Hwa. 2000. Sample Selection for Statistical Minimum-Risk Training on Translation Forests. In Grammar Induction. In Proceedings of EMNLP. Proceedings of EMNLP.

Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Yang Liu, Matt Gardner, and Mirella Lapata. 2018. William Schuler, and Lane Schwartz. 2018. Un- Structured Alignment Networks for Matching Sen- supervised Grammar Induction with Depth-bounded tences. In Proceedings of EMNLP. PCFG. In Proceedings of TACL. Yang Liu and Mirella Lapata. 2018. Learning Struc- Mark Johnson, Thomas L. Grifﬁths, and Sharon Gold- tured Text Representations. In Proceedings of water. 2007. Bayesian Inference for PCFGs via TACL. Markov chain Monte Carlo. In Proceedings of NAACL. Jean Maillard and Stephen Clark. 2018. Latent Tree Learning with Differentiable Parsers: Shift-Reduce Matthew Johnson, David K. Duvenaud, Alex Parsing and Chart Parsing. In Proceedings of the Wiltschko, Ryan P. Adams, and Sandeep R. Workshop on the Relevance of Linguistic Structure Datta. 2016. Composing Graphical Models with in Neural Architectures for NLP. Neural Networks for Structured Representations and Fast Inference. In Proceedings of NIPS. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a Large Anno- Kazuya Kawakami, Chris Dyer, and Phil Blunsom. tated Corpus of English: The Penn Treebank. Com- 2018. Unsupervised Word Discovery with Segmen- putational Linguistics, 19:313–330. tal Neural Language Models. arXiv:1811.09353. Rebecca Marvin and Tal Linzen. 2018. Targeted Syn- Yoon Kim, Carl Denton, Luong Hoang, and Alexan- tactic Evaluation of Language Models. In Proceed- der M. Rush. 2017. Structured Attention Networks. ings of EMNLP. In Proceedings of ICLR. Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 2005. Probabilistic CFG with Latent Annotations. Method for Stochastic Optimization. In Proceed- In Proceedings of ACL. ings of ICLR. Gabor´ Melis, Chris Dyer, and Phil Blunsom. 2018. On Diederik P. Kingma and Max Welling. 2014. Auto- the State of the Art of Evaluation in Neural Lan- Encoding Variational Bayes. In Proceedings of guage Models. In Proceedings of ICLR. ICLR. Stephen Merity, Nitish Shirish Keskar, and Richard Nikita Kitaev and Dan Klein. 2018. Constituency Pars- Socher. 2018. Regularizing and Optimizing LSTM ing with a Self-Attentive Encoder. In Proceedings of Language Models. In Proceedings of ICLR. ACL. Tomas Mikolov, Martin Karaﬁat, Lukas Burget, Jan Dan Klein and Christopher Manning. 2002. A Genera- Cernocky, and Sanjeev Khudanpur. 2010. Recurrent tive Constituent-Context Model for Improved Gram- Neural Network Based Language Model. In Pro- mar Induction. In Proceedings of ACL. ceedings of INTERSPEECH.

1115 Andriy Mnih and Karol Gregor. 2014. Neural varia- Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Ju- tional inference and learning in belief networks. In rafsky. 2013. Breaking Out of Local Optima with Proceedings of ICML. Count Transforms and Model Recombination: A Study in Grammar Induction. In Proceedings of Andriy Mnih and Danilo J. Rezende. 2016. Variational EMNLP. Inference for Monte Carlo Objectives. In Proceed- ings of ICML. Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A Minimal Span-Based Neural Constituency Parser. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark In Proceedings of ACL. Johnson. 2010. Using Universal Linguistic Knowl- edge to Guide Grammar Induction. In Proceedings Emma Strubell, Patrick Verga, Daniel Andor, of EMNLP. David Weiss, and Andrew McCallum. 2018. Vlad Niculae, Andre´ F. T. Martins, and Claire Cardie. Linguistically-Informed Self-Attention for Seman- 2018. Towards Dynamic Computation Graphs via tic Role Labeling. In Proceedings of EMNLP. Sparse Latent Structure. In Proceedings of EMNLP. Kai Sheng Tai, Richard Socher, and Christopher D. Ankur P. Parikh, Shay B. Cohen, and Eric P. Xing. Manning. 2015. Improved Semantic Representa- 2014. Spectral Unsupervised Parsing with Additive tions From Tree-Structured Long Short-Term Mem- Tree Metrics. In Proceedings of ACL. ory Networks. In Proceedings of ACL.

Hao Peng, Sam Thomson, and Noah A. Smith. 2018. Ke Tran and Yonatan Bisk. 2018. Inducing Grammars Backpropagating through Structured Argmax using with and for Neural Machine Translation. In Pro- a SPIGOT. In Proceedings of ACL. ceedings of the 2nd Workshop on Neural Machine Elis Ponvert, Jason Baldridge, and Katrin Erk. 2011. Translation and Generation. Simpled Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Methods. In John C. Trueswell and Lila R. Gleitman. 2007. Learn- Proceedings of ACL. ing to Parse and its Implications for Language Ac- quisition. The Oxford Handbook of Psycholinguis- Oﬁr Press and Lior Wolf. 2016. Using the Output Em- tics. bedding to Improve Language Models. In Proceed- ings of EACL. Wenhui Wang and Baobao Chang. 2016. Graph-based Dependency Parsing with Bidirectional LSTM. In Maxim Rabinovich, Mitchell Stern, and Dan Klein. Proceedings of ACL. 2017. Abstract Syntax Networks for Code Gener- ation and Semantic Parsing. In Proceedings of ACL. Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. A Tree-based Decoder for Neural Danilo J. Rezende, Shakir Mohamed, and Daan Wier- Machine Translation. In Proceedings of EMNLP. stra. 2014. Stochastic Backpropagation and Ap- proximate Inference in Deep Generative Models. In Adina Williams, Andrew Drozdov, and Samuel R. Proceedings of ICML. Bowman. 2018. Do Latent Tree Learning Models Yoav Seginer. 2007. Fast Unsupervised Incremental Identify Meaningful Structure in Sentences? In Pro- Parsing. In Proceedings of ACL. ceedings of TACL. Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and Ronald J. Williams. 1992. Simple Statistical Gradient- Aaron Courville. 2018. Neural Language Modeling following Algorithms for Connectionist Reinforce- by Jointly Learning Syntax and Lexicon. In Pro- ment Learning. Machine Learning, 8. ceedings of ICLR. Sam Wiseman, Stuart M. Shieber, and Alexander M. Yikang Shen, Shawn Tan, Alessandro Sordoni, and Rush. 2018. Learning Neural Templates for Text Aaron Courville. 2019. Ordered Neurons: Integrat- Generation. In Proceedings of EMNLP. ing Tree Structures into Recurrent Neural Networks. In Proceedings of ICLR. Naiwen Xue, Fei Xia, Fu dong Chiou, and Marta Palmer. 2005. The Penn Chinese Treebank: Phrase Haoyue Shi, Hao Zhou, Jiaze Chen, and Lei Li. 2018. Structure Annotation of a Large Corpus. Natural On Tree-based Neural Sentence Modeling. In Pro- Language Engineering, 11:207–238. ceedings of EMNLP. Noah A. Smith and Jason Eisner. 2004. Annealing Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and Techniques for Unsupervised Statistical Language William W. Cohen. 2018. Breaking the Softmax Learning. In Proceedings of ACL. Bottleneck: A High-Rank RNN Language Model. In Proceedings of ICLR. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. Lad- Pengcheng Yin and Graham Neubig. 2017. A Syntac- der Variational Autoencoders. In Proceedings of tic Neural Model for General-Purpose Code Gener- NIPS. ation. In Proceedings of ACL.

1116 Pengcheng Yin, Chunting Zhou, Junxian He, and Gra- ham Neubig. 2018. StructVAE: Tree-structured La- tent Variable Models for Semi-supervised Semantic Parsing. In Proceedings of ACL. Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to Compose Words into Sentences with Reinforcement Learning. In Proceedings of ICLR. Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long Short-Term Memory Over Tree Struc- tures. In Proceedings of ICML.

1117