Compound Probabilistic Context-Free Grammars for Grammar Induction
Total Page:16
File Type:pdf, Size:1020Kb
Compound Probabilistic Context-Free Grammars for Grammar Induction Yoon Kim Chris Dyer Alexander M. Rush Harvard University DeepMind Harvard University Cambridge, MA, USA London, UK Cambridge, MA, USA [email protected] [email protected] [email protected] Abstract non-parametric models (Kurihara and Sato, 2006; Johnson et al., 2007; Liang et al., 2007; Wang and We study a formalization of the grammar in- Blunsom, 2013), and manually-engineered fea- duction problem that models sentences as be- tures (Huang et al., 2012; Golland et al., 2012) to ing generated by a compound probabilistic context free grammar. In contrast to traditional encourage the desired structures to emerge. formulations which learn a single stochastic We revisit these aforementioned issues in light grammar, our context-free rule probabilities of advances in model parameterization and infer- are modulated by a per-sentence continuous ence. First, contrary to common wisdom, we latent variable, which induces marginal de- find that parameterizing a PCFG’s rule probabil- pendencies beyond the traditional context-free ities with neural networks over distributed rep- assumptions. Inference in this grammar is resentations makes it possible to induce linguis- performed by collapsed variational inference, tically meaningful grammars by simply optimiz- in which an amortized variational posterior is placed on the continuous variable, and the la- ing log likelihood. While the optimization prob- tent trees are marginalized with dynamic pro- lem remains non-convex, recent work suggests gramming. Experiments on English and Chi- that there are optimization benefits afforded by nese show the effectiveness of our approach over-parameterized models (Arora et al., 2018; compared to recent state-of-the-art methods Xu et al., 2018; Du et al., 2019), and we in- for grammar induction from words with neu- deed find that this neural PCFG is significantly ral language models. easier to optimize than the traditional PCFG. 1 Introduction Second, this factored parameterization makes it straightforward to incorporate side information Grammar induction is the task of inducing hier- into rule probabilities through a sentence-level archical syntactic structure from data. Statistical continuous latent vector, which effectively allows approaches to grammar induction require specify- different contexts in a derivation to coordinate. ing a probabilistic grammar (e.g. formalism, num- In this compound PCFG—continuous mixture of ber and shape of rules), and fitting its parameters PCFGs—the context-free assumptions hold con- through optimization. Early work found that it was ditioned on the latent vector but not uncondition- difficult to induce probabilistic context-free gram- ally, thereby obtaining longer-range dependencies mars (PCFG) from natural language data through within a tree-based generative process. direct methods, such as optimizing the log like- To utilize this approach, we need to efficiently lihood with the EM algorithm (Lari and Young, optimize the log marginal likelihood of observed 1990; Carroll and Charniak, 1992). While the rea- sentences. While compound PCFGs break effi- sons for the failure are manifold and not com- cient inference, if the latent vector is known the pletely understood, two major potential causes distribution over trees reduces to a standard PCFG. are the ill-behaved optimization landscape and the This property allows us to perform grammar in- overly strict independence assumptions of PCFGs. duction using a collapsed approach where the la- More successful approaches to grammar induction tent trees are marginalized out exactly with dy- have thus resorted to carefully-crafted auxiliary namic programming. To handle the latent vec- objectives (Klein and Manning, 2002), priors or tor, we employ standard amortized inference us- Code: https://github.com/harvardnlp/compound-pcfg ing reparameterized samples from a variational 2369 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2369–2385 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics posterior approximated from an inference network Charniak, 1992).2 Successful approaches to un- (Kingma and Welling, 2014; Rezende et al., 2014). supervised parsing have therefore modified the On standard benchmarks for English and Chi- model/learning objective by guiding potentially nese, the proposed approach is found to perform unrelated rules to behave similarly. favorably against recent neural network-based ap- Recognizing that sharing among rule types is proaches to grammar induction (Shen et al., 2018, beneficial, we propose a neural parameterization 2019; Drozdov et al., 2019; Kim et al., 2019). where rule probabilities are based on distributed representations. We associate embeddings with 2 Probabilistic Context-Free Grammars each symbol, introducing input embeddings wN We consider context-free grammars (CFG) con- for each symbol N on the left side of a rule (i.e. sisting of a 5-tuple G = (S; N ; P; Σ; R) where N 2 fSg [ N [ P). For each rule type r, πr is S is the distinguished start symbol, N is a finite parameterized as follows, > set of nonterminals, P is a finite set of pretermi- exp(u f1(wS)) 1 π = A ; nals, Σ is a finite set of terminal symbols, and R S!A P > 0 exp(u 0 f1(wS)) is a finite set of rules of the form, A 2N A exp(u> w ) π = BC A ; S ! A; A 2 N A!BC P > B0C02M exp(uB0C0 wA) A ! B C; A 2 N ; B; C 2 N [ P exp(u> f (w )) π = w 2 T ; T ! w; T 2 P; w 2 Σ: T !w P > w02Σ exp(uw0 f2(wT )) A probabilistic context-free grammar (PCFG) where M is the product space (N[P)×(N[P), consists of a grammar G and rule probabilities and f1; f2 are MLPs with two residual layers (see appendix A.1 for the full parameterization). We π = fπrgr2R such that πr is the probability of will use E = fw j N 2 fSg [ N [ Pg to the rule r. Letting TG be the set of all parse trees of G N G, a PCFG defines a probability distribution over denote the set of input symbol embeddings for a t 2 T via p (t) = Q π where t is the set grammar G, and λ to refer to the parameters of G π r2tR r R of rules used in the derivation of t. It also defines the neural network used to obtain the rule proba- a distribution over string of terminals x 2 Σ∗ via bilities. A graphical model-like illustration of the neural PCFG is shown in Figure1 (left). X pπ(x) = pπ(t); It is clear that the neural parameterization does t2TG (x) not change the underlying probabilistic assump- tions. The difference between the two is anal- where T (x) = ft j yield(t) = xg, i.e. the set G ogous to the difference between count-based vs. of trees t such that t’s leaves are x. We will use feed-forward neural language models, where feed- p (t j x) p (t j yield(t) = x) to denote the π , π forward neural language models make the same posterior distribution over latent trees given the Markov assumptions as the count-based models observed sentence x. but are able to take advantage of shared, dis- Parameterization The standard way to param- tributed representations. eterize a PCFG is to simply associate a scalar to 3 Compound PCFGs each rule πr with the constraint that they form A compound probability distribution (Robbins, valid probability distributions, i.e. each nontermi- 1951) is a distribution whose parameters are them- nal is associated with a fully-parameterized cate- selves random variables. These distributions gen- gorical distribution over its rules. This direct pa- eralize mixture models to the continuous case, for rameterization is algorithmically convenient since example in factor analysis which assumes the fol- the M-step in the EM algorithm (Dempster et al., lowing generative process, 1977) has a closed form. However, there is a long history of work showing that it is difficult z ∼ N (0; I) x ∼ N (Wz; Σ): to learn meaningful grammars from natural lan- Compound distributions provide the ability to guage data with this parameterization (Carroll and model rich generative processes, but marginaliz- ing over the latent parameter can be computation- 1Since we will be inducing a grammar directly from ally expensive unless conjugacy can be exploited. words, P is roughly the set of part-of-speech tags and N is the set of constituent labels. However, to avoid issues of label 2In preliminary experiments we were indeed unable to alignment, evaluation is only on the tree topology. learn linguistically meaningful grammars with this PCFG. 2370 N z c γ N A1 πS π A1 z;S A2 T3 πN EG A2 T3 πz;N E T1 T2 πP G T1 T2 πz;P w1 w2 w3 w1 w2 w3 Figure 1: A graphical model-like diagram for the neural PCFG (left) and the compound PCFG (right) for an example tree structure. In the above, A1;A2 2 N are nonterminals, T1;T2;T3 2 P are preterminals, w1; w2; w3 2 Σ are terminals. In the neural PCFG, the global rule probabilities π = πS [ πN [ πP are the output from a neural net run over the symbol embeddings EG, where πN are the set of rules with a nonterminal on the left hand side (πS and πP are similarly defined). In the compound PCFG, we have per-sentence rule probabilities πz = πz;S [ πz;N [ πz;P obtained from running a neural net over a random vector z (which varies across sentences) and global symbol embeddings EG. In this case, the context-free assumptions hold conditioned on z, but they do not hold unconditionally: e.g. when conditioned on z and A2, the variables A1 and T1 are independent; however when conditioned on just A2, they are not independent due to the dependence path through z.