Arxiv:1906.10225V9 [Cs.CL] 29 Mar 2020 Mars (PCFG) from Natural Language Data Through Within a Tree-Based Generative Process
Total Page:16
File Type:pdf, Size:1020Kb
Compound Probabilistic Context-Free Grammars for Grammar Induction Yoon Kim Chris Dyer Alexander M. Rush Harvard University DeepMind Harvard University Cambridge, MA, USA London, UK Cambridge, MA, USA [email protected] [email protected] [email protected] Abstract non-parametric models (Kurihara and Sato, 2006; Johnson et al., 2007; Liang et al., 2007; Wang and We study a formalization of the grammar Blunsom, 2013), and manually-engineered fea- induction problem that models sentences as being generated by a compound probabilis- tures (Huang et al., 2012; Golland et al., 2012) to tic context-free grammar. In contrast to encourage the desired structures to emerge. traditional formulations which learn a sin- We revisit these aforementioned issues in light gle stochastic grammar, our grammar’s rule of advances in model parameterization and infer- probabilities are modulated by a per-sentence ence. First, contrary to common wisdom, we continuous latent variable, which induces find that parameterizing a PCFG’s rule probabil- marginal dependencies beyond the traditional ities with neural networks over distributed rep- context-free assumptions. Inference in this grammar is performed by collapsed variational resentations makes it possible to induce linguis- inference, in which an amortized variational tically meaningful grammars by simply optimiz- posterior is placed on the continuous variable, ing log likelihood. While the optimization prob- and the latent trees are marginalized out with lem remains non-convex, recent work suggests dynamic programming. Experiments on En- that there are optimization benefits afforded by glish and Chinese show the effectiveness of over-parameterized models (Arora et al., 2018; our approach compared to recent state-of-the- Xu et al., 2018; Du et al., 2019), and we in- art methods when evaluated on unsupervised deed find that this neural PCFG is significantly parsing. easier to optimize than the traditional PCFG. 1 Introduction Second, this factored parameterization makes it straightforward to incorporate side information Grammar induction is the task of inducing hier- into rule probabilities through a sentence-level archical syntactic structure from data. Statistical continuous latent vector, which effectively allows approaches to grammar induction require specify- different contexts in a derivation to coordinate. ing a probabilistic grammar (e.g. formalism, num- In this compound PCFG—continuous mixture of ber and shape of rules), and fitting its parameters PCFGs—the context-free assumptions hold con- through optimization. Early work found that it was ditioned on the latent vector but not uncondition- difficult to induce probabilistic context-free gram- ally, thereby obtaining longer-range dependencies arXiv:1906.10225v9 [cs.CL] 29 Mar 2020 mars (PCFG) from natural language data through within a tree-based generative process. direct methods, such as optimizing the log like- To utilize this approach, we need to efficiently lihood with the EM algorithm (Lari and Young, optimize the log marginal likelihood of observed 1990; Carroll and Charniak, 1992). While the rea- sentences. While compound PCFGs break effi- sons for the failure are manifold and not com- cient inference, if the latent vector is known the pletely understood, two major potential causes distribution over trees reduces to a standard PCFG. are the ill-behaved optimization landscape and the This property allows us to perform grammar in- overly strict independence assumptions of PCFGs. duction using a collapsed approach where the la- More successful approaches to grammar induction tent trees are marginalized out exactly with dy- have thus resorted to carefully-crafted auxiliary namic programming. To handle the latent vec- objectives (Klein and Manning, 2002), priors or tor, we employ standard amortized inference us- Code: https://github.com/harvardnlp/compound-pcfg ing reparameterized samples from a variational posterior approximated from an inference network constraint that they form valid probability distri- (Kingma and Welling, 2014; Rezende et al., 2014). butions, i.e. each nonterminal is associated with On standard benchmarks for English and Chi- a fully-parameterized categorical distribution over nese, the proposed approach is found to perform its rules. This direct parameterization is algorith- favorably against recent neural approaches to un- mically convenient since the M-step in the EM al- supervised parsing (Shen et al., 2018, 2019; Droz- gorithm (Dempster et al., 1977) has a closed form. dov et al., 2019; Kim et al., 2019). However, there is a long history of work showing that it is difficult to learn meaningful grammars 2 Probabilistic Context-Free Grammars from natural language data with this parameteri- 3 We consider context-free grammars (CFG) con- zation (Carroll and Charniak, 1992). Successful sisting of a 5-tuple G = (S; N ; P; Σ; R) where approaches to unsupervised parsing have therefore S is the distinguished start symbol, N is a finite modified the model/learning objective by guiding set of nonterminals, P is a finite set of pretermi- potentially unrelated rules to behave similarly. nals,1 Σ is a finite set of terminal symbols, and R Recognizing that sharing among rule types is is a finite set of rules of the form, beneficial, we propose a neural parameterization where rule probabilities are based on distributed S ! A; A 2 N representations. We associate embeddings with A ! B C; A 2 N ; B; C 2 N [ P each symbol, introducing input embeddings wN for each symbol N on the left side of a rule (i.e. T ! w; T 2 P; w 2 Σ: N 2 fSg [ N [ P). For each rule type r, πr is A probabilistic context-free grammar (PCFG) parameterized as follows, > consists of a grammar G and rule probabilities exp(uA f1(wS) + bA) πS!A = ; π = fπrgr2R such that πr is the probability of P > 0 A02N exp(uA0 f1(wS) + bA ) the rule r. Letting T be the set of all parse trees of G exp(u> w + b ) G π = BC A BC ; , a PCFG defines a probability distribution over A!BC P > Q B0C02M exp(uB0C0 wA + bB0C0 ) t 2 TG via pπ(t) = r2t πr where tR is the set R > t exp(u f2(wT ) + bw) of rules used in the derivation of . It also defines π = w ; ∗ T !w P > a distribution over string of terminals x 2 Σ via w02Σ exp(uw0 f2(wT ) + bw0 ) X where M is the product space (N[P)×(N[P), pπ(x) = pπ(t); and f1; f2 are MLPs with two residual layers. t2T (x) G Note that we do not use an MLP for rules of the where TG(x) = ft j yield(t) = xg, i.e. the set of type πA!BC , as it did not empirically improve re- trees t such that t’s leaves are x. We will slightly sults. See appendix A.1 for the full parameteriza- abuse notation and use tion. Going forward, we will use EG = fwU j U 2 fSg [ N [ Pg [ fu j V 2 N [ M [ Σg to 1 V [yield(t) = x]pπ(t) denote the set of input/output symbol embeddings pπ(t j x) , pπ(x) for grammar G, and λ to refer to the parameters of the neural network f ; f used to obtain the rule to denote the posterior distribution over the unob- 1 2 probabilities. A graphical model-like illustration served latent trees given the observed sentence x, of the neural PCFG is shown in Figure1 (left). where 1[·] is the indicator function.2 It is clear that the neural parameterization does 2.1 Parameterization not change the underlying probabilistic assump- The standard way to parameterize a PCFG is to tions. The difference between the two is anal- ogous to the difference between count-based vs. simply associate a scalar to each rule πr with the feed-forward neural language models, where feed- 1Since we will be inducing a grammar directly from forward neural language models make the same words, P is roughly the set of part-of-speech tags and N is the set of constituent labels. However, to avoid issues of label Markov assumptions as the count-based models alignment, evaluation is only on the tree topology. but are able to take advantage of shared, dis- 2Therefore when used in the context of a posterior distri- tributed representations. bution conditioned on a sentence x, the variable t does not include the leaves x and only refers to the unobserved non- 3In preliminary experiments we were indeed unable to terminal/preterminal symbols. learn linguistically meaningful grammars with this PCFG. N z c γ N A1 πS π A1 z;S A2 T3 πN EG A T 2 3 π T1 T2 πP z;N EG T1 T2 w1 w2 w3 πz;P w1 w2 w3 Figure 1: A graphical model-like diagram for the neural PCFG (left) and the compound PCFG (right) for an example tree structure. In the above, A1;A2 2 N are nonterminals, T1;T2;T3 2 P are preterminals, w1; w2; w3 2 Σ are terminals. In the neural PCFG, the global rule probabilities π = πS [πN [πP are the output from a neural net run over the symbol embeddings EG , where πN are the set of rules with a nonterminal on the left hand side (πS and πP are similarly defined). In the compound PCFG, we have per-sentence rule probabilities πz = πz;S [ πz;N [ πz;P obtained from running a neural net over a random vector z (which varies across sentences) and global symbol embeddings EG . In this case, the context-free assumptions hold conditioned on z, but they do not hold unconditionally: e.g. when conditioned on z and A2, the variables A1 and T1 are independent; however when conditioned on just A2, they are not independent due to the dependence path through z. Note that the rule probabilities are random variables in the compound PCFG but deterministic variables in the neural PCFG.