Tensors Over Semirings for Latent-Variable Weighted Logic Programs

Tensors over Semirings for Latent-Variable Weighted Logic Programs Esma Balkır1 Daniel Gildea2 Shay B. Cohen1 1ILCC, School of Informatics, University of Edinburgh 2Department of Computer Science, University of Rochester [email protected] [email protected] [email protected] Abstract the set of all possible derivations. It is then possible to derive a general formulation of inside and Semiring parsing (Goodman, 1999) is an ele- outside calculations in a way that is agnostic to the gant framework for describing parsers by us- particular semiring chosen. ing semiring weighted logic programs. In this paper we present a generalization of this con- Latent variable models have been an important cept: latent-variable semiring parsing. With component in the NLP toolbox. The central as- our framework, any semiring weighted logic sumption in latent variable models is that the corre- program can be latentified by transforming lations between observed variables in the training weights from scalar values of a semiring to data could be explained by unobserved, hidden vari- rank-n arrays, or tensors, of semiring val- ables. Latent variables have been used with gram- ues, allowing the modeling of latent variables mars such as Probabilistic Context-Free Grammars within the semiring parsing framework. Semir- ing is too strong a notion when dealing with (PCFGs), where each node in the parse tree is rep- tensors, and we have to resort to a weaker resented using a vector of latent state probabilities structure: a partial semiring.1 We prove that that further extend the expressiveness of the gram- this generalization preserves all the desired mar (Matsuzaki et al., 2005). properties of the original semiring framework The approach of adding latent variables to formal while strictly increasing its expressiveness. grammars have proven to be a fruitful one: in the context of PCFG parsing, Matsuzaki et al.(2005) 1 Introduction show that latent variable PCFGs (L-PCFGs) per- Weighted Logic Programming (WLP) is a declara- form on par with models hand-annotated with lin- tive approach to specifying and reasoning about dy- guistically motivated features. Cohen et al.(2013) namic programming algorithms and chart parsers. report that on the Penn Treebank dataset, L-PCFGs WLP is a generalization of bottom-up logic pro- trained with either EM or a spectral algorithm programming where proofs are assigned weights by vide a 20% increase in F1 over PCFGs without combining the weights of the axioms used in the latent states. Gebhardt(2018) shows that the bene- proof, and the weight of a theorem is in turn cal- fits of latent variables are not limited to PCFGs by culated by combining the weights of all its possi- successfully enriching both Linear Context-Free ble proof paths. The combinatorial nature of this Rewriting Systems and Hybrid Grammars with la- arXiv:2006.04232v1 [cs.CL] 7 Jun 2020 procedure makes weighted logic programs highly tent variables, and demonstrates their applicability suitable for specifying dynamic programming algo- on discontinuous constituent parsing. rithms. In particular, Goodman(1999) presents an Given the usefulness of latent variables, it would elegant abstraction for specifying and computing be desirable to have a generic inference mechanism parser values based on WLP where the values could for any latent variable grammar. WLPs can repre- be drawn from any complete semiring. This gener- sent inference algorithms for probabilistic gram- alizes the case of Boolean decision problems, prob- mars effectively. However, this does not trivially abilistic grammars with Viterbi search and other extend to latent-variable models because latent vari- quantities of interest such as the best derivation or ables are often represented as vectors, matrices and higher-order tensors, and these taken together no 1Our definition of a partial semiring is slightly different than those in the abstract algebra literature e.g. Steenstrup longer form a semiring. This is because in the (1985). semiring framework, values for deduction items and for rules must all come from the same set, and valid derivation should correspond to a sequence the semiring operations must be defined over all of well defined semiring operations. For CFGs, we pairs of values from this set. This does not al- give a straightforward condition that ensures this low for letting different grammar nonterminals be is the case. This essentially boils down to making represented by vectors of different sizes. More sure that each non-terminal corresponds to a fixed importantly, it does not allow for a rule’s value vector space dimension. For example, if A corre- to be a tensor whose dimensionality depends on sponds to a space of d1 dimensions, B to d2 and C the rule’s arity, as is generally the case in latent to d3, then a rule A ! BC would have a tensor variable frameworks. weight in d2 × d3 × d1. In this paper we start with a broad interpreta- As long as the weights are well defined, the stan- tion of latent variables as tensors over an arbitrary dard definitions for the value of a grammar deriva- semiring. While a set of tensors over semirings tion and a string according to a semiring weighted is no longer a semiring, we prove that if the set grammar extend to the case of tensors of semirings. of tensors have certain matching dimensions for Weighted logic programming provides the means the set of grammar rules they are assigned to, then to declaratively specify an efficient algorithm to they fulfill all the desirable properties relevant for obtain these values of interest. In line with Sikkel the semiring parsing framework. This paves the (1998) and Goodman(1999) we present precise way to use WLPs with latent variables, naturally conditions for when a partial-semiring WLP de- improving the expressivity of the statistical model scribes a correct parser. represented by the underlying WLP. Introducing The value of the WLP formulation of parsing a semiring framework like ours makes it easier to algorithms is that it provides a unified fashion in seamlessly incorporate latent variables into any exe- which dynamic programming algorithms can be cution model for dynamic programming algorithms extracted from the program description. This relies (or software such as Dyna, Eisner et al. 2005, and on the ability of a WLP to decompose the value other Prolog-like/WLP-like solvers). of a proof to a combination of the values of the We focus on CFG parsing, however the same sub-proofs. Specifically, given a derivation tree, latent variable techniques can be applied to any a WLP description automatically provides algo- weighted deduction system, including systems for rithms for calculating the inside and outside values. parsing TAG, CCG and LCFRS, and systems for We provide analogous algorithms for calculating Machine Translation (Lopez, 2009). The methods the inside and outside values for partial-semiring we present for inside and outside computation can WLPs. Our outside formulation addresses the non- be used to learn latent refinements of a specified commutative nature of tensors themselves, and grammar for any of these tasks with EM (Dempster could be extended to cases where the underlying et al., 1977; Matsuzaki et al., 2005), or used as semiring is non-commutative using the techniques a backbone to create spectral learning algorithms presented by Goodman(1998). (Hsu et al., 2012; Bailly et al., 2009; Cohen et al., 2014). 3 Related Work 2 Main Results Takeaway “Parsing as deduction” (Pereira and Warren, 1983) is an established framework that allows a number of We present a strict generalization of semiring parsing algorithms to be written as declarative rules weighted logic programming, with a particular fo- and deductive systems (Shieber et al., 1995), and cus on parser descriptions in WLP for context-free their correctness to be rigorously stated (Sikkel, grammars. Throughout, we utilize the correspon- 1998). Goodman(1999) has extended the pars- dence between axioms and grammar rules, deduc- ing as deduction framework to arbitrary semirings tive proofs and grammar derivations, and derived and showed that various different values of interest theorems and strings. could be computed using the same algorithm by We assume that axioms/grammar rules come changing the semiring. This led to the develop- equipped with weights in the form of tensors over ment of Dyna, a toolkit for declaratively specifying semiring values. The main issue with going from weighted logic programs, allowing concise imple- semirings to tensors over semiring values is that mentation of a number of NLP algorithms (Eisner these weights need to be well defined in that any et al., 2005). The semiring characterization of possible values (lhs), and a string α 2 (N [ Σ)∗ on the right hand to assign to WLPs gave rise to the formulation of a side (rhs). We will use α ) β if β could be de- number of novel semirings. One novel semiring of rived from α with the application of one grammar interest for purposes of learning parameters is the rule. We will say that a sentence σ 2 Σ+ could generalized entropy semiring (Cohen et al., 2008) be derived from the non-terminal A if σ could be which can be used to calculate the KL-divergence generated by starting with A and repeatedly apply- between the distribution of derivations induced by ing rules in R until the right hand side contains ∗ two weighted logic programs. Other two semir- only terminals, and denote this as A =) σ. We will ings of interest are expectation and variance semir- denote the language that a grammar G defines by ∗ ings introduced by Eisner(2002) and Li and Eisner L(G) = fσjS =) σg. (2009). These utilize the algebraic structure to effi- CFG derivations can naturally be represented as ciently track quantities needed by the expectation- trees.

Tensors Over Semirings for Latent-Variable Weighted Logic Programs

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support