Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming Without Semirings

Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings Kevin Gimpel and Noah A. Smith Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {kgimpel,nasmith}@cs.cmu.edu Abstract features, but leaves open the question of how the feature weights or probabilities are learned. Mean- We introduce cube summing, a technique while, some learning algorithms, like maximum that permits dynamic programming algo- likelihood for conditional log-linear models (Laf- rithms for summing over structures (like ferty et al., 2001), unsupervised models (Pereira the forward and inside algorithms) to be and Schabes, 1992), and models with hidden vari- extended with non-local features that vio- ables (Koo and Collins, 2005; Wang et al., 2007; late the classical structural independence Blunsom et al., 2008), require summing over the assumptions. It is inspired by cube prun- scores of many structures to calculate marginals. ing (Chiang, 2007; Huang and Chiang, We first review the semiring-weighted logic 2007) in its computation of non-local programming view of dynamic programming al- features dynamically using scored k-best gorithms (Shieber et al., 1995) and identify an in- lists, but also maintains additional resid- tuitive property of a program called proof locality ual quantities used in calculating approx- that follows from feature locality in the underlying imate marginals. When restricted to lo- probability model (§2). We then provide an analy- cal features, cube summing reduces to a sis of cube pruning as an approximation to the in- novel semiring (k-best+residual) that gen- tractable problem of exact optimization over struc- eralizes many of the semirings of Good- tures with non-local features and show how the man (1999). When non-local features are use of non-local features with k-best lists breaks included, cube summing does not reduce certain semiring properties (§3). The primary to any semiring, but is compatible with contribution of this paper is a novel technique— generic techniques for solving dynamic cube summing—for approximate summing over programming equations. discrete structures with non-local features, which we relate to cube pruning (§4). We discuss imple- 1 Introduction mentation (§5) and show that cube summing be- Probabilistic NLP researchers frequently make in- comes exact and expressible as a semiring when dependence assumptions to keep inference algo- restricted to local features; this semiring general- rithms tractable. Doing so limits the features that izes many commonly-used semirings in dynamic are available to our models, requiring features programming (§6). to be structurally local. Yet many problems in 2 Background NLP—machine translation, parsing, named-entity recognition, and others—have benefited from the In this section, we discuss dynamic programming addition of non-local features that break classical algorithms as semiring-weighted logic programs. independence assumptions. Doing so has required We then review the definition of semirings and im- algorithms for approximate inference. portant examples. We discuss the relationship be- Recently cube pruning (Chiang, 2007; Huang tween locally-factored structure scores and proofs and Chiang, 2007) was proposed as a way to lever- in logic programs. age existing dynamic programming algorithms that find optimal-scoring derivations or structures 2.1 Dynamic Programming when only local features are involved. Cube prun- Many algorithms in NLP involve dynamic pro- ing permits approximate decoding with non-local gramming (e.g., the Viterbi, forward-backward, probabilistic Earley’s, and minimum edit distance Semirings define these values and define two op- algorithms). Dynamic programming (DP) in- erators over them, called “aggregation” (max in volves solving certain kinds of recursive equations Eq. 1) and “combination” (× in Eq. 1). with shared substructure and a topological order- Goodman and Eisner et al. assumed that the val- ing of the variables. ues of the variables are in a semiring, and that the Shieber et al. (1995) showed a connection equations are defined solely in terms of the two between DP (specifically, as used in parsing) semiring operations. We will often refer to the and logic programming, and Goodman (1999) “probability” of a proof, by which we mean a non- augmented such logic programs with semiring negative R-valued score defined by the semantics weights, giving an algebraic explanation for the of the dynamic program variables; it may not be a intuitive connections among classes of algorithms normalized probability. with the same logical structure. For example, in Goodman’s framework, the forward algorithm and 2.2 Semirings the Viterbi algorithm are comprised of the same A semiring is a tuple hA, ⊕, ⊗, 0, 1i, in which A logic program with different semirings. Goodman is a set, ⊕ : A × A → A is the aggregation defined other semirings, including ones we will operation, ⊗ : A × A → A is the combina- use here. This formal framework was the basis tion operation, 0 is the additive identity element for the Dyna programming language, which per- (∀a ∈ A, a ⊕ 0 = a), and 1 is the multiplica- mits a declarative specification of the logic pro- tive identity element (∀a ∈ A, a ⊗ 1 = a). A gram and compiles it into an efficient, agenda- semiring requires ⊕ to be associative and com- based, bottom-up procedure (Eisner et al., 2005). mutative, and ⊗ to be associative and to distribute For our purposes, a DP consists of a set of recur- over ⊕. Finally, we require a ⊗ 0 = 0 ⊗ a = 0 for sive equations over a set of indexed variables. For all a ∈ A.1 Examples include the inside semir- example, the probabilistic CKY algorithm (run on ing, hR≥0, +, ×, 0, 1i, and the Viterbi semiring, sentence w1w2...wn) is written as hR≥0, max, ×, 0, 1i. The former sums the probabilities of all proofs of each theorem. The lat- CX,i−1,i = pX→w (1) i ter (used in Eq. 1) calculates the probability of the CX,i,k = max Y,Z∈N;j∈{i+1,...,k−1} most probable proof of each theorem. Two more examples follow. pX→YZ × CY,i,j × CZ,j,k goal = CS,0,n Viterbi proof semiring. We typically need to where N is the nonterminal set and S ∈ N is the recover the steps in the most probable proof in addition to its probability. This is often done us- start symbol. Each CX,i,j variable corresponds to the chart value (probability of the most likely sub- ing backpointers, but can also be accomplished by tree) of an X-constituent spanning the substring representing the most probable proof for each theorem in its entirety as part of the semiring value wi+1...wj. goal is a special variable of greatest in- terest, though solving for goal correctly may (in (Goodman, 1999). For generality, we define a general, but not in this example) require solving proof as a string that is constructed from strings for all the other values. We will use the term “in- associated with axioms, but the particular form dex” to refer to the subscript values on variables of a proof is problem-dependent. The “Viterbi proof” semiring includes the probability of the (X, i, j on CX,i,j). Where convenient, we will make use of Shieber most probable proof and the proof itself. Letting ∗ et al.’s logic programming view of dynamic pro- L ⊆ Σ be the proof language on some symbol set Σ, this semiring is defined on the set ≥0 × L gramming. In this view, each variable (e.g., CX,i,j R in Eq. 1) corresponds to the value of a “theo- with 0 element h0, i and 1 element h1, i. For two values hu1,U1i and hu2,U2i, the aggregation rem,” the constants in the equations (e.g., pX→YZ operator returns hmax(u , u ),U i. in Eq. 1) correspond to the values of “axioms,” 1 2 argmaxi∈{1,2} ui and the DP defines quantities corresponding to 1When cycles are permitted, i.e., where the value of one weighted “proofs” of the goal theorem (e.g., find- variable depends on itself, infinite sums can be involved. We ing the maximum-valued proof, or aggregating must ensure that these infinite sums are well defined under the semiring. So-called complete semirings satisfy additional proof values). The value of a proof is a combi- conditions to handle infinite sums, but for simplicity we will nation of the values of the axioms it starts with. restrict our attention to DPs that do not involve cycles. Semiring A Aggregation (⊕) Combination (⊗) 0 1 inside R≥0 u1 + u2 u1u2 0 1 Viterbi R≥0 max(u1, u2) u1u2 0 1 Viterbi proof R≥0 × L hmax(u1, u2),Uargmaxi∈{1,2} ui i hu1u2,U1.U2i h0, i h1, i ≤k k-best proof (R≥0 × L) max-k(u1 ∪ u2) max-k(u1 ? u2) ∅ {h1, i} Table 1: Commonly used semirings. An element in the Viterbi proof semiring is denoted hu1,U1i, where u1 is the probability of proof U1. The max-k function returns a sorted list of the top-k proofs from a set. The ? function performs a cross-product on two k-best proof lists (Eq. 2). The combination operator returns hu1u2,U1.U2i, yˆ ∈ Y for a given input x ∈ X: where U1.U2 denotes the string concatenation of 2 QM hm(x,y) U1 and U2. yˆ(x) = argmaxy∈Y m=1 λm (4) k-best proof semiring. The “k-best proof” The second is the summing problem, which semiring computes the values and proof strings of marginalizes the proof probabilities (without nor- the k most-probable proofs for each theorem. The malization): ≤k set is (R≥0 × L) , i.e., sequences (up to length P QM hm(x,y) k) of sorted probability/proof pairs.

Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming Without Semirings

Estimation of Viterbi Path in Bayesian Hidden Markov Models

Sequence Models

Viterbi Extraction Tutorial with Hidden Markov Toolkit

Markov Chains and Hidden Markov Models

N-Best Hypotheses

Graphical Models

Hidden Markov Models in Bioinformatics

Hidden Markov Model: Tutorial

Lecture 6A: Introduction to Hidden Markov Models

Hidden Markov Models + Conditional Random Fields

4A. Inference in Graphical Models Inference on a Chain (Rep.)

Hidden Markov Models