Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings

Kevin Gimpel and Noah A. Smith Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {kgimpel,nasmith}@cs.cmu.edu

Abstract features, but leaves open the question of how the feature weights or probabilities are learned. Mean- We introduce cube summing, a technique while, some learning algorithms, like maximum that permits dynamic programming algo- likelihood for conditional log-linear models (Laf- rithms for summing over structures (like ferty et al., 2001), unsupervised models (Pereira the forward and inside algorithms) to be and Schabes, 1992), and models with hidden vari- extended with non-local features that vio- ables (Koo and Collins, 2005; Wang et al., 2007; late the classical structural independence Blunsom et al., 2008), require summing over the assumptions. It is inspired by cube prun- scores of many structures to calculate marginals. ing (Chiang, 2007; Huang and Chiang, We first review the semiring-weighted logic 2007) in its computation of non-local programming view of dynamic programming al- features dynamically using scored k-best gorithms (Shieber et al., 1995) and identify an in- lists, but also maintains additional resid- tuitive property of a program called proof locality ual quantities used in calculating approx- that follows from feature locality in the underlying imate marginals. When restricted to lo- probability model (§2). We then provide an analy- cal features, cube summing reduces to a sis of cube pruning as an approximation to the in- novel semiring (k-best+residual) that gen- tractable problem of exact optimization over struc- eralizes many of the semirings of Good- tures with non-local features and show how the man (1999). When non-local features are use of non-local features with k-best lists breaks included, cube summing does not reduce certain semiring properties (§3). The primary to any semiring, but is compatible with contribution of this paper is a novel technique— generic techniques for solving dynamic cube summing—for approximate summing over programming equations. discrete structures with non-local features, which we relate to cube pruning (§4). We discuss imple- 1 Introduction mentation (§5) and show that cube summing be- Probabilistic NLP researchers frequently make in- comes exact and expressible as a semiring when dependence assumptions to keep inference algo- restricted to local features; this semiring general- rithms tractable. Doing so limits the features that izes many commonly-used semirings in dynamic are available to our models, requiring features programming (§6). to be structurally local. Yet many problems in 2 Background NLP—machine translation, parsing, named-entity recognition, and others—have benefited from the In this section, we discuss dynamic programming addition of non-local features that break classical algorithms as semiring-weighted logic programs. independence assumptions. Doing so has required We then review the definition of semirings and im- algorithms for approximate inference. portant examples. We discuss the relationship be- Recently cube pruning (Chiang, 2007; Huang tween locally-factored structure scores and proofs and Chiang, 2007) was proposed as a way to lever- in logic programs. age existing dynamic programming algorithms that find optimal-scoring derivations or structures 2.1 Dynamic Programming when only local features are involved. Cube prun- Many algorithms in NLP involve dynamic pro- ing permits approximate decoding with non-local gramming (e.g., the Viterbi, forward-backward, probabilistic Earley’s, and minimum edit distance Semirings define these values and define two op- algorithms). Dynamic programming (DP) in- erators over them, called “aggregation” (max in volves solving certain kinds of recursive equations Eq. 1) and “combination” (× in Eq. 1). with shared substructure and a topological order- Goodman and Eisner et al. assumed that the val- ing of the variables. ues of the variables are in a semiring, and that the Shieber et al. (1995) showed a connection equations are defined solely in terms of the two between DP (specifically, as used in parsing) semiring operations. We will often refer to the and logic programming, and Goodman (1999) “probability” of a proof, by which we mean a non- augmented such logic programs with semiring negative R-valued score defined by the semantics weights, giving an algebraic explanation for the of the dynamic program variables; it may not be a intuitive connections among classes of algorithms normalized probability. with the same logical structure. For example, in Goodman’s framework, the forward algorithm and 2.2 Semirings the Viterbi algorithm are comprised of the same A semiring is a tuple hA, ⊕, ⊗, 0, 1i, in which A logic program with different semirings. Goodman is a set, ⊕ : A × A → A is the aggregation defined other semirings, including ones we will operation, ⊗ : A × A → A is the combina- use here. This formal framework was the basis tion operation, 0 is the additive identity element for the Dyna programming language, which per- (∀a ∈ A, a ⊕ 0 = a), and 1 is the multiplica- mits a declarative specification of the logic pro- tive identity element (∀a ∈ A, a ⊗ 1 = a). A gram and compiles it into an efficient, agenda- semiring requires ⊕ to be associative and com- based, bottom-up procedure (Eisner et al., 2005). mutative, and ⊗ to be associative and to distribute For our purposes, a DP consists of a set of recur- over ⊕. Finally, we require a ⊗ 0 = 0 ⊗ a = 0 for sive equations over a set of indexed variables. For all a ∈ A.1 Examples include the inside semir- example, the probabilistic CKY algorithm (run on ing, hR≥0, +, ×, 0, 1i, and the Viterbi semiring, sentence w1w2...wn) is written as hR≥0, max, ×, 0, 1i. The former sums the prob- abilities of all proofs of each theorem. The lat- CX,i−1,i = pX→w (1) i ter (used in Eq. 1) calculates the probability of the CX,i,k = max Y,Z∈N;j∈{i+1,...,k−1} most probable proof of each theorem. Two more examples follow. pX→YZ × CY,i,j × CZ,j,k

goal = CS,0,n Viterbi proof semiring. We typically need to where N is the nonterminal set and S ∈ N is the recover the steps in the most probable proof in addition to its probability. This is often done us- start symbol. Each CX,i,j variable corresponds to the chart value (probability of the most likely sub- ing backpointers, but can also be accomplished by tree) of an X-constituent spanning the substring representing the most probable proof for each the- orem in its entirety as part of the semiring value wi+1...wj. goal is a special variable of greatest in- terest, though solving for goal correctly may (in (Goodman, 1999). For generality, we define a general, but not in this example) require solving proof as a string that is constructed from strings for all the other values. We will use the term “in- associated with axioms, but the particular form dex” to refer to the subscript values on variables of a proof is problem-dependent. The “Viterbi proof” semiring includes the probability of the (X, i, j on CX,i,j). Where convenient, we will make use of Shieber most probable proof and the proof itself. Letting ∗ et al.’s logic programming view of dynamic pro- L ⊆ Σ be the proof language on some symbol set Σ, this semiring is defined on the set ≥0 × L gramming. In this view, each variable (e.g., CX,i,j R in Eq. 1) corresponds to the value of a “theo- with 0 element h0, i and 1 element h1, i. For two values hu1,U1i and hu2,U2i, the aggregation rem,” the constants in the equations (e.g., pX→YZ operator returns hmax(u , u ),U i. in Eq. 1) correspond to the values of “axioms,” 1 2 argmaxi∈{1,2} ui and the DP defines quantities corresponding to 1When cycles are permitted, i.e., where the value of one weighted “proofs” of the goal theorem (e.g., find- variable depends on itself, infinite sums can be involved. We ing the maximum-valued proof, or aggregating must ensure that these infinite sums are well defined under the semiring. So-called complete semirings satisfy additional proof values). The value of a proof is a combi- conditions to handle infinite sums, but for simplicity we will nation of the values of the axioms it starts with. restrict our attention to DPs that do not involve cycles. Semiring A Aggregation (⊕) Combination (⊗) 0 1 inside R≥0 u1 + u2 u1u2 0 1 Viterbi R≥0 max(u1, u2) u1u2 0 1 Viterbi proof R≥0 × L hmax(u1, u2),Uargmaxi∈{1,2} ui i hu1u2,U1.U2i h0, i h1, i ≤k k-best proof (R≥0 × L) max-k(u1 ∪ u2) max-k(u1 ? u2) ∅ {h1, i}

Table 1: Commonly used semirings. An element in the Viterbi proof semiring is denoted hu1,U1i, where u1 is the probability of proof U1. The max-k function returns a sorted list of the top-k proofs from a set. The ? function performs a cross-product on two k-best proof lists (Eq. 2).

The combination operator returns hu1u2,U1.U2i, yˆ ∈ Y for a given input x ∈ X: where U1.U2 denotes the string concatenation of 2 QM hm(x,y) U1 and U2. yˆ(x) = argmaxy∈Y m=1 λm (4) k-best proof semiring. The “k-best proof” The second is the summing problem, which semiring computes the values and proof strings of marginalizes the proof probabilities (without nor- the k most-probable proofs for each theorem. The malization): ≤k set is (R≥0 × L) , i.e., sequences (up to length P QM hm(x,y) k) of sorted probability/proof pairs. The aggrega- s(x) = y∈Y m=1 λm (5) tion operator ⊕ uses max-k, which chooses the k highest-scoring proofs from its argument (a set of As defined, the feature functions hm can depend scored proofs) and sorts them in decreasing order. on arbitrary parts of the input axiom set x and the To define the combination operator ⊗, we require entire output proof y. a cross-product that pairs probabilities and proofs from two k-best lists. We call this ?, defined on 2.4 Proof and Feature Locality two semiring values u = hhu1,U1i, ..., huk,Ukii An important characteristic of problems suited for and v = hhv1,V1i, ..., hvk,Vkii by: DP is that the global calculation (i.e., the value of goal) depend only on local factored parts. In DP u ? v = {huivj,Ui.Vji | i, j ∈ {1, ..., k}} (2) equations, this means that each equation connects a relatively small number of indexed variables re- Then, u ⊗ v = max-k(u ? v). This is similar to lated through a relatively small number of indices. the k-best semiring defined by Goodman (1999). In the logic programming formulation, it means These semirings are summarized in Table 1. that each step of the proof depends only on the the- 2.3 Features and Inference orems being used at that step, not the full proofs of those theorems. We call this property proof lo- X Let be the space of inputs to our logic program, cality. In the statistical modeling view of Eq. 3, x ∈ X L i.e., is a set of axioms. Let denote the classical DP requires that the probability model Y ⊆ L proof language and let denote the set of make strong Markovian conditional independence proof strings that constitute full proofs, i.e., proofs assumptions (e.g., in HMMs, St−1 ⊥ St+1 | St); goal of the special theorem. We assume an expo- in exponential families over discrete structures, nential probabilistic model such that this corresponds to feature locality. For a particular proof y of goal consisting of p(y | x) ∝ QM λhm(x,y) (3) m=1 m t intermediate theorems, we define a set of proof strings `i ∈ L for i ∈ {1, ..., t}, where `i corre- where each λm ≥ 0 is a parameter of the model sponds to the proof of the ith theorem.3 We can and each hm is a feature function. There is a bijec- tion between Y and the space of discrete structures break the computation of feature function hm into that our model predicts. a summation over terms corresponding to each `i: Given such a model, DP is helpful for solving h (x, y) = Pt f (x, ` ) (6) two kinds of inference problems. The first prob- m i=1 m i decoding lem, , is to find the highest scoring proof This is simply a way of noting that feature func- 2We assume for simplicity that the best proof will never tions “fire” incrementally at specific points in the be a tie among more than one proof. Goodman (1999) han- dles this situation more carefully, though our version is more 3The theorem indexing scheme might be based on a topo- likely to be used in practice for both the Viterbi proof and logical ordering given by the proof structure, but is not im- k-best proof semirings. portant for our purposes. proof, normally at the first opportunity. Any fea- McDonald and Pereira (2006), in which an exact ture function can be expressed this way. For local solution to a related decoding problem is found features, we can go farther; we define a function and then modified to fit the problem of interest. top(`) that returns the proof string corresponding to the antecedents and consequent of the last infer- 3 Approximate Decoding ence step in `. Local features have the property: Cube pruning (Chiang, 2007; Huang and Chi- loc Pt hm (x, y) = i=1 fm(x, top(`i)) (7) ang, 2007) is an approximate technique for decod- ing (Eq. 4); it is used widely in machine transla- Local features only have access to the most re- tion. Given proof locality, it is essentially an effi- cent deductive proof step (though they may “fire” cient implementation of the k-best proof semiring. repeatedly in the proof), while non-local features Cube pruning goes farther in that it permits non- have access to the entire proof up to a given the- local features to weigh in on the proof probabili- orem. For both kinds of features, the “f” terms ties, at the expense of making the k-best operation are used within the DP formulation. When tak- approximate. We describe the two approximations ing an inference step to prove theorem i, the value cube pruning makes, then propose cube decoding, QM fm(x,`i) m=1 λm is combined into the calculation which removes the second approximation. Cube of that theorem’s value, along with the values of decoding cannot be represented as a semiring; we the antecedents. Note that typically only a small propose a more general algebraic structure that ac- number of fm are nonzero for theorem i. commodates it. When non-local hm/fm that depend on arbitrary parts of the proof are involved, the decoding and 3.1 Approximations in Cube Pruning summing inference problems are NP-hard (they Cube pruning is an approximate solution to the de- instantiate probabilistic inference in a fully con- coding problem (Eq. 4) in two ways. nected ). Sometimes, it is possible to achieve proof locality by adding more indices to Approximation 1: k < ∞. Cube pruning uses the DP variables (for example, consider modify- a finite k for the k-best lists stored in each value. ing the bigram HMM Viterbi algorithm for trigram If k = ∞, the algorithm performs exact decoding HMMs). This increases the number of variables with non-local features (at obviously formidable and hence computational cost. In general, it leads expense in combinatorial problems). to exponential-time inference in the worst case. There have been many algorithms proposed for Approximation 2: lazy computation. Cube approximately solving instances of these decod- pruning exploits the fact that k < ∞ to use lazy ing and summing problems with non-local fea- computation. When combining the k-best proof tures. Some stem from work on graphical mod- lists of d theorems’ values, cube pruning does not els, including loopy (Sutton and enumerate all kd proofs, apply non-local features McCallum, 2004; Smith and Eisner, 2008), Gibbs to all of them, and then return the top k. Instead, sampling (Finkel et al., 2005), sequential Monte cube pruning uses a more efficient but approxi- Carlo methods such as particle filtering (Levy et mate solution that only calculates the non-local al., 2008), and variational inference (Jordan et al., factors on O(k) proofs to obtain the approximate 1999; MacKay, 1997; Kurihara and Sato, 2006). top k. This trick is only approximate if non-local Also relevant are stacked learning (Cohen and features are involved. Carvalho, 2005), interpretable as approximation Approximation 2 makes it impossible to formu- of non-local feature values (Martins et al., 2008), late cube pruning using separate aggregation and and M-estimation (Smith et al., 2007), which al- combination operations, as the use of lazy com- lows training without inference. Several other ap- putation causes these two operations to effectively proaches used frequently in NLP are approximate be performed simultaneously. To more directly methods for decoding only. These include beam relate our summing algorithm (§4) to cube prun- search (Lowerre, 1976), cube pruning, which we ing, we suggest a modified version of cube prun- discuss in §3, integer linear programming (Roth ing that does not use lazy computation. We call and Yih, 2004), in which arbitrary features can act this algorithm cube decoding. This algorithm can as constraints on y, and approximate solutions like be written down in terms of separate aggregation and combination operations, though we will show in place by multiplying in the function result, and it is not a semiring. returns the modified proof list:

3.2 Cube Decoding g0 = λ`.g(x, `) 0 0 We formally describe cube decoding, show that exec(g, ¯u) = hhu1g (U1),U1i, hu2g (U2),U2i, 0 it does not instantiate a semiring, then describe ..., hukg (Uk),Ukii a more general algebraic structure that it does in- stantiate. Here, max-k is simply used to re-sort the k-best Consider the set G of non-local feature functions proof list following function evaluation. 4 that map X × L → R≥0. Our definitions in §2.2 The semiring properties fail to hold when in- for the k-best proof semiring can be expanded to troducing non-local features in this way. In par- accommodate these functions within the semiring ticular, ⊗cd is not associative when 1 < k < ∞. value. Recall that values in the k-best proof semir- For example, consider the probabilistic CKY algo- ≤k ing fall in Ak = (R≥0 ×L) . For cube decoding, rithm as above, but using the cube decoding semir- we use a different set Acd defined as ing with the non-local feature functions collec- tively known as “NGramTree” features (Huang, ≤k Acd = (R≥0 × L) ×G × {0, 1} 2008) that score the string of terminals and nonter- | {z } Ak minals along the path from word j to word j + 1 when two constituents C and C are com- where the binary variable indicates whether the Y,i,j Z,j,k bined. The semiring value associated with such value contains a k-best list (0, which we call an a feature is u = hhi, NGramTree (), 1i (for a “ordinary” value) or a non-local feature function π specific path π), and we rewrite Eq. 1 as fol- in G (1, which we call a “function” value). We lows (where ranges for summation are omitted for denote a value u ∈ A by cd space): u = hhhu1,U1i, hu2,U2i, ..., huk,Ukii, gu, usi L | {z } CX,i,k = cd pX→YZ ⊗cd CY,i,j ⊗cd CZ,j,k ⊗cd u ¯u The combination operator is not associative where each ui ∈ R≥0 is a probability and each since the following will give different answers:5 Ui ∈ L is a proof string. We use ⊕k and ⊗k to denote the k-best proof (pX→YZ ⊗cd CY,i,j) ⊗cd (CZ,j,k ⊗cd u) (10) semiring’s operators, defined in §2.2. We let g0 be ((pX→YZ ⊗cd CY,i,j) ⊗cd CZ,j,k) ⊗cd u (11) such that g0(`) is undefined for all ` ∈ L. For two values u = h¯u, gu, usi, v = h¯v, gv, vsi ∈ Acd, In Eq. 10, the non-local feature function is ex- cube decoding’s aggregation operator is: ecuted on the k-best proof list for Z, while in Eq. 11, NGramTreeπ is called on the k-best proof u ⊕cd v = h¯u ⊕k ¯v, g0, 0i if ¬us ∧ ¬vs (8) list for the X constructed from Y and Z. Further- Under standard models, only ordinary values will more, neither of the above gives the desired re- be operands of ⊕cd, so ⊕cd is undefined when us ∨ sult, since we actually wish to expand the full set 2 vs. We define the combination operator ⊗cd: of k proofs of X and then apply NGramTreeπ to each of them (or a higher-dimensional “cube” u ⊗cd v = (9) if more operands are present) before selecting the  h¯u ⊗ ¯v, g , 0i if ¬u ∧ ¬v , k-best. The binary operations above retain only  k 0 s s  hmax-k(exec(gv, ¯u)), g0, 0i if ¬us ∧ vs, the top k proofs of X in Eq. 11 before applying NGramTreeπ to each of them. We actually would hmax-k(exec(gu, ¯v)), g0, 0i if us ∧ ¬vs,  like to redefine combination so that it can operate hhi, λz.(g (z) × g (z)), 1i u ∧ v . u v if s s on arbitrarily-sized sets of values. where exec(g, ¯u) executes the function g upon We can understand cube decoding through an each proof in the proof list ¯u, modifies the scores algebraic structure with two operations ⊕ and ⊗, where ⊗ need not be associative and need not dis- 4 In our setting, gm(x, `) will most commonly be defined tribute over ⊕, and furthermore where ⊕ and ⊗ are fm(x,`) as λm in the notation of §2.3. But functions in G could also be used to implement, e.g., hard constraints or other non- 5Distributivity of combination over aggregation fails for local score factors. related reasons. We omit a full discussion due to space. defined on arbitrarily many operands. We will re- where Res returns the “residual” set of scored fer here to such a structure as a generalized semir- proofs not in the k-best among its arguments, pos- 6 0 ing. To define ⊗cd on a set of operands with N sibly the empty set. 0 N N 0 ordinary operands and N function operands, we For a set of N +N operands {vi}i=1∪{wj}j=1 N 0 first compute the full O(k ) cross-product of the such that vis = 1 (non-local feature functions) and ordinary operands, then apply each of the N func- wjs = 1 (ordinary values), the combination oper- tions from the remaining operands in turn upon the ator ⊗ is shown in Eq. 13 Fig. 1. Note that the full N 0-dimensional “cube,” finally calling max-k case where N 0 = 0 is not needed in this applica- on the result. tion; an ordinary value will always be included in combination. 4 Cube Summing In the special case of two ordinary operands (where u = v = 0), Eq. 13 reduces to We present an approximate solution to the sum- s s ming problem when non-local features are in- u ⊗ v = (14) volved, which we call cube summing. It is an ex- hu v + u k¯vk + v k¯uk + kRes(¯u ? ¯v)k , tension of cube decoding, and so we will describe 0 0 0 0 it as a generalized semiring. The key addition is to max-k(¯u ? ¯v), g0, 0i maintain in each value, in addition to the k-best list We define 0 as h0, hi, g0, 0i; an appropriate def- of proofs from Ak, a scalar corresponding to the residual probability (possibly unnormalized) of all inition for the combination identity element is less proofs not among the k-best.7 The k-best proofs straightforward and of little practical importance; are still used for dynamically computing non-local we leave it to future work. features but the aggregation and combination op- If we use this generalized semiring to solve a erations are redefined to update the residual as ap- DP and achieve goal value of u, the approximate propriate. sum of all proof probabilities is given by u0 +k¯uk. If all features are local, the approach is exact. With We define the set Acs for cube summing as non-local features, the k-best list may not contain ≤k the k-best proofs, and the residual score, while in- Acs = R≥0 × (R≥0 × L) × G × {0, 1} cluding all possible proofs, may not include all of

A value u ∈ Acs is defined as the non-local features in all of those proofs’ prob- abilities. u = hu0, hhu1,U1i, hu2,U2i, ..., huk,Ukii, gu, usi | {z } 5 Implementation ¯u We have so far viewed dynamic programming For a proof list ¯u, we use k¯uk to denote the sum P algorithms in terms of their declarative speci- of all proof scores, ui. i:hui,Uii∈¯u fications as semiring-weighted logic programs. The aggregation operator over operands Solvers have been proposed by Goodman (1999), N 8 {ui}i=1, all such that uis = 0, is defined by: by Klein and Manning (2001) using a hypergraph LN representation, and by Eisner et al. (2005). Be- i=1 ui = (12) cause Goodman’s and Eisner et al.’s algorithms as- D   PN SN sume semirings, adapting them for cube summing i=1 ui0 + Res i=1 ¯ui , is non-trivial.9 SN  E max-k i=1 ¯ui , g0, 0 To generalize Goodman’s algorithm, we sug- gest using the directed-graph data structure known 6 Algebraic structures are typically defined with binary op- variously as an arithmetic circuit or computation erators only, so we were unable to find a suitable term for this 10 structure in the literature. graph. Arithmetic circuits have recently drawn 7Blunsom and Osborne (2008) described a related ap- interest in the graphical model community as a proach to approximate summing using the chart computed during cube pruning, but did not keep track of the residual 9The bottom-up agenda algorithm in Eisner et al. (2005) terms as we do here. might possibly be generalized so that associativity, distribu- 8 We assume that operands ui to ⊕cs will never be such tivity, and binary operators are not required (John Blatz, p.c.). 10 that uis = 1 (non-local feature functions). This is reasonable This data structure is not specific to any particular set of in the widely used log-linear model setting we have adopted, operations. We have also used it successfully with the inside where weights λm are factors in a proof’s product score. semiring. N N 0 *  O O X Y Y vi ⊗ wj =  wb0 k ¯wck (13) i=1 j=1 B∈P(S) b∈B c∈S\B

0 + kRes(exec(gv1 ,... exec(gvN , ¯w1 ? ··· ? ¯wN ) ...))k , E 0 max-k(exec(gv1 ,... exec(gvN , ¯w1 ? ··· ? ¯wN ) ...)), g0, 0

Figure 1: Combination operation for cube summing, where S = {1, 2,...,N 0} and P(S) is the power set of S excluding ∅.

tool for performing probabilistic inference (Dar- k-best + residual wiche, 2003). In the directed graph, there are ver- l ua id k es = tices corresponding to axioms (these are sinks in r re 0 no the graph), ⊕ vertices corresponding to theorems, ig and ⊗ vertices corresponding to summands in the k-best proof inside dynamic programming equations. Directed edges (Goodman, 1999) (Baum et al., 1970) k point from each node to the nodes it depends on; 1 = = ∞ ⊕ vertices depend on ⊗ vertices, which depend on k ⊕ and axiom vertices. Viterbi proof all proof Arithmetic circuits are amenable to automatic (Goodman, 1999) (Goodman, 1999) differentiation in the reverse mode (Griewank ignore and Corliss, 1991), commonly used in back- proof propagation algorithms. Importantly, this permits Viterbi us to calculate the exact gradient of the approx- (Viterbi, 1967) imate summation with respect to axiom values, following Eisner et al. (2005). This is desirable Figure 2: Semirings generalized by k-best+residual. when carrying out the optimization problems in- volved in parameter estimation. Another differen- ings. Cube pruning reduces to an implementation tiation technique, implemented within the semir- of the k-best semiring (Goodman, 1998), and cube ing, is given by Eisner (2002). summing reduces to a novel semiring we call the Cube pruning is based on the k-best algorithms k-best+residual semiring. Binary instantiations of of Huang and Chiang (2005), which save time ⊗ and ⊕ can be iteratively reapplied to give the over generic semiring implementations through equivalent formulations in Eqs. 12 and 13. We de- lazy computation in both the aggregation and com- fine 0 as h0, hii and 1 as h1, h1, ii. The ⊕ opera- bination operations. Their techniques are not as tor is easily shown to be commutative. That ⊕ is clearly applicable here, because our goal is to sum associative follows from associativity of max-k, over all proofs instead of only finding a small sub- shown by Goodman (1998). Showing that ⊗ is set of them. If computing non-local features is a associative and that ⊗ distributes over ⊕ are less computational bottleneck, they can be computed straightforward; proof sketches are provided in only for the O(k) proofs considered when choos- Appendix A. The k-best+residual semiring gen- ing the best k as in cube pruning. Then, the com- eralizes many semirings previously introduced in putational requirements for approximate summing the literature; see Fig. 2. are nearly equivalent to cube pruning, but the ap- proximation is less accurate. 6.2 Variations

6 Semirings Old and New Once we relax requirements about associativity and distributivity and permit aggregation and com- We now consider interesting special cases and bination operators to operate on sets, several ex- variations of cube summing. tensions to cube summing become possible. First, when computing approximate summations with 6.1 The k-best+residual Semiring non-local features, we may not always be inter- When restricted to local features, cube pruning ested in the best proofs for each item. Since the and cube summing can be seen as proper semir- purpose of summing is often to calculate statistics under a model distribution, we may wish instead out immediately. Three more cancel using Eq. 15, leaving: to sample from that distribution. We can replace LHS = kRes(¯u ? ¯v)k k ¯wk + kRes(max-k(¯u ? ¯v) ? ¯w)k the max-k function with a sample-k function that RHS = k¯uk kRes(¯v ? ¯w)k + kRes(¯u ? max-k(¯v ? ¯w))k samples k proofs from the scored list in its argu- ment, possibly using the scores or possibly uni- If LHS = RHS, associativity holds. Using Eq. 15 again, we formly at random. This breaks associativity of ⊕. can rewrite the second term in LHS to obtain We conjecture that this approach can be used to LHS = kRes(¯u ? ¯v)k k ¯wk + kmax-k(¯u ? ¯v) ? ¯wk simulate particle filtering for structured models. − kmax-k(max-k(¯u ? ¯v) ? ¯w)k Another variation is to vary k for different theo- Using Eq. 16 and pulling out the common term k ¯wk, we have rems. This might be used to simulate beam search, or to reserve computation for theorems closer to LHS =(kRes(¯u ? ¯v)k + kmax-k(¯u ? ¯v)k) k ¯wk goal, which have more proofs. − kmax-k(max-k(¯u ? ¯v) ? ¯w)k = k(¯u ? ¯v) ? ¯wk − kmax-k(max-k(¯u ? ¯v) ? ¯w)k 7 Conclusion = k(¯u ? ¯v) ? ¯wk − kmax-k((¯u ? ¯v) ? ¯w)k The resulting expression is intuitive: the residual of (u⊗v)⊗ This paper has drawn a connection between cube w is the difference between the sum of all proof scores and the sum of the k-best. RHS can be transformed into this same pruning, a popular technique for approximately expression with a similar line of reasoning (and using asso- solving decoding problems, and the semiring- ciativity of ?). Therefore, LHS = RHS and ⊗ is associative. weighted logic programming view of dynamic Distributivity. To prove that ⊗ distributes over ⊕, we must programming. We have introduced a generaliza- show left-distributivity, i.e., that u⊗(v⊕w) = (u⊗v)⊕(u⊗ w), and right-distributivity. We show left-distributivity here. tion called cube summing, to be used for solv- As above, we expand the expressions, finding 8 terms on the ing summing problems, and have argued that cube LHS and 9 on the RHS. Six on each side cancel, leaving: pruning and cube summing are both semirings that LHS = kRes(¯v ∪ ¯w)k k¯uk + kRes(¯u ? max-k(¯v ∪ ¯w))k can be used generically, as long as the under- RHS = kRes(¯u ? ¯v)k + kRes(¯u ? ¯w)k lying probability models only include local fea- + kRes(max-k(¯u ? ¯v) ∪ max-k(¯u ? ¯w))k tures. With non-local features, cube pruning and cube summing can be used for approximate decod- We can rewrite LHS as: ing and summing, respectively, and although they = kRes(¯v ∪ ¯w)k k¯uk + k¯u ? max k(¯v ∪ ¯w)k no longer correspond to semirings, generic algo- LHS - − kmax-k(¯u ? max-k(¯v ∪ ¯w))k rithms can still be used. = k¯uk (kRes(¯v ∪ ¯w)k + kmax-k(¯v ∪ ¯w)k) Acknowledgments − kmax-k(¯u ? max-k(¯v ∪ ¯w))k = k¯uk k¯v ∪ ¯wk − kmax-k(¯u ? (¯v ∪ ¯w))k We thank three anonymous EACL reviewers, John Blatz, Pe- = k¯uk k¯v ∪ ¯wk − kmax-k((¯u ? ¯v) ∪ (¯u ? ¯w))k dro Domingos, Jason Eisner, Joshua Goodman, and members of the ARK group for helpful comments and feedback that where the last line follows because ? distributes over ∪ improved this paper. This research was supported by NSF (Goodman, 1998). We now work with the RHS: IIS-0836431 and an IBM faculty award. RHS = kRes(¯u ? ¯v)k + kRes(¯u ? ¯w)k A k-best+residual is a Semiring + kRes(max-k(¯u ? ¯v) ∪ max-k(¯u ? ¯w))k = kRes(¯u ? ¯v)k + kRes(¯u ? ¯w)k In showing that k-best+residual is a semiring, we will restrict + kmax-k(¯u ? ¯v) ∪ max-k(¯u ? ¯w)k our attention to the computation of the residuals. The com- putation over proof lists is identical to that performed in the − kmax-k(max-k(¯u ? ¯v) ∪ max-k(¯u ? ¯w))k k-best proof semiring, which was shown to be a semiring by Goodman (1998). We sketch the proofs that ⊗ is associative and that ⊗ distributes over ⊕; associativity of ⊕ is straight- Since max-k(¯u ? ¯v) and max-k(¯u ? ¯w) are disjoint (we forward. assume no duplicates; i.e., two different theorems can- For a proof list ¯a, k¯ak denotes the sum of proof scores, not have exactly the same proof), the third term becomes P kmax-k(¯u ? ¯v)k + kmax-k(¯u ? ¯w)k and we have ai. Note that: i:hai,Aii∈¯a = k¯u ? ¯vk + k¯u ? ¯wk kRes(¯a)k + kmax-k(¯a)k = k¯ak (15) − kmax-k(max-k(¯u ? ¯v) ∪ max-k(¯u ? ¯w))k ‚ ‚ ‚ ‚ ‚¯a ? b¯‚ = k¯ak ‚b¯‚ (16) = k¯uk k¯vk + k¯uk k ¯wk − kmax-k((¯u ? ¯v) ∪ (¯u ? ¯w))k Associativity. Given three semiring values u, v, and w, we = k¯uk k¯v ∪ ¯wk − kmax-k((¯u ? ¯v) ∪ (¯u ? ¯w))k . need to show that (u⊗v)⊗w = u⊗(v⊗w). After expand- ing the expressions for the residuals using Eq. 14, there are 10 terms on each side, five of which are identical and cancel References D. J. C. MacKay. 1997. Ensemble learning for hidden Markov models. Technical report, Cavendish Labo- L. E. Baum, T. Petrie, G. Soules, and N. Weiss. 1970. ratory, Cambridge. A maximization technique occurring in the statis- tical analysis of probabilistic functions of Markov A. F. T. Martins, D. Das, N. A. Smith, and E. P. Xing. chains. Annals of Mathematical Statistics, 41(1). 2008. Stacking dependency parsers. In Proc. of EMNLP. P. Blunsom and M. Osborne. 2008. Probabilistic infer- ence for machine translation. In Proc. of EMNLP. R. McDonald and F. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In P. Blunsom, T. Cohn, and M. Osborne. 2008. A dis- Proc. of EACL. criminative latent variable model for statistical ma- chine translation. In Proc. of ACL. F. C. N. Pereira and Y. Schabes. 1992. Inside-outside reestimation from partially bracketed corpora. In D. Chiang. 2007. Hierarchical phrase-based transla- Proc. of ACL, pages 128–135. tion. Computational Linguistics, 33(2):201–228. D. Roth and W. Yih. 2004. A linear programming W. W. Cohen and V. Carvalho. 2005. Stacked sequen- formulation for global inference in natural language tial learning. In Proc. of IJCAI. tasks. In Proc. of CoNLL. A. Darwiche. 2003. A differential approach to infer- S. Shieber, Y. Schabes, and F. Pereira. 1995. Principles ence in Bayesian networks. Journal of the ACM, and implementation of deductive parsing. Journal of 50(3). Logic Programming, 24(1-2):3–36. J. Eisner, E. Goldlust, and N. A. Smith. 2005. Com- D. A. Smith and J. Eisner. 2008. Dependency parsing piling Comp Ling: Practical weighted dynamic pro- by belief propagation. In Proc. of EMNLP. gramming and the Dyna language. In Proc. of HLT- EMNLP. N. A. Smith, D. L. Vail, and J. D. Lafferty. 2007. Com- putationally efficient M-estimation of log-linear J. Eisner. 2002. Parameter estimation for probabilistic structure models. In Proc. of ACL. finite-state transducers. In Proc. of ACL. C. Sutton and A. McCallum. 2004. Collective seg- J. R. Finkel, T. Grenager, and C. D. Manning. 2005. mentation and labeling of distant entities in infor- Incorporating non-local information into informa- mation extraction. In Proc. of ICML Workshop on tion extraction systems by gibbs sampling. In Proc. Statistical Relational Learning and Its Connections of ACL. to Other Fields. J. Goodman. 1998. Parsing inside-out. Ph.D. thesis, A. J. Viterbi. 1967. Error bounds for convolutional Harvard University. codes and an asymptotically optimal decoding algo- J. Goodman. 1999. Semiring parsing. Computational rithm. IEEE Transactions on Information Process- Linguistics, 25(4):573–605. ing, 13(2). A. Griewank and G. Corliss. 1991. Automatic Differ- M. Wang, N. A. Smith, and T. Mitamura. 2007. What entiation of Algorithms. SIAM. is the Jeopardy model? a quasi-synchronous gram- L. Huang and D. Chiang. 2005. Better k-best parsing. mar for QA. In Proc. of EMNLP-CoNLL. In Proc. of IWPT. L. Huang and D. Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Proc. of ACL. L. Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proc. of ACL. M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2). D. Klein and C. Manning. 2001. Parsing and hyper- graphs. In Proc. of IWPT. T. Koo and M. Collins. 2005. Hidden-variable models for discriminative reranking. In Proc. of EMNLP. K. Kurihara and T. Sato. 2006. Variational Bayesian grammar induction for natural language. In Proc. of ICGI. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- ditional random fields: Probabilistic models for seg- menting and labeling sequence data. In Proc. of ICML. R. Levy, F. Reali, and T. Griffiths. 2008. Modeling the effects of memory on human online sentence pro- cessing with particle filters. In Advances in NIPS. B. T. Lowerre. 1976. The Harpy System. Ph.D. thesis, Carnegie Mellon University.