Dynamic Programming for Parsing and Estimation of Stochastic Unification

Dynamic programming for parsing and estimation of stochastic uniﬁcation-based grammars∗

Stuart Geman Mark Johnson Division of Applied Mathematics Cognitive and Linguistic Sciences Brown University Brown University [email protected] Mark [email protected]

Abstract 1 Introduction

Stochastic Unification-Based Grammars (SUBGs) Stochastic unification-based grammars use log-linear models (also known as exponential or (SUBGs) define exponential distributions MaxEnt models and Markov Random Fields) to de- over the parses generated by a unification- fine probability distributions over the parses of a uni- based grammar (UBG). Existing algo- fication grammar. These grammars can incorporate rithms for parsing and estimation require virtually all kinds of linguistically important con- the enumeration of all of the parses of a straints (including non-local and non-context-free string in order to determine the most likely constraints), and are equipped with a statistically one, or in order to calculate the statis- sound framework for estimation and learning. tics needed to estimate a grammar from Abney (1997) pointed out that the non-context- a training corpus. This paper describes a free dependencies of a unification grammar require graph-based dynamic programming algo- stochastic models more general than Probabilis- rithm for calculating these statistics from tic Context-Free Grammars (PCFGs) and Markov the packed UBG parse representations of Branching Processes, and proposed the use of log- Maxwell and Kaplan (1995) which does linear models for defining probability distributions not require enumerating all parses. Like over the parses of a unification grammar. Un- many graphical algorithms, the dynamic fortunately, the maximum likelihood estimator Ab- programming algorithm’s complexity is ney proposed for SUBGs seems computationally in- worst-case exponential, but is often poly- tractable since it requires statistics that depend on nomial. The key observation is that by the set of all parses of all strings generated by the using Maxwell and Kaplan packed repre- grammar. This set is infinite (so exhaustive enumer- sentations, the required statistics can be ation is impossible) and presumably has a very com- rewritten as either the max or the sum of plex structure (so sampling estimates might take an a product of functions. This is exactly extremely long time to converge). the kind of problem which can be solved by dynamic programming over graphical Johnson et al. (1999) observed that parsing and models. related tasks only require conditional distributions over parses given strings, and that such conditional distributions are considerably easier to estimate than joint distributions of strings and their parses. The ∗ We would like to thank Eugene Charniak, Miyao conditional maximum likelihood estimator proposed Yusuke, Mark Steedman as well as Stefan Riezler and the team at PARC; naturally all errors remain our own. This research was by Johnson et al. requires statistics that depend on supported by NSF awards DMS 0074276 and ITR IIS 0085940. the set of all parses of the strings in the training corpus. For most linguistically realistic grammars this appeared after this paper was accepted) is the closest set is finite, and for moderate sized grammars and related work we know of. They describe a technique training corpora this estimation procedure is quite for calculating the statistics required to estimate a feasible. log-linear parsing model with non-local properties However, our recent experiments involve training from packed feature forests. from the Wall Street Journal Penn Tree-bank, and The rest of this paper is structured as follows. repeatedly enumerating the parses of its 50,000 sen- The next section describes unification grammars tences is quite time-consuming. Matters are only and Maxwell and Kaplan packed representation. made worse because we have moved some of the The following section reviews stochastic unifica- constraints in the grammar from the unification com- tion grammars (Abney, 1997) and the statistical ponent to the stochastic component. This broadens quantities required for efficiently estimating such the coverage of the grammar, but at the expense of grammars from parsed training data (Johnson et al., massively expanding the number of possible parses 1999). The final substantive section of this paper of each sentence. shows how these quantities can be defined directly In the mid-1990s unification-based parsers were in terms of the Maxwell and Kaplan packed repre- developed that do not enumerate all parses of a string sentations. but instead manipulate and return a “packed” rep- The notation used in this paper is as follows. Vari- resentation of the set of parses. This paper de- ables are written in upper case italic, e.g., X, Y , etc., scribes how to find the most probable parse and the sets they range over are written in script, e.g., the statistics required for estimating a SUBG from X , Y, etc., while specific values are written in lower the packed parse set representations proposed by case italic, e.g., x, y, etc. In the case of vector-valued Maxwell III and Kaplan (1995). This makes it pos- entities, subscripts indicate particular components. sible to avoid explicitly enumerating the parses of the strings in the training corpus. 2 Maxwell and Kaplan packed The methods proposed here are analogues of representations the well-known dynamic programming algorithms for Probabilistic Context-Free Grammars (PCFGs); This section characterises the properties of unifica- specifically the Viterbi algorithm for finding the tion grammars and the Maxwell and Kaplan packed most probable parse of a string, and the Inside- parse representations that will be important for what Outside algorithm for estimating a PCFG from un- follows. This characterisation omits many details parsed training data.1 In fact, because Maxwell and about unification grammars and the algorithm by Kaplan packed representations are just Truth Main- which the packed representations are actually con- tenance System (TMS) representations (Forbus and structed; see Maxwell III and Kaplan (1995) for de- de Kleer, 1993), the statistical techniques described tails. here should extend to non-linguistic applications of A parse generated by a unification grammar is a TMSs as well. finite subset of a set F of features. Features are parse Dynamic programming techniques have fragments, e.g., chart edges or arcs from attribute- been applied to log-linear models before. value structures, out of which the packed representa- Lafferty et al. (2001) mention that dynamic tions are constructed. For this paper it does not mat- programming can be used to compute the statistics ter exactly what features are, but they are intended required for conditional estimation of log-linear to be the atomic entities manipulated by a dynamic models based on context-free grammars where programming parsing algorithm. A grammar defines the properties can include arbitrary functions of a set Ω of well-formed or grammatical parses. Each the input string. Miyao and Tsujii (2002) (which parse ω ∈ Ω is associated with a string of words Y (ω) called its yield. Note that except for trivial 1However, because we use conditional estimation, also grammars F and Ω are infinite. known as discriminative training, we require at least some dis- criminating information about the correct parse of a string in If y is a string, then let Ω(y) = {ω ∈ Ω|Y (ω) = order to estimate a stochastic unification grammar. y} and F(y) = ∈ {f ∈ ω}. That is, Ω(y) is Sω Ω(y) the set of parses of a string y and F(y) is the set of the no-goods. That is, we require that: features appearing in the parses of y. In the gram- 0 0 mars of interest here Ω(y) and hence also F(y) are ∀x, x ∈ X if N(x) = N(x ) = 1 and 0 0 finite. ω(x) = ω(x ) then x = x (1) Maxwell and Kaplan’s packed representations of- Finally, a packed representation R represents the ten provide a more compact representation of the set of parses Ω(R) that are identified by values set of parses of a sentence than would be obtained that satisfy the no-goods, i.e., Ω(R) = {ω(x)|x ∈ by merely listing each parse separately. The intu- X , N(x) = 1}. ition behind these packed representations is that for Maxwell III and Kaplan (1995) describes a pars- most strings y, many of the features in F(y) occur ing algorithm for unification-based grammars that in many of the parses Ω(y). This is often the case takes as input a string y and returns a packed rep- in natural language, since the same substructure can resentation R such that Ω(R) = Ω(y), i.e., R rep- appear as a component of many different parses. resents the set of parses of the string y. The SUBG Packed feature representations are defined in parsing and estimation algorithms described in this terms of conditions on the values assigned to a vec- paper use Maxwell and Kaplan’s parsing algorithm tor of variables X. These variables have no direct as a subroutine. linguistic interpretation; rather, each different assignment of values to these variables identifies a set 3 Stochastic Unification-Based Grammars of features which constitutes one of the parses in the packed representation. A condition a on X is This section reviews the probabilistic framework a function from X to {0, 1}. While for uniformity used in SUBGs, and describes the statistics that we write conditions as functions on the entire vec- must be calculated in order to estimate the pa- tor X, in practice Maxwell and Kaplan’s approach rameters of a SUBG from parsed training data. produces conditions whose value depends only on a For a more detailed exposition and descriptions few of the variables in X, and the efficiency of the of regularization and other important details, see algorithms described here depends on this. Johnson et al. (1999). A packed representation of a finite set of parses is The probability distribution over parses is defined 0 a quadruple R = (F , X, N, α), where: in terms of a finite vector g = (g1, . . . , gm) of properties. A property is a real-valued function of • F 0 ⊇ F(y) is a finite set of features, parses Ω. Johnson et al. (1999) placed no restric- tions on what functions could be properties, permit- • X is a finite vector of variables, where each ting properties to encode arbitrary global informa- variable X` ranges over the finite set X`, tion about a parse. However, the dynamic programming algorithms presented here require the informa- • N is a finite set of conditions on X called the tion encoded in properties to be local with respect to no-goods,2 and the features F used in the packed parse representa- • α is a function that maps each feature f ∈ F 0 tion. Specifically, we require that properties be defined on features rather than parses, i.e., each feature to a condition αf on X. f ∈ F is associated with a finite vector of real values A vector of values x satisfies the no-goods N iff (g1(f), . . . , gm(f)) which define the property func- N(x) = 1, where N(x) = ∈ η(x). Each x tions for parses as follows: Qη N that satisfies the no-goods identifies a parse ω(x) = 0 gk(ω) = gk(f), for k = 1 . . . m. (2) {f ∈ F |αf (x) = 1}, i.e., ω is the set of features X f∈ω whose conditions are satisfied by x. We require that each parse be identified by a unique value satisfying That is, the property values of a parse are the sum of the property values of its features. In the usual 2The name “no-good” comes from the TMS literature, and was used by Maxwell and Kaplan. However, here the no-goods case, some features will be associated with a single actually identify the good variable assignments. property (i.e., gk(f) is equal to 1 for a specific value of k and 0 otherwise), and other features will be as- maximises a product of functions or else as the sum sociated with no properties at all (i.e., g(f) = 0). of a product of functions, each of which depends This requires properties be very local with re- on a small subset of the variables X. These are the spect to features, which means that we give up the kinds of quantities for which dynamic programming ability to define properties arbitrarily. Note how- graphical model algorithms have been developed. ever that we can still encode essentially arbitrary linguistic information in properties by adding spe- 3.1 The most probable parse cialised features to the underlying unification gram- In parsing applications it is important to be able to mar. For example, suppose we want a property that extract the most probable (or MAP) parse ωˆ(y) of indicates whether the parse contains a reduced rela- string y with respect to a SUBG. This parse is: tive clauses headed by a past participle (such “gar- den path” constructions are grammatical but often ωˆ(y) = argmax Wθ(ω) ∈ almost incomprehensible, and alternative parses not ω Ω(y) including such constructions would probably be pre- Given a packed representation (F 0, X, N, α) for the ferred). Under the current definition of properties, parses Ω(y), let xˆ(y) be the x that identifies ωˆ(y). we can introduce such a property by modifying the Since Wθ(ωˆ(y)) > 0, it can be shown that: underlying unification grammar to produce a certain “diacritic” feature in a parse just in case the parse ac- m gj(ω(x)) tually contains the appropriate reduced relative con- xˆ(y) = argmax N(x) θj x∈X Y struction. Thus, while properties are required to be j=1 m g (f) local relative to features, we can use the ability of f∈ω(x) j = argmax N(x) θjP ∈X Y the underlying unification grammar to encode essen- x j=1 tially arbitrary non-local information in features to m f∈F0 αf (x)gj (f) introduce properties that also encode non-local in- = argmax N(x) θjP ∈X Y formation. x j=1 m A Stochastic Unification-Based Grammar is a α (x)g (f) = argmax N(x) θ f j triple (U, g, θ), where U is a unification grammar Y Y j x∈X ∈F 0 that defines a set Ω of parses as described above, j=1 f αf (x) g = (g1, . . . , gm) is a vector of property functions as m  gj (f) just described, and θ = (θ1, . . . , θm) is a vector of = argmax N(x) θj x∈X Y 0 Y non-negative real-valued parameters called property f∈F j=1  weights. The probability Pθ(ω) of a parse ω ∈ Ω is: = argmax η(x) hθ,f (x) (3) ∈X Y Y x η∈N f∈F 0 Wθ(ω) Pθ(ω) = , where: Z m gj(f) θ where hθ,f (x) = θ if αf (x) = 1 and m Qj=1 j gj(ω) h (x) = 1 α (x) = 0 h (x) W (ω) = θ , θ,f if f . Note that θ,f de- θ Y j and j=1 pends on exactly the same variables in X as αf does. As (3) makes clear, finding xˆ(y) involves maximis- Z = W (ω0) θ X θ ing a product of functions where each function de- ω0∈Ω pends on a subset of the variables X. As explained Intuitively, if gj(ω) is the number of times that prop- below, this is exactly the kind of maximisation that erty j occurs in ω then θj is the ‘weight’ or ‘cost’ of can be solved using graphical model techniques. each occurrence of property j and Zθ is a normal- ising constant that ensures that the probability of all 3.2 Conditional likelihood parses sums to 1. We now turn to the estimation of the property Now we discuss the calculation of several impor- weights θ from a training corpus of parsed data D = tant quantities for SUBGs. In each case we show (ω1, . . . , ωn). As explained in Johnson et al. (1999), that the quantity can be expressed as the value that one way to do this is to find the θ that maximises the conditional likelihood of the training corpus parses For example, the Conjugate Gradient algorithm, given their yields. (Johnson et al. actually maximise which was used by Johnson et al., requires the cal- conditional likelihood regularized with a Gaussian culation not just of LD(θ) but also its derivatives prior, but for simplicity we ignore this here). If yi is ∂LD(θ)/∂θk. It is straight-forward to show: the yield of the parse ωi, the conditional likelihood n of the parses given their yields is: ∂LD(θ) LD(θ) = (gk(ωi) − Eθ[gk|yi]) . ∂θk θk X n W (ω ) i=1 L (θ) = θ i D Y Z (Ω(y )) i=1 θ i We have just described the calculation of LD(θ), so if we can calculate E [g |y ] then we can calcu- where Ω(y) is the set of parses with yield y and: θ k i late the partial derivatives required by the Conjugate Zθ(S) = Wθ(ω). Gradient algorithm as well. X 0 ω∈S Again, let R = (F , X, N, α) be a packed repre- Then the maximum conditional likelihood estimate sentation such that Ω(R) = Ω(yi). First, note that ˆ ˆ (2) implies that: θ of θ is θ = argmaxθ LD(θ). Now calculating Wθ(ωi) poses no computational Eθ[gk|yi] = gk(f) P({ω : f ∈ ω}|yi). problems, but since Ω(yi) (the set of parses for yi) X f∈F 0 can be large, calculating Zθ(Ω(yi)) by enumerating each ω ∈ Ω(yi) can be computationally expensive. Note that P({ω : f ∈ ω}|yi) involves the sum of However, there is an alternative method for calcu- weights over all x ∈ X subject to the conditions lating Zθ(Ω(yi)) that does not involve this enumera- that N(x) = 1 and αf (x) = 1. Thus P({ω : f ∈ tion. As noted above, for each yield yi, i = 1, . . . , n, ω}|yi) can also be expressed in a form that is easy Maxwell’s parsing algorithm returns a packed fea- to evaluate using graphical techniques. ture structure Ri that represents the parses of yi, i.e., Ω(yi) = Ω(Ri). A derivation parallel to the one for Zθ(Ω(R))Pθ({ω : f ∈ ω}|yi) 0 (3) shows that for R = (F , X, N, α): = α (x) η(x) h 0 (x) X f Y Y θ,f (5) ∈X ∈ 0∈F 0 Z (Ω(R)) = η(x) h (x) x η N f θ X Y Y θ,f (4) x∈X η∈N f∈F 0 4 Graphical model calculations (This derivation relies on the isomorphism between parses and variable assignments in (1)). It turns out In this section we briefly review graphical model that this type of sum can also be calculated using algorithms for maximising and summing products graphical model techniques. of functions of the kind presented above. It turns out that the algorithm for maximisation is a gener- 3.3 Conditional Expectations alisation of the Viterbi algorithm for HMMs, and In general, iterative numerical procedures are re- the algorithm for computing the summation in (5) quired to find the property weights θ that maximise is a generalisation of the forward-backward algorithm for HMMs (Smyth et al., 1997). Viewed the conditional likelihood LD(θ). While there are a number of different techniques that can be used, abstractly, these algorithms simplify these expres- all of the efficient techniques require the calculation sions by moving common factors over the max or sum operators respectively. These techniques are of conditional expectations Eθ[gk|yi] for each prop- now relatively standard; the most well-known ap- erty gk and each sentence yi in the training corpus, where: proach involves junction trees (Pearl, 1988; Cow- ell, 1999). We adopt the approach approach de- E [g|y] = g(ω)P (ω|y) θ X θ scribed by Geman and Kochanek (2000), which is ω∈Ω(y) a straightforward generalization of HMM dynamic ω∈ y g(ω)Wθ(ω) programming with minimal assumptions and pro- = P Ω( ) Zθ(Ω(y)) gramming overhead. However, in principle any of the graphical model computational algorithms can The procedure depends on two sequences of func- be used. tions Mi, i = 1, . . . , n + 1 and Vi, i = 1, . . . , n. The quantities (3), (4) and (5) involve maximisa- Informally, Mi is the maximum value attained by tion or summation over a product of functions, each the subset of the functions A that depend on one of of which depends only on the values of a subset of the variables X1, . . . , Xi, and Vi gives information the variables X. There are dynamic programming about the value of Xi at which this maximum is at- algorithms for calculating all of these quantities, but tained. for reasons of space we only describe an algorithm To simplify notation we write these functions as for finding the maximum value of a product of func- functions of the entire set of variables X, but usu- tions. These graph algorithms are rather involved. ally depend on a much smaller set of variables. The It may be easier to follow if one reads Example 1 Mi are real valued, while each Vi ranges over Xi. before or in parallel with the definitions below. Let M = {M1, . . . , Mn}. Recall that the sets of To explain the algorithm we use the following no- functions A and M can be both be partitioned into 0 tation. If x and x are both vectors of length m disjoint subsets A1, . . . , An+1 and M1, . . . , Mn+1 0 0 then x =j x iff x and x disagree on at most their respectively on the basis of the variables each Ai 0 jth components, i.e., xk = xk for k = 1, . . . , j − and Mi depend on. The definition of the Mi and 1, j + 1, . . . m. If f is a function whose domain Vi, i = 1, . . . , n is as follows: is X , we say that f depends on the set of variables 0 0 0 0 0 Mi(x) = max A(x ) M(x ) (8) d(f) = {Xj|∃x, x ∈ X , x =j x , f(x) 6= f(x )}. x0∈X Y Y 0 A∈Ai M∈Mi That is, Xj ∈ d(f) iff changing the value of Xj can s.t. x =ix 0 0 change the value of f. Vi(x) = argmax A(x ) M(x ) 0 Y Y The algorithm relies on the fact that the variables x ∈X A∈A M∈M 0 i i s.t. x =ix in X = (X1, . . . , Xn) are ordered (e.g., X1 pre- cedes X2, etc.), and while the algorithm is correct Mn+1 receives a special definition, since there is no for any variable ordering, its efficiency may vary variable Xn+1. dramatically depending on the ordering as described below. Let H be any set of functions whose do- M =  A  M mains are X. We partition H into disjoint subsets n+1 Y Y (9) A∈An+1  M∈Mn+1  H1, . . . , Hn+1, where Hj is the subset of H that depend on Xj but do not depend on any variables or- The definition of Mi in (8) may look circular (since dered before Xj, and Hn+1 is the subset of H that do M appears in the right-hand side), but in fact it is not depend on any variables at all (i.e., they are con- not. First, note that Mi depends only on variables 3 stants). That is, Hj = {H ∈ H|Xj ∈ d(H), ∀i < ordered after Xi, so if Mj ∈ Mi then j < i. More j Xi 6∈ d(H)} and Hn+1 = {H ∈ H|d(H) = ∅}. specifically, As explained in section 3.1, there is a set of functions A such that the quantities we need to calculate d(M ) =  d(A) ∪ d(M) \ {X }. have the general form: i [ [ i A∈Ai M∈Mi  Mmax = max A(x) (6) x∈X Y Thus we can compute the Mi in the order A∈A M1, . . . , Mn+1, inserting Mi into the appropriate set xˆ = argmax A(x). (7) ∈X Y Mk, where k > i, when Mi is computed. x A∈A We claim that Mmax = Mn+1. (Note that Mn+1 Mmax is the maximum value of the product expres- and Mn are constants, since there are no variables sion while xˆ is the value of the variables at which the ordered after Xn). To see this, consider the tree T maximum occurs. In a SUBG parsing application xˆ whose nodes are the Mi, and which has a directed identifies the MAP parse. edge from Mi to Mj iff Mi ∈ Mj (i.e., Mi appears 3Strictly speaking this does not necessarily define a parti- in the right hand side of the definition (8) of Mj). tion, as some of the subsets Hj may be empty. T has a unique root Mn+1, so there is a path from every Mi to Mn+1. Let i ≺ j iff there is a path Note that M5 is a constant, reflecting the fact that from Mi to Mj in this tree. Then a simple induction in GA the node X5 is not connected to any node or- shows that Mj is a function from d(Mj) to a max- dered after it. imisation over each of the variables Xi where i ≺ j M = max M (x ) of i≺j,A∈A A. 5 ∈X 4 5 Q i x5 5 Further, it is straightforward to show that Vi(xˆ) = V5 = argmax M4(x5) xî (the value xˆ assigns to Xi). By the same argu- x5∈X5 ments as above, d(Vi) only contains variables or- The second component is defined in the same way: dered after Xi, so Vn = xˆn. Thus we can evaluate the Vi in the order Vn, . . . , V1 to find the maximising M6(x7) = max e(x6, x7) assignment xˆ. x6∈X6 V6(x7) = argmax e(x6, x7) Example 1 Let X = { X1, X2, X3, X4, X5, x6∈X6 X6, X7} and set A = {a(X1, X3), b(X2, X4), M = max M (x ) 7 ∈X 6 7 c(X3, X4, X5), d(X4, X5), e(X6, X7)}. We can x7 7 represent the sharing of variables in A by means of a V7 = argmax M6(x7) x ∈X undirected graph GA, where the nodes of GA are the 7 7 A variables X and there is an edge in G connecting The maximum value for the product M8 = Mmax is Xi to Xj iff ∃A ∈ A such that both Xi, Xj ∈ d(A). defined in terms of M5 and M7. GA is depicted below. Xr1 Xr3 Xr5 Xr6 Mmax = M8 = M5M7

Finally, we evaluate V7, . . . , V1 to ﬁnd the maximis- r r r ing assignment xˆ. X2 X4 X7 xˆ7 = V7 X M Starting with the variable 1, we compute 1 xˆ6 = V6(xˆ7) and V1: xˆ5 = V5

M1(x3) = max a(x1, x3) xˆ4 = V4(xˆ5) x1∈X1 xˆ3 = V3(xˆ4, xˆ5) V1(x3) = argmax a(x1, x3) x ∈X 1 1 xˆ2 = V2(xˆ4)

We now proceed to the variable X2. xˆ1 = V1(xˆ3)

M2(x4) = max b(x2, x4) We now briefly consider the computational com- x ∈X 2 2 plexity of this process. Clearly, the number of steps V2(x4) = argmax b(x2, x4) required to compute each Mi is a polynomial of or- x2∈X2 der |d(Mi)| + 1, since we need to enumerate all pos- Since M1 belongs to M3, it appears in the definition sible values for the argument variables d(Mi) and of M3. for each of these, maximise over the set Xi. Fur- ther, it is easy to show that in terms of the graph GA, M3(x4, x5) = max c(x3, x4, x5)M1(x3) x3∈X3 d(Mj) consists of those variables Xk, k > j reach- V3(x4, x5) = argmax c(x3, x4, x5)M1(x3) able by a path starting at Xj and all of whose nodes x ∈X 3 3 except the last are variables that precede Xj. Since computational effort is bounded above by a Similarly, M4 is defined in terms of M2 and M3. polynomial of order |d(Mi)| + 1, we seek a variable M4(x5) = max d(x4, x5)M2(x4)M3(x4, x5) ordering that bounds the maximum value of |d(Mi)|. x ∈X 4 4 Unfortunately, finding the ordering that minimises V (x ) = argmax d(x , x )M (x )M (x , x ) 4 5 4 5 2 4 3 4 5 |d(M )| x4∈X4 the maximum value of i is an NP-complete problem. However, there are several efficient heuris- of parameter weight estimation when the parameter tics that are reputed in graphical models community weight estimates are themselves inaccurate. to produce good visitation schedules. It may be that Finally, it would be extremely interesting to com- they will perform well in the SUBG parsing applica- pare these dynamic programming algorithms to tions as well. the ones described by Miyao and Tsujii (2002). It seems that the Maxwell and Kaplan packed repre- 5 Conclusion sentation may permit more compact representations than the disjunctive representations used by Miyao This paper shows how to apply dynamic program- et al., but this does not imply that the algorithms ming methods developed for graphical models to proposed here are more efficient. Further theoreti- SUBGs to find the most probable parse and to ob- cal and empirical investigation is required. tain the statistics needed for estimation directly from Maxwell and Kaplan packed parse representations. i.e., without expanding these into individual parses. References The algorithm rests on the observation that so long Steven Abney. 1997. Stochastic Attribute-Value Grammars. as features are local to the parse fragments used in Computational Linguistics, 23(4):597–617. the packed representations, the statistics required for Robert Cowell. 1999. Introduction to inference for Bayesian parsing and estimation are the kinds of quantities networks. In Michael Jordan, editor, Learning in Graphi- that dynamic programming algorithms for graphical cal Models, pages 9–26. The MIT Press, Cambridge, Mas- sachusetts. models can perform. Since neither Maxwell and Ka- plan’s packed parsing algorithm nor the procedures Kenneth D. Forbus and Johan de Kleer. 1993. Building problem solvers. The MIT Press, Cambridge, Massachusetts. described here depend on the details of the underlying linguistic theory, the approach should apply to Stuart Geman and Kevin Kochanek. 2000. Dynamic programming and the representation of soft-decodable codes. Tech- virtually any kind of underlying grammar. nical report, Division of Applied Mathematics, Brown Uni- Obviously, an empirical evaluation of the algo- versity. rithms described here would be extremely useful. Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and The algorithms described here are exact, but be- Stefan Riezler. 1999. Estimators for stochastic “unification- cause we are working with unification grammars based” grammars. In The Proceedings of the 37th Annual Conference of the Association for Computational Linguis- and apparently arbitrary graphical models we can- tics, pages 535–541, San Francisco. Morgan Kaufmann. not polynomially bound their computational complexity. However, it seems reasonable to expect John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic models for seg- that if the linguistic dependencies in a sentence typ- menting and labeling sequence data. In Machine Learn- ically factorize into largely non-interacting cliques ing: Proceedings of the Eighteenth International Conference then the dynamic programming methods may offer (ICML 2001), Stanford, California. dramatic computational savings compared to current John T. Maxwell III and Ronald M. Kaplan. 1995. A method methods that enumerate all possible parses. for disjunctive constraint satisfaction. In Mary Dalrymple, Ronald M. Kaplan, John T. Maxwell III, and Annie Zae- It might be interesting to compare these dy- nen, editors, Formal Issues in Lexical-Functional Grammar, namic programming algorithms with a standard number 47 in CSLI Lecture Notes Series, chapter 14, pages unification-based parser using a best-first search 381–481. CSLI Publications. heuristic. (To our knowledge such an approach has Yusuke Miyao and Jun’ichi Tsujii. 2002. Maximum entropy not yet been explored, but it seems straightforward: estimation for feature forests. In Proceedings of Human Language Technology Conference 2002, March. the figure of merit could simply be the sum of the weights of the properties of each partial parse’s frag- Judea Pearl. 1988. Probabalistic Reasoning in Intelligent Sys- tems: Networks of Plausible Inference. Morgan Kaufmann, ments). Because such parsers prune the search space San Mateo, California. they cannot guarantee correct results, unlike the algorithms proposed here. Such a best-first parser Padhraic Smyth, David Heckerman, and Michael Jordan. 1997. Probabilistic Independence Networks for Hidden Markov might be accurate when parsing with a trained gram- Models. Neural Computation, 9(2):227–269. mar, but its results may be poor at the beginning