Dynamic Programming for Parsing and Estimation of Stochastic Unification

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 279-286. Dynamic programming for parsing and estimation of stochastic unification-based grammars∗ Stuart Geman Mark Johnson Division of Applied Mathematics Cognitive and Linguistic Sciences Brown University Brown University [email protected] Mark [email protected] Abstract 1 Introduction Stochastic Unification-Based Grammars (SUBGs) Stochastic unification-based grammars use log-linear models (also known as exponential or (SUBGs) define exponential distributions MaxEnt models and Markov Random Fields) to de- over the parses generated by a unification- fine probability distributions over the parses of a uni- based grammar (UBG). Existing algo- fication grammar. These grammars can incorporate rithms for parsing and estimation require virtually all kinds of linguistically important con- the enumeration of all of the parses of a straints (including non-local and non-context-free string in order to determine the most likely constraints), and are equipped with a statistically one, or in order to calculate the statis- sound framework for estimation and learning. tics needed to estimate a grammar from Abney (1997) pointed out that the non-context- a training corpus. This paper describes a free dependencies of a unification grammar require graph-based dynamic programming algo- stochastic models more general than Probabilis- rithm for calculating these statistics from tic Context-Free Grammars (PCFGs) and Markov the packed UBG parse representations of Branching Processes, and proposed the use of log- Maxwell and Kaplan (1995) which does linear models for defining probability distributions not require enumerating all parses. Like over the parses of a unification grammar. Un- many graphical algorithms, the dynamic fortunately, the maximum likelihood estimator Ab- programming algorithm's complexity is ney proposed for SUBGs seems computationally in- worst-case exponential, but is often poly- tractable since it requires statistics that depend on nomial. The key observation is that by the set of all parses of all strings generated by the using Maxwell and Kaplan packed repre- grammar. This set is infinite (so exhaustive enumer- sentations, the required statistics can be ation is impossible) and presumably has a very com- rewritten as either the max or the sum of plex structure (so sampling estimates might take an a product of functions. This is exactly extremely long time to converge). the kind of problem which can be solved by dynamic programming over graphical Johnson et al. (1999) observed that parsing and models. related tasks only require conditional distributions over parses given strings, and that such conditional distributions are considerably easier to estimate than joint distributions of strings and their parses. The ∗ We would like to thank Eugene Charniak, Miyao conditional maximum likelihood estimator proposed Yusuke, Mark Steedman as well as Stefan Riezler and the team at PARC; naturally all errors remain our own. This research was by Johnson et al. requires statistics that depend on supported by NSF awards DMS 0074276 and ITR IIS 0085940. the set of all parses of the strings in the training corpus. For most linguistically realistic grammars this appeared after this paper was accepted) is the closest set is finite, and for moderate sized grammars and related work we know of. They describe a technique training corpora this estimation procedure is quite for calculating the statistics required to estimate a feasible. log-linear parsing model with non-local properties However, our recent experiments involve training from packed feature forests. from the Wall Street Journal Penn Tree-bank, and The rest of this paper is structured as follows. repeatedly enumerating the parses of its 50,000 sen- The next section describes unification grammars tences is quite time-consuming. Matters are only and Maxwell and Kaplan packed representation. made worse because we have moved some of the The following section reviews stochastic unifica- constraints in the grammar from the unification com- tion grammars (Abney, 1997) and the statistical ponent to the stochastic component. This broadens quantities required for efficiently estimating such the coverage of the grammar, but at the expense of grammars from parsed training data (Johnson et al., massively expanding the number of possible parses 1999). The final substantive section of this paper of each sentence. shows how these quantities can be defined directly In the mid-1990s unification-based parsers were in terms of the Maxwell and Kaplan packed repre- developed that do not enumerate all parses of a string sentations. but instead manipulate and return a “packed” rep- The notation used in this paper is as follows. Vari- resentation of the set of parses. This paper de- ables are written in upper case italic, e.g., X; Y , etc., scribes how to find the most probable parse and the sets they range over are written in script, e.g., the statistics required for estimating a SUBG from X ; Y, etc., while specific values are written in lower the packed parse set representations proposed by case italic, e.g., x; y, etc. In the case of vector-valued Maxwell III and Kaplan (1995). This makes it pos- entities, subscripts indicate particular components. sible to avoid explicitly enumerating the parses of the strings in the training corpus. 2 Maxwell and Kaplan packed The methods proposed here are analogues of representations the well-known dynamic programming algorithms for Probabilistic Context-Free Grammars (PCFGs); This section characterises the properties of unifica- specifically the Viterbi algorithm for finding the tion grammars and the Maxwell and Kaplan packed most probable parse of a string, and the Inside- parse representations that will be important for what Outside algorithm for estimating a PCFG from un- follows. This characterisation omits many details parsed training data.1 In fact, because Maxwell and about unification grammars and the algorithm by Kaplan packed representations are just Truth Main- which the packed representations are actually con- tenance System (TMS) representations (Forbus and structed; see Maxwell III and Kaplan (1995) for de- de Kleer, 1993), the statistical techniques described tails. here should extend to non-linguistic applications of A parse generated by a unification grammar is a TMSs as well. finite subset of a set F of features. Features are parse Dynamic programming techniques have fragments, e.g., chart edges or arcs from attribute- been applied to log-linear models before. value structures, out of which the packed representa- Lafferty et al. (2001) mention that dynamic tions are constructed. For this paper it does not mat- programming can be used to compute the statistics ter exactly what features are, but they are intended required for conditional estimation of log-linear to be the atomic entities manipulated by a dynamic models based on context-free grammars where programming parsing algorithm. A grammar defines the properties can include arbitrary functions of a set Ω of well-formed or grammatical parses. Each the input string. Miyao and Tsujii (2002) (which parse ! 2 Ω is associated with a string of words Y (!) called its yield. Note that except for trivial 1However, because we use conditional estimation, also grammars F and Ω are infinite. known as discriminative training, we require at least some dis- criminating information about the correct parse of a string in If y is a string, then let Ω(y) = f! 2 ΩjY (!) = order to estimate a stochastic unification grammar. yg and F(y) = 2 ff 2 !g. That is, Ω(y) is S! Ω(y) the set of parses of a string y and F(y) is the set of the no-goods. That is, we require that: features appearing in the parses of y. In the gram- 0 0 mars of interest here Ω(y) and hence also F(y) are 8x; x 2 X if N(x) = N(x ) = 1 and 0 0 finite. !(x) = !(x ) then x = x (1) Maxwell and Kaplan's packed representations of- Finally, a packed representation R represents the ten provide a more compact representation of the set of parses Ω(R) that are identified by values set of parses of a sentence than would be obtained that satisfy the no-goods, i.e., Ω(R) = f!(x)jx 2 by merely listing each parse separately. The intu- X ; N(x) = 1g: ition behind these packed representations is that for Maxwell III and Kaplan (1995) describes a pars- most strings y, many of the features in F(y) occur ing algorithm for unification-based grammars that in many of the parses Ω(y). This is often the case takes as input a string y and returns a packed rep- in natural language, since the same substructure can resentation R such that Ω(R) = Ω(y), i.e., R rep- appear as a component of many different parses. resents the set of parses of the string y. The SUBG Packed feature representations are defined in parsing and estimation algorithms described in this terms of conditions on the values assigned to a vec- paper use Maxwell and Kaplan's parsing algorithm tor of variables X. These variables have no direct as a subroutine. linguistic interpretation; rather, each different as- signment of values to these variables identifies a set 3 Stochastic Unification-Based Grammars of features which constitutes one of the parses in the packed representation. A condition a on X is This section reviews the probabilistic framework a function from X to f0; 1g. While for uniformity used in SUBGs, and describes the statistics that we write conditions as functions on the entire vec- must be calculated in order to estimate the pa- tor X, in practice Maxwell and Kaplan's approach rameters of a SUBG from parsed training data. produces conditions whose value depends only on a For a more detailed exposition and descriptions few of the variables in X, and the efficiency of the of regularization and other important details, see algorithms described here depends on this.

Dynamic Programming for Parsing and Estimation of Stochastic Unification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support