Accurate Unlexicalized Parsing

Accurate Unlexicalized Parsing Dan Klein Christopher D. Manning Computer Science Department Computer Science Department Stanford University Stanford University Stanford, CA 94305-9040 Stanford, CA 94305-9040 [email protected] [email protected] Abstract mance of an unlexicalized PCFG over the Penn treebank could be improved enormously simply by an- We demonstrate that an unlexicalized PCFG can notating each node by its parent category. The Penn parse much more accurately than previously shown, treebank covering PCFG is a poor tool for parsing be- by making use of simple, linguistically motivated state splits, which break down false independence cause the context-freedom assumptions it embodies assumptions latent in a vanilla treebank grammar. are far too strong, and weakening them in this way Indeed, its performance of 86.36% (LP/LR F1) is makes the model much better. More recently, Gildea better than that of early lexicalized PCFG models, (2001) discusses how taking the bilexical probabil- and surprisingly close to the current state-of-the- art. This result has potential uses beyond establish- ities out of a good current lexicalized PCFG parser ing a strong lower bound on the maximum possi- hurts performance hardly at all: by at most 0.5% for ble accuracy of unlexicalized models: an unlexical- test text from the same domain as the training data, ized PCFG is much more compact, easier to repli- 1 cate, and easier to interpret than more complex lex- and not at all for test text from a different domain. ical models, and the parsing algorithms are simpler, But it is precisely these bilexical dependencies that more widely understood, of lower asymptotic com- backed the intuition that lexicalized PCFGs should be plexity, and easier to optimize. very successful, for example in Hindle and Rooth’s demonstration from PP attachment. We take this as a In the early 1990s, as probabilistic methods swept reflection of the fundamental sparseness of the lex- NLP, parsing work revived the investigation of prob- ical dependency information available in the Penn abilistic context-free grammars (PCFGs) (Booth and Treebank. As a speech person would say, one mil- Thomson, 1973; Baker, 1979). However, early re- lion words of training data just isn’t enough. Even sults on the utility of PCFGs for parse disambigua- for topics central to the treebank’s Wall Street Jour- tion and language modeling were somewhat disap- nal text, such as stocks, many very plausible depen- pointing. A conviction arose that lexicalized PCFGs dencies occur only once, for example stocks stabi- (where head words annotate phrasal nodes) were lized, while many others occur not at all, for exam- the key tool for high performance PCFG parsing. ple stocks skyrocketed.2 This approach was congruent with the great success The best-performing lexicalized PCFGs have in- of word n-gram models in speech recognition, and creasingly made use of subcategorization3 of the drew strength from a broader interest in lexicalized 1 grammars, as well as demonstrations that lexical de- There are minor differences, but all the current best-known lexicalized PCFGs employ both monolexical statistics, which pendencies were a key tool for resolving ambiguities describe the phrasal categories of arguments and adjuncts that such as PP attachments (Ford et al., 1982; Hindle and appear around a head lexical item, and bilexical statistics, or de- Rooth, 1993). In the following decade, great success pendencies, which describe the likelihood of a head word taking as a dependent a phrase headed by a certain other word. in terms of parse disambiguation and even language 2This observation motivates various class- or similarity- modeling was achieved by various lexicalized PCFG based approaches to combating sparseness, and this remains a models (Magerman, 1995; Charniak, 1997; Collins, promising avenue of work, but success in this area has proven 1999; Charniak, 2000; Charniak, 2001). somewhat elusive, and, at any rate, current lexicalized PCFGs do simply use exact word matches if available, and interpolate However, several results have brought into ques- with syntactic category-based estimates when they are not. tion how large a role lexicalization plays in such 3In this paper we use the term subcategorization in the origi- parsers. Johnson (1998) showed that the perfor- nal general sense of Chomsky (1965), for where a syntactic cat- categories appearing in the Penn treebank. Charniak constants. An unlexicalized PCFG parser is much (2000) shows the value his parser gains from parent- simpler to build and optimize, including both stan- annotation of nodes, suggesting that this informa- dard code optimization techniques and the investigation is at least partly complementary to information tion of methods for search space pruning (Caraballo derivable from lexicalization, and Collins (1999) and Charniak, 1998; Charniak et al., 1998). uses a range of linguistically motivated and care- It is not our goal to argue against the use of lex- fully hand-engineered subcategorizations to break icalized probabilities in high-performance probabi- down wrong context-freedom assumptions of the listic parsing. It has been comprehensively demon- naive Penn treebank covering PCFG, such as differ- strated that lexical dependencies are useful in re- entiating “base NPs” from noun phrases with phrasal solving major classes of sentence ambiguities, and a modifiers, and distinguishing sentences with empty parser should make use of such information where subjects from those where there is an overt subject possible. We focus here on using unlexicalized, NP. While he gives incomplete experimental results structural context because we feel that this infor- as to their efficacy, we can assume that these features mation has been underexploited and underappreci- were incorporated because of beneficial effects on ated. We see this investigation as only one part of parsing that were complementary to lexicalization. the foundation for state-of-the-art parsing which em- In this paper, we show that the parsing perfor- ploys both lexical and structural conditioning. mance that can be achieved by an unlexicalized PCFG is far higher than has previously been demon- 1 Experimental Setup strated, and is, indeed, much higher than community To facilitate comparison with previous work, we wisdom has thought possible. We describe several trained our models on sections 2–21 of the WSJ sec- simple, linguistically motivated annotations which tion of the Penn treebank. We used the first 20 files do much to close the gap between a vanilla PCFG (393 sentences) of section 22 as a development set and state-of-the-art lexicalized models. Specifically, (devset). This set is small enough that there is no- we construct an unlexicalized PCFG which outper- ticeable variance in individual results, but it allowed forms the lexicalized PCFGs of Magerman (1995) rapid search for good features via continually repars- and Collins (1996) (though not more recent models, ing the devset in a partially manual hill-climb. All of such as Charniak (1997) or Collins (1999)). section 23 was used as a test set for the final model. One benefit of this result is a much-strengthened For each model, input trees were annotated or trans- lower bound on the capacity of an unlexicalized formed in some way, as in Johnson (1998). Given PCFG. To the extent that no such strong baseline has a set of transformed trees, we viewed the local trees been provided, the community has tended to greatly as grammar rewrite rules in the standard way, and overestimate the beneficial effect of lexicalization in used (unsmoothed) maximum-likelihood estimates probabilistic parsing, rather than looking critically for rule probabilities.5 To parse the grammar, we at where lexicalized probabilities are both needed to used a simple array-based Java implementation of make the right decision and available in the training a generalized CKY parser, which, for our final best data. Secondly, this result affirms the value of lin- model, was able to exhaustively parse all sentences guistic analysis for feature discovery. The result has in section 23 in 1GB of memory, taking approxi- other uses and advantages: an unlexicalized PCFG mately 3 sec for average length sentences.6 is easier to interpet, reason about, and improve than 5 the more complex lexicalized models. The grammar The tagging probabilities were smoothed to accommodate unknown words. The quantity P(tag|word) was estimated representation is much more compact, no longer re- as follows: words were split into one of several categories quiring large structures that store lexicalized proba- wordclass, based on capitalization, suffix, digit, and other bilities. The parsing algorithms have lower asymp- character features. For each of these categories, we took the totic complexity4 and have much smaller grammar maximum-likelihood estimate of P(tag|wordclass). This dis- tribution was used as a prior against which observed taggings, if any, were taken, giving P(tag|word) = [c(tag,word) + egory is divided into several subcategories, for example divid- κ P(tag|wordclass)]/[c(word)+κ]. This was then inverted to ing verb phrases into finite and non-finite verb phrases, rather give P(word|tag). The quality of this tagging model impacts than in the modern restricted usage where the term refers only all numbers; for example the raw treebank grammar’s devset F1 to the syntactic argument frames of predicators. is 72.62 with it and 72.09 without it. 4O(n3) vs. O(n5) for a naive implementation, or vs. O(n4) 6The parser is available for download as open source at: if using the clever approach of Eisner and Satta (1999). http://nlp.stanford.edu/downloads/lex-parser.shtml VP Horizontal Markov Order <VP:[VBZ]. PP> Vertical Order h = 0 h = 1 h ≤ 2 h = 2 h =∞ v = 1 No annotation 71.27 72.5 73.46 72.96 72.62 <VP:[VBZ]. NP> PP (854) (3119) (3863) (6207) (9657) v ≤ 2 Sel.

Load more