Automatic Extraction of Subcategorization from Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Extraction of Subcategorization from Corpora Ted Briscoe John Carroll Computer Laboratory Cognitive and Computing Sciences University of Cambridge University of Sussex Pembroke Street, Cambridge CB2 3QG, UK Brighton BN1 9QH, UK ejb@cl, cam. ac. uk j otto. carroll@cogs, susx. ac. uk Abstract Several substantial machine-readable subcatego- rization dictionaries exist for English, either built We describe a novel technique and imple- largely automatically from machine-readable ver- mented system for constructing a subcate- sions of conventional learners' dictionaries, or manu- gorization dictionary from textual corpora. ally by (computational) linguists (e.g. the Alvey NL Each dictionary entry encodes the relative Tools (ANLT) dictionary, Boguraev et al. (1987); frequency of occurrence of a comprehen- the COMLEX Syntax dictionary, Grishman et al. sive set of subcategorization classes for En- (1994)). Unfortunately, neither approach can yield a glish. An initial experiment, on a sample genuinely accurate or comprehensive computational of 14 verbs which exhibit multiple comple- lexicon, because both rest ultimately on the manual mentation patterns, demonstrates that the efforts of lexicographers / linguists and are, there- technique achieves accuracy comparable to fore, prone to errors of omission and commission previous approaches, which are all limited which are hard or impossible to detect automatically to a highly restricted set of subcategoriza- (e.g. Boguraev & Briscoe, 1989; see also section 3.1 tion classes. We also demonstrate that a below for an example). Furthermore, manual encod- subcategorization dictionary built with the ing is labour intensive and, therefore, it is costly to system improves the accuracy of a parser extend it to neologisms, information not currently by an appreciable amount 1. encoded (such as relative frequency of different sub- categorizations), or other (sub)languages. These 1 Motivation problems are compounded by the fact that predi- cate subcategorization is closely associated to lexical Predicate subcategorization is a key component of sense and the senses of a word change between cor- a lexical entry, because most, if not all, recent syn- pora, sublanguages and/or subject domains (Jensen, tactic theories 'project' syntactic structure from the lexicon. Therefore, a wide-coverage parser utilizing 1991). such a lexicalist grammar must have access to an In a recent experiment with a wide-coverage pars- accurate and comprehensive dictionary encoding (at ing system utilizing a lexicalist grammatical frame- a minimum) the number and category of a predi- work, Briscoe & Carroll (1993) observed that half cate's arguments and ideally also information about of parse failures on unseen test data were caused control with predicative arguments, semantic selec- by inaccurate subcategorization information in the tion preferences on arguments, and so forth, to allow ANLT dictionary. The close connection between the recovery of the correct predicate-argument struc- sense and subcategorization and between subject do- ture. If the parser uses statistical techniques to rank main and sense makes it likely that a fully accurate analyses, it is also critical that the dictionary encode 'static' subcategorization dictionary of a language is the relative frequency of distinct subcategorization unattainable in any case. Moreover, although Sch- classes for each predicate. abes (1992) and others have proposed 'lexicalized' probabilistic grammars to improve the accuracy of 1This work was supported by UK DTI/SALT parse ranking, no wide-coverage parser has yet been project 41/5808 'Integrated Language Database', CEC constructed incorporating probabilities of different Telematics Applications Programme project LE1-211i subcategorizations for individual predicates, because 'SPARKLE: Shallow PARsing and Knowledge extraction of the problems of accurately estimating them. for Language Engineering', and by SERC/EPSRC Ad- vanced Fellowships to both authors. We would like to These problems suggest that automatic construc- thank the COMLEX Syntax development team for al- tion or updating of subcategorization dictionaries lowing us access to pre-release data (for an early exper- from textual corpora is a more promising avenue iment), and for useful feedback. to pursue. Preliminary experiments acquiring a few 356 verbal subcategorization classes have been reported from sentence subanalyses which begin/end at by Brent (1991, 1993), Manning (1993), and Ush- the boundaries of (specified) predicates. ioda et al. (1993). In these experiments the max- 5. A pattern classifier which assigns patterns in imum number of distinct subcategorization classes patternsets to subcategorization classes or re- recognized is sixteen, and only Ushioda et al. at- jects patterns as unclassifiable on the basis of tempt to derive relative subcategorization frequency the feature values of syntactic categories and for individual predicates. the head lemmas in each pattern. We describe a new system capable of distinguish- ing 160 verbal subcategorization classes--a superset 6. A patternsets evaluator which evaluates sets of those found in the ANLT and COMLEX Syn- of patternsets gathered for a (single) predicate, tax dictionaries. The classes also incorporate infor- constructing putative subcategorization entries mation about control of predicative arguments and and filtering the latter on the basis of their re- alternations such as particle movement and extra- liability and likelihood. position. We report an initial experiment which For example, building entries for attribute, and demonstrates that this system is capable of acquir- given that one of the sentences in our data was (la), ing the subcategorization classes of verbs and the the tagger and lemmatizer return (lb). relative frequencies of these classes with compara- ble accuracy to the less ambitious extant systems. (1) a He attributed his failure, he said, to We achieve this performance by exploiting a more no< blank> one buying his books. sophisticated robust statistical parser which yields b he_PPHS1 attribute_VVD his_APP$ fail- complete though 'shallow' parses, a more compre- ure_NN1 ,_, he_PPHS1 say_VVD ,_, to_II hensive subcategorization class classifier, and a pri- no<blank>one_PN buy_ VVG his_APP$ or/ estimates of the probability of membership of book_NN2 these classes. We also describe a small-scale ex- periment which demonstrates that subcategorization class frequency information for individual verbs can (lb) is parsed successfully by the probabilistic LR be used to improve parsing accuracy. parser, and the ranked analyses are returned. Then the patternset extractor locates the subanalyses con- 2 Description of the System taining attribute and constructs a patternset. The highest ranked analysis and pattern for this example 2.1 Overview are shown in Figure 12 . Patterns encode the value The system consists of the following six components of the VSUBCAT feature from the VP rule and the which are applied in sequence to sentences contain- head lemma(s) of each argument. In the case of PP ing a specific predicate in order to retrieve a set of (I)2) arguments, the pattern also encodes the value of subcategorization classes for that predicate: PSUBCAT from the PP rule and the head lemma(s) of its complement(s). In the next stage of process- 1. A tagger, a first-order HMM part-of-speech ing, patterns are classified, in this case giving the (PoS) and punctuation tag disambiguator, is subcategorization class corresponding to transitive used to assign and rank tags for each word and plus PP with non-finite clausal complement. punctuation token in sequences of sentences (El- The system could be applied to corpus data by worthy, 1994). first sorting sentences into groups containing in- 2. A lemmatizer is used to replace word-tag stances of a specified predicate, but we use a different pairs with lemma-tag pairs, where a lemma is strategy since it is more efficient to tag, lemmatize the morphological base or dictionary headword and parse a corpus just once, extracting patternsets form appropriate for the word, given the PoS for all predicates in each sentence; then to classify assignment made by the tagger. We use an en- the patterns in all patternsets; and finally, to sort hanced version of the GATE project stemmer and recombine patternsets into sets of patternsets, (Cunningham et al., 1995). one set for each distinct predicate containing pat- 3. A probabilistic LR parser, trained on a tree- ternsets of just the patterns relevant to that predi- bank, returns ranked analyses (Briscoe &: Car- cate. The tagger, lemmatizer, grammar and parser roll, 1993; Carroll, 1993, 1994), using a gram- have been described elsewhere (see previous refer- mar written in a feature-based unification gram- ences), so we provide only brief relevant details here, mar formalism which assigns 'shallow' phrase concentrating on the description of the components structure analyses to tag networks (or 'lattices') returned by the tagger (Briscoe & Carroll, 1994, 2The analysis shows only category aliases rather than 1995; Carroll & Briscoe, 1996). sets of feature-value pairs. Ta represents a text adjunct delimited by commas (Nunberg 1990; Briscoe ~ Carroll, 4. A patternset extractor which extracts sub- 1994). Tokens in the patternset are indexed by sequen- categorization patterns, including the syntac- tial position in the sentence so that two or more tokens tic categories and head lemmas of constituents, of the same type