AUTOMATIC ACQUISITION OF SUBCATEGORIZATION FRAMES FROM UNTAGGED TEXT

Michael R. Brent MIT AI Lab 545 Technology Square Cambridge, Massachusetts 02139 [email protected]

ABSTRACT Journal (kindly provided by the Penn Tree Bank This paper describes an implemented program project). On this corpus, it makes 5101 observa- that takes a raw, untagged text corpus as its tions about 2258 orthographically distinct . only input (no open-class dictionary) and gener- False positive rates vary from one to three percent ates a partial list of verbs occurring in the text of observations, depending on the SF. and the subcategorization frames (SFs) in which 1.1 WHY IT MATTERS they occur. Verbs are detected by a novel tech- nique based on the Case Filter of Rouvret and Accurate parsing requires knowing the sub- Vergnaud (1980). The completeness of the output categorization frames of verbs, as shown by (1). list increases monotonically with the total number (1) a. I expected [nv the man who smoked NP] of occurrences of each in the corpus. False to eat ice-cream positive rates are one to three percent of observa- h. I doubted [NP the man who liked to eat tions. Five SFs are currently detected and more ice-cream NP] are planned. Ultimately, I expect to provide a large SF dictionary to the NLP community and to Current high-coverage parsers tend to use either train dictionaries for specific corpora. custom, hand-generated lists of subcategorization frames (e.g., Hindle, 1983), or published, hand- 1 INTRODUCTION generated lists like the Ozford Advanced Learner's Dictionary of Contemporary English, Hornby and This paper describes an implemented program Covey (1973) (e.g., DeMarcken, 1990). In either that takes an untagged text corpus and generates case, such lists are expensive to build and to main- a partial list of verbs occurring in it and the sub- tain in the face of evolving usage. In addition, categorization frames (SFs) in which they occur. they tend not to include rare usages or specialized So far, it detects the five SFs shown in Table 1. vocabularies like financial or military jargon. Fur- ther, they are often incomplete in arbitrary ways. For example, Webster's Ninth New Collegiate Dic- SF Good Example Bad Example tionary lists the sense of strike meaning 'go occur Description to", as in "it struck him that... ", but it does not direct greet them *arrive them list that same sense of hit. (My program discov- direct object tell him he's a *hope him he's a ered both.) fool & clause fool 1.2 WHY IT'S HARD direct object want him to *hope him to & attend attend The initial priorities in this research were: clause know I'll attend *want I'll attend . Generality (e.g., minimal assumptions about infinitive hope to attend *greet to attend the text) . Accuracy in identifying SF occurrences Table 1: The five subcategorization frames (SFs) • Simplicity of design and speed detected so far Efficient use of the available text was not a high priority, since it was felt that plenty of text was The SF acquisition program has been tested available even for an inefficient learner, assuming on a corpus of 2.6 million of the Wall Street sufficient speed to make use of it. These priorities

209 had a substantial influence on the approach taken. size. A better approach is to require a fixed per- They are evaluated in retrospect in Section 4. centage of the total occurrences of any given verb The first step in finding a subcategorization to appear with a given SF before concluding that frame is finding a verb. Because of widespread and random error is not responsible for these observa- productive /verb ambiguity, dictionaries are tions. Unfortunately, determining the cutoff per- not much use -- they do not reliably exclude the centage requires human intervention and sampling possibility oflexical ambiguity. Even if they did, a error makes classification unstable for verbs with program that could only learn SFs for unambigu- few occurrences in the input. The sampling er- ous verbs would be of limited value. Statistical ror can be dealt with (Brent, 1991) but predeter- disambiguators make dictionaries more useful, but mined cutoff percentages stir require eye-bailing they have a fairly high error rate, and degrade in the data. Thus robust, unsupervised judgments the presence of many unfamiliar words. Further, in the face of error pose the third challenge to de- it is often difficult to understand where the error is veloping an accurate learning system. coming from or how to correct it. So finding verbs poses a serious challenge for the design of an accu- 1.3 HOW IT'S DONE rate, general-purpose algorithm for detecting SFs. The architecture of the system, and that of this pa- In fact, finding main verbs is more difficult per, directly reflects the three challenges described than it might seem. One problem is distinguishing above. The system consists of three modules: from and , as shown 1. Verb detection: Finds some occurrences of below. verbs using the Case Filter (Rouvret and (2) a. John has [~p rented furniture] Vergnaud, 1980), a proposed rule of gram- (comp.: John has often rented apart- mar. ments) 2. SF detection: Finds some occurrences of b. John was smashed (drunk) last night five subcategorization frames using a simple, (comp.: John was kissed last night) finite-state grammar for a fragment of En- glish. c. John's favorite activity is watching TV (comp.: John's favorite child is watching 3. SF decision: Determines whether a verb is TV) genuinely associated with a given SF, or whether instead its apparent occurrences in In each case the main verb is have or be in a con- that SF are due to error. This is done using text where most parsers (and statistical disam- statistical models of the frequency distribu- biguators) would mistake it for an auxiliary and tions. mistake the following for a participial main verb. The following two sections describe and eval- uate the verb detection module and the SF de- A second challenge to accuracy is determin- tection module, respectively; the decision module, ing which verb to associate a given which is still being refined, will be described in with. Paradoxically, example (1) shows that in a subsequent paper. The final two sections pro- general it isn't possible to do this without already vide a brief to related work and draw knowing the SF. One obvious strategy would be conclusions. to wait for sentences where there is only one can- didate verb; unfortunately, it is very difficult to know for certain how many verbs occur in a sen- 2 VERB DETECTION tence. Finding some of the verbs in a text reliably The technique I developed for finding verbs is is hard enough; finding all of them reliably is well based on the Case Filter of Rouvret and Verguaud beyond the scope of this work. (1980). The Case Filter is a proposed rule of gram- Finally, any system applied to real input, no mar which, as it applies to English, says that ev- matter how carefully designed, will occasionally ery noun-phrase must appear either immediately make errors in finding the verb and determining to the left of a tensed verb, immediately to the its subcategorizatiou frame. The more times a right of a preposition, or immediately to the right given verb appears in the corpus, the more likely of a main verb. and adverbial phrases it is that one of those occurrences will cause an (including days and dates) are ignored for the pur- erroneous judgment. For that reason any learn- poses of case adjacency. A noun-phrase that sat- ing system that gets only positive examples and isfies the Case Filter is said to "get case" or "have makes a permanent judgment on a single example case", while one that violates it is said to "lack will always degrade as the number of occurrences case". The program judges an open-class word increases. In fact, making a judgment based on to be a main verb if it is adjacent to a or any fixed number of examples with any finite error proper name that would otherwise lack case. Such rate will always lead to degradation with corpus- a pronoun or proper name is either the or

210 the direct object of the verb. Other noun phrases ing hand-verified tags. Typical disagreements in are not used because it is too difficult to determine which my system was right involved verbs that their right boundaries accurately. are ambiguous with much more frequent nouns, The two criteria for evaluating the perfor- like mold in "The Soviet Communist Party has the mance of the main-verb detection technique are power to shape corporate development and mold efficiency and accuracy. Both were measured us- it into a body dependent upon it ." There were ing a 2.6 million word corpus for which the Penn several systematic constructions in which the Penn Treebank project provides hand-verified tags. tags were right and my system was wrong, includ- Efficiency of verb detection was assessed by ing constructions like "We consumers are..." and running the SF detection module in the normal pseudo-clefts like '~vhat you then do is you make mode, where verbs were detected using the Case them think .... (These examples are actual text Filter technique, and then running it again with from the Penn corpus.) the Penn Tags substituted for the verb detection The extraordinary accuracy of verb detection module. The results are shown in Table 2. Note -- within a tiny fraction of the rate achieved by trained human taggers -- and it's relatively low SF Occurrences Control Efficiency efficiency are consistent with the priorities laid out Found in Section 1.2. 2.1 SF DETECTION direct object 3,591 8,606 40% The obvious approach to finding SFs like "V direct object 94 381 25% NP to V" and "V to V" is to look for occurrences of &: clause just those patterns in the training corpus; but the obvious approach fails to address the attachment direct object 310 3,597 8% problem illustrated by example (1) above. The & infinitive solution is based on the following insights: clause 739 14,144 5% • Some examples are clear and unambiguous. • Observations made in clear cases generalize infinitive 367 11,880 3% to all cases. Table 2: Efficiency of verb detection for each of • It is possible to distinguish the clear cases the five SFs, as tested on 2.6 million words of the from the ambiguous ones with reasonable ac- Wall Street Journal and controlled by the Penn curacy. Treehank's hand-verified tagging • With enough examples, it pays to wait for the clear cases. the substantial variation among the SFs: for the SFs "direct object" and "direct object & clause" Rather than take the obvious approach of looking efficiency is roughly 40% and 25%, respectively; for "V NP to V', my approach is to wait for clear for "direct object & infinitive" it drops to about cases like "V PRONOUN to V'. The advantages 8%; and for the intransitive SFs it is under 5%. can be seen by contrasting (3) with (1). The reason that the transitive SFs fare better is (3) a. OK I expected him to eat ice-cream that the direct object gets case from the preced- b. * I doubted him to eat ice-cream ing verb and hence reveals its presence -- intran- sitive verbs are harder to find. Likewise, clauses More generally, the system recognizes linguistic fare better than because their subjects structure using a small finite-state grammar that get case from the main verb and hence reveal it, describes only that fragment of English that is whereas infinitives lack overt subjects. Another most useful for recognizing SFs. The grammar obvious factor is that, for every SF listed above relies exclusively on closed-class lexical items such except "direct object" two verbs need to be found as , prepositions, , and aux- -- the matrix verb and the complement verb -- if iliary verbs. either one is not detected then no observation is The grammar for detecting SFs needs to recorded. distinguish three types of complements: direct Accuracy was measured by looking at the objects, infinitives, and clauses. The gram- Penn tag for every word that the system judged mars for each of these are presented in Fig- to be a verb. Of approximately 5000 verb tokens ure 1. Any open-class word judged to he a found by the Case Filter technique, there were verb (see Section 2) and followed immediately 28 disagreements with the hand-verified tags. My by matches for , , , or is assigned the in 20, for a 0.24% error-rate beyond the rate us- corresponding SF. Any word ending in "ly" or

211 := that? ( I I his I ) := I J he [ she [ I [ they := you, it, yours, hers, ours, theirs := := me [ him [ us [ them := to

Figure 1: A non-recursive (finite-state) grammar for detecting certain verbal complements. "?" indicates an optional element. Any verb followed immediately expressions matching , , , , or is assigned the corresponding SF. belonging to a list of 25 irregular adverbs is ig- column of Table 3. 2 The most common source nored for purposes of adjacency. The notation of error was purpose adjuncts, as in "John quit "T' follows optional expressions. The category to pursue a career in finance," which comes from previously-noted-uninflected-verb is special omitting the in order from "John quit in order to in that it is not fixed in advance -- open-class non- pursue a career in finance." These purpose ad- adverbs are added to it when they occur following juncts were mistaken for infinitival complements. an unambiguous modal. I This is the only case in The other errors were more sporadic in nature, which the program makes use of earlier decisions many coming from unusual extrapositions or other -- literally bootstrapping. Note, however, that relatively rare phenomena. ambiguity is possible between mass nouns and un- Once again, the high accuracy and low ef- inflected verbs, as in to fish. ficiency are consistent with the priorities of Sec- Like the verb detection algorithm, the SF de- tion 1.2. The throughput rate is currently about tection algorithm is evaluated in terms of efficiency ten-thousand words per second on a Sparcsta- and accuracy. The most useful estimate of effi- tion 2, which is also consistent with the initial pri- ciency is simply the density of observations in the orities. Furthermore, at ten-thousand words per corpus, shown in the first column of Table 3. The second the current density of observations is not problematic.

SF occurrences % error 3 RELATED WORK found Interest in extracting lexical and especially collocational information from text has risen dra- direct object 3,591 1.5% matically in the last two years, as sufficiently large corpora and sufficiently cheap computation have direct object 94 2.0% become available. Three recent papers in this area & clause are Church and Hanks (1990), Hindle (1990), and direct object 310 1.5% Smadja and McKeown (1990). The latter two are & infinitive concerned exclusively with collocation relations between open-class words and not with grammat- clause 739 0.5% ical properties. Church is also interested primar- ily in open-class collocations, but he does discuss infinitive 367 3.0% verbs that tend to be followed by infinitives within his mutual information framework. Mutual information, as applied by Church, Table 3: SF detector error rates as tested on 2.6 is a measure of the tendency of two items to ap- million words of the Wall Street Journal pear near one-another -- their observed frequency in nearby positions is divided by the expectation accuracy of SF detection is shown in the second of that frequency if their positions were random and independent. To measure the tendency of a 1If there were room to store an unlimited number verb to be followed within a few words by an in- of uninflected verbs for later reference then the gram- finitive, Church uses his statistical disambiguator mar formalism would not be finite-state. In fact, a fixed amount of storage, sufficient to store all the verbs 2Error rates computed by hand verification of 200 in the , is allocated. This question is purely examples for each SF using the tagged mode. These academic, however -- a hash-table gives constant-time are estimated independently of the error rates for verb average performance. detection.

212 (Church, 1988) to distinguish between to as an the one planned for purpose adjuncts will work infinitive marker and to as a preposition. Then here as well, but how successful it will be, and if he measures the mutual information between oc- successful how large a sample size it will require, currences of the verb and occurrences of infinitives remain to be seen. following within a certain number of words. Unlike The question of sample size leads back to an our system, Church's approach does not aim to de- evaluation of the initial priorities, which favored cide whether or not a verb occurs with an infiniti- simplicity, speed, and accuracy, over efficient use val complement -- example (1) showed that being of the corpus. There are various ways in which followed by an infinitive is not the same as taking the high-priority criteria can be traded off against an infinitival complement. It might be interesting efficiency. For example, consider (2c): one might to try building a verb categorization scheme based expect that the overwhelming majority of occur- on Church's mutual information measure, but to rences of "is V-ing" are genuine progressives, while the best of our knowledge no such work has been a tiny minority are cases . One might also reported. expect that the occasional copula constructions are not concentrated around any one present par- 4 CONCLUSIONS ticiple but rather distributed randomly among a large population. If those expectations are true The ultimate goal of this work is to provide then a frequency-modeling mechanism like the one the NLP community with a substantially com- being developed for adjuncts ought to prevent the plete, automatically updated dictionary of subcat- mistaken copula from doing any harm. In that egorization frames. The methods described above case it might be worthwhile to admit "is V-ing', solve several important problems that had stood where V is known to be a (possibly ambiguous) in the way of that goal. Moreover, the results ob- verb root, as a verb, independent of the Case Fil- tained with those methods are quite encouraging. ter mechanism. Nonetheless, two obvious barriers still stand on the path to a fully automated SF dictionary: a deci- ACKNOWLEDGMENTS sion algorithm that can handle random error, and techniques for detecting many more types of SFs. Thanks to Don Hindle, Lila Gleitman, and Jane Algorithms are currently being developed to Grimshaw for useful and encouraging conversa- resolve raw SF observations into genuine lexical tions. Thanks also to Mark Liberman, Mitch properties and random error. The idea is to auto- Marcus and the Penn Treebank project at the matically generate statistical models of the sources University of Pennsylvania for supplying tagged of error. For example, purpose adjuncts like "John text. This work was supported in part by National quit to pursue a career in finance" are quite rare, Science Foundation grant DCR-85552543 under a accounting for only two percent of the apparent Presidential Young Investigator Award to Profes- infinitival complements. Furthermore, they are sor Robert C. Berwick. distributed across a much larger set of matrix verbs than the true infinitival complements, so any References given verb should occur with a purpose [Brent, 1991] M. Brent. Semantic Classification of extremely rarely. In a histogram sorting verbs by Verbs from their Syntactic Contexts: An Imple- their apparent frequency of occurrence with in- mented Classifier for Stativity. In Proceedings of finitival complements, those that in fact have ap- the 5th European A CL Conference. Association peared with purpose adjuncts and not true sub- for Computational Linguistics, 1991. categorized infinitives will be clustered at the low frequencies. The distributions of such clusters can [Church and Hanks, 1990] K. Church and be modeled automatically and the models used for P. Hanks. Word association norms, mutual in- identifying false positives. formation, and lexicography. Comp. Ling., 16, The second requirement for automatically 1990. generating a full-scale dictionary is the ability to [Church, 1988] K. Church. A Stochastic Parts detect many more types of SFs. SFs involving Program and Parser for Unre- certain prepositional phrases are particularly chal: stricted Text. In Proceedings of the 2nd ACL lenging. For example, while purpose adjuncts Conference on Applied NLP. ACL, 1988. (mistaken for infinitival complements) are rela- tively rare, instrumental adjuncts as in "John hit [DeMarcken, 1990] C. DeMarcken. Parsing the the nail with a hammer" are more common. The LOB Corpus. In Proceedings of the A CL. As- problem, of course, is how to distinguish them socation for Comp. Ling., 1990. from genuine, subcategorized PPs headed by with, [Gleitman, 1990] L. Gleitman. The structural as in "John sprayed the lawn with distilled wa- sources of verb meanings. Language Acquisition, ter". The hope is that a frequency analysis like 1(1):3-56, 1990.

213 [Hindle, 1983] D. Hindle. User Manual for Fid- ditch, a Deterministic Parser. Technical Report 7590-142, Naval Research Laboratory, 1983. [Hindle, 1990] D. Hindle. Noun cl~sification from structures. In Proceedings of the 28th Annual Meeting of the ACL, pages 268-275. ACL, 1990. [Hornby and Covey, 1973] A. Hornby and A. Covey. Ozford Advanced Learner's Dictio- nary of Contemporary English. Oxford Univer- sity Press, Oxford, 1973. [Levin, 1989] B. Levin. English Verbal Diathe- sis. Lexicon Project orking Papers no. 32, MIT Center for Cognitive Science, MIT, Cambridge, MA., 1989. [Pinker, 1989] S. Pinker. Learnability and Cogni- tion: The Acquisition of Argument Structure. MIT Press, Cambridge, MA, 1989. [Rouvret and Vergnaud, 1980] A. Rouvret and J- R Vergnaud. Specifying Reference to the Sub- ject. Linguistic Inquiry, 11(1), 1980. [Smadja and McKeown, 1990] F. Smadja and K. McKeown. Automatically extracting and representing collocations for lan- guage generation. In 28th Anneal Meeting of the Association for Comp. Ling., pages 252-259. ACL, 1990. [Zwicky, 1970] A. Zwicky. In a Manner of Speak- ing. Linguistic Inquiry, 2:223-233, 1970.

214