How Verb Subcategorization Frequencies Are Affected by Corpus Choice

How Verb Subcategorization Frequencies Are Affected By Corpus Choice Douglas Roland Daniel Jurafsky University of Colorado University of Colorado Department of Linguistics Dept. of Linguistics & Inst. of Cognitive Science Boulder, CO 80309-0295 Boulder, CO 80309-0295 [email protected] j urafsky @colorado.edu 1997), and in psychological theories of language Abstract processing (Clifton et al. 1984, Ferreira & The probabilistic relation between verbs and McClure 1997, Garnsey et al. 1997, Jur'afsky 1996, their arguments plays an important role in MacDonald 1994, Mitchell & Holmes 1985, modern statistical parsers and supertaggers, Tanenhaus et al. 1990, Trueswell et al. 1993). and in psychological theories of language processing. But these probabilities are These probabilities are computed in very different computed in very different ways by the two ways by the two sets of researchers. sets of researchers. Computational linguists Psychological studies use methods such as compute verb subcategorization probabilities sentence completion and sentence production for from large corpora while psycholinguists collecting verb argument structure probabilities. compute them from psychological studies In sentence completion, subjects are asked to (sentence production and completion tasks). complete a sentence fragment. Garnsey at al. Recent studies have found differences (1997) used a proper name followed by a verb, between corpus frequencies and such as "Debbie remembered " In psycholinguistic measures. We analyze sentence subjects are asked to write any sentence subcategorization frequencies from four containing a given verb. An example of this type different corpora: psychological sentence of study is Connine et al. (1984). production data (Connine et al. 1984), written text (Brown and WSJ), and telephone An alternative to these psychological methods is conversation data (Switchboard). We find to use corpus data. This can be done two different sources for the differences. automatically with unparsed corpora (Briscoe and Discourse influence is a result of how verb Carroll 1997, Manning 1993, Ushioda et al. 1993), use is affected by different discourse types from parsed corpora such as Marcus et al.'s (1993) such as narrative, connected discourse, and Treebank (Merlo 1994, Franlis 1994) or manually single sentence productions. Semantic as was done for COMLEX (Macleod and influence is a result of different corpora using Grishman 1994). The advantage of any of these different senses of verbs, which have different corpus methods is the much greater amount of subcategorization frequencies. We conclude data that can be used, and the much more natural that verb sense and discourse type play an contexts. This seems to make it preferable to important role in the frequencies observed in data generated in psychological studies. different experimental and corpus based sources of verb subcategorization frequencies. Recent studies (Merlo 1994, Gibson et al. 1996) have found differences between corpus 1 Introduction frequencies and experimental measures. This The probabilistic relation between verbs and their suggests that corpus-based frequencies and arguments plays an important role in modern experiment-based frequencies may not be statistical parsers and supertaggers (Charniak interchangeable. To clarify the nature of the 1995, Collins 1996/1997, Joshi and Srinivas 1994, differences between various corpora and to find Kim, Srinivas, and Trueswell 1997, Stolcke et al. the causes of these differences, we analyzed 1122 psychological sentence production data (Connine Clifton. The Connine data (CFJCF) consists of et al. 1984), written discourse (Brown and WSJ examples of 127 verbs, each classified as ltom Penn Treebank - Marcus et al. 1993), and belonging to one of 15 subcategorization frames. conversational data (Switchboard - Godfrey et al. We added a 16th category for direct quotations 1992). We found that the subcategorization (which appeared in the corpus data but not the frequencies in each of these sources are different. Connine data). Examples of these categories, We performed three experiments to (1) find the taken from the Brown Corpus, appear in figure 1 causes of general differences between corpora, (2) below. There are approximately 14,000 verb measure the size of these differences, and (3) find tokens in the CFJCF data set. verb specific differences. The rest of this paper describes our methodology and the two sources of For the BC, WSJ, and SWBD data, we counted subcategorization probability differences: subcategorizations using tgrep scripts based on the discourse influence and semantic influence. Penn Treebank. We automatically extracted and categorized all examples of the 127 verbs used in 2 Methodology the Comfine study. We used the same verb For the sentence production data, we used the subcategorization categories as the Comline study. numbers published in the original Connine et al. There were approximately 21,000 relevant verb paper as well as the original data, which we were tokens in the Brown Corpus, 25,000 relevant verb able to review thanks to the generosity of Charles 1 [O1 Barbara asked, as ~ey heard the front door close.- " 2 [PP] Guerrillas were racing [toward him]. ' '" 3 [hlf-S] : Hank thanked them and promised [to observe file rules]. 1 4 [inf-S]/PP/ Labor fights [to change its collar from blue io white]. 5 [wh-S] I know now [why the students insisted that] go to Hiroshima even when I to'ld them I didn't want to]. 6 [that-Sl l She promised ['that she wou'ld'soon take 'a few day's leave and visit the uncie she had never seen, on the island of Oyajima --which was not very far from Yokosuka]. 7 [verb-ing] But I couldn't help [thinking that Nadine and Wally were getting just what ~ey deserved]. 8 [perception Far off, in the dusk, he heard [voices singing, muffled but strong]. ' ..... complement.] 9 INP]' ° Tile t~tle immediately Withdrew into its private counci'lroom' to study [the phenome'non]. 10 [NP]tNP] "The mayor of ~e town taught [~em] [Englisl~ ]~UciFrench]. 11 INP][PP] They bought [rustled cattle] [from the outlaw]i kept him supplied wifil guns mad ammunition, harbored his men in their houses. 12 [NPl[inf-Sl ' Side had assumed before 'then that one day he wou'ld ask [her] [to marry him]. [NP][wh-S] I asked "[Wisman] [what would happen if' he broke out tile go codes and 'tried to start transmitting one]. 14 lNP][that-S] But, in departing, Lewis begged [Breasted] [that there be no liquor in ~e apartment at the Grosvenor on his return], and he took with him the first thirty galleys of Elmer Gantry. 15 lpassive] A col'd supper was ordered'and a bottle of port. 16 Quoi'es He writes ["confucius held that in times oi s~ess, one should take short'views - only up to lunchtime."] Figure 1 - examples of each subcategorization frame from Brown Corpus 1123 tokens in the Wall Street Journal Corpus, and 10,000 in Switchboard. Unlike the Connine data, Passive is generally used in English to emphasize where all verbs were equally represented, the the undergoer (to keep the topic in subject frequencies of each verb in the corpora varied. position) and/or to de-emphasize the identity of For each calculation where individual verb the agent (Thompson 1987). Both of these frequency could affect the outcome, we reasons are affected by the type of discourse. If normalized for frequency, and eliminated verbs there is no preceding discourse, then there is no with less than 50 examples. This left 77 out of pre-existing topic to keep in subject position. In 127 verbs in the Brown Corpus, 74 in the Wall addition, with no context Ibr the sentence, there is Street Journal, and only 30 verbs in Switchboard. less likely to be a reason to de-emphasize the This was not a problem with the Connine data agent of the sentence. where most verbs had approximately 100 tokens. 3.1.2 Zero Anaphora 3 Experiment 1 The increase in zero anaphora (not overtly The purpose of the first experiment is to analyze mentioning understood arguments) is caused by the general (non-verb-specific) differences two factors. Generally, as the amount of between argument structure frequencies in the surrounding context increases (going from single data sources. In order to do this, the data for each sentence to connected discourse) the need to verb in the corpus was normalized to remove the overtly express all of the arguments with a verb effects of verb frequency. The average decreases. frequency of each subcategorization flame was calculated for each corpus. The average Data Source % [0] subcat frame frequencies for each of the data sources were then CFJCF 7% compared. Wall Street Journal 8% Brown 13 % 3.1 Results Switchboard 18% We found that the three corpora consisting of connected discourse (BC, WSJ, SWBD) shared a Verbs that can describe actions (agree, disappear, common set of differences when compared to the escape, follow, leave, sing, wait) were typically CFJCF sentence production data. There were used with some form of argument in single three general categories of differences between the sentences, such as: corpora, and all can be related to discourse type. "I had a test that day, so I really wanted to escape These categories are: from school." (CFJCF data). (1) passive sentences Such verbs were more likely to be used without (2) zero anaphora any arguments in connected discourse as in: (3) quotations "She escaped , crawled through the usual mine fields, under barbed wire, was shot at, swam a 3.1.1 Passive Sentences river, and we finally picked her up in Linz." The CFJCF single sentence productions had the (Brown Corpus) smallest number of passive sentences. The In this case, the argument of "escaped", connected spoken discourse in Switchboard had ("imprisonment") was understood from the more passives, followed by the written discourse previous sentence. Verbs of propositional in the Wall Street Journal and the Brown Corpus.

Load more