<<

Verb Sense and Verb Subcategorization Probabilities by Douglas William Roland B.S., University of Delaware, 1989 M.A., University of Colorado, 1994

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Department of 2001

Copyright © 2001 by Douglas William Roland

This thesis entitled: Verb Sense and Verb Subcategorization Probabilities written by Douglas William Roland has been approved for the Department of Linguistics

______Daniel Jurafsky

______Lise Menn

______Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

iii Abstract Roland, Douglas William (Ph.D., Linguistics) Verb Sense and Verb Subcategorization Probabilities Thesis directed by Associate Professor Daniel S. Jurafsky

This dissertation investigates a variety of problems in and computational linguistics caused by the differences in verb subcategorization probabilities found between various corpora and experimental data sets. For psycholinguistics, these problems include the practical problem of which frequencies to use for norming psychological experiments, as well as the more theoretical issue of which frequencies are represented in the mental and how those frequencies are learned. In computational linguistics, these problems include the decreases in the accuracy of probabilistic applications such as parsers when they are used on corpora other than the one on which they were trained.

Evidence is presented showing that different senses of verbs and their corresponding differences in subcategorization, as well as inherent differences between the production of sentences in psychological norming protocols and language use in context, are important causes of the subcategorization frequency differences found between corpora. This suggests that verb subcategorization probabilities should be based on individual senses of verbs rather than the whole verb , and that “test tube” sentences are not the same as “wild” sentences. Hence, the influences of experimental design on verb subcategorization probabilities should be given careful consideration.

This dissertation will demonstrate a model of how the relationship between verb sense and verb subcategorization can be employed to predict verb subcategorization based on the semantic context preceding the verb in corpus data. The predictions made by the model are shown to be the same as predictions made by human subjects given the same contexts.

For Sumiyo v Acknowledgements This dissertation is the result of an enormous amount of help, advice, and support from many people. However, unlike the body of the dissertation, where a simple table or graph can adequately convey a message, a page of text in the introduction cannot convey impact that the people named here have had on this dissertation and on me as a researcher and as a human being.

At the top of the list are my advisor, Dan Jurafsky, my co-advisor, Lise Menn, and the other members of my committee, Alan Bell, Jim Martin, and Tom Landauer. Over the years, they have all given me more than a fair share of time, energy, insight, help, and patience. I can’t even begin to list all they have done.

Other present and former Boulder faculty members have also had a great influence on me, including: Paul Smolensky, who introducing me to computational modeling during my first semester in Boulder, Mike Eisenburg, who showed me the artistic side of programming, Laura Michaelis, and Barbara Fox.

I also owe special gratitude to Susanne Gahl, who read multiple versions of this dissertation and the preceding papers, and provided many useful suggestions for improving both content and the clarity.

Several people have very kindly contributed their data to this dissertation. These include Charles Clifton, who lent me the original hand-written subject response sheets from the Connine, Ferreira, Jones, Clifton, & Frazier (1984) study, and thus caused me to re-evaluate many of my original thoughts about the causes of the differences between norming study data and other corpus data. Of similar importance was the subject response data from Garnsey, Pearlmutter, Myers, & Lotocky (1997), provided by Susan Garnsey. Chapter 3 would not exist at all if it were not for the data and support provided by Mary Hare, Ken McRae, and Jeff Elman.

Financially, this work was supported in part by: NSF CISE/IRI/Interactive Systems Proposal 9818827, NSF CISE/IRI/Interactive Systems Award IRI-9618838, NSF CISE/IRI/Interactive Systems Award IRI-970406, NSF CISE/IRI/Interactive Systems Award IIS-9733067

Many people have provided helpful feedback either on this dissertation, or on the papers and presentations, such as Roland & Jurafsky (1997), Roland & Jurafsky (1998), Roland & Jurafsky (in press), and Roland et al. (2000), that contain various pieces of the data and analysis in this dissertation. These include Charles Clifton, Charles Fillmore, Adele Goldberg, Mary Hare, Uli Heid, Paola Merlo, Neal Perlmutter, Philip Resnik, Suzanne Stevenson, and the ever-famous anonymous reviewers. This list also includes assorted office-mates and other people associated with Dan’s research group: (in an attempt at chronological order) Bill Raymond, Taimi Metzler, Giulia Bencini, Michelle Gregory, Traci Curl, Mike O’Connell, Cynthia Girand, Noah Coccaro, Chris Riddoch, Beth Elder, Keith Herold, and finally, vi Dan Gildea and Sameer Pradhan, both of whom provided significant help while I was attempting to tag and parse data from various corpora.

Special thanks to Yoshie Matsumoto for showing me where all the good coffee shops were for getting work done, and setting a pace and spirit during my first year in Boulder which I have attempted (not always successfully) to maintain since. Also, thanks to Faridah Hudson and Linda Nicita, who started with me, and have provided support and camaraderie during the long trip.

Many friends (mostly human, but also one canine and one feline) from the real world also made life a much better place to be: (in order of neighborhood proximity to our kitchen) Rie, Pepper, Emi, Virgil, Hal, Elizabeth, Jeremy, Benny, Raj, Noriko, Wes, Lee Wah, Nanako, Hiroko, Kiyoshi, (and now for a great leap in neighborhood distance) Yoshie, Kazu, Tomoko, and Taro. Also thanks to Imanaka Sensei for ocha and okashi.

I would also like to thank my parents for being parents, a job which included Mom spending a day of her vacation last summer proof-reading the whole dissertation.

None of the help from the people above would mean anything were it not for the support, encouragement, love, and extreme patience shown by my wife Sumi. I couldn’t have done it without her.

¢¡¤£¦¥¨§©¡ ¦ vii Contents

1 Introduction 1 1.1 Overview...... 1 1.2 The importance of verb subcategorization probabilities in psycholinguistics ...... 1 1.3 The problem with verb subcategorization frequencies...... 4 1.4 Verb subcategorizations and computational linguistics...... 11 1.4.1 The importance of verb subcategorization information for statistical parsers ...... 12 1.4.2 Problem: the need to retrain parsers for new domains ...... 13 1.5 Solving the verb subcategorization frequency problem ...... 15 1.5.1 Evidence for the relationship between verb and subcategorization from linguistics...... 17 1.5.2 Evidence for the relationship between verb semantics and subcategorization from computational linguistics...... 19 1.5.3 Evidence for the relationship between verb semantics and subcategorization from psycholinguistics ...... 20 1.6 Outline of Chapters...... 21 1.6.1 Chapter 2...... 22 1.6.2 Chapter 3...... 23 1.6.3 Chapter 4...... 24

2 Subcategorization probability differences in corpora and experiments 25 2.1 Combining sense-based verb subcategorization probabilities with other factors to yield observed subcategorization frequencies...... 25 2.1.1 Verb senses...... 26 2.1.2 Probabilistic factors...... 27 2.2 Experiment – comparing norming studies and corpus data...... 28 2.2.1 Methodology...... 28 2.2.1.1 Connine et al. (1984) sentence production study...... 29 2.2.1.2 Garnsey et al. (1997) sentence completion study ...... 30 2.2.1.3 Brown Corpus ...... 31 2.2.1.4 Wall Street Journal Corpus ...... 32 viii 2.2.1.5 Switchboard Corpus ...... 32 2.2.1.6 Extracting subcategorization probabilities from the corpora ...... 32 2.2.1.7 Measuring differences between corpora...... 39 2.2.2 Results and discussion: Part 1 - Subcategorization differences resulting from comparing isolated sentence and connected-discourse corpora ...... 41 2.2.2.1 Discourse cohesion...... 41 2.2.2.2 Other experimental factors – subject animacy...... 46 2.2.2.3 Conclusion for section 2.2.2 ...... 47 2.2.3 Results and discussion: Part 2 - Subcategorization differences resulting from verb sense differences ...... 48 2.2.3.1 Verbs have different subcategorization frequencies in different corpora...... 48 2.2.3.2 Verbs have different distributions of sense in different corpora ...... 49 2.2.3.3 Topics provided in norming studies also influence verb sense...... 51 2.2.3.4 Subcategorization frequencies for each verb sense...... 51 2.2.3.5 Factors that contribute to stable cross-corpus subcategorization frequencies ...... 52 2.2.3.6 Conclusion for section 2.2.3 ...... 55 2.2.4 Results and discussion: Part 3 – Reducing sense and discourse differences decreases the differences in subcategorization probabilities...... 56 2.2.5 Conclusion ...... 58 2.3 Experiment – Controlling for discourse type and verb sense to generate stable cross corpus subcategorization frequencies...... 58 2.3.1 Data...... 59 2.3.2 Verb Frequency ...... 59 2.3.3 Subcategorization Frequency...... 60 2.3.3.1 Methodology:...... 60 2.3.3.2 Results: ...... 61 2.3.4 Discussion:...... 62 2.3.5 Conclusion ...... 65 2.4 Conclusion ...... 66 ix 3 Predicting verb subcategorization from semantic context using the relationship between verb sense and verb subcategorization 68 3.1 Overview...... 68 3.2 Psycholinguistic evidence for the effects of verb sense on human sentence processing ...... 68 3.3 Model for predicting subcategorization from semantic context ...... 72 3.3.1 Previous related uses of LSA ...... 74 3.3.2 Details of how LSA is used to measure semantic similarity ...... 79 3.3.3 Corpus (training) data used in model ...... 80 3.4 Experiments ...... 81 3.4.1 Predicting the subcategorizations of the Hare et al. (2001) bias contexts ...... 81 3.4.1.1 Results and discussion...... 82 3.4.2 Predicting the subcategorizations of corpus bias contexts ...... 87 3.4.2.1 Results and discussion...... 88 3.4.2.2 Additional analysis using corpus bias contexts ...... 91 3.4.3 Predicting the subcategorizations of examples of ‘admit’...... 93 3.4.3.1 Methods ...... 94 3.4.3.2 Results and discussion...... 95 3.5 Conclusion ...... 97

4 Conclusions and future work 99 4.1 Psycholinguistics ...... 99 4.2 Computational linguistics ...... 100 4.3 Future work...... 100

Bibliography 101

Appendix A: Subcategorizations and tgrep search strings 106

Appendix B: Stimuli used in Hare et al. (2001) 112 x Tables Table 1: Correlation values from comparisons in Merlo (1994)...... 5

Table 2: High, middle, and low attachment sites from Gibson & Schuetze (1999)...... 7

Table 3: Correlations (r) and agreement (b) for comparisons for DO and SC subcategorizations from Lapata et al. (2001)...... 9

Table 4: Correlations (r) and agreement (b) for comparisons for NP and 0 subcategorizations from Lapata et al. (2001)...... 9

Table 5: Sample grammar rules (made-up probabilities)...... 12

Table 6: Results from Gildea (2001)...... 14

Table 7: Senses and possible subcategorizations of admit from WordNet (Miller, Beckwith, Fellbaum, Gross, & Miller 1993)...... 16

Table 8: Approximate size of each corpus...... 29

Table 9: Connine et al. (1984) protocol 1 sample prompts and subject responses.29

Table 10: Connine et al. (1984) protocol 2 sample prompts and subject responses.30

Table 11: 127 verbs used from Connine et al. (1984) ...... 30

Table 12: Sentence Completion protocol used to collect subcategorization frequencies by Garnsey et al. (1997)...... 30

Table 13: Example subcategorization-frame probabilities for each of the three subcategorization frame classes (DO-bias, SC-bias, and EQ-bias) of Garnsey et al. (1997)...... 31

Table 14: 127 verbs used from Garnsey et al. (1997)...... 31

Table 15: List of subcategorizations...... 33

Table 16: Examples of each subcategorization frame taken from the Brown Corpus...... 34

Table 17: Examples of each subcategorization frame from the response sheets for the CFJCF data...... 35

Table 18: Raw subcategorization vectors for hear from BC and WSJ...... 40

Table 19: Modified subcategorization vectors for hear from BC and WSJ for use in calculating Chi Square...... 41 xi Table 20: Use of passives in each corpus...... 42

Table 21: The object of follow is only omitted in connected-discourse corpora. (numbers are hand-counted, and indicate % of omitted objects out of all instances of follow) ...... 43

Table 22: Greater use of first person subject in isolated-sentences...... 44

Table 23: Use of VP-internal NPs which are anaphorically related to the subject. .44

Table 24: Token/Type ratio for arguments of accept...... 45

Table 25: Subcategorization of worry affected by sentence-completion paradigm. 46

Table 26: Uses of worry...... 46

Table 27: Agreement between WSJ and BC data...... 49

Table 28: Differences in distribution of verb senses between BC and WSJ...... 49

Table 29: Examples of common senses of charge and their frequencies...... 50

Table 30: Examples of common senses of jump and their frequencies...... 50

Table 31: Examples of common senses of pass and their frequencies...... 51

Table 32: Uses of pass in different settings in the CFJCF sentence production study...... 51

Table 33: Different senses of charge in WSJ have different subcategorization probabilities. Dominant prepositions are listed in parentheses after the frequency...... 52

Table 34: Improvement in agreement when after controlling for verb sense...... 52

Table 35: Agreement between BC and WSJ data...... 52

Table 36: Differences in distribution of verb sense between BC and WSJ...... 53

Table 37: Examples of common senses of kill...... 53

Table 38: Examples of common senses of stay...... 54

Table 39: Examples of common senses of try...... 54

Table 40: Senses and subcategorizations of kill in WSJ...... 55

Table 41: Improvements in agreement for the verb hear...... 57 xii Table 42: 64 verbs chosen for analysis...... 59

Table 43: Number of verbs out of 64 showing a significant difference in frequency between corpora...... 60

Table 44: Verbs that BNC and Brown both have more of than WSJ...... 60

Table 45: Verbs that WSJ has more of than both Brown and BNC...... 60

Table 46: bias in each corpus...... 62

Table 47: Sentence Completion results from Hare et al. (2001)...... 69

Table 48: Word sense disambiguation results from Schütze (1997)...... 77

Table 49: Sample size for each verb. (Verbs marked with * were used in Hare et al. (2001), but were not used in the experiments in this dissertation due to small sample sizes.)...... 81

Table 50: Average subcategorization frequencies for 15 verbs used in experiment 3.4.1, taken from corpus frequencies reported in Hare et al. (2001)...... 86

Table 51: Sample sentence completion prompts...... 87

Table 52: Average subcategorization frequencies for 15 verbs taken from sentence completion experiment in Hare et al. (2001)...... 87

Table 53: Examples of senses and subcategorizations of admit...... 94

Table 54: Counts and examples of subsenses of the 50 corpus examples of the DO-enter sense of admit...... 97

Table 55: 20 nearest neighbors of grape in the TASA LSA semantic space...... 100

Table 56: Errors not including quote-finding errors for communication verbs. ....106

xiii Figures

Figure 1: High, middle, and low attachment sites (from Gibson et al. 1996)...... 7

Figure 2: Sample lexicalized tree taken from Charniak (1995)...... 12

Figure 3: Semantic structures for two different syntactic patterns of ‘spray’ (Pinker 1989, page 228)...... 19

Figure 4: Model showing why different corpora have different subcategorization probabilities for the same verb...... 26

Figure 5: Effect of bias context on reading times in ambiguous condition, from Hare et al. (2001). (DA = DO bias, ambiguous condition, SA = SC bias, ambiguous condition)...... 71

Figure 6: Effect of bias context on reading times in unambiguous condition, from Hare et al. (2001). (DU = DO bias, unambiguous condition, SU = SC bias, unambiguous condition)...... 72

Figure 7: Predicting subcategorization from the context preceding the verb...... 73

Figure 8: Use of semantic similarity to predict subcategorization...... 74

Figure 9: Disambiguating the subcategorization of a target using Schütze style clusters in LSA semantic space...... 78

Figure 10: Disambiguating the subcategorization of a target using the subcategorizations of the nearest neighbors...... 79

Figure 11: Average % SC corpus examples in neighborhood of SC bias contexts ...83

Figure 12: Accuracy in predicting the subcategorization bias of the SC bias contexts...... 84

Figure 13: Average % DO corpus examples in neighborhood of DO bias contexts .85

Figure 14: Accuracy in predicting the subcategorization bias of the DO bias contexts...... 85

Figure 15: Average % SC corpus examples in neighborhood of SC target contexts...... 88

Figure 16: Accuracy in predicting the subcategorization bias of the SC corpus contexts...... 89

Figure 17: Average % DO corpus examples in neighborhood of DO corpus contexts ...... 90 xiv Figure 18: Accuracy in predicting the subcategorization bias of the DO bias contexts...... 91

Figure 19: Comparison of various LSA weighting methods...... 92

Figure 20: Effects of weighting neighborhoods by cosine on accuracy in predicting subcategorization...... 93

Figure 21: Relative frequencies of each type of example in the neighborhood of DO-confess corpus examples...... 95

Figure 22: Relative frequencies of each type of example in the neighborhood of SC-confess corpus examples...... 96

Figure 23: Relative frequencies of each type of example in the neighborhood of DO-enter corpus examples...... 96 1

1 Introduction

1.1 Overview

This dissertation will investigate a variety of problems in psycholinguistics and computational linguistics caused by the differences in verb subcategorization probabilities found between various corpora and experimental data sets. For psycholinguistics, these problems include the practical problem of which frequencies to use for norming psychological experiments as well as the more theoretical issue of which frequencies are represented in the mental lexicon and how those frequencies are learned. In computational linguistics, these problems include the decreases in the accuracy of probabilistic applications such as parsers when they are used on corpora other than the one on which they were trained.

Chapter 2 will demonstrate that different senses of verbs and their corresponding differences in subcategorization as well as inherent differences between the production of sentences in psychological norming protocols and language use in context are important causes of the subcategorization frequency differences found between corpora. This leads to two conclusions: 1) verb subcategorization probabilities, for psycholinguistic models and for norming purposes, should be based on individual senses of verbs rather than the whole verb lexeme, and 2) “test tube” sentences are not the same as “wild” sentences, and thus the influences of experimental design on verb subcategorization probabilities should be given careful consideration.

Chapter 3 will demonstrate a computational model, based on Latent Semantic Analysis, of how the relationship between verb sense and verb subcategorization can be employed to predict verb subcategorization based on the semantic context preceding the verb in corpus data. This chapter will also demonstrate that the predictions made by the model are the same as predictions made by human subjects given the same contexts. This will be accomplished by showing that the predictions from the algorithm correspond with parsing decisions made by human subjects in reading time experiments performed by Hare, Elman, & McRae (2001). 1.2 The importance of verb subcategorization probabilities in psycholinguistics

Verb subcategorization probabilities play an important role in recent psycholinguistic theories of human language processing and in computational linguistic applications such as probabilistic parsers. This section will address the role of verb subcategorization probabilities in psycholinguistics, providing examples of both evidence of how verb subcategorization probabilities affect sentence processing and of how various researchers have generated norming materials for their experiments. Section 1.4 will address the role of verb subcategorization probabilities in computational linguistics. 2

Fodor (1978) argued that the transitivity preferences of verbs affect the processing of sentences containing those verbs. She argues, based on intuition and informant judgment, that sentence (1) is more difficult to understand than sentence (2). The additional difficulty in (1) is attributed to the parser proposing a gap after the verb read at the location marked by (_).

(1) Which booki did the teacher read (_) to the children from _i?

(2) Which studenti did the teacher go to the concert with _i? Alternatively, example (3) is more difficult than example (4). This time, the difficulty is attributed to the parser not proposing the filled gap (_) after the verb walk.

(3) Which studenti did the teacher walk (_)i to the cafeteria?

(4) Which studenti did the teacher walk to the cafeteria with _i? These patterns of difficulty argue against both theories where the parser always proposes gaps, and theories where the parser never proposes gaps. Fodor argues that the key difference between these sets of examples is that the verb read commonly takes a direct object (DO), so parser proposes a gap, while the verb walk occurs more commonly without a DO, so no gap is proposed.

Clifton, Frazier, & Connine (1984) provided experimental evidence for the relationship between verb subcategorization expectations and processing difficulty discussed by Fodor. Clifton et al. (1984) relied on a norming study by Connine, Ferreira, Jones, Clifton, & Frazier (1984) for verb bias data. Parsing difficulties were measured by on line reaction times in grammaticality judgment and secondary task protocols.

Ford, Bresnan, & Kaplan (1982) showed how lexical preference affects the parsing of ambiguous sentences. Subjects were given different ambiguous sentences and asked to chose a meaning for the sentence. The subjects’ interpretations of the sentences changed when different verbs were used in otherwise identical sentences. Examples (5) and (6) show how parse preferences, indicated in parenthesis, change when the verb is changed. These changes in parse preference indicate that some information is associated with the verb that influences parsing decisions. (5) They objected to everyone that they couldn’t hear. a. They objected to everyone who they couldn’t hear (55%) b. They objected to everyone about the fact that they couldn’t hear. (45%) (6) They signaled to everyone that they couldn’t hear. a. They signaled to everyone who they couldn’t hear. (10%) b. They signaled to everyone the fact that they couldn’t hear. (90%)

3 Trueswell, Tanenhaus, & Kello (1993) showed that the subcategorization bias of the verb affects the parsing difficulties in sentences with the sentential complement (SC) / direct object (DO) ambiguity. In this ambiguity, the noun phrase after the verb can be interpreted either as the direct object of the verb or as the subject of a sentential complement. In example (7), the student is a direct object of the verb accept, while in example (8), the student is the subject of the sentential complement the student wrote the paper. (7) The teacher accepted the student.

(8) The teacher accepted the student wrote the paper. If the verb is more frequently used with a direct object (i.e. has a DO bias), then parsing is more difficult in the region after the words the student, than it is if the verb is more frequently used with a sentential complement (i.e. has a SC bias). In order to determine the subcategorization bias of different verbs, Trueswell et al. (1993) used a sentence completion task. Subjects were given the initial portion of a sentence, such as John insisted, and asked to complete the sentence. The subjects’ completions were categorized as to whether they had written a direct object, a sentential complement, or some other use. In separate experiments relying on naming latency, self paced reading times, and eye tracking, they found that subjects had difficulties in sentences with the SC/DO ambiguity when the verbs had a DO bias, but not when the verbs had an SC bias.

Garnsey, Pearlmutter, Myers, & Lotocky (1997) showed that both verb bias and the plausibility of the noun phrase following the verb as a direct object played a role in parsing in the same direct object / sentential complement ambiguity investigated in Trueswell et al. (1993). This confirmed the results Trueswell et al. (1993), and separated out the effects caused by the plausibility of the noun phrase. As part of this project, Garnsey et al. (1997) performed a much larger sentence completion norming study on a superset of the verbs normed by Trueswell et al. (1993). Garnsey et al. (1997) selected candidate verbs based on the Connine et al. (1984) sentence production study.

MacDonald (1994) proposed that the difficulties in resolving syntactic ambiguities, such as in garden path sentences, were influenced by probabilistic constraints including “the frequencies of the alternative structures of ambiguous verbs.” This claim was supported by showing that the interpretation of reduced relative constructions was related to the degree to which the verb was used intransitively. Reduced relatives formed with highly transitive verbs such as interview, as in (9), are less difficult to process than reduced relatives formed with verbs that are frequently used intransitively, such as race, as in (10). (9) The homeless people interviewed in the film are exceptionally calm … (MacDonald (1994) taken from Maslin (1991)) 4 (10) # 1 The horse raced past the barn fell. (MacDonald (1994) originally from Bever (1970))

Jennings, Randall, & Taylor (1997) demonstrated graded effects of verb subcategorization preferences on sentence parsing. They found the degree of bias for 93 verbs that could take both a direct object and a sentential complement completion using a sentence completion task where subjects had to complete sentences consisting of a determiner followed by an adjective followed by an animate noun followed by a past tense verb such as “The old man observed ______.” They used 12 SC bias and 16 DO bias verbs in a cross-modal priming experiment where subjects heard the sentence up to the verb, and then had to name the visually presented prompt of either they, suggesting an SC completion or them, indicating a DO completion. The naming latency was related to the degree of preference or dispreference of the prompt as a possible continuation.

The papers discussed in this section have illustrated evidence for the role of verb subcategorization frequencies in human sentence processing. It is important to note that all of these papers implicitly assume that a single set of subcategorization probabilities can be defined for each verb. In general, these probabilities in the mental lexicon are assumed to be acquired through exposure to language use. 1.3 The problem with verb subcategorization frequencies

The previous section shows that verb subcategorization frequencies play an important role in human language processing. However, studies such as Merlo (1994), Gibson, Schuetze, & Salomon (1996), and Gibson & Schuetze (1999) have found differences between syntactic and subcategorization frequencies computed from corpora and those computed from psychological experiments. Additionally, Biber and colleagues (Biber, Conrad, & Reppen 1998, Biber 1993, Biber 1988) have found that corpora differ in a wide variety of phenomena, including the use of various syntactic structures. This presents two problems for the psycholinguistic community. On one hand, one must answer the practical question of which verb subcategorization frequencies are the most appropriate ones to use for norming experiments. On the other hand, if processing relies on frequencies, and these frequencies are learned through exposure to language use, which frequencies are actually represented in the lexicon? Norming studies and corpora such as the Brown Corpus have different verb subcategorization frequencies, yet frequencies from both are commonly used to represent language use.

Merlo (1994) compared subcategorization frequency data for a set of verbs taken from psycholinguistic norming studies with corpus subcategorization frequencies for the same verbs. The norming data considered in Merlo (1994) was taken from four separate studies; a sentence production study (Connine et al. 1984) and three sentence completion studies (Garnsey et al. 1997, Holmes, Stowe, & Cupples 1989, and Trueswell et al. 1993). In a sentence production study, subjects

1 The # symbol will be used to indicate anomalous examples or garden path examples. 5 are asked to write a sentence using a given verb, while in a sentence completion study, subjects are asked to complete a sentence based on a provided partial sentence, typically a grammatical subject followed by the verb. The corpus data used in her study was taken from the Penn Treebank (Marcus, Santorini, & Marcinkiewicz 1993), and consisted of a combination of Wall Street Journal, MARI radio broadcast transcriptions, and DARPA Air Travel Information System training data where subjects requested flight scheduling information from a reservation system. The comparisons between the corpus data and the Connine data were based on a set of five possible subcategorizations (NP, PP, S, SBAR, SBAR0) and an “other” category, while the comparisons between the corpus data and the other norming studies were based on the categories of NP, SC, and Other.

Table 1 shows the correlation values for the comparisons of the various data sets performed in Merlo (1994). The first column shows the corpora being compared. The second column shows the correlation between the frequencies of the NP subcategorization in each of the two corpora for each of the N verbs available in both corpora. This can be visualized by imagining a graph where each of the verbs is represented by a point in a two dimensional space where the X axis represents the frequency of the NP subcategorization in one corpus, and the Y axis represents the frequency of the NP subcategorization in the other corpus. If all of verbs have the same NP frequency in both corpora, then they would all be located on the line Y=X, and the correlation would be 1. For example, there are 36 verbs for which data is available in both the Trueswell and Merlo data sets, and the correlation between the frequency of the NP subcategorization for each of these verbs in the two data sets is .739. The third column shows the correlations between the frequencies of the SC subcategorizations.

Comparison2 NP SC3 Trueswell vs. Garnsey4 r = .935 r = .916 Trueswell vs. Merlo r = .739 r = .444 F(1,36) = 43.36 F(1,36) = 8.848 p < .0001 p = .0052 Holmes vs. Merlo r = .594 r = .667 F(1,21) = 10.883 F(1,21) = 15.990 p = .0036 p = .0007 Garnsey vs. Merlo r = .727 r = .585 F(1,48) = 52.723) F(1,48) = 24.503 p < .0001 p < .0001 Connine vs. Merlo (<50) r = .598 r = .751 Connine vs. Merlo (>50) r = .784 r = .835 F(1,24) = 38.326 F(1,24) = 55.258 p < .0001 p < .0001 Table 1: Correlation values from comparisons in Merlo (1994).

2 Merlo = corpus numbers in (Merlo 1994), Garnsey = Garnsey et al. (1997), Holmes = Holmes et al. (1989), and Trueswell = Trueswell et al. (1993) 3 SC is Merlo’s category “Clause” for all comparisons except the Connine comparison, in which case it is “S” (not “SBAR”) 4 Results for this comparison taken from Trueswell et al. (1993), and also appear in Merlo (1994). 6

Merlo concluded that the norming study data was “not strongly correlated with the corpus counts”, but that it was “appropriate to keep using corpus counts when needed and to continue exploring the possible sources of difference” because corpus probabilities do correlate with experimental evidence in some cases. The important point of this data is that the comparisons between the corpus data and the different norming study data sets have much lower correlations than the comparison between the norming studies done by Trueswell et al. (1993) and Garnsey et al. (1997). This shows that the results produced by two norming studies relying on similar protocols are much more like each other than they are like corpus data.

One must be careful in drawing conclusions about the degree of difference between various corpora based on such data, however. This is because each of the comparisons shown in Table 1 is based on the data for a different set of verbs (note that the values of N range between 21 and 48). One of the conclusions that will be drawn from that data presented in this dissertation is that the degree of subcategorization variability between corpora is related to the choice of which verbs are included in the frequency counts. Thus, if one wants to directly compare the relative degree of subcategorization frequency differences between a series of corpora, one should use the same set of verbs for each case (a difficult objective for both the Merlo paper and this dissertation, in that both rely on fixed data sets from norming studies conducted by other authors).

Other studies have found differences between corpus probabilities and experimental data. Gibson et al. (1996) and Gibson & Schuetze (1999) compare corpus frequencies and experimental data, although they do not address the issue of verb subcategorization probabilities directly. Gibson et al. (1996) found the frequencies of three different structures in the Penn Treebank Brown Corpus and Wall Street Journal Corpus. These structures corresponded to high, middle, and low attachment of a NP with three possible attachment sites. This ambiguity is shown in Figure 1, while examples of these structures are shown in Table 2.

7

NP1

N1 PP

XP Prep NP2

N2 PP

Prep NP3

N3

Figure 1: High, middle, and low attachment sites (from Gibson et al. 1996).

Type of attachment for Brown Corpus Examples (taken from Gibson & NP 4 Schuetze 1999) Low (attached to NP3) [NP 1 strong opposition by [NP 2 the coalition of [[NP 3 Southern Democrats] and [NP 4 conservative Republicans]]]] [NP 1 the running argument about [NP 2 the relative merits of [[NP 3 Mays] and [NP 4 Mickey Mantle]]]] middle (attached to [NP 1 a fine big actor with [[NP 2 a great head of NP2) [NP 3 blond hair]] and [NP 4 a good voice ]]] [NP 1 correct observance of [[NP 2 three hundred major rules of [NP 3 ritual]] and [NP 4 three thousand minor ones]]] high (attached to NP1) [[NP 1 a man in [NP 2 an occupation of [NP 3 high hazard]]] and [NP 4 a woman balanced on a knife-edge between death from tuberculosis and recovery]] [[NP 1 the question of [NP 2 discrimination in [NP 3 housing]]] and [NP 4 the part each man present played in it]] Table 2: High, middle, and low attachment sites from Gibson & Schuetze (1999). 8 They found that in both corpora, low attachment was most common, followed by middle attachment, followed by high attachment. One the other hand, they found that in both off line comprehension experiments and reading time experiments, subjects preferred low attachment sites to high, and high to middle, rather than the low-middle-high ordering predicted by the corpus frequencies. Their results are taken to indicate that there are additional factors involved in comprehension (attachment preferences) that are not reflected in production (corpus) data. However, unlike the verb-based studies, the lexical identities of the nouns involved in the attachments were not taken into account. Additionally, attachment preferences for a 2 way ambiguity (high and low attachment only) match corpus biases in both English, which has a low attachment preference, and Spanish, which has a high attachment preference (Mitchell, Cuetos, & Corley 1992, Cuetos, Mitchell, & Corley 1996). This suggests that, like the verb subcategorization examples discussed above, noun phrase attachment preferences do correspond to corpus frequencies in general, even though it is not true for the specific case discussed in Gibson et al. (1996) and Gibson & Schuetze (1999). This dissertation will not attempt to find specific causes for the noun phrase attachment differences described here. It is possible that word sense and discourse factors similar to those described in this dissertation may act on noun phrase attachment preferences as well as on verb subcategorization preferences.

Lapata, Keller, & Schulte im Walde (2001) follow up on the Merlo (1994) study by examining verb subcategorization frequencies for two separate ambiguities. These are the SC/DO ambiguity illustrated in examples (11) and (12), where the noun phrase in italics is either the object of the verb or the subject of a complement, and the NP/0 ambiguity illustrated in examples (13) and (14), where the noun phrase in italics is either the object of the verb (and is followed by the subject of the complement) or the subject of the subject of the complement. (11) The teacher knew the answer to the question.

(12) The teacher knew the answer was false.

(13) As the professor lectured the students the fire-alarm went off.

(14) As the professor lectured the students fell asleep.

Lapata et al. (2001) compared data from the British National Corpus (BNC) (http://info.ox.ac.uk/bnc/index.html) with data from several norming studies. The BNC was chosen since it was felt to be more balanced than the corpus data used in the Merlo (1994) study. The BNC data was either hand labeled (manual), or extracted using one of two automatic algorithms (chunking & parsing). The differences between the different BNC data sets is not relevant to this dissertation, but the results from all three BNC data sets are reported here to provide a range from which to estimate the differences between the BNC data and the other data. The norming studies for the SC/DO ambiguity were Connine et al. (1984), Garnsey et al. (1997), and Trueswell et al. (1993), while for the NP/0 ambiguity, data from the sentence production and sentence completion studies in Pickering, Traxler, & Crocker 9 (2000) and data from the Connine et al. (1984) sentence production studies were compared with the BNC data. Table 3 shows the results for the SC/DO comparisons, while Table 4 shows the results for the NP/0 comparisons. The values in the N columns show the number of verbs for which data was available in the corpus pairing. The r column shows the correlation between the subcategorization frequencies for each of the verbs, and the b column shows the percent of the N verbs that had the same subcategorization preference in both corpora. There were three possible subcategorization preference bins. For the SC/DO comparisons, these were SC-biased, when the SC frequency was at least twice that of the DO frequency, DO-biased when the DO frequency was at least twice that of the SC frequency, and equi-biased otherwise. A similar set of three bins was used for the NP/0 comparison.

Garsney Trueswell Connine BNC manual BNC chunking N r b N r b N r b N r b N r b Trueswell 49 .91 76%

Connine 15 .87 73% 12 .78 67% BNC 50 .75 62% 24 .69 50% 12 .96 100% manual BNC 90 .69 58% 44 .64 52% 21 .54 71% 52 .89 77% chunking BNC 90 .81 74% 44 .69 61% 24 .74 50% 55 .92 78% 658 .84 93% parsing Table 3: Correlations (r) and agreement (b) for comparisons for DO and SC subcategorizations from Lapata et al. (2001).

Connine Pickering Pickering BNC BNC chunking production completion manual N r b N r b N r b N r b N r b Pickering 19 .80 68% production Pickering 16 .62 38% 69 .70 62% completion BNC 28 .25 11% 22 .54 68% 16 .55 44%5 manual BNC 12 .11 17% 85 .42 52% 64 .23 39% 67 .26 37% chunking BNC 64 .61 56% 102 .66 67% 70 .42 40% 66 .54 63% 1862 .43 56% parsing Table 4: Correlations (r) and agreement (b) for comparisons for NP and 0 subcategorizations from Lapata et al. (2001).

5 The values in this cell correct a typographical error in the original paper (M. Lapata, personal communication, September 7, 2001) 10

The data in Table 3 and Table 4 provides several insights into the differences between verb subcategorization frequencies in different corpora. As is the case with the Merlo data, this data indicates a degree of similarity in verb subcategorization that is much better than would be found by random chance, but suggests that the different corpora do not have identical subcategorization probabilities for each verb. There is a tendency for the comparisons between pairs of norming studies to result in higher correlation and agreement numbers than are found in corpus (BNC) versus norming study comparisons, although the 100% agreement in verb preference assignment between the BNC-manual data and the Connine data in Table 3 is a clear exception to this generalization. The tendency of the norming study versus norming study results to have higher correlations is similar to the tendency found in the Merlo data. However, as in the case of the Merlo data, the different comparisons are based on different sets of verbs due to the fact that the different norming studies use different sets of verbs. This dissertation will argue that different verbs have inherently different degrees of variation between corpora, and that therefore, the different sets of verbs used in each comparison may influence the degree of variation found in each comparison.

These results of both the Merlo (1994) and Lapata et al. (2001) studies show that while corpus data and norming study data are not identical, there is a much better than chance resemblance between them. It is difficult to say what degree of variation in subcategorization probabilities is acceptable, since this depends on how the probabilities will be used. All of the comparisons for which p values were provided in Merlo (1994) showed a significant difference. These comparisons had relatively small sample sizes relative to those available from recent large corpora such as the British National Corpus (http://info.ox.ac.uk/bnc/index.html). However, if two corpora are large enough, any type of difference will eventually be significant. This suggests that it is more appropriate to measure the significance of cross corpus differences with respect to whichever task is relevant, rather than by relying on traditional statistical measures of significance. For example, if one is preparing lists of verbs with different subcategorization biases for use in a psycholinguistic experiment, and two different data sources place a verb into a different bias group, then the difference is “significant”, and should be investigated, while any differences which do not cause a verb to be placed into different categories are “not significant”, and can be ignored for that task. Similarly, if the task is using a statistical parser to parse a corpus, then any differences between the training data and the corpus being parsed are “significant” only when they cause the parser accuracy to drop below whatever level of accuracy is acceptable for the task at hand.

Although the Merlo (1994) and Lapata et al. (2001) studies can both be taken to show that there are differences in subcategorization probabilities between different corpora and norming studies, they leave some questions unanswered. Although these studies provide some potential sources for the differences in subcategorization preference between different corpora, it would be useful to know more about why different corpora and norming methods produce different subcategorization 11 probabilities. Any potential causes of differences in probabilities are important because in theories in which probabilities are used in comprehension, these probabilities are considered to be derived from exposure to language. In various experiments, such as those discussed in section 1.2, correlations have been found between sentence processing phenomena and both frequencies from norming studies and frequencies based on corpus data. If corpora, frequently treated as providing a representative view of human language production, yield different frequencies from norming studies, which frequencies actually represent human language production?

Second, Merlo (1994), Gibson et al. (1996), Gibson & Schuetze (1999), and Lapata et al. (2001) describe differences between psychological data and corpus data. However, it is also important to note that the choice of which frequencies to use in psycholinguistic models is not just a choice between corpus data and norming study data. Differences are also found between various corpus sources. Biber and colleagues (Biber et al. 1998, Biber 1993, Biber 1988) have extensively analyzed the differences between various registers of corpora such as fiction and academic writing and have found that many features of corpora differ between registers. The features they discuss range from syntactic, such as the use of passive, that-clauses, and to-clauses, to lexical, such as the distribution of similar-meaning words like big, large, and great, to discourse, such as the relative amount of new and given information. Biber does not, however, specifically address issues relating to verb subcategorization probabilities for individual verbs or the properties of sentence production and completion data, which are of interest in this dissertation. Because corpora also differ in their properties, one must consider not only the differences between norming protocols and corpora, but also the differences between individual corpora.

Third, the Merlo (1994) and Lapata et al. (2001) studies only address a limited set of broad subcategorization possibilities. It is possible that some causes of cross corpus differences might be obscured by these broad subcategorizations, and that a finer-grained set of distinctions might better reveal the sources of difference. 1.4 Verb subcategorizations and computational linguistics

Applications in the field of computational linguistics such as probabilistic parsers also face problems caused by different corpora having different verb subcategorization probabilities. These parsers face significant decreases in performance when they are used in domains other than the one for which they were originally trained. Because of this, they must be retrained for each new domain in which they are used. This section will argue that because verb subcategorization probabilities are an important source of information for these parsers, the differences in verb subcategorization probabilities between corpora play a significant role in the decrease in performance of these parsers. 12 1.4.1 The importance of verb subcategorization information for statistical parsers

Modern statistical parsers such as Charniak (1997) and Collins (1999) rely on lexicalized probabilistic context free grammars. A probabilistic context free grammar (PCFG) is a context free grammar (CFG) that assigns probabilities to each rule in the grammar. The probabilities used in these rules are generated by a learning algorithm that is trained on a set of pre-parsed training data. Once a set of probabilities is learned, the parser can then be used to parse other data.

In a lexicalized probabilistic context free grammar (LPCFG), the probabilities for each rule depend on the head word of each phrase. Table 5 shows sample grammar rules, while Figure 2 shows a sample tree structure illustrating how head word information is carried up the tree structure.

CFG Probabilistic CFG Lexicalized PCFG S → NP VP S → NP VP [.05] S:pulled → NP:dogs VP:pulled [.05] NP → DET N NP → DET N [.02] NP:sleds → DET:the N:sleds [.02] NP → N N NP → N N [.03] NP:dogs → N:sheep N:dogs [.03] … … … Table 5: Sample grammar rules (made-up probabilities).

s:pulled

vp:pulled

pp:over

np:dogs np:sleds np:rocks

n:sheep n:dogs v:pulled det:the n:sleds prep:over n:rocks

Figure 2: Sample lexicalized tree taken from Charniak (1995).

Completely lexicalized PCFGs contain the subcategorization probabilities for each verb. For example, the subcategorization probabilities for the verb pull are represented in all of the rules in the grammar with VP:pulled as the left hand side of the rule. A sample rule, taken from the sentence shown in Figure 2 is shown in example (15). (15) VP:pulled -> V:pulled NP:sleds PP:over The right hand side of this rule represents a possible (narrowly defined) subcategorization of the verb pull.

13 The probabilities for grammar rules used in parsers are generated by using a learning algorithm to induce the probabilities from a set of training data. However, when rules are as specific as the rule in example (15), they are unlikely to occur in a set of training data frequently enough for the learning algorithm to learn the correct probabilities. Because of this, each probabilistic parser must use some approximation to the lexicalized rules. Collins (1999) model 1 approximates the probabilities of the elements on the right hand side of all rules, not just VP rules, by assuming that the probabilities of each of the elements on the right hand side of the rule are independent. Thus, the probability of the rule in example (15) is not estimated from the probability of a verb phrase headed by the verb pull having a daughter noun phrase headed by sleds co-occurring with a prepositional phrase headed by the preposition over, but is instead estimated by finding the separate probabilities of a verb phrase headed by pulled having a daughter noun phrase headed by sleds with or without any other arguments multiplied by the probability of a verb phrase headed by pulled having a having a prepositional phrase headed by the preposition over with or without any other arguments. When this sort of approximation is made, the LPCFG no longer contains subcategorization information, since there is no information about how often different arguments actually co-occur with a given verb.

This approximation results in better parser performance than if the approximation were not made, because it allows for the grammar to provide estimates of the probabilities of right hand side combinations which either did not occur in the training data or occurred with a low frequency. However, this also allows the parser to propose unlikely combinations of arguments in a VP, because in real language, the arguments of a verb are not independent of each other. Collins (1999) model 2 explicitly adds verb subcategorization information back into the parser in order to reduce the problem of unlikely subcategorization combinations. Collins (1999, p. 202) reports that adding subcategorization information (model 1 vs. model 2) improves accuracy of the parser by 10%, when the model is not using distance constraints that he feels are approximating subcategorization information. This difference in performance illustrates the importance of verb subcategorization information in parser performance.

Carroll, Minnen, & Briscoe (1998) demonstrate the utility of verb subcategorization information in an otherwise non-lexicalized parser. They compared the output from two different versions of a non-lexicalized parser (structural rules only). In one case, the parse with the highest probability was chosen. In the other case, the output of the same parser was re-ranked according to subcategorization frequencies based on 10 million words of data from the British National Corpus. They found that the addition of the verb subcategorization information significantly improved the accuracy of the parser. 1.4.2 Problem: the need to retrain parsers for new domains

One problem with statistical parsers is that it is usually necessary to retrain the statistical algorithms for each new domain. This is because different domains and 14 genres have different statistical properties. This requires new labeled training data for each new genre. Carroll & Rooth (1998) showed that frequency-based parsers work better with ‘imperfect’ automatically generated frequencies from the same genre than with ‘perfect’ frequencies from different genre. This indicates that the difference in subcategorization probabilities between genres is larger than the amount of error in the automatic parsers which are trained on the same sort of data as which they are tested.

Gildea (2001) investigated the effect of choice of corpus training data on the performance of a parser. He used the Collins (1999) model 1 parser with the distance features mentioned above as approximating verb subcategorization information. When the parser was trained on Wall Street Journal training data and run on Brown Corpus data, the performance was significantly lower than when run on WSJ data. When the parser was trained on Brown Corpus data, the performance on Brown Corpus data was significantly higher than when trained on WSJ data. When the parser was trained on a combination of both training sets, the performance on each of the corpora was not much higher than when it was only trained on the training data for that corpus. This indicates both that the training data is corpus specific, and that extra training data from other corpora is of little help to the parser. These results are shown in Table 6.

Training data Test set Recall Precision WSJ WSJ 86.1 86.6 WSJ Brown 80.3 81.0 Brown Brown 83.6 84.6 WSJ + Brown Brown 83.9 84.8 WSJ + Brown WSJ 86.3 86.9 Table 6: Results from Gildea (2001).

Gildea (2001) also shows that lexical bigram statistics used by the parser are corpus specific and “are of no use when attempting to generalize to new training data”. These statistics capture the probabilities of parent/child relationships, such as the probability of a VP headed by pulled having a PP headed by over as a child. This result indicates that the relationships between the head words in the LPCFG rules are corpus specific.

Because the lexical bigram probabilities for a particular VP parent are a reflection of the subcategorization probabilities for that verb, these results suggest that differences in verb subcategorization frequencies between corpora may play an important role in the degradation of parser performance when the parsers are used in new domains. If the factors that are causing different corpora to have different verb subcategorization probabilities can be identified, then two sets of problems can be solved. On one hand, it would solve the problem of which probabilities to use in norming psychological experiments and the problem of what frequencies are represented in the mental lexicon. On the other hand, the need to retrain computational applications such as probabilistic parsers for each new domain could also be reduced. 15 1.5 Solving the verb subcategorization frequency problem

The previous sections have outlined a series of problems affecting psycholinguistics and computational linguistics, all of which are caused by different corpora and data sources having different verb subcategorization probabilities. This dissertation will provide evidence that a significant source of these probability differences, and therefore a significant cause of these problems, is the fact that different senses of verbs have different subcategorization probabilities. Both probabilistic parsers, such as Charniak (1997) and Collins (1999), and psycholinguistic research, such as Trueswell et al. (1993), Garnsey et al. (1997), and Merlo (1994) base their verb subcategorization probabilities on the verb word form or lexeme (the combination of all forms of a verb) rather than on the individual senses of verbs. Thus, the subcategorization probability for the verb admit traditionally includes all of the uses of admit, whether admit means allow to enter or confess or any other possible sense. However, as Table 7 shows, these different senses do not always have the same subcategorization properties. WordNet sense 1 for the verb admit takes a direct object or a sentential complement, while the other senses take direct objects or prepositional phrases.

16 Sense 1 admit, acknowledge -- (declare or acknowledge to be true; "He admitted his errors"; "She acknowledged that she might have forgotten") - Somebody ----s something - Somebody ----s that CLAUSE Sense 2 admit, allow in, let in -- (allow to enter; grant entry to; "We cannot admit non-members into our club") - Somebody ----s somebody - Somebody ----s somebody PP Sense 3 admit, let in, include -- (allow participation in or the right to be part of; permit to exercise the rights, functions, and responsibilities of; "admit someone to the profession"; "She was admitted to the New Jersey Bar") - Somebody ----s somebody - Somebody ----s somebody PP Sense 4 accept, admit, take, take on -- (admit into a group or community; "accept students for graduate study"; "We'll have to vote on whether or not to admit a new member") - Sam is admitting Sue Sense 5 admit, allow -- (afford possibility: "This problem admits of no solution"; "This short story allows of several different interpretations") - Something is ----ing PP Sense 6 admit -- (give access or entrance to; "The French doors admit onto the yard") - Something is ----ing PP Sense 7 accommodate, hold, admit -- (have room for; hold without crowding; "This hotel can accommodate 250 guests"; "The theater admits 300 people"; "The auditorium can't hold more than 500 people") - Something ----s somebody - Something ----s something Sense 8 admit -- (serve as a means of entrance; "This ticket will admit one adult to the show") - Something ----s somebody - Something ----s something Table 7: Senses and possible subcategorizations of admit from WordNet (Miller, Beckwith, Fellbaum, Gross, & Miller 1993).

The problems arise when different corpora or data sources have different distributions of verb senses. When these senses have different subcategorization probabilities, different subcategorization probabilities are found for the verb lexeme in each corpus. For example, the probability of the lexeme admit taking a sentential complement is related to the relative frequency of WordNet sense 1 in the corpus. If sense 1 does not occur, then there will be no instances of the lexeme taking a sentential complement. If sense 6 is common, then the lexeme will have a high percentage of prepositional phrase uses. This dissertation will demonstrate that the 17 problems discussed above can be solved or reduced by basing verb subcategorization probabilities on the verb senses rather than on the verb lexeme. Psycholinguistic models should represent verb senses in the lexicon, norming studies should control for verb sense, and parsers should include verb sense or semantic context information in their language models. 1.5.1 Evidence for the relationship between verb semantics and subcategorization from linguistics

The relationship between verb sense and verb subcategorization is not new. Linguists have long suggested that the lemma or sense of a word is the locus of subcategorization; for example Green (1974) showed that different senses of the verb run had different subcategorizations. She describes the following two senses of run: “When run occurs as an intransitive verb, with or without a prepositional phrase, it refers to a sort of locomotion.” This sense is illustrated in examples (16) and (17), from Green (1974). (16) John ran fast.

(17) John ran into the room. “When run occurs before for and a noun denoting an elective office, it refers to a certain sanctioned, goal-oriented activity which doesn’t particularly involve any locomotion.” This sense is illustrated in example (18) from Green (1974). (18) Lenore ran for senator. She argues that these are two different senses because other locomotion verbs can be substituted in examples (16) and (17), but not in examples such as (18).

Since Gruber (1965) and Fillmore (1968), linguists have been trying to show that the syntactic subcategorization of a verb is related to the semantics of its arguments. Thus one might expect a verb meaning enter to have a different set of syntactic properties than a verb meaning confess. Similarly, if two senses of a single verb mean enter and confess, these two senses should have different syntactic properties. For example, the verb admit has (at least) two possible senses, confess and allow to enter. The confess sense of admit, involving communication, allows for a sentential complement expressing propositional content, shown in example (19). (19) This sentiment was confirmed by Saints' b oss Ian Branfoot who admitted [that the attacking Town tactics had taken him by surprise]SC. (BNC) Alternatively, the allow to enter sense, involving motion, allows for a directional prepositional phrase, shown in (20). (20) The windows were opened slightly to admit fresh air [into the room]Directional PP. (BNC) It is also possible for different senses to share a possible subcategorization. Both the confess (21) and the allow to enter (22) sense can appear with a direct object. 18 (21) It's important to stress that men as well as women can be victims of harassment and sometimes it's harder for them to admit [it]DO. (BNC)

(22) The door opened to admit [Rosa]DO, wearing her customary black. (BNC) The notion of a semantic base for subcategorization probabilities is consistent with work such as Argaman, Pearlmutter, & Garnsey (1998) and Argaman & Pearlmutter (in press), which shows that verbs and their nominalizations have similar subcategorization preferences.

Previous papers have also proposed links between fine-grained sense distinctions and the syntactic manifestations of these senses. Pinker (1989) proposes that the syntactic structure in which a verb appears is uniquely determined by the semantics of the event being described. In this theory, there are separate semantic representations of the verb for each of the possible syntactic patterns of the verb. The following pair of examples, discussed at length in Goldberg (1995) and Pinker (1989), would probably be classified as belonging to the same gross sense of the verb spray. However, example (23) is generally taken to imply that the wall is completely covered with paint, while in example (24), only a small portion of the wall may actually end up with paint on it. Thus, one might propose that these are actually examples of two different senses of spray, one meaning completely cover, and the other meaning splash. (23) Bob sprayed the wall with paint.

(24) Bob sprayed paint onto the wall. Figure 3 shows the relationship between the separate semantic structures proposed by Pinker for ‘spray’ in ‘Bob sprayed the wall with paint’ (bottom) and ‘paint’ in ‘Bob sprayed paint onto the wall’ (top).

19

Figure 3: Semantic structures for two different syntactic patterns of ‘spray’ (Pinker 1989, page 228).

Although Pinker claims that each subcategorization of a verb has a unique semantic structure, he also claims that many of the differences are so subtle as to require an innate set of semantic representations in order to be able to learn the distinctions.

Lexicographers traditionally provide possible usages for each sense of a verb. An example of this is WordNet (Miller et al. 1993), which provides detailed lists of possible subcategorizations for each sense of a word, but does not provide frequency information for these subcategorizations. The WordNet subcategorization entries for the eight WordNet senses of the verb admit are shown above in Table 7.

Although previous linguistic research has demonstrated that different senses of a verb can have different subcategorization possibilities, it is not clear from this work that verb sense based subcategorization differences are the actual cause of cross corpus differences in subcategorization frequencies. If nearly all verbs in the corpora have only one sense, or nearly all of the senses have the same distribution of subcategorizations, then different distributions of verb sense can not provide an adequate explanation of cross corpus subcategorization differences. 1.5.2 Evidence for the relationship between verb semantics and subcategorization from computational linguistics

Given the relationship between verb sense and verb subcategorization described above, it should be possible to use verb subcategorization information to help disambiguate the sense of the verb or verb sense information to predict the subcategorization of the verb. Recently, Bikel (2000) developed an algorithm relying on syntactic and semantic data to simultaneously parse and sense tag text. The goal was to allow sense information to help parsing, and syntactic information to help sense tagging. The success of this combination is unclear. The output of the combined sense/parsing algorithm was no better than an equivalent parsing algorithm 20 with no access to sense information, and the word sense disambiguation results are not directly comparable to most other WSD results due to differences in the nature of the task, so one can not directly assess whether the WSD results are better than other results which do not take structural information into account.

Other WSD algorithms have also made use of structural information. Lin (1997) uses explicit lexical head / syntactic dependency relationship information to perform word sense disambiguation, while Yarowsky (2000) uses a combination of co-occurrence and dependency information. In these cases, information such as knowing that the word facility is the subject of the verb employ in example (25) is used to determine that facility is an example of WordNet sense 1 (installation) rather than WordNet sense 5 (bathroom). (25) The new facility will employ 500 of the 600 existing employees. (Lin 1997)

It is not clear to what extent the use of such lexical head information in WSD is actually the inverse of using verb sense information to predict subcategorization. The WSD tasks above rely on the relationship between a particular verb sense and the individual lexical items that fill various slots such as subject or direct object, while the psychological predictions of subcategorization discussed in this dissertation involve the relationship between either individual verb or senses and the existence of various subcategorization possibilities. In other words, the information used by Lin (1997) and Yarowsky (2000) would be useful for predicting what the direct object is, while the task in the psychological experiments is whether there is a direct object at all. Still, this evidence suggests that there is a relationship between verb sense and verb subcategorization. 1.5.3 Evidence for the relationship between verb semantics and subcategorization from psycholinguistics

Since different senses of a verb have different subcategorization probabilities, the semantic context leading up to a verb should affect the likelihood of different senses, and thus the likelihood of different subcategorizations. If this is the case, then (human) parsing decisions should also be affected by the semantic context. This prediction is confirmed by recent work in Hare et al. (2001). Hare et al. (2001) have shown that the semantic context preceding a verb influences the performance of human subjects in both sentence completion and reading time tasks. Hare et al. (2001) selected 20 verbs, each with multiple possible subcategorizations. They prepared a pair of biasing contexts for each verb such that one context biased the verb towards a sense with a direct object (DO) subcategorization, and the other towards a sense with a sentential complement (SC) subcategorization. For example, (26) shows a context that biases the verb admit towards a DO interpretation, while (27) shows a context which biases the verb admit towards a SC interpretation. This work is discussed in detail in chapter 3. 21 (26) The two freshman on the waiting list refused to leave the professor’s office until he let them into his class. Finally, he admitted [the students.]

(27) For over a week, the trail guide had been denying any problems with the two high school kids walking the entire Appalachian Trail. Finally, he admitted [that the students had little chance of succeeding.] 1.6 Outline of Chapters

The previous sections have demonstrated that the differences in verb subcategorization frequencies are an important problem for both psycholinguistics and computational linguistics. Section 1.2 argued that verb subcategorization probabilities play an important role in psycholinguistic theories of processing and that such probabilities are part of the representation of each verb in the mental lexicon. Section 1.2 also demonstrated the need to estimate verb subcategorization probabilities for verbs for the purpose of norming the studies that test theories of processing. Section 1.3 showed that the methods used to estimate such probabilities are problematic. Different sets of probabilities are generated in the production of single sentences in either sentence production or sentence completion tasks than are found in connected discourse production such as the Brown corpus and the Wall Street Journal corpus. This results in a theoretical question of which frequencies are represented in the lexicon, and a practical question of which frequencies to use for norming purposes. Section 1.4 presented a related problem in computational linguists. Probabilistic parsers, which rely on verb subcategorization frequencies, face significant decreases in performance when they are used in new domains. It was argued that cross corpus differences in verb subcategorization frequencies play an important role in this need to retrain parsers for each new domain. Evidence of a relationship between verb subcategorization and verb sense was discussed. Section 1.5.1 described some of the linguistic evidence that suggests that one should expect different sense of a verb to have different subcategorization probabilities. Section 1.5.2 described computational research that has taken advantage of the verb to predict verb semantics. Section 1.5.3 described research that provides evidence that humans make use of this relationship when generating verb subcategorization expectations during comprehension. Section 1.5 hypothesized that the cross corpus differences in verb subcategorization probabilities are caused by the failure to take the distribution of individual verb senses into account.

The claim that individual verb senses, not the verb lexeme, are the appropriate locus of verb subcategorization probabilities has several implications: 1) Models of lexical representation need to take verb sense differences into account when modeling subcategorization probabilities. This leads to a prediction that verb sense will affect any phenomena that are affected by verb subcategorization probabilities. Hare et al. (2001) provides 22 evidence of semantic biases towards individual senses of verbs affecting parsing decisions in comprehension.

2) Psychological verb subcategorization norming procedures need to control for any factors which affect verb sense, since these also affect verb subcategorization probabilities.

3) Computational applications such as parsers which take verb sense into account will be more domain independent, and less likely to need retraining for each new domain. The remainder of this dissertation will be concerned with demonstrating that verb sense differences do cause subcategorization frequency variation, and also with developing and testing a model of how the semantic context preceding a verb influences verb subcategorization predictions. 1.6.1 Chapter 2

Chapter 2 will demonstrate that different senses of verbs and their corresponding differences in subcategorization as well as inherent differences between the production of sentences in psychological norming protocols and language use in context are important causes of the subcategorization frequency differences found between corpora. Chapter 2 presents evidence based on two different data sets.

The first investigation in this chapter compares subcategorization probabilities for the 127 verbs in Connine et al. (1984) and the 48 verbs in Garnsey et al. (1997) with subcategorization probabilities for the same verbs, taken from the Penn Treebank parsed versions (Marcus et al. 1993) of the Wall Street Journal Corpus, Brown Corpus (Francis & Kucera 1982), and Switchboard corpus (Godfrey, Holliman, & McDaniel 1992). These three all consist of connected discourse and are available from the Linguistic Data Consortium (http://www.ldc.upenn.edu). Two major causes of verb subcategorization frequency differences are discussed. First, the isolated production tasks used in the norming studies result in discourse based differences such as decreases in the use of passive and zero anaphora and an increase in the use of “default” referents. Second, semantic biases caused by the sentence production methodologies and topics (such as business) discussed in the corpora result in the different data sources having different distributions of verb senses, which results in also having different distributions of subcategorizations.

The second investigation in this chapter compares subcategorization probabilities for 64 verbs collected from three corpora of written connected discourse; the Wall Street Journal and Brown corpora used above, and the British National Corpus (http://info.ox.ac.uk/bnc/index.html). This data demonstrates that when discourse type (all three corpora are written connected discourse) and verb sense (the data was hand labeled for verb sense, and only the data for the most common sense was used) are controlled for, cross corpus variation in subcategorization are reduced. The main senses of all but nine verbs have the same transitivity bias in each of the 23 corpora. The remaining differences in transitivity bias are the result of finer-grained sense distinctions than those that were controlled for in the labeling process.

This leads to three conclusions: 1) verb subcategorization probabilities should be based on individual senses of verbs rather than the whole verb lexeme, 2) “test tube” sentences are not the same as “wild” sentences, and thus the influences of experimental design on verb subcategorization probabilities should be given careful consideration, and 3) when sense and discourse type differences are controlled for, cross corpus verb subcategorization differences are reduced.

For computational linguists, the results of this chapter suggest that the need to retrain computational applications for each new corpus would be significantly reduced if verb sense were taken into account. For psycholinguists, the results of this chapter suggest that verb sense and genre/discourse type differences need to be taken into account when designing experiments and norming materials, and that corpus frequencies and psycholinguistic phenomena are more likely to be correlated if verb sense is taken into account. 1.6.2 Chapter 3

Chapter 3 will provide a computational model, based on Latent Semantic Analysis, of the influence of verb sense on verb subcategorization. This model will be used to predict verb subcategorization based on the semantic context preceding the verb in corpus data. The first experiment will show that this model makes the same subcategorization predictions as humans given the same contextual information by modeling the data from Hare et al. (2001). The human subjects predicted an SC completion given an SC bias context, and a DO completion given either a DO bias context or a neutral context. The subjects also preferred DO completions in the absence of a specific bias context, because the verbs in the study are more frequently used with DO completions. In order to model this data, the model in this chapter must predict SC completions given SC bias contexts. When given a DO bias context, the model must either predict a DO completion or make a neutral prediction, since the model can then fall back on the default DO preference of the verbs given a neutral context.

The model does predict the SC subcategorization when given the SC bias contexts from Hare et al. (2001), but finds that the DO bias contexts from Hare et al. (2001) are equally similar or dissimilar to both corpus DO and SC examples. This indicates that the DO bias contexts used in Hare et al. (2001) are either not strongly biased towards DO contexts, or are biased towards DO uses that do not occur in the corpus samples used by the model. However, the model can still be considered to be successful, since the model can fall back on using the inherent DO bias of the verbs.

The second experiment will show that the model is able to correctly predict both DO and SC subcategorizations when naturally occurring bias contexts are used, instead of the bias contexts provided in Hare et al. (2001). This shows that the difficulties the model faced in predicting the subcategorizations of the Hare et al. 24 (2001) DO bias contexts are a result of the nature of the bias contexts and not an insensitivity of the model to the relevant semantics.

The third experiment will show that the model is capable of predicting the subcategorization of low frequency senses of verbs, and that the model can distinguish between different subcategorizations of the same sense of a verb. This suggests that the performance of the model on the Hare et al. (2001) DO bias contexts is due to the contexts being of a neutral nature rather than the contexts being biased towards low frequency uses of the verbs.

For computational linguists, the model in this chapter provides a source of subcategorization information that could supplement the information used in statistical parsers. For psycholinguists, this chapter provides a potential model for using semantic context and verb sense to predict verb subcategorizations in human sentence processing. For linguists, this chapter provides a method for attempting to induce the relationships between verb semantics and verb subcategorization patterns discussed in Pinker (1989) and Goldberg (1995). 1.6.3 Chapter 4

Chapter 4 will summarize the contributions of this work and discuss future directions for this work. 25

2 Subcategorization probability differences in corpora and experiments

Chapter 1 presented a series of problems in psycholinguistics and computational linguistics that are caused by different corpora and norming studies having different verb subcategorization probabilities. These problems include the issues of which probabilities are most appropriate for use in norming experiments, which probabilities are represented in the mental lexicon, and the problem of needing to retrain parsers for each new domain. Chapter 1 argued that verb sense-based subcategorization differences were an important source of subcategorization probability variation. This chapter will provide evidence that verb sense differences, along with discourse factors related to the isolated nature of sentence production in the norming studies, are major causes of verb subcategorization frequency differences between data sources. This chapter will also show that as these factors are controlled for, cross corpus variation in subcategorization frequency is reduced. 2.1 Combining sense-based verb subcategorization probabilities with other factors to yield observed subcategorization frequencies

The results in this chapter are explained in terms of a model, shown in Figure 4, in which core subcategorization probabilities for individual senses of a verb are combined with other probabilistic factors to yield the observed subcategorization probabilities. This model is presented as a way to show the relationship between the different factors that influence subcategorization probabilities. It is also intended as a statement that when subcategorization frequency differences are found, one should look for sense and discourse factors as possible causes of the differences.

The usual method for testing such a model would be to perform a regression analysis or to use a training set / test set protocol to determine the appropriate weights for each of the factors. Unfortunately, these methods would require the creation of a large corpus labeled for verb sense and all relevant discourse features. Because this is presently not feasible, the influences of these factors will have to be demonstrated through the analysis of small sets of hand labeled corpus data. Thus, it is impossible to provide an accurate measure of the size of each effect or a measure of the completeness of the list of factors discussed, only an indication of existence of each factor.

26

Core subcategorization Core subcategorization Core subcategorization probabilit ies for sense 1 probabilit ies for sense 2 probabilit ies for sense 3 of verb of verb of verb

Adjust subcategorization probabilities for Adjust subcategorization probabilities for relative frequency of each sense in corpus A relative frequency of each sense in corpus B

Adjust subcategorization probabilities for Adjust subcategorization probabilities for effects of discourse factors, etc. in corpus A effects of discourse factors, etc. in corpus B

Final subcategorization probabilities Final subcategorization probabilities observed for verb in corpus A observed for verb in corpus B

Figure 4: Model showing why different corpora have different subcategorization probabilities for the same verb.

The verb subcategorization frequencies observed in norming studies and corpus data are the result of language production. Thus, any model explaining the differences in between different corpora and norming studies must be model of how these uses were produced. Because frequent correlations are found between production data, such as corpus frequencies and norming study frequencies, and processing phenomena, such as reading times and garden path reactions, it is anticipated that a similar process takes place during comprehension, where underlying subcategorization expectations based on individual verb senses are modified depending on various factors in the context leading up to the verb. This is not to say that complete agreement between production and comprehension probabilities is expected, since factors such as memory limitations affect comprehension differently than they do production. 2.1.1 Verb senses

The primary claim of this model is that subcategorization probabilities are based on individual senses of each verb rather than on the verb lexeme. The model requires some method for representing the relationship between verb senses and verb subcategorizations. Two different representation systems are used in this dissertation. In this chapter, the uses of each verb are divided into broadly defined senses, and all uses of each sense are treated as having the same subcategorization probabilities. Alternatively, in chapter 3, the experiments rely on a non-discrete model of verb sense where there are no pre-defined senses. The former method is more suited for hand labeling and lexicographic tasks while the latter is more suited for automatic induction tasks. These two systems of representing verb senses make the same predictions under many circumstances. However, the non-discreteness of the latter model is more compatible with the apparent gradations of senses in natural 27 language use. It is the intention of this dissertation to demonstrate the importance of distinguishing between different senses of a verb rather than to verify or falsify any particular model of verb sense. 2.1.2 Probabilistic factors

In the model proposed in this chapter, the observed subcategorization probabilities are the result of various factors combining with subcategorization probabilities for an individual sense of the verb. Many potential such factors have already been described in the literature. Biber (1988), Biber (1993), and Biber et al. (1998) provide an extensive analysis of how the register of the corpus affects factors such as the likelihood of various syntactic structures, collocations, and individual lexical items. The intuition here is that if you know you are reading (for example) a scientific report (or dealing with a corpus of scientific reports), that the probability of encountering a passive sentence is much higher than in a fiction document or in conversation, independent of which verb or verb sense is actually used in the sentence.

Another factor that can influence parsing decisions is the specific discourse context preceding a verb. Altmann, Nice, Garnham, & Henstra (1998) show that sentences such as (28) are parsed differently (by humans) depending on the preceding context. When the context specifies that both dogs have been washed, as in (29), subjects expect wash to be followed by a time phrase indicating at which point in time the dog was washed. Sentence (28) seems anomalous in this case. Alternatively, when only one dog was washed, as in (30), subjects are not expecting a time phrase to modify wash, and thus (28) seems normal. In these examples, the sentence completion 6 expectations of wash change because of differences in contextual factors rather than because of differences in sense. (28) He’ll brush the dog he washed tomorrow to make its fur shine again.

(29) # Tom’s got two young dogs and they like playing in the fields. Tom washed one of the dogs yesterday but the other one last week. He’ll brush the dog he washed tomorrow to make its fur shine again.

(30) Tom’s got two young dogs and they like playing in the fields. Tom washed one of the dogs but did not want to bother with the other dog. He’ll brush the dog he washed tomorrow to make its fur shine again.

This dissertation does not attempt to characterize all of the factors that influence subcategorization probabilities, but does provide several new factors that

6 Typically, verbs are not considered to subcategorize for time phrases, although the phenomena here do involve the issue of whether tomorrow modifies the verb brush or wash. 28 are particularly relevant to the design of psycholinguistic experiments to add to the set of previously proposed factors. 2.2 Experiment – comparing norming studies and corpus data

Perhaps the ideal way to demonstrate that verb sense differences were causing subcategorization differences between corpora would be to take a large sense and subcategorization tagged corpus, and perform a regression to find out how much of the differences were accounted for by verb sense. Unfortunately, there is no suitable ready made set of data for such a purpose. Semcor, an extension of WordNet (Miller et al. 1993, Miller 1995), in which a portion of the Brown Corpus is tagged for sense, does provide some information on the relative frequencies of senses. When this data is aligned with the Penn Treebank version of the Brown corpus, a limited amount of sense/subcategorization frequency information can be derived. However, this data suffers from a severe data sparseness problem, with most senses having no more than a handful of examples in the Brown data. Work is underway to provide more precise subcategorization (including semantic frame) and sense data for a set of words in the FrameNet project (Lowe, Baker, & Fillmore 1997, Baker, Fillmore, & Lowe 1998). This type of project will make estimations of the size of the effect of verb sense on verb subcategorization more feasible. In the meantime, it is possible to label limited samples of data to show examples of verb sense and discourse type effects on verb subcategorization.

The purpose of the study in this section is to examine the causes of subcategorization frequency differences found in five corpora. These corpora consist of two psychological norming studies, two corpora of written text, and one corpus of spoken data. These corpora were chosen to represent a wide variety of genres and discourse types. The first section of results will discuss the contributions made by factors related to the differences between the single sentence norming studies and the connected discourse corpora. The second section of results will discuss the contribution of verb sense to the differences in subcategorization frequencies. The third section of results will show that when discourse and sense differences between corpora are reduced, verb subcategorization probabilities become more similar. 2.2.1 Methodology

Five different sources of subcategorization information are compared. Two of these are corpora derived from psychological experiments in which subjects are asked to produce single isolated sentences. These are both widely cited studies, Connine et al. (1984) (CFJCF) and Garnsey et al. (1997) (Garnsey). The three non-experimental corpora used are all on-line corpora that have been tagged and parsed as part of the Penn Treebank project (Marcus et al. 1993): the Brown corpus (BC), the Wall Street Journal corpus (WSJ), and the Switchboard corpus (SWBD). These three all consist of connected discourse and are available from the Linguistic Data Consortium (http://www.ldc.upenn.edu). Both the 127 verbs used in the 29 Connine et al. study and the 48 verbs published from the Garnsey et al. study were investigated. The Connine et al. and Garnsey et al. data sets have nine verbs in common. Table 8 shows the number of tokens of the relevant verbs that were available in each corpus. It also shows whether the sample size for each verb was fixed or frequency dependent. Verb frequency was controlled for in all cross-corpus comparisons.

Corpus Token/Type examples per verb CFJCF 5,400 (127 CFJCF verbs) n ≅ either 29, 39, or 68

Garnsey 5,200 (48 Garnsey verbs) n ≅ 108

BC 21,000 (127 CFJCF verbs) 0 n 2,644

6,600 (48 Garnsey verbs)

WSJ 25,000 (127 CFJCF verbs) 0 n 11,411

5,700 (48 Garnsey verbs)

SWBD 10,000 (127 CFJCF verbs) 0 n 3,169 4,400 (48 Garnsey verbs) Table 8: Approximate size of each corpus

2.2.1.1 Connine et al. (1984) sentence production study

The first set of data used was from a sentence production study done by Connine et al. (1984). Data from the published subcategorization frequencies and the sentences from the original subject response sheets, provided by Charles Clifton, was used. The Connine et al. (1984) study used two slightly different sentence production protocols. In both protocols, subjects were given a list of words and asked to write sentences using them, based on a given topic or setting. Subjects frequently used these topic or setting words in their sentence completions. In the first study, each word always appeared with the same topic (for example the verb beg always appeared with the topic animals/nature, while the verb teach always appeared with the topic sports/games); the words included verbs, nouns, and prepositions; only the verbs were used in the statistics. Table 9 shows samples of the prompts and subject responses from protocol 1.

Prompt Actual response teach (sports/games) My friend taught me to play racquetball. fly (work/workers) The company paid the employees to fly to the seminar. carry I remember when Joe carried my books home from (books/reading) school. Table 9: Connine et al. (1984) protocol 1 sample prompts and subject responses.

In the second study, only verbs were used; each appeared with one of three possible settings, home, school, or downtown, randomly distributed across subjects. Table 10 shows samples of the prompts given to the subjects in protocol 2 and sample responses resulting from the prompts. Note that the provision of topic/subject words does influence the subject responses – 45% of the responses in the second protocol 30 contained the words home, school, or downtown. This tendency was less noticeable in the first protocol, where two topic/subject words were provided.

Prompt Actual response forget (home) Don’t forget your chair at home. charge (downtown) “Charge it” is my favorite phrase downtown. answer (school) In school, Tom was always the first to answer. Table 10: Connine et al. (1984) protocol 2 sample prompts and subject responses.

91 verbs were used with the first protocol and 66 verbs in the second. 30 verbs were used in both protocols, so the total number of verbs in at least one protocol was 127. Table 11 shows the 127 verbs from the Connine et al. (1984) study.

advise, agree, allow, answer, approve, ask, attack, attempt, beg, believe, block, buy, call, carry, charge, chase, cheat, check, choose, clean, coach, coax, comfort, continue, copy, criticize, debate, decide, describe, disappear, discuss, dispute, drive, encourage, escape, expect, fight, fly, follow, forget, gore, govern, guard, guess, happen, hear, help, hesitate, hire, hurry, imitate, include, investigate, invite, judge, jump, keep, kick, kill, know, leave, lecture, load, lose, motion, move, notice, object, order, paint, pass, pay, perform, permit, persuade, phone, play, point, position, praise, promise, prompt, protest, pull, push, race, read, realize, refuse, remember, review, revolt, rule, rush, save, say, see, seem, signal, sing, stand, start, stay, stop, store, strike, struggle, study, surrender, swear, swim, talk, teach, tell, think, tire, try, understand, unload, urge, visit, wait, walk, want, watch, worship, write Table 11: 127 verbs used from Connine et al. (1984)

2.2.1.2 Garnsey et al. (1997) sentence completion study

The second set of psychological norming data was taken from a sentence completion study performed by Garnsey et al. (1997). In the sentence completion methodology, subjects are given a sentence fragment and asked to complete it. Garnsey et al. (1997) used this methodology to gather subcategorization frequencies by giving subjects a person’s name followed by the preterite form of the target verb and asking them to complete the sentence. Section 2.2.2.2 will discuss how the provision of a proper noun as the subject influences sentence production. They tested 100 verbs; for each verb, subjects saw a fragment consisting of a proper name followed by one of 100 verbs in the preterite form, as shown in Table 12.

Debbie remembered ______.

Table 12: Sentence Completion protocol used to collect subcategorization frequencies by Garnsey et al. (1997). 31 The sentence completions were then coded by hand into three broad subcategorization categories; Sentential Complement, shown in (31), Direct Object, shown in (32), and Other, shown in (33). (31) Sentential Complement: George admitted [that he enjoyed taking that psychology experiment]SC.

(32) Direct Object: George admitted [his guilt]NP.

(33) Other: Dennis remarked [about the food]PP. Garnsey et al. assigned verbs to a SC preference group if they were used at least twice as often with an SC as with a DO. Likewise, verbs were assigned to a DO preference group if the DO use was at least twice as common as the SC use. A third group of verbs, EQ bias was created for verbs with less than a 15% difference between the number of SC and DO uses. Each group consisted of 16 verbs for the total of 48 used in their experiments. The remaining verbs were not used. Table 13 shows three sample verbs, one in each class, together with their subcategorization frame preferences:

Verb Class Verb Direct Object Sentential Other Complement DO-bias accepted 98% 1% 1% EQ-bias felt 12% 11% 77% SC-bias admitted 9% 60% 30% Table 13: Example subcategorization-frame probabilities for each of the three subcategorization frame classes (DO-bias, SC-bias, and EQ-bias) of Garnsey et al. (1997).

The published frequency data for the 48 verbs as well as the data from the original subject response sheets, provided by Susan Garnsey, was used in this analysis.

accept, acknowledge, admit, advocate, announce, argue, assert, assume, believe, claim, concede, conclude, confess, confide, confirm, decide, declare, deny, discover, doubt, emphasize, establish, estimate, fear, feel, figure, guarantee, guess, hear, imply, indicate, insure, know, maintain, predict, print, propose, protest, prove, realize, regret, sense, suggest, suspect, understand, warn, worry, write Table 14: 127 verbs used from Garnsey et al. (1997)

2.2.1.3 Brown Corpus

The three non-experimental corpora used are all on-line corpora that have been tagged and parsed as part of the Penn Treebank project (Marcus et al. 1993). The Brown corpus is a 1-million-word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc). The texts had all been published in 1961, and the corpus was assembled at Brown University in 1963-1964 (Francis and Kucera 1982). Because the Brown corpus is the only one of 32 the five corpora that was explicitly balanced for genre, and because it has become a standard for on-line corpora, it is often used as a benchmark to compare with the other corpora. The Penn Treebank version of the Brown Corpus was tagged and parsed by the Penn Treebank project, a property that allows examples of particular verb/subcategorization combinations to be extracted automatically.

For the 127 verbs used in the CFJCF study, there were approximately 21,000 relevant verb tokens in the Brown Corpus. For each calculation in this chapter where individual verb frequency could affect the outcome, the results were normalized for frequency, and verbs with less than 50 examples were eliminated. This left 77 out of 127 verbs in the Brown Corpus. For the 48 verbs in the Garnsey study, there were approximately 6,600 relevant verb tokens. Of the 48 verb types, only 27 had a frequency greater than 50. Table 16 shows examples from the Brown corpus for each of the 16 Connine categories. 2.2.1.4 Wall Street Journal Corpus

The Wall Street Journal corpus is a 1-million word collection of Dow Jones Newswire stories. This corpus is also a part of the DARPA WSJ-CSR1 corpus. The 1-million word segment used was parsed by the Penn Treebank Project. For the 127 verbs used in the CFJCF study, there were approximately 25,000 relevant verb tokens. There were 74 verbs that had a frequency greater than 50. For the 48 verbs in the Garnsey study, there were approximately 5,700 tokens, with 34 of the verbs having a frequency greater than 50. 2.2.1.5 Switchboard Corpus

Switchboard is a corpus of telephone conversations between strangers, collected in the early 1990’s (Godfrey et al. 1992). Only the half of the corpus that was processed by the Penn Treebank project was used; this half consists of 1155 conversations averaging 6 minutes each, for a total of 1.4 million words in 205,000 utterances. For the 127 verbs used in the CFJCF study, there were approximately 10,000 tokens in Switchboard. Only 30 verbs had a frequency greater than 50. For the 48 verbs in the Garnsey study, there were approximately 4,400 tokens, with only 7 of the verbs having a frequency greater than 50. 2.2.1.6 Extracting subcategorization probabilities from the corpora

Deriving subcategorization probabilities from the five corpora involved both automatic scripts and some hand re-coding. The set of complementation patterns is based in part on collaboration with the FrameNet project (http://www.icsi.Berkeley.edu/~framenet, Baker et al. 1998, Lowe et al. 1997). The FrameNet complementation patterns are described in Gahl (1998a), Gahl (1998b), Johnson and Fillmore (2000), and Johnson et al. (2001). These complementation categories are also similar to the ones used in Connine et al. (1984), with some changes in the exact definitions of some categories. The 17 categories used in this experiment are listed in Table 15.

33 Code Description 0 Intransitive with no other arguments from this list PP Intransitive with prepositional phrase VPto Intransitive with to marked infinitive verb phrase Sforto Intransitive with for PP and to marked infinitive phrase Swh Intransitive with Wh clause Sfin Intransitive with finite clause VPing Intransitive with gerundive verb phrase Percep. Compl. Perception Complement NP Transitive NP NP Ditransitive NP PP Transitive with prepositional phrase NP Vpto Transitive with a to marked infinitive verb phrase NP Swh Transitive with a Wh clause NP Sfin Transitive with a finite clause Quo Quotation Passive Passive Other Other Table 15: List of subcategorizations.

Examples of each category taken from the Brown Corpus and the Connine et al. (1984) study are shown in Table 16 and Table 17. These two tables are provided to give a feel of the differences between the corpus data and the sentence production data, but please note that the examples from the Brown corpus are relatively short examples for ease of display; a random sample would have produced much longer sentences on average. These definitions of these subcategorizations are discussed in detail below.

34 1 0 He always did when people asked. 2 PP Guerrillas were racing [toward him]. 3 VPto Hank thanked them and promised [to observe the rules]. 4 Sforto …Papa agreed [with Mama] [to make a joint will …]. 5 Swh I know now [why the students insisted that I go to Hiroshima even when I told them I didn't want to]. 6 Sfin She promised [that she would soon take a few days’ leave and visit the uncle she had never seen, on the island of Oyajima --which was not very far from Yokosuka]. 7 VPing But I couldn't help [thinking that Nadine and Wally were getting just what they deserved]. 8 Percep. Compl. Far off, in the dusk, he heard [voices singing, muffled but strong]. 9 NP The turtle immediately withdrew into its private council room to study [the phenomenon]. 10 [NP NP] The mayor of the town taught [them] [English and French]. 11 [NP PP] They bought [rustled cattle] [from the outlaw], kept him supplied with guns and ammunition, harbored his men in their houses. 12 [NP VPto] She had assumed before then that one day he would ask [her] [to marry him]. 13 [NP She] I asked [Wisman] [what would happen if he broke out the go codes and tried to start transmitting one]. 14 [NP Sfin] But, in departing, Lewis begged [Breasted] [that there be no liquor in the apartment at the Grosvenor on his return], and he took with him the first thirty galleys of Elmer Gantry. 15 Passive A cold supper was ordered and a bottle of port. 16 Quotes He writes [“Confucius held that in times of stress, one should take short views – only up to lunchtime.”] Table 16: Examples of each subcategorization frame taken from the Brown Corpus.

35 1 0 The anchorman on Channel 7 performed very well. 2 PP He disappeared [from sight]. 3 VPto I agreed [to come along]. 4 Sforto I waited for Jenny to bring me my book I left at her house. 5 Swh It’s hard to realize [how far away Australia really is from the U.S.] 6 Sfin I believe [it’s time for new furniture].

7 VPing Many try to escape [doing all their work in college]. 8 Percept. It was fun to help [Rick study last night] Compl. 9 NP I wrote [a book of recipes]. 10 [NP NP] I bought [my brother] [a sweater] in town 11 [NP PP] I teach [basketball] [to young kids]

12 [NP VPto] I expect [the program] [to be a little more interesting]. 13 [NP Swh] We worshipped [him] [when he lived downtown].

14 [NP Sfin] She promised [her boyfriend] [that she would stick to her diet].. 15 Passive Politics are sometimes loaded with propaganda. 16 Quotes (none)

Table 17: Examples of each subcategorization frame from the response sheets for the CFJCF data.

The following section describes the set of subcategorization distinctions used in hand coding data. These definitions were also used in developing the series of regular expression searches and tgrep scripts which were used to compute probabilities for these subcategorization frames from the three syntactically parsed Treebank corpora (BC, WSJ, SWBD). The details of the tgrep search patterns are described in Appendix A. Some categories (in particular the quotation category Quo) were difficult to code automatically and so were re-coded by hand. Since the Garnsey et al. data used a more limited set of subcategorizations, portions of this data were manually recoded into the 17 categories.

A sample of the automatically labeled data was hand checked for accuracy. The error rate in the data, excluding errors in quotation identification (which were fixed by hand), is between 3% and 7%7. The error rate is given as a range due to the

7 Although the overall error rate of the search patterns is low, it tends to affect some verbs and subcategorizations more than others. Because of this, the appendix provides the search strings to generate subcategorization data 36 subjectivity of some types of judgments. 2-6% of the error rate was due to mis-parsed sentences in Treebank, including PP attachment errors, argument/ errors, etc. 1% of the error rate was due to inadequacies in the search strings, primarily in locating displaced arguments via the Treebank 1 style notation used in the Brown Corpus data. The sources of error are further discussed in the following sections and in Appendix A. Transitive (NP)

The NP constituent appears in all transitive categories, including the ditransitive subcategorization. Examples (34) and (35) illustrate transitive and ditransitive uses of verbs. Cases where the NP has been dislocated are also counted, such as in example (36). Note that passivization is counted in a separate category. The NP constituent excludes cases where the NP is not an argument of the verb, such as the time NP in (37). Measure phrases such as in (38) are also excluded. In practice, for the Treebank data, the NP constituent includes all nodes labeled NP that are either lexically filled or marked with a T. Because of this, there were some errors in the Treebank data. A hand checked random sample showed that about 1% of the overall examples involved missed NPs as a result of failing to identify cases of movement. Additionally, there was an overall error rate of about 2% that resulted from cases such as (37) and (38), where the time and measure NPs are (incorrectly) included in the count. (34) But, darn it all, why should we help [a couple of spoiled snobs who had looked down their noses at us]NP? (Brown)

(35) He suggested that a regrouping of forces might allow [the average voter]NP [a better pull at the right lever for him]NP on election day. (Brown)

(36) She always did before, and showed the utmost confidence in whatever we advised [ ]NP. (Brown)

(37) Senators unanimously approved [Thursday]TIME NP [the bill of Sen. George Parkhouse…]NP (Brown)

(38) We walked [miles]MEASURE NP and saw various shrines and gardens. (Brown) (cf. We walked [the dog]NP)

Prepositional Phrase (PP)

The PP constituent appears in both the PP and NP PP subcategorizations shown in (41) and (42) respectively. This category includes prepositional phrases that are arguments of the verb, but not those that are adjuncts, such as in (43). For the Treebank data, the argument / adjunct distinction in the parse structure was used. Unlike the Treebank data, the Connine et al. data did not distinguish between whether from the Treebank corpora rather than the resulting numbers generated by the strings. This allows any potential users of this data to see first hand whether the verbs of interest are prone to error or not. 37 prepositional phrases were used as arguments or adjuncts. Because of this, the relevant portions of the Connine et al. data were recoded to match the Treebank data, resulting in the reclassification of a number of Connine et al. examples from [PP] to [0] and [NP PP] to [NP]. Example (39) illustrates a case which was coded as an instance of a PP under both the Connine et al. coding scheme and the one used in this dissertation, while example (40) illustrates a case which was originally coded as a PP in the Connine et al. scheme, but was recoded as a [0] example because the prepositional phrase was a time phrase.

(39) I had to struggle [with the steak]PP because it was so tough. (Connine et al. data)

(40) We struggled [for hours]Time Phrase. (Connine et al. data)

The data used in this experiment also relies on a distinction between prepositions and particles. Verb particle combinations such as (44) were treated as being separate verbs, and were thus excluded from this study. Within the Treebank data, the PRT and PP tags were used to make this distinction.

(41) He remained there for four years before moving [to Rensselaer Polytechnic Institute in Troy, N.Y.]PP (Brown)

(42) One young girl told me how her mother removed a wart from her finger by soaking a copper penny in vinegar for three days and then painting [the finger]NP [with the liquid]PP several times. (Brown)

(43) She painted [the finger]NP [on Monday]ADJUNCT.

(44) All this was unknown to me, and yet I had dared to ask her [out]Particle for the most important night of the year! (Brown)

Intransitive with ‘to’ marked infinitive verb phrase (VPto)

The VPto constituent can appear with either with or without an NP. (45) However, three of the managers did say that they would agree [to attend the proposed meeting]VPto if all of the other managers decided to attend. (Brown)

(46) He advised [the poor woman]NP [not to appear in court] VPto as what she was charged with T was not in violation of law. (Brown)

‘For’ PP with ‘to’ marked infinitive phrase (Sforto)

The Sforto subcategorization consists of a prepositional phrase, typically with the preposition for, and an to marked infinitive verb phrase. 38

(47) A few years before his death Papa had agreed [with Mama]PP [to make a joint will with her]VPto in which it would be provided that in the event of the death of either of them an accounting would be made to their children whereby each child would receive a bequest of $ 5000 cash. (Brown)

(48) I’d like [for you]PP [to meet my mother] VPto. (Framenet)

‘Wh’ clause (Swh)

The Wh clause can occur with either transitive or intransitive verb uses. This category includes the traditional Wh words as well as if and whether clauses. (49) Note where the sun rises and sets, and ask [which direction the prevailing winds and storms come from]Swh.

(50) About five years ago, Handley came to ask [me]NP [if he could see the tattered register]Swh.

Finite Clause (Sfin)

The Sfin constituent consists of a finite clause. This clause can be preceded by a that. (51) But I insist upon believing [that even when it is lost, it may, like paradise, be regained]Sfin (Brown).

(52) A tribe in ancient India believed [the earth was a huge tea tray resting on the backs of three giant elephants, which in turn stood on the shell of a great tortoise]Sfin (Brown).

Gerundive Verb Phrase (VPing)

The VPing constituent consists of a gerundive verb phrase. The gerundive verb has no (lexically filled) subject. Items with a subject are included in the perception complement category, thus making this a subset of the Framenet VPing category. (53) He seemed to remember [reading somewhere that Abyssinians had large litters]VPing, and suffered a dismaying vision of the apartment overrun with a dozen kittens. (Brown)

(54) However, he continued [experimenting and lecturing]VPing, publishing the results of his experiments in German and Danish periodicals. (Brown)

Perception Complement (Percep. Compl.)

The Perception Complement constituent consists of an untensed clause with either a bare stem verb or an ing verb. These are referred to perception complements 39 because they are frequently used with perception verbs such as hear and see. The distinction between the VPing category and this category is that the verb in the perception complement has a (lexically filled) subject. This category is derived from Connine et al. (1984), and is a superset of the Framenet VPbrst category. (55) Winston had heard [her shaking out the skirt of her new pink silk hostess gown]Percep. Compl.. (Brown)

(56) Anyhow, I wasn't surprised, early that morning, to see [Handley himself crossing from Dogtown Common Road to the Back Road] Percep. Compl.. (Brown)

Quotation (Quo)

The quotation is a possible constituent for many verbs of communication. This constituent was particularly hard to identify using tgrep search patterns, since not all examples are marked with quotation marks. In addition, when quotation marks are present, they are frequently present for items which are not arguments of the verb in question. Additionally, the content of a quotation can range from an exclamation to a complete sentence, which makes it difficult to predict particular structural patterns. Because of these difficulties, quotations were hand labeled after the normal tgrep labeling process.

(57) [“A fine idea”]QUO, Mike agreed. (Brown)

(58) [“Hey, Mityukh”]QUO, asks one group, [“what are we shouting about?”]QUO (Brown)

Passive

Passive verb uses were automatically identified by tgrep search strings. Passives included both be and get passives. Reduced relative uses of verbs were not included in this study. (59) American history should clinch the case when Congress is asked to approve. (Brown)

(60) This selection-rejection process takes place as the file is read. (Brown)

Other

The other category does not represent a particular class of verb uses. Instead, frequencies for other in various tables reflect the frequency for the remaining subcategorizations not specifically shown in the table. 2.2.1.7 Measuring differences between corpora

Previous authors such as Trueswell et al. (1993), Merlo (1994), and Lapata et al. (2001) have used correlation values found for a single subcategorization 40 possibility (such as NP) across several verbs in order to measure the differences in subcategorization between corpora. For example, to calculate the correspondence value for the NP subcategorization between the Brown Corpus and the WSJ corpus based on the data below, one would use the NP frequencies for hear as well as those for several other verbs. When only two subcategorization possibilities exist, the correlation values for a single subcategorization are sufficient to capture the differences between the corpora, but when multiple subcategorization possibilities exist, separate correlation values are calculated for each subcategorization. This use of the correlation statistic has two disadvantages, however. It cannot be used measure the difference between the use of a single verb in two corpora. Additionally, when a large number of subcategorization possibilities exist, the correspondence values for a single subcategorization may not yield a complete picture of the differences, since the set of verbs under consideration may have very different distributions of one subcategorization, such as NP, but may tend to have more similar distributions of the other subcategorizations (noting that the total of all uses still must add up to 100%).

Because many results in this dissertation involve multiple possible subcategorizations or comparisons between the use of a single verb in various corpora, a different method of measuring differences must be used. Two different statistics were used for comparing the subcategorization probabilities of a verb in different corpora. First, the cosine (Salton & McGill 1983), a standard measure in , was used in order to measure the degree of difference between the uses of a verb in two different corpora. The subcategorization frequencies for a verb can be treated as a vector in multidimensional space. This allowed us to use the cosine of the angle between the vectors as a measure of the agreement between the subcategorization frequencies of verbs in different corpora. Table 18 shows the vectors for the verb hear in the Brown corpus and in the Wall Street Journal corpus. Using Formula 1, the cosine of the two vectors shown in Table 18 is 0.98. For non-negative vectors such as these subcategorization frequency vectors, the cosine ranges from 0 (complementary distribution) to 1 (complete agreement). This measure is not affected by sample size, and thus allows for direct comparison of the degree of difference between corpora irrespective of corpus size.

hear 0 PP Swh Sfin Percep. NP NP PP passive Compl. BC 4 12 3 1 15 47 4 14 WSJ 0 17 3 5 13 56 10 10 Table 18: Raw subcategorization vectors for hear from BC and WSJ.

n

∑ xi yi = Cosine = i 1 n n 2 2 ∑ xi ∑ yi i=1 i=1 Formula 1: Cosine of two vectors, x and y. 41

Although the cosine value shows the degree of difference between the distribution of the subcategorizations of a verb between two corpora, it does not show whether the difference is significant or not. The Chi Square statistic is used to determine whether the differences in subcategorization are significant. In order to calculate Chi Square, subcategorizations with low frequencies are collapsed into an other category, such as in Table 19. When Chi Square is calculated for this example, the difference between the two corpora is found to be not significant (Chi Square = 4.13, p > .2) hear PP Percep. NP NP PP passive Other Compl. BC 12 15 47 4 14 8 WSJ 17 13 56 10 10 8 Table 19: Modified subcategorization vectors for hear from BC and WSJ for use in calculating Chi Square.

2.2.2 Results and discussion: Part 1 - Subcategorization differences resulting from comparing isolated sentence and connected-discourse corpora

A portion of the subcategorization frequency differences are the result of the inherently different nature of single sentence production used in psychological experiments and connected discourse found in the more natural corpus data. This section will show that the single sentence / connected discourse opposition affects subcategorization through two general mechanisms: the use of discourse cohesion in connected discourse and the use of default referents in null context (isolated sentence production). 2.2.2.1 Discourse cohesion

The first difference between single sentence production and connected discourse involves discourse cohesion. Unlike isolated sentences, a sentence in connected discourse must cohere with rest of the discourse. Halliday & Hasan (1976) use the notion of cohesion to show why sentences such as “So we pushed him under the other one” sound odd as the start of a conversation. Because a large number of syntactic phenomena such as pronominalization, fronting, deixis, and passivization play a role in discourse coherence, one would expect these syntactic devices to be used differently in connected discourse than in single sentence production. In addition, to the extent that these syntactic phenomena affect subcategorization, one would expect sentences produced in isolation (such as in the Connine et al. and Garnsey et al. experiments) to have different subcategorization probabilities than sentences found in connected discourse, such as in the Brown corpus, the Wall Street Journal corpus, and the Switchboard corpus. Because dislocated arguments and pronominalized arguments were counted in the same categories as their non-dislocated and full NP counterparts, pronominalization and 42 most kinds of movement do not affect subcategorization frequencies. Two syntactic devices that do affect subcategorization frequencies are passivization and zero anaphora. Passive

The passive in English is generally described as having one of two broad functions: (1) de-emphasizing the identity of the agent and (2) keeping an undergoer topic in subject position. (Thompson 1987) Because both of these functions are more relevant for multi-sentence discourse, one would expect that sentences produced in isolation would make less use of passivization. As shown in Table 20, there is a much greater use of the passive in all of the connected discourse corpora than in the isolated sentences from Connine et al.8

Data Source % passive sentences Garnsey —9 CFJCF 0.6% Switchboard 2.2% Wall Street Journal 6.7% Brown corpus 7.8% Table 20: Use of passives in each corpus.

Zero Anaphora

Zero anaphora also plays a role in discourse cohesion. Whether an argument of a verb may be omitted depends on factors such as the semantics of the verb, what kind of omission the verb lexically licenses, the definiteness of the argument, and the nature of the context (Fillmore 1969, 1986; Fraser & Ross 1970; Resnik 1996; inter alia). In one common case of zero anaphora, Definite Null Complementation (DNC), “the speaker’s authority to omit a complement exists only within an ongoing discourse in which the missing information can be immediately retrieved from the context” (Fillmore 1986). For example, the verb follow licenses DNC only if the ‘thing followed’ can be recovered from the context, as shown in examples (61) and (62). Because the referent must be recoverable from the context, this type of zero anaphora is unlikely to occur in single sentence production, where the context is limited at best. (61) The shot reverberated in diminishing whiplashes of sound. Hush followed. (Brown corpus)

(62) Underwriting doesn' t get under way until after morning tea at 10 a.m. A two-hour lunch break follows. (WSJ) The lack of Definite Null Complementation in single sentence production results in single sentence corpora having a lower occurrence of the [0] subcategorization frame. For example the direct object of the verb follow is often

8 There were more passives in the written than in the spoken corpora, supporting Chafe (1982). 9 The protocol used in Garnsey et al. (1997) forces the subjects to produce active sentences. 43 omitted in the connected discourse corpora, but never omitted in the Connine et al. data set. Hand-counting every instance of follow in all four corpora verifies that every case of omission was caused by definite null complementation. The referent is usually in a preceding sentence or a preceding clause of the same sentence. Note that because in the Garnsey et al. (1997) protocol, a proper noun and the preterite form of the verb are provided, producing an example of the [0] subcategorization frame in the would more or less require the subjects to leave a blank response.

Data Source % [0] subcategorization frame Garnsey — CFJCF 0% Wall Street Journal 5% Switchboard 11% Brown 22% Table 21: The object of follow is only omitted in connected-discourse corpora. (numbers are hand-counted, and indicate % of omitted objects out of all instances of follow)

Default referents

In connected discourse, the context controls which referents are used as arguments of the verb. In single sentence production tasks, there is no larger context to provide this influence. In the absence of such demands, one might expect the subjects to use a wider variety of arguments with the verbs. On the contrary, the subjects favor a narrow set of default referents – those which are accessible in the experimental context, or which are prototypical arguments of the verb. There are three kinds of biases toward these default referents. 1) First Person Subjects

First, non-zero subjects of single sentence productions were more likely to be I or we than subjects in the types of written connected discourse sampled. Presumably the participants tended to use themselves as the topic of the sentence since in a null context there was no topic under discussion. Table 22 shows that the single sentence production data has a higher use of first person subjects than the written connected discourse data. Note that the Switchboard corpus also has a higher use of first person subjects. This could reflect a tendency for the participants, who are talking to strangers, to use themselves as a topic, given the absence of shared background.

44 Data Source % first person subject Garnsey —10 CFJCF 40% Switchboard 39% Brown corpus 18% Wall Street Journal 7% Table 22: Greater use of first person subject in isolated-sentences.

2) Anaphoric relationship between NPs

Second, VP internal NPs (e.g. NPs which are c-commanded by the verb) in single sentence production are more likely to be anaphorically related to the subject of the verb than are the internal NPs of VPs in connected discourse. This includes cases such as (63) where the embedded NP is co-referential with the subject, and cases such as (64) where the embedded NP and the subject are related by a possession or part-whole relationship. To simplify judgment of relatedness, only co-referential pronouns and traces were counted. Inferentially related NPs were not counted.

(63) Tomi noticed that hei was getting taller. (Garnsey et al. data)

(64) Alicei prayed that heri daughter wouldn’t die. (Garnsey et al. data)

(65) Johni said hei will kill hisi teacher. (Connine et al. data) Table 23 shows how often the subject was anaphorically related to a VP internal NP in a hand-labeled sample of 100 sentences randomly selected from the entirety of each corpus.

Data Source % related subject/NP Garnsey 41% CFJCF 26% Wall Street Journal 15% Brown corpus 12% Switchboard 8% Table 23: Use of VP-internal NPs which are anaphorically related to the subject.

By contrast, VP-internal NPs in the natural corpora were more likely to refer to referents other than the subject of the verb. This additional sentence-internal anaphora in the isolated sentences is presumably a strategy for avoiding sentences like (66) which require the creation of an additional referent that is not already present in the context. (66) Alice prayed that Bob’s daughter wouldn’t die. (made up example)

10 3rd person subjects are provided in the Garnsey et al. (1997) protocol. 45 Additionally, the subjects and VP-internal NPs are both likely to refer to previously mentioned entities from preceding sentences. Notice that in example (67), taken from the Brown Corpus, the subject He is co-referential with the subjects in the preceding sentences and the object lightning not only not co-referential with the subject of the verb notice, but is at least suggested by rain in the previous sentence. Similarly, in example (68) from the WSJ corpus, the object of notice is not co-referential with the subject clients, but both switch and clients are both related to entities mentioned in the previous context (both new message and Mr. Straszheim are referentially related to his switch, while smaller clients is mentioned in contrast with sophisticated professional).

(67) But hei felt no physical discomfort. Hei was only vaguely aware of the sluicing rain. Hei hardly noticed [the blue-green flashes of lightning and the hard claps of thunder]j. (Brown)

(68) Carrying the new message on the road, Mr. Straszheim meets confrontation that often occurs in inverse proportion to the size of the client. No sophisticated professional expects economists to be right all the time. Some smaller clientsi don't seem to notice [his switch]j. (WSJ)

3) Prototypical objects

Third, the objects in the single sentence production data were more likely to be prototypical objects. That is, subjects tended to use default, relatively predictable head nouns for the direct objects of verbs. For example, of the 107 Garnsey sentences with the verb accept, 12 (11%) had a direct object whose head noun was award. In fact 33% of the 107 sentences had a direct object whose head was one of the most common 4 words award, fact, job, or invitation. By contrast, the 112 Brown corpus sentences used a far greater variety of objects; it would take 12 different object nouns to account for 33% of the 112 sentences. Furthermore, the most common Brown corpus objects were pronouns (it, them); no common noun occurred more than 3 times in the 112 sentences. A formal metric of argument prototypicality is the token/type ratio. The ratio of the number of object noun tokens to object noun types will be high when a small number of types account for a greater percentage of the tokens. Table 24 shows that the token/type ratio is much higher for Garnsey data set than for the Brown corpus.

Data Source token count type count Argument token/type ratio Garnsey 107 54 2.0 CFJCF — — — Wall Street Journal 138 105 1.3 Brown corpus 112 86 1.3 Switchboard 15 14 1.1 Table 24: Token/Type ratio for arguments of accept 46 2.2.2.2 Other experimental factors – subject animacy

The previous sections discussed context effects that distinguish isolated sentence corpora from connected discourse corpora. This section discusses a further experimental bias that is specific to the sentence completion task. In sentence completion, the participants are given a prompt consisting of a syntactic subject as well as a verb. The nature of this syntactic subject can influence the verb subcategorization of the resulting sentence. Indeed this fact explains the single largest mismatch between the Garnsey data set and Brown corpus data. The verb worry was the only verb in these two corpora with an opposite preference between direct object and sentential complement; in Brown worry was more likely to take a direct object, while in the Garnsey data set worry was more likely to take a sentential complement.

Subcategorizations of ‘worry’ % Direct Object % Sentential Complement Garnsey 1% 24% BC 14% 4% Table 25: Subcategorization of worry affected by sentence-completion paradigm.

This reversal in preference was caused by the properties of two of the subcategorization frames of worry. In frame 1 below, worry takes an experiencer as a subject, and subcategories for a finite sentence [Sfin]. In frame 2 below, worry takes a stimulus as a subject, and subcategorizes for an [NP]. This alternation is an example of the Causative/Inchoative alternation that some ‘amuse-type’ psych verbs undergo, as discussed in Levin (1993).

# frame example 1 [experiencer] worries [stimulus] Samantha worried that trouble was coming in waves. (Garnsey) 2 [stimulus] worries [experiencer] Her words remained with him, worrying him for hours. (BC) Table 26: Uses of worry.

In the Garnsey protocol, proper names (highly animate) were provided. Specifically, for the verb worry, subjects had to complete a sentence starting with “Samantha worried ______”. This provides a bias towards the first use, since animate subjects are more likely to be experiencers than stimuli. None of the Garnsey examples were completed in such a way that Samantha became the stimulus of the worrying as in the made up example (69). Rather, Samantha was always the experiencer of the worry. (69) Samantha worried me. (made up example) All of the sentential complement uses in the Brown corpus data had a human/animate subject, such as in example (70). In the direct object uses, only 30% of the subjects 47 were animate. Examples (71) and (72) show inanimate subjects preceding direct object uses of the verb worry in the Brown Corpus.

(70) But Keys worries that the Metrecal drinker will never make either the psychological or physiological adjustment to the idea of eating smaller portions of food. (Brown)

(71) This had worried her. (Brown)

(72) “The only thing that worries me is how I'm going to prove it”, Eugenia said. (Brown)

It is uncontroversial that the nature of the prompt in a sentence completion experiment affects factors such as whether the sentence will be active or passive. This analysis shows that the nature of the prompt has a more subtle but equally important effect on how subjects will use a verb. 2.2.2.3 Conclusion for section 2.2.2

This section has shown several different ways in which discourse context and experimental design affect observed subcategorization frequencies. These effects suggest that a psychological model of subcategorization probabilities will need to control for such discourse context effects. These contextual effects also have a methodological implication. In cases where performance results from a particular experimental context are being compared with corpus frequencies at large, one should not expect to find direct correspondences between phenomena such as parsing preferences and reading times and corpus frequencies. This is because the experimental results will be influenced by the particular discourse context of the experimental items, while the corpus frequencies will reflect an average of all of the contexts represented in the corpus.

The alternative to comparing experimental performance with corpus frequencies is to compare experimental results with norming data based on similar contexts. One example of this is found in Garnsey et al. (1997), where both the norming and experimental results were based on single sentence production/reading times, and human (grammatical) subjects were used with the verbs in both cases. Even when such a methodology is employed, corpus frequencies can still be used to identify likely candidate verbs for each of the possible subcategorization biases.

The number of possible causes of differences between experimental and corpus verb subcategorization frequencies places an extra burden on researchers attempting to demonstrate a lack of use of probabilistic information in human sentence processing. One must show not only that the probabilistic information is not being used, but also that there are no other plausible causes of a lack of correspondence between frequency data and experimental results. Alternatively, when evidence of a correlation is found, it is doubly strong, in that such a correspondence between frequency information and experimental results indicates 48 that the correlation is sufficiently strong to overcome potential discourse and contextual differences.

The number of possible causes for differences between subcategorization frequencies from various sources also makes it difficult to investigate whether there are any inherent differences between production and comprehension frequencies. On one hand, one might expect factors such as memory capacity to affect comprehension differently than production, but on the other hand, one would also expect that humans would use a language model based on exposure to production during comprehension. 2.2.3 Results and discussion: Part 2 - Subcategorization differences resulting from verb sense differences

This section will demonstrate that using the verb lexeme rather than individual verb senses as the basis for subcategorization probabilities also results in apparent differences in subcategorization probability between corpora. This section will show that different corpora can yield different subcategorization probabilities. Then, it will show that different corpora contain different senses of verbs. Finally, it will show that it is this different distribution of lemmas or senses that accounts for much of the inter-corpus variability in subcategorization frequencies.

The results in this section are based on a subset of the data used in the previous section. In order to investigate the relationship between verb sense and verb subcategorization, seven verbs were selected from the data used in the previous section. These verbs were hand tagged for semantic sense/lemma using semantic senses provided in WordNet (Miller et al. 1993). Sense categories were collapsed in the few cases where different WordNet senses could not be reliably distinguished. Examples of each sense are shown in the tables below in this section. When there were more than 100 tokens of a verb in a single corpus, 100 randomly selected examples were coded. This sample size was chosen to match the maximum sample size in the psychological corpora. Primarily, the data from the Brown corpus and the Wall Street Journal corpus is compared, since these two corpora had the largest amount of data. Although the data from the other corpora was less plentiful, it still provided useful insights. 2.2.3.1 Verbs have different subcategorization frequencies in different corpora

First, three verbs, pass, charge, and jump, are analyzed. These three were chosen because they had large differences in subcategorization frequencies between the Wall Street Journal corpus and the Brown corpus. Table 27 shows that all three verbs have significant differences in subcategorization frequencies between the Brown corpus and the Wall Street Journal corpus.

49 Verb Cosine (all senses combined) Do BC and WSJ have different subcategorization probabilities? pass 0.75 Yes (X2 = 22.2, p < .001) charge 0.65 Yes (X2 = 46.8, p < .001) jump 0.50 Yes (X2 = 49.6, p < .001) Table 27: Agreement between WSJ and BC data.

2.2.3.2 Verbs have different distributions of sense in different corpora

Next, the frequency of each sense was measured in each corpus. Each of the verbs showed a significant difference in the distribution of senses between the Brown corpus and the Wall Street Journal corpus, as shown in Table 28. This is consistent with Biber et al. (1998), who note that different genres have different distributions of word senses.

Verb Do BC and WSJ have different distributions of verb sense? pass Yes (X2 = 59.4, p < .001) charge Yes (X2 = 35.1, p < .001) jump Yes (X2 = 103, p < .001) Table 28: Differences in distribution of verb senses between BC and WSJ.

Table 29 uses the verb charge to show how the sense distributions are different for a particular verb. The types of topics contained in a corpus influence which senses of a verb are used. Since Brown corpus contains a balanced variety of topics, while the Wall Street Journal corpus is strongly biased towards business related discussion, one would expect to see more of the business-related senses in the Wall Street Journal corpus. Indeed, the two business-related senses of charge (accuse and bill) are used more frequently in the Wall Street Journal corpus, although they also occur commonly in the Brown corpus, while the attack sense of charge is used only in the Brown corpus. The credit card sense is probably more common in corpora that are more recent than the Brown corpus.

50 Senses of BC % WSJ % Example of the senses of charge. charge attack 23% 0% His followers shouted the old battle cry after (WN 1)11 him and charged the hill, firing as they ran. (BC) run 8% 0% She charged off to the bedrooms. (BC) (WN 4) appoint 6% 4% The commission is charged with designing a (WN 5) ten year recovery program. (WSJ) accuse 39% 58% Separately, a Campeau shareholder filed suit, (WN charging Campeau, Chairman Robert Campeau 2,6,7) and other officers with violating securities law. (WSJ) bill 24% 36% Currently the government charges nothing for (WN 3) such filings. (WSJ) credit card 0% 2% Many auto dealers now let buyers charge part (WN 12) or all of their purchase on the American Express card….(WSJ) TOTAL 100% 100% Table 29: Examples of common senses of charge and their frequencies.

Table 30 shows examples of the senses of the verb jump and their frequencies. There is a clear split the distributions of the senses. Two senses, leap and attack occur primarily in the BC data while the economic sense price jump is used primarily in the WSJ data.

Senses of BC % WSJ % Examples of the senses of jump. jump. leap 71% 4% He jumped, and sank to his knees in muddy (WN 1,2) water. (BC) attack 11% 2% “This deal at Las Putas Buenas where the two (WN 3) knife-men jumped you,” said Rourke with interest, “that sounds like it was set up with malice aforethought by the luscious Mrs. Peralta, doesn’t it?” (BC) price 5% 86% Holiday Corp. said net income jumped 89%, (WN 4) partly on the strength of record operating income in its gaming division. (WSJ) other 13% 8% senses TOTAL 100% 100% Table 30: Examples of common senses of jump and their frequencies.

Table 31 shows the frequencies for each sense of the verb pass. The go past sense is most common in the BC data, while the legal sense of passing a law is most common in the WSJ data.

11 The senses used in this section are either WordNet senses or combinations of WordNet senses. WordNet sense numbers not listed either were not present in the sample used or were counted in the other category. 51

Senses of BC % WSJ % Examples of the senses of pass. pass. go past 48% 4% We passed his house and school on the way. (WN (BC) 1,2,6,7) law 19% 49% In the senate, several bills are expected to pass (WN 3) without any major conflict or opposition. (BC) pass time 7% 3% Phil decided to stay a little longer, and as time (WN 4) passed, it seemed as if the strange little man had never been there, but for the other glass on the table. (BC) hand to 5% 17% He asked, when she passed him a glass. (BC) (WN 5) test 1% 6% Those who stayed had to pass tests. (BC) (WN 14) other 20% 21% TOTAL 100% 100% Table 31: Examples of common senses of pass and their frequencies.

2.2.3.3 Topics provided in norming studies also influence verb sense

Corpus topic also effects verb sense in the isolated sentence corpora. When topics such as home, school, and downtown were provided to the subjects in the Connine et al. (1984) sentence production study, subjects used different senses of the verbs. Table 32 shows how the provided setting influenced the verb sense used. For example the school setting caused 5 out of 9 subjects to use the test sense of the verb pass. By contrast, the test sense was used only 2 times in 230 examples in the Brown corpus.

movement test pass the buck Other12 home 6 1 1 2 downtown 5 1 0 3 school 4 5 0 0 Table 32: Uses of pass in different settings in the CFJCF sentence production study

2.2.3.4 Subcategorization frequencies for each verb sense

For each of these three verbs, the subcategorization frequencies for each sense were examined. In each case, the relative frequency of the verb senses in each corpus resulted in a difference in the overall subcategorization frequency for that verb. This is due to each of the senses having separate subcategorization probabilities. Table 33 illustrates that different senses of the verb charge have different subcategorizations (examples of each sense are given in Table 29).

12 Other sense or non-verb use. 52 Senses of ‘charge’ that-S NP NP PP13 passive Other appoint 0% 0% 0% 4% 0% accuse 18% 0% 12% (with) 24% 2% bill 0% 9% 24% (for) 1% 1% credit card 0% 0% 2% (on) 0% 0% Table 33: Different senses of charge in WSJ have different subcategorization probabilities. Dominant prepositions are listed in parentheses after the frequency.

Further evidence that subcategorization probabilities are based on verb sense is provided by the fact that for two of the verbs, pass and charge, the agreement for the most common sense was significantly better than the agreement for all senses combined. The third verb, jump, also shows improvement, but the single sense value is not significant. This is because the nearly complementary distribution of senses between the corpora results in low sample sizes for one of the corpora whenever only a single sense is taken into consideration. Table 34 shows that the agreement for the most common sense is better than the agreement for all senses combined. The remaining disagreement between the corpora is attributed to context and discourse based subcategorization differences.

Verb Cosine (all senses combined) Cosine (most common sense) pass 0.75 0.95 charge 0.65 0.80 jump 0.50 0.59 Table 34: Improvement in agreement when after controlling for verb sense.

2.2.3.5 Factors that contribute to stable cross-corpus subcategorization frequencies

The first three verbs examined in detail (pass, charge, and jump) illustrate how verb sense differences can result in large verb subcategorization differences between corpora. The next three verbs, (kill, stay, and try - Table 35), were chosen in contrast to illustrate cases where there is good agreement in overall subcategorization between the Wall Street Journal corpus and the Brown corpus data. This is done as a preliminary effort to see what factors might prevent subcategorization frequencies from changing between corpora.

Verb Cosine (all senses Do BC and WSJ have different combined) subcategorization probabilities? (X2) kill 1.00 No stay 1.00 No try 1.00 No Table 35: Agreement between BC and WSJ data.

13 The set of subcategorization frames used does not take the identity of the preposition into account. 53 One possible circumstance where one would not predict verb subcategorization to vary between corpora is the case where all of the corpora have either the same single sense or the same combinations of senses. One would expect to find many verbs within a language that had only one common sense. However, Table 36 shows that these three verbs had significantly different distributions of verb senses in the Brown corpus and the WSJ corpus.

Verb Do BC and WSJ have different distributions of verb sense? kill Yes (X2 = 26.9, p < .001) stay Yes (X2 = 26.1, p < .001) try Yes (X2 = 8.74, p < .025) Table 36: Differences in distribution of verb sense between BC and WSJ.

Table 37 shows the distribution of senses for the verb kill. The verb kill has a primary sense, cause to die and a series of other senses that are metaphorical extensions. In the BC data, mainly the literal sense is used, while in the WSJ data, a variety of extensions are also used.

Senses of BC % WSJ % Examples of the senses of kill. kill. cause to 99% 74% To give the patient the wrong type of blood, said die the doctor, would likely kill him. (BC) (WN 1,4) vote 1% 13% Cashiering the entire omnibus bill would down probably mean killing any capital-gains cut, too. (WN 2) (WSJ) kill a deal 0% 10% NBC’s interest may revive the deal, which (WN MGM/UA killed last week when the Australian 11)14 concern had trouble raising cash. (WSJ) other 0% 3% TOTAL 100% 100% Table 37: Examples of common senses of kill.

Table 38 shows the distribution of senses for the verb stay. The sense of stay meaning to visit someone, a general sense, is more common in the BC data, while the sense of continuing to work is more common in the WSJ data. In BC, not changing and not moving are equally common, while WSJ favors the not change sense.

14 WordNet sense 11 is closest to the kill a deal sense used in the WSJ data. 54 Senses of BC % WSJ % Examples of senses of stay. stay. not 33% 54% If the dollar stays weak, he says, that will add to change inflationary pressures in the U.S. and make it (WN 1) hard for the Federal Reserve Board to ease interest rates very much. (WSJ) not move 34% 19% But Hoag had not stayed on the front steps long (WN 2)15 when Griffith disappeared into the building. (BC) visit 18% 2% Robbie and Beryl tried their best to persuade her someone to come and stay with them, and Anne and I (WN 3) have told her she’s more than welcome here, but I think she feels that she might be an imposition, especially as long as our Rosie is still in school. (BC) stay on at 10% 21% Mr. Leibler, 40, said he will stay on as Amex a job president and work with Mr. Jones. (WSJ) (WN 4) other 5% 4% TOTAL 100% 100% Table 38: Examples of common senses of stay.

Table 39 shows the distribution of senses for the verb try. Both corpora have the same dominate sense, but BC also contains the secondary sense of try out/test. In general, BC is more likely to have higher frequencies of secondary senses than WSJ, unless the second sense is a business sense.

Senses of BC % WSJ % Examples of senses of try. try. attempt 85% 96% And some oil companies are trying to lock in (WN 1) future supplies. (WSJ) try out/test 13% 2% He tried the doors of the bookcase. (BC) (WN 2,4) legal trial 2% 2% The Gortonists were charged with blasphemy (WN 3,5) and tried for their lives. (BC) TOTAL 100% 100% Table 39: Examples of common senses of try.

Although all three verbs have a different distribution of verb senses between the two corpora, they do not show a difference in subcategorization probabilities between the corpora. This is because the different senses of these verbs actually have similar subcategorization patterns. Table 40 shows that all of the senses of kill have [NP] as their primary subcategorization, although other subcategorizations are possible. Try and stay show similar patterns.

15 I interpreted the difference between sense 1 and sense 2 as meaning that sense 2 involved specifically not making literal physical movement. 55 Senses of ‘kill’ [0] [NP] [NP][PP] [passive] other cause to die 9% 43% 4% 19% 0% vote down 0% 13% 0% 0% 0% kill a deal 0% 10% 0% 0% 0% other 0% 3% 0% 0% 0% TOTAL 9% 69% 4% 19% 0% Table 40: Senses and subcategorizations of kill in WSJ.

In the case of kill, the different senses of the verb are metaphoric extensions of the main use of the verb. Senses that are very closely (polysemously or metaphorically) related, like the senses of kill and stay, tend to have similar subcategorization probabilities across corpora. However, not all metaphoric extensions share subcategorization probabilities. For example, the verb jump has two senses related by metonymy, leap and rise in price. While these have similar possible subcategorizations, the actual distribution of these subcategorizations was very different in the Brown corpus and the Wall Street Journal corpus data, due to the discourse circumstances under which each of the senses was used. The information demands in the Wall Street Journal resulted in stock price jumps being given with a distance and stopping point (jumped five eighths to five dollars a share). Alternatively, in the Brown corpus data, jump is more likely to be used to describe a manner of movement, as in (73), in which case it is the type of movement, rather than the starting and end points of the movement that are important. (73) At one side of the stage a dancer jumps excitedly; nearby, another sits motionless, while still another is twirling an umbrella.

2.2.3.6 Conclusion for section 2.2.3

This section has shown that different verb senses can have different subcategorization probabilities. It also showed that different corpora tend to have a different distribution of verb senses, and that these differences in distribution can result in overall subcategorization differences between the corpora. This relationship between verb sense and subcategorization leads to an important methodological caveat as well: psychological models and experimental protocols that rely on verb subcategorization frequencies must also take verb sense into account. This result also suggests that statistical parsers that are sensitive to verb sense might be less likely to need retraining as they are applied to different domains.

The results in this section provoke the important question of how much of cross-corpus subcategorization frequency is due to verb sense differences and how much is due to discourse or other factors. Unfortunately, not only is it difficult to estimate an answer for any particular case, there is also probably no global answer to this question. First, there is no large database of verb sense / subcategorization correspondences for any corpus. Even databases such as the SEMCOR extension of WordNet have insufficient data for making useful generalizations about more than the 56 most frequent verbs. However, projects such as FrameNet (Baker et al. 1998) and NORMA (Gahl & Jurafsky, 2000; Roland et al. 2000) have as their goals providing subcategorization/sense frequencies for a sampling of verbs. It would require sense/subcategorization statistics from multiple corpora to make an adequate measure of the size of verb sense based subcategorization frequency differences.

A second problem is that one would expect discourse and verb sense based differences to play different roles depending on which corpora are compared. When corpora such as the Brown corpus and the WSJ corpus are compared, one would expect discourse differences to play a smaller role, in that both corpora are written connected discourse. Alternatively, when single sentence production data, spoken connected discourse, and written connected discourse are compared, one would expect discourse factors to play a larger role, and verb sense differences to play a smaller role, particularly if the topics of discussion were similar between the different data sources. At best, one can only show which factors are causing some particular set of differences. Support for a model where core subcategorization probabilities for each sense of a verb combine with discourse and other factors to yield the observed subcategorization probabilities can only be demonstrated by controlling for the various factors and showing that the differences in subcategorization probabilities are reduced. Such a reduction in subcategorization differences will be demonstrated in the next section. 2.2.4 Results and discussion: Part 3 – Reducing sense and discourse differences decreases the differences in subcategorization probabilities.

The results in this section will provide support for the model introduced in the beginning of the chapter where core subcategorization probabilities for each sense of a verb combine with discourse and other factors to yield the observed subcategorization probabilities. This section will show preliminary evidence that a single sense tends to have a single subcategorization probability vector, when other factors are controlled for. This section will rely on data for the verb hear, which is one of the few verbs that appeared in all five corpora.

The procedure will be to show that the agreement between subcategorization vectors iteratively improves as more factors are controlled for, from a cosine of .88 for agreement between uncontrolled vectors, to a cosine of .99 for agreement between vectors controlled for verb sense as well as discourse context effects.

First, the average agreement between each of the 10 possible pairs of corpora was calculated. For example, these pairings include the Brown corpus and the Wall Street Journal corpus, the Brown corpus and the Connine data set, the Brown corpus and the Garnsey data set, the Brown corpus and the Switchboard corpus, the Wall Street Journal corpus and the Switchboard corpus, and so on. The average agreement (cosine) was .88.

57 The ‘isolated-sentence’ effect was controlled for by only comparing pairs of corpora if they were both isolated-sentences or both connected sentences. Thus, the comparisons were the Garnsey data set vs. the Connine data set, the Brown corpus vs. the Wall Street Journal corpus, the Wall Street Journal corpus vs. the Switchboard corpus, and the Brown corpus vs. the Switchboard corpus. The average agreement improved to .93. Spoken versus written effects were controlled for by comparing only the Brown corpus and the Wall Street Journal corpus. The average agreement improved to .98. Finally, instead of comparing all sentences with hear in the Brown corpus to all sentences with hear in the Wall Street Journal corpus, only sentences which used the single most frequent sense of hear were compared. The average agreement improved to .99. Table 41 shows a schematic of the comparisons. Note that although verb sense is controlled for only in the final step, controlling for sense results in improvement at any point in the chart. For example, the average agreement for all corpora also improves to .89 when sense is controlled for.

Average Agreement between all corpora .88

Comparison between different discoures type? (Single Sentence vs. Connected Discourse)

Yes No

Average Agreement Average Agreement Single Sentence Single vs. Single vs. or Connected Discourse Connected vs. Connected .84 .93

Comparison between different discourse type? (Written vs. Spoken)

Yes No

Average Agreement Average Agreement Written vs. Spoken Written vs. Written .91 .98

Control for verb sense Agreement = .99

Table 41: Improvements in agreement for the verb hear. 58 2.2.5 Conclusion

This section has shown that subcategorization frequency variation is caused by factors including the discourse cohesion effects of natural corpora, the default referent effects of isolated-sentence experiments, the prompt given in sentence production experiment, the effects of different genres on verb sense, and the effect of verb sense on subcategorization. The evidence shows clearly that in clear cases of polysemy, such as the accuse and bill senses of charge, each sense has a different set of subcategorization probabilities. This section has not investigated subtler differences in meaning, such as in load the wagon with hay and load hay into the wagon. Such alternations are usually modeled by one of two theories. Our data is currently unable to distinguish between them. For example, a Lexical Rule account, such as Levin & Hovav (1995), might consider each valence possibility as a distinct lemma; the results merely show that these lemmas would have to be associated with lemma probabilities. An alternative constructional account, such as Goldberg (1995), would include both valence possibilities as part of a single lemma for load, with separate valence probabilities. In the constructional account, the shadings in sense are determined by the combination of lexical meaning and constructional meaning.

These experiments do have a number of implications both for cognitive modeling and for psycholinguistic methodology. This dissertation makes a psychological claim about mental representation: that each lemma contains a vector of probabilistic expectations for its arguments. These results suggest that the observed subcategorization probabilities can be explained by a probabilistic combination of these lemma probabilities with other probabilistic factors. If this is true, it supports models of human language interpretation such as Narayanan & Jurafsky (1998) that similarly rely on the Bayesian combination of different probabilistic sources of lexical and non-lexical knowledge. 2.3 Experiment – Controlling for discourse type and verb sense to generate stable cross corpus subcategorization frequencies

The goal of this experiment is to show that stable cross corpus verb subcategorization frequencies result when discourse type and verb sense are controlled for. The previous experiments in this chapter hinted at such a result, but demonstrated such a result for only one verb. This experiment will take a much larger number of verbs (64) and control for verb sense primarily by choosing single sense verbs. Discourse type will be controlled for by only using corpora of primarily the same general discourse type (written connected text). This analysis was originally performed to generate a set of verb transitivity biases for use in norming a series of psychological experiments (Gahl et al. 2001). 59 2.3.1 Data

Data for 64 verbs (shown in Table 1) was collected from three corpora; The British National Corpus (BNC) (http://info.ox.ac.uk/bnc/index.html), the Penn Treebank parsed version of the Brown Corpus (Brown), and the Penn Treebank Wall Street Journal corpus (WSJ) (Marcus et al. 1993). The 64 verbs were chosen on the basis of the requirements of separate psychological experiments including having a single dominant sense, being easily imageable, and participating in one of several subcategorization alternations. A random sample of 100 examples of each verb was selected from each of the three corpora. When the corpus contained less than 100 tokens of the verb, as was frequently the case in the Brown and WSJ corpora, the entire available data was used. This data was coded for several properties: Transitive/Intransitive, Active/Passive, and whether the example involved the major sense of the verb or not. The BNC data was coded entirely by hand, while the Brown and WSJ was hand coded after a first pass of subcategorization labeling via a tgrep search string algorithm. The same coder labeled the data for all three corpora for any given verb, in order to reduce any problems in inter-coder reliability. The coders for this data included Lise Menn, Susanne Gahl, Daniel Jurafsky, Elizabeth Elder, and Chris Riddoch, as well as the author. A preliminary analysis of the data used in this section is discussed in Roland et al. (2000).

adjust, advance, appoint, arrest, break, burst, carve, crack, crumble, dance, design, dissolve, distract, disturb, drop, elect, encourage, entertain, excite, fight, float, flood, fly, frighten, glide, grow, hang, harden, heat, hurry, impress, jump, kick, knit, lean, leap, lecture, locate, march, melt, merge, mutate, offend, play, pour, race, relax, rise, rotate, rush, sail, shut, soften, spill, stand, study, surrender, tempt, terrify, type, walk, wander, wash, watch Table 42: 64 verbs chosen for analysis

2.3.2 Verb Frequency

The frequency differences for the target verbs as a measure of corpus difference was used because word frequency is known to vary with corpus genre. One would expect factors such as corpus genre (Business for WSJ vs. mixed for BNC and Brown), American vs. British English, and the era the corpus sample was taken in (published in or before 1961 for Brown, published in 1989 for WSJ, published/created in or before 1993 for BNC) to influence word frequency.

The frequencies for each verb were calculated and Chi Square was used to test whether the difference in frequency was significant for each corpus pairing. The number of verbs that showed a significant difference was counted, using p = 0.05 as a cut-off point. This result is shown in Table 43. Although there were verbs that had a significant difference in distribution between the two mixed genre corpora (BNC, Brown), there were more differences in word frequency between the general corpora and the business corpus. The difference between the BNC/Brown comparison and the BNC and Brown vs. WSJ comparison is significant (Chi Square, p < .01). 60 BNC vs. Brown BNC vs. WSJ Brown vs. WSJ 30 46 46 Table 43: Number of verbs out of 64 showing a significant difference in frequency between corpora.

Table 44 shows the list of words that were significantly more frequent in both of the general corpora than they were in the business-oriented corpus. Notice that most of the verbs describe either leisure activities or at least activities and events that are not typical in business settings. amuse, boil, burst, dance, disturb, entertain, frighten, hang, harden, hurry, impress, knit, lean, paint, play, race, sail, stand, tempt, walk, wander, wash, watch Table 44: Verbs that BNC and Brown both have more of than WSJ.

Alternatively, when one looks at the words that had a significantly higher frequency in the WSJ corpus than in either of the other corpora (Table 45), one finds predominately verbs that can describe stock price changes and business transactions. adjust, advance, crumble, drop, elect, fall, grow, jump, merge, quote, rise, shrink, shut, slip Table 45: Verbs that WSJ has more of than both Brown and BNC.

2.3.3 Subcategorization Frequency

2.3.3.1 Methodology:

For this experiment, the examples of the 64 verbs from each of the three corpora were coded for transitivity. Any use with a direct object was counted as transitive, and any other use, such as with a prepositional phrase, was counted as intransitive. The transitive category also included cases of movement, such as right dislocation. Passive uses were also included in the transitive category. Examples (74) and (75) illustrate intransitive uses, example (76) illustrates transitive (and active) while examples (77) and (78) illustrate transitive (and passive) uses of the verb drop. (74) Pretax profits for the year ended 31st March 1991, dropped by 1.24 million… (BNC)

(75) Something dropped to the floor… (BNC)

(76) Lift them from the elbows, and then drop them down to the floor. (BNC)

(77) …plans for an OSF binary interface have been dropped, apparently due to technical difficulties. (BNC) 61 (78) And even then the poor moribund critters had to be dropped from a great height to get a satisfactory thump… (BNC)

Verb sense was controlled for by only including sentences from the majority sense of the verb in the counts. For example, instances of drop that were phrasal verbs with distinct senses like drop in or drop off were not included. Metaphorical extensions of the main sense, such as a company dropping a product line were included, however. Thus, a broadly defined notion of sense was used, rather than the more narrowly defined word senses used in some on-line word sense resources such as WordNet. This was partly for logistic reasons, since such fine-grained senses are very hard to code, and partially because very narrowly defined senses frequently have only one possible subcategorization. Coding for such senses would have thus biased the experiment strongly toward finding a strong link between sense and subcategorization-bias.

Transitivity biases for each of the 64 verbs in each of the three corpora were calculated. The verbs were classed as high transitivity if more than 2/3 of the tokens of the major sense were transitive, low transitivity if more than 2/3 of the tokens of the major sense were intransitive, and as mixed, otherwise. Any token of the verb that was not used in its major sense was removed from consideration. If subcategorization biases were related to verb sense, one would expect the transitivity biases to be stable across corpora once secondary senses are removed from consideration. 2.3.3.2 Results:

Nine of the 64 verbs, shown in Table 46, had a significant shift in transitivity bias. These verbs had a different high/mixed/low transitivity bias in at least one of the three corpora.

62 Verb BNC transitivity Brown transitivity WSJ transitivity advance mixed mixed low (48%) (65%) (19%) crack mixed mixed high (58%) (58%) (86%) fight low mixed high (29%) (49%) (64%) float low low mixed (22%) (11%) (44%) flood mixed high high (52%) (100%) (100%) relax low low mixed (27%) (30%) (65%) soften high high mixed (71%) (70%) (43%) study high mixed high (84%) (39%) (92%) surrender mixed mixed high (48%) (39%) (73%) Table 46: Transitivity bias in each corpus.

2.3.4 Discussion:

In general, these shifts in transitivity were a result of the verbs having differences in sense between the corpora such that the senses had different subcategorizations, but were still within the broadly defined ‘main sense’ for that verb.

For seven out of the nine verbs, the shifts in transitivity are a result of differences between the WSJ data and the other data, which are a result of the WSJ being biased towards business-specific uses of these verbs. For example, in the BNC and Brown data, advance is a mixture of transitive and intransitive uses, shown in (79) and (80), while intransitive share price changes (81) dominated in the WSJ data. (79) BNC intransitive: In films, they advance in droves of armour across open fields …

(80) BNC transitive: We have advanced “moral careers” as another useful concept …

(81) WSJ intransitive: Of the 4,345 stocks that T changed hands, 1,174 declined and 1,040 advanced. Crack is used to mean make a sound (82) or break (83) in the Brown and BNC data (both of which have transitive and intransitive uses), while it is more likely to be used to mean enter or dominate a group/market (transitive use) in the WSJ data; (84) and (85). 63 (82) Brown intransitive: A carbine cracked more loudly …

(83) Brown intransitive: Use well-wedged clay, free of air bubbles and pliable enough to bend without cracking.

(84) WSJ transitive: But the outsiders haven't yet been able to crack Saatchi's clubby inner circle, or to have significant influence on company strategy.

(85) WSJ transitive: … big investments in “domestic” industries such as beer will make it even tougher for foreign competitors to crack the Japanese market. Float is generally used as an intransitive verb (86), but must be used transitively when used in a financial sense (87). (86) Brown intransitive: The ball floated downstream.

(87) WSJ transitive: B.A.T aims to … float its big paper and British retailing businesses via share issues to existing holders. Relax is generally used intransitively (88), but is used transitively in the WSJ data when discussing the relaxation of rules and credit (89). (88) BNC intransitive: The moment Joseph stepped out onto the terrace the worried faces of Tran Van Hieu and his wife relaxed with relief.

(89) WSJ transitive: Ford is willing to bid for 100% of Jaguar 's shares if both the government and Jaguar shareholders agree to relax the anti-takeover barrier prematurely. Soften is generally used transitively (90), but is used intransitively in the WSJ data when discussing the softening of prices (91) and (92). (90) Brown transitive: Hardy would not allow sentiment to soften his sense of the irredeemable pastness of the past, and the eternal deadness of the dead.

(91) WSJ intransitive: A spokesman for Scott says that assuming the price of pulp continues to soften, “We should do well.”

(92) WSJ intransitive: The stock has since softened, trading around $25 a share last week and closing yesterday at $23.00 in national over-the-counter trading. Surrender is used both transitively (93) and intransitively (94), but must be used transitively when discussing the surrender of particular items such as stocks (95) and (96). (93) BNC transitive: In 1475 Stanley surrendered his share to the crown… 64 (94) Brown intransitive: … the defenders, to save bloodshed , surrendered under the promise that they would be treated as neighbors

(95) WSJ transitive: Holders can … surrender their shares at the per-share price of $1,000, plus accumulated dividends of $6.71 a share.

(96) WSJ transitive: … Nelson Peltz and Peter W. May surrendered warrants and preferred stock in exchange for a larger stake in Avery 's common shares. The verb fight is the only verb that has a different transitivity bias in each of the three corpora; with all other verbs, at least two corpora share the same bias. In the WSJ, fight tends to be used transitively, describing action against a specific entity or concept (97). In the other two corpora, there are more descriptions of actions for or against more abstract concepts (98) and (99). In addition, the WSJ differences may further be influenced by a journalistic style practice of dropping the preposition against in the phrase fight against. (97) WSJ transitive: Los Angeles County Supervisor Kenneth Hahn yesterday vowed to fight the introduction of double-decking in the area.

(98) BNC intransitive: He fought against the United Nations troops in the attempted Katangese secession of nineteen sixty to sixty-two.

(99) Brown intransitive: But he would fight for his own liberty rather than for any abstract principle connected with it -- such as “cause”. The verb study is generally transitive (100), except in the Brown data, where study is frequently used with a prepositional phrase (101) or to generically describe the act of studying (102). Further investigation is needed to see what might be causing this difference; possible candidates include language change (since Brown is much older than BNC and WSJ), British-American differences, or micro-sense differences. (100) BNC transitive: A much more useful and realistic approach is to study recordings of different speakers' natural, spontaneous …

(101) Brown intransitive: In addition, Dr. Clark has studied at Rhode Island State College and Massachusetts Institute of Technology.

(102) Brown intransitive: She discussed in her letters to Winslow some of the questions that came to her as she studied alone. The verb flood is used intransitively more often in the BNC than in the other corpora. The Brown and WSJ uses tend to be transitive non-weather uses of the verb flood (103) and (104), while the BNC uses include more weather uses, which are more likely to be intransitive (105). Further investigation is needed to determine whether 65 this is a result of the BNC discussing weather more often, or a result of which particular grammatical structures are used to describe the weather floods in British and American English. (103) WSJ transitive: Lawsuits over the harm caused by DES have flooded federal and state courts in the past decade.

(104) Brown transitive: The terrible vision of the ghetto streets flooded his mind.

(105) BNC intransitive: … should the river flood, as he 'd observed it did after heavy rain, the house was safe upon its hill.

2.3.5 Conclusion

The goal of the work performed in this section was to find a stable set of transitivity biases for 64 verbs to provide norming data for psychological experiments.

The first result is that 55 out of 64 single sense verbs analyzed did not change in transitivity bias across corpora. This suggests that for the goal of providing transitivity biases for single sense verbs, the influence of American vs. British English and broad based vs. narrow corpora may not be large. One would, however, expect larger cross corpus differences for verbs that are more polysemous than the particular set of verbs examined in this experiment.

The second result is that for the 9 out of 64 verbs that did change in transitivity bias, the shift in transitivity bias was largely a result of subtle shifts in verb sense between the genres present in each corpus. These two results suggest that when verb sense is adequately controlled for, verbs have stable subcategorization probabilities across corpora.

To the extent that stable subcategorization biases were found, this experiment was a success, and provides evidence that controlling for verb sense can reduce the degree of cross corpus variation in subcategorization probabilities. However, there are two important caveats: firstly, the corpora used were of similar discourse type (connected written discourse), a factor which masks the possible contributions of various discourse type factors in subcategorization variation, and secondly, the actual biases found in these corpora are only meaningful to the extent to which the contexts in the experimental task being normed resemble the contexts found in the corpora.

One possible future application of this work is that it might be possible to use verb frequencies and subcategorization probabilities of multi-sense verbs can be used to measure the degree of difference between corpora. 66 2.4 Conclusion

The main goal of this chapter is to show that verb sense and verb subcategorization are related, and that problems in psycholinguistics and computational linguistics can be solved by taking this relationship into account and treating different senses of verbs separately. One specific problem that this chapter addressed was the apparent differences in verb subcategorization frequencies found between different corpora and psychological experiments. These differences were found to result from a combination of verb-sense based differences and other contextual factors.

These results suggest that, because of the inherent differences between isolated sentence production and connected discourse, psycholinguists should not use probabilities from one genre to normalize experiments from the other. In other words, ‘test-tube’ sentences are not the same as ‘wild’ sentences. These results also show that seemingly innocuous methodological devices, such as beginning sentences-to-be-completed with proper nouns (Debbie remembered…) can have a strong effect on resulting probabilities. Because of this, norming procedures and experimental procedures should be closely matched so as to reduce the number of unintended factors influencing experimental outcome. This argues against the concept of a ‘generic’ subcategorization probability for use in norming all psych experiments.

Alternatively, these results suggest that for cases where corpus data does supply appropriate frequencies, large corpora such as the BNC provide similar subcategorization probabilities to the Brown Corpus when coarse-grained subcategorizations such as transitive vs. intransitive are used. The differences found between American and British English were much smaller than the differences found between balanced corpora and genre specific corpora such as the Wall Street Journal. It is expected that if more fine-grained subcategorizations are used, such as subdividing according to the choice of preposition in a prepositional phrase, then the differences between subcategorization probabilities from American and British English corpora will increase.

The introduction to this chapter suggested that the ideal experiment for investigating the effects of sense and discourse type on cross corpus verb subcategorization frequency differences would be to label several corpora for a variety of sense and discourse type factors, and perform a regression to find out how much each of the factors affected the subcategorization probabilities from each of the corpora. This task was rejected as being infeasible. Nonetheless, the question still remains – how important is sense, and how important is discourse type?

The results in section 2.2, particularly those in 2.2.4, suggest that sense plays a small role, while factors such as spoken vs. written, and single sentence vs. connected discourse play a large role. Alternatively, the results in section 2.3 present a data set where the differences in subcategorization probability are almost entirely due to verb 67 sense differences. These results appear to present a contradiction. In fact, both sense and discourse type both have the potential to be important factors. If the verbs under investigation have multiple senses, then sense-based subcategorization differences are likely, whereas if most of the verbs have either a single sense or only a single common sense, then sense-based subcategorization differences are much less likely. Similarly, if the data sources under investigation are of a variety of discourse types, then discourse-based subcategorization differences are likely, while if the data sources are of similar discourse types, then discourse-based subcategorization differences are less likely. Because of this, the answer to the question of the relative importance of sense and discourse type depends on the answer to the question of how different the corpora are and which verbs are being investigated.

The results in this chapter suggest that if verb sense and discourse type are controlled for, then stable cross corpus subcategorization frequencies can be obtained. This suggests that verb sense sensitive parsing algorithms would not need to be retrained for each new domain16. It also suggests the plausibility of the mental lexicon containing an underlying set of subcategorization probabilities for each verb sense.

The observed subcategorization probabilities are the result of an interaction between the underlying subcategorization probabilities for each verb sense and factors such as the probability of a particular subcategorization given a particular discourse type, the likelihood of an event occurring in the world, the likelihood of needing to talk about an event, the likelihood of creating a new referent in an experimental context, and the perceived need to produce either a long or a short response in an experiment. This suggests a difficulty in obtaining experimental results that directly reflect the underlying probabilities, since the experimental protocols also influence verb subcategorization expectations.

16 Of course, it depends on the discourse types of the corpora being parsed. However, at present, the most common use of parsers is in parsing written connected discourse. 68

3 Predicting verb subcategorization from semantic context using the relationship between verb sense and verb subcategorization

3.1 Overview

Chapter 2 showed how various discourse factors combined with core subcategorization probabilities for individual senses of verbs to yield the observed subcategorization probabilities for these verbs in various sets of production (corpus) data. One reason for investigating the sources of these production frequencies is that they are also the basis for probabilities used in comprehension. Theories such as the tuning theory (Mitchell et al. 1995) argue that expectations in sentence processing are based on the frequencies of each structure in previous exposure. This implies that probabilities found in comprehension data should be the same as or similar to the probabilities found in production data. If verb subcategorization probabilities are based on individual senses of verbs, then, by implication, so are comprehension probabilities. This leads to the prediction that different semantic contexts preceding a verb should bias the human parser towards particular senses of verbs, and thus, towards particular subcategorizations of the verb. Recent experiments by Hare et al. (2001) show that the semantic context preceding a verb does in fact influence parsing decisions during comprehension. The chapter will develop a model of how the semantic context preceding a verb influences subcategorization predictions, and will show that the predictions of the model match the predictions made by the human subjects in Hare et al. (2001). 3.2 Psycholinguistic evidence for the effects of verb sense on human sentence processing

The goal of this chapter is to develop a model of how the semantic context preceding a verb influences verb subcategorization predictions. This model will be tested against parsing decisions made by human subjects in Hare et al. (2001). Hare et al. (2001) have shown that the semantic context preceding a verb influences the performance of human subjects in both sentence completion and reading time tasks. Hare et al. (2001) selected 20 verbs, each with multiple possible subcategorizations. They prepared a pair of biasing contexts for each verb such that one context biased the verb towards a sense with a direct object (DO) subcategorization, and the other towards a sense with a sentential complement (SC) subcategorization. Hare et al. (2001) reports on two experiments based on these sets of verbs and biasing contexts. The first was a sentence completion experiment. Subjects were given sentence completion prompts preceded by either a DO biasing context, as in (106), an SC biasing context, as in (107), or no biasing context, as in (108). All 20 bias pairs are shown in Appendix B. 69 (106) (DO bias) Allison and her friends had been searching for John Grisham's new novel for a week, but yesterday they were finally successful. They found ______.

(107) (SC bias) The intro psychology students hated having to read the assigned text because it was so boring. They found ______.

(108) (No biasing context) They found ______.

Table 47 shows the sentence completion patterns from Hare et al. (2001) for the 40 subjects in the norming study, combined across all 20 verbs. Subjects were more likely to complete a sentence with a DO when the verb was preceded by a semantic context that favored the DO biased sense of the verb. Similarly, subjects were more likely to complete a sentence with a SC when the verb was preceded by a semantic context that favored a SC interpretation of the verb. When there was no biasing context, the subjects tended to produce DO sentence completions. This reflects the inherent DO bias found in these verbs, and corresponds with overall DO bias they found for these verbs in four different corpora (Brown, WSJ, WSJ87/BLLIP, Switchboard).

Bias Context DO SC Other DO bias context 70% 20% 10% SC bias context 20% 65% 15% No bias context 58% 22% 20% Table 47: Sentence Completion results from Hare et al. (2001).

In a second experiment, Hare et al. (2001) used a self paced reading time experiment to show that the semantic context also affected reading times. Subjects read biasing contexts followed by a SC use of the verb that was either ambiguous (no that) or unambiguous (with that).

(109) (DO bias/ambiguous) Allison and her friends had been searching for John Grisham's new novel for a week, but yesterday they were finally successful. They found the book was written poorly and were annoyed that they had spent so much time trying to get it.

(110) (DO bias/unambiguous) Allison and her friends had been searching for John Grisham's new novel for a week, but yesterday they were finally successful. They found that the book was written poorly and were annoyed that they had spent so much time trying to get it. 70 (111) (SC bias/ambiguous) The intro psychology students hated having to read the assigned text because it was so boring. They found the book was written poorly and difficult to understand.

(112) (SC bias/unambiguous) The intro psychology students hated having to read the assigned text because it was so boring. They found that the book was written poorly and difficult to understand.

They found that the reading times were influenced by the biasing context and the ambiguous/unambiguous nature of the target sentence. For example, the noun phrase the book is ambiguous in both (113) and (114), because it can fill the role of either the direct object of the verb found, or the subject of the verb was. However, because the preceding context is different, subjects expect the students to be a DO in example (113). Therefore, it takes longer to read the words following the book than it does in example (114), where the subjects are expecting a sentential complement. The average reading times across all 20 verbs are shown in Figure 5. Note that the reading time corresponding to the second word after the noun phrase has a significantly slower reading time in the DO bias condition than it does in the SC bias condition. This word corresponds to the underlined word written in example (113). (113) Allison and her friends had been searching for John Grisham's new novel for a week, but yesterday they were finally successful. They found the book was written poorly and were annoyed that they had spent so much time trying to get it.

(114) The intro psychology students hated having to read the assigned text because it was so boring. They found the book was written poorly and difficult to understand.

71

Ambiguous conditions

440

420

400

380 DA SA 360

340

320

300 V det noun dis 1 dis2 pdis1 pdis2

Figure 5: Effect of bias context on reading times in ambiguous condition, from Hare et al. (2001). (DA = DO bias, ambiguous condition, SA = SC bias, ambiguous condition)

When the complementizer that is present in the sentences, the DO and SC biases also affect reading times. In example (115), which has a DO bias context, subjects are expecting a direct object after the verb. Therefore, the reading times when the subjects get to the complementizer are significantly slower than when there is a SC bias context, as in example (116). Figure 6 shows the cross verb average reading times when the complementizer that is present. The reading time for the word that is significantly slower when there is a DO bias than when there is a SC bias. (115) The two freshmen on the waiting list refused to leave the professor's office until he let them into his class. Finally, he admitted that the students had little chance of getting into the course.

(116) For over a week, the trail guide had been denying any problems with the two high school kids walking the entire Appalachian Trail. Finally, though, he admitted that the students had little chance of succeeding.

72

Unambiguous conditions

380

370

360

350 DU 340 SU 330

320

310

300 V that det noun dis 1 dis2 pdis1 pdis2

Figure 6: Effect of bias context on reading times in unambiguous condition, from Hare et al. (2001). (DU = DO bias, unambiguous condition, SU = SC bias, unambiguous condition)

Both the sentence completion and reading time data from Hare et al. (2001) demonstrate that the semantic context preceding the verb can influence parsing decisions in human sentence processing. This influence is hypothesized to take place by means of the context biasing the parser towards a particular sense of the verb, which, in turn, biases the parser towards a particular subcategorization. 3.3 Model for predicting subcategorization from semantic context

The previous section described psycholinguistic experiments showing how the semantic context preceding a verb influences subcategorization predictions. This section will present a method, illustrated in Figure 7, for modeling such influences.

73

Input context leading up to verb

Input verb

Compare context leading up to verb with contexts that have led up to previous examples of verb

Make subcategorization prediction based on the subcategorizations of the “most similar” previous examples

Figure 7: Predicting subcategorization from the context preceding the verb.

This model is intended as an overall model in which many factors combine with verb sense predictions to yield verb subcategorization predictions. Thus, “most similar” in Figure 7 is intended to include similarity in terms of verb sense as well as other factors such as discourse type, genre type, the thematic fit of the subject of the verb with the various senses of the verb, and the recency of various syntactic patterns.

However, the data being modeled is taken from a set of carefully controlled experiments. In the Hare et al. (2001) experiments, all of the words in the test sentences up to and including the disambiguating region are identical in the DO and SC bias cases. This eliminates these words from having a differential effect in the parsing of these sentences. Thus, in order to model the Hare et al. (2001) data, only the effects of semantic context on verb sense and verb subcategorization need to be taken into account, and “most similar” in Figure 7 can be interpreted as meaning semantic similarity.

The model will be implemented by comparing the context leading up to the target verb with the contexts that have preceded corpus (training) examples. The model then guesses whichever subcategorization has semantic contexts that most closely resemble the context preceding the target verb. Latent Semantic Analysis (LSA), described in the next section, is used to measure semantic similarity between the target context and the training examples. The experiments in this chapter will demonstrate that the model used makes the same predictions as human subjects in Hare et al. (2001) based on the same contexts which preceded the verbs in the human subject experiment. The human subjects predicted an SC completion given an SC 74 bias context, and a DO completion given either a DO bias context or a neutral context. The subjects also preferred DO completions in the absence of a specific bias context, because the verbs in the study are more frequently used with DO completions. In order to model this data, the model in this chapter must predict SC completions given SC bias contexts. When given a DO bias context, the model must either predict a DO completion or make a neutral prediction, since the model can then fall back on the default DO preference of the verbs given a neutral context. The goal of the model is not to model a time course of human processing, where subcategorization predictions are revised with each new word. Rather, the model will show that the appropriate subcategorization prediction can be made at the point at which the human subject (or the computational model) has read the verb in the target sentence. Although the model is intended to make the same predictions as humans given the same input based on information gained through previous exposure to language, it is not intention to claim that the exact details of the model are the same mechanisms used by humans.

Target sentence to be disambiguated

The two freshmen on the waiting list refused to leave the professor's office until he let them in t o his class. Finally, he admitted [????]

Co mpare semantic similarity and choose closest.

200 Corpus examples of 200 Corpus examples of contexts preceding DO uses of contexts preceding SC uses of admit admit

Figure 8: Use of semantic similarity to predict subcategorization.

3.3.1 Previous related uses of LSA

The semantic similarity between the target bias contexts and the corpus examples will be measured using latent semantic analysis (LSA). LSA is a technique where words or sets of words, such as sentences, are represented as vectors in a high dimension semantic space. The semantic space is created by using Singular Value Decomposition (SVD) to decompose a word co-occurrence matrix, which is then reconstructed with a reduced dimensionality. The details of this process are described in Landauer, Foltz, & Laham (1998), Schütze (1998), and Schütze (1997). The use of such semantic spaces for finding the degree of similarity between words or documents was originally developed for document retrieval tasks under the name Latent Semantic Indexing (Deerwester, Dumais, Furnas, Landauer, & Harshman 1990). It has since been used in a variety of tasks to model human 75 judgments. Landauer & Dumais (1997) showed that LSA could be used to choose the appropriate synonym on the standardized TOEFL test with the same degree of accuracy as the average foreign student taking the test. Rehder et al. (1998) used LSA to select appropriate instructional materials for students by comparing essays written by the students with the instructional materials.

LSA has been shown to be effective in word sense disambiguation tasks. The present goal of predicting verb subcategorizations based on the preceding semantic context is a form of word sense disambiguation, because the subcategorizations being predicted are associated with particular senses of the verbs. If there were a one to one correspondence between verb subcategorization and verb sense, then the subcategorization prediction task would be identical to the verb sense disambiguation task. Even if there is not a one to one relationship, knowing the sense of the verb should help in predicting the subcategorization, as long as the different senses of the verb have different distributions of subcategorizations.

Schütze, (Schütze 1998, 1997) used a method based on SVD to disambiguate ambiguous words and pseudowords. Schütze prepared a semantic space using a 20,000 by 1000 word word co-occurrence matrix based on 17 months of text from the New York Times News Service. The matrix was decomposed using SVD, and reduced to a 100 dimensional space. Any word within the list of 20,000 words can thus be represented by a 100-dimension vector in this space.

The disambiguation algorithm in Schütze (1998) and Schütze (1997) uses the following steps17: 1. Take all examples of the word to be disambiguated in the training set. There were between 1030 and 21374 examples of each word in the training set.

2. Prepare a context vector to represent each example. The context vector is created by adding the individual vectors for each word within a window of the target word. Window sizes of between 2 and 50 words on either side of the target were used. No significant benefit was found for including words at a distance greater than 15 words. Log inverse document frequency was used to weight the individual words when the context vector was created. This weighting favors words that occur in a few documents, and lessens the impact of words that occur in many documents, since these are less likely to be semantically important.

3. The context vectors for all of the examples of the target word in the training data are clustered into a small number of clusters using an automatic clustering algorithm. Each cluster represents an automatically generated sense of the target word.

17 Several variations are reported in the two papers – this description is based on the simplest version. 76 Either 2, 5, 10, 20, or 50 clusters were used. A larger number of clusters implies a finer-grained set of sense distinctions. A sense vector is prepared for each cluster. The sense vector is the centroid of the cluster. The identity of the sense is determined by the majority sense in the cluster. Words in the test set were then disambiguated as follows: 1. A context vector is prepared for the word to be disambiguated, using the same size window and weighting scheme as used in the training data.

2. The context vector for the word is then compared to the sense vectors created above. The word is then assigned the sense of the closest sense vector. Schütze tested two types of items. One type of item was pseudo-words. These are artificially created ambiguous words formed by replacing two separate unrelated words (or phrases, in this case) with a single pseudo-word. The disambiguation task is to figure out for each pseudo-word which of the original words or phrases had been at that location. For example, the phrases wide range and consulting firm can be replaced throughout the training and test set with the same pseudowords. Whenever the pseudo-word appears in the test set, the task is to figure out whether wide range or consulting firm had originally appeared at that location. The advantage of using pseudowords is that large quantities of training and test data can be automatically generated without the expense of hand labeling the data. The disadvantage is that disambiguating the pseudo-words may be an artificially easy task, since true ambiguous words may not have as clear a meaning distinction. The other type of item tested by Schütze was true (naturally occurring) ambiguous words. An example would be the word capital, which can either mean investment money or the location of a government. Table 48 shows the word sense disambiguation results from Schütze (1997). This table lists ambiguous word or pseudoword phrase which was disambiguated, and the accuracy (percent correct) of the algorithm on the test set.

77 Pseudo-words Accuracy wide-range / consulting-firm 74% heart-disease / reserve-board 97% urban-development / cease-fire 99% drug-administration / Fernando-valley 98% economic-development / right-field 100% national-park / judiciary-committee 95% Japanese-companies / city-hall 94% drug-dealers / Paine-Webber 94% league-baseball / square-feet 93% Pete-Rose / nuclear-power 96% Ambiguous Words capital/s 90% interest/s 91% motion/s 86% plant/s 94% ruling 92% space 85% suit/s 95% tanks/s 93% train/s 85% vessel/s 94% Results based on the version of the algorithm described above, with the following values for the two variables: window size = 50, number of clusters = 10. Table 48: Word sense disambiguation results from Schütze (1997).

Although the model implemented in this chapter is similar to the procedure used by Schütze (1997), it is also different in several ways. Schütze’s methodology would imply that the way to perform subcategorization prediction would be to take the 200 DO and 200 SC corpus examples of each verb and cluster them in semantic space. This is illustrated in Figure 9. Each cluster would then be assigned a subcategorization based on the majority membership in that cluster. The subcategorization predicted for the target context would be determined by finding which cluster centroid was closest to the vector for the context preceding the verb.

78

DO DO DO DO DO DO SC SC

DO DO Target SC SC

SC DO DO SC SC SC DO DO DO DO SC SC SC DO SC

Figure 9: Disambiguating the subcategorization of a target using Schütze style clusters in LSA semantic space.

However, this methodology will be modified by not pre-clustering the corpus examples. Instead, the relationships between the corpus examples and the target context will be used directly, and the subcategorization prediction for the target verb will simply be based on the subcategorization of the majority of the nearest N neighbors. This is illustrated in Figure 10. This modification allows the subcategorization to be determined by only the closest, and therefore, most relevant corpus examples. This potentially allows for changes in prediction based on more subtle differences in target contexts than would be possible with the original methodology. While similar target contexts may be closest to the same cluster, they could have different sets of nearest neighbors. This change in methodology also allows the effect of semantic distance on the accuracy of the subcategorization predictions to be investigated.

79

neighborhood size 10 DO DO DO DO DO DO SC SC SC DO DO neighborhood size 3 SC

Target SC DO DO SC SC SC DO DO DO DO SC SC SC DO SC

Figure 10: Disambiguating the subcategorization of a target using the subcategorizations of the nearest neighbors.

3.3.2 Details of how LSA is used to measure semantic similarity

Semantic similarity between each of the bias contexts and corpus examples is measured in this chapter using the cosine of vectors for each in LSA semantic space. The cosine between the vectors is used rather than the geometric distance between the end points of the vectors because the length of the vector is related to the strength of knowledge about a word or sentence while the direction of the vector is related to the semantics of the word or sentence. The LSA tools used are publicly available at http://lsa.colorado.edu.

The experiments in this chapter rely on the LSA TASA semantic space with 300 dimensions and “document” weighting. The TASA semantic space is based on the TASA corpus (Zeno, Ivens, Millard, & Duvvuri 1995) of school text book samples, and was chosen out of the available semantic spaces, because it is felt to most closely approximate18 the knowledge of a typical college undergraduate. The size of 300 dimensions was selected because this has been found to be an optimal size in other work. The optimization of this variable was not investigated in this work.

Several options are available for combining the vectors of individual words when creating the vector to represent a whole sentence or larger context. In all experiments in this dissertation, the vectors for each word were weighted by inverse document frequency, as was done by Schütze. This is know as “document”

18 Of course, most language exposure comes from spoken language and TV rather than reading text books in school, but this space seems to provide reasonable results for the task at hand, and there are no corpora covering the typical language exposure of an individual from birth through college from which a semantic space could be created. 80 weighting in the LSA tools. A comparison of several weighting methods is described in section 3.4.2. 3.3.3 Corpus (training) data used in model

For use in all experiments in this chapter, subcorpora of 200 randomly selected DO reference examples and 200 randomly selected SC reference examples were prepared for 20 verbs. These verbs are shown in Table 49. These verbs and subcategorizations were selected for use in the experiments that model the psychological data from Hare et al. (2001). Each corpus example consisted of the sentence containing the target verb plus the two preceding sentences. The subcorpora were created from the British National Corpus (BNC). The BNC was chosen for its large size (100 million words). Smaller corpora such as TASA (17 million words) (Zeno et al. 1995) and the Brown Corpus (1 million words) (Francis & Kucera 1982) were investigated, however these did not have a sufficiently large number of examples for all of the verbs. Even with the large BNC corpus, the extraction process described below resulted in some verb/subcategorization combinations had a total count of less than 200.

The subcorpora were created in a multi-step process. First, all sentences containing the target verbs were extracted from the BNC. These sentences were then parsed using the Charniak parser (Charniak 1997) (version 4). The parsed sentences were then prepared for use with the tgrep tools provided with Treebank (Marcus et al. 1993). A series of tgrep scripts were used to extract DO and SC examples for each of the verbs. Because of the relatively high rate of error for some verb/subcategorization combinations in this process, the output was hand checked, and incorrectly identified examples were removed from the subcorpora. A random sample of 100 examples was re-checked, and indicated an agreement rate of 98% between the initial hand coding and the re-checking. The 200 examples for each subcorpus were randomly selected from the output of the hand-correcting process. Verbs with less than 200 corpus examples per subcategorization were not used in the analyses below, reducing the number of verbs from 20 to 15. The final sample size for each verb is indicated in Table 49. Note that due to the process used to generate these subcorpora, these sample sizes are potentially lower than the actual corpus frequencies of these verbs due to false negative errors in the parsing and tgrep search processes.

81 Verb DO Sample Size SC Sample Size Acknowledge 200 200 Add 200 200 Admit 200 200 Anticipate* 200 183 Bet* 131 65 Claim 200 200 Confirm 200 200 Declare 200 200 Feel 200 200 Find 200 200 Grasp* 200 37 Indicate 200 200 Insert* 200 1 Observe 200 200 Project* 200 8 Recall 200 200 Recognize 200 200 Reflect 200 200 Report 200 200 Reveal 200 200 Table 49: Sample size for each verb. (Verbs marked with * were used in Hare et al. (2001), but were not used in the experiments in this dissertation due to small sample sizes.)

3.4 Experiments

3.4.1 Predicting the subcategorizations of the Hare et al. (2001) bias contexts

The goal of this experiment is to see if the model proposed above can make the same subcategorization predictions as human subjects do, based on the bias contexts used in Hare et al. (2001). The model’s subcategorization prediction for a given bias context is determined by whether that bias context is more similar to the DO corpus examples or the SC corpus examples. Because the corpus examples include a variety of senses with varying (DO and SC) subcategorizations, the subcategorization prediction is based on the subcategorizations of only the most semantically similar (as determined by LSA) corpus examples. This was done as follows: For each of the 15 verbs from Hare et al. (2001) for which 200 DO and 200 SC corpus examples were prepared, the cosines between each of the Hare et al. (2001) bias contexts and the 400 corpus examples for the appropriate verb were calculated using the LSA and the TASA semantic space described above. For each bias context, the corpus examples were ranked by cosine from most similar to least similar. For each possible size of neighborhood, the number of SC and DO examples in the neighborhood was calculated. At a neighborhood size of 1, subcategorization prediction is simply the subcategorization of the most similar 82 corpus example. At a neighborhood of 10, the subcategorization prediction is whatever the majority subcategorization is of the closest 10 corpus examples. At the largest possible neighborhood size, 400, the subcategorization prediction is split between DO and SC, because the neighborhood includes all 200 DO and all 200 SC examples. 3.4.1.1 Results and discussion

This section describes the results of modeling subcategorization predictions based on the biasing contexts from Hare et al. (2001). The results of modeling the SC bias contexts will be discussed first, followed by the results of modeling the DO bias contexts. The SC bias contexts were found to be more similar to corpus SC examples than to corpus DO examples, as expected. Figure 11 shows the percentage of SC examples at each neighborhood size, averaged across the 15 verbs. There are three main points of interest in this graph. First, the neighborhood of less than ten or so nearest examples shows a high degree of variability in the subcategorization bias. This is due to the small sample size inherent in these neighborhoods. Recall that for an individual verb, at a neighborhood size of 1, the neighborhood is either 100% SC or 100% DO. When a large number of bias contexts is considered, this variability is averaged out. However, in this experiment, there are only 15 DO and 15 SC bias contexts.

The second area of interest is the area with neighborhood sizes of between 10 and 50. This area shows a high concentration of SC corpus examples. This indicates that the most similar corpus examples to the bias contexts are much more likely to be SC examples. This shows that the bias contexts are in fact semantically similar to the contexts that precede corpus SC examples.

The third area of interest is the area with neighborhood sizes between 50 and 400. In this region, the line is fairly flat, indicating a relatively similar frequency of SC and DO examples. This region consists of corpus examples that are not particularly related to the bias contexts, and thus do not show an SC or DO preference. However, this region is still useful in the subcategorization prediction task, because the neighborhoods still contain a majority of SC examples.

If the model were unsuccessful, and found no relationship between the semantic context and subcategorization, one would expect to find an equal number of SC and DO examples at any neighborhood size.

83

% SC corpus examples in neighborhood of SC bias context

58% 56% 54% 52% 50% 48% 46% subcategorization 44% 42%

% examples in neighboorhood with same 40% 0 50 100 150 200 250 300 350 400 Size of neighboorhood

Figure 11: Average % SC corpus examples in neighborhood of SC bias contexts

Figure 11 shows that the subcategorization bias of the corpus examples in the neighborhood of each biasing context can be used to predict subcategorization. Because this graph shows the average of 15 (SC) bias contexts, it does not directly translate to a measure of the accuracy of the model. The accuracy of the model at each neighborhood size was also calculated, based on the percentage of the 15 bias contexts for which the model made the correct prediction at each neighborhood size. If the majority of the examples in the neighborhood were SC examples, then the model is counted as being correct. If the majority of examples were DO examples, then the model was counted as being wrong. If there was a 50/50 split, which is possible at any even numbered neighborhood size, and guaranteed at the neighborhood size of 400, then the model made a 50/50 guess at the subcategorization of the target context. The accuracy at each neighborhood size is shown in Figure 12. In general, the accuracy of the model increases as the neighborhood size increases. Although the nearest neighbors are the most relevant, and have the strongest bias towards being SC examples, as indicated by Figure 11, the model also benefits from the larger sample size inherent in the larger neighborhoods. The maximal result is approximately 85% accuracy. This is less than the sense disambiguation accuracy achieved in Schütze (1998) and Schütze (1997). However, the results from Schütze provide an upper bound on accuracy for this type of technique, based on the case where there is a one to one relationship between sense and subcategorization. For the verb used in Hare et al. (2001), many of the SC uses can be paraphrased using a DO structure. The significance of the relationship between the DO and SC alternations that are possible with these verbs will be addressed in section 3.4.3

84

Accuracy in predicting subcategorization of SC bias sentences

1 0.9 0.8 0.7 0.6 0.5 0.4

subcategorization 0.3 0.2 Accuracy of model in predicting 0.1 0 0 50 100 150 200 250 300 350 400 Size of neighborhood used

Figure 12: Accuracy in predicting the subcategorization bias of the SC bias contexts.

Although the model was successful at predicting the subcategorization of the SC bias contexts, it was much less successful at predicting the subcategorization of the DO bias examples. However, subsequent experiments will show that this is due to properties of the DO bias contexts, rather than due to a flaw in the model. Figure 13 shows that after the initial instability of the closest neighborhood, due to the small sample size, the neighborhoods at subsequent neighborhood sizes consist of roughly an equal number of SC and DO examples. Additionally, Figure 14 shows that the accuracy is roughly 50%, and drops below 50% as the neighborhoods start to have a majority of SC examples at a neighborhood size larger than 250.

85

% DO corpus examples in neighborhood of DO bias context

54%

52%

50%

48%

46%

subcategorization 44%

42%

% examples in neighboorhood with same 40% 0 50 100 150 200 250 300 350 400 Size of neighboorhood

Figure 13: Average % DO corpus examples in neighborhood of DO bias contexts

Accuracy in predicting subcategorization of DO bias sentences

0.8

0.7

0.6

0.5

0.4

0.3 subcategorization 0.2

Accuracy of model in predicting 0.1

0 0 50 100 150 200 250 300 350 400 Size of neighborhood used

Figure 14: Accuracy in predicting the subcategorization bias of the DO bias contexts.

The results of this analysis show that the model proposed in this chapter can predict the correct verb subcategorization for the SC bias contexts in Hare et al. (2001), but not for the DO bias contexts. This is because, within the LSA semantic space used, the DO bias contexts are no more like the contexts preceding the corpus DO examples than like the contexts preceding the corpus SC examples. There are three general classes of possible explanation for this, two of which can be ruled out by the results of the remaining experiments in this chapter. 86

One possible reason is that the LSA semantic space may not appropriately capture the semantics that precede typical DO uses of the verbs in this experiment. If the LSA semantic space did not appropriately capture the relevant semantics, then one would not expect the DO bias contexts to be any more like the DO corpus examples than any other randomly selected set of corpus examples. However, experiment 3.4.2 will show that the model can successfully predict subcategorizations based on the contexts preceding naturally occurring DO examples found in the BNC for the same verbs.

A second possibility is that the verb senses on which the DO bias contexts in Hare et al. (2001) rely are rare, and are not sufficiently represented in the corpus data on which the model relies. If the DO corpus examples were primarily of different senses than those used in the bias contexts, then one would also expect an inability of the model to predict the correct subcategorization. This is because the model would treat the SC examples and the different sense DO examples as being similarly unrelated to the DO bias contexts. However, experiment 3.4.3 will show that the model is capable of making the correct subcategorization prediction even when the corpus data contains only a small quantity of same-sense DO examples.

The third possibility is that the DO bias contexts used in Hare et al. (2001) are not like the contexts that precede typical DO uses of these verbs, or in other words, the bias contexts are not DO bias contexts, but are “not SC” contexts. If this were the case, one would still have to explain why the subjects in the Hare et al. (2001) experiments produced DO uses in the sentence completion task when they were given the DO bias contexts, and why the subjects behaved as if they were expecting DO uses of the verbs in the reading time experiments after the DO bias contexts. The experiments in this chapter suggest that the DO expectations are not the result of particular contextual biases towards DO uses of these verbs, but are instead caused by inherent DO biases in these verbs. Table 50 shows corpus subcategorization frequencies for these verbs reported in Hare et al. (2001). For all corpora examined, the DO use of the verb is either the most frequent subcategorization, or at least more frequent than the SC use. Because they do not report the breakdown on the Other category, it is not clear whether the DO uses in Switchboard are the most frequent use, or just more frequent than the SC uses.

Corpus DO SC Other Brown 43% 18% 29% WSJ 38% 35% 20% WSJ87/BLLIP 36% 31% 28% Switchboard 31% 19% 50% Table 50: Average subcategorization frequencies for 15 verbs used in experiment 3.4.1, taken from corpus frequencies reported in Hare et al. (2001).

87 Additional evidence comes from the sentence completion data from Hare et al. (2001). Table 52 shows the sentence completion frequencies for the 15 verbs used in experiment 3.4.1. Although the subjects produce DO verb uses after the DO bias context, and SC uses after the SC bias context, it is important to note that they also produce DO uses in a neutral context. Sample sentence completion prompts are shown in Table 51.

Bias context Sentence completion prompt No bias He admitted ______. DO bias The two freshmen on the waiting list refused to leave the professor's office until he let them into his class. Finally, he admitted______. SC bias For over a week, the trail guide had been denying any problems with the two high school kids walking the entire Appalachian Trail. Finally, though, he admitted ______. Table 51: Sample sentence completion prompts.

Bias Context DO SC Other No bias context 55% 24% 21% DO bias context 64% 25% 11% SC bias context 12% 70% 18% Table 52: Average subcategorization frequencies for 15 verbs taken from sentence completion experiment in Hare et al. (2001).

The evidence presented in this section suggests that the LSA-based model is making correct subcategorization predictions based on the bias contexts used in Hare et al. (2001). However, it also suggests that the algorithm used in the model needs to be modified from “choose the subcategorization of the most similar corpus (or previously experienced examples, in the case of human sentence processing) examples” to “choose the subcategorization of the most similar examples, but, choose the most frequent sense and subcategorization as a default if the context doesn’t provide strong evidence for any other choice”. 3.4.2 Predicting the subcategorizations of corpus bias contexts

The previous section showed that the model in this chapter was able to predict the subcategorizations of the SC bias contexts from Hare et al. (2001), but not the subcategorizations of the DO bias contexts. This leaves open the question of whether the model is capable of predicting both SC and DO subcategorizations. This experiment will investigate the ability of the model to predict verb subcategorization based on the contexts preceding naturally occurring corpus examples, in hope that the contexts preceding corpus examples are somehow different from the artificially constructed contexts preceding the Hare et al. examples. The same 200 DO and 200 SC corpus examples for each verb are used as in the previous 88 experiment. For this experiment, rather than predicting the subcategorization of the experimental bias contexts based on the degree of similarity with the 400 corpus examples, the subcategorization for each of the 400 corpus examples is predicted separately, based on the similarity with 398 of the remaining corpus examples. One example of the opposite subcategorization is removed from the corpus data, so that each target context is compared with 199 DO examples and 199 SC examples. The experiment essentially uses a training set of 398 items with a test set of 1 item, and 400 fold cross-validation. 3.4.2.1 Results and discussion

The results of this experiment show that the model is able to correctly predict the subcategorization of both DO and SC corpus examples. Figure 15 shows the number of SC examples in the neighborhood of each of the SC corpus examples, averaged across 200 examples for each of 15 verbs. As in the previous experiment, there is a peak concentration of SC examples in the neighborhood closest to the target context, and there is a majority of SC examples at any neighborhood size. This indicates that the model can correctly predict the subcategorization of the SC corpus examples.

% Examples of same subcategorization in neighborhood of SC corpus examples

65%

60%

55%

50% subcategorization 45%

% examples in neighborhood with same 40% 0 50 100 150 200 250 300 350 400 Size of neighborhood

Figure 15: Average % SC corpus examples in neighborhood of SC target contexts

Figure 16 shows how this distribution of SC and DO examples translates into percent accuracy, based on the same scheme as in experiment 3.4.1, where the model votes for whichever subcategorization is in the majority at a given neighborhood size. Again, the model makes a 50/50 random choice when the neighborhood is evenly split between DO and SC examples. Even though the closest examples are most useful in predicting subcategorization, the accuracy increases as neighborhood size increases. This is the result of a tradeoff between the relevance of the closest 89 examples, and the reduction of noise in the sample as sample size increases with larger neighborhood size.

Accuracy in predicting subcategorization of corpus SC examples

100% 90% 80% 70% 60% 50% 40% % Accuracy 30% 20% 10% 0% 0 50 100 150 200 250 300 350 400 Size of neighborhood used

Figure 16: Accuracy in predicting the subcategorization bias of the SC corpus contexts.

The results for the DO corpus contexts are more complex than the results for the SC corpus examples. This is because the DO data contains many uses of verbs that are similar to the SC uses of the verbs. For example, a sentence containing “he admitted that …” would count as an SC use, while the similar “he admitted the fact that …” would count as a DO use of the verb admit. Figure 17 shows the number of DO examples at each neighborhood size. As expected, there is an initial high concentration of DO examples in the closest neighborhood sizes. However, at a neighborhood size of approximately 150, the neighborhoods start to actually contain a majority of SC examples. This counterintuitive shift is caused by the SC/DO sense confound mentioned above. However, this data does not directly show that the model can distinguish between the DO/SC sense alternations such as the one shown above. This issue will be addressed by experiment 3.4.3, which will show that the model is able to distinguish between the contexts preceding the SC and DO related senses of the verbs.

90

% Examples of same subcategorization in neighborhood of DO corpus examples

58% 57% 56% 55% 54% 53% 52%

subcategorization 51% 50% 49% % examples in neighborhood with same 48% 0 50 100 150 200 250 300 350 400 Size of neighborhood

Figure 17: Average % DO corpus examples in neighborhood of DO corpus contexts

Figure 18 shows the accuracy in predicting the subcategorization of the corpus DO examples. Although the model does correctly predict the right subcategorization in the closest neighborhoods, the accuracy rapidly drops off to below chance, as the neighborhoods start to include the semantically related SC examples. An additional factor affecting the accuracy in identifying the subcategorizations of the DO examples relative to the accuracy in identifying the subcategorization of the SC examples is that the SC examples are primarily of a single sense of the verb, while the DO examples contain many unrelated senses. This means that the relevant (same sense) training data for SC examples is the whole set of 199 examples, while the relevant training data for the DO examples is only a fraction of the whole set.

91

Accuracy in predicting subcategorization of corpus DO examples

70%

60%

50%

40%

30% % Accuracy 20%

10%

0% 0 50 100 150 200 250 300 350 400 Size of neighborhood used

Figure 18: Accuracy in predicting the subcategorization bias of the DO bias contexts.

3.4.2.2 Additional analysis using corpus bias contexts

The corpus examples and contexts from experiment 3.4.2 were also used to test several parameters of the model. All of the experiments in this dissertation rely on inverse document frequency weighting, or “document” weighting 19 , when combining the LSA vectors of individual words to create the overall vector for the sentence or sentences. However, other methods of combining the vectors are possible. Experiment 3.4.2 was run using four possible methods in order to investigate the effects of the different methods on the accuracy of the model.

The first method, mentioned above, is “document” weighting, where the vectors for each word are weighted for inverse document frequency. In constructing the single vector which represents a bias context or reference context, document weighting favors words in the text that have more potential to discriminate between documents. In part, this lowers the weight of low semantic content words such as function words and increases the weight of words with more semantic content (even if this content is not relevant in discriminating between the DO and SC use of the target verb).

A second method of combining the vectors of the individual words in a text to obtain an overall vector for the whole text is “term” weighting. Unlike in document weighting, the individual vectors are not weighted. This method of combining the vectors for the words in a text allows lower content words to have more of an influence on the overall vector for the text.

19 “Document” and “term” weighting refer to options available in the tools located at http://lsa.colorado.edu. 92 Additionally, it is possible to normalize the vectors of the individual words in a text before combining them with the vectors for the other words. The length of the vector for each word is proportional to the frequency of that word in the corpus from which the semantic space is prepared. When the vectors for individual verbs are not normalized, words that occur more frequently in the corpus used to generate the semantic space have a stronger influence on the overall vector for a text. Normalizing the vectors of the individual words has been found to be beneficial in some cases (T. K. Landauer, personal communication, April 2001), since it reduces the influence of word frequency on the combined vector, but also has a side effect of increasing the error in the combined vector when words with a very low frequency are involved. This is due to the normalization increasing the weight of the noise in the vectors of the low frequency words.

All four weighting methods resulted in qualitatively similar results, with document weighting producing slightly better results quantitatively. Figure 19 summarizes the effects of the different weighting methods. All of the methods showed the highest accuracy (prediction of DO and SC examples combined) in relatively close neighborhoods, with accuracy dropping off as the neighborhoods contained less relevant examples. The upper line, document weighting, is the average of the lines shown in Figure 16 and Figure 18 above. Additionally, the individual DO and SC curves for each weighting method similarly match the results shown above for experiment 3.4.2. These results show that the overall implications of the model are not particularly effected by choice of weighting method, but that document weighting produces slightly more accurate subcategorization predictions.

Accuracy of Predicting Verb Subcategorization Using Various Weighting Methods

70%

65%

60%

55% document weighting Accuracy 50% term weighting normalized document weighing 45% normalized term weighting

40% 0 50 100 150 200 250 300 350 400 Neighborhood Size

Figure 19: Comparison of various LSA weighting methods.

A second variable in the model was also examined using the data from experiment 3.4.2. In the experiments in this dissertation, the determination of subcategorization prediction was based on the majority subcategorization in each 93 given neighborhood. However, this means that the closest example to the target in a neighborhood counts as much as the most distant example in the neighborhood. This is somewhat counter to the intuition inherent in the neighborhood concept that the examples closest to the target are the most relevant for making a subcategorization prediction. In order to investigate this issue, experiment 3.4.2 was rerun, with the model’s predictions being based on a weighted majority vote for each neighborhood size rather than the straight majority of each neighborhood. The votes were weighted by the cosine between the target example and the corpus training examples. Thus, the most semantically similar examples counted more than the less similar examples in predicting the subcategorization.

The results of this experiment are shown in Figure 20. Weighting the vote of each training example does not produce a large change in the results, but does slow the drop off in the performance of the model in predicting DO examples as neighborhood size increases. Although weighting the prediction of each neighborhood by cosine does result in a slight improvement, it seems more important to make sure that there is enough training data to ensure that the nearest neighborhoods contain a sufficient number of relevant examples.

Accuracy of Predicting Subcategorization of Corpus Examples

100% 90% 80% 70% 60% 50%

Accuracy 40% 30% DO SC 20% Weighted DO 10% Weighted SC 0% 0 50 100 150 200 250 300 350 400 Neighborhood Size

Figure 20: Effects of weighting neighborhoods by cosine on accuracy in predicting subcategorization.

3.4.3 Predicting the subcategorizations of examples of ‘admit’

The previous experiment showed that the model in this chapter can predict the correct subcategorizations for both DO and SC examples of the verbs. However, there are unanswered questions. For the verbs in question, the SC use tends to involve a single sense of the verb, while the DO uses tend to involve multiple senses of the verbs. Additionally, the sense that has an SC subcategorization also tends to have a DO subcategorization. The verb admit is typical of these verbs. The verb 94 admit has two broadly defined senses, one meaning confess, which has both DO and SC possible subcategorizations, and a loosely defined sense meaning allow to enter. Examples of these senses are provided below in Table 53. One question is related to how well the model can distinguish between two closely related senses with different subcategorizations, such as the DO-confess and SC-confess uses of the verb admit. A second question is related to how well the model can perform when there are multiple senses with the same subcategorization. Can the model correctly identify the subcategorization of a use when there are only a small number of examples in the training data? Experiment 3.4.3 will address both of these issues by analyzing the performance of the model based on sense and subcategorization labeled data for the verb admit. 3.4.3.1 Methods

For this experiment, the DO and SC subcorpora for the verb admit prepared above were hand tagged for verb sense. All of the examples were classed as belonging to one of two senses, enter and confess. Because the enter sense only occurred in the DO examples, a total of three possible classifications existed for each example; DO-enter, DO-confess, and SC-confess. Examples (117) and (118) illustrate the SC-confess use of admit. Examples (119) and (120) illustrate the DO-confess use of admit. Examples (121), (122), (123), and (124) illustrate the DO-enter use of admit. DO-enter includes the sense of admitting evidence to court as a metaphoric extension. Table 53 also shows the resulting sample size for each sense/subcategorization combination.

SC confess (117) Addressing the Supreme Soviet on Sept. 11 , Ryzhkov (N=200) admitted [that the working group had failed to synthesise the rival plans]. (118) You have admitted [that it was I who caused all the evidence to fall into a pattern]. DO (119) Turner admitted [difficulties in motivating his side]. confess (120) Cocker , of Park Avenue , Teesville , Middlesbrough , (N=150) also admitted [violent disorder and two further burglaries]. DO enter (121) The door opened to admit [Rosa], wearing her (N=50) customary black . (122) Clearly, the decision to admit [a patient] to hospital must be taken only after very careful consideration. (123) The Governing body of Somerville College has decided to admit [both men and women] from next year. (124) It must be remembered, however, that the Order only permits the court to admit [hearsay evidence]. Table 53: Examples of senses and subcategorizations of admit.

This experiment relies on the same protocol as experiment 3.4.2, where each corpus context is compared with the contexts for the other 398 corpus examples. However, the results are separated by sense and subcategorization of the target context, rather than just subcategorization. 95 3.4.3.2 Results and discussion

Figure 21 shows the relative occurrence of each of the three sense-subcategorization combinations in the neighborhood of the corpus DO-confess examples. Because the three sense-subcategorization combinations have different frequencies, the frequencies at each neighborhood size are normalized, so that 100% represents the number of examples that would be expected in the neighborhood by random chance. This graph shows that the closest neighborhood sizes do contain higher than expected number of DO-confess examples, while at a greater distance, SC-confess examples predominate. DO-enter examples are under-represented at all neighborhood sizes. These results indicate that the most closely related examples to the DO-confess examples are the DO-confess examples, but the SC-confess examples also have a high (but lower) degree of similarity, and thus predominate at slightly larger semantic distances.

Items in neighborhood of DO-confess examples of ADMIT

120%

100%

80% DO-confess 60% SC-confess DO-enter 40% (Baseline = 100%)

20%

% of expected number in neighborhood 0% 0 100 200 300 400 Size of neighborhood

Figure 21: Relative frequencies of each type of example in the neighborhood of DO-confess corpus examples.

Figure 22 shows the distribution of the three sense-subcategorization combinations in the neighborhood of the SC-confess examples. As expected, the SC-confess examples dominate at most neighborhood sizes, and the semantically related, but slightly less similar DO-confess examples maintain a lower level of representation, while the unrelated DO-enter examples are underrepresented at all neighborhood sizes.

96

Items in neighborhood of SC-confess examples of ADMIT

120%

100%

80% SC-confess DO-confess 60% DO-enter 40% (Baseline = 100%)

20% % of expected number in neighborhood 0% 0 100 200 300 400 Size of neighborhood

Figure 22: Relative frequencies of each type of example in the neighborhood of SC-confess corpus examples.

Figure 23 shows the distribution of the sense-subcategorization combinations in the neighborhood of the DO-enter examples. As expected, the DO-enter examples predominate in the typical close neighborhood peak. At a larger distance, the three sense-subcategorization combinations have roughly equal representation in the neighborhoods. This is because most of the 50 DO-enter examples occurred in close proximity to the other DO-enter examples, and there is no additional supply of less related DO-enter examples to be added in at greater distances.

Items in neighborhood of DO-enter examples of ADMIT

350%

300%

250%

200% DO-enter DO-confess 150% SC-confess

(Baseline = 100%) 100%

50%

% of expected number in neighborhood 0% 0 100 200 300 400 Size of neighborhood

Figure 23: Relative frequencies of each type of example in the neighborhood of DO-enter corpus examples. 97

The results of this experiment illustrate two points. One is that the model can distinguish between closely related senses such as SC-confess and DO-confess. The other is that the model can correctly identify the subcategorization of examples even when there are relatively few examples of the relevant sense in the training data, such as in the case of DO-enter. In fact, the task of identifying the subcategorization of the DO-enter examples is based on an even smaller set of relevant training data than is initially apparent. For each of the DO-enter test examples, there are 49 training examples (50 corpus examples less the test example). However, the DO-enter sense of admit includes a variety of metaphorically related uses, and is thus not a highly unified verb sense. Because these uses of admit occur in widely different contexts, the relevant set of training data for each of the test examples is probably much smaller than the whole set of 49 examples. Table 54 shows the frequencies of the different sub-senses of the 50 DO-enter corpus examples.

Sense of Count Example from BNC ‘admit-enter’ physical 20 This entrance was protected by a boarded fence and gate sufficiently wide to admit carriages. court 10 He appealed, submitting that the judge wrongly admitted the evidence. school 9 Neither Plato nor Aristotle would admit a student to the Academy at Athens if he did not like his face. hospital 6 All four physicians admit elderly patients into the district hospital’s general medical beds. membership 5 Areas of contention were brought to light during the summit, however, notably Austria' s objections to the other countries’ reliance on nuclear power, and opposition by Italy to admitting other states (specifically Poland) to the group. Table 54: Counts and examples of subsenses of the 50 corpus examples of the DO-enter sense of admit.

3.5 Conclusion

The goal of this chapter was to demonstrate that the relationship between the context preceding a verb and the subcategorization of the verb could be used to predict the subcategorization of the verb. Experiment 3.4.1 showed that the model introduced in this chapter could successfully predict the subcategorizations of the SC bias contexts from Hare et al. (2001). However, the model had difficulty with predicting the subcategorizations of the DO bias contexts. This difficulty is attributed to a lack of DO-ness to the bias contexts rather than to a failure of the model. Experiment 3.4.2 showed that the model is capable of predicting both DO and SC subcategorizations when naturally occurring contexts are used rather than the artificially created DO contexts used in Hare et al. (2001). Experiment 3.4.3 showed that the model can make correct subcategorization predictions even when the verb 98 sense involved has a minority representation in the training data, and that the preceding context can be used to distinguish between separate subcategorizations of the same coarse-grained verb sense.

One potential extension of the work in this chapter is to attempt to induce Levin (1993) type verb alternation groups from corpus data. The ability to induce such verb classes plays and important part in arguments over language learnability, an issue discussed extensively in Pinker (1989). A key question in the discussion of language learnability is the question of whether there is enough evidence in the data available to a child learning English to properly learn which verbs do and do not appear in various subcategorizations. The subtle semantic distinctions between different classes of verbs are thought to play a key role in learning, but it is not clear whether there is sufficient semantic evidence available to allow for the induction of the verb alternation patterns. It is possible that over a large corpus, a learning system relying on the mechanisms used in this chapter might be able to induce the necessary distinctions.

A second potential extension of this work is to use the relationship between the context preceding the verb and the subcategorization of the verb to extend the portability of statistical parsers. The introduction to this dissertation argued that the fact that these parsers do not take verb sense into account is at least a partial cause of the drop in performance when they are used types of data that differ from the training data. If the parsers relied on a complete lexicalized grammar, rather than the simplified versions that they use, verb sense information would be an inherent part of the grammar. However, a parser that relied on a lexicalized grammar with no simplifying assumptions would require an inordinate amount of training data. Much as adding explicit subcategorization information back in to the grammar makes up for the information lost through the independence assumptions and thus improves parser performance, adding explicit verb sense information should also make up for verb sense information which is lost through the independence assumptions. It is hoped that LSA based verb sense disambiguation techniques such as that which is used in this dissertation can improve parser performance at less expense than would be required to rely on a complete lexicalized grammar. 99

4 Conclusions and future work

The introduction to this dissertation posed several problems in psycholinguistics and computational linguistics that were attributed to different corpora and psycholinguistic data sources having different verb subcategorization probabilities. The problems included the reduction in performance faced by statistical parsers when they are used on data other than that for which they were trained, the issue of which verb subcategorization probabilities are represented in the mental lexicon, and the issue of which verb subcategorization probabilities are most appropriate to use in norming psycholinguistic experiments. Chapter 2 demonstrated that individual senses of verbs were a more appropriate locus for verb subcategorization probabilities than the verb lexeme. Chapter 2 also demonstrated that when verb sense and discourse type were controlled for, cross-corpus subcategorization variation was reduced. Chapter 3 demonstrated a model that could make verb subcategorization predictions based on the preceding semantic context, and showed that the predictions made by the model were the same as the predictions made by humans given the same contexts, thus indicating, in conjunction with the results from Hare et al. (2001), that the verb sense / subcategorization relationship plays a role in human sentence processing. The results in this dissertation both provide at least partial answers to the questions posed in the introduction, and provide several areas for future research. 4.1 Psycholinguistics

The evidence from Chapter 2 suggests that the context from which the norming data is taken should resemble the context in which the verb is used in the experiment, to the extent possible. Semantic biases towards different verb senses and a wide variety of discourse factors play a role in producing verb subcategorization expectations. These results also suggest that there really is no such thing as a neutral context for producing verb uses. Common norming methods such as sentence production and sentence completion have their own inherent biases. Subcategorization probabilities taken from corpora also face difficulties. On one hand, the generic average of all uses of a verb in a corpus may or may not correctly represent the properties of that verb as used in a particular experimental context. On the other hand, corpus data also does not necessarily reflect the previous language exposure of typical psychology experiment subjects. These subjects typically were not even born when the data for the Brown Corpus was collected. Subjects are exposed to a much wider variety of language use than the ten million words of textbook text found in the TASA corpus. This poses a problem for selecting a corpus for finding appropriate frequencies, and a problem for selecting a corpus for generating semantic spaces for use in LSA-type models. An example of the potential differences between corpus data and daily life experiences is shown by the list of the twenty closest words to the word grape in a semantic space based on the TASA corpus in Table 55.

100 huelga, chavez, cesar, pickers, ufwoc, awoc, nfwa, grafters, yuma, migrant, growers, arvin, picketed, nonunion, ufw, campesinos, causa, strikebreakers, afl Table 55: 20 nearest neighbors of grape in the TASA LSA semantic space.

Given the difficulties in choosing appropriate corpus data for use in modeling human performance, it seems more appropriate to take studies which do not find strong correlations between corpus data and experimental data as reasons to look for more verb-sense and discourse-based differences, rather than as evidence of a lack of use of probabilistic information in human sentence processing. 4.2 Computational linguistics

The implications of this work for computational linguistics are less well defined than the implications for psycholinguistics. This work argues that controlling for verb sense should improve cross-corpus performance of statistical parsers, but never presents an actual example of improvement as evidence. If the statistical parsers relied on complete lexicalized grammars (without the simplifying assumptions presently used), it is quite likely that controlling for verb sense would not produce better cross-corpus performance. This is because the knowledge of which combinations of lexical arguments each verb takes is largely sense specific. However, simplifying assumptions, such as the (argument) independence assumptions used in Collins (1999) remove this information from the grammar. Such assumptions are necessary to reduce the quantity of training data needed, but in a trade-off, induce additional error into the grammar. Verb subcategorization information is also lost through the independence assumptions. Just as adding explicit verb subcategorization information back into the grammar increased the accuracy of the Collins parser given a fixed amount of training data, adding explicit verb sense information should also increase the accuracy, particularly when the parser is used on corpora that are different from the training data. The usefulness of doing this depends on the relative expense of adding sense information versus the expense of expanding the training data until it is large enough that an accurate grammar can be induced without relying on assumptions such as the independence assumption. 4.3 Future work

This dissertation suggests several lines of future research. One is identifying other factors that influence verb subcategorization probabilities and cause differences between psychological data and corpus data. A related task is to actually label a large set of data for verb sense and discourse factors, in order to build and test a large-scale model of the relative contributions of each factor to verb subcategorization probability variation. A separate area of research would be to actually attempt to add verb sense information into a statistical parser. The results in chapter 3 suggest that the information contained in the few sentences before a target verb is helpful in predicting both the sense and subcategorization of the verb, although these results do not demonstrate explicitly that this information would be additional information beyond that which is presently used in the various parsers. 101

Bibliography

Altmann, G. T. M., van Nice, K. Y., Garnham, A., & Henstra, J. (1998). Late closure in context. Journal of Memory and Language, 38, 459-484.

Argaman, V., Pearlmutter, N., & Garnsey, S. M. (1998). Lexical semantics as a basis for argument structure frequency biases. Poster presented at CUNY Sentence Processing Conference.

Argaman, V., & Pearlmutter, N. J. (in press). Lexical semantics as a basis for argument structure frequency biases. In P. Merlo & S. Stevenson (Eds.), Sentence processing and the lexicon: Formal, computational and experimental perspectives. Amsterdam: John Benjamins.

Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL '98) (pp. 86-90). Montreal, Canada.

Bever, T. G. (1970). The cognitive basis for linguistic structure. In J. R. Hayes (Ed.), Cognitive development of language (pp. 279-362). New York: John Wiley.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 219-241.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Cambridge: Cambridge University Press.

Bikel, D. M. (2000). A Statistical model for parsing and word-sense disambiguation. 2000 Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 155-163). Hong Kong.

Carroll, G., & Rooth, M. (1998). Valence induction with a head-lexicalized PCFG. Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP 3). Granada.

Carroll, J., Minnen, G., & Briscoe, T. (1998). Can subcategorization help a statistical parser? 6th ACL/SIGDAT Workshop on Very Large Corpora (pp. 118-126). Montreal, Canada.

Chafe, W. (1982). Integration and involvement in speaking, writing, and oral literature. In D. Tannen (Ed.), Spoken and Written Language (pp. 35-53). Norwood, New Jersey: Ablex.

Chafe, W. (1987). Cognitive constraints on information flow. In R. S. Tomlin (Ed.), Coherence and grounding in discourse (pp. 1-16). Amsterdam: Benjamins.

Charniak, E. (1995). Parsing with context free grammars and word statistics (CS-95-28). Providence, Rhode Island: Brown University.

Charniak, E. (1997). Statistical parsing with a context-free grammar and word statistics. AAAI-97 (pp. 598-603). Providence, RI.

Clifton, C., Frazier, L., & Connine, C. (1984). Lexical expectations in sentence comprehension. Journal of Verbal Learning and Verbal Behavior, 23, 696-708.

Collins, M. (1999). Head-driven statistical models for natural language processing. Unpublished Doctoral dissertation, University of Pennsylvania. 102 Connine, C., Ferreira, F., Jones, C., Clifton, C., & Frazier, L. (1984). Verb frame preference: Descriptive norms. Journal of Psycholinguistic Research, 13, 307-319.

Cuetos, F., Mitchell, D. C., & Corley, M. M. B. (1996). Parsing in different languages. In M. Carreiras & J. E. Garcia-Albea & N. Sabastian-Galles (Eds.), Language Processing in Spanish (pp. 145-187). Hillsdale, N.J.: Erlbaum.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407.

Fillmore, C. J. (1968). The case for case. In E. W. Bach & R. T. Harms (Eds.), Universals in Linguistic Theory (pp. 1-88). New York: Holt, Rinehart & Winston.

Fillmore, C. J. (1969). Types of lexical information. In F. Kiefer (Ed.), Studies in Syntax and Semantics (pp. 109-137). Dordrecht: Reidel.

Fillmore, C. J. (1986). Pragmatically controlled zero anaphora. Proceedings of the 12th Annual Meeting of the Berkeley Linguistics Society (pp. 95-107). Berkeley, CA.

Fodor, J. (1978). Parsing strategies and constraints on transformations. Linguistic Inquiry, 9, 427-473.

Ford, M., Bresnan, J., & Kaplan, R. M. (1982). A Competence-Based Theory of Syntactic Closure. In J. Bresnan (Ed.), The Mental Representation of Grammatical Relations (pp. 727-796). Cambridge: MIT Press.

Francis, W., & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin.

Fraser, B., & Ross, J. R. (1970). Idioms and unspecified NP deletion. Linguistic Inquiry, 1, 264-265.

Gahl, S. (1998a). Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus. Proceedings of ACL-98 (pp. 428-432). Montreal.

Gahl, S. (1998b). Automatic extraction of subcorpora for corpus-based dictionary-building. EURALEX '98 Proceedings: Papers submitted to the Eighth EURALEX Conference (pp. 445-452). University of Liège, Belgium.

Gahl, S., & Jurafsky, D. (2000). Coder's manual for the NORMA project on verb alternation biases (Technical Report 00-03). Boulder: University of Colorado Institute of Cognitive Science.

Gahl, S., Menn, L., Ramsberger, G., Jurafsky, D. S., Elder, E., Rewega, M., & Holland, A. L. (2001). Syntactic frame and verb bias in aphasia: Plausibility judgments of undergoer-subject sentences. Poster presented at TENNET, Montreal.

Garnsey, S. M., Pearlmutter, N. J., Myers, E., & Lotocky, M. A. (1997). The contributions of verb bias and plausibility to the comprehension of temporarily ambiguous sentences. Journal of Memory & Language, 37(1), 58-93.

Gibson, E., & Schuetze, C. T. (1999). Disambiguation preferences in noun phrase conjunction do not mirror corpus frequency. Journal of Memory & Language, 40(2), 263-279.

Gibson, E., Schuetze, C. T., & Salomon, A. (1996). The relationship between the frequency and the processing complexity of linguistic structure. Journal of Psycholinguistic Research, 25(1), 59-92.

Gildea, D. (2001). Corpus variation and parser performance. Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (pp. 167-172). Carnegie Mellon University. 103 Givon, T. (1979). On understanding grammar. New York: Academic Press.

Givon, T. (1984). Syntax: A functional/typological introduction. Amsterdam/Philadelphia: John Benjamins Publishing Company.

Givon, T. (1987). Beyond foreground and background. In R. S. Tomlin (Ed.), Coherence and grounding in discourse (pp. 175-188). Amsterdam: Benjamins.

Godfrey, J., Holliman, E., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. Proceedings of ICASSP-92 (pp. 517-520). San Francisco.

Goldberg, A. (1995). Constructions. Chicago: University of Chicago Press.

Green, G. (1974). Semantics and syntactic regularity. Bloomington: Indiana University Press.

Gruber, J. (1965). Studies in lexical relations. Ph.D. Dissertation, MIT, Cambridge, MA.

Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London/New York: Longman.

Hare, M., Elman, J., & McRae, K. (2001). Sense and structure: Meaning as a determinant of verb categorization preferences. Manuscript submitted for publication.

Holmes, V. M., Stowe, L., & Cupples, L. (1989). Lexical expectations in parsing complement-verb sentences. Journal of Memory and Language, 28, 668-689.

Jennings, F., Randall, B., & Taylor, L. K. (1997). Graded effects of verb subcategory preferences on parsing: Support for constraint-satisfaction models. Language and Cognitive Processes, 12(4), 485-504.

Johnson, C. R., & Fillmore, C. J. (2000). The FrameNet tagset for frame-semantic and syntactic coding of predicate-argument structure. Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL 2000) (pp. 56-62). Seattle WA, April 29-May 4, 2000.

Johnson, C. R., Fillmore, C. J., Wood, E. J., Ruppenhofer, J., Urban, M., Petruck, M. R. L., & Baker, C. F. (2001). The FrameNet Project: Tools for lexicon building, Version 0.7, July 13, 2001: Available through http://www.icsi.berkeley.edu/~framenet/book.html.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.

Lapata, M., Keller, F., & Schulte im Walde, S. (2001). Verb frame frequency as a predictor of verb bias. Journal of Psycholinguistic Research, 30(4), 419-435.

Levin, B. (1993). English verb classes and alternations. Chicago and London: The University of Chicago Press.

Levin, B., & Hovav, M. R. (1995). Unaccusativity at the syntax-lexical semantics interface. Cambridge: MIT Press.

Lin, D. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. ACL-97 (pp. 64-71). Spain. 104 Lowe, J. B., Baker, C. F., & Fillmore, C. J. (1997). A frame-semantic approach to semantic annotation. Proceedings of the SIGLEX workshop "Tagging Text with Lexical Semantics: Why, What, and How?" in conjunction with ANLP-97 (pp. 18-24). Washington, D.C.

MacDonald, M. C. (1994). Probabilistic constraints and syntactic ambiguity resolution. Language & Cognitive Processes, 9(2), 157-201.

Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330.

Maslin, J. (1991, March 22). Problems of homeless: The individual stories. New York Times, pp. C8.

Merlo, P. (1994). A corpus-based analysis of verb continuation frequencies for syntactic processing. Journal of Psycholinguistic Research, 23(6), 435-47.

Miller, G., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1993). Introduction to WordNet: An on-line lexical database.

Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39-41.

Mitchell, D. C., Cuetos, F., & Corley, M. M. B. (1992). Statistical versus linguistic determinants of parsing bias: Cross-linguistic evidence. Paper presented at the 5th annual CUNY Conference on Sentence Processing. New York.

Mitchell, D. C., Cuetos, F., Corley, M. M. B., & Brysbaert, M. (1995). Exposure-based models of human parsing: Evidence for the use of coarse-grained (nonlexical) statistical records. Journal of Psycholinguistic Research Special Issue: Sentence processing: I, 24(6), 469-488.

Narayanan, S., & Jurafsky, D. (1998). Baysian models of human sentence processing. Proceedings of the 20th annual conference of the Cognitive Science Society (pp. 752-757).

Pickering, M. J., Traxler, M. J., & Crocker, M. W. (2000). Ambiguity resolution in sentence processing: Evidence against frequency-based accounts. Journal of Memory and Language, 43, 447-475.

Pinker, S. (1989). Learnability and cognition: The acquisition of argument structure. Cambridge, MA, USA: MIT Press.

Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K., & Kintsch, W. (1998). Using Latent Semantic Analysis to assess knowledge: Some technical considerations. Discourse Processes, 25, 337-354.

Resnik, P. (1996). Selectional constraints: An information-theoretic model and its computational realization. Cognition, 61(1-2), 127-159.

Roland, D., & Jurafsky, D. (1997). Computing verbal valence frequencies: corpora versus norming studies. Poster session presented at CUNY sentence processing conference. Santa Monica, CA.

Roland, D., & Jurafsky, D. (1998). How verb subcategorization frequencies are affected by corpus choice. Proceedings of COLING-ACL 1998 (pp. 1117-1121). Montreal, Canada.

Roland, D., & Jurafsky, D. (in press). Verb sense and verb subcategorization probabilities. In P. Merlo & S. Stevenson (Eds.), The Lexical Basis of Sentence Processing: Formal, Computational, and Experimental Issues : John Benjamins. 105 Roland, D., Jurafsky, D., Menn, L., Gahl, S., Elder, E., & Riddoch, C. (2000). Verb subcategorization frequency differences between business-news and balanced corpora: The role of verb sense. Proceedings of the Workshop on Comparing Corpora (pp. 28-34). Hong Kong, October 2000.

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

Schütze, H. (1997). Ambiguity resolution in language learningNumber 71). Stanford, CA: CSLI Publications.

Schütze, H. (1998). Automatic word sense disambiguation. Computational Linguistics, 24(1), 97-123.

Thompson, S. A. (1987). The passive in English: A discourse perspective. In R. Channon & L. Shockey (Eds.), In Honor of Ilse Lehiste/Ilse Lehiste Puhendusteos (pp. 497-511). Dordrecht: Foris.

Trueswell, J. C., Tanenhaus, M. K., & Kello, C. (1993). Verb-specific constraints in sentence processing: Separating effects of lexical preference from garden-paths. Journal of Experimental Psychology: Learning, Memory, & Cognition, 19(3), 528-553.

Yarowsky, D. (2000). Hierarchical decision lists for word sense disambiguation. Computers and the Humanities, 34(2), 179-186.

Zeno, S. M., Ivens, S. H., Millard, R. T., & Duvvuri, R. (1995). The educator's word frequency guide / Touchstone Applied Science Associates, Inc.: Touchstone Applied Science Associates, Inc.

106

Appendix A: Subcategorizations and tgrep search strings

This section will provide the definitions of each of the subcategorization frames used in Section 2.2 of this dissertation. The subcategorization frequencies were extracted from three corpora from the Penn Treebank: the Brown corpus, the WSJ corpus, and the Switchboard corpus. This information was extracted using a series of perl scripts in conjunction with the tgrep utility. The search strings were designed in such a way that the strings each identify a non-overlapping set of examples, and that the combined set of strings identifies all possible tree structures. Rather than provide linguistic definitions for each of the subcategorizations, and then describe the differences between the theoretical definition of the subcategorization and the practical reality of what actually ended up in each of the counts, it seems more appropriate to start by explaining the actual search patterns.

The search patterns used were based on Treebank I style annotation, since the Brown Corpus was only available in Treebank I style annotation when this work was done. Since that time, a more sophisticated version of tgrep has been released20, and the Brown Corpus has been annotated with Treebank II style annotation21. These improvements should allow for a much more accurate (or at least more elegant) identification of various subcategorizations than was possible with tgrep I and Treebank I. Nonetheless, the patterns and tools used in this dissertation had a low degree of error, once quotations were hand identified for all communication verbs. This step of hand correction was essential, due to the extreme difficulties involved in identifying such arguments. Table 56 shows the results of an analysis of errors found in a random sample of the output of the search strings. This list does not include errors in identifying quotations for communication verbs, because these were all corrected by hand as part of the data collection process.

Treebank-based errors - PP attachment 1% - verb+particle vs. verb+PP 2% - NP/adverbial distinction 2% - miscellaneous mis-parsed sentences 1% Search string based errors - missed traces and displaced arguments 1% Table 56: Errors not including quote-finding errors for communication verbs.

This section will present the actual tgrep search strings. The most important thing to note is that these patterns rely on a modified version of the original Treebank corpora. In all cases, the quotation marks in each corpus were replaced with (QUOTE OPEN_Q) and (QUOTE CLOSE_Q), so that the quotation marks could be

20 http://www.cs.cmu.edu/~dr/Tgrep2/ 21 The Treebank II version of the Brown corpus has not yet been released through the LDC, but is mentioned in Bikel (2000). 107 used as part of the search patterns without interfering with the tgrep syntax. After this substitution was made, the corpora were recompiled using tprep. These search strings were pre-processed in a perl script before being fed into tgrep, with several substitutions. For ease of interpretation and presentation, the strings will be left in the unsubstituted form. $VERB in each string represents a regular expression containing all possible forms of the verb. For example, if the target verb were race, the regular expression would be “race|races|racing|raced”. Other substitutions include22:

$BE_GET = "is|are|was|were|be|am|been|get|gets|got|gotten|getting|being";

$NOT_PASSIVE="!>(VBN>(VP!NP%(AUX<<$BE_GET))) !>(VBN>(VP!NP%(/VB/<$BE_GET)))!>(VBN>(VP>(VP!

NP%(/VB/<$BE_GET))))!>(VBN>(VP!NP%(VP<(/VB/ <$BE_GET))))"; The total number of each verb was found with the following pattern. Note that the verb can not have a particle as a sister. Verb particle combinations were intentionally removed, as these tend to be idiomatic and have separate semantics that are not equivalent to the other uses of the verb. Also, the verb has to be dominated by a VP. This removes various nominalizations. Examples where the verb was dominated by an NP were identified using separate search patterns described below. VP<(/VB/<$VERB)!NP<(/VB/!%../NP/|/PP/|S|/S-/|/SBAR/|VP|X)<(/VB/<($V ERB$NOT_PASSIVE))!>(S|SINV%S|SBARQ|SQ)!>(S|SINV%QUO TE)!>(S|SINV%/,/)!>(S|SINV%(S|NP

(VP!NP<(/VB/!%../NP/|/PP/|S|/S-/|/SBAR/|VP|X)<(/VB/<($V ERB$NOT_PASSIVE))>(S|SINV%SBARQ|SQ)!>(S|SINV%QUOTE )!>(S|SINV%/,/))

22 Please refer to the tgrep manpage for an explanation of the tgrep search syntax. 108 (VP!NP<(/VB/!%../NP/|/PP/|S|/S-/|/SBAR/|VP|X)<(/VB/<($V ERB$NOT_PASSIVE))!>(S|SINV%S)!>(S|SINV%QUOTE)>(S|SIN V%/,/))

(VP!NP<(/VB/!%../NP/|/PP/|S|/S-/|/SBAR/|VP|X)<(/VB/<($V ERB$NOT_PASSIVE))!>(S|SINV%S)>(S|SINV%QUOTE))

(VP!NP<(/VB/!%../NP/|/PP/|S|/S-/|/SBAR/|VP|X)<(/VB/<($V ERB$NOT_PASSIVE))>(S|SINV%(S!(S|SINV%QUOT E))

(VP!NP<(/VB/!%../NP/|/PP/|S|/S-/|/SBAR/|VP|X)<(/VB/<($V ERB$NOT_PASSIVE))>(S|SINV%(S|NP(S|SINV%QU OTE)) The [PP] subcategorization was identified with the following string. This identifies the prepositional phrases which Treebank counted as arguments, but not those counted as adjuncts. In a hand check of a random sample of all labeled data, about 2% of the cases had a PP labeled as an argument which might better be counted as a adjunct or vise versa. (VP!NP!NP!

(VP!NP!

(VP!NP!

(VP!NP!NP!NP!

(VP!NP!NP!

(VP!NP!NP!

(VP!NP!NP!NP!NP!

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..VP)!<(/ NP/%../NP/|/PP/|S|/S-/|/SBAR/|X)) The [NP] examples were identified through a set of strings that are similar to the [0] strings. Note that traces marked in the Treebank count as NPs. (VP!NP<(/VB/<($VERB$NOT_PASSIVE))(VP>(S|SINV%S))!>(VP>(S%QUOT E))) 110 (VP!NP<(/VB/<($VERB$NOT_PASSIVE))(VP>(S|SINV%S))>(VP>(S%QUOTE )))

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))(VP>(S|SINV%S))!>(VP>(S%QUOTE ))) The [NP NP] examples are identified through the following string. This string produces numerous false positives, particularly with verbs that don’t have [NP NP] as a possible subcategorization. This is a result of time NP sisters of the verb as in “I saw him Monday”. Although there are certain obvious NPs to look for such as days of the week, the problematic NPs exhibit a Zipfian distribution. This problem was not corrected, and affects all NP subcategorization counts, but induces only a small amount of overall error in the results approximately 2%. (VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%../NP/)) The following string was used for [NP PP]: (VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%../PP/)!<( /NP/%../NP/|S|/S-/|/SBAR/|VP|X)) The following strings were used for the [NP VPto] category: (VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..(S|/S-/< (AUX|TO<

(VP!NP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..(/SBAR /!<<,that|0))!<(/NP/%../NP/|/PP/|S|/S-/|VP|X)) The following strings were used for the [NP Sfin] subcategorization: (VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..(/SBAR /<<,that))!<(/NP/%../NP/|/PP/|S|/S-/|VP|X))

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..(/SBAR /<<,0))!<(/NP/%../NP/|/PP/|S|/S-/|VP|X)) The following strings were used to identify passive examples: (VP<(VBN<$VERB)!NP%(AUX<<$BE_GET))

(VP<(VBN<$VERB)!NP%(/VB/<$BE_GET))

(VP<(VBN<$VERB)!NP<(VP%(/VB/<$BE_GET)))

(VP<(VBN<$VERB)!NP%(VP<(/VB/<$BE_GET))) 111 Examples where the verb was dominated by a NP rather than a VP node were excluded. These cases include but are not limited to reduced relatives. The output of these strings was not included in the results in Chapter 2. (VP!NP<(VB<($VERB$NOT_PASSIVE)))

(VP!NP<(VBD<($VERB$NOT_PASSIVE)))

(VP!NP<(VBG<($VERB$NOT_PASSIVE)))

(VP!NP<(VBN<($VERB$NOT_PASSIVE)))

(VP!NP<(VBP<($VERB$NOT_PASSIVE)))

(VP!NP<(VBZ<($VERB$NOT_PASSIVE))) Data from the following strings was included in an “other” category – Some of these strings are logical possibilities included to ensure that all corpus examples were found by some string, and did not necessarily match any corpus examples. (VP!NP<(/VB/<($VERB$NOT_PASSIVE))!

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))!

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..S|/S-/)! <(/NP/%..(S|/S-/<(AUX|TO<

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..(X<<,th at|0))!<(/NP/%../NP/|/PP/|S|/S-/|/SBAR/|VP))

(VP!NP<(/VB/<($VERB$NOT_PASSIVE))<(/NP/%..(X!<<,t hat|0))!<(/NP/%../NP/|/PP/|S|/S-/|/SBAR/|VP)) Several types of examples are not correctly identified by the search strings. These bugs were not fixed in an accuracy/effort trade-off. 1. Quotes without quotation marks can not be identified at all. This is actually solved by hand counting all quotes anyway. 2. Time NPs still cause false transitives, ditransitives, etc. This results in approximately a 2% error rate. 3. Pseudoclefts – approximately 26 in Brown Corpus data used in this dissertation. 4. Comparatives – approximately 21 in Brown Corpus data used in this dissertation. 5. Tough movement

112

Appendix B: Stimuli used in Hare et al. (2001)

1. observe (SC) Trevor's teacher asked him to explain why there had been riots following the election in Bosnia. (Target) He observed (that) the election had probably been rigged / the previous year and that is what caused the problems.

(DO) A United Nations official was sent to Bosnia to keep an eye on the election. (Target) He observed (that) the election had probably been rigged / the previous year, so the UN wanted to make sure it wouldn't happen again.

2. admit (SC) For over a week, the trail guide had been denying any problems with the two high school kids walking the entire Appalachian Trail. (Target) Finally, though, he admitted (that) the students had little chance of / succeeding.

(DO) The two freshmen on the waiting list refused to leave the professor's office until he let them into his class. (Target) Finally, he admitted (that) the students had little chance of / getting into the course.

3. recall (SC) Mary Anne was happy to have time to sit down and read, but she couldn't locate the book she had started three days earlier. (Target) She recalled (that) the novel was sitting underneath the / magazines on her coffee table, so she got it and sat down on her favorite rocking chair.

(DO) Two people had requested the overdue book, so the librarian agreed to get it for them right away. (Target) She recalled (that) the novel was sitting underneath the / front counter, and that it had actually been returned earlier that day.

4. grasp (SC) As Rhonda lay alone in her bed, she began to understand why Ted had become such a good student lately. (Target) She grasped (that) her friend wanted to make a / good impression on the new teacher.

(DO) Rhonda saw Ted trip at the head of the stairs. (Target) She grasped (that) her friend wanted to make a / fool of himself and had done it on purpose.

5. recognize 113 (SC) Joe had taken his mom's ailing sister into his home, and he wanted to keep her with him even though she wanted to move to a nursing home. (Target) Finally though, he recognized (that) his aunt was sick and her / care would be better at the home.

(DO) When Joe opened the door, he did not immediately know his mom's sister. (Target) Finally though, he recognized (that) his aunt was sick and her / appearance had changed dramatically.

6. indicate (SC) Ken had finally allowed his landlady to rent out his garage space while he was in Europe for a year. (Target) He indicated (that) the car was gone because he / had lent it to his nephew.

(DO) The day care worker asked the little boy to show her which toy he wanted. (Target) He indicated (that) the car was gone because he / had let another kid have it.

7. add (SC) Matthew was complaining to his wife about their kids' ridiculously busy schedule when he thought of one last thing to tell her. (Target) He added (that) their kids were fine just playing / in the local soccer league with their friends and he didn't want them trying out for the traveling team.

(DO) Matthew asked his wife for a pen as the two of them stood in front of the sign-up list for the kids' traveling team. (Target) He added (that) their kids were fine just playing / in the local league last year, so why not let them try out for the traveling team this year.

8. anticipate (SC) Liz and George were reassured by their broker's projections after stock prices fell badly in August. (Target) He anticipated (that) the market was going to fluctuate / but then prices would rise rapidly.

(DO) Unlike many people, George didn't lose any money when stock prices fell badly in August. (Target) He anticipated (that) the market was going to fluctuate / and moved his money into bonds.

9. reflect (SC) Maureen debated whether to park near the visitor’s center and hike in from there to Mt. Shasta, or to try to get a spot in the more crowded lot nearer the base of the peak. (Target) She reflected (that) the mountain might be too far / away if she parked in the first lot, so she took a chance and kept going.

114 (DO) When Maureen moved into her new apartment, she set up mirrors in her living room to try to get a good view of Mt. Shasta out her front window. (Target) She reflected (that) the mountain might be too far / away to see clearly, but it was worth trying all the same.

10. acknowledge (SC) For the past hour and a half, Susan had been bragging to her friend about her endless patience when it came to dealing with her curious 4-year-old son, even though it was something of a lie. (Target) Finally though, she acknowledged (that) her son bugged her very often because / of his boundless energy and endless questions.

(DO) At the dinner table, Kenny was eager to comment on the plans for the family trip, but his mom paid no attention to him. (Target) Finally though, she acknowledged (that) her son often bugged her because / of his ridiculous off-topic comments, and that she had been ignoring him on purpose.

11. find (SC) The intro psychology students hated having to read the assigned text because it was so boring. (Target) They found (that) the book was written poorly and / difficult to understand.

(DO) Allison and her friends had been searching for John Grisham's new novel for a week, but yesterday they were finally successful. (Target) They found (that) the book was written poorly and / were annoyed that they had spent so much time trying to get it.

12. bet (SC) Anthony was deeply depressed about the damage to his property caused by the earthquake. (Target) He bet (that) his house was going to be / worth much less than it used to be.

(DO) Anthony had experienced a string of bad luck in the high stakes poker game, but because he was holding such a great hand he decided to stay in, using his property as collateral. (Target) He bet (that) his house was going to be / worth enough to let him stay in the game and to win back his money besides.

13. confirm (SC) Roger’s secretary asked him if he really did want to have tomorrow's meeting in the small conference room that was completely lacking any decent audiovisual equipment. (Target) He confirmed (that) the room was precisely the right / one because there were only going to be five people at the meeting.

115 (DO) Roger called university classroom reservations to finalize the location of his course for this semester. (Target) He confirmed (that) the room was precisely the right / one because there were only going to be five students in his course.

14. declare (SC) At the meeting, the parents objected strongly to the principal's decision to have yet another long weekend in May. (Target) They declared (that) a holiday was inappropriate because there / were important exams coming up soon.

(DO) Congress was looking for a way to honor the slain civil rights leader. (Target) They declared (that) a holiday was inappropriate because there / were better ways to honor him.

15. reveal (SC) Luke's wife finally asked him why he wasn't concerned about the kids stealing the package he had left on the seat of his car. (Target) He revealed (that) the box had actually been empty / and her pearls were safe in the upstairs closet.

(DO) Bob finally agreed to show Cindy the package that he had hidden under the bed. (Target) He revealed (that) the box had actually been empty / all along and her present was actually in the upstairs closet. 16. claim (SC) After his promotion, John sent a letter to the selection committee thanking them for choosing him. (Target) He claimed (that) the honor made him very happy / and was the most exciting thing that had ever happened to him.

(DO) After he won the competition, John went down to the awards center. (Target) He claimed (that) the honor made him very happy / and was the most exciting thing that had ever happened to him.

17. project (SC) Because the historian expressed concern about the delays that were piling up, the studio executives asked her when she might be delivering the finished product to them. (Target) She projected (that) the documentary would take about two / months longer than originally planned.

(DO) As she began her presentation in the viewing room, the producer asked her assistant to dim the lights. (Target) She projected (that) the documentary would take about two / hours and then she would answer questions.

116 18. insert (SC) The newspaper editors were arguing intensely and the reporter was having a hard time getting a word in edgewise. (Target) Finally though, she inserted (that) the paper seemed to be falling / apart and radical change was needed.

(DO) While Bob was sweeping the attic, June was getting frustrated at how hard it was to put the musty old documents back into their boxes. (Target) Finally though, she inserted (that) the paper seemed to be falling / apart and that she couldn't put it away without ripping it.

19. feel (SC) Rick was snug inside the cabin, but his horses were outside for the night and that worried him. (Target) He felt (that) the weather might become a problem / as the night wore on.

(DO) Rick was beginning to get a little cold as he climbed the icy mountain. (Target) He felt (that) the weather might become a problem / as time wore on.

20. report (SC) The newscaster had to take a deep breath before he could give details of the deaths at the high school. (Target) He reported (that) the students were caught by surprise / when the gunman appeared out of nowhere.

(DO) Danny loved being hall monitor at his high school because it gave him such a sense of power. (Target) He reported (that) the students were caught by surprise / when he walked into the bathroom and caught them smoking.