By Douglas William Roland B.S., University of Delaware, 1989 M.A., University of Colorado, 1994

Verb Sense and Verb Subcategorization Probabilities by Douglas William Roland B.S., University of Delaware, 1989 M.A., University of Colorado, 1994 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Department of Linguistics 2001 Copyright © 2001 by Douglas William Roland This thesis entitled: Verb Sense and Verb Subcategorization Probabilities written by Douglas William Roland has been approved for the Department of Linguistics ___________________________ Daniel Jurafsky ___________________________ Lise Menn ____________ Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii Abstract Roland, Douglas William (Ph.D., Linguistics) Verb Sense and Verb Subcategorization Probabilities Thesis directed by Associate Professor Daniel S. Jurafsky This dissertation investigates a variety of problems in psycholinguistics and computational linguistics caused by the differences in verb subcategorization probabilities found between various corpora and experimental data sets. For psycholinguistics, these problems include the practical problem of which frequencies to use for norming psychological experiments, as well as the more theoretical issue of which frequencies are represented in the mental lexicon and how those frequencies are learned. In computational linguistics, these problems include the decreases in the accuracy of probabilistic applications such as parsers when they are used on corpora other than the one on which they were trained. Evidence is presented showing that different senses of verbs and their corresponding differences in subcategorization, as well as inherent differences between the production of sentences in psychological norming protocols and language use in context, are important causes of the subcategorization frequency differences found between corpora. This suggests that verb subcategorization probabilities should be based on individual senses of verbs rather than the whole verb lexeme, and that “test tube” sentences are not the same as “wild” sentences. Hence, the influences of experimental design on verb subcategorization probabilities should be given careful consideration. This dissertation will demonstrate a model of how the relationship between verb sense and verb subcategorization can be employed to predict verb subcategorization based on the semantic context preceding the verb in corpus data. The predictions made by the model are shown to be the same as predictions made by human subjects given the same contexts. For Sumiyo v Acknowledgements This dissertation is the result of an enormous amount of help, advice, and support from many people. However, unlike the body of the dissertation, where a simple table or graph can adequately convey a message, a page of text in the introduction cannot convey impact that the people named here have had on this dissertation and on me as a researcher and as a human being. At the top of the list are my advisor, Dan Jurafsky, my co-advisor, Lise Menn, and the other members of my committee, Alan Bell, Jim Martin, and Tom Landauer. Over the years, they have all given me more than a fair share of time, energy, insight, help, and patience. I can’t even begin to list all they have done. Other present and former Boulder faculty members have also had a great influence on me, including: Paul Smolensky, who introducing me to computational modeling during my first semester in Boulder, Mike Eisenburg, who showed me the artistic side of programming, Laura Michaelis, and Barbara Fox. I also owe special gratitude to Susanne Gahl, who read multiple versions of this dissertation and the preceding papers, and provided many useful suggestions for improving both content and the clarity. Several people have very kindly contributed their data to this dissertation. These include Charles Clifton, who lent me the original hand-written subject response sheets from the Connine, Ferreira, Jones, Clifton, & Frazier (1984) study, and thus caused me to re-evaluate many of my original thoughts about the causes of the differences between norming study data and other corpus data. Of similar importance was the subject response data from Garnsey, Pearlmutter, Myers, & Lotocky (1997), provided by Susan Garnsey. Chapter 3 would not exist at all if it were not for the data and support provided by Mary Hare, Ken McRae, and Jeff Elman. Financially, this work was supported in part by: NSF CISE/IRI/Interactive Systems Proposal 9818827, NSF CISE/IRI/Interactive Systems Award IRI-9618838, NSF CISE/IRI/Interactive Systems Award IRI-970406, NSF CISE/IRI/Interactive Systems Award IIS-9733067 Many people have provided helpful feedback either on this dissertation, or on the papers and presentations, such as Roland & Jurafsky (1997), Roland & Jurafsky (1998), Roland & Jurafsky (in press), and Roland et al. (2000), that contain various pieces of the data and analysis in this dissertation. These include Charles Clifton, Charles Fillmore, Adele Goldberg, Mary Hare, Uli Heid, Paola Merlo, Neal Perlmutter, Philip Resnik, Suzanne Stevenson, and the ever-famous anonymous reviewers. This list also includes assorted office-mates and other people associated with Dan’s research group: (in an attempt at chronological order) Bill Raymond, Taimi Metzler, Giulia Bencini, Michelle Gregory, Traci Curl, Mike O’Connell, Cynthia Girand, Noah Coccaro, Chris Riddoch, Beth Elder, Keith Herold, and finally, vi Dan Gildea and Sameer Pradhan, both of whom provided significant help while I was attempting to tag and parse data from various corpora. Special thanks to Yoshie Matsumoto for showing me where all the good coffee shops were for getting work done, and setting a pace and spirit during my first year in Boulder which I have attempted (not always successfully) to maintain since. Also, thanks to Faridah Hudson and Linda Nicita, who started with me, and have provided support and camaraderie during the long trip. Many friends (mostly human, but also one canine and one feline) from the real world also made life a much better place to be: (in order of neighborhood proximity to our kitchen) Rie, Pepper, Emi, Virgil, Hal, Elizabeth, Jeremy, Benny, Raj, Noriko, Wes, Lee Wah, Nanako, Hiroko, Kiyoshi, (and now for a great leap in neighborhood distance) Yoshie, Kazu, Tomoko, and Taro. Also thanks to Imanaka Sensei for ocha and okashi. I would also like to thank my parents for being parents, a job which included Mom spending a day of her vacation last summer proof-reading the whole dissertation. None of the help from the people above would mean anything were it not for the support, encouragement, love, and extreme patience shown by my wife Sumi. I couldn’t have done it without her. ¢¡¤£¦¥¨§©¡ ¦ vii Contents 1 Introduction 1 1.1 Overview....................................................................................................1 1.2 The importance of verb subcategorization probabilities in psycholinguistics ...............................................................................................................1 1.3 The problem with verb subcategorization frequencies.................................4 1.4 Verb subcategorizations and computational linguistics..............................11 1.4.1 The importance of verb subcategorization information for statistical parsers .........................................................................................12 1.4.2 Problem: the need to retrain parsers for new domains ...................13 1.5 Solving the verb subcategorization frequency problem .............................15 1.5.1 Evidence for the relationship between verb semantics and subcategorization from linguistics................................................17 1.5.2 Evidence for the relationship between verb semantics and subcategorization from computational linguistics.........................19 1.5.3 Evidence for the relationship between verb semantics and subcategorization from psycholinguistics .....................................20 1.6 Outline of Chapters...................................................................................21 1.6.1 Chapter 2......................................................................................22 1.6.2 Chapter 3......................................................................................23 1.6.3 Chapter 4......................................................................................24 2 Subcategorization probability differences in corpora and experiments 25 2.1 Combining sense-based verb subcategorization probabilities with other factors to yield observed subcategorization frequencies.........................25 2.1.1 Verb senses...................................................................................26 2.1.2 Probabilistic factors......................................................................27 2.2 Experiment – comparing norming studies and corpus data........................28 2.2.1 Methodology................................................................................28 2.2.1.1 Connine et al. (1984) sentence production study...............29 2.2.1.2 Garnsey et al. (1997) sentence completion study ..............30 2.2.1.3 Brown Corpus ..................................................................31 2.2.1.4 Wall Street Journal Corpus ...............................................32

Load more