A Computational Study of Lexicalized Noun Phrases in English

A Computational Study of Lexicalized Noun Phrases in English

A COMPUTATIONAL STUDY OF LEXICALIZED NOUN PHRASES IN ENGLISH DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosphy in the Graduate School of The Ohio State University By Carol Jean Godby, B.A., M.A. * * * * * The Ohio State University 2002 Dissertation Committee: Approved by Professor Craige Roberts, Adviser ________________________ Professor Chris Brew Adviser Professor David Dowty Department of Lingistics Copyright © Carol Jean Godby 2002 ii ABSTRACT Lexicalized noun phrases are noun phrases that function as words. In English, lexicalized noun phrases are often realized as noun-noun compounds such as theater ticket and garbage man, or as adjective-noun phrases such as black market and high school. In specialized or technical subjects, phrases such as urban planning, air traffic control, highway engineering and combinatorial mathematics are conventional names for concepts that are just as important as single-word terms such as adsorbents, hydrology, or aerodynamics. But despite the fact that lexicalized noun phrases represent useful vocabulary and are cited in dictionaries, thesauri and book indexes, the traditional linguistic literature has failed to identify consistent and categorical formal criteria for identifying them. This study develops and evaluates a linguistically natural computational method for recognizing lexicalized noun phrases in a large corpus of English- language engineering text by synthesizing the insights of studies in traditional linguistics and computational linguists. From the scholarship in theoretical linguistics, the analysis adopts the perspective that lexicalized noun phrases represent the names of concepts that are important to a community of speakers and have survived a single context of use. Theoretical linguists have also proposed diagnostic tests for identifying lexicalized noun phrases, many of which can be formalized in a computational study. From the scholarship in computational linguistics, the analysis incorporates the view that a linguistic investigation can be extended and verified by processing relevant evidence from a corpus of text, which can be evaluated using mathematical models that do not require categorical input. In a engineering text, a small set of linguistic contexts, including professor of, department of or studies in, yields long lists of lexicalized noun phrases, including public safety, abstract state machines, complex systems, computer graphics, and mathematical morphology. The study reported here identifies lexical and syntactic contexts that harbor lexicalized noun phrases and submits them to a machine-learning algorithm that classifies the lexical status of noun phrases extracted from the text. iii Results from several evaluations show that the linguistic evidence extracted from the corpus is relevant to the classification of noun phrases in engineering text. Informal evidence from other subject domains suggests that the results can be generalized. iv ACKNOWLEDGMENTS I returned to graduate school nearly 15 years after I dropped out and am now in a position to apply the results of my research in my professional work. These unlikely events converged because I have been fortunate enough to work in an intellectually rich and supportive environment. I am grateful to my managers and colleagues at OCLC, who believe in me and understand the relevance of linguistics for solving the problems presented by large repositories of machine-readable text. My deepest thanks go to Martin Dillon, who invited me to join the Office of Research and gave me an interesting problem to work on. This dissertation is the result. Joan Mitchell and Diane Vizine-Goetz supported my interest in terminology identification and invited me to many professional venues, where I could discuss my work and develop my ideas. Traugott Koch gave me permission to use a valuable data set that he and his colleagues at Lund University in Sweden collected, without which the project reported here could not have been completed. I am also grateful to the professors and students in the Linguistics Department at Ohio State, who welcomed me back as one of their own. Their collegiality and professionalism made the work go as smoothly as possible. My heartfelt thanks go to my dissertation committee for their help in shaping my research problem into a linguistically sophisticated inquiry. Craige Roberts, my adviser, intervened several times with much-needed encouragement when I was overwhelmed and ready to give up. David Dowty, always a linguist’s linguist, pushed me to higher standards of scholarship than I could have achieved on my own. Chris Brew arrived at Ohio State just in time with the expertise in corpus linguistics that I needed to do this project justice. Special thanks go to my friend Lee Jansen, who did a thorough job of editing the manuscript. Finally, I thank my other old friends who kept me sane as I worked on this project: Mark Bendig, Jeff McKibben, Rosanne Norman, Debbie Stollenwerk, Margaret Thomas and Patricia Weiland. v VITA 1954............................................................................Born – Newport News, Virginia 1977............................................................................Bachelor of Arts German The University of Delaware 1981............................................................................Master of Arts Linguistics The Ohio State University 1982............................................................................Graduate teaching associate English as a Second Language The Ohio State University 1984............................................................................Education consultant Software Productions Columbus, Ohio 1984............................................................................Courseware developer College of Medicine The Ohio State University 1988............................................................................Senior programmer/analyst Online Computer Library Center (OCLC) Dublin, Ohio 1990............................................................................Systems analyst, OCLC 1992............................................................................Associate research scientist, OCLC 1995............................................................................Research scientist, OCLC 1999-present...............................................................Senior research scientist, OCLC vi PUBLICATIONS Carol Jean Godby and Ray Reighart. 2001. Terminology identification in a collection of Web resources. In Karen Calhoun and John Riemer, (eds.), CORC: New Tools and Possibilities for Cooperative Electronic Resource Description. pp. 49-66. Binghamton, New York: The Hayworth Press. Carol Jean Godby, Eric Miller and Ray Reighart. 2000. Automatically generated topic maps of World Wide Web resources. The Annual Review of OCLC Research. Accessible at: <http://www.oclc.org/research/publications/arr/1999/godby/topicmaps.htm> Anders Ardö, Jean Godby, Andrew Houghton, Traugott Koch, Ray Reighart, Roger Thompson and Diane Vizine-Goetz, 2000. Browsing engineering resources on the Web. In Clare Beghtol, Lynne Howarth, and Nancy Williamson, (eds.), Dynamism and Stability in Knowledge Organisation. pp. 385-390. Würzburg, Germany: Ergon Verlag. Carol Jean Godby and Ray Reighart. 2000. Using machine-readable text to update the Dewey Decimal Classification. Advances in Classification Research, pp. 21-34. Carol Jean Godby and Ray Reighart. 1999. The WordSmith indexing system. The Annual Review of OCLC Research. Accessible at: <http://www.oclc.org/research/publications/arr/1998/godby_reighart/wordsmith.htm> Diane Vizine-Goetz and Carol Jean Godby. 1998. Library classification schemes and access to electronic collections: enhancement of the Dewey Decimal Classification with supplemental vocabulary. Advances in Classification Research, pp. 14-25. Diane Vizine-Goetz, Carol Jean Godby and Mark Bendig. 1995. Spectrum: A Web- based system for describing Internet resources. Computer Networks and ISDN Systems. 27:985-1002. FIELDS OF STUDY Major Field: Linguistics vii TABLE OF CONTENTS Page Abstract.............................................................................................................................. iii Acknowledgments................................................................................................................v Vita..................................................................................................................................... vi List of Tables ...................................................................................................................... x List of Figures.................................................................................................................. xiii Chapters: 1. Words that Masquerade as Phrases................................................................................1 1.0. Introduction............................................................................................................1 1.1. The automatic identification of noun phrases........................................................2 1.1.1. Noun phrases in the information-retrieval task...........................................2 1.1.2. Noun-phrase collocations............................................................................7 1.2. Perspectives from theoretical linguistics ...............................................................8 1.2.1. Syntactic properties of lexicalized

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    158 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us