INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g.. maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6’ x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road. Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A COMPUTATIONAL STUDY OF LEXICALIZED NOUN PHRASES IN ENGLISH DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosphy in the Graduate School of The Ohio State University By Carol Jean Godby, B.A.. M.A. The Ohio State University 2002 Dissertation Committee: Approved by Professor Craige Roberts, Adviser Professor Chris Brew Adviser Professor David Dowty Department of Lingistics Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3039473 Copyright 2001 by Godby, Carol Jean All rights reserved. ___ ® UMI UMI Microform 3039473 Copyright 2002 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest information and Learning Company 300 North Zeeb Road P.O.Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Copyright § Carol Jean Godby 2001 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ABSTRACT Lexicalized noun phrases are noun phrases that function as words. In English, lexicalized noun phrases are usually realized as noun-noun compounds such as theater ticket and garbage man, or as adjective-noun phrases such as black market and high school. In specialized or technical subject domains, phrases such as urban planning, air traffic control. highway engineering and combinatorial mathematics represent conventional names for concepts that are just as important to the as single-word terms such as adsorbents, hydrology, or aerodynamics. Yet despite the fact that lexicalized noun phrases are frequent enough to be cited in dictionaries, book indexes, the traditional linguistic literature has failed to identify consistent and categorical formal criteria for identifying them. This study develops and evaluates a linguistically natural computational method for recognizing lexicalized noun phrases in a large corpus of English-language engineering text by synthesizing the insights of studies in traditional linguistics and computational linguists. From the scholarship in theoretical linguistics, the analysis adopts the perspective that lexicalized noun phrases represent the names of concepts that are important to a community of speakers and have survived a single context of use. Theoretical linguists have also proposed diagnostic tests for identifying lexicalized noun phrases, many of which can be formalized in a computational study. From the scholarship in computational linguistics, the analysis incorporates the view that a linguistic investigation can be extended and verified by processing relevant evidence from a corpus of text, which can be evaluated using mathematical models that do not require categorical input. ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In a engineering text, a small set of linguistic contexts, including professor o f department o f or studies in, yields long lists of lexicalized noun phrases, including public safety, abstract state machines, complex systems, computer graphics, and mathematical morphology. The study reported here identifies lexical and syntactic contexts that harbor lexicalized noun phrases and submits them to a machine-learning algorithm that classifies the lexical status of noun phrases extracted from the text. Results from several evaluations show that this evidence is relevant to the classification, and informal evidence from many other subject domains implies that the results can be generalized. iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ACKNOWLEDGMENTS I returned to graduate school to finish my Ph.D., nearly 15 years after I dropped out, and am now in a position to apply the results of my research in my professional life. This unlikely set of events converged because I have been fortunate enough to work in an intellectually rich and supportive environment. 1 am grateful to my managers and colleagues at OCLC, who believe in me and understand the relevance of linguistics for solving the problems presented by large repositories of machine-readable text. My deepest thanks go to Martin Dillon, who invited me to join the Office of Research and gave me an interesting problem to work on. This dissertation is the result. Joan Mitchell and Diane Vizine-Goetz supported my interest in terminology extraction and invited me to many professional venues to discuss my work. Traugott Koch gave me permission to use a valuable data set that he and his colleagues at Lund University in Sweden collected, without which the project reported in my dissertation could not have been completed. I am also grateful to the professors and students in the Linguistics Department at Ohio State, who welcomed me back as one of their own. Their collegiality and professionalism made the work go as smoothly as possible. My heartfelt thanks go to my dissertation committee for their help in shaping my research problem into a linguistically sophisticated inquiry. Craige Roberts, my adviser, intervened several times with much-needed encouragement when 1 was overwhelmed and ready to giv e up. David Dowty, always a linguist’s linguist, pushed me to higher standards of scholarship than I could have achieved on my own. Chris Brew arrived at Ohio State just in time with the expertise in corpus linguistics that I needed to do this project justice. Special thanks go to Lee Jansen, who did a thorough job of editing the manuscript. Finally, I thank my old friends, who kept me sane as I worked on this project: Mark Bendig, Jeff McKibben. Rosanne Norman, Debbie Stollenwerk, Margaret Thomas and Patricia Weiland. iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. VITA 1954................................................................................... Bom - Newport News, Virginia 1977 ...................................................................................Bachelor of Arts German The University of Delaware 198 1...................................................................................Master of Arts Linguistics The Ohio State University 198 2 ...................................................................................Graduate teaching associate English as a Second Language The Ohio State University 1984 ...................................................................................Education consultant Software Productions Columbus, Ohio 1984 ...................................................................................Courseware developer College of Medicine The Ohio State University 1988 ...................................................................................Senior programmer/analyst Online Computer Library Center (OCLC) Dublin, Ohio 1990...................................................................................Systems analyst, OCLC 1992...................................................................................Associate research scientist, OCLC 1995.................................................................................. Research scientist, OCLC 1999-present ....................................................................Senior research scientist. OCLC v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PUBLICATIONS Carol Jean Godby and Ray Reighart. 2001. Terminology identification in a collection of Web resources. In Karen Calhoun and John Riemer, (eds.), CORC: New Tools utul Possibilities for Cooperative Electronic Resource Description, pp. 49-66. Binghamton, New York: The Hayworth Press. Carol Jean Godby, Eric Miller and Ray Reighart. 2000. Automatically generated topic maps of World Wide Web resources. The Annual Review o f OCLC Research. Accessible at:
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages160 Page
-
File Size-