Lexical Semantic Analysis in Natural Language Text
Total Page:16
File Type:pdf, Size:1020Kb
Lexical Semantic Analysis in Natural Language Text Nathan Schneider Language Technologies Institute School of Computer Science Carnegie Mellon University ◇ July 28, 2014 Submitted in partial fulfillment of the requirements for the degree of doctor of philosophy in language and information technologies Abstract Computer programs that make inferences about natural language are easily fooled by the often haphazard relationship between words and their meanings. This thesis develops Lexical Semantic Analysis (LxSA), a general-purpose framework for describing word groupings and meanings in context. LxSA marries comprehensive linguistic annotation of corpora with engineering of statistical natural lan- guage processing tools. The framework does not require any lexical resource or syntactic parser, so it will be relatively simple to adapt to new languages and domains. The contributions of this thesis are: a formal representation of lexical segments and coarse semantic classes; a well-tested linguistic annotation scheme with detailed guidelines for identifying multi- word expressions and categorizing nouns, verbs, and prepositions; an English web corpus annotated with this scheme; and an open source NLP system that automates the analysis by statistical se- quence tagging. Finally, we motivate the applicability of lexical se- mantic information to sentence-level language technologies (such as semantic parsing and machine translation) and to corpus-based linguistic inquiry. i I, Nathan Schneider, do solemnly swear that this thesis is my own To my parents, who gave me language. work, and that everything therein is, to the best of my knowledge, true, accurate, and properly cited, so help me Vinken. § ∗ In memory of Charles “Chuck” Fillmore, one of the great linguists. Except for the footnotes that are made up. If you cannot handle jocular asides, please∗ recompile this thesis with the \jnote macro disabled. ii iii Acknowledgments blah blah v 3.1 Introduction ........................ 28 3.2 Linguistic Characterization................ 31 3.3 Existing Resources..................... 38 3.4 Taking a Different Tack .................. 46 3.5 Formal Representation.................. 49 Contents 3.6 Annotation Scheme.................... 51 3.7 Annotation Process .................... 53 3.8 The Corpus......................... 58 3.9 Conclusion......................... 61 Abstracti 4 Noun and Verb Supersenses 63 4.1 Introduction ........................ 64 Acknowledgmentsv 4.2 Background: Supersense Tags.............. 64 4.3 Supersense Annotation for Arabic............ 67 1 Setting the Stage1 4.4 Supersense Annotation for English........... 71 1.1 Task Definition....................... 3 4.5 Conclusion......................... 77 1.2 Approach.......................... 5 1.3 Guiding Principles..................... 6 5 Preposition Supersenses 79 1.4 Contributions ....................... 8 5.1 Background......................... 82 1.5 Organization........................ 8 5.2 Our Approach to Prepositions.............. 88 5.3 Preposition Targets .................... 89 2 General Background: Computational Lexical Semantics 11 5.4 Preposition Tags...................... 92 2.1 The Computational Semantics Landscape . 12 5.5 Conclusion.........................115 2.2 Lexical Semantic Categorization Schemes . 14 2.3 Lexicons and Word Sense Classification . 15 2.4 Named Entity Recognition and Sequence Models . 19 II Automation 117 2.5 Other Topics in Computational Lexical Semantics . 23 6 Multiword Expression Identification 119 6.1 Introduction ........................120 I Representation & Annotation 25 6.2 Evaluation..........................120 6.3 Tagging Schemes......................122 3 Multiword Expressions 27 6.4 Model ............................124 vi vii Computers are getting smarter all the time: scientists tell us that soon they will be able to talk to us. (By “they” I mean “computers”: I doubt scientists will ever be able to talk to us.) 6.5 Results............................130 6.6 Related Work........................140 Dave Barry, Dave Barry’s Bad Habits: “The Computer: Is It Terminal?” 6.7 Conclusion.........................141 7 Full Supersense Tagging 143 7.1 Background: English Supersense Tagging with a Dis- criminative Sequence Model...............144 7.2 Piggybacking off of the MWE Tagger ..........145 CHAPTER 7.3 Experiments: MWEs + Noun and Verb Supersenses . 146 1 7.4 Conclusion.........................153 III Conclusion 155 Setting the Stage 8 Summary of Contributions 157 9 Future Work 159 The seeds for this thesis were two embarrassing realizations.1 The first was that, despite established descriptive frameworks A MWE statistics in PARSEDSEMCOR 161 for syntax and morphology, and many vigorous contenders for rela- A.1 Gappy MWEs........................162 tional and compositional semantics, I did not know of any general- purpose linguistically-driven computational scheme to represent B MWE Annotation Guidelines 165 the contextual meanings of words—let alone resources or algorithms C MWE Patterns by POS 181 for putting such a scheme into practice. My embarrassment deepened when it dawned on me that I knew D Noun Supersense Tagset 183 of no general-purpose linguistically-driven computational scheme for even deciding which groups of characters or tokens in a text E Noun Supersense Annotation Guidelines 187 qualify as meaningful words! Well. Embarrassment, as it turns out, can be a constructive F Verb Supersense Annotation Guidelines 195 Bibliography 203 1In the Passover Haggadah, we read that two Rabbis argued over which sentence should begin the text, whereupon everyone else in the room must have cried that Index 229 four cups of wine were needed, stat. This thesis is similar, only instead of wine, you get footnotes. viii 1 motivator. Hence this thesis, which chronicles the why, the what, tomation problem by decomposing it into subproblems, or tasks; and the how of analyzing (in a general-purpose linguistically-driven NLP tasks with natural language text input include grammatical computational fashion) the lexical semantics of natural language analysis with linguistic representations, automatic knowledge base text. or database construction, and machine translation.2 The latter two § are considered applications because they fulfill real-world needs, The intricate communicative capacity we know as “language” rests whereas automating linguistic analysis (e.g., syntactic parsing) is upon our ability to learn and exploit conventionalized associations sometimes called a “core NLP” task. Core NLP systems are those between patterns and meanings. When in Annie Hall Woody Allen’s intended to provide modular functionality that could be exploited character explains, “My raccoon had hepatitis,” English-speaking for many language processing applications. viewers are instantly able to identify sound patterns in the acoustic This thesis develops linguistic descriptive techniques, an En- signal that resemble sound patterns they have heard before. The glish text dataset, and algorithms for a core NLP task of analyzing patterns are generalizations over the input (because no two acoustic the lexical semantics of sentences in an integrated and general- signals are identical). For one acquainted with English vocabulary, purpose way. My hypothesis is that the approach is conducive to they point the way to basic meanings like the concepts of ‘raccoon’ rapid high-quality human annotation, to efficient automation, and and ‘hepatitis’—we will call these lexical meanings—and the gram- to downstream application. matical patterns of the language provide a template for organizing A synopsis of the task definition, guiding principles, method- words with lexical meanings into sentences with complex meanings ology, and contributions will serve as the entrée of this chapter, (e.g., a pet raccoon having been afflicted with an illness, offered as followed by an outline of the rest of the document for dessert. an excuse for missing a Dylan concert). Sometimes it is useful to contrast denoted semantics (‘the raccoon that belongs to me was 1.1 Task Definition suffering from hepatitis’) and pragmatics (‘. which is why I was unable to attend the concert’) with meaning inferences (‘my pet We3 define Lexical Semantic Analysis (LxSA) to be the task of seg- raccoon’; ‘I was unable to attend the concert because I had to nurse menting a sentence into its lexical expressions, and assigning se- the ailing creature’). Listeners draw upon vast stores of world knowl- mantic labels to those expressions. By lexical expression we mean edge and contextual knowledge to deduce coherent interpretations a word or group of words that, intuitively, has a “basic” meaning of necessarily limited utterances. or function. By semantic label we mean some representation of As effortless as this all is for humans, understanding language production, processing, and comprehension is the central goal of 2I will henceforth assume the reader is acquainted with fundamental concepts, representations, and methods in linguistics, computer science, and NLP.An introduc- the field of linguistics, while automating human-like language be- tion to computational linguistics can be found in Jurafsky and Martin(2009). havior in computers is a primary dream of artificial intelligence. The 3Since the time of Aristotle, fledgling academics have taken to referring them- field of natural language processing (NLP) tackles the language au- selves in the plural in hopes of gaining some measure of dignity. Instead, they were treated as