Learning Compound Noun Semantics

Total Page:16

File Type:pdf, Size:1020Kb

Learning Compound Noun Semantics UCAM-CL-TR-735 Technical Report ISSN 1476-2986 Number 735 Computer Laboratory Learning compound noun semantics Diarmuid OS´ eaghdha´ December 2008 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2008 Diarmuid O´ Seaghdha´ This technical report is based on a dissertation submitted July 2008 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Corpus Christi College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Learning compound noun semantics Diarmuid O´ S´eaghdha Summary This thesis investigates computational approaches for analysing the semantic relations in compound nouns and other noun-noun constructions. Compound nouns in particular have received a great deal of attention in recent years due to the challenges they pose for natural language processing systems. One reason for this is that the semantic relation between the constituents of a compound is not explicitly expressed and must be retrieved from other sources of linguistic and world knowledge. I present a new scheme for the semantic annotation of compounds, describing in detail the motivation for the scheme and the development process. This scheme is applied to create an annotated dataset for use in compound interpretation experiments. The results of a dual-annotator experiment indicate that good agreement can be obtained with this scheme relative to previously reported results and also provide insights into the challenging nature of the annotation task. I describe two corpus-driven paradigms for comparing pairs of nouns: lexical similarity and relational similarity. Lexical similarity is based on comparing each constituent of a noun pair to the corresponding constituent of another pair. Relational similarity is based on comparing the contexts in which both constituents of a noun pair occur together with the corresponding contexts of another pair. Using the flexible framework of kernel methods, I develop techniques for implementing both similarity paradigms. A standard approach to lexical similarity represents words by their co-occurrence distri- butions. I describe a family of kernel functions that are designed for the classification of probability distributions. The appropriateness of these distributional kernels for semantic tasks is suggested by their close connection to proven measures of distributional lexical similarity. I demonstrate the effectiveness of the lexical similarity model by applying it to two classification tasks: compound noun interpretation and the 2007 SemEval task on classifying semantic relations between nominals. To implement relational similarity I use kernels on strings and sets of strings. I show that distributional set kernels based on a multinomial probability model can be computed many times more efficiently than previously proposed kernels, while still achieving equal or better performance. Relational similarity does not perform as well as lexical similarity in my experiments. However, combining the two models brings an improvement over either model alone and achieves state-of-the-art results on both the compound noun and SemEval Task 4 datasets. 3 4 Acknowledgments The past four years have been a never-boring mixture of learning and relearning, exper- iments that worked in the end, experiments that didn’t, stress, relief, more stress, some croquet and an awful lot of running. And at the end a thesis was produced. A good number of people have contributed to getting me there, by offering advice and feedback, by answering questions and by providing me with hard-to-find papers; these include Tim Baldwin, John Beavers, Barry Devereux, Roxana Girju, Anders Søgaard, Simone Teufel, Peter Turney and Andreas Vlachos. Andreas also read and commented on parts of this thesis, and deserves to be thanked a second time. Julie Bazin read the whole thing, and many times saved me from my carelessness. Diane Nicholls enthusiastically and diligently participated in the annotation experiments. My supervisor Ann Copestake has been an unfailing source of support, from the beginning – by accepting an erstwhile Sanskritist who barely knew what a probability distribution was for a vague project involving compound nouns and machine learning – to the end – she has influenced many aspects of my writing and presentation for the better. I am grateful to my PhD examiners, Anna Korhonen and Diana McCarthy, for their constructive comments and good-humoured criticism. This thesis is dedicated to my parents for giving me a good head start, to Julie for sticking by me through good results and bad, and to Cormac, for whom research was just one thing that ended too soon. 5 6 Contents 1 Introduction 11 2 Compound semantics in linguistics and NLP 15 2.1 Introduction................................... 15 2.2 Linguistic perspectives on compounds . ...... 15 2.2.1 Compounding as a linguistic phenomenon . .. 15 2.2.2 Inventories, integrated structures and pro-verbs: A survey of repre- sentationaltheories . 16 2.3 Compounds and semantic relations in NLP . .... 21 2.3.1 Inventories, integrated structures and pro-verbs (again) . .. 21 2.3.2 Inventoryapproaches . 22 2.3.3 Integrationalapproaches . 24 2.4 Conclusion.................................... 25 3 Developing a relational annotation scheme 27 3.1 Introduction................................... 27 3.2 Desiderata for a semantic annotation scheme . ....... 27 3.3 Developmentprocedure. 30 3.4 Theannotationscheme. .. .. .. 32 3.4.1 Overview ................................ 32 3.4.2 Generalprinciples............................ 33 3.4.3 BE.................................... 35 3.4.4 HAVE .................................. 35 3.4.5 IN .................................... 36 3.4.6 ACTOR,INST ............................. 36 3.4.7 ABOUT................................. 38 3.4.8 REL,LEX,UNKNOWN . .. .. 38 3.4.9 MISTAG,NONCOMPOUND . 39 3.5 Conclusion.................................... 39 CONTENTS CONTENTS 4 Evaluating the annotation scheme 41 4.1 Introduction................................... 41 4.2 Data....................................... 41 4.3 Procedure .................................... 44 4.4 Analysis ..................................... 45 4.4.1 Agreement................................ 45 4.4.2 Causesofdisagreement. 46 4.5 Priorworkanddiscussion . 51 4.6 Conclusion.................................... 53 5 Semantic similarity and kernel methods 55 5.1 Introduction................................... 55 5.2 A similarity-based approach to relational classification ........... 55 5.3 Methods for computing noun pair similarity . ...... 56 5.3.1 Constituent lexical similarity . .... 57 5.3.1.1 Lexical similarity paradigms . 57 5.3.1.2 Thedistributionalmodel. 59 5.3.1.3 Measures of similarity and distance . 62 5.3.1.4 Fromwordstowordpairs . 67 5.3.2 Relationalsimilarity . 68 5.3.2.1 The relational distributional hypothesis . ... 68 5.3.2.2 Methods for token-level relational similarity . ..... 70 5.3.2.3 Methods for type-level relational similarity . ..... 71 5.4 Kernel methods and support vector machines . ..... 74 5.4.1 Kernels ................................. 75 5.4.2 Classification with support vector machines . ..... 77 5.4.3 Combining heterogeneous information for classification ....... 81 5.5 Conclusion.................................... 82 6 Learning with co-occurrence vectors 83 6.1 Introduction................................... 83 6.2 Distributional kernels for semantic similarity . ........... 83 6.3 Datasets..................................... 86 6.3.1 1,443Compounds............................ 86 6.3.2 SemEvalTask4............................. 87 6.4 Co-occurrencecorpora . 88 8 CONTENTS CONTENTS 6.4.1 BritishNationalCorpus . 89 6.4.2 Web1T5-GramCorpus . .. .. 90 6.5 Methodology .................................. 91 6.6 Compoundnounexperiments . 93 6.7 SemEvalTask4experiments. 97 6.8 Furtheranalysis................................. 100 6.8.1 Investigating the behaviour of distributional kernels.........100 6.8.2 Experiments with co-occurrence weighting . .102 6.9 Conclusion....................................104 7 Learning with strings and sets 107 7.1 Introduction...................................107 7.2 String kernels for token-level relational similarity . ..............107 7.2.1 Kernelsonstrings. .. .. .. .107 7.2.2 Distributional string kernels . 109 7.2.3 ApplicationtoSemEvalTask4 . 110 7.2.3.1 Method ............................110 7.2.3.2 Results ............................111 7.3 Set kernels for type-level relational similarity . ............113 7.3.1 Kernelsonsets .............................113 7.3.2 Multinomialsetkernels. 115 7.3.3 Relatedwork ..............................116 7.3.4 Application to compound noun interpretation . .118 7.3.4.1 Method ............................118 7.3.4.2 Comparison of time requirements . 120 7.3.4.3 Results ............................121 7.4 Conclusion....................................125 8 Conclusions and future work 127 8.1 Contributionsofthethesis . 127 8.2 Futurework...................................128 Appendices A Notational conventions 155 B Annotation guidelines for compound nouns 157 9 CONTENTS CONTENTS 10 Chapter 1 Introduction Noun-noun compounds are familiar facts of our daily linguistic lives. To take a simple example from my typical morning routine: each weekday morning I eat breakfast in the living room, while catching up on email correspondence
Recommended publications
  • Pos, Morphology and Dependencies Annotation Guidelines for Arabic
    PoS, Morphology and Dependencies Annotation Guidelines for Arabic Mohammed Attia, Tolga Kayadelen, Ryan Mcdonald, Slav Petrov Google Inc. May, 2017 Table of Contents 1. Introduction............................................................................................................................................2 2. Tokenization...........................................................................................................................................3 Arabic Clitic Table................................................................................................................................4 Special Cases.........................................................................................................................................4 3. POS Tagging..........................................................................................................................................8 POS Quick Table...................................................................................................................................8 POS Tags.............................................................................................................................................13 JJ: Adjective....................................................................................................................................13 JJR: Elative Adjective.....................................................................................................................14 DT: The Arabic Determiner System...............................................................................................14
    [Show full text]
  • On Compounds, Noun Phrases and Domains Gisli R
    University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 6-13-2017 Cycling Through Grammar: On Compounds, Noun Phrases and Domains Gisli R. Hardarson University of Connecticut, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Hardarson, Gisli R., "Cycling Through Grammar: On Compounds, Noun Phrases and Domains" (2017). Doctoral Dissertations. 1570. https://opencommons.uconn.edu/dissertations/1570 Cycling Through Grammar: On Compounds, Noun Phrases and Domains Gísli Rúnar Harðarson, PhD University of Connecticut, 2017 In this dissertation, I address the question of domains within grammar: i.e. how domains are defined, whether different components of grammar make references to the same boundaries (or at least boundary definers), and whether these boundaries are uniform with respect to different processes. I address these questions in two case studies. First, I explore compound nouns in Icelandic and restrictions on their composition, where inflected non-head elements are structurally peripheral to uninflected ones. I argue that these effects are due to a matching condition which requires elements within compounds to match their attachment site in terms of size/type. Following that I explore how morphophonology is regulated by the structure of the compound. I argue for a contextual definition of the domain of morphophonology, where the highest functional morpheme in the extended projection of the root marks the boundary. Under this approach a morphophonological domain can contain smaller domains analogous to phases in syntax. This allows for the morphosyntactic structure to be mapped directly to phonology while giving the impression of two contradicting structures.
    [Show full text]
  • English for Practical Purposes 9
    ENGLISH FOR PRACTICAL PURPOSES 9 CONTENTS Chapter 1: Introduction of English Grammar Chapter 2: Sentence Chapter 3: Noun Chapter 4: Verb Chapter 5: Pronoun Chapter 6: Adjective Chapter 7: Adverb Chapter 8: Preposition Chapter 9: Conjunction Chapter 10: Punctuation Chapter 11: Tenses Chapter 12: Voice Chapter 1 Introduction to English grammar English grammar is the body of rules that describe the structure of expressions in the English language. This includes the structure of words, phrases, clauses and sentences. There are historical, social, and regional variations of English. Divergences from the grammardescribed here occur in some dialects of English. This article describes a generalized present-dayStandard English, the form of speech found in types of public discourse including broadcasting,education, entertainment, government, and news reporting, including both formal and informal speech. There are certain differences in grammar between the standard forms of British English, American English and Australian English, although these are inconspicuous compared with the lexical andpronunciation differences. Word classes and phrases There are eight word classes, or parts of speech, that are distinguished in English: nouns, determiners, pronouns, verbs, adjectives,adverbs, prepositions, and conjunctions. (Determiners, traditionally classified along with adjectives, have not always been regarded as a separate part of speech.) Interjections are another word class, but these are not described here as they do not form part of theclause and sentence structure of the language. Nouns, verbs, adjectives, and adverbs form open classes – word classes that readily accept new members, such as the nouncelebutante (a celebrity who frequents the fashion circles), similar relatively new words. The others are regarded as closed classes.
    [Show full text]
  • EERE Editorial Style Guide
    EERE Editorial Style Guide April 30, 2021 (This page intentionally left blank) EERE Editorial Style Guide April 30, 2021 ii Table of Contents EERE Style Guide Instructions ................................................................................................... 1 A► .................................................................................................................................................. 2 a, an ............................................................................................................................................. 2 abbreviations, acronyms, and initialisms .................................................................................... 2 abstract ........................................................................................................................................ 2 academic degrees ........................................................................................................................ 2 acknowledgments ........................................................................................................................ 3 acronyms ..................................................................................................................................... 3 addresses ..................................................................................................................................... 3 air conditioning ..........................................................................................................................
    [Show full text]
  • Workbook This Workbook Is to Accompany a Grammar Textbook
    The Missouri House of Representatives GRAMMAR BOOT CAMP • Workbook This workbook is to accompany a grammar textbook. Writing Good Sentences by Claude W. Faulkner inspired and influenced this workbook’s creation and is highly recommended for use as a grammar textbook. Antony LePage September 2018 Formatting by Ellen Misloski Contents BASIC SENTENCE PATTERNS ............................................................................................................6 Exercise One .......................................................................................................................................6 Substantives Exercise Two ....................................................................................................................................... 7 Verbs Exercise Three ..................................................................................................................................... 8 Subject – Verb – Object Subject – Linking Verb – Object Exercise Four ....................................................................................................................................10 Substantive Modifiers Restrictive and Non-Restrictive Substantive Modifiers Exercise Five......................................................................................................................................13 Adjective – Noun Combinations Exercise Six ........................................................................................................................................15 Modifiers
    [Show full text]
  • How to Format Your Paper
    (De)verbal Modifiers in Attribute + Noun Collocations and Compounds: Verbs, Deverbal Nouns or Suffixed Adjectives? Radek Vogel Address: Masaryk University, Faculty of Education, Department of English Language and Literature, Poříčí 9, 60300 Brno, Czech Republic. [email protected] Abstract: English as an analytic language particularly poor in inflections and relatively poor in derivational suffixes does not often mark word classes by specific morphemes. On the contrary, one form of a word can be used in several grammatical functions and an identical word form can even have several meanings in several word classes. The frequent occurrence of conversion between word classes thus allows one form, often the base or simplest one, to perform several roles, and, at the same time, makes identification of its grammatically and semantically defined word class difficult, especially in multiword phrases functioning as a whole. The most frequent type of a noun phrase, Attr+N phrase, can thus be realised in several ways, with different word classes performing the function of syntactic attribute. This paper looks into a less frequent subtype of this phrase in English, one which uses semantically (de)verbal attribute, and tries to establish rules governing the choice between the two main options, using either the base form or a derived one (e.g. call centre vs. writing paper). Shedding light on this issue has practical application in mastering appropriate formation as well as correct understanding of English multiword phrases or terms (mostly nominal), which is an important skill in non-native environment (especially in EAP) where English is used as a lingua franca.
    [Show full text]
  • EERE Style Guide
    EERE Style Guide April 15, 2021 EERE Style Guide (This page intentionally left blank) ii EERE Style Guide Executive Summary The Office of Energy Efficiency and Renewable Energy (EERE) Style Guide is an essential tool for preparing publications, exhibits, and websites. It features formatting, spelling, punctuation, capitalization, grammar, and language guidelines. Why Use This Guide? This guide promotes the accuracy, consistency, and professionalism required for effectively communicating EERE’s capabiliti es and accomplishments in research and development. How to Use This Guide Consult this guide first when you develop or edit a publication or website for EERE. The entries are listed in alphabetical order. You can also use the index to search for entries by topic. If you can’t find an entry on a subject or topic, consult these style and reference guides in the following order: • The Associated Press Stylebook* • Merriam-Webster’s Collegiate Dictionary • The Chicago Manual of Style.* * Available by subscription only. 1 EERE Style Guide Style Guide Entry Definition/Rule Usage Example(s) a, an Use “a” before any acronym or word • a light-water reactor: an LWR that begins with a consonant sound. • a request for proposals: an RFP Use “an” before any acronym or word • a NASA astronaut that begins with a vowel sound. An • a Project Management Plan: a acronym is pronounced as a word (e.g., PMP a HEPA filter); an initialism is pronounced as its letters (for example, an NGO). The first sound of the word or letters indicates whether to use “a” or “an.” abbreviations, An abbreviation is a shortened form of Abbreviations acronyms, a word used in place of the full word.
    [Show full text]
  • Stanford Typed Dependencies Manual
    Stanford typed dependencies manual Marie-Catherine de Marneffe and Christopher D. Manning September 2008 Revised for the Stanford Parser v. 3.7.0 in September 2016 Please note that this manual describes the original Stanford Dependencies representation. As of ver- sion 3.5.2, the default representation output by the Stanford Parser and Stanford CoreNLP is the new Universal Dependencies (UD) representation, and we no longer maintain the original Stanford Depen- dencies representation. For a description of the UD representation, take a look at the Universal Depen- dencies documentation at http:/www.universaldependencies.org and the discussion of the enhanced and enhanced++ UD representations by Schuster and Manning (2016). 1 Introduction The Stanford typed dependencies representation was designed to provide a simple description of the grammatical relationships in a sentence that can easily be understood and effectively used by people without linguistic expertise who want to extract textual relations. In particular, rather than the phrase structure representations that have long dominated in the computational linguistic community, it repre- sents all sentence relationships uniformly as typed dependency relations. That is, as triples of a relation between pairs of words, such as “the subject of distributes is Bell.” Our experience is that this simple, uniform representation is quite accessible to non-linguists thinking about tasks involving information extraction from text and is effective in relation extraction applications. Here is an example
    [Show full text]
  • The Semantic Transparency of English Compound Nouns
    The semantic transparency of English compound nouns Martin Schäfer language Morphological Investigations 3 science press Morphological Investigations Editors: Jim Blevins, Petar Milin, Michael Ramscar In this series: 1. Trips, Carola & Jaklin Kornfilt (eds.). Further investigations into the nature of phrasal compounding. 2. Baechler, Raffaela. Absolute Komplexität in der Nominalflexion. 3. Schäfer, Martin. The semantic transparency of English compound nouns. The semantic transparency of English compound nouns Martin Schäfer language science press Martin Schäfer. 2018. The semantic transparency of English compound nouns (Morphological Investigations 3). Berlin: Language Science Press. This book is the revised version of the author’s habilitation, Friedrich-Schiller-Universität Jena, 2017 This title can be downloaded at: http://langsci-press.org/catalog/book/153 © 2018, Martin Schäfer Published under the Creative Commons Attribution 4.0 Licence (CC BY 4.0): http://creativecommons.org/licenses/by/4.0/ ISBN: 978-3-96110-030-9 (Digital) 978-3-96110-031-6 (Hardcover) DOI:10.5281/zenodo.1134595 Source code available from www.github.com/langsci/153 Collaborative reading: paperhive.org/documents/remote?type=langsci&id=153 Cover and concept of design: Ulrike Harbort Typesetting: Martin Schäfer Proofreading: Plinio A. Barbosa, Jose Poblete Bravo, Merlijn Breunesse, Stefan Hartmann, Martin Hilpert, Gianina Iordachioaia, Timm Lichte, Ahmet Bilal Özdemir, Steve Pepper, Katja Politt, Valeria Quochi, Edalat Shekari, Andrew Spencer, Carola Trips, Jeroen van de Weijer, Amr Zawawy Fonts: Linux Libertine, Arimo, DejaVu Sans Mono Typesetting software:Ǝ X LATEX Language Science Press Unter den Linden 6 10099 Berlin, Germany langsci-press.org Storage and cataloguing done by FU Berlin Dedicated to the next generation (in order of appearance): Charlotte, Henriette, Anton, Magdalena, Moritz, Henrike, Emma, Lene, Mathilde, Marie, Simon, Anne, Theo, Ole, Jakob Contents Acknowledgments xi Abbreviations xiii 1 Introduction 1 1.1 A first notion of semantic transparency .............
    [Show full text]
  • Some New Insights Into the Semantics of English N+N Compounds
    CORE Metadata, citation and similar papers at core.ac.uk Provided by ResearchArchive at Victoria University of Wellington Some new insights into the semantics of English N+N compounds by Elizaveta Tarasova A thesis submitted to Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy Victoria University of Wellington 2013 ii Abstract Some new insights into the semantics of English N+N compounds This thesis focuses on English N+N compounds and the primary purpose of the study is to investigate the way in which compounded structures acquire their meaning and to check the way in which the semantics of each of the constituents contributes to the overall meaning of the structure. The way in which such contributions are made should be inferable from the linguistic analysis of the structure and meaning of compounds. In order to do this, the thesis looks first at the morphological productivity of the constituents comprising a compound. The second aim is to identify whether the productivity of a compound constituent on the morphological level coincides with the productivity of the semantic relation realised in the constituent family. The discussion of the results obtained from a corpus study provides plausible explanations for the regularities noted in the course of the analysis by using some of the relevant principles from the complex of approaches including the Construction Grammar and Cognitive Grammar approaches. Examples of compounds were collected from the printed media (NZ broadsheets) and the BNC. The analysis of the data used both quantitative and qualitative methods. The quantitative analysis of the data confirms two hypotheses: (1) that a constituent is more productive in just one of the positions (modifier or head)1, and (2) the more productive a constituent is, the more likely it is to realise a single semantic relation in a constituent family.
    [Show full text]
  • Noun As Adjective Modifier Example Sentences
    Noun As Adjective Modifier Example Sentences Kirby remains unmeasured: she federating her believer abused too appreciably? Bitchier and conchal Conan pangeneticstencils inspectingly Fonz tooms and that journalized taximeters. his cordwain troublously and unconcernedly. Kimmo still gab funnily while Modifier Examples and Definition EnglishSentencescom. What base a misplaced modifier. The student opened her book name then settle the chapter. The sentence to as clear as nouns are. Other sentences mean by placing it blesseth him in noun as adjective modifier example sentences have soup. The sentence with another special expression, as object usually referred to. See anymore for more. In fact, you can improve her writing ask a forecast by paying attention to basic problems like misplaced modifiers and dangling modifiers. One place to do because is direct cross paid any repeated words and any forms of the bleach be year the sentences containing modifiers EXAMPLE The doctor mode to the. For a question mark said it again lost nearly all predicates must be. By way are some extent, even distinguishing different phrase because of such as a comma before or start behaving like? Four Free Products In One! Refers to the language that an individual encounters as an effort and accurate child; a persons native language. For additional practice with adjectives and adverbs, check off our completely free bright on Albert. You can download all documents freely. Her quiet was margin to melon, and it bad was pro ven to be not lie. Several Possible Antecedents: Neither Mary nor her sisters offered their help. Modifiers are words phrases or clauses that add description to sentences Typically you book find a.
    [Show full text]
  • Hawking Hyphens in Compound Modifiers
    Hawking Hyphens in Compound Modifiers Joan Ames Magat * Joseph Kimble once remarked that, for legal writers, hyphenating compound modifiers (“phrasal adjectives”) that precede the noun they modify “is . a hard sell.” 1 How right he was. I’ve been trying to hawk that hyphen for years. Still, convinced as I am that the practice is “never incorrect” 2 and driven by the certainty that such hyphens inevitably clarify 3—which, for legal writing in particular, is paramount—I make the pitch once again here in the hopes of making a few more sales. Sure, some compound nouns functioning as modifiers are so familiar they might slip by unhyphenated: high school student, business judgment rule, sales tax increases, criminal defense lawyer. Yet even a dictionary of general use— Webster’s here—recognizes the shift of noun to adjective with a hyphen: “common law n.” is listed as “common-law . adj .,” 4 as is “common-law marriage.” 5 Hyphenated. And Black’s Law Dictionary , which defines legal terms of art, does the same. 6 Business record (n.) is listed adjectivally as “business-records exception”; 7 dispute resolution (n.), as “dispute-resolution procedure.” 8 Hyphenated. The “rule,” such as it is, is easy enough to apply consistently; one can craft exceptions for visually obvious compounds—phrases in italics or quotes or that are proper nouns, for example. Yet the facility of the rule’s application evidently does nothing to recommend its consistent use. A handful of legal authors whose work I have edited have balked at * Joan Ames Magat teaches writing at Duke Law School.
    [Show full text]