The Effects of Lexical Resource Quality on Preference Violation Detection

Total Page:16

File Type:pdf, Size:1020Kb

The Effects of Lexical Resource Quality on Preference Violation Detection The Effects of Lexical Resource Quality on Preference Violation Detection Jesse Dunietz Lori Levin and Jaime Carbonell Computer Science Department Language Technologies Institute Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA, 15213, USA Pittsburgh, PA, 15213, USA [email protected] lsl,jgc @cs.cmu.edu { } Abstract presents an opportunity to revisit such challenges from the perspective of selectional preference vio- Lexical resources such as WordNet and lations. Detecting these violations, however, con- VerbNet are widely used in a multitude stitutes a severe stress-test for resources designed of NLP tasks, as are annotated corpora for other tasks. As such, it can highlight shortcom- such as treebanks. Often, the resources ings and allow quantifying the potential benefits of are used as-is, without question or exam- improving resources such as WordNet (Fellbaum, ination. This practice risks missing sig- 1998) and VerbNet (Schuler, 2005). nificant performance gains and even entire In this paper, we present DAVID (Detector of techniques. Arguments of Verbs with Incompatible Denota- This paper addresses the importance of tions), a resource-based system for detecting pref- resource quality through the lens of a erence violations. DAVID is one component of challenging NLP task: detecting selec- METAL (Metaphor Extraction via Targeted Anal- tional preference violations. We present ysis of Language), a new system for identifying, DAVID, a simple, lexical resource-based interpreting, and cataloguing metaphors. One pur- preference violation detector. With as- pose of DAVID was to explore how far lexical is lexical resources, DAVID achieves an resource-based techniques can take us. Though F1-measure of just 28.27%. When the our initial results suggested that the answer is “not resource entries and parser outputs for very,” further analysis revealed that the problem a small sample are corrected, however, lies less in the technique than in the state of exist- the F1-measure on that sample jumps ing resources and tools. from 40% to 61.54%, and performance Often, it is assumed that the frontier of perfor- on other examples rises, suggesting that mance on NLP tasks is shaped entirely by algo- the algorithm becomes practical given re- rithms. Manning (2011) showed that this may not fined resources. More broadly, this pa- hold for POS tagging – that further improvements per shows that resource quality matters may require resource cleanup. In the same spirit, tremendously, sometimes even more than we argue that for some semantic tasks, exemplified algorithmic improvements. by preference violation detection, resource qual- ity may be at least as essential as algorithmic en- 1 Introduction hancements. A variety of NLP tasks have been addressed 2 The Preference Violation Detection using selectional preferences or restrictions, in- Task cluding word sense disambiguation (see Navigli (2009)), semantic parsing (e.g., Shi and Mihalcea DAVID builds on the insight of Wilks (1978) that (2005)), and metaphor processing (see Shutova the strongest indicator of metaphoricity is the vi- (2010)). These semantic problems are quite chal- olation of selectional preferences. For example, lenging; metaphor analysis, for instance, has long only plants can literally be pruned. If laws is been recognized as requiring considerable seman- the object of pruned, the verb is likely metaphori- tic knowledge (Wilks, 1978; Carbonell, 1980). cal. Flagging such semantic mismatches between The advent of extensive lexical resources, an- verbs and arguments is the task of preference vio- notated corpora, and a spectrum of NLP tools lation detection. 765 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 765–770, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics We base our definition of preferences on the “The politician pruned laws regulating plastic Pragglejaz guidelines (Pragglejaz Group, 2007) bags, and created new fees for inspecting dairy for identifying the most basic sense of a word as farms.” the most concrete, embodied, or precise one. Sim- Verb Arg0 Arg1 ilarly, we define selectional preferences as the se- pruned The politician laws . bags mantic constraints imposed by a verb’s most basic regulating laws plastic bags sense. Dictionaries may list figurative senses of created The politician new fees prune, but we take the basic sense to be cutting inspecting - - dairy farms plant growth. Several types of verbs were excluded from the Table 1: SENNA’s SRL output for the example task because they have very lax preferences. These sentence above. Though this example demon- include verbs of becoming or seeming (e.g., trans- strates only two arguments, SENNA is capable of form, appear), light verbs, auxiliaries, and aspec- labeling up to six. tual verbs. For the sake of simplifying implemen- tation, phrasal verbs were also ignored. Restriction WordNet Synsets 3 Algorithm Design animate animate being.n.01 people.n.01 To identify violations, DAVID employs a simple person.n.01 algorithm based on several existing tools and re- concrete physical object.n.01 sources: SENNA (Collobert et al., 2011), a seman- matter.n.01 tic role labeling (SRL) system; VerbNet, a com- substance.n.01 putational verb lexicon; SemLink (Loper et al., organization social group.n.01 2007), which includes mappings between Prop- district.n.01 Bank (Palmer et al., 2005) and VerbNet; and WordNet. As one metaphor detection component Table 2: DAVID’s mappings between some of METAL’s several, DAVID is designed to favor common VerbNet restriction types and WordNet precision over recall. The algorithm is as follows: synsets. 1. Run the Stanford CoreNLP POS tagger (Toutanova et al., 2003) and the TurboParser Each VerbNet restriction is interpreted as man- dependency parser (Martins et al., 2011). dating or forbidding a set of WordNet hypernyms, 2. Run SENNA to identify the semantic argu- defined by a custom mapping (see Table 2). ments of each verb in the sentence using the For example, VerbNet requires both the Patient PropBank argument annotation scheme (Arg0, of a verb in carve-21.2-2 and the Theme Arg1, etc.). See Table 1 for example output. of a verb in wipe manner-10.4.1-1 to 3. For each verb V , find all VerbNet entries for be concrete. By empirical inspection, concrete V . Using SemLink, map each PropBank argu- nouns are hyponyms of the WordNet synsets ment name to the corresponding VerbNet the- physical object.n.01, matter.n.03, matic roles in these entries (Agent, Patient, or substance.n.04. Laws (the Patient of etc.). For example, the VerbNet class for prune prune) is a hyponym of none of these, so prune is carve-21.2-2. SemLink maps Arg0 to would be flagged as a violation. the Agent of carve-21.2-2 and Arg1 to the Patient. 4 Corpus Annotation 4. Retrieve from VerbNet the selectional restric- To evaluate our system, we assembled a corpus tions of each thematic role. In our running of 715 sentences from the METAL project’s cor- example, VerbNet specifies +int control pus of sentences with and without metaphors. The and +concrete for the Agent and Patient of corpus was annotated by two annotators follow- carve-21.2-2, respectively. ing an annotation manual. Each verb was marked 5. If the head of any argument cannot be inter- for whether its arguments violated the selectional preted to meet V ’s preferences, flag V as a vi- preferences of the most basic, literal meaning of olation. the verb. The annotators resolved conflicts by dis- 766 Error source Frequency tations for non-verbs. The only parser-related er- ror we corrected was a mislabeled noun. Bad/missing VN entries 4.5 (14.1%) Bad/missing VN restrictions 6 (18.8%) 6.2 Correcting Corrupted Data in VerbNet Bad/missing SL mappings 2 (6.3%) The VerbNet download is missing several sub- Parsing/head-finding errors 3.5 (10.9%) classes that are referred to by SemLink or that SRL errors 8.5 (26.6%) have been updated on the VerbNet website. Some VN restriction system too weak 4 (12.5%) roles also have not been updated to the latest ver- Confounding WordNet senses 3.5 (10.9%) sion, and some subclasses are listed with incor- Endemic errors: 7.5 (23.4%) rect IDs. These problems, which caused SemLink Resource errors: 12.5 (39.1%) mappings to fail, were corrected before reviewing Tool errors: 12 (37.5%) errors from the corpus. Total: 32 (100%) Six subclasses needed to be fixed, all of which were easily detected by a simple script that did not Table 3: Sources of error in 90 randomly selected depend on the 90-sentence subcorpus. We there- sentences. For errors that were due to a combi- fore expect that few further changes of this type nation of sources, 1/2 point was awarded to each would be needed for a more complete resource re- source. (VN stands for VerbNet and SL for Sem- finement effort. Link.) 6.3 Corpus-Based Updates to SemLink Our modifications to SemLink’s mappings in- cussing until consensus. cluded adding missing verbs, adding missing roles 5 Initial Results to mappings, and correcting mappings to more ap- propriate classes or roles. We also added null map- As the first row of Table 4 shows, our initial eval- pings in cases where a PropBank argument had no uation left little hope for the technique. With corresponding role in VerbNet. This makes the such low precision and F1, it seemed a lexical system’s strategy for ruling out mappings more re- resource-based preference violation detector was liable. out. When we analyzed the errors in 90 randomly No corrections were made purely based on the selected sentences, however, we found that most sample. Any time a verb’s mappings were edited, were not due to systemic problems with the ap- VerbNet was scoured for plausible mappings for proach; rather, they stemmed from SRL and pars- every verb sense in PropBank, and any nonsensi- ing errors and missing or incorrect resource entries cal mappings were deleted.
Recommended publications
  • An Arabic Wordnet with Ontologically Clean Content
    Applied Ontology (2021) IOS Press The Arabic Ontology – An Arabic Wordnet with Ontologically Clean Content Mustafa Jarrar Birzeit University, Palestine [email protected] Abstract. We present a formal Arabic wordnet built on the basis of a carefully designed ontology hereby referred to as the Arabic Ontology. The ontology provides a formal representation of the concepts that the Arabic terms convey, and its content was built with ontological analysis in mind, and benchmarked to scientific advances and rigorous knowledge sources as much as this is possible, rather than to only speakers’ beliefs as lexicons typically are. A comprehensive evaluation was conducted thereby demonstrating that the current version of the top-levels of the ontology can top the majority of the Arabic meanings. The ontology consists currently of about 1,300 well-investigated concepts in addition to 11,000 concepts that are partially validated. The ontology is accessible and searchable through a lexicographic search engine (http://ontology.birzeit.edu) that also includes about 150 Arabic-multilingual lexicons, and which are being mapped and enriched using the ontology. The ontology is fully mapped with Princeton WordNet, Wikidata, and other resources. Keywords. Linguistic Ontology, WordNet, Arabic Wordnet, Lexicon, Lexical Semantics, Arabic Natural Language Processing Accepted by: 1. Introduction The importance of linguistic ontologies and wordnets is increasing in many application areas, such as multilingual big data (Oana et al., 2012; Ceravolo, 2018), information retrieval (Abderrahim et al., 2013), question-answering and NLP-based applications (Shinde et al., 2012), data integration (Castanier et al., 2012; Jarrar et al., 2011), multilingual web (McCrae et al., 2011; Jarrar, 2006), among others.
    [Show full text]
  • Verbnet Based Citation Sentiment Class Assignment Using Machine Learning
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 9, 2020 VerbNet based Citation Sentiment Class Assignment using Machine Learning Zainab Amjad1, Imran Ihsan2 Department of Creative Technologies Air University, Islamabad, Pakistan Abstract—Citations are used to establish a link between time-consuming and complicated. To resolve this issue there articles. This intent has changed over the years, and citations are exists many researchers [7]–[9] who deal with the sentiment now being used as a criterion for evaluating the research work or analysis of citation sentences to improve bibliometric the author and has become one of the most important criteria for measures. Such applications can help scholars in the period of granting rewards or incentives. As a result, many unethical research to identify the problems with the present approaches, activities related to the use of citations have emerged. That is why unaddressed issues, and the present research gaps [10]. content-based citation sentiment analysis techniques are developed on the hypothesis that all citations are not equal. There are two existing approaches for Citation Sentiment There are several pieces of research to find the sentiment of a Analysis: Qualitative and Quantitative [7]. Quantitative citation, however, only a handful of techniques that have used approaches consider that all citations are equally important citation sentences for this purpose. In this research, we have while qualitative approaches believe that all citations are not proposed a verb-oriented citation sentiment classification for equally important [9]. The quantitative approach uses citation researchers by semantically analyzing verbs within a citation text count to rank a research paper [8] while the qualitative using VerbNet Ontology, natural language processing & four approach analyzes the nature of citation [10].
    [Show full text]
  • Lexical Resource Reconciliation in the Xerox Linguistic Environment
    Lexical Resource Reconciliation in the Xerox Linguistic Environment Ronald M. Kaplan Paula S. Newman Xerox Palo Alto Research Center Xerox Palo Alto Research Center 3333 Coyote Hill Road 3333 Coyote Hill Road Palo Alto, CA, 94304, USA Palo Alto, CA, 94304, USA kapl an~parc, xerox, com pnewman©parc, xerox, tom Abstract This paper motivates and describes the morpho- logical and lexical adaptations of XLE. They evolved This paper motivates and describes those concurrently with PARGRAM, a multi-site XLF_~ aspects of the Xerox Linguistic Environ- based broad-coverage grammar writing effort aimed ment (XLE) that facilitate the construction at creating parallel grammars for English, French, of broad-coverage Lexical Functional gram- and German (see Butt et. al., forthcoming). The mars by incorporating morphological and XLE adaptations help to reconcile separately con- lexical material from external resources. structed linguistic resources with the needs of the Because that material can be incorrect, in- core grammars. complete, or otherwise incompatible with The paper is divided into three major sections. the grammar, mechanisms are provided to The next section sets the stage by providing a short correct and augment the external material overview of the overall environmental features of the to suit the needs of the grammar developer. original LFG GWB and its provisions for morpho- This can be accomplished without direct logical and lexicon processing. The two following modification of the incorporated material, sections describe the XLE extensions in those areas. which is often infeasible or undesirable. Externally-developed finite-state morpho- 2 The GWB Data Base logical analyzers are reconciled with gram- mar requirements by run-time simulation GWB provides a computational environment tai- of finite-state calculus operations for com- lored especially for defining and testing grammars bining transducers.
    [Show full text]
  • Wordnet As an Ontology for Generation Valerio Basile
    WordNet as an Ontology for Generation Valerio Basile To cite this version: Valerio Basile. WordNet as an Ontology for Generation. WebNLG 2015 1st International Workshop on Natural Language Generation from the Semantic Web, Jun 2015, Nancy, France. hal-01195793 HAL Id: hal-01195793 https://hal.inria.fr/hal-01195793 Submitted on 9 Sep 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. WordNet as an Ontology for Generation Valerio Basile University of Groningen [email protected] Abstract a structured database of words in a format read- able by electronic calculators. For each word in In this paper we propose WordNet as an the database, WordNet provides a list of senses alternative to ontologies for the purpose of and their definition in plain English. The senses, natural language generation. In particu- besides having a inner identifier, are represented lar, the synset-based structure of WordNet as synsets, i.e., sets of synonym words. Words proves useful for the lexicalization of con- in general belong to multiple synsets, as they cepts, by providing ready lists of lemmas have more than one sense, so the relation between for each concept to generate.
    [Show full text]
  • Towards a Cross-Linguistic Verbnet-Style Lexicon for Brazilian Portuguese
    Towards a cross-linguistic VerbNet-style lexicon for Brazilian Portuguese Carolina Scarton, Sandra Alu´ısio Center of Computational Linguistics (NILC), University of Sao˜ Paulo Av. Trabalhador Sao-Carlense,˜ 400. 13560-970 Sao˜ Carlos/SP, Brazil [email protected], [email protected] Abstract This paper presents preliminary results of the Brazilian Portuguese Verbnet (VerbNet.Br). This resource is being built by using other existing Computational Lexical Resources via a semi-automatic method. We identified, automatically, 5688 verbs as candidate members of VerbNet.Br, which are distributed in 257 classes inherited from VerbNet. These preliminary results give us some directions of future work and, since the results were automatically generated, a manual revision of the complete resource is highly desirable. 1. Introduction the verb to load. To fulfill this gap, VerbNet has mappings The task of building Computational Lexical Resources to WordNet, which has deeper semantic relations. (CLRs) and making them publicly available is one of Brazilian Portuguese language lacks CLRs. There are some the most important tasks of Natural Language Processing initiatives like WordNet.Br (Dias da Silva et al., 2008), that (NLP) area. CLRs are used in many other applications is based on and aligned to WordNet. This resource is the in NLP, such as automatic summarization, machine trans- most complete for Brazilian Portuguese language. How- lation and opinion mining. Specially, CLRs that treat the ever, only the verb database is in an advanced stage (it syntactic and semantic behaviour of verbs are very impor- is finished, but without manual validation), currently con- tant to the tasks of information retrieval (Croch and King, sisting of 5,860 verbs in 3,713 synsets.
    [Show full text]
  • Modeling and Encoding Traditional Wordlists for Machine Applications
    Modeling and Encoding Traditional Wordlists for Machine Applications Shakthi Poornima Jeff Good Department of Linguistics Department of Linguistics University at Buffalo University at Buffalo Buffalo, NY USA Buffalo, NY USA [email protected] [email protected] Abstract Clearly, descriptive linguistic resources can be This paper describes work being done on of potential value not just to traditional linguis- the modeling and encoding of a legacy re- tics, but also to computational linguistics. The source, the traditional descriptive wordlist, difficulty, however, is that the kinds of resources in ways that make its data accessible to produced in the course of linguistic description NLP applications. We describe an abstract are typically not easily exploitable in NLP appli- model for traditional wordlist entries and cations. Nevertheless, in the last decade or so, then provide an instantiation of the model it has become widely recognized that the devel- in RDF/XML which makes clear the re- opment of new digital methods for encoding lan- lationship between our wordlist database guage data can, in principle, not only help descrip- and interlingua approaches aimed towards tive linguists to work more effectively but also al- machine translation, and which also al- low them, with relatively little extra effort, to pro- lows for straightforward interoperation duce resources which can be straightforwardly re- with data from full lexicons. purposed for, among other things, NLP (Simons et al., 2004; Farrar and Lewis, 2007). 1 Introduction Despite this, it has proven difficult to create When looking at the relationship between NLP significant electronic descriptive resources due to and linguistics, it is typical to focus on the dif- the complex and specific problems inevitably as- ferent approaches taken with respect to issues sociated with the conversion of legacy data.
    [Show full text]
  • TEI and the Documentation of Mixtepec-Mixtec Jack Bowers
    Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Jack Bowers To cite this version: Jack Bowers. Language Documentation and Standards in Digital Humanities: TEI and the documen- tation of Mixtepec-Mixtec. Computation and Language [cs.CL]. École Pratique des Hauts Études, 2020. English. tel-03131936 HAL Id: tel-03131936 https://tel.archives-ouvertes.fr/tel-03131936 Submitted on 4 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Préparée à l’École Pratique des Hautes Études Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Soutenue par Composition du jury : Jack BOWERS Guillaume, JACQUES le 8 octobre 2020 Directeur de Recherche, CNRS Président Alexis, MICHAUD Chargé de Recherche, CNRS Rapporteur École doctorale n° 472 Tomaž, ERJAVEC Senior Researcher, Jožef Stefan Institute Rapporteur École doctorale de l’École Pratique des Hautes Études Enrique, PALANCAR Directeur de Recherche, CNRS Examinateur Karlheinz, MOERTH Senior Researcher, Austrian Center for Digital Humanities Spécialité and Cultural Heritage Examinateur Linguistique Emmanuel, SCHANG Maître de Conférence, Université D’Orléans Examinateur Benoit, SAGOT Chargé de Recherche, Inria Examinateur Laurent, ROMARY Directeur de recherche, Inria Directeur de thèse 1.
    [Show full text]
  • Unification of Multiple Treebanks and Testing Them with Statistical Parser with Support of Large Corpus As a Lexical Resource
    Eng. &Tech.Journal, Vol.34,Part (B), No.5,2016 Unification of Multiple Treebanks and Testing Them With Statistical Parser With Support of Large Corpus as a Lexical Resource Dr. Ahmed Hussein Aliwy Computer Science Department, University of Kufa/Kufa Email:[email protected] Received on:12/11/2015 & Accepted on:21/4/2016 ABSTRACT There are many Treebanks, texts with the parse tree, available for the researcher in the field of Natural Language Processing (NLP). All these Treebanks are limited in size, and each one used private Context Free Grammar (CFG) production rules (private formalism) because its construction is time consuming and need to experts in the field of linguistics. These Treebanks, as we know, can be used for statistical parsing and machine translation tests and other fields in NLP applications. We propose, in this paper, to build large Treebank from multiple Treebanks for the same language. Also, we propose to use an annotated corpus as a lexical resource. Three English Treebanks are taken for our study which arePenn Treebank (PTB), GENIA Treebank (GTB) and British National Corpus (BNC). Brown corpus is used as a lexical resource which contains approximately one million tokens annotated with part of speech tags for each. Our work start by the unification of POS tagsets of the three Treebank then the mapping process between Brown Corpus tagset and the unified tagset is done. This is done manually according to our experience in this field. Also, all the non-terminals in the CFG production are unified.All the three Treebanks and the Brown corpus are rebuilt according to the new modification.
    [Show full text]
  • Instructions for ACL-2010 Proceedings
    Towards the Representation of Hashtags in Linguistic Linked Open Data Format Thierry Declerck Piroska Lendvai Dept. of Computational Linguistics, Dept. of Computational Linguistics, Saarland University, Saarbrücken, Saarland University, Saarbrücken, Germany Germany [email protected] [email protected] the understanding of the linguistic and extra- Abstract linguistic environment of the social media post- ing that features the hashtag. A pilot study is reported on developing the In the light of recent developments in the basic Linguistic Linked Open Data (LLOD) Linked Open Data (LOD) framework, it seems infrastructure for hashtags from social media relevant to investigate the representation of lan- posts. Our goal is the encoding of linguistical- guage data in social media so that it can be pub- ly and semantically enriched hashtags in a lished in the LOD cloud. Already the classical formally compact way using the machine- readable OntoLex model. Initial hashtag pro- Linked Data framework included a growing set cessing consists of data-driven decomposition of linguistic resources: language data i.e. hu- of multi-element hashtags, the linking of man-readable information connected to data ob- spelling variants, and part-of-speech analysis jects by e.g. RDFs annotation properties such as of the elements. Then we explain how the On- 'label' and 'comment' , have been suggested to toLex model is used both to encode and to en- be encoded in machine-readable representation3. rich the hashtags and their elements by linking This triggered the development of the lemon them to existing semantic and lexical LOD re- model (McCrae et al., 2012) that allowed to op- sources: DBpedia and Wiktionary.
    [Show full text]
  • The Interplay Between Lexical Resources and Natural Language Processing
    Tutorial on: The interplay between lexical resources and Natural Language Processing Google Group: goo.gl/JEazYH 1 Luis Espinosa Anke Jose Camacho-Collados Mohammad Taher Pilehvar Google Group: goo.gl/JEazYH 2 Luis Espinosa Anke (linguist) Jose Camacho-Collados (mathematician) Mohammad Taher Pilehvar (computer scientist) 3 www.kahoot.it PIN: 7024700 NAACL 2018 Tutorial: The Interplay between Lexical Resources and Natural Language Processing Camacho-Collados, Espinosa-Anke, Pilehvar 4 QUESTION 1 PIN: 7024700 www.kahoot.it NAACL 2018 Tutorial: The Interplay between Lexical Resources and Natural Language Processing Camacho-Collados, Espinosa-Anke, Pilehvar 5 Outline 1. Introduction 2. Overview of Lexical Resources 3. NLP for Lexical Resources 4. Lexical Resources for NLP 5. Conclusion and Future Directions NAACL 2018 Tutorial: The Interplay between Lexical Resources and Natural Language Processing Camacho-Collados, Espinosa-Anke, Pilehvar 6 INTRODUCTION NAACL 2018 Tutorial: The Interplay between Lexical Resources and Natural Language Processing Camacho-Collados, Espinosa-Anke, Pilehvar 7 Introduction ● “A lexical resource (LR) is a database consisting of one or several dictionaries.” (en.wikipedia.org/wiki/Lexical_resource) ● “What is lexical resource? In a word it is vocabulary and it matters for IELTS writing because …” (dcielts.com/ielts-writing/lexical-resource) ● “The term Language Resource refers to a set of speech or language data and descriptions in machine readable form, used for … ” (elra.info/en/about/what-language-resource )
    [Show full text]
  • Leveraging Verbnet to Build Corpus-Specific Verb Clusters
    Leveraging VerbNet to build Corpus-Specific Verb Clusters Daniel W Peterson and Jordan Boyd-Graber and Martha Palmer University of Colorado daniel.w.peterson,jordan.boyd.graber,martha.palmer @colorado.edu { } Daisuke Kawhara Kyoto University, JP [email protected] Abstract which involved dozens of linguists and a decade of work, making careful decisions about the al- In this paper, we aim to close the gap lowable syntactic frames for various verb senses, from extensive, human-built semantic re- informed by text examples. sources and corpus-driven unsupervised models. The particular resource explored VerbNet is useful for semantic role labeling and here is VerbNet, whose organizing princi- related tasks (Giuglea and Moschitti, 2006; Yi, ple is that semantics and syntax are linked. 2007; Yi et al., 2007; Merlo and van der Plas, To capture patterns of usage that can aug- 2009; Kshirsagar et al., 2014), but its widespread ment knowledge resources like VerbNet, use is limited by coverage. Not all verbs have we expand a Dirichlet process mixture a VerbNet class, and some polysemous verbs model to predict a VerbNet class for each have important senses unaccounted for. In addi- sense of each verb, allowing us to incorpo- tion, VerbNet is not easily adaptable to domain- rate annotated VerbNet data to guide the specific corpora, so these omissions may be more clustering process. The resulting clusters prominent outside of the general-purpose corpora align more closely to hand-curated syn- and linguistic intuition used in its construction. tactic/semantic groupings than any previ- Its great strength is also its downfall: adding ous models, and can be adapted to new new verbs, new senses, and new classes requires domains since they require only corpus trained linguists - at least, to preserve the integrity counts.
    [Show full text]
  • Statistical Machine Translation with a Factorized Grammar
    Statistical Machine Translation with a Factorized Grammar Libin Shen and Bing Zhang and Spyros Matsoukas and Jinxi Xu and Ralph Weischedel Raytheon BBN Technologies Cambridge, MA 02138, USA {lshen,bzhang,smatsouk,jxu,weisched}@bbn.com Abstract for which the segments for translation are always fixed. In modern machine translation practice, a sta- However, do we really need such a large rule set tistical phrasal or hierarchical translation sys- to represent information from the training data of tem usually relies on a huge set of trans- much smaller size? Linguists in the grammar con- lation rules extracted from bi-lingual train- struction field already showed us a perfect solution ing data. This approach not only results in space and efficiency issues, but also suffers to a similar problem. The answer is to use a fac- from the sparse data problem. In this paper, torized grammar. Linguists decompose lexicalized we propose to use factorized grammars, an linguistic structures into two parts, (unlexicalized) idea widely accepted in the field of linguis- templates and lexical items. Templates are further tic grammar construction, to generalize trans- organized into families. Each family is associated lation rules, so as to solve these two prob- with a set of lexical items which can be used to lex- lems. We designed a method to take advantage icalize all the templates in this family. For example, of the XTAG English Grammar to facilitate the extraction of factorized rules. We experi- the XTAG English Grammar (XTAG-Group, 2001), mented on various setups of low-resource lan- a hand-crafted grammar based on the Tree Adjoin- guage translation, and showed consistent sig- ing Grammar (TAG) (Joshi and Schabes, 1997) for- nificant improvement in BLEU over state-of- malism, is a grammar of this kind, which employs the-art string-to-dependency baseline systems factorization with LTAG e-tree templates and lexical with 200K words of bi-lingual training data.
    [Show full text]