Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Construction of English-Bodo Parallel Text
International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018 CONSTRUCTION OF ENGLISH -BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRANSLATION Saiful Islam 1, Abhijit Paul 2, Bipul Syam Purkayastha 3 and Ismail Hussain 4 1,2,3 Department of Computer Science, Assam University, Silchar, India 4Department of Bodo, Gauhati University, Guwahati, India ABSTRACT Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system. -
Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources. -
20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation
Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation JOHANN-MATTIS LIST, Max Planck Institute for the Science of Human History, Germany 20 NATHANIEL A. SIMS, The University of California, Santa Barbara, America ROBERT FORKEL, Max Planck Institute for the Science of Human History, Germany While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustain- able. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that couldhelp to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documen- tation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework forthe computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardiza- tion proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research. CCS Concepts: • Applied computing → Language translation; Additional Key Words and Phrases: Sino-tibetan, interlinear-glossed text, computer-assisted language com- parison, standardization, qiang ACM Reference format: Johann-Mattis List, Nathaniel A. -
Linguistic Annotation of the Digital Papyrological Corpus: Sematia
Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances. -
Developing a Large Scale Framenet for Italian: the Iframenet Experience
Developing a large scale FrameNet for Italian: the IFrameNet experience Roberto Basili° Silvia Brambilla§ Danilo Croce° Fabio Tamburini§ ° § Dept. of Enterprise Engineering Dept. of Classic Philology and Italian Studies University of Rome Tor Vergata University of Bologna {basili,croce}@info.uniroma2.it [email protected], [email protected] ian Portuguese, German, Spanish, Japanese, Swe- Abstract dish and Korean. All these projects are based on the idea that English. This paper presents work in pro- most of the Frames are the same among languages gress for the development of IFrameNet, a and that, thanks to this, it is possible to adopt large-scale, computationally oriented, lexi- Berkeley’s Frames and FEs and their relations, cal resource based on Fillmore’s frame se- with few changes, once all the language-specific mantics for Italian. For the development of information has been cut away (Tonelli et al. 2009, IFrameNet linguistic analysis, corpus- Tonelli 2010). processing and machine learning techniques With regard to Italian, over the past ten years are combined in order to support the semi- several research projects have been carried out at automatic development and annotation of different universities and Research Centres. In par- the resource. ticular, the ILC-CNR in Pisa (e.g. Lenci et al. 2008; Johnson and Lenci 2011), FBK in Trento (e.g. Italiano. Questo articolo presenta un work Tonelli et al. 2009, Tonelli 2010) and the Universi- in progress per lo sviluppo di IFrameNet, ty of Rome, Tor Vergata (e.g. Pennacchiotti et al. una risorsa lessicale ad ampia copertura, 2008, Basili et al. -
Integrating Wordnet and Framenet Using a Knowledge-Based Word Sense Disambiguation Algorithm
Integrating WordNet and FrameNet using a knowledge-based Word Sense Disambiguation algorithm Egoitz Laparra and German Rigau IXA group. University of the Basque Country, Donostia {egoitz.laparra,german.rigau}@ehu.com Abstract coherent groupings of words belonging to the same frame. This paper presents a novel automatic approach to In that way we expect to extend the coverage of FrameNet partially integrate FrameNet and WordNet. In that (by including from WordNet closely related concepts), to way we expect to extend FrameNet coverage, to en- enrich WordNet with frame semantic information (by port- rich WordNet with frame semantic information and ing frame information to WordNet) and possibly to extend possibly to extend FrameNet to languages other than FrameNet to languages other than English (by exploiting English. The method uses a knowledge-based Word local wordnets aligned to the English WordNet). Sense Disambiguation algorithm for linking FrameNet WordNet1 [12] (hereinafter WN) is by far the most lexical units to WordNet synsets. Specifically, we ex- ploit a graph-based Word Sense Disambiguation algo- widely-used knowledge base. In fact, WN is being rithm that uses a large-scale knowledge-base derived used world-wide for anchoring different types of seman- from WordNet. We have developed and tested four tic knowledge including wordnets for languages other than additional versions of this algorithm showing a sub- English [4], domain knowledge [17] or ontologies like stantial improvement over previous results. SUMO [22] or the EuroWordNet Top Concept Ontology [3]. It contains manually coded information about En- glish nouns, verbs, adjectives and adverbs and is organized around the notion of a synset. -
Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture
© Copyright Michael Stubbs 1996. This is chapter 1 of Text and Corpus Analysis which was published by Blackwell in 1996. TEXT AND CORPUS ANALYSIS: COMPUTER-ASSISTED STUDIES OF LANGUAGE AND CULTURE MICHAEL STUBBS CHAPTER 1. TEXTS AND TEXT TYPES Attested language text duly recorded is in the focus of attention for the linguist. (Firth 1957b: 29.) The main question discussed in this book is as follows: How can an analysis of the patterns of words and grammar in a text contribute to an understanding of the meaning of the text? I discuss traditions of text analysis in (mainly) British linguistics, and computer-assisted methods of text and corpus analysis. And I provide analyses of several shorter and longer texts, and also of patterns of language across millions of words of text corpora. The main emphasis is on the analysis of attested, naturally occurring textual data. An exclusive concentration on the text alone is, however, not an adequate basis for text interpretation, and this introductory chapter discusses the appropriate balance for text analysis, between the text itself and its production and reception. Such analyses of attested texts are important for both practical and theoretical reasons, and the main reasons can be easily and briefly stated. First, there are many areas of the everyday world in which the systematic and critical interpretation of texts is an important social skill. One striking example occurs in courtrooms, where very important decisions, by juries as well as lawyers, can depend on such interpretations. Second, much of theoretical linguistics over the past fifty years has been based on the study of isolated sentences. -
The Hebrew Framenet Project
The Hebrew FrameNet Project Avi Hayoun, Michael Elhadad Dept. of Computer Science Ben-Gurion University Beer Sheva, Israel {hayounav,elhadad}@cs.bgu.ac.il Abstract We present the Hebrew FrameNet project, describe the development and annotation processes and enumerate the challenges we faced along the way. We have developed semi-automatic tools to help speed the annotation and data collection process. The resource currently covers 167 frames, 3,000 lexical units and about 500 fully annotated sentences. We have started training and testing automatic SRL tools on the seed data. Keywords: FrameNet, Hebrew, frame semantics, semantic resources 1. Introduction frames and their structures in natural language. Recent years have seen growing interest in the task 1.2. FrameNet in Other Languages of Semantic Role Labeling (SRL) of natural language The original FrameNet project has been adapted and text (sometimes called “shallow semantic parsing”). ported to multiple languages. The most active interna- The task is usually described as the act of identify- tional FrameNet teams include the Swedish FrameNet ing the semantic roles, which are the set of semantic (SweFN) covering close to 1,200 frames with 34K LUs properties and relationships defined over constituents (Ahlberg et al., 2014); the Japanese FrameNet (JFN) of a sentence, given a semantic context. with 565 frames, 8,500 LUs and 60K annotated ex- The creation of resources that document the realiza- ample sentences (Ohara, 2013); and FrameNet Brazil tion of semantic roles in natural language texts, such (FN-Br) covering 179 frames, 196 LUs and 12K anno- as FrameNet (Fillmore and Baker, 2010; Ruppenhofer tated sentences (Torrent and Ellsworth, 2013). -
INF5830 Introduction to Semantic Role Labeling
INF5830 Introduction to Semantic Role Labeling Andrey Kutuzov University of Oslo Language Technology Group With thanks to Lilja Øvrelid, Martha Palmer and Dan Jurafsky INF5830 Introduction to Semantic Role Labeling 1(36) Semantic Role Labeling INF5830 Introduction to Semantic Role Labeling 2(36) Introduction Contents Introduction Semantic roles in general PropBank: Proto-roles FrameNet: Frame Semantics Summary INF5830 Introduction to Semantic Role Labeling 2(36) Introduction Semantics I Study of meaning, expressed in language; I Morphemes, words, phrases, sentences; I Lexical semantics; I Sentence semantics; I (Pragmatics: how the context affects meaning). INF5830 Introduction to Semantic Role Labeling 3(36) Introduction Semantics I Linguistic knowledge: meaning I Meaningful or not: I Word { flick vs blick I Sentence { John swims vs John metaphorically every I Several meanings (WSD): I Words { fish I Sentence { John saw the man with the binoculars I Same meaning (semantic similarity): I Word { sofa vs couch I Sentence { John gave Hannah a gift vs John gave a gift to Hannah I Truth conditions: I All kings are male I Molybdenum conducts electricity I Entailment: I Alfred murdered the librarian I The librarian is dead I Participant roles: John is the `giver', Hannah is the `receiver' INF5830 Introduction to Semantic Role Labeling 4(36) Introduction Representing events I We want to understand the event described by these sentences: 1. IBM bought Spark 2. IBM acquired Spark 3. Spark was acquired by IBM 4. The owners of Spark sold it to IBM I Dependency parsing is insufficient. UDPipe will give us simple relations between verbs and arguments: 1. (buy, nsubj, IBM), (buy, obj, Spark) 2. -
Latvian Framenet: Cross-Lingual Issues
96 Human Language Technologies – The Baltic Perspective K. Muischnek and K. Müürisep (Eds.) © 2018 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/978-1-61499-912-6-96 Latvian FrameNet: Cross-Lingual Issues Gunta NESPORE-Bˇ ERZKALNE¯ 1, Baiba SAULITE¯ and Normunds GRUZ¯ ITIS¯ Institute of Mathematics and Computer Science, University of Latvia Abstract. This paper reports the lessons learned while creating a FrameNet- annotated text corpus of Latvian. This is still an ongoing work, a part of a larger project which aims at the creation of a multilayer text corpus, anchored in cross- lingual state-of-the-art representations: Universal Dependencies (UD), FrameNet and PropBank, as well as Abstract Meaning Representation (AMR). For the FrameNet layer, we use the latest frame inventory of Berkeley FrameNet (BFN v1.7), while the annotation itself is done on top of the underlying UD layer. We strictly follow a corpus-driven approach, meaning that lexical units (LU) in Latvian FrameNet are created only based on the annotated corpus examples. Since we are aiming at a medium-sized still general-purpose corpus, an important aspect that we take into account is the variety and balance of the corpus in terms of genres, do- mains and LUs. We have finished the first phase of the FrameNet corpus annota- tion, and we have collected and discuss cross-lingual issues and their possible so- lutions. The issues are relevant for other languages as well, particularly if the goal is to maintain cross-lingual compatibility via BFN. -
Using Frame Semantics in Natural Language Processing
Using Frame Semantics in Natural Language Processing Apoorv Agarwal Daniel Bauer Owen Rambow Dept. of Computer Science Dept. of Computer Science CCLS Columbia University Columbia University Columbia University New York, NY New York, NY New York, NY [email protected] [email protected] [email protected] Abstract the notion of a social event (Agarwal et al., 2010), We summarize our experience using a particular kind of event which involves (at least) FrameNet in two rather different projects two people such that at least one of them is aware in natural language processing (NLP). of the other person. If only one person is aware We conclude that NLP can benefit from of the event, we call it Observation (OBS): for FrameNet in different ways, but we sketch example, someone is talking about someone else some problems that need to be overcome. in their absence. If both people are aware of the event, we call it Interaction (INR): for example, 1 Introduction one person is telling the other a story. Our claim We present two projects at Columbia in which we is that links in social networks are in fact made use FrameNet. In these projects, we do not de- up of social events: OBS social events give rise velop basic NLP tools for FrameNet, and we do to one-way links, and INR social events to two- not develop FramNets for new languages: we sim- way links. For more information, see (Agarwal ply use FrameNet or a FrameNet parser in an NLP and Rambow, 2010; Agarwal et al., 2013a; Agar- application. -
TEI and the Documentation of Mixtepec-Mixtec Jack Bowers
Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Jack Bowers To cite this version: Jack Bowers. Language Documentation and Standards in Digital Humanities: TEI and the documen- tation of Mixtepec-Mixtec. Computation and Language [cs.CL]. École Pratique des Hauts Études, 2020. English. tel-03131936 HAL Id: tel-03131936 https://tel.archives-ouvertes.fr/tel-03131936 Submitted on 4 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Préparée à l’École Pratique des Hautes Études Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Soutenue par Composition du jury : Jack BOWERS Guillaume, JACQUES le 8 octobre 2020 Directeur de Recherche, CNRS Président Alexis, MICHAUD Chargé de Recherche, CNRS Rapporteur École doctorale n° 472 Tomaž, ERJAVEC Senior Researcher, Jožef Stefan Institute Rapporteur École doctorale de l’École Pratique des Hautes Études Enrique, PALANCAR Directeur de Recherche, CNRS Examinateur Karlheinz, MOERTH Senior Researcher, Austrian Center for Digital Humanities Spécialité and Cultural Heritage Examinateur Linguistique Emmanuel, SCHANG Maître de Conférence, Université D’Orléans Examinateur Benoit, SAGOT Chargé de Recherche, Inria Examinateur Laurent, ROMARY Directeur de recherche, Inria Directeur de thèse 1.