Open Linked Data E Nuovi Approcci Al Trattamento Del Significato Linguistico

Total Page:16

File Type:pdf, Size:1020Kb

Open Linked Data E Nuovi Approcci Al Trattamento Del Significato Linguistico TAL 2014 Evoluzione delle interfacce di dialogo e Open Data Open Linked Data e nuovi approcci al trattamento del significato linguistico Guido Vetere IBM Italia - Centro Studi Avanzati K-Drive Knowledge Driven Data Exploitation © 2009 IBM Corporation La dimensione del significato 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Visioni della semantica lessicale concettuale Geeraerts D. (2010), Theories of Lexical Semantics, Oxford University Press 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Visioni della semantica lessicale testuale Geeraerts D. (2010), Theories of Lexical Semantics, Oxford University Press 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Dati linguistici aperti lessici tesauri ontologie corpora annotati 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Dati ed evidenze linguistiche in Watson conoscenze di sfondo evidenze testuali 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Inferenze linguistiche guidate dai dati Q: 1506 What's the name of King Arthur's sword? …..where Arthur is believed to have thrown his sword, Excalibur, to be received by the Lady of the Lake;..... Kateryna Tymoshenko, Alessandro Moschitti and Aliaksei Severyn. "Encoding Semantic Resources in Syntactic Structures for Passage Reranking", accepted to the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2014), Gothenburg, Sweden, 2014. 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Dati linguistici: l'importanza di tenerli aperti ©2011-2014 Winsord 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Dati linguistici dell'italiano 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Senso Comune 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Problemi (sempre) aperti 19 75 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Semantica, vaghezza e ontologia persona soggetti situazioni Achille C. Varzi, Vaghezza e ontologia in Storia dell'ontologia, a cura di Maurizio Ferraris, Bompiani, Milano, 2008 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Semantica, vaghezza e ontologia giovane soggetti situazioni Achille C. Varzi, Vaghezza e ontologia in Storia dell'ontologia, a cura di Maurizio Ferraris, Bompiani, Milano, 2008 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Cosa c'è in una relazione lessicale? WHAT'S IN THIS LINK? giovane immaturo 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Precisare la vaghezza K-Drive Knowledge Driven Data Exploitation Grant agreement no.: 286348 Panos Alexopoulos, Boris Villazón- Terrazas, Jeff Z. Pan: Towards Vagueness-Aware Semantic Data. URSW 2013 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Modellare concetti linguistici Concept G. Vetere, I. Oltramari, A., Chiari, I, Jezek, E., Vieu, L. e F.M. Zanzotto. 2012. «Senso Comune, an Open Knowledge Base for Italian », Revue TAL Journal Special Issue of Entity the on "Free Language Resources", vol. 52, n. 3, pp. 217-43 Physical Reality c o m m i t m e n t Abstract Concrete Entity Entity Linguistic Reality Tokens Types Meaning Sense see also Word Lexeme 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation Conclusioni Apertium project lexicons Penn Discourse Treebank Printed Book Auction Catalogues YAGO Arabic Corpora Perceo Corpus Deutsches Morphologie-Lexikon TEP: Tehran English-Persian Parallel Arabic Online Commentary Dataset v1.1 plWordNet Intercontinental Dictionary Series Corpus Asian Wordnet Prague Czech-English Dependency Treebank Lemon Wiktionary TMC: Tehran Monolingual Corpus Autotyp ➔Corpora, dizionari, tesauri,PROIEL ontologie sono crucialiLemon nello Wordnet sviluppo delle Linktecnologie Grammar (Parser) della BabelNet PropBank Lexvo.org diverse VOA corpora baby nameslingua Punjabi Morphology, corpus and lexicon lingvoj.org Bijankhan corpus cornell Movie dialog corpus Resnik's Bible corpora Multi layer sentiment analysis Persian Link Grammar Hamshahri CLEF corpus Cornetto RST discourse treebank Open Data Thesaurus Persian Dependency Treebank Corpora of misspellings Russian WordNet VU WordNet ➔La loro 'apertura' non è solo una politica di distribuzione, ma risponde(PerD a requisitiT) di Corpus of Historical American English Salsa WALS Per­si­an Tree­bank (Per­Tree­Bank) Corpus of Moderntrasparenza, Scottish Writing interoperabilità, (CMSW) SemCor Corpus integrabilità Zhishi-me ASC) Danish Wordnet Sentiment-annotated quotation corpus JRC-Names MUC-6 Discourse Graphbank SFB632, QUIS-corpora SentiWS Multext-East div. name lists SFN RKB Explorer Wordnet ➔ Multi-Lingual Semantic Network dobes Integrare basi di conoscenzasloWNet linguistica non è banale,Sanskrit Englishpoiché Lexicon ha a chemultitree fare con molti English-Persianproblemi Parallel Corpus filosofici SMULTRON ConceptNet Name List ethnologue soas SIMPLE Ontology, Lexicon NEGRA corpus GeoWordNet SPLLOC Ontos News Portal NomBank GermaNet TamilWordNet Catalan WordNet NunavutHansard ➔ Grammis E importante che i modellitds e le pratiche di distribuzioneFrench TimeBank e integrazioneOCAS dei dati Haitian Creole Lang Data, Carnegie Mellon TDS ontology IcePaHC ODIN Hebrew WordNetlinguistici riescano in Thequalche Manually modoAnnotated a Sub-Corpus preservare laOmegaWiki natura del 'segno' OLAC Hindi WordNet The Open American National Corpus OPUS Omega 4 Inuktitut - A Multi-dialectal Outline Dictionary The Tower of Babel Wiktionary OntoNotes iso 639-n TIGER corpus ISOcat Open Thesaurus Japanese WordNet TimeBank Talk Bank oralliteratures JRC-Acquis TüBa-D/Z corpus GOLD PanLex konstanz universals archive upsid OLiA paradisec LCS Database US Census name lists WikiWord Thesaurus Le Petit Prince Verb Semantics Ontology Rosetta-project LEGO VerbNet Pali English Lexicon Lemon Wikicorpus DBpedia Linguee German-English dictionary Wiktionary RDF dump wiktionary.dbpedia.org linguist list WOLD PHOIBLE ll-map WOLF African Bibles Guido Vetere LSG ("Líonra Séimeantach na Gaeilge") Wordnet ( W3C ) [email protected] word lists Macedonian WordNet Wordnet (Princeton) Manually Annotated Sub-Corpus (M 20 gennaio 2014 Guido Vetere, IBM - TAL 2014 © 2014 IBM Corporation.
Recommended publications
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • MASC: the Manually Annotated Sub-Corpus of American English
    MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Background and Context for CLASP
    Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE
    [Show full text]
  • Informatics 1: Data & Analysis
    Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The University of Edinburgh Friday 27 February 2015 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey Final Day ! ESES: The Edinburgh Student Experience Survey http://www.ed.ac.uk/students/surveys Please log on to MyEd before 1 March to complete the survey. Help guide what we do at the University of Edinburgh, improving your future experience here and that of the students to follow. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Lecture Plan XML We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 12 2015-02-27 Homework ! Tutorial Exercises Tutorial 5 exercises went online earlier this week. In these you use the xmllint command-line tool to check XML validity and run your own XPath queries. Reading T. McEnery and A. Wilson. Corpus Linguistics. Second edition, Edinburgh University Press, 2001. Chapter 2: What is a corpus and what is in it? (§2.2.2 optional) Photocopied handout, also available from the ITO. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Remote Access to DICE ! Much coursework can be done on your own machines, but sometimes it’s important to be able to connect to and use DICE systems.
    [Show full text]
  • The Expanding Horizons of Corpus Analysis
    The expanding horizons of corpus analysis Brian MacWhinney Carnegie Mellon University Abstract By including a focus on multimedia interactions linked to transcripts, corpus linguistics can vastly expand its horizons. This expansion will rely on two continuing developments. First, we need to develop easily used methods for each of the ten analytic methods we have examined, including lexical analyses, QDA (qualitative data analysis), automatic tagging, language profiles, group comparisons, change scores, error analysis, feedback studies, conversation analysis, and modeling. Second, we need to work together to construct a unified database for language studies and related sciences. This database must be grounded on the principles of open access, data-sharing, interoperability, and integrated structure. It must provide powerful tools for searching, multilinguality, and multimedia analysis. If we can build this infrastructure, we will be able to explore more deeply the key questions underlying the structure and functioning of language, as it emerges from the intermeshing of processes operative on eight major timeframes. 1. Introduction Corpus linguistics has benefitted greatly from continuing advances in computer and Internet hardware and software. These advances have made it possible to develop facilities such as BNCweb (bncweb.lancs.ac.uk), LDC (Linguistic Data Consortium) online, the American National Corpus (americannationalcorpus. org), TalkBank (talkbank.org), and CHILDES (childes.psy.cmu.edu). In earlier periods, these corpora were limited to written and transcribed materials. However, most newer corpora now include transcripts linked to either audio or video recordings. The development of this newer corpus methodology is facilitated by technology which makes it easy to produce high-quality video recordings of face-to-face interactions.
    [Show full text]
  • Unit 3: Available Corpora and Software
    Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription.
    [Show full text]
  • English Corpus Linguistics: an Introduction - Charles F
    Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index Aarts, Bas, 4, 102 Biber, Douglas, et al. (1999) 14 adequacy, 2–3, 10–11 Birmingham Corpus, 15, 142 age, 49–50 Blachman, Edward, 76–7 Altenberg, Bengt, 26–7 BNC see British National Corpus AMALGAM Tagging Project, 86–7, 89 Brill, Eric, 86 American National Corpus, 24, 84, 142 Brill Tagger, 86–8 American Publishing House for the Blind British National Corpus (BNC), 143 Corpus, 17, 142 annotation, 84 analyzing a corpus, 100 composition, 18, 31t, 34, 36, 38, 40–1, 49 determining suitability, 103–7, 107t copyright, 139–40 exploring a corpus, 123–4 planning, 30–2, 33, 43, 51, 138 extracting information: defining parameters, record keeping, 66 107–9; coding and recording, 109–14, research using, 15, 36 112t; locating relevant constructions, speech samples, 59 114–19, 116f, 118f tagging, 87 framing research question, 101–3 time-frame, 45 future prospects, 140–1 British National Corpus (BNC) Sampler, see also pseudo-titles (corpus analysis 139–40, 143 case study); statistical analysis Brown Corpus, xii, 1, 143 anaphors, 97 genre variation, 18 annotation, 98–9 length, 32 future prospects, 140 research using, 6, 9–10, 12, 42, 98, 103 grammatical markup, 81 sampling methodology, 44 parsing, 91–6, 98, 140 tagging, 87, 90 part-of-speech markup, 81 time-frame, 45 structural markup, 68–9, 81–6 see also FROWN (Freiburg–Brown) tagging, 86–91, 97–8, 111, 117–18, 140 Corpus types, 81 Burges, Jen, 52 appositions, 42, 98 Burnard,
    [Show full text]
  • A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language
    A STUDY OF ISSUES AND TECHNIQUES FOR CREATING CORE VOCABULARY LISTS FOR ENGLISH AS AN INTERNATIONAL LANGUAGE BY C. JOSEPH SORELL A thesis submitted to Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy Victoria University of Wellington 2013 ABSTRACT Core vocabulary lists have long been a tool used by language learners and instructors seeking to facilitate the initial stages of foreign language learning (Fries & Traver, 1960: 2). In the past, these lists were typically based on the intuitions of experienced educators. Even before the advent of computer technology in the mid-twentieth century, attempts were made to create such lists using objective methodologies. These efforts regularly fell short, however, and – in the end – had to be tweaked subjectively. Now, in the 21st century, this is unfortunately still true, at least for those lists whose methodologies have been published. Given the present availability of sizable English- language corpora from around the world and affordable personal computers, this thesis seeks to fill this methodological gap by answering the research question: How can valid core vocabulary lists for English as an International Language be created? A practical taxonomy is proposed based on Biber’s (1988, 1995) multi-dimensional analysis of English texts. This taxonomy is based on correlated linguistic features and reasonably covers representative spoken and written texts in English. The four-part main study assesses the variance in vocabulary data within each of the four key text types: interactive (face-to-face conversation), academic exposition, imaginative narrative, and general reported exposition. The variation in word types found at progressive intervals in corpora of various sizes is measured using the Dice coefficient, a coefficient originally used to measure species variation in different biotic regions (Dice, 1945).
    [Show full text]
  • Towards a Global, Multilingual Framenet
    LREC 2020 WORKSHOP Language Resources and Evaluation Conference 11–16 May 2020 International FrameNet Workshop 2020 Towards a Global, Multilingual FrameNet PROCEEDINGS Edited by: Tiago T. Torrent, Collin F. Baker, Oliver Czulo, Kyoko Ohara and Miriam R. L. Petruck Proceedings of the LREC International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet Edited by: Tiago T. Torrent, Collin F. Baker, Oliver Czulo, Kyoko Ohara and Miriam R. L. Petruck ISBN: 979-10-95546-58-0 EAN: 9791095546580 For more information: European Language Resources Association (ELRA) 9 rue des Cordelières 75013, Paris France http://www.elra.info Email: [email protected] c European Language Resources Association (ELRA) These Workshop Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License ii Introduction This workshop continues the series of International FrameNet Workshops, based on frame semantics and FrameNets around the world, including meetings in Berkeley, California in 2013 (sponsored by the Swedish FrameNet group), in Juiz de Fora, Brazil in 2016 (sponsored by FrameNet Brasil), and in Miyazaki, Japan in 2018 (in conjunction with the LREC conference there). The last of these was specifically designed to bring together two sometimes separate groups of researchers, those concentrating on frame semantics and those concentrating on construction grammar, which Charles J. Fillmore and many of the same colleagues developed in parallel over a period of several decades. The call for papers of the current conference emphasized that the workshop would welcome studies of both theoretical and practical issues, and we are fortunate to have strong contributions of both types, sometimes within a single paper.
    [Show full text]
  • The Architecture of a Multipurpose Australian National Corpus
    The Architecture of a Multipurpose Australian National Corpus Pam Peters Macquarie University 1. Introduction Collaborative planning by Australian researchers for a large national corpus presents us with quite diverse views on: (a) what a corpus is, (b) what research agendas it should support, (c) which varieties of discourse it should contain, (d) how many languages it could include, (e) how the material might be collected, and (f) what kinds of annotation are needed to add value to the texts. The challenge for us all is to develop a comprehensive plan and architecture for a corpus which will encompass as many research agendas as possible. A modular design which allows independent compilation of segments of the notional whole recommends itself, so long as common systems of metadata and annotation can be set down at the start. 2. What is a Corpus? Divergent understandings of what a corpus is reflect some aspects of the word’s history, and the different academic disciplines which have made use of them. In literary studies since the earlier C18, the Latin word corpus has been used to refer to the body of work by a single author, for example, Shakespeare, or set of them, such as the romantic poets. It implies the total output of those authors, not a sampling. In linguistic science since the 1960s, corpus has referred to a structured collection of texts sampled from various types of discourse, including written and – since the 1980s – spoken as well (e.g., Collins Cobuild). It is thus deliberately heterogenous, and designed to be in some sense ‘representative’ of a standard language or variety of it.
    [Show full text]
  • Representation Learning for Information Extraction
    Representation Learning for Information Extraction by Ehsan Amjadian A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Cognitive Science Carleton University Ottawa, Ontario ©2019 Ehsan Amjadian Abstract Distributed representations, predominantly acquired via neural networks, have been applied to natural language processing tasks including speech recognition and machine translation with a success comparable to sophisticated state-of-the-art algo- rithms. The present thesis offers an investigation of the application of such represen- tations to information extraction. Specifically, I explore the suitability of applying shallow distributed representations to the automatic terminology extraction task, as well as the bridging reference resolution task. I created a dataset as a gold standard for automatic term extraction in the mathematical education domain. I carefully as- sessed the performance of the existing terminology extraction methods on this dataset. Then, I introduce a novel method for automatic terminology extraction for one word terms, and I evaluate the performance of the novel algorithm in various terminological domains. The introduced algorithm leverages the distributed representation of words from the local and global perspectives to encode syntactic, semantic, association, and frequency information at the same time. Furthermore, this novel algorithm can be trained with a minimal number of data points. I show that the algorithm is robust to the change of domain, and that information can be transferred from one technical domain to another, leveraging what we call anchor words with consistent semantics shared between the domains. As for the bridging reference resolution task, a dataset is built on the letter portion of the Open American National Corpus and I compare the performance of a preliminary method against a majority class baseline.
    [Show full text]