Representation Learning for Information Extraction

Total Page:16

File Type:pdf, Size:1020Kb

Representation Learning for Information Extraction Representation Learning for Information Extraction by Ehsan Amjadian A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Cognitive Science Carleton University Ottawa, Ontario ©2019 Ehsan Amjadian Abstract Distributed representations, predominantly acquired via neural networks, have been applied to natural language processing tasks including speech recognition and machine translation with a success comparable to sophisticated state-of-the-art algo- rithms. The present thesis offers an investigation of the application of such represen- tations to information extraction. Specifically, I explore the suitability of applying shallow distributed representations to the automatic terminology extraction task, as well as the bridging reference resolution task. I created a dataset as a gold standard for automatic term extraction in the mathematical education domain. I carefully as- sessed the performance of the existing terminology extraction methods on this dataset. Then, I introduce a novel method for automatic terminology extraction for one word terms, and I evaluate the performance of the novel algorithm in various terminological domains. The introduced algorithm leverages the distributed representation of words from the local and global perspectives to encode syntactic, semantic, association, and frequency information at the same time. Furthermore, this novel algorithm can be trained with a minimal number of data points. I show that the algorithm is robust to the change of domain, and that information can be transferred from one technical domain to another, leveraging what we call anchor words with consistent semantics shared between the domains. As for the bridging reference resolution task, a dataset is built on the letter portion of the Open American National Corpus and I compare the performance of a preliminary method against a majority class baseline. ii iii Acknowledgements The great sacrifices and contributions made by others enriched my journey over the past years. Words cannot do justice in capturing their help and sacrifices, however, this is an attempt to acknowledge them. I would like to thank my brother Amir Amjadian and my mother Dr. Khadijeh Eshaghi for their support and sacrifice that first and foremost made this endeavor possible. I could travel, as a result, to the other side of the world in pursuit of knowledge and to contribute to the scientific community. Amir and Khadijeh, without your great help and sacrifice this journey would have never started. Thank you! I would like to thank my dear wife, Mahsa Raeisi Ardali, for her unconditional support during the writing stage of this dissertation. Her devotion and encouragement strengthened my steps forward. Immeasurable gratitude gos to my supervisors: Professor Raj Singh and Profes- sor Diana Inkpen. Dr. Singh took me under his wing and made Ottawa feel like home. My eyes were opened and my mind freed by our discussions, his clarity in thoughts, reasoning and writing, his vast experience in research, his absolute com- mand of formal and computational semantics and pragmatics, and above all, his invaluable mentorship and friendship. Dr. Inkpen noticed my passion for natural language processing, machine learning, and deep learning, and nurtured them with her depth and breadth of knowledge in the field of artificial intelligence and computer science, turning thoughts into ideas, and ideas into experiments. I could not have asked for a better supervision and mentorship. I would like to thank my thesis committee members, Professor Xiaodan Zhu and Professor Robert West, for the many inspiring discussions and conversations in NLP, iv information extraction, word embeddings, and high dimensional data structures as well as their great feedback for the present thesis. Great thanks go to Professor Patrick Drouin for his pioneering work in automatic terminology extraction as well as his invaluable comments on the thesis. I was fortu- nate to be one of the countless individuals who have been inspired by his work. Many thanks go to Professor Christopher Cox for his invaluable feedback on the thesis that resulted from a close assessment of the ideas and experiments in the document which lead to their further refinement. I could not have wished for better lab mates and friends than Roxana Barbu and Prasadith Kirinde Gamaarachchige. Prasadith brought an ocean of cutting-edge skills in software engineering and web development to the lab, in addition to his radiating serenity. Roxana made many extended hours of research seem normal and pleasant by her diligence and team work, even though we worked on different projects. Thank you both for being such great friends and all the thought-provoking conversations. Great thanks go to my friend and colleague Professor Muhammad Rizwan Abid. We had many inspiring conversations from the very beginning of my journey, many of which resulted in great academic work. Roxana and Muhammad both kindly proofread the present document and made many great suggestions that lead to its improvement. A significant portion of the work in automatic terminology extraction comes from the years of collaboration with Professor T.Sima Paribakht and Professor Farahnaz Faez as well as their contributions before I joined the project. Working with them was an absolute honor and pleasure. I benefited much from their advices and insights. Great thanks go to Liane Dubreuil for the warm welcome to the department and v for removing any administrative obstacle from my path. The present work benefited much from great efforts by Jeol Baylis and Christopher Genovesi constructing the bridging reference corpus. I would like to thank Amir Gharavi for all the detailed mathematical discussions as well as being a great friend over the past years. Contents Abstract ........................................ ii Acknowledgements .................................. iii 1 Introduction 1 1.1 TheTaskstoTackle .............................. 1 1.2 Automatic Terminology Extraction (ATE). ... 3 1.3 BridgingReferenceResolution . ... 6 1.4 Objectives.................................... 8 1.5 Contributions.................................. 9 1.5.1 List of Main Contributions of the Present Dissertation..... 10 1.6 ResearchQuestions .............................. 11 1.7 Publications................................... 12 1.8 ThesisStructure ................................ 12 1.9 Summary .................................... 13 2 Background 14 2.1 AutomaticTerminologyExtraction . .. 14 2.1.1 TraditionalATE............................ 15 2.1.2 MachineLearningATE........................ 16 vi CONTENTS vii 2.1.3 DistributedATE............................ 17 2.2 TransferLearning ............................... 23 2.3 DomainAdaptation .............................. 23 2.4 StatisticalModelingofLanguage . ... 27 2.4.1 Introduction .............................. 27 2.4.2 N-GramModels ............................ 27 2.4.3 VectorSpaceModels ......................... 29 2.4.4 WordEmbeddings........................... 35 2.5 Summary .................................... 51 3 Automatic Terminology Extraction Evaluation: A Gold Standard for the Mathematics Education Domain 53 3.1 Introduction................................... 54 3.2 RelatedWork.................................. 56 3.3 TermExtractionMethodsandTools . 57 3.3.1 AntConc ................................ 58 3.3.2 Topia .................................. 58 3.3.3 TermoStat ............................... 59 3.4 SketchEngine.................................. 60 3.5 Corpus...................................... 62 3.6 TermEvaluator................................. 62 3.7 TheAnnotationProcess............................ 64 3.8 ResultsandAnalysis.............................. 66 3.9 Summary .................................... 73 CONTENTS viii 4 Distributed Specificity 74 4.1 Introduction................................... 75 4.2 RelatedWork.................................. 78 4.3 Corpus...................................... 80 4.4 Methodology .................................. 81 4.4.1 SpecificityVectors........................... 81 4.4.2 FilteringApproach .......................... 84 4.4.3 DirectApproach............................ 86 4.5 Annotation ................................... 87 4.6 ExperimentsandResults ........................... 88 4.7 Summary .................................... 93 5 Cross-Domain Distributed Automatic Terminology Extraction 94 5.1 QuantitativeExperiments. .. 95 5.2 QualitativeAssessments. 98 5.3 DomainAdaptationforATE......................... 104 5.3.1 Introduction .............................. 104 5.3.2 Dataset ................................. 105 5.3.3 Methods................................. 106 5.3.4 ExperimentsandResults. 106 5.4 Summary .................................... 109 6 Distributed Bridging Reference Resolution 111 6.1 Introduction................................... 112 6.2 RelatedWork.................................. 117 CONTENTS ix 6.3 Data ....................................... 119 6.4 Methodology .................................. 120 6.5 Experiments&Results ............................ 122 6.6 Summary .................................... 123 7 Conclusion and Future Research 124 7.1 Summary .................................... 124 7.2 FutureWork .................................. 129 Appendices 132 A FullListofResultsinQualitativeAssessments 133 A.1 Microbiology .................................. 133 A.2 Optometry...................................
Recommended publications
  • MASC: the Manually Annotated Sub-Corpus of American English
    MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum†, Charles Fillmore**, Rebecca Passonneau†† *Vassar College Poughkeepsie, New York USA **International Computer Science Institute Berkeley, California USA †Princeton University Princeton, New Jersey USA ††Columbia University New York, New York USA E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing
    [Show full text]
  • Background and Context for CLASP
    Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90’s and early 2000’s Text Encoding Initiative (TEI) – 1987 Still ongoing, used mainly by humanities EAGLES/ISLE Developed standards for morpho-syntax, syntax, sub-categorization, etc. (links on CLASP wiki) Corpus Encoding Standard (now XCES - http://www.xces.org) Main Aspects" ! Harmonization of formats for linguistic data and annotations" ! Harmonization of descriptors in linguistic annotation" ! These two are often mixed, but need to deal with them separately (see CLASP wiki)" Formats: The Past 20 Years" 1987 TEI Myriad of formats 1994 MULTEXT, CES ~1996 XML 2000 ISO TC37 SC4 2001 LAF model introduced now LAF/GrAF, ISO standards Myriad of formats Actually…" ! Things are better now" ! XML use" ! Moves toward common models, especially in Europe" ! US community seeing the need for interoperability " ! Emergence of common processing platforms (GATE, UIMA) with underlying common models " Resources 1990 ! WordNet gains ground as a “standard” LR ! Penn Treebank, Wall Street Journal Corpus World Wide Web ! British National Corpus ! EuroWordNet XML ! Comlex ! FrameNet ! American National Corpus Semantic Web ! Global WordNet ! More FrameNets ! SUMO ! VerbNet ! PropBank, NomBank ! MASC present NLP software 1994 ! MULTEXT > LT tools, LT XML 1995 ! GATE (Sheffield) 1996 1998 ! Alembic Workbench ! ATLAS (NIST) 2003 ! What happened to this? 200? ! Callisto ! UIMA Now: GATE
    [Show full text]
  • Informatics 1: Data & Analysis
    Informatics 1: Data & Analysis Lecture 12: Corpora Ian Stark School of Informatics The University of Edinburgh Friday 27 February 2015 Semester 2 Week 6 http://www.inf.ed.ac.uk/teaching/courses/inf1/da Student Survey Final Day ! ESES: The Edinburgh Student Experience Survey http://www.ed.ac.uk/students/surveys Please log on to MyEd before 1 March to complete the survey. Help guide what we do at the University of Edinburgh, improving your future experience here and that of the students to follow. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Lecture Plan XML We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 12 2015-02-27 Homework ! Tutorial Exercises Tutorial 5 exercises went online earlier this week. In these you use the xmllint command-line tool to check XML validity and run your own XPath queries. Reading T. McEnery and A. Wilson. Corpus Linguistics. Second edition, Edinburgh University Press, 2001. Chapter 2: What is a corpus and what is in it? (§2.2.2 optional) Photocopied handout, also available from the ITO. Ian Stark Inf1-DA / Lecture 12 2015-02-27 Remote Access to DICE ! Much coursework can be done on your own machines, but sometimes it’s important to be able to connect to and use DICE systems.
    [Show full text]
  • The Expanding Horizons of Corpus Analysis
    The expanding horizons of corpus analysis Brian MacWhinney Carnegie Mellon University Abstract By including a focus on multimedia interactions linked to transcripts, corpus linguistics can vastly expand its horizons. This expansion will rely on two continuing developments. First, we need to develop easily used methods for each of the ten analytic methods we have examined, including lexical analyses, QDA (qualitative data analysis), automatic tagging, language profiles, group comparisons, change scores, error analysis, feedback studies, conversation analysis, and modeling. Second, we need to work together to construct a unified database for language studies and related sciences. This database must be grounded on the principles of open access, data-sharing, interoperability, and integrated structure. It must provide powerful tools for searching, multilinguality, and multimedia analysis. If we can build this infrastructure, we will be able to explore more deeply the key questions underlying the structure and functioning of language, as it emerges from the intermeshing of processes operative on eight major timeframes. 1. Introduction Corpus linguistics has benefitted greatly from continuing advances in computer and Internet hardware and software. These advances have made it possible to develop facilities such as BNCweb (bncweb.lancs.ac.uk), LDC (Linguistic Data Consortium) online, the American National Corpus (americannationalcorpus. org), TalkBank (talkbank.org), and CHILDES (childes.psy.cmu.edu). In earlier periods, these corpora were limited to written and transcribed materials. However, most newer corpora now include transcripts linked to either audio or video recordings. The development of this newer corpus methodology is facilitated by technology which makes it easy to produce high-quality video recordings of face-to-face interactions.
    [Show full text]
  • Unit 3: Available Corpora and Software
    Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription.
    [Show full text]
  • English Corpus Linguistics: an Introduction - Charles F
    Cambridge University Press 0521808790 - English Corpus Linguistics: An Introduction - Charles F. Meyer Index More information Index Aarts, Bas, 4, 102 Biber, Douglas, et al. (1999) 14 adequacy, 2–3, 10–11 Birmingham Corpus, 15, 142 age, 49–50 Blachman, Edward, 76–7 Altenberg, Bengt, 26–7 BNC see British National Corpus AMALGAM Tagging Project, 86–7, 89 Brill, Eric, 86 American National Corpus, 24, 84, 142 Brill Tagger, 86–8 American Publishing House for the Blind British National Corpus (BNC), 143 Corpus, 17, 142 annotation, 84 analyzing a corpus, 100 composition, 18, 31t, 34, 36, 38, 40–1, 49 determining suitability, 103–7, 107t copyright, 139–40 exploring a corpus, 123–4 planning, 30–2, 33, 43, 51, 138 extracting information: defining parameters, record keeping, 66 107–9; coding and recording, 109–14, research using, 15, 36 112t; locating relevant constructions, speech samples, 59 114–19, 116f, 118f tagging, 87 framing research question, 101–3 time-frame, 45 future prospects, 140–1 British National Corpus (BNC) Sampler, see also pseudo-titles (corpus analysis 139–40, 143 case study); statistical analysis Brown Corpus, xii, 1, 143 anaphors, 97 genre variation, 18 annotation, 98–9 length, 32 future prospects, 140 research using, 6, 9–10, 12, 42, 98, 103 grammatical markup, 81 sampling methodology, 44 parsing, 91–6, 98, 140 tagging, 87, 90 part-of-speech markup, 81 time-frame, 45 structural markup, 68–9, 81–6 see also FROWN (Freiburg–Brown) tagging, 86–91, 97–8, 111, 117–18, 140 Corpus types, 81 Burges, Jen, 52 appositions, 42, 98 Burnard,
    [Show full text]
  • A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language
    A STUDY OF ISSUES AND TECHNIQUES FOR CREATING CORE VOCABULARY LISTS FOR ENGLISH AS AN INTERNATIONAL LANGUAGE BY C. JOSEPH SORELL A thesis submitted to Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy Victoria University of Wellington 2013 ABSTRACT Core vocabulary lists have long been a tool used by language learners and instructors seeking to facilitate the initial stages of foreign language learning (Fries & Traver, 1960: 2). In the past, these lists were typically based on the intuitions of experienced educators. Even before the advent of computer technology in the mid-twentieth century, attempts were made to create such lists using objective methodologies. These efforts regularly fell short, however, and – in the end – had to be tweaked subjectively. Now, in the 21st century, this is unfortunately still true, at least for those lists whose methodologies have been published. Given the present availability of sizable English- language corpora from around the world and affordable personal computers, this thesis seeks to fill this methodological gap by answering the research question: How can valid core vocabulary lists for English as an International Language be created? A practical taxonomy is proposed based on Biber’s (1988, 1995) multi-dimensional analysis of English texts. This taxonomy is based on correlated linguistic features and reasonably covers representative spoken and written texts in English. The four-part main study assesses the variance in vocabulary data within each of the four key text types: interactive (face-to-face conversation), academic exposition, imaginative narrative, and general reported exposition. The variation in word types found at progressive intervals in corpora of various sizes is measured using the Dice coefficient, a coefficient originally used to measure species variation in different biotic regions (Dice, 1945).
    [Show full text]
  • Towards a Global, Multilingual Framenet
    LREC 2020 WORKSHOP Language Resources and Evaluation Conference 11–16 May 2020 International FrameNet Workshop 2020 Towards a Global, Multilingual FrameNet PROCEEDINGS Edited by: Tiago T. Torrent, Collin F. Baker, Oliver Czulo, Kyoko Ohara and Miriam R. L. Petruck Proceedings of the LREC International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet Edited by: Tiago T. Torrent, Collin F. Baker, Oliver Czulo, Kyoko Ohara and Miriam R. L. Petruck ISBN: 979-10-95546-58-0 EAN: 9791095546580 For more information: European Language Resources Association (ELRA) 9 rue des Cordelières 75013, Paris France http://www.elra.info Email: [email protected] c European Language Resources Association (ELRA) These Workshop Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License ii Introduction This workshop continues the series of International FrameNet Workshops, based on frame semantics and FrameNets around the world, including meetings in Berkeley, California in 2013 (sponsored by the Swedish FrameNet group), in Juiz de Fora, Brazil in 2016 (sponsored by FrameNet Brasil), and in Miyazaki, Japan in 2018 (in conjunction with the LREC conference there). The last of these was specifically designed to bring together two sometimes separate groups of researchers, those concentrating on frame semantics and those concentrating on construction grammar, which Charles J. Fillmore and many of the same colleagues developed in parallel over a period of several decades. The call for papers of the current conference emphasized that the workshop would welcome studies of both theoretical and practical issues, and we are fortunate to have strong contributions of both types, sometimes within a single paper.
    [Show full text]
  • The Architecture of a Multipurpose Australian National Corpus
    The Architecture of a Multipurpose Australian National Corpus Pam Peters Macquarie University 1. Introduction Collaborative planning by Australian researchers for a large national corpus presents us with quite diverse views on: (a) what a corpus is, (b) what research agendas it should support, (c) which varieties of discourse it should contain, (d) how many languages it could include, (e) how the material might be collected, and (f) what kinds of annotation are needed to add value to the texts. The challenge for us all is to develop a comprehensive plan and architecture for a corpus which will encompass as many research agendas as possible. A modular design which allows independent compilation of segments of the notional whole recommends itself, so long as common systems of metadata and annotation can be set down at the start. 2. What is a Corpus? Divergent understandings of what a corpus is reflect some aspects of the word’s history, and the different academic disciplines which have made use of them. In literary studies since the earlier C18, the Latin word corpus has been used to refer to the body of work by a single author, for example, Shakespeare, or set of them, such as the romantic poets. It implies the total output of those authors, not a sampling. In linguistic science since the 1960s, corpus has referred to a structured collection of texts sampled from various types of discourse, including written and – since the 1980s – spoken as well (e.g., Collins Cobuild). It is thus deliberately heterogenous, and designed to be in some sense ‘representative’ of a standard language or variety of it.
    [Show full text]
  • ΓE 77 Computational Linguistics Essays/ Project Topics
    ΓE 77 Computational Linguistics Essays/ Project Topics 1. General Information • Instead of EXAMS • 2.000 to 3.000 words +15% (depends on the topic) • Essay (1-student) OR Project (group of 2-3 students) • Starting date: 2 Apr 2018 • Topic notification date: until 18 Apr 2018 • (Short) Presentation date: Last lecture • Deadline: 20 Jun 2018 2. Summary • (Research | Survey | Literature Review) Papers • Existing NLP Tools (testing | evaluation) • Annotation Project • Corpus data Project • Phonological, Morphological, Syntactic, Semantic Project • Language Learning Platform evaluation Languages: English or Greek or another language 3. Essay Topics • You select an area and I will recommend some articles/books to get you started. One book chapter or big article is going to be the core part of an essay; several articles will be the additional and supplementary bibliography. • Research Areas: Machine Translation, Machine Learning, Summarization, Question Answering, Sentiment Analysis, Information Extraction, Reference Resolution, Speech Analysis/ Synthesis, Thesauri and Ontologies, Building Corpora, Corpus annotation, Language Learning Platforms, etc. You should use examples from English plus another language. Also you can have joint research areas. You can check the book of Speech and Language Processing for additional research areas. 4. Project Topics 4.1. Annotation Project (APHASIA) OR any other spoken data (even video) • Using a tool for multi-level annotations (ELAN), you will annotate aphasic speech data based on an annotation template. • Task 1: Multi-level annotation of a file (2-4 minutes). • Task 2: Point out the linguistic errors and speech events. • Task 3: Present the aphasic speech and its characteristics/parameters based on annotations. • Task 4: Short evaluation of the tool and the annotation template.
    [Show full text]
  • New Kazakh Parallel Text Corpora with On-Line Access
    New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed.
    [Show full text]
  • The American National Corpus: a Standardized Resource for American English
    The American National Corpus: A standardized resource for American English Nancy Ide* and Catherine Macleod† *Department of Computer Science Vassar College Poughkeepsie, NY 12604-0520 USA [email protected] Computer Science Department New York University New York, New York 10003-6806 USA [email protected] 1 Introduction Linguistic research has become heavily reliant on text corpora over the past ten years. Such resources are becoming increasingly available through efforts such as the Linguistic Data Consortium (LDC) in the US and the European Language Resources Association (ELRA) in Europe. However, in the main the corpora that are gathered and distributed through these and other mechanisms consist of texts which can be easily acquired and are available for re-distribution without undue problems of copyright, etc. This practice has resulted in a vast over-representation among available corpora of certain genres, in particular newspaper samples, which comprise the greatest percentage of texts currently available from, for example, the LDC, and which also dominate the training data available for speech recognition purposes. Other available corpora typically consist of technical reports, transcriptions of parliamentary and other proceedings, short telephone conversations, and the like. The upshot of this is that corpus- based natural language processing has relied heavily on language samples representative of usage in a handful of limited and linguistically specialized domains. A corpus is intended to be "a collection of naturally occurring language text, chosen to characterize a state or variety of a language" (Sinclair, 1991). As such, very few of the so-called corpora used in current natural language processing and speech recognition work deserve the name.
    [Show full text]