Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi

Total Page:16

File Type:pdf, Size:1020Kb

Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi To cite this version: Adrien Barbaresi. Ad hoc and general-purpose corpus construction from web sources. Linguis- tics. ENS Lyon, 2015. English. <tel-01167309> HAL Id: tel-01167309 https://tel.archives-ouvertes.fr/tel-01167309 Submitted on 24 Jun 2015 HAL is a multi-disciplinary open access L'archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´eeau d´ep^otet `ala diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´esou non, lished or not. The documents may come from ´emanant des ´etablissements d'enseignement et de teaching and research institutions in France or recherche fran¸caisou ´etrangers,des laboratoires abroad, or from public or private research centers. publics ou priv´es. Distributed under a Creative Commons Attribution 4.0 International License THÈSE en vue de l'obtention du grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Discipline : Linguistique Laboratoire ICAR École Doctorale LETTRES, LANGUES, LINGUISTIQUE, ARTS présentée et soutenue publiquement le 19 juin 2015 par Monsieur Adrien Barbaresi _____________________________________________ Construction de corpus généraux et spécialisés à partir du web ______________________________________________ Directeur de thèse : M. Benoît HABERT Devant la commission d'examen formée de : M. Benoît HABERT, ENS Lyon, Directeur M. Thomas LEBARBÉ, Université Grenoble 3, Examinateur M. Henning LOBIN, Université de Gießen, Rapporteur M. Jean-Philippe MAGUÉ, ENS Lyon, Co-Encadrant M. Ludovic TANGUY, Université Toulouse 2, Rapporteur Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi 2015 École Normale Supérieure de Lyon (Université de Lyon) ED 484 3LA ICAR lab Thesis Committee: Benoît H ENS Lyon advisor Thomas L$ University of Grenoble 3 chair Henning L University of Gießen reviewer Jean-Philippe M$ ENS Lyon co-advisor Ludovic T University of Toulouse 2 reviewer Submitted in partial fulVllment of the requirements for the degree of Doctor of Philosophy in Linguistics i Acknowledgements Many thanks to Benoît Habert and Jean-Philippe Magué Henning Lobin Serge Heiden, Matthieu Decorde, Alexei Lavrentev, Céline Guillot, and the people at ICAR Alexander Geyken, Lothar Lemnitzer, Kay-Michael Würzner, Bryan Jurish, and the team at the Zentrum Sprache of the BBAW my fellow former members of the board of ENthèSe Felix Bildhauer and Roland Schäfer the liberalities of my PhD grant and the working conditions I beneVted from1 the S. my friends and family Eva Brehm-Jurish for helping my English sound better Brendan O’Connor for his inspirational LATEX preamble 1PhD grant at the ENS Lyon, internal research grant at FU Berlin, DFG-funded research grants at the BBAW. I am grateful to the FU Berlin and the BBAW for the substantial computational resources they provided. iii Contents 1 Corpus design before and after the emergence of web data 1 1.1 Prologue: “A model of Universal Nature made private" – Garden design and text collections . 2 1.2 Introductory deVnitions and typology: corpus, corpus design and text . 4 1.2.1 What “computational" is to linguistics . 4 1.2.2 What theory and Veld are to linguistics . 5 1.2.3 Why use a corpus? How is it deVned? How is it built? . 6 1.3 Corpus design history: From copy typists to web data (1959-today) . 15 1.3.1 Possible periodizations . 15 1.3.2 Copy typists, Vrst electronic corpora and establishment of a scientiVc methodology (1959-1980s) . 16 1.3.3 Digitized text and emergence of corpus linguistics, between techno- logical evolutions and cross-breeding with NLP (from the late 1980s onwards) . 21 1.3.4 Enter web data: porosity of the notion of corpus, and opportunism (from the late 1990s onwards) . 31 1.4 Web native corpora – Continuities and changes . 48 1.4.1 Corpus linguistics methodology . 48 1.4.2 From content oversight to suitable processing: Known problems . 56 1.5 Intermediate conclusions . 64 1.5.1 (Web) corpus linguistics: a discipline on the rise? . 64 1.5.2 Changes in corpus design and construction . 66 1.5.3 Challenges addressed in the following sections . 68 2 Gauging and quality assessment of text collections: methodological insights on (web) text processing and classiVcation 71 2.1 Introduction . 72 2.2 Text quality assessment, an example of interdisciplinary research on web texts 72 2.2.1 Underestimated Waws and recent advances in discovering them . 72 2.2.2 Tackling the problems: General state of the art . 77 2.3 Text readability as an aggregate of salient text characteristics . 83 2.3.1 From text quality to readability and back . 83 2.3.2 Extent of the research Veld: readability, complexity, comprehensibility 83 v 2.3.3 State of the art of diUerent widespread approaches . 85 2.3.4 Common denominator of current methods . 96 2.3.5 Lessons to learn and possible application scenarios . 99 2.4 Text visualization, an example of corpus processing for digital humanists . 100 2.4.1 Advantages of visualization techniques and examples of global and lo- cal visualizations . 100 2.4.2 Visualization of corpora . 102 2.5 Conclusions on current research trends . 105 2.5.1 Summary . 105 2.5.2 A gap to bridge between quantitative analysis and (corpus) linguistics . 106 3 From the start URLs to the accessible corpus document: Web text collection and preprocessing 109 3.1 Introduction . 110 3.1.1 Structure of the Web and deVnition of web crawling . 110 3.1.2 “OYine" web corpora: overview of constraints and processing chain . 112 3.2 Data collection: how is web document retrieval done and which challenges arise? 113 3.2.1 Introduction . 113 3.2.2 Finding URL sources: Known advantages and diXculties . 115 3.2.3 In the civilized world: “Restricted retrieval" . 118 3.2.4 Into the wild: Web crawling . 120 3.2.5 Finding a vaguely deVned needle in a haystack: targeting small, partly unknown fractions of the web . 127 3.3 An underestimated step: Preprocessing . 130 3.3.1 Introduction . 130 3.3.2 Description and discussion of salient preprocessing steps . 132 3.3.3 Impact on research results . 138 3.4 Conclusive remarks and open questions . 142 3.4.1 Remarks on current research practice . 142 3.4.2 Existing problems and solutions . 143 3.4.3 (Open) questions . 144 4 Web corpus sources, qualiVcation, and exploitation 147 4.1 Introduction: qualiVcation steps and challenges . 148 4.1.1 Hypotheses . 148 4.1.2 Concrete issues . 149 4.2 PrequaliVcation of web documents . 151 4.2.1 PrequaliVcation: Finding viable seed URLs for web corpora . 151 4.2.2 Impact of prequaliVcation on (focused) web crawling and web corpus sampling . 177 4.2.3 Restricted retrieval . 180 4.3 QualiVcation of web corpus documents and web corpus building . 186 4.3.1 General-purpose corpora . 186 4.3.2 Specialized corpora . 203 4.4 Corpus exploitation: typology, analysis, and visualization . 210 4.4.1 A few speciVc corpora constructed during this thesis . 210 4.4.2 Interfaces and exploitation . 215 vi 5 Conclusion 229 5.0.1 The framework of web corpus linguistics . 230 5.0.2 Put the texts back into focus . 232 5.0.3 Envoi: Back to the garden . 235 6 Annexes 237 6.1 Synoptic tables . 238 6.2 Building a basic crawler . 240 6.3 Concrete steps of URL-based content Vltering . 243 6.3.1 Content type Vltering . 243 6.3.2 Filtering adult content and spam . 243 6.4 Examples of blogs . 244 6.5 A surface parser for fast phrases and valency detection on large corpora . 247 6.5.1 Introduction . 247 6.5.2 Description . 248 6.5.3 Implementation . 249 6.5.4 Evaluation . 252 6.5.5 Conclusion . 253 6.6 Completing web pages on the Wy with JavaScript . 255 6.7 Newspaper article in XML TEI format . 258 6.8 Résumé en français . 260 References 271 Index 285 vii List of Tables 1.1 Synoptic typological comparison of speciVc and general-purpose corpora . 51 3.1 Amount of documents lost in the cleanup steps of DECOW2012 . 138 4.1 URLs extracted from DMOZ and Wikipedia . 157 4.2 Crawling experiments for Indonesian . 158 4.3 Naive reference for crawling experiments concerning Indonesian . 159 4.4 5 most frequent languages of URLs taken at random on FriendFeed . 165 4.5 10 most frequent languages of spell-check-Vltered URLs gathered on FriendFeed . 166 4.6 10 most frequent languages of URLs gathered on identi.ca . 166 4.7 10 most frequent languages of Vltered URLs gathered on Reddit channels and on a combination of channels and user pages . 167 4.8 5 most frequent languages of links seen on Reddit and rejected by the primary language Vlter . 167 4.9 URLs extracted from search engines queries . 171 4.10 URLs extracted from a blend of social networks crawls . 172 4.11 URLs extracted from DMOZ and Wikipedia . 173 4.12 Comparison of the number of diUerent domains in the page’s outlinks for four diUerent sources . 174 4.13 Inter-annotator agreement of three coders for a text quality rating task . 189 4.14 Results of logistic regression analysis for selected features and choices made by rater S . 201 4.15 Evaluation of the model and comparison with the results of Schäfer et al. (2013) . 202 4.16 Most frequent license types . 207 4.17 Most frequent countries (ISO code) . 207 4.18 Various properties of the examined corpora. 222 4.19 Percentage distribution of selected PoS (super)tags on token and type level . 224 6.1 Corpora created during this thesis .
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • From CHILDES to Talkbank
    From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself.
    [Show full text]
  • The General Regionally Annotated Corpus of Ukrainian (GRAC, Uacorpus.Org): Architecture and Functionality
    The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org): Architecture and Functionality Maria Shvedova[0000-0002-0759-1689] Kyiv National Linguistic University [email protected] Abstract. The paper presents the General Regionally Annotated Corpus of Ukrainian, which is publicly available (GRAC: uacorpus.org), searchable online and counts more than 400 million tokens, representing most genres of written texts. It also features regional annotation, i. e. about 50 percent of the texts are attributed with regard to the different regions of Ukraine or countries of the di- aspora. If the author is known, the text is linked to their home region(s). The journalistic texts are annotated with regard to the place where the edition is pub- lished. This feature differs the GRAC from a majority of general linguistic cor- pora. Keywords: Ukrainian language, corpus, diachronic evolution, regional varia- tion. 1 Introduction Currently many major national languages have large universal corpora, known as “reference” corpora (cf. das Deutsche Referenzkorpus – DeReKo) or “national” cor- pora (a label dating back ultimately to the British National Corpus). These corpora are large, representative for different genres of written language, have a certain depth of (usually morphological and metatextual) annotation and can be used for many differ- ent linguistic purposes. The Ukrainian language lacks a publicly available linguistic corpus. Still there is a need of a corpus in the present-day linguistics. Independently researchers compile dif- ferent corpora of Ukrainian for separate research purposes with different size and functionality. As the community lacks a universal tool, a researcher may build their own corpus according to their needs.
    [Show full text]
  • ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
    EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year.
    [Show full text]
  • Gold Standard Annotations for Preposition and Verb Sense With
    Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.
    [Show full text]
  • Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
    Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection
    To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection Master's Thesis Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science James Pustejovsky, Advisor In Partial Fulfillment of the Requirements for Master's Degree By Richard A. Brutti Jr. May 2012 c Copyright by Richard A. Brutti Jr. 2012 All Rights Reserved ABSTRACT To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences Brandeis University Waltham, Massachusetts Richard A. Brutti Jr. Since the introduction of the Unaccusative Hypothesis (Perlmutter, 1978), there have been many further attempts to explain the mechanisms behind the division in intransitive verbs. This paper aims to analyze and test some of theories of unac- cusativity using computational linguistic tools. Specifically, I focus on verbs that exhibit split intransitivity, that is, verbs that can appear in both unaccusative and unergative constructions, and in determining the distinguishing features that make this alternation possible. Many formal linguistic theories of unaccusativity involve the interplay of semantic roles and temporal event markers, both of which can be analyzed using statistical computational linguistic tools, including semantic role labelers, semantic parses, and automatic event classification. I use auxiliary verb selection as a surface-level indicator of unaccusativity in Italian and Dutch, and iii test various classes of verbs extracted from the Europarl corpus (Koehn, 2005). Additionally, I provide some historical background for the evolution of this dis- tinction, and analyze how my results fit into the larger theoretical framework.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]
  • Book of Abstracts
    Book of Abstracts - Journée d’études AFLiCo JET Corpora and Representativeness jeudi 3 et vendredi 4 mai 2018 Bâtiment Max Weber (W) Keynote speakers Dawn Knight et Thomas Egan Plus d’infos : bit.ly/2FPGxvb 1 Table of contents DAY 1 – May 3rd 2018 Thomas Egan, Some perils and pitfalls of non-representativeness 1 Daniel Henke, De quoi sont représentatifs les corpus de textes traduits au juste ? Une étude de corpus comparable-parallèle 1 Ilmari Ivaska, Silvia Bernardini, Adriano Ferraresi, The comparability paradox in multilingual and multi-varietal corpus research: Coping with the unavoidable 2 Antonina Bondarenko, Verbless Sentences: Advantages and Challenges of a Parallel Corpus-based Approach 3 Adeline Terry, The representativeness of the metaphors of death, disease, and sex in a TV show corpus 5 Julien Perrez, Pauline Heyvaert, Min Reuchamps, On the representativeness of political corpora in linguistic research 6 Joshua M. Griffiths, Supplementing Maximum Entropy Phonology with Corpus Data 8 Emmanuelle Guérin, Olivier Baude, Représenter la variation – Revisiter les catégories et les variétés dans le corpus ESLO 9 Caroline Rossi, Camille Biros, Aurélien Talbot, La variation terminologique en langue de spécialité : pour une analyse à plusieurs niveaux 10 Day 2 – May 4rd 2018 Dawn Knight, Representativeness in CorCenCC: corpus design in minoritised languages 11 Frederick Newmeyer, Conversational corpora: When 'big is beautiful' 12 Graham Ranger, How to get "along": in defence of an enunciative and corpus- based approach 13 Thi Thu Trang
    [Show full text]
  • Unit 3: Available Corpora and Software
    Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 3: Available corpora and software Irina Dahlmann, University of Nottingham 3.1 Commonly-used reference corpora and how to find them This section provides an overview of commonly-used and readily available corpora. It is also intended as a summary only and is far from exhaustive, but should prove useful as a starting point to see what kinds of corpora are available. The Corpora here are divided into the following categories: • Corpora of General English • Monitor corpora • Corpora of Spoken English • Corpora of Academic English • Corpora of Professional English • Corpora of Learner English (First and Second Language Acquisition) • Historical (Diachronic) Corpora of English • Corpora in other languages • Parallel Corpora/Multilingual Corpora Each entry contains the name of the corpus and a hyperlink where further information is available. All the information was accurate at the time of writing but the information is subject to change and further web searches may be required. Corpora of General English The American National Corpus http://www.americannationalcorpus.org/ Size: The first release contains 11.5 million words. The final release will contain 100 million words. Content: Written and Spoken American English. Access/Cost: The second release is available from the Linguistic Data Consortium (http://projects.ldc.upenn.edu/ANC/) for $75. The British National Corpus http://www.natcorp.ox.ac.uk/ Size: 100 million words. Content: Written (90%) and Spoken (10%) British English. Access/Cost: The BNC World Edition is available as both a CD-ROM or via online subscription.
    [Show full text]