Ad Hoc and General-Purpose Corpus Construction from Web Sources Adrien Barbaresi
Total Page:16
File Type:pdf, Size:1020Kb
Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi To cite this version: Adrien Barbaresi. Ad hoc and general-purpose corpus construction from web sources. Linguis- tics. ENS Lyon, 2015. English. <tel-01167309> HAL Id: tel-01167309 https://tel.archives-ouvertes.fr/tel-01167309 Submitted on 24 Jun 2015 HAL is a multi-disciplinary open access L'archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´eeau d´ep^otet `ala diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´esou non, lished or not. The documents may come from ´emanant des ´etablissements d'enseignement et de teaching and research institutions in France or recherche fran¸caisou ´etrangers,des laboratoires abroad, or from public or private research centers. publics ou priv´es. Distributed under a Creative Commons Attribution 4.0 International License THÈSE en vue de l'obtention du grade de Docteur de l’Université de Lyon, délivré par l’École Normale Supérieure de Lyon Discipline : Linguistique Laboratoire ICAR École Doctorale LETTRES, LANGUES, LINGUISTIQUE, ARTS présentée et soutenue publiquement le 19 juin 2015 par Monsieur Adrien Barbaresi _____________________________________________ Construction de corpus généraux et spécialisés à partir du web ______________________________________________ Directeur de thèse : M. Benoît HABERT Devant la commission d'examen formée de : M. Benoît HABERT, ENS Lyon, Directeur M. Thomas LEBARBÉ, Université Grenoble 3, Examinateur M. Henning LOBIN, Université de Gießen, Rapporteur M. Jean-Philippe MAGUÉ, ENS Lyon, Co-Encadrant M. Ludovic TANGUY, Université Toulouse 2, Rapporteur Ad hoc and general-purpose corpus construction from web sources Adrien Barbaresi 2015 École Normale Supérieure de Lyon (Université de Lyon) ED 484 3LA ICAR lab Thesis Committee: Benoît H ENS Lyon advisor Thomas L$ University of Grenoble 3 chair Henning L University of Gießen reviewer Jean-Philippe M$ ENS Lyon co-advisor Ludovic T University of Toulouse 2 reviewer Submitted in partial fulVllment of the requirements for the degree of Doctor of Philosophy in Linguistics i Acknowledgements Many thanks to Benoît Habert and Jean-Philippe Magué Henning Lobin Serge Heiden, Matthieu Decorde, Alexei Lavrentev, Céline Guillot, and the people at ICAR Alexander Geyken, Lothar Lemnitzer, Kay-Michael Würzner, Bryan Jurish, and the team at the Zentrum Sprache of the BBAW my fellow former members of the board of ENthèSe Felix Bildhauer and Roland Schäfer the liberalities of my PhD grant and the working conditions I beneVted from1 the S. my friends and family Eva Brehm-Jurish for helping my English sound better Brendan O’Connor for his inspirational LATEX preamble 1PhD grant at the ENS Lyon, internal research grant at FU Berlin, DFG-funded research grants at the BBAW. I am grateful to the FU Berlin and the BBAW for the substantial computational resources they provided. iii Contents 1 Corpus design before and after the emergence of web data 1 1.1 Prologue: “A model of Universal Nature made private" – Garden design and text collections . 2 1.2 Introductory deVnitions and typology: corpus, corpus design and text . 4 1.2.1 What “computational" is to linguistics . 4 1.2.2 What theory and Veld are to linguistics . 5 1.2.3 Why use a corpus? How is it deVned? How is it built? . 6 1.3 Corpus design history: From copy typists to web data (1959-today) . 15 1.3.1 Possible periodizations . 15 1.3.2 Copy typists, Vrst electronic corpora and establishment of a scientiVc methodology (1959-1980s) . 16 1.3.3 Digitized text and emergence of corpus linguistics, between techno- logical evolutions and cross-breeding with NLP (from the late 1980s onwards) . 21 1.3.4 Enter web data: porosity of the notion of corpus, and opportunism (from the late 1990s onwards) . 31 1.4 Web native corpora – Continuities and changes . 48 1.4.1 Corpus linguistics methodology . 48 1.4.2 From content oversight to suitable processing: Known problems . 56 1.5 Intermediate conclusions . 64 1.5.1 (Web) corpus linguistics: a discipline on the rise? . 64 1.5.2 Changes in corpus design and construction . 66 1.5.3 Challenges addressed in the following sections . 68 2 Gauging and quality assessment of text collections: methodological insights on (web) text processing and classiVcation 71 2.1 Introduction . 72 2.2 Text quality assessment, an example of interdisciplinary research on web texts 72 2.2.1 Underestimated Waws and recent advances in discovering them . 72 2.2.2 Tackling the problems: General state of the art . 77 2.3 Text readability as an aggregate of salient text characteristics . 83 2.3.1 From text quality to readability and back . 83 2.3.2 Extent of the research Veld: readability, complexity, comprehensibility 83 v 2.3.3 State of the art of diUerent widespread approaches . 85 2.3.4 Common denominator of current methods . 96 2.3.5 Lessons to learn and possible application scenarios . 99 2.4 Text visualization, an example of corpus processing for digital humanists . 100 2.4.1 Advantages of visualization techniques and examples of global and lo- cal visualizations . 100 2.4.2 Visualization of corpora . 102 2.5 Conclusions on current research trends . 105 2.5.1 Summary . 105 2.5.2 A gap to bridge between quantitative analysis and (corpus) linguistics . 106 3 From the start URLs to the accessible corpus document: Web text collection and preprocessing 109 3.1 Introduction . 110 3.1.1 Structure of the Web and deVnition of web crawling . 110 3.1.2 “OYine" web corpora: overview of constraints and processing chain . 112 3.2 Data collection: how is web document retrieval done and which challenges arise? 113 3.2.1 Introduction . 113 3.2.2 Finding URL sources: Known advantages and diXculties . 115 3.2.3 In the civilized world: “Restricted retrieval" . 118 3.2.4 Into the wild: Web crawling . 120 3.2.5 Finding a vaguely deVned needle in a haystack: targeting small, partly unknown fractions of the web . 127 3.3 An underestimated step: Preprocessing . 130 3.3.1 Introduction . 130 3.3.2 Description and discussion of salient preprocessing steps . 132 3.3.3 Impact on research results . 138 3.4 Conclusive remarks and open questions . 142 3.4.1 Remarks on current research practice . 142 3.4.2 Existing problems and solutions . 143 3.4.3 (Open) questions . 144 4 Web corpus sources, qualiVcation, and exploitation 147 4.1 Introduction: qualiVcation steps and challenges . 148 4.1.1 Hypotheses . 148 4.1.2 Concrete issues . 149 4.2 PrequaliVcation of web documents . 151 4.2.1 PrequaliVcation: Finding viable seed URLs for web corpora . 151 4.2.2 Impact of prequaliVcation on (focused) web crawling and web corpus sampling . 177 4.2.3 Restricted retrieval . 180 4.3 QualiVcation of web corpus documents and web corpus building . 186 4.3.1 General-purpose corpora . 186 4.3.2 Specialized corpora . 203 4.4 Corpus exploitation: typology, analysis, and visualization . 210 4.4.1 A few speciVc corpora constructed during this thesis . 210 4.4.2 Interfaces and exploitation . 215 vi 5 Conclusion 229 5.0.1 The framework of web corpus linguistics . 230 5.0.2 Put the texts back into focus . 232 5.0.3 Envoi: Back to the garden . 235 6 Annexes 237 6.1 Synoptic tables . 238 6.2 Building a basic crawler . 240 6.3 Concrete steps of URL-based content Vltering . 243 6.3.1 Content type Vltering . 243 6.3.2 Filtering adult content and spam . 243 6.4 Examples of blogs . 244 6.5 A surface parser for fast phrases and valency detection on large corpora . 247 6.5.1 Introduction . 247 6.5.2 Description . 248 6.5.3 Implementation . 249 6.5.4 Evaluation . 252 6.5.5 Conclusion . 253 6.6 Completing web pages on the Wy with JavaScript . 255 6.7 Newspaper article in XML TEI format . 258 6.8 Résumé en français . 260 References 271 Index 285 vii List of Tables 1.1 Synoptic typological comparison of speciVc and general-purpose corpora . 51 3.1 Amount of documents lost in the cleanup steps of DECOW2012 . 138 4.1 URLs extracted from DMOZ and Wikipedia . 157 4.2 Crawling experiments for Indonesian . 158 4.3 Naive reference for crawling experiments concerning Indonesian . 159 4.4 5 most frequent languages of URLs taken at random on FriendFeed . 165 4.5 10 most frequent languages of spell-check-Vltered URLs gathered on FriendFeed . 166 4.6 10 most frequent languages of URLs gathered on identi.ca . 166 4.7 10 most frequent languages of Vltered URLs gathered on Reddit channels and on a combination of channels and user pages . 167 4.8 5 most frequent languages of links seen on Reddit and rejected by the primary language Vlter . 167 4.9 URLs extracted from search engines queries . 171 4.10 URLs extracted from a blend of social networks crawls . 172 4.11 URLs extracted from DMOZ and Wikipedia . 173 4.12 Comparison of the number of diUerent domains in the page’s outlinks for four diUerent sources . 174 4.13 Inter-annotator agreement of three coders for a text quality rating task . 189 4.14 Results of logistic regression analysis for selected features and choices made by rater S . 201 4.15 Evaluation of the model and comparison with the results of Schäfer et al. (2013) . 202 4.16 Most frequent license types . 207 4.17 Most frequent countries (ISO code) . 207 4.18 Various properties of the examined corpora. 222 4.19 Percentage distribution of selected PoS (super)tags on token and type level . 224 6.1 Corpora created during this thesis .