Technological Advances in Corpus Sampling Methodology

Technological Advances in Corpus Sampling Methodology Stephen Michael Wattam, BSc (Hons), MRes Mathematics and Statistics School of Computing and Communications Lancaster University Submitted for the degree of Doctor of Philosophy August 2015 Abstract Current efforts in corpus linguistics and natural language processing make heavy use of corpora— large language samples that are intended to describe a wide population of language users. The first modern corpora were manually constructed, transcribed from published texts and other non-digital sources into a machine-readable format. In part due to their hard-won nature, larger corpora have become widely shared and re-used: this has significant benefits for the scientific community, yet also leads to a stagnation of sampling methods and designs. The rise of Web as Corpus (WaC), and the use of computers to author documents, has provided us with the tools needed to build corpora automatically, or with little supervision. This offers an avenue to re-examine and, in places, exceed the limitations of conventional corpus sampling methods. Even so, many corpora are compared against conventional ones due to their status as a de-facto gold standard of representativeness. Such practices place undue trust in aging sample designs and the expert opinion therein. In this thesis I argue for the development of new sampling procedures guided less by concepts of linguistic balance and more by statistical sampling theory. This is done by presenting three different areas of potential study, along with exploratory results and publicly-available tools and methods that allow for further investigation. The first of these is an examination of temporal bias in sampling online. I present a preliminary investigation demonstrating the prevalence of such effects, before describing a tool designed to reveal linguistic change over time at a resolution not possible with current software. Secondly, the sample design of larger general-purpose corpora is inverted in order to relate it to an individual’s experience of language. This takes the form of a census sample of language for a single subject, taken using semi-automated methods, and illustrates how poorly suited some aspects of general-purpose corpora are for questions about individual language use. Finally, a method is presented and evaluated that is able to describe arbitrary sample designs in quantitiative terms, and use this description to automatically construct, augment, or repair corpora using the web. This method uses bootstrapping to apply current sampling theory to linguistic research questions, in order to better align the scientific notion of representativeness with the process of retrieving data. i Acknowledgements This thesis could not have been completed without the support of many people. Through the considerable time it took, many academics and friends have helped answer my questions and allay (or confirm!) my fears. My thanks go out to all members of UCREL, who taught me the ways of the corpus. Foremost amongst these must be my supervisors, Paul Rayson and Damon Berridge. Their direction and encouragement was often the only thing keeping me in the office, and it is hard to overstate their influence. Secondly, my friends, particularly Carl Ellis, Matthew Edwards, John Vidler and John Hardy. Our (sometimes not so) implicit competition has been endlessly motivating. Finally, I wish to thank my parents. It is doubtless my mother’s lesson to question the world that has lead me into research, and my father’s work ethic (however diluted during inheritance) that has carried me this far. Thank you all. ii Declaration The material presented in this thesis is the result of original research by the named author, Stephen Wattam, under the supervision of Dr. Paul Rayson & Prof. Damon Berridge, carried out in collaboration between Maths and Stats and the School of Computing and Communications, Lancaster University. This work was funded by the Economic and Social Research Council. Parts of this thesis are based upon prior publications by the author, listed below—all other sources, if quoted, are attributed accordingly in the body of the text. Stephen Wattam, Paul Rayson & Damon Berridge. Document Attrition in Web Corpora: an • Exploration. In Corpus Linguistics, 2011. Stephen Wattam, Paul Rayson & Damon Berridge. LWAC: Longitudinal Web-as-Corpus • Sampling. In 8th Web as Corpus Workshop (WAC-8), 2013. Stephen Wattam, Paul Rayson & Damon Berridge. Using Life-Logging to Re-Imagine • Representativeness in Corpus Design. In Corpus Linguistics, 2013 Mr Stephen Wattam . Dr Paul Rayson . iii Contents 1 Introduction 1 2 Background 5 2.1 Corpora and Corpus Linguistics............................5 2.1.1 A Brief History of Modern Corpora......................7 2.1.2 What Makes a Corpus?.............................9 2.2 Sampling Principles.................................... 19 2.2.1 Nonprobability Sampling............................ 20 2.2.2 Probability Sampling............................... 23 2.2.3 Sample Size Estimation............................. 26 2.2.4 Sample Design.................................. 27 2.2.5 Sources of Error in Corpus Construction................... 28 2.2.6 Validity Concerns in Corpus Analysis..................... 34 2.3 New Technologies..................................... 36 2.3.1 Web Corpora................................... 36 2.3.2 Intellectual Property............................... 37 2.3.3 Sample Design.................................. 38 2.3.4 Documents of Digital Origin.......................... 38 2.3.5 Life-Logging.................................... 39 2.3.6 New Sampling Opportunities.......................... 39 2.4 Summary.......................................... 40 3 Longitudinal Web Sampling 42 3.1 Document Attrition.................................... 43 3.1.1 Methods & Data................................. 44 3.1.2 Results....................................... 46 3.1.3 Discussion..................................... 48 3.2 LWAC: Longitudinal WaC Sampling.......................... 49 3.2.1 Design & Implementation............................ 51 3.2.2 Performance.................................... 53 3.3 Summary.......................................... 56 iv CONTENTS v 4 Proportionality: A Case Study 58 4.1 Sample Design....................................... 59 4.1.1 Difficulties and Disadvantages......................... 60 4.2 Technology......................................... 61 4.2.1 Life Logging.................................... 61 4.3 Aims and Objectives................................... 65 4.3.1 Sampling Policy.................................. 66 4.4 Method........................................... 67 4.4.1 Data Sources.................................... 67 4.4.2 Recording Methods................................ 68 4.4.3 Recording Procedure............................... 73 4.4.4 Operationalisation, Processing and Analysis................. 74 4.4.5 Coding and Genres................................ 76 4.5 Results........................................... 77 4.5.1 Activities...................................... 79 4.5.2 Distribution.................................... 80 4.5.3 Events....................................... 81 4.5.4 Word Counts................................... 84 4.5.5 Comparisons................................... 86 4.6 Discussion & Reflection................................. 89 4.6.1 Types of data................................... 89 4.6.2 Method....................................... 89 4.6.3 Sampling Period & Validity........................... 91 4.6.4 Data and Comparisons.............................. 92 4.6.5 Validity....................................... 93 4.6.6 Ethics........................................ 94 4.6.7 Future Work.................................... 95 4.7 Summary.......................................... 96 5 Describing, Building and Rebuilding 98 5.1 Rationale.......................................... 99 5.2 Use Cases......................................... 100 5.3 Design........................................... 101 5.3.1 Profiling...................................... 101 5.3.2 Retrieval...................................... 102 5.3.3 Measuring Performance............................. 108 5.4 Method & Implementation................................ 109 5.4.1 Architecture.................................... 110 5.4.2 Document Search & Retrieval.......................... 113 5.5 Summary.......................................... 119 CONTENTS vi 6 Evaluation & Discussion 120 6.1 Evaluation Criteria.................................... 120 6.2 Performance of Heuristics................................ 122 6.2.1 Audience Level.................................. 122 6.2.2 Word Count.................................... 124 6.2.3 Genre........................................ 124 6.3 Performance of Resampling............................... 128 6.3.1 Method....................................... 129 6.3.2 Results....................................... 131 6.4 Performance of Retriever................................. 135 6.4.1 Method....................................... 135 6.4.2 Results....................................... 136 6.5 Discussion......................................... 144 6.6 Summary.......................................... 146 7 Conclusions & Review 148 7.1 Further Work....................................... 150 7.1.1 Large-Scale Longitudinal Document Attrition Studies............ 151 7.1.2 LWAC Distribution................................ 151 7.1.3 Slimmed-down

Load more