The Written British National Corpus 2014: Design, Compilation and Analysis

The Written British National Corpus 2014: Design, compilation and analysis Abi Hawtin ESRC Centre for Corpus Approaches to Social Science Department of Linguistics and English Language Lancaster University A thesis submitted to Lancaster University for the degree of Doctor of Philosophy in Linguistics December 2018 Table of Contents Table of Contents .......................................................................................................i Declaration ............................................................................................................... vi Abstract ................................................................................................................. viii List of Tables............................................................................................................ xi List of Figures .........................................................................................................xii List of Appendices ................................................................................................ xiii Acknowledgements ................................................................................................ xiv Chapter 1: Introduction ........................................................................................... 1 1.1 The British National Corpus 2014 .................................................................... 1 1.2 Distinguishing between spoken and written language ....................................... 3 1.3 Justifying the Written BNC2014 project ........................................................... 5 1.3.1 Introduction ................................................................................................ 5 1.3.2 The British National Corpus (BNC1994) .................................................... 6 1.3.3 The enduring popularity of the BNC1994.................................................... 9 1.3.4 Other corpora of written British English .................................................... 14 1.3.4.1 The Brown Family ............................................................................. 15 1.3.4.2 The Bank of English ........................................................................... 18 1.3.4.3 ukWaC .............................................................................................. 19 1.3.4.4 ICE-GB ............................................................................................ 20 1.3.5 Summary and justification for the project .................................................. 21 1.4 The project team and my ownership of the research ........................................ 22 1.4.1 The Written BNC2014 project team ......................................................... 22 1.4.2 Ownership of research .............................................................................. 22 1.5 Copyright and Permissions ............................................................................. 24 1.5.1 Introduction ............................................................................................. 24 1.5.2 Copyright law .......................................................................................... 25 1.5.3 Expert opinions ........................................................................................ 28 1.5.4 Definitions ............................................................................................... 29 1.5.5 Conclusion ............................................................................................... 29 1.6 Research aims and the structure of the thesis .................................................. 30 Chapter 2: Contemporary National Corpora ....................................................... 35 2.1 Introduction .................................................................................................... 35 i 2.2 The Corpus de référence du français contemporain (CRFC) ........................... 36 2.3 The Czech National Corpus (SYN2015) ......................................................... 41 2.4 The Thai National Corpus (TNC) ................................................................... 44 2.5 The American National Corpus (ANC) ........................................................... 47 2.6 Corpus of Contemporary American English (COCA) ..................................... 50 2.7 Deutsches Referenzkorpus (DeReKo) ............................................................. 52 2.8 Conclusion ..................................................................................................... 54 Chapter 3: Creating Representative and Comparable Corpora .......................... 57 3.1 Introduction .................................................................................................... 57 3.2 Corpus Representativeness ............................................................................. 57 3.2.1 Defining representativeness ...................................................................... 58 3.2.2 Sampling .................................................................................................. 59 3.2.2.1 Population definition ......................................................................... 59 3.2.2.2 Random sampling .............................................................................. 62 3.2.2.3 Proportional sampling ....................................................................... 64 3.2.2.4 Balance ............................................................................................ 67 3.2.2.5 Sample size ........................................................................................ 67 3.2.2.6 Number of samples ............................................................................ 69 3.2.2.7 Corpus size ........................................................................................ 70 3.2.3 Representativeness is not possible ............................................................. 72 3.2.4 The BNC1994 ........................................................................................... 75 3.2.5 Conclusion ................................................................................................ 76 3.3. Comparability in corpus design ..................................................................... 77 3.3.1 Introduction .............................................................................................. 77 3.3.2. Methods for creating and testing comparable corpora ............................... 80 3.3.2.1 Creating comparable corpora using web crawling ............................ 80 3.3.2.2 Testing corpus comparability ............................................................. 84 3.3.3 The Brown Family .................................................................................... 86 3.3.4 ARCHER ................................................................................................. 92 3.3.5 Conclusion ................................................................................................ 94 3.4. Representativeness vs. Comparability ............................................................ 95 3.4.1 The Problem ............................................................................................. 95 3.4.2 The Solution ............................................................................................. 98 Chapter 4: Designing the Written BNC2014 Sampling Frame ........................... 100 4.1 Introduction .................................................................................................. 100 ii 4.2 Classifying texts in corpora .......................................................................... 100 4.2.1 Introduction ........................................................................................... 100 4.2.2 Genre Theory ......................................................................................... 101 4.2.2.1 Introduction ..................................................................................... 101 4.2.2.2 Approaches to genre ........................................................................ 101 4.2.3 Genre, register, style, text type – some definitions .................................. 103 4.2.3.1 Introduction .................................................................................... 103 4.2.3.2 Genre .............................................................................................. 104 4.2.3.3 Register .......................................................................................... 105 4.2.3.4 Style ................................................................................................ 106 4.2.3.5 Text type .......................................................................................... 107 4.2.3.6 Terminology in this thesis ................................................................ 107 4.2.4 Text classification in previous national corpora ...................................... 108 4.2.4.1 The Brown Family ........................................................................... 108 4.2.4.2 The Corpus de référence du Français contemporain (CRFC) .......... 110 4.2.4.3 The British National Corpus 1994 ................................................... 112 4.2.4.4 Summary ........................................................................................ 115 4.2.5 Text classification in the Written BNC2014 ........................................... 115 4.3 Design of the sampling frame

The Written British National Corpus 2014: Design, Compilation and Analysis

Talk Bank: a Multimodal Database of Communicative Interaction

Newcastle University Eprints

RASLAN 2015 Recent Advances in Slavonic Natural Language Processing

Arabic Corpus and Word Sketches

CLIB 2016 Proceedings

Collection of Usage Information for Language Resources from Academic Articles

Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011

MASC: the Manually Annotated Sub-Corpus of American English

From CHILDES to Talkbank

Jazykovedný Ústav Ľudovíta Štúra

The General Regionally Annotated Corpus of Ukrainian (GRAC, Uacorpus.Org): Architecture and Functionality

Conference Abstracts