The Written British National Corpus 2014: Design, Compilation and Analysis

The Written British National Corpus 2014: Design, Compilation and Analysis

The Written British National Corpus 2014: Design, compilation and analysis Abi Hawtin ESRC Centre for Corpus Approaches to Social Science Department of Linguistics and English Language Lancaster University A thesis submitted to Lancaster University for the degree of Doctor of Philosophy in Linguistics December 2018 Table of Contents Table of Contents .......................................................................................................i Declaration ............................................................................................................... vi Abstract ................................................................................................................. viii List of Tables............................................................................................................ xi List of Figures .........................................................................................................xii List of Appendices ................................................................................................ xiii Acknowledgements ................................................................................................ xiv Chapter 1: Introduction ........................................................................................... 1 1.1 The British National Corpus 2014 .................................................................... 1 1.2 Distinguishing between spoken and written language ....................................... 3 1.3 Justifying the Written BNC2014 project ........................................................... 5 1.3.1 Introduction ................................................................................................ 5 1.3.2 The British National Corpus (BNC1994) .................................................... 6 1.3.3 The enduring popularity of the BNC1994.................................................... 9 1.3.4 Other corpora of written British English .................................................... 14 1.3.4.1 The Brown Family ............................................................................. 15 1.3.4.2 The Bank of English ........................................................................... 18 1.3.4.3 ukWaC .............................................................................................. 19 1.3.4.4 ICE-GB ............................................................................................ 20 1.3.5 Summary and justification for the project .................................................. 21 1.4 The project team and my ownership of the research ........................................ 22 1.4.1 The Written BNC2014 project team ......................................................... 22 1.4.2 Ownership of research .............................................................................. 22 1.5 Copyright and Permissions ............................................................................. 24 1.5.1 Introduction ............................................................................................. 24 1.5.2 Copyright law .......................................................................................... 25 1.5.3 Expert opinions ........................................................................................ 28 1.5.4 Definitions ............................................................................................... 29 1.5.5 Conclusion ............................................................................................... 29 1.6 Research aims and the structure of the thesis .................................................. 30 Chapter 2: Contemporary National Corpora ....................................................... 35 2.1 Introduction .................................................................................................... 35 i 2.2 The Corpus de référence du français contemporain (CRFC) ........................... 36 2.3 The Czech National Corpus (SYN2015) ......................................................... 41 2.4 The Thai National Corpus (TNC) ................................................................... 44 2.5 The American National Corpus (ANC) ........................................................... 47 2.6 Corpus of Contemporary American English (COCA) ..................................... 50 2.7 Deutsches Referenzkorpus (DeReKo) ............................................................. 52 2.8 Conclusion ..................................................................................................... 54 Chapter 3: Creating Representative and Comparable Corpora .......................... 57 3.1 Introduction .................................................................................................... 57 3.2 Corpus Representativeness ............................................................................. 57 3.2.1 Defining representativeness ...................................................................... 58 3.2.2 Sampling .................................................................................................. 59 3.2.2.1 Population definition ......................................................................... 59 3.2.2.2 Random sampling .............................................................................. 62 3.2.2.3 Proportional sampling ....................................................................... 64 3.2.2.4 Balance ............................................................................................ 67 3.2.2.5 Sample size ........................................................................................ 67 3.2.2.6 Number of samples ............................................................................ 69 3.2.2.7 Corpus size ........................................................................................ 70 3.2.3 Representativeness is not possible ............................................................. 72 3.2.4 The BNC1994 ........................................................................................... 75 3.2.5 Conclusion ................................................................................................ 76 3.3. Comparability in corpus design ..................................................................... 77 3.3.1 Introduction .............................................................................................. 77 3.3.2. Methods for creating and testing comparable corpora ............................... 80 3.3.2.1 Creating comparable corpora using web crawling ............................ 80 3.3.2.2 Testing corpus comparability ............................................................. 84 3.3.3 The Brown Family .................................................................................... 86 3.3.4 ARCHER ................................................................................................. 92 3.3.5 Conclusion ................................................................................................ 94 3.4. Representativeness vs. Comparability ............................................................ 95 3.4.1 The Problem ............................................................................................. 95 3.4.2 The Solution ............................................................................................. 98 Chapter 4: Designing the Written BNC2014 Sampling Frame ........................... 100 4.1 Introduction .................................................................................................. 100 ii 4.2 Classifying texts in corpora .......................................................................... 100 4.2.1 Introduction ........................................................................................... 100 4.2.2 Genre Theory ......................................................................................... 101 4.2.2.1 Introduction ..................................................................................... 101 4.2.2.2 Approaches to genre ........................................................................ 101 4.2.3 Genre, register, style, text type – some definitions .................................. 103 4.2.3.1 Introduction .................................................................................... 103 4.2.3.2 Genre .............................................................................................. 104 4.2.3.3 Register .......................................................................................... 105 4.2.3.4 Style ................................................................................................ 106 4.2.3.5 Text type .......................................................................................... 107 4.2.3.6 Terminology in this thesis ................................................................ 107 4.2.4 Text classification in previous national corpora ...................................... 108 4.2.4.1 The Brown Family ........................................................................... 108 4.2.4.2 The Corpus de référence du Français contemporain (CRFC) .......... 110 4.2.4.3 The British National Corpus 1994 ................................................... 112 4.2.4.4 Summary ........................................................................................ 115 4.2.5 Text classification in the Written BNC2014 ........................................... 115 4.3 Design of the sampling frame

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    375 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us