The BNC Handbook Exploring the British National Corpus with SARA
Total Page:16
File Type:pdf, Size:1020Kb
The BNC Handbook Exploring the British National Corpus with SARA Guy Aston and Lou Burnard September 1997 ii Preface The British National Corpus is a collection of over 4000 samples of modern British English, both spoken and written, stored in electronic form and selected so as to reflect the widest possible variety of users and uses of the language. Totalling over 100 million words, the corpus is currently being used by lex- icographers to create dictionaries, by computer scientists to make machines ‘understand’ and produce natural language, by linguists to describe the English language, and by language teachers and students to teach and learn it — to name but a few of its applications. Institutions all over Europe have purchased the BNC and installed it on computers for use in their research. However it is not necessary to possess a copy of the corpus in order to make use of it: it can also be consulted via the Internet, using either the World Wide Web or the SARA software system, which was developed specially for this purpose. The BNC Handbook provides a comprehensive guide to the SARA software distributed with the corpus and used on the network service. It illustrates some of the ways in which it is possible to find out about contemporary English usage from the BNC, and aims to encourage use of the corpus by a wide and varied public. We have tried as far as possible to avoid jargon and unnecessary technicalities; the book assumes nothing more than an interest in language and linguistic problems on the part of its readers. The handbook has three major parts. It begins with an introduction to the topic of corpus linguistics, intended to bring the substantial amount of corpus- based work already done in a variety of research areas to the non-specialist reader’s attention. It also provides an outline description of the BNC itself. The bulk of the book however is concerned with the use of the SARA search program. This part consists of a series of detailed task descriptions which (it is hoped) will serve to teach the reader how to use SARA eVectively, and at the same time stimulate his or her interest in using the BNC. There are ten tasks, each of which introduces a new group of features of the software and of the corpus, of roughly increasing complexity. At the end of each task there are suggestions for further related work. The last part of the handbook gives a summary overview of the SARA program’s commands and capabilities, intended for reference purposes, details of the main coding schemes used in the corpus, and a select bibliography. The BNC was created by a consortium led by Oxford University Press, together with major dictionary publishers Longman and Chambers, and research centres at the Universities of Lancaster and Oxford, and at the British Library. Its creation was jointly funded by the Department of Trade and Industry and iv the Science and Research Council (now EPSRC), under the Joint Framework for Information Technology programme, with substantial investment from the commercial partners in the consortium. Reference information on the BNC and its creation is available in the BNC Users’ Reference Guide (Burnard 1995), which is distributed with the corpus, and also from the BNC project’s web site at http://info.ox.ac.uk/bnc. This handbook was prepared with the assistance of a major research grant from the British Academy, whose support is gratefully acknowledged. Thanks are also due to Tony Dodd, author of the SARA search software, to Phil Champ, for assistance with the SARA help file, to Jean-Daniel Fekète and Sebastian Rahtz for help in formatting, and to our respective host institutions at the Universities of Oxford and Bologna. Our greatest debt however is to Lilette, for putting up with us during the writing of it. contents v Contents I: Corpus linguistics and the BNC 1 1 Corpuslinguistics ........................ 4 1.1 Whatisacorpus? .................... 4 1.2 Whatcanyougetoutofacorpus? . 5 1.3 Howhavecorporabeenused?. 10 1.3.1 Whatkindsofcorporaexist? . 10 1.3.2 Someapplicationareas . 12 1.3.3 Collocation.................. 13 1.3.4 Contrastivestudies . 15 1.3.5 NLPapplications . 18 1.3.6 Languageteaching . 19 1.4 How should a corpus be constructed? . 21 1.4.1 Corpusdesign ................ 21 1.4.2 Encoding, annotation, and transcription . 24 2 TheBritishNationalCorpus . 28 2.1 HowtheBNCwasconstructed . 28 2.1.1 Corpusdesign ................ 28 2.1.2 Encoding, annotation, and transcription . 33 2.2 UsingtheBNC:somecaveats. 36 2.2.1 Sourcematerials . 37 2.2.2 Sampling,encoding,andtaggingerrors . 38 2.2.3 WhatisaBNCdocument?. 39 2.2.4 Miscellaneousproblems. 40 3 Futurecorpora.......................... 42 II: Exploring the BNC with SARA 45 1 Oldwordsandnewwords.... ........... ..... 48 1.1 The problem: finding evidence of language change . 48 1.1.1 Neologismanddisuse . 48 1.1.2 Highlightedfeatures . 48 1.1.3 Beforeyoustart. 49 1.2 Procedure ........................ 49 1.2.1 StartingSARA ................ 49 1.2.2 Loggingon.................. 49 1.2.3 Gettinghelp ................. 50 1.2.4 QuittingSARA........... ..... 50 1.2.5 Using Phrase Query to find a word: ‘cracks- men’ ..................... 51 vi contents 1.2.6 Viewing the solutions: display modes . 51 1.2.7 Obtaining contextual information: the Source andBrowseoptions. 52 1.2.8 Changing the defaults: the User preferences dialoguebox ................. 54 1.2.9 Comparing queries: ‘cracksmen’ and ‘cracks- man’ ..................... 57 1.2.10 Viewing multiple solutions: ‘whammy’ . 58 1.3 Discussion and suggestions for further work . 60 1.3.1 Caveats .................... 60 1.3.2 Somesimilarproblems . 61 2 Whatismorethanonecorpus? . 63 2.1 Theproblem:relativefrequencies . 63 2.1.1 ‘Corpus’indictionaries. 63 2.1.2 Highlightedfeatures . 63 2.1.3 Beforeyoustart. 64 2.2 Procedure ........................ 64 2.2.1 Finding word frequencies using Word Query 64 2.2.2 Looking for more than one word form using WordQuery ................. 65 2.2.3 Definingdownloadcriteria . 66 2.2.4 Thinning downloaded solutions . 68 2.2.5 Savingandre-openingqueries . 70 2.3 Discussion and suggestions for further work . 72 2.3.1 Phrase Query or Word Query? . 72 2.3.2 Somesimilarproblems . 72 3 Whenisajarnotadoor?..................... 74 3.1 The problem: words and their company . 74 3.1.1 Collocation.................. 74 3.1.2 Highlightedfeatures . 75 3.1.3 Beforeyoustart. 75 3.2 Procedure ........................ 76 3.2.1 UsingtheCollocationoption . 76 3.2.2 Investigating collocates using the Sort option . 78 3.2.3 Printingsolutions. 81 3.2.4 Investigating collocations without download- ing: are men as handsome as women are beautiful? ................... 82 3.3 Discussion and suggestions for further work . 83 contents vii 3.3.1 Thesignificanceofcollocations . 83 3.3.2 Somesimilarproblems . 85 4 Aquerytoofar ......................... 86 4.1 Theproblem:variationinphrases . 86 4.1.1 A first example: ‘the horse’s mouth’ . 86 4.1.2 Asecondexample: ‘abridgetoofar’ . 86 4.1.3 Highlightedfeatures . 86 4.1.4 Beforeyoustart. 87 4.2 Procedure ........................ 87 4.2.1 Looking for phrases using Phrase Query . 87 4.2.2 Looking for phrases using Query Builder . 91 4.3 Discussion and suggestions for further work . 95 4.3.1 Looking for variant phrases . 95 4.3.2 Somesimilarproblems . 95 5 Do people ever say ‘you can say that again’? . 98 5.1 Theproblem: comparingtypesoftexts . 98 5.1.1 Spokenandwrittenvarieties . 98 5.1.2 Highlightedfeatures . 98 5.1.3 Beforeyoustart. 99 5.2 Procedure ........................ 99 5.2.1 Identifying text type by using the Source option 99 5.2.2 Finding text-type frequencies using SGML Query ....................100 5.2.3 Displaying SGML markup in solutions . 103 5.2.4 Searching in specific text-types: ‘good heav- ens’inrealandimaginedspeech . 105 5.2.5 Comparing frequencies in diVerent text-types 107 5.3 Discussion and suggestions for further work . 108 5.3.1 Investigating other explanations: combining attributes ... ........... ..... 108 5.3.2 Somesimilarproblems . 110 6 Domensaymauve? ....................... 112 6.1 The problem: investigating sociolinguistic variables . 112 6.1.1 Comparingcategoriesofspeakers . 112 6.1.2 ComponentsofBNCtexts . 112 6.1.3 Highlightedfeatures . 114 6.1.4 Beforeyoustart. 115 6.2 Procedure ........................ 115 6.2.1 Searchinginspokenutterances . 115 viii contents 6.2.2 Using Custom display format . 116 6.2.3 Comparing frequencies for diVerent types of speaker: maleandfemale‘lovely’ . 117 6.2.4 Investigating other sociolinguistic variables: ageand‘goodheavens’ . 123 6.3 Discussion and suggestions for further work . 125 6.3.1 Sociolinguistic variables in spoken and writ- tentexts ................... 125 6.3.2 Somesimilarproblems . 126 7 ‘Madonna hits album’ — did it hit back? . 129 7.1 Theproblem:linguisticambiguity . 129 7.1.1 TheEnglishofheadlines . 129 7.1.2 Particular parts of speech, particular portions oftexts .................... 129 7.1.3 Highlightedfeatures . 130 7.1.4 Beforeyoustart. 131 7.2 Procedure ........................ 131 7.2.1 Searching for particular parts-of-speech with POSQuery:‘hits’asaverb. 131 7.2.2 Searching within particular portions of texts withQueryBuilder . 133 7.2.3 Searching within particular portions of par- ticulartext-typeswithCQLQuery . 135 7.2.4 Displaying and sorting part-of-speech codes . 137 7.2.5 Re-using the text of one query in another: ‘hits’asanoun . 138 7.2.6 Investigating colligations using POS collating 139 7.3 Discussion and suggestions for further work . 140 7.3.1 Usingpart-of-speechcodes . 140 7.3.2 Somesimilarproblems . 141 8 Springingsurprisesonthearmchairlinguist . 143 8.1 The problem: intuitions about grammar . 143 8.1.1 Participant roles and syntactic variants . 143 8.1.2 Inflectionsandderivedforms . 144 8.1.3 Variationinorderanddistance . 145 8.1.4 Highlightedfeatures . 146 8.1.5 Beforeyoustart. 146 8.2 Procedure ........................ 147 contents ix 8.2.1 Designing patterns using Word Query: forms of‘spring’. 147 8.2.2 UsingPatternQuery . 150 8.2.3 Varying order and distance between nodes in QueryBuilder . 150 8.2.4 Checking precision with the Collocation and Sortoptions.