The BNC Handbook Exploring the British National Corpus with SARA

Total Page:16

File Type:pdf, Size:1020Kb

The BNC Handbook Exploring the British National Corpus with SARA The BNC Handbook Exploring the British National Corpus with SARA Guy Aston and Lou Burnard September 1997 ii Preface The British National Corpus is a collection of over 4000 samples of modern British English, both spoken and written, stored in electronic form and selected so as to reflect the widest possible variety of users and uses of the language. Totalling over 100 million words, the corpus is currently being used by lex- icographers to create dictionaries, by computer scientists to make machines ‘understand’ and produce natural language, by linguists to describe the English language, and by language teachers and students to teach and learn it — to name but a few of its applications. Institutions all over Europe have purchased the BNC and installed it on computers for use in their research. However it is not necessary to possess a copy of the corpus in order to make use of it: it can also be consulted via the Internet, using either the World Wide Web or the SARA software system, which was developed specially for this purpose. The BNC Handbook provides a comprehensive guide to the SARA software distributed with the corpus and used on the network service. It illustrates some of the ways in which it is possible to find out about contemporary English usage from the BNC, and aims to encourage use of the corpus by a wide and varied public. We have tried as far as possible to avoid jargon and unnecessary technicalities; the book assumes nothing more than an interest in language and linguistic problems on the part of its readers. The handbook has three major parts. It begins with an introduction to the topic of corpus linguistics, intended to bring the substantial amount of corpus- based work already done in a variety of research areas to the non-specialist reader’s attention. It also provides an outline description of the BNC itself. The bulk of the book however is concerned with the use of the SARA search program. This part consists of a series of detailed task descriptions which (it is hoped) will serve to teach the reader how to use SARA eVectively, and at the same time stimulate his or her interest in using the BNC. There are ten tasks, each of which introduces a new group of features of the software and of the corpus, of roughly increasing complexity. At the end of each task there are suggestions for further related work. The last part of the handbook gives a summary overview of the SARA program’s commands and capabilities, intended for reference purposes, details of the main coding schemes used in the corpus, and a select bibliography. The BNC was created by a consortium led by Oxford University Press, together with major dictionary publishers Longman and Chambers, and research centres at the Universities of Lancaster and Oxford, and at the British Library. Its creation was jointly funded by the Department of Trade and Industry and iv the Science and Research Council (now EPSRC), under the Joint Framework for Information Technology programme, with substantial investment from the commercial partners in the consortium. Reference information on the BNC and its creation is available in the BNC Users’ Reference Guide (Burnard 1995), which is distributed with the corpus, and also from the BNC project’s web site at http://info.ox.ac.uk/bnc. This handbook was prepared with the assistance of a major research grant from the British Academy, whose support is gratefully acknowledged. Thanks are also due to Tony Dodd, author of the SARA search software, to Phil Champ, for assistance with the SARA help file, to Jean-Daniel Fekète and Sebastian Rahtz for help in formatting, and to our respective host institutions at the Universities of Oxford and Bologna. Our greatest debt however is to Lilette, for putting up with us during the writing of it. contents v Contents I: Corpus linguistics and the BNC 1 1 Corpuslinguistics ........................ 4 1.1 Whatisacorpus? .................... 4 1.2 Whatcanyougetoutofacorpus? . 5 1.3 Howhavecorporabeenused?. 10 1.3.1 Whatkindsofcorporaexist? . 10 1.3.2 Someapplicationareas . 12 1.3.3 Collocation.................. 13 1.3.4 Contrastivestudies . 15 1.3.5 NLPapplications . 18 1.3.6 Languageteaching . 19 1.4 How should a corpus be constructed? . 21 1.4.1 Corpusdesign ................ 21 1.4.2 Encoding, annotation, and transcription . 24 2 TheBritishNationalCorpus . 28 2.1 HowtheBNCwasconstructed . 28 2.1.1 Corpusdesign ................ 28 2.1.2 Encoding, annotation, and transcription . 33 2.2 UsingtheBNC:somecaveats. 36 2.2.1 Sourcematerials . 37 2.2.2 Sampling,encoding,andtaggingerrors . 38 2.2.3 WhatisaBNCdocument?. 39 2.2.4 Miscellaneousproblems. 40 3 Futurecorpora.......................... 42 II: Exploring the BNC with SARA 45 1 Oldwordsandnewwords.... ........... ..... 48 1.1 The problem: finding evidence of language change . 48 1.1.1 Neologismanddisuse . 48 1.1.2 Highlightedfeatures . 48 1.1.3 Beforeyoustart. 49 1.2 Procedure ........................ 49 1.2.1 StartingSARA ................ 49 1.2.2 Loggingon.................. 49 1.2.3 Gettinghelp ................. 50 1.2.4 QuittingSARA........... ..... 50 1.2.5 Using Phrase Query to find a word: ‘cracks- men’ ..................... 51 vi contents 1.2.6 Viewing the solutions: display modes . 51 1.2.7 Obtaining contextual information: the Source andBrowseoptions. 52 1.2.8 Changing the defaults: the User preferences dialoguebox ................. 54 1.2.9 Comparing queries: ‘cracksmen’ and ‘cracks- man’ ..................... 57 1.2.10 Viewing multiple solutions: ‘whammy’ . 58 1.3 Discussion and suggestions for further work . 60 1.3.1 Caveats .................... 60 1.3.2 Somesimilarproblems . 61 2 Whatismorethanonecorpus? . 63 2.1 Theproblem:relativefrequencies . 63 2.1.1 ‘Corpus’indictionaries. 63 2.1.2 Highlightedfeatures . 63 2.1.3 Beforeyoustart. 64 2.2 Procedure ........................ 64 2.2.1 Finding word frequencies using Word Query 64 2.2.2 Looking for more than one word form using WordQuery ................. 65 2.2.3 Definingdownloadcriteria . 66 2.2.4 Thinning downloaded solutions . 68 2.2.5 Savingandre-openingqueries . 70 2.3 Discussion and suggestions for further work . 72 2.3.1 Phrase Query or Word Query? . 72 2.3.2 Somesimilarproblems . 72 3 Whenisajarnotadoor?..................... 74 3.1 The problem: words and their company . 74 3.1.1 Collocation.................. 74 3.1.2 Highlightedfeatures . 75 3.1.3 Beforeyoustart. 75 3.2 Procedure ........................ 76 3.2.1 UsingtheCollocationoption . 76 3.2.2 Investigating collocates using the Sort option . 78 3.2.3 Printingsolutions. 81 3.2.4 Investigating collocations without download- ing: are men as handsome as women are beautiful? ................... 82 3.3 Discussion and suggestions for further work . 83 contents vii 3.3.1 Thesignificanceofcollocations . 83 3.3.2 Somesimilarproblems . 85 4 Aquerytoofar ......................... 86 4.1 Theproblem:variationinphrases . 86 4.1.1 A first example: ‘the horse’s mouth’ . 86 4.1.2 Asecondexample: ‘abridgetoofar’ . 86 4.1.3 Highlightedfeatures . 86 4.1.4 Beforeyoustart. 87 4.2 Procedure ........................ 87 4.2.1 Looking for phrases using Phrase Query . 87 4.2.2 Looking for phrases using Query Builder . 91 4.3 Discussion and suggestions for further work . 95 4.3.1 Looking for variant phrases . 95 4.3.2 Somesimilarproblems . 95 5 Do people ever say ‘you can say that again’? . 98 5.1 Theproblem: comparingtypesoftexts . 98 5.1.1 Spokenandwrittenvarieties . 98 5.1.2 Highlightedfeatures . 98 5.1.3 Beforeyoustart. 99 5.2 Procedure ........................ 99 5.2.1 Identifying text type by using the Source option 99 5.2.2 Finding text-type frequencies using SGML Query ....................100 5.2.3 Displaying SGML markup in solutions . 103 5.2.4 Searching in specific text-types: ‘good heav- ens’inrealandimaginedspeech . 105 5.2.5 Comparing frequencies in diVerent text-types 107 5.3 Discussion and suggestions for further work . 108 5.3.1 Investigating other explanations: combining attributes ... ........... ..... 108 5.3.2 Somesimilarproblems . 110 6 Domensaymauve? ....................... 112 6.1 The problem: investigating sociolinguistic variables . 112 6.1.1 Comparingcategoriesofspeakers . 112 6.1.2 ComponentsofBNCtexts . 112 6.1.3 Highlightedfeatures . 114 6.1.4 Beforeyoustart. 115 6.2 Procedure ........................ 115 6.2.1 Searchinginspokenutterances . 115 viii contents 6.2.2 Using Custom display format . 116 6.2.3 Comparing frequencies for diVerent types of speaker: maleandfemale‘lovely’ . 117 6.2.4 Investigating other sociolinguistic variables: ageand‘goodheavens’ . 123 6.3 Discussion and suggestions for further work . 125 6.3.1 Sociolinguistic variables in spoken and writ- tentexts ................... 125 6.3.2 Somesimilarproblems . 126 7 ‘Madonna hits album’ — did it hit back? . 129 7.1 Theproblem:linguisticambiguity . 129 7.1.1 TheEnglishofheadlines . 129 7.1.2 Particular parts of speech, particular portions oftexts .................... 129 7.1.3 Highlightedfeatures . 130 7.1.4 Beforeyoustart. 131 7.2 Procedure ........................ 131 7.2.1 Searching for particular parts-of-speech with POSQuery:‘hits’asaverb. 131 7.2.2 Searching within particular portions of texts withQueryBuilder . 133 7.2.3 Searching within particular portions of par- ticulartext-typeswithCQLQuery . 135 7.2.4 Displaying and sorting part-of-speech codes . 137 7.2.5 Re-using the text of one query in another: ‘hits’asanoun . 138 7.2.6 Investigating colligations using POS collating 139 7.3 Discussion and suggestions for further work . 140 7.3.1 Usingpart-of-speechcodes . 140 7.3.2 Somesimilarproblems . 141 8 Springingsurprisesonthearmchairlinguist . 143 8.1 The problem: intuitions about grammar . 143 8.1.1 Participant roles and syntactic variants . 143 8.1.2 Inflectionsandderivedforms . 144 8.1.3 Variationinorderanddistance . 145 8.1.4 Highlightedfeatures . 146 8.1.5 Beforeyoustart. 146 8.2 Procedure ........................ 147 contents ix 8.2.1 Designing patterns using Word Query: forms of‘spring’. 147 8.2.2 UsingPatternQuery . 150 8.2.3 Varying order and distance between nodes in QueryBuilder . 150 8.2.4 Checking precision with the Collocation and Sortoptions.
Recommended publications
  • Talk Bank: a Multimodal Database of Communicative Interaction
    Talk Bank: A Multimodal Database of Communicative Interaction 1. Overview The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web- based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes.
    [Show full text]
  • Collection of Usage Information for Language Resources from Academic Articles
    Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD.
    [Show full text]
  • From CHILDES to Talkbank
    From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself.
    [Show full text]
  • The British National Corpus Revisited: Developing Parameters for Written BNC2014 Abi Hawtin (Lancaster University, UK)
    The British National Corpus Revisited: Developing parameters for Written BNC2014 Abi Hawtin (Lancaster University, UK) 1. The British National Corpus 2014 project The ESRC Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and Cambridge University Press are working together to create a new, publicly accessible corpus of contemporary British English. The corpus will be a successor to the British National Corpus, created in the early 1990s (BNC19941). The British National Corpus 2014 (BNC2014) will be of the same order of magnitude as BNC1994 (100 million words). It is currently projected that the corpus will reach completion in mid-2018. Creation of the spoken sub-section of BNC2014 is in its final stages; this paper focuses on the justification and development of the written sub- section of BNC2014. 2. Justification for BNC2014 There are now many web-crawled corpora of English, widely available to researchers, which contain far more data than will be included in Written BNC2014. For example, the English TenTen corpus (enTenTen) is a web-crawled corpus of Written English which currently contains approximately 19 billion words (Jakubíček et al., 2013). Web-crawling processes mean that this huge amount of data can be collected very quickly; Jakubíček et al. (2013: 125) note that “For a language where there is plenty of material available, we can gather, clean and de-duplicate a billion words a day.” So, with extremely large and quickly-created web-crawled corpora now becoming commonplace in corpus linguistics, the question might well be asked why a new, 100 million word corpus of Written British English is needed.
    [Show full text]
  • Comprehensive and Consistent Propbank Light Verb Annotation Claire Bonial,1 Martha Palmer2 1U.S
    Comprehensive and Consistent PropBank Light Verb Annotation Claire Bonial,1 Martha Palmer2 1U.S. Army Research Laboratory, 2University of Colorado, Boulder 12800 Powder Mill Rd., Adelphi, MD, USA E-mail: [email protected], [email protected] Abstract Recent efforts have focused on expanding the annotation coverage of PropBank from verb relations to adjective and noun relations, as well as light verb constructions (e.g., make an offer, take a bath). While each new relation type has presented unique annotation challenges, ensuring consistent and comprehensive annotation of light verb constructions has proved particularly challenging, given that light verb constructions are semi-productive, difficult to define, and there are often borderline cases. This research describes the iterative process of developing PropBank annotation guidelines for light verb constructions, the current guidelines, and a comparison to related resources. Keywords: light verb constructions, semantic role labeling, NLP resources 2. PropBank Background 1. Introduction PropBank annotation consists of two tasks: sense annotation and role annotation. The PropBank lexicon The goal of PropBank (Palmer et al., 2005) is to supply provides a listing of the coarse-grained senses of a verb, consistent, general-purpose labeling of semantic roles noun, or adjective relation, and the set of roles associated across different syntactic realizations. With over two with each sense (thus, called a “roleset”). The roles are million words from diverse genres, the benchmark listed as argument numbers (Arg0 – Arg6) and annotated corpus supports the training of automatic correspond to verb-specific roles. For example: semantic role labelers, which in turn support other Natural Language Processing (NLP) areas, such as Offer-01 (transaction, proposal): machine translation.
    [Show full text]
  • 3 Corpus Tools for Lexicographers
    Comp. by: pg0994 Stage : Proof ChapterID: 0001546186 Date:14/5/12 Time:16:20:14 Filepath:d:/womat-filecopy/0001546186.3D31 OUP UNCORRECTED PROOF – FIRST PROOF, 14/5/2012, SPi 3 Corpus tools for lexicographers ADAM KILGARRIFF AND IZTOK KOSEM 3.1 Introduction To analyse corpus data, lexicographers need software that allows them to search, manipulate and save data, a ‘corpus tool’. A good corpus tool is the key to a comprehensive lexicographic analysis—a corpus without a good tool to access it is of little use. Both corpus compilation and corpus tools have been swept along by general technological advances over the last three decades. Compiling and storing corpora has become far faster and easier, so corpora tend to be much larger than previous ones. Most of the first COBUILD dictionary was produced from a corpus of eight million words. Several of the leading English dictionaries of the 1990s were produced using the British National Corpus (BNC), of 100 million words. Current lexico- graphic projects we are involved in use corpora of around a billion words—though this is still less than one hundredth of one percent of the English language text available on the Web (see Rundell, this volume). The amount of data to analyse has thus increased significantly, and corpus tools have had to be improved to assist lexicographers in adapting to this change. Corpus tools have become faster, more multifunctional, and customizable. In the COBUILD project, getting concordance output took a long time and then the concordances were printed on paper and handed out to lexicographers (Clear 1987).
    [Show full text]
  • Università Degli Studi Di Macerata
    UNIVERSITÀ DEGLI STUDI DI MACERATA Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia Corso di Laurea Magistrale in Lingue Moderne per la Comunicazione e la Cooperazione Internazionale (ClasseLM-38) Traduzione per laComunicazione Internazionale – inglese -mod. B STRUMENTI E TECNOLOGIE PER LA TRADUZIONESPECIALISTICA 1 What is a corpus? Some (authoritative) definitions • “a collection of naturally-occurring language text, chosen to characterize a state or variety of a language” (Sinclair, 1991:171) • “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to be used for linguistic analysis” (Francis, 1992:7) • “a closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria” (Engwall, 1992:167) • “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery & Wilson, 1996:23) • “a collection of (1) machine-readable (2) authentic texts […] which is (3) sampled to be (4) representative of a particular language or language variety” (McEnery et al., 2006:5) What is / is not a corpus…? • A newspaper archive on CD-ROM? The answer is • An online glossary? always “NO” • A digital library (e.g. Project (see Gutenberg)? definition) • All RAI 1 programmes (e.g. for spoken TV language) Corpora vs. web •Corpora: – Usually stable •searches can be replicated – Control over contents •we can select the texts to be included, or have control over selection strategies – Ad-hoc linguistically-aware software to investigate them •concordancers can sort / organise concordance lines • Web (as accessed via Google or other search engines): – Very unstable •results can change at any time for reasons beyond our control – No control over contents •what/how many texts are indexed by Google’s robots? – Limited control over search results •cannot sort or organise hits meaningfully; they are presented randomly Click here for another corpus vs.
    [Show full text]
  • Exploring Methods and Resources for Discriminating Similar Languages
    Exploring Methods and Resources for Discriminating Similar Languages Marco Lui♥♣, Ned Letcher♥, Oliver Adams♥, Long Duong♥♣, Paul Cook♥ and Timothy Baldwin♥♣ ♥ Department of Computing and Information Systems The University of Melbourne ♣ NICTA Victoria [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract The Discriminating between Similar Languages (DSL) shared task at VarDial challenged partici- pants to build an automatic language identification system to discriminate between 13 languages in 6 groups of highly-similar languages (or national varieties of the same language). In this paper, we describe the submissions made by team UniMelb-NLP, which took part in both the closed and open categories. We present the text representations and modeling techniques used, including cross-lingual POS tagging as well as fine-grained tags extracted from a deep grammar of English, and discuss additional data we collected for the open submissions, utilizing custom- built web corpora based on top-level domains as well as existing corpora. 1 Introduction Language identification (LangID) is the problem of determining what natural language a document is written in. Studies in the area often report high accuracy (Cavnar and Trenkle, 1994; Dunning, 1994; Grefenstette, 1995; Prager, 1999; Teahan, 2000). However, recent work has shown that high accu- racy is only achieved under ideal conditions (Baldwin and Lui, 2010), and one area that needs further work is accurate discrimination between closely-related languages (Ljubesiˇ c´ et al., 2007; Tiedemann and Ljubesiˇ c,´ 2012). The problem has been explored for specific groups of confusable languages, such as Malay/Indonesian (Ranaivo-Malancon, 2006), South-Eastern European languages (Tiedemann and Ljubesiˇ c,´ 2012), as well as varieties of English (Lui and Cook, 2013), Portuguese (Zampieri and Gebre, 2012), and Spanish (Zampieri et al., 2013).
    [Show full text]
  • FERSIWN GYMRAEG ISOD the National Corpus of Contemporary
    FERSIWN GYMRAEG ISOD The National Corpus of Contemporary Welsh Project Report, October 2020 Authors: Dawn Knight1, Steve Morris2, Tess Fitzpatrick2, Paul Rayson3, Irena Spasić and Enlli Môn Thomas4. 1. Introduction 1.1. Purpose of this report This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups. 1.2. Licence The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool (for links to all tools, refer to section 10 of this report). When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged (see 1.3). § To access the corpus visit: www.corcencc.org/explore § To access the GitHub site: https://github.com/CorCenCC o GitHub is a cloud-based service that enables developers to store, share and manage their code and datasets.
    [Show full text]
  • ESP Corpus Construction: a Plea for a Needs-Driven Approach Bâtir Des Corpus En Anglais De Spécialité : Plaidoyer Pour Une Approche Fondée Sur L’Analyse Des Besoins
    ASp la revue du GERAS 68 | 2015 Varia ESP corpus construction: a plea for a needs-driven approach Bâtir des corpus en anglais de spécialité : plaidoyer pour une approche fondée sur l’analyse des besoins Hilary Nesi Electronic version URL: http://journals.openedition.org/asp/4682 DOI: 10.4000/asp.4682 ISSN: 2108-6354 Publisher Groupe d'étude et de recherche en anglais de spécialité Printed version Date of publication: 23 October 2015 Number of pages: 7-24 ISSN: 1246-8185 Electronic reference Hilary Nesi, « ESP corpus construction: a plea for a needs-driven approach », ASp [Online], 68 | 2015, Online since 01 November 2016, connection on 02 November 2020. URL : http:// journals.openedition.org/asp/4682 ; DOI : https://doi.org/10.4000/asp.4682 This text was automatically generated on 2 November 2020. Tous droits réservés ESP corpus construction: a plea for a needs-driven approach 1 ESP corpus construction: a plea for a needs-driven approach Bâtir des corpus en anglais de spécialité : plaidoyer pour une approche fondée sur l’analyse des besoins Hilary Nesi 1. Introduction 1 A huge amount of corpus data is now available for English language teachers and learners, making it difficult for them to decide which kind of corpus is most suited to their needs. The corpora that they probably know best are the large, publicly accessible general corpora which inform general reference works and textbooks; the ones that aim to cover the English language as a whole and to represent many different categories of text. The 100 million word British National Corpus (BNC), created in the 1990s, is a famous example of this type and has been used extensively for language learning purposes, but at 100 million words it is too small to contain large numbers of texts within each of its subcomponents, and it is now too old to represent modern usage in the more fast-moving fields.
    [Show full text]
  • Setting up for Corpus Lexicography
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE Setting up for Corpus Lexicography Adam Kilgarriff,* Jan Pomikalek,* Miloš Jakubíček*‡, & Pete Whitelock† *Lexical Computing Ltd; ‡Masaryk University, Brno; †Oxford University Press Keywords: corpora, corpus lexicography, web crawling, dependency parsing. Abstract There are many benefits to using corpora. In order to reap those rewards, how should someone who is setting up a dictionary project proceed? We describe a practical experience of such „setting up‟ for a new Portuguese-English, English-Portuguese dictionary being written at Oxford University Press. We focus on the Portuguese side, as OUP did not have Portuguese resources prior to the project. We collected a very large (3.5 billion word) corpus from the web, including removing all unwanted material and duplicates. We then identified the best tools for Portuguese for lemmatizing and parsing, and undertook the very large task of parsing it. We then used the dependency parses, as output by the parser, to create word sketches (one page summaries of a word‟s grammatical and collocational behavior). We plan to customize an existing system for automatically identifying good candidate dictionary examples, to Portuguese, and add salient information about regional words to the word sketches. All of the data and associated support tools for lexicography are available to the lexicographer in the Sketch Engine corpus query system. 1. Introduction There are a number of ways in which corpus technology can support lexicography, as described in Rundell & Kilgarriff (2011). It can make it more accurate, more consistent, and faster. But how might those potential benefits pan out in an actual project? If starting from a blank sheet of paper, how should one proceed? In this paper we describe such an exercise.
    [Show full text]
  • The Manually Annotated Sub-Corpus
    Case study: The Manually Annotated Sub-Corpus Nancy Ide Abstract This case study describes the creation process for the Manually Anno- tated Sub-Corpus (MASC), a 500,000 word subset of the Open American National Corpus (OANC). The corpus includes primary data from a balanced selection of 19 written and spoken genres, all of which is annotated for almost 20 varieties of linguistic phenomena at all levels. All annotations are either hand-validated or manually-produced. MASC is unique in that it is fully open and free for any use, including commercial use. 1 Introduction This case study describes the creation process for the Manually Annotated Sub- Corpus (MASC), which is a subset of the Open American National Corpus (OANC). The OANC is itself a subset of the American National Corpus (ANC). Each of these corpora represents a distinct evolutionary stage in our approach to corpus- building and delivery that reflect adaptations to both changing community needs and advances in best practices for creating and representing linguistically annotated corpora. We therefore describe the procedures involved in producing the ANC and OANC before focusing on MASC, which is the jewel in the crown of corpora pro- duced by the ANC project. 2 Background: The ANC The ANC was motivated by developers of major linguistic resources such as FrameNet [2] and Nomlex [24], who had been extracting usage examples from Nancy Ide Vassar College, Poughkeepsie, New York 12604-0520 USA e-mail: [email protected] 1 2 Nancy Ide the 100 million-word British National Corpus (BNC), the largest corpus of English across several genres that was available at the time.
    [Show full text]