Cambridge Handbook of English Corpus Linguistics Chapter 2: Computational Tools and Methods for Corpus Compilation and Analysis1
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
Unit 1: Introduction
Corpus building and investigation for the Humanities: An on-line information pack about corpus investigation techniques for the Humanities Unit 1: Introduction David Evans, University of Nottingham 1.1 What a corpus is A corpus is defined here as a principled collection of naturally occurring texts which are stored on a computer to permit investigation using special software. A corpus is principled because texts are selected for inclusion according to pre-defined research purposes. Usually texts are included on external rather than internal criteria. For example, a researcher who wants to investigate metaphors used in university lectures will attempt to collect a representative sample of lectures across a number of disciplines, rather than attempting to collect lectures that include a lot of figurative language. Most commercially available corpora are made up of samples of a particular language variety which aim to be representative of that variety. Here are some examples of some of the different types of corpora and how they represent a particular variety: General corpora An example of a general corpus is the British National Corpus which “… aims to represent the universe of contemporary British English [and] to capture the full range of varieties of language use.” (Aston & Burnard 1998: 5). As a result of this aim the corpus is very large (containing some 100 million words) and contains a balance of texts from a wide variety of different domains of spoken and written language. Large general corpora are sometimes referred to as reference corpora because they are often used as a baseline against which judgements about the language varieties held in more specialised corpora can be made. -
Significant Concordance and Co-Occurrence in Quantitative
Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis Janneke Rauscher1 Leonard Swiezinski2 Martin Riedl2 Chris Biemann2 (1) Johann Wolfgang Goethe University Gruneburgplatz¨ 1, 60323 Frankfurt am Main, Germany (2) FG Language Technology, Dept. of Computer Science Technische Universitat¨ Darmstadt, 64289 Darmstadt, Germany [email protected], [email protected], {riedl,biem}@cs.tu-darmstadt.de Abstract Although the number of research projects in Dig- ital Humanities is increasing at fast pace, we still We present CoocViewer, a graphical analy- observe a gap between the traditional humanities sis tool for the purpose of quantitative lit- scholars on the one side, and computer scientists on erary analysis, and demonstrate its use on the other. While computer science excels in crunch- a corpus of crime novels. The tool dis- ing numbers and providing automated processing plays words, their significant co-occurrences, and contains a new visualization for signif- for large amounts of data, it is hard for the com- icant concordances. Contexts of words and puter scientist to imagine what research questions co-occurrences can be displayed. After re- form the discourse in the humanities. In contrast to viewing previous research and current chal- this, humanities scholars have a hard time imagining lenges in the newly emerging field of quan- the possibilities and limitations of computer technol- titative literary research, we demonstrate how ogy, how automatically generated results ought to CoocViewer allows comparative research on be interpreted, and how to operationalize automatic literary corpora in a project-specific study, and how we can confirm or enhance our hypothe- processing in a way that its unavoidable imperfec- ses through quantitative literary analysis. -
PDF of Volume 33
BULLETIN OF THE INTERNATIONAL ORGANIZATION FOR SEPTUAGINT AND COGNATE STUDIES Volume 33 Fall, 2000 Minutes of the loses Meeting, Boston Treasurer's Report 6 News and Notes 8 Record of Work Published or in Progress 11 Varia 24 Review of Hatch and Redpath's Concordance to the Septuagint, 2nd edition 31 Johann Cook Web Review: The Christian Classics Ethereal Library, at a!. 36 Frederick Knobloch A Note on the Syntactic Analysis of Greek Translations and Compositions 39 T. P. Hutchinson and D. Cairns How to Analyse and Translate the Idiomatic Phrase 1l"l~ '0 47 Takamitsu Muraoka Evaluating Lexical Consistency in the Old Greek Bible 53 Martha L. Wade Translation Technique in LXX Genesis and Its Implications for the NETS Version 76 Robert J. V. Hiebert PROGRAM FOR THE loses MEETING BULLETIN of the loses IN BOSTON, NOVEMBER 20-23, 1999 Published Annually Each Fall by THE INTERNATIONAL ORGANIZATION FOR Sunday, November 21 SEPTUAGINT AND COGNATE STUDIES OFFICERS AND EXECUTIVE COMMITTEE 1:00-3:30 pm H-Room 103 President Honorary Presidents Albert Pietersma, University of Toronto, Presiding Johan Lust John Wm. Wevers (Toronto) Katholieke Univ. leuven Albert Pietersma (Toronto) St. Michielsstraat' 2 Eugene Ulrich (Notre Dame) Alison Salvesen, Oxford University 8-3000 Leuven Belgium Jacob of Edessa 's Version of the Books of Samuel and the [email protected] Textual Criticism ofthe Septuagint Immediate Past President Wee President Leonard J. Greenspoon Benjamin G. Wright Jan Willem Van Henten, University of Amsterdam Creighton University lehigh University Omaha, NE 68178 U.SA Bethlehem, PA 18015 U.SA The Honorary Decreefor Simon the Maccabee (1 Macc 14:25- [email protected] [email protected] 49) in its Hellenistic Context Editor NETS Co-chairs Theodore A Bergren Albert Pietersma (Toronto) Harry F. -
A Sense Discrimination Engine for English, Chinese and Other Languages
Sketch Engine: a sense discrimination engine for English, Chinese and other languages Adam Kilgarriff1, Pavel Rychlý2, Simon Smith3, Chu-Ren Huang4, Yiching Wu5, Cecilia Lin3 [email protected] 1. Background: corpora and concordances Analysis of text and spoken language, for the purposes of second language teaching, dictionary making and other linguistic applications, used to be based on the intuitions of linguists and lexicographers. The compilation of dictionaries and thesauri, for example, required that the compiler read very widely, and record the results of his efforts – the definitions and different senses of words – on thousands, or millions of index cards. Today’s approach to linguistic analysis generally involves the use of linguistic corpora: large databases of spoken or written language samples, defined by Crystal (1991) as “A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language”. Numerous large corpora have been assembled for English, including the British National Corpus (BNC) and the Bank of English. Dictionaries published by the Longman Group are based on the 100 million word BNC, and corpora are routinely used by computational linguists in tasks such as machine translation and speech recognition. The BNC is an example of a balanced corpus, in that it attempts to represent a broad cross-section of genres and styles, including fiction and non-fiction, books, periodicals and newspapers, and even essays by students. Transcriptions of spoken data are also included; and this is the corpus that is used with the English Sketch Engine. -
The BNC Handbook Exploring the British National Corpus with SARA
The BNC Handbook Exploring the British National Corpus with SARA Guy Aston and Lou Burnard September 1997 ii Preface The British National Corpus is a collection of over 4000 samples of modern British English, both spoken and written, stored in electronic form and selected so as to reflect the widest possible variety of users and uses of the language. Totalling over 100 million words, the corpus is currently being used by lex- icographers to create dictionaries, by computer scientists to make machines ‘understand’ and produce natural language, by linguists to describe the English language, and by language teachers and students to teach and learn it — to name but a few of its applications. Institutions all over Europe have purchased the BNC and installed it on computers for use in their research. However it is not necessary to possess a copy of the corpus in order to make use of it: it can also be consulted via the Internet, using either the World Wide Web or the SARA software system, which was developed specially for this purpose. The BNC Handbook provides a comprehensive guide to the SARA software distributed with the corpus and used on the network service. It illustrates some of the ways in which it is possible to find out about contemporary English usage from the BNC, and aims to encourage use of the corpus by a wide and varied public. We have tried as far as possible to avoid jargon and unnecessary technicalities; the book assumes nothing more than an interest in language and linguistic problems on the part of its readers. -
Latent Semantic Analysis, Corpus Stylistics and Machine Learning
Latent Semantic Analysis, Corpus stylistics and Machine Learning Stylometry for Translational and Authorial Style Analysis: The Case of Denys Johnson-Davies’ Translations into English A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Mohammed Al Batineh May, 2015 © Copyright by Mohammed S. Al-Batineh All Rights Reserved Dissertation written by Mohammed Al Batineh BA., Yarmouk University, Jordan, 2008 MA., Yarmouk University, Jordan, 2010 APPROVED BY __________________________, Chair, Doctoral Dissertation Committee Dr. Françoise Massardier-Kenney (advisor) __________________________, Members, Doctoral Dissertation Committee Dr. Carol Maier __________________________, Dr. Gregory M. Shreve __________________________, Dr. Jonathan I. Maletic __________________________, Dr. Katherine Rawson ACCEPTED BY __________________________, Interim Chair, Modern and Classical Language Studies Dr. Keiran J Dunne __________________________, Dean, College of Arts and Sciences Dr. James L. Blank TABLE OF CONTENTS LIST OF FIGURES ........................................................................................... viii LIST OF TABLES ............................................................................................... ix DEDICATION ...................................................................................................... x ABSTRACT ........................................................................................................ xii CHAPTER