Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture

© Copyright Michael Stubbs 1996. This is chapter 1 of Text and Corpus Analysis which was published by Blackwell in 1996. TEXT AND CORPUS ANALYSIS: COMPUTER-ASSISTED STUDIES OF LANGUAGE AND CULTURE MICHAEL STUBBS CHAPTER 1. TEXTS AND TEXT TYPES Attested language text duly recorded is in the focus of attention for the linguist. (Firth 1957b: 29.) The main question discussed in this book is as follows: How can an analysis of the patterns of words and grammar in a text contribute to an understanding of the meaning of the text? I discuss traditions of text analysis in (mainly) British linguistics, and computer-assisted methods of text and corpus analysis. And I provide analyses of several shorter and longer texts, and also of patterns of language across millions of words of text corpora. The main emphasis is on the analysis of attested, naturally occurring textual data. An exclusive concentration on the text alone is, however, not an adequate basis for text interpretation, and this introductory chapter discusses the appropriate balance for text analysis, between the text itself and its production and reception. Such analyses of attested texts are important for both practical and theoretical reasons, and the main reasons can be easily and briefly stated. First, there are many areas of the everyday world in which the systematic and critical interpretation of texts is an important social skill. One striking example occurs in courtrooms, where very important decisions, by juries as well as lawyers, can depend on such interpretations. Second, much of theoretical linguistics over the past fifty years has been based on the study of isolated sentences. This approach has some severe limitations and it is important to show what can be learned by studying patterns of language across texts and corpora. 1.1. THE DATA By text, I mean an instance of language in use, either spoken or written: a piece of language behaviour which has occurred naturally, without the intervention of the 2 linguist. This excludes examples of language which have been invented by a linguist merely to illustrate a point in a linguistic theory. Examples of real instances of language in use, might include: a conversation, a lecture, a sermon, an advert, a recipe, a newspaper article, a scientific research paper, a novel, a school textbook, and so on. The list is open-ended, and probably endless. In chapter 2, I discuss in more detail the advantages of basing linguistic description and theory on attested instances of language use. One brief point of terminology. There is considerable variation in how terms such as text and discourse are used in linguistics. Sometimes this terminological variation signals important conceptual distinctions, but often it does not, and terminological debates are usually of little interest. These distinctions in terminology and concept will only occasionally be relevant for my argument, and when they are, I will draw attention to them (e.g. in chapter 7.2). It is an important feature of the book that all the data which are analysed in any detail are attested in this sense. Where this is not absolutely clear from the context, I will use the following conventions (see also section on Data Conventions and Terminology) to mark the source of examples: [A] actual, authentic, attested data [M] modified data [I] invented, intuitive, introspective data. Almost all examples in the book are [A]. Occasionally, it is convenient to modify (for example, to abbreviate) or to invent an example, and I will always warn the reader when I have done this. Possibly the most central problem in contemporary linguistics is: What counts as data? This question has major implications for descriptive and theoretical linguistics. What precisely is the evidence on which theories of syntax and semantics are based? But questions of what counts as evidence, corroboration or proof are also important in applied linguistics, and I will discuss in detail texts from the mass media, education and courtrooms. In such cases, linguists have a responsibility to say how confident they are that their analyses are correct 'beyond reasonable doubt'. 1.2. THE ORGANIZATION OF THE BOOK Chapters 1 to 3 discuss the concepts of text, text type and genre, with particular reference to the British neo-Firthian tradition of linguistics, as it has been developed by Halliday and Sinclair. This tradition has led to major grammars and dictionaries of English, and to significant advances in methods of computer-assisted text and corpus analysis. Chapters 4 to 8 provide analyses of texts and text corpora. One descriptive focus of these chapters is texts and text types which are public and/or authoritative: two widely circulated speeches (chapter 4), the summing-up by a judge in a criminal trial (chapter 5), two school textbooks (chapter 6), statements by public figures (chapter 7), and various corpus data (chapters 7 and 8). These chapters progress from little things to big 3 things: from short texts (of a few hundred words) to long texts (whole books of tens of thousands of words) from a comparison of one text with another, to a comparison of patterns in texts with patterns in corpora and from single texts to text corpora: that is, collections of texts of millions of words in length. This progression is important for two reasons. First, it should facilitate the use of the book in teaching. These chapters introduce various methods of computer-assisted text and corpus analysis: in particular, the use of concordances for studying patterns of language use. Later chapters also introduce other methods for studying the most frequent and characteristic syntactic constructions and lexical collocations in which words occur. Second, it facilitates a discussion of criteria for text analysis. These include: the need to analyse not only short text fragments, but also whole long texts; and the need for the stylistic analysis of individual texts to be based on comparisons with other texts and with corpus data which represent (however imperfectly) the language. The main descriptive and theoretical arguments in the book concern findings about language which could not conceivably result from intuitive data in the form of invented sentences, but which require large quantities of corpus data. (See especially chapters 2, 7, 8.) 1.3. A SIMPLE MODEL: TEXTS, PRODUCTION AND RECEPTION This opening chapter makes some introductory points about the relations between texts, text types and social institutions, and gives some examples of different studies. A problem which crops up in various forms is: Where is the meaning of a text located? Is the meaning inside the text itself? Or inside the mind of the person who makes sense of it? Or is it in the speech community somewhere? – perhaps in the form of a consensus interpretation on which we could all agree? Suppose I receive a legal document, which I read, but do not fully understand. I might ring up a friend who is a lawyer and read a problematic sentence over the phone. My friend might then explain to me what this sentence means. I will have read it, but the lawyer will have interpreted it. So was the meaning in the text? Possibly, although it seems to require the knowledge of the lawyer to get the meaning out. And note the ambiguity of the word read. It sometimes refers merely to the words on the page: I have read a sentence aloud over the telephone. And it sometimes means "read and understand", as when I say that I have read a good book on holiday. Here is a very simple model of the relation between text and context. In common sense terms, it seems clear that the meaning of a text depends on at least three things: the language of the text itself, who produced it, and who is responding to it. By the language itself I mean the words actually spoken or written, and their patterns of lexical collocations and of syntactic and rhetorical structures. Some meanings are created by the words themselves and their observable interrelationships, and several chapters below are concerned with analysing such patterns. But some meanings depend on our knowledge of the point of view of the author. We may interpret things quite differently, depending on when and where and by whom the language was produced, for example, depending on the status or authority of the speaker or writer: a person in the street, an 4 expert, a government minister, a government committee, an author of a school textbook, a judge summing up in court, and so on. And some meanings are brought to the text by readers or listeners: according to their specialist knowledge, their cultural assumptions, or their familiarity with other related texts. Readers and listeners also have different points of view, and respond to texts in different ways. The text-production-reception model emphasizes immediately that meanings and interpretations are mediated by social institutions. It is difficult to discuss examples of language, say from a textbook or a judicial summing-up, without referring to the status of speakers and listeners in institutions such as education and the law. This simple view of different factors in the meaning of texts is reflected in the history of literary criticism over the last hundred years or so. (Eagleton, 1983, provides a good discussion of this history.) There was a period when the main focus was on the author: his or her biography and social background were felt to be crucial to an understanding of a text. This tradition is still represented in the essays of literary reference books. An essay on Wordsworth, for example, might combine comments on the themes of his poetry, with a brief account of his life and references to contemporary historical events, such as the French Revolution.

Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture

Construction of English-Bodo Parallel Text

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples

20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Latvian Framenet: Cross-Lingual Issues

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Natural Language Processing, Sentiment Analysis and Clinical Analytics

Improving Statistical Machine Translation Efficiency by Triangulation

Proceedings of the 55Th Annual Meeting of the Association For

Helping Language Learners Get Started with Concordancing

Words Stemming Based on Structural and Semantic Similarity