E-Content Submission to INFLIBNET Subject Name Linguistics Paper Name Corpus Linguistics Paper Coordinator Name & Contact Mo
Total Page:16
File Type:pdf, Size:1020Kb
e-Content Submission to INFLIBNET Subject Name Linguistics Paper Name Corpus Linguistics Paper Coordinator Name & Contact Module name #0 5: Corpus Type: Genre of Text Module ID Content Writer (CW) Name Email id Phone Pre-requisites Objectives Keywords E-Text Self Learn Self Assessment Learn More Story Board √ √ √ √ √ Contents 5.0 Introduction 5.1 Why Classify Corpora ? 5.2 Genre of Text 5.2.1 Text Corpus 5.2.2 Speech Corpus 5.2.3 Spoken Corpus 5.3 Conclusion Assessment & Evaluation Resources & Lin 5.0 INTRODUCTION For last fifty years or so, corpus linguistics is attested as one of the mainstays of linguistics for various reasons. At various points of time scholars have discussed about the methods of generating corpora, techniques of processing them, and using information from corpora in linguistic works – starting from mainstream linguistics to applied linguistics and language technology. However, in general, these discussions often ignore an important aspect relating to classification of corpora, although scholars sporadically attempt to discuss about the form, formation, and function of corpora of various types. People avoid this issue, because it is a difficult scheme to classify corpora by way of a single frame or type. Any scheme that attempts to put various corpora within a single frame is destined to turn out to be unscientific and non-reliable. Digital corpora are designed to be used in various linguistic works. Sometimes, these are used for general linguistics research and application, some other times these are utilised for works of language technology and computational linguistics. The general assumption is that a corpus developed for certain types of work is not much useful for works of other types. Such assumption is false in the sense that a corpus developed for a specific kind of work can equally be fruitfully used for many other works. Therefore, it is better to assume that function and utilisation of a corpus is multidimensional and multidirectional. For instance, corpus developed for compiling dictionary may be used for writing grammar books, developing language teaching materials, and writing reference books. Due to such reasons people are often hesitant to classify corpora in any scheme. 5.1 WHY CLASSIFY CORPORA ? Each corpus is developed following some methods of language representation, text collection, and text application. These make a corpus distinct in form, content, feature, and function from others. Taking these factors into consideration we propose to classify corpora into various types based on the factors relating to their form, content, and utilization. Systematic classification of corpora provide language users the following advantages not possible to achieve in any other way. (a) Language users can easily to identify appropriate data from texts of suitable areas and domains of language use. (b) Linguists can particular corpus they think useful for their works. For this they do not need to grope in the dark. (c) Dictionary makers wanting to compile dictionaries need not be confused with selection of corpus. They can select general and special corpora, if they find prior information about the types of corpus they need for their works. (d) For application-specific requirements people can try with general as well as special corpus without toiling hard in the labyrinth of corpora. (e) Terminologists can use special corpora to extract relevant lexical information necessary for the collection of jargons as well as scientific and technical terms. (f) Domain-specific investigators can retrieve necessary linguistic data and information from special corpora. Investigator wanting to study normal speech patterns of native people can access a general speech corpus rather than other corpora. (g) If corpora are not classified, users have to refer all corpus types before selecting required one. It consumes much time, energy and labour due to internal complexities involved in it. (h) Classification of corpora enhances speed and accuracy of comparative studies across corpus types. If speech and text corpora are kept separate, comparative studies between the two may become robust and effective. (i) Classification of corpora makes us comfortable for comparing data stored in each corpus. We can systematically observe traits of similarities and differences between the two types. (j) If corpora are mixed up, comparative study becomes complicated while observations become defective. Taking such advantages into mind, we present here a tentative scheme of classification of corpora. In this context the followings are the most important factors: (a) Minimum conditions need to be fulfilled for a collection of language data to be considered as a corpus before it is put to classification, and (b) Identity of corpora of ordinary language use should be kept separate from the corpora recorded in artificial language use. Both the factors need to maintain a balance. If the criteria proposed below are considered adequate we assume that considerable progress is made, because there are large collections of language databases called corpora, which do not meet these conditions. Also there are some corpora, which record special and artificial language samples. Besides, branch of corpus linguistics is developing rapidly. As a result of this, regular norms and assumptions are revised at quick successions. Therefore, classification of corpora is made maximally flexible to meet such unstable conditions. Digital corpora are various types with regard to texts, languages, modes of data sampling, methods of generation, manners of processing, and nature of utilisation, etc. For instance, • A corpus may contain samples of written texts while the other one may contain samples of spoken texts. • A corpus may preserve text samples from present-day language while others may store samples complied from age-old texts and ancient documents. • A corpus is monolingual by way of collecting data from a single language, while others are bilingual by way of including texts from two languages, or multilingual by way of including samples from more than two languages. • Texts included in a corpus may be collected from a particular source, from a whole range of sources belonging to a particular field, or across the fields and subjects of a language. • Text samples may be obtained from newspapers, magazines, journals, periodicals, and similar other forms. • Text samples may also be compiled from extracts of impromptu conversations, spontaneous dialogues, made-up monologues, or from interactive discourses of varying lengths, etc. This implies that there are numerous needs and factors that control content, type, and use of a corpus. It also signifies that the kind of texts included as well as the combination of various text types may vary among the corpora. Taking all these issues under consideration we broadly classify corpora based on the following criteria: (a) Corpus Type: Genre of text (b) Corpus Type: Nature of data, (c) Corpus Type: Type of text, (d) Corpus Type: Purpose of design, and (e) Corpus Type: Nature of application. Corpus Classification Criteria Genre of Nature of Type of Text Data Text Purpose of Nature of Design Application Fig. 1: Classification of Corpus In the following sections, the first type of corpus is discussed briefly with reference to the corpora developed so far in various languages of the world. The remaining four types are discussed in next four modules with adequate examples and explanations. 5.2 GENRE OF TEXT Following the criteria ‘Genre of Text’ language corpora may be classified broadly into three broad types, namely, Text Corpus, Speech Corpus, and Spoken Corpus. Speech Australian Speech Speech Samples Corpus Corpus G E Spoken London -Lund N Transcribed Speech R Corpus Spoken Corpus E Text TDIL Indian Corpus Written Texts language Corpus Fig. 2: Genre of Text 5.2.1 Text Corpus A Text Corpus, by virtue of its genre, contains only the language data collected from various written, printed, published, and electronic sources. In case of printed materials, it collects texts from published books, papers, journals, magazines, periodicals, notices, circulars, documents, reports, manifestos, advertisements, bulletins, placards, festoons, etc. In case of non-published materials, it collects the texts from personal letters, personal diaries, written family records, old manuscripts, ancient legal deeds and wills, etc. Thus, samples of various texts obtained from both published and non-published sectors constitute the central body of a text corpus. Some examples of text corpus are the British National Corpus, the Bank of English, the American National Corpus, the Australian Corpus of English, the Wellington Corpus of Written New Zealand English, the Brown Corpus, the LOB Corpus, the Kolhapur Corpus of Indian English, the FLOB Corpus, the Bank of Swedish, the TDIL Corpus of Indian Languages, and others. These corpora are made with texts obtained from written texts. In early years of corpus generation there was virtually little scope for including text samples from digital sources in a corpus, since such text samples were not easily available. However, the situation is greatly changed within last few years. Now, we can find huge amount of written texts from various digital sources to be included in a text corpus. There are many web sites, home pages, web pages, internet, etc. form where we can collect data for generating a corpus of written texts. Moreover, there are electronic journals and newsletters of various types from where texts