eContent Submission to INFLIBNET Subject Name Linguistics Paper Name Paper Coordinator Name & Contact Module name #0 5: Corpus Type: Genre of Text Module ID Content Writer (CW) Name Email id Phone Prerequisites Objectives Keywords EText Self Learn Self Assessment Learn More Story Board √ √ √ √ √

Contents

5.0 Introduction 5.1 Why Classify Corpora ? 5.2 Genre of Text 5.2.1 Text Corpus 5.2.2 5.2.3 Spoken Corpus 5.3 Conclusion

Assessment & Evaluation Resources & Lin

5.0 INTRODUCTION

For last fifty years or so, corpus linguistics is attested as one of the mainstays of linguistics for various reasons. At various points of time scholars have discussed about the methods of generating corpora, techniques of processing them, and using information from corpora in linguistic works – starting from mainstream linguistics to applied linguistics and language technology. However, in general, these discussions often ignore an important aspect relating to classification of corpora, although scholars sporadically attempt to discuss about the form, formation, and function of corpora of various types. People avoid this issue, because it is a difficult scheme to classify corpora by way of a single frame or type. Any scheme that attempts to put various corpora within a single frame is destined to turn out to be unscientific and nonreliable.

Digital corpora are designed to be used in various linguistic works. Sometimes, these are used for general linguistics research and application, some other times these are utilised for works of language technology and computational linguistics. The general assumption is that a corpus developed for certain types of work is not much useful for works of other types. Such assumption is false in the sense that a corpus developed for a specific kind of work can equally be fruitfully used for many other works. Therefore, it is better to assume that function and utilisation of a corpus is multidimensional and multidirectional. For instance, corpus developed for compiling dictionary may be used for writing grammar books, developing language teaching materials, and writing reference books. Due to such reasons people are often hesitant to classify corpora in any scheme.

5.1 WHY CLASSIFY CORPORA ?

Each corpus is developed following some methods of language representation, text collection, and text application. These make a corpus distinct in form, content, feature, and function from others. Taking these factors into consideration we propose to classify corpora into various types based on the factors relating to their form, content, and utilization. Systematic classification of corpora provide language users the following advantages not possible to achieve in any other way.

(a) Language users can easily to identify appropriate data from texts of suitable areas and domains of language use. (b) Linguists can particular corpus they think useful for their works. For this they do not need to grope in the dark. (c) Dictionary makers wanting to compile dictionaries need not be confused with selection of corpus. They can select general and special corpora, if they find prior information about the types of corpus they need for their works. (d) For applicationspecific requirements people can try with general as well as special corpus without toiling hard in the labyrinth of corpora. (e) Terminologists can use special corpora to extract relevant lexical information necessary for the collection of jargons as well as scientific and technical terms. (f) Domainspecific investigators can retrieve necessary linguistic data and information from special corpora. Investigator wanting to study normal speech patterns of native people can access a general speech corpus rather than other corpora. (g) If corpora are not classified, users have to refer all corpus types before selecting required one. It consumes much time, energy and labour due to internal complexities involved in it. (h) Classification of corpora enhances speed and accuracy of comparative studies across corpus types. If speech and text corpora are kept separate, comparative studies between the two may become robust and effective. (i) Classification of corpora makes us comfortable for comparing data stored in each corpus. We can systematically observe traits of similarities and differences between the two types. (j) If corpora are mixed up, comparative study becomes complicated while observations become defective.

Taking such advantages into mind, we present here a tentative scheme of classification of corpora. In this context the followings are the most important factors:

(a) Minimum conditions need to be fulfilled for a collection of language data to be considered as a corpus before it is put to classification, and (b) Identity of corpora of ordinary language use should be kept separate from the corpora recorded in artificial language use.

Both the factors need to maintain a balance. If the criteria proposed below are considered adequate we assume that considerable progress is made, because there are large collections of language databases called corpora, which do not meet these conditions. Also there are some corpora, which record special and artificial language samples. Besides, branch of corpus linguistics is developing rapidly. As a result of this, regular norms and assumptions are revised at quick successions. Therefore, classification of corpora is made maximally flexible to meet such unstable conditions.

Digital corpora are various types with regard to texts, languages, modes of data sampling, methods of generation, manners of processing, and nature of utilisation, etc. For instance,

• A corpus may contain samples of written texts while the other one may contain samples of spoken texts. • A corpus may preserve text samples from presentday language while others may store samples complied from ageold texts and ancient documents. • A corpus is monolingual by way of collecting data from a single language, while others are bilingual by way of including texts from two languages, or multilingual by way of including samples from more than two languages. • Texts included in a corpus may be collected from a particular source, from a whole range of sources belonging to a particular field, or across the fields and subjects of a language. • Text samples may be obtained from newspapers, magazines, journals, periodicals, and similar other forms. • Text samples may also be compiled from extracts of impromptu conversations, spontaneous dialogues, madeup monologues, or from interactive discourses of varying lengths, etc.

This implies that there are numerous needs and factors that control content, type, and use of a corpus. It also signifies that the kind of texts included as well as the combination of various text types may vary among the corpora. Taking all these issues under consideration we broadly classify corpora based on the following criteria:

(a) Corpus Type: Genre of text (b) Corpus Type: Nature of data, (c) Corpus Type: Type of text, (d) Corpus Type: Purpose of design, and (e) Corpus Type: Nature of application.

Corpus Classification Criteria

Genre of Nature of Type of Text Data Text

Purpose of Nature of Design Application

Fig. 1: Classification of Corpus

In the following sections, the first type of corpus is discussed briefly with reference to the corpora developed so far in various languages of the world. The remaining four types are discussed in next four modules with adequate examples and explanations.

5.2 GENRE OF TEXT

Following the criteria ‘Genre of Text’ language corpora may be classified broadly into three broad types, namely, Text Corpus, Speech Corpus, and Spoken Corpus.

Speech Australian Speech Speech Samples Corpus Corpus

G E Spoken London -Lund N Transcribed Speech R Corpus Spoken Corpus E

Text TDIL Indian Corpus Written Texts language Corpus

Fig. 2: Genre of Text

5.2.1 Text Corpus

A Text Corpus, by virtue of its genre, contains only the language data collected from various written, printed, published, and electronic sources. In case of printed materials, it collects texts from published books, papers, journals, magazines, periodicals, notices, circulars, documents, reports, manifestos, advertisements, bulletins, placards, festoons, etc. In case of nonpublished materials, it collects the texts from personal letters, personal diaries, written family records, old manuscripts, ancient legal deeds and wills, etc. Thus, samples of various texts obtained from both published and nonpublished sectors constitute the central body of a text corpus. Some examples of text corpus are the , the , the , the Australian Corpus of English, the Wellington Corpus of Written New Zealand English, the , the LOB Corpus, the Kolhapur Corpus of Indian English, the FLOB Corpus, the Bank of Swedish, the TDIL Corpus of Indian Languages, and others. These corpora are made with texts obtained from written texts.

In early years of corpus generation there was virtually little scope for including text samples from digital sources in a corpus, since such text samples were not easily available. However, the situation is greatly changed within last few years. Now, we can find huge amount of written texts from various digital sources to be included in a text corpus. There are many web sites, home pages, web pages, internet, etc. form where we can collect data for generating a corpus of written texts. Moreover, there are electronic journals and newsletters of various types from where texts samples can be collected for generating written corpus. The following diagram (Fig. 3) presents a sample of a written text from the KCIE.

**[txt. a01**] 0010A01 **<<*3Politics of Job Reservations*0**>> $**[begin leader comment, begin 0020A01 underscoring**] *3^The Bihar Government did not foresee or forestall 0030A01 the complications that_ followed its decision to_ reserve jobs for 0031A01 backward 0040A01 classes. ^The present violence in the State has raised the controversy 0050A01 over the criterion for backwardness whether it should be caste or 0060A01 economic conditions.*0**[end underscoring, end leader comment**] 0070A01 $^WHY has the Bihar Government*'s decision to_ reserve jobs for backward 0080A01 classes led to a violent outburst? ^It is not such an original idea 0090A01 that it should have triggered demonstrations and riots or attracted allIndia

Fig. 3: Example of a text corpus (KCIE) (Source: ICAME http://www.hit.uib.no/icame/koleks.html)

There is a debate regarding inclusion of texts written to be delivered in speech (i.e., oration) in text corpus. Also, debate arises with regard to the status of texts used in scripts and plays in relation to their inclusion in a text corpus. Should we include samples from these sources into a text corpus? It is really a difficult question, since it almost impossible to decide in a definite way in which group these texts should actually belong.

If we take into consideration the basic linguistic modality used in generation of these texts, we find that these texts have a right to be included in a text corpus. Also, readout writings, lectures delivered in seminars, notes dictated in class or office, etc. although meant for listening, are actually composed following the general norms of writing. Moreover, such texts, although delivered in spoken form, do not have the features of normal dialogue or conversation. A public speech like “Dear ladies and gentlemen! It is a great delight to inform you that the government has decided to implement mass literacy programme for the benefit of the nation” does not contain the features typical to impromptu speech, it quite rational to include it in a text corpus, since it is generated first in written form. A written text may be read out, but its expression changes due to change of medium. Therefore, it is primarily a written text.

On the contrary, if we take the purpose of composition into our consideration, we may argue that these texts should belong to speech corpus only. The general argument is that these texts are composed not for reading but for speaking. The scripts composed for films and plays are made in such a way that these are suitable for the characters to communicate verbally. Similarly, lectures composed for public oration are made in such a way that these are suitable for open verbal deliberation before the audience. Therefore, these texts should not be put within a text corpus. Similar argument stands valid for the notes dictated and delivered in class. However, before we take any decision regarding the actual status of such texts, we need to analyse the texts from various angles with serious consideration of the linguistic and nonlinguistic factors interlinked with these events.

A few years ego written texts written in English, German, Spanish, French, and other advanced languages were easily available in huge amount from internet. But texts written in Indian languages were rarely found in the net. In fact, due to certain orthographic problems relating to Indian scripts, written texts in Indian languages were very difficult to procure from cyber world. Due to such technological snag, Indian corpus linguists are not in a position to generate a text corpus by way of quick collection of data from web sources. However, the situation is rapidly changing. At present, some resources are indeed available in the net for Indian languages, thanks to the development done in the area of putting Indian texts in 'cyberia'.

In this context, some people are interested to include text samples taken from personal emails in a text corpus. However, we are highly sensitive in this regard. We argue that texts composed in personal emails should not be included in a general text corpus, since samples derived from these sources possess certain criteria, which are hardly observed in texts composed in regular imaginative and informative writings. Email texts are primarily skewed and greatly distracted from the actual form and texture of general written texts. Therefore, we should better identify a special category, namely ‘Email Corpus’ where such texts should be preserved for special type of investigation and analysis.

5.2.2 Speech Corpus

A speech corpus, in contrast to text corpus, contains text samples obtained from verbal interactions. Technically, speech corpus refers to texts which are available in audio form (Sasaki 2003: 91). That is, the speakers involved in a speech corpus should behave in an oral mode. An important type of speech corpus is an ‘experimental corpus’, which is assembled for studying fine details of spoken language. Such a corpus is small in size and is produced by asking informants to read out passages in an anechoic chamber. A speech corpus is a collection of spoken data typically recorded in specific setting, for specific purpose, by specific users. For instance, the speech corpus Speech DatCar is designed for developing an interactive system for direct consumer application. Usually such a speech corpus lacks the richness of linguistic features normally found in regular spoken texts.

At the time of developing a speech corpus, it is kept in mind that samples are natural, informal, conversational, and impromptu in nature. By its default value, a speech corpus is entitled to contain samples of private and personal talks, formal and informal discussions, debates, instant talks, impromptu analysis, casual speech, facetoface conversations, telephonic conversations, dialogues, monologues, online dictations, instant addressing, etc. There is no scope for external involvement, since the aim of a speech corpus is to display the basic characteristics of a speech act in a most faithful manner (Chafe 1982).

A speech corpus, for example, may contain text samples from various types of speech events occurring in regular normal life and living, such as common talks, telephonic exchanges, casual speeches, proceedings of courts, interrogations at police station, quarrels on roads, bargaining at markets, talks in social functions, festivals and celebrations, exchange of talks in classrooms, gossips among friends at malls, lovetalks between lovers, curtain lectures of couples, etc. Texts collected from such sources will properly attest the actual form and nature of normal speeches. Some examples of authentic speech corpus are: The London-Lund Corpus of Spoken English, American Speech Corpus, Edinburgh University Speech Corpus of English, Korean Speech Corpus, Cantonese Speech Database, Dutch and Flemish Speech Database, Machine-Readable Corpus of Spoken English, Dialogue Diversity Corpus, West Point Arabic Speech Corpus, Smart-Kom Multimodal Corpus, Speech Corpus of London Teenagers, etc.

The two most important questions relating to speech corpus generation are the followings:

(a) How a speech corpus should be designed and developed, and (b) Language of which community or group will it represent?

These are tricky questions, which have no straightforward answers, particularly in Indian context. To solve problems of representing speech of a community or group, we propose to pay emphasis on generation of speech corpus for each language variety including the standard and the regional ones. Practical constraints like lack of financial support, technical knowhow, trained manpower, linguistic motivation, social inspiration, political encouragement, etc. may stand as barriers on the path of such projects in Indian context. Therefore, considering the facilities and conditions available, we argue for developing speech corpus first for the standard variety of each national language included in the 8 th Schedule of Indian Constitution. Priority may be diverted towards other languages varieties after generation of corpora in each Indian national language.

The next question relates to several issues: from which sections, sectors, and domains speech data are to be collected? Experts have furnished various arguments about this particular issue [4] . According to some scholars (Sinclair 1991: 132) speech samples should be taken from those sources and domains, which are considered to be standard and universally accepted by most of the people of the speech community. For instance, texts from news broadcasting and telecasting, language used in official and formal situations, in court proceedings, in college and university lectures, in classroom teachings, etc. may be included in a speech corpus (Uhmann 2001: 377).

The reasons behind the selection of texts from these sources are that these samples are suitable to reveal the actual standard form of the spoken version of a language used by people. Moreover, analysis of these standard speech data will produce almost all the salient features of the spoken form. Moreover, if required, these data may be used in classroom teaching for teaching discourse patterns in spoken interactions, pronunciation of sounds, words, and sentences to language learners. Moreover, the corpus may equally be useful for teaching language to foreign learners.

However, these arguments are strongly contradicted by Leech (1993) and others. According to them, if a speech corpus is designed with data of standard form only, there will be no scope for variety in the corpus. Moreover it will fail to represent numerous varieties normally found in regular speech. It is therefore, not logical to generalise the speech habit of an entire speech community with a small set of text samples collected from the standard spoken form. If we do this, we will not only fail to account for the peculiarities observed in different speech patterns of people, but also will deprive large number of common speakers from representing their speech data in corpus.

Therefore, speech text samples should be taken from all possible domains of spoken interactions to represent the people coming from all walks of life irrespective to profession, education, class, ethnicity, age, and sex. It will contain equal amount of data spoken by children as well as spoken by students of schools and colleges, by workers of offices, courts and business, by people of various other professions. Similarly text samples will come from the interrogations conducted at police stations, debates held in parliaments, quarrels taking place in markets and roads, etc. In sum, language spoken by varieties of people should have proportional representation in it. Only then it will reflect on the internal form and nature of speech by maximum representation (Eggins 1994: 109).

According to us, both formal and informal speech data should be included in a speech corpus to make it maximally representative. While formal speech will include texts from radio and television newscasts, public announcements, audio advertisements, dialogic interactions, interviews, verbal surveys, prerecorded dialogues, scripts of films and plays; informal data will include samples of texts obtained from various verbal interactions casually enacted in regular courses of life and living. Thus equal representation of speech texts will make a speech corpus balanced, nonskewed, and properly representative.

A speech corpus should be made in such a way that it is able to balance between demographic and contextual varieties. While demographic variety accounts for age, gender, profession, birthplace, education, economic condition, ethnicity, etc. of speakers, contextual variety accounts for all types of variations observed in speech events taking place at different times, spaces, agents, and events. A speech corpus made in this process faithfully represents actual nature and form of a speech event. Thus, it builds bases for repetitiveness and diversion – two important features of speech for providing reliable information and clues for proper analysis and interpretation of discourse. The following diagram (Fig. 4) shows a sample of the COLT (Corpus of London Teenagers ) corpus, which is reproduced to understand how a speech corpus is designed and developed.

Sharon: Oh don’t start on me you know, saying I can’t be there on Tuesday! (...) Susie: I said nothing. [I’m talking about me!] Sharon: [ laugh ] Don’t start because I’ll, I’ll smash your face in! (...) Sharon: I say, I’ve got friends Susie: laugh (...) Sharon: and I’m gonna make them come over and I’m gonna make them beat the shit out of you! (...) Susie: Oh shut up!

Fig. 4: Speech Corpus (Stenström, Andersen and Hasund 2002: 203)

For convenience of understanding, let us assume that we want to develop a speech corpus for standard spoken Hindi. Let us also hope that it will preserve all the salient features required for a speech corpus to be maximally balanced, representative and useful for studying Hindi speech. Now the question is how we are going to develop such a speech corpus. In this case, the method employed for other speeches may be useful (Hary 2003) with necessary changes. Normally, the methods, which are used for collecting spoken texts in digital form include the following stages:

Stage 1: Recording spoken texts in digital tape recorders. Stage 2: Recording spoken interactions in videotapes. Stage 3: Transcribing spoken texts into written form. Stage 4: Transcribing spoken texts in the notations used in International Phonetic Alphabets (IPA). Stage 5: Annotating texts with phonetic, orthographic, grammatical, demographic, and contextual information. Stage 6: Preparing detail data about the extralinguistic information relating to spoken texts and interactants. Stage 7: Preparing detail glossary of spoken texts. Stage 8: Translating texts into another widelyknown language.

It is normally argued that informal and impromptu speech is the most important variety of all, because it has the closest representation of the core of a language. Informal speech corpus, in principle, contains texts from informal and impromptu conversations. It reveals all the characteristic features of speech in a reliable and lively way that no other variety can probably do.

The controversies relating to selection of speech text samples need urgent clarification. We are not sure how a speech is considered as impromptu or identified as informal one. In fact, these questions need to be addressed first before we actually tag speech samples with such stickers. We are not also sure whether one composes texts either for oral deliberation or for silent reading, or for both. The truth is informal and impromptu speeches are most difficult and expensive things to acquire and are highly complicated to classify and manage. Also, complexities are involved in transcription of speech, since there is hardly any consensus about the conventions of transcription.

The method and standard proposed by Greenbaum and Quirk (1990) while developing the London-Lund Speech Corpus of English is greatly revised and modified at the time of developing Swedish Speech Corpus , Chinese Speech Corpus , Speech Corpus of American English , and Hebrew Speech Corpus . As a result, we have no definite guideline to follow for collecting speech data. However, present trend of corpus research implies that linguists have liberty to select type and amount of speech data independently taking into consideration the need of specific research and application potential of the work.

5.2.3 Spoken Corpus

The term Spoken Corpus is carefully used to distinguish it from a speech corpus. A spoken corpus, in principle, is a technical extension of a speech corpus. Definitely it contains texts of a spoken language but in a different mode and formation. Text samples in a spoken corpus are stored in written form, transcribed directly from spoken texts. Also, sometimes, it is tagged with various annotations relating to normal utterance of speech. Some examples of spoken corpus include the Lancaster/IBM , the Emotional Prosody Speech and Transcripts Corpus, the London-Lund Corpus, the Wellington Corpus of Spoken New Zealand English, the International Corpus of English, etc. In these corpora, speech texts are transcribed and preserved in written form without changing the texts at the time of transcription.

Spoken corpora are annotated with phonetic transcriptions. If spoken corpora are preserved as sound waves as well as transcripted versions, then a single text exists in two versions to generate a special kind of parallel corpora. Although not many examples of phonetically transcripted spoken corpora exist, they are useful addition to the class of annotated corpora for linguists who lack technological expertise for analysing recorded speech (McEnery and Wilson 1996: 26). In the diagram below (Fig. 5) a sample of a spoken trascripted corpus from the LancasterLund Corpus (LLC) is given.

10 1 1 B 11 ((of ^Spanish)) . graph\ology# / 20 1 1 A 11 ^w=ell# . / 30 1 1 A 11 ((if)) did ^y/ou _set _that# / 40 1 1 B 11 ^well !J\oe and _I# / 50 1 1 B 11 ^set it betw\een _us# / 60 1 1 B 11 ^actually! Joe 'set the :p\aper# / 70 1 1 B 20 and *((3 to 4 sylls))* / 80 1 1 A 11 *^w=ell# /

Fig. 5: Example of a spoken corpus (LLC) (Source: ICAME http://www.hit.uib.no/icame/lolueks.html )

Despite the wide experience gained in compilation and annotation of text corpora the works relating to generation and annotation of spoken corpora have not become simplified. Spoken texts involve many aspects that need extra care of at the time of text collection and annotation. The transient nature of spoken texts is offered as an explanation for justifying the complexities involved with collection of spoken texts. Even, capturing spoken texts is not a trivial task as it involves various issues of demography, linguistics and technology.

Once an audio data is collected and stored in digital form, it involves production of transcription of the texts in both orthographic and phonetic forms for their utilisation. That means processing of spoken texts involves text segmentation, orthographic annotation, prosodic annotation, partofspeech tagging, lemmatisation, parsing, etc. which are built upon transcription of speech texts. The problems that are often encountered in processing spoken texts are the followings:

(a) Experience of working with text corpora have marginal value to deal with idiosyncrasies found in spoken text corpus. (b) Since there is little experience and knowledge are available for transcription of spoken texts, it is necessary to develop benchmarking procedures, techniques and guidelines for speech text transcription. (c) Tools for automatic, supervised or semisupervised transcription spoken texts need to be designed for all languages. (d) Systems and methods should be developed for implementation of annotation on spoken text in a uniform manner on all speech varieties. (e) Schemes for spoken texts transcription have to be designed in such a way that it is possible to revert to speech data easily from the transcribed version of text. (f) Standards of annotation developed for spoken corpus of one languages may be customised to cater the needs of spoken corpus of other languages.

Due to complexities involved in compilation and annotation, spoken corpus has brought linguists and speech technologists under one platform. Ideally, a spoken corpus addresses needs of these people, although there are conflicts of interests. For example, the quality of recording of spontaneous conversation in noisy environment is highly interesting data and useful for linguists, but it appears to be useless to the researchers of speech recognition and speaker identification. Given below (Fig. 6) is an annotated spoken corpus, tagged with features of spontaneous speech and syntax.

Orthographic version of a spoken text:

Good morning. More news about the Reverend Sun Myung Moon, founder of the Unification church, who's currently in jail for tax evasion: he was awarded an honorary degree last week by

Annotated version of the spoken text:

A01 2 (_( In_IN Perspective_NP )_) A01 3 (_( Rosemary_NP Hartill_NP )_) A01 5 ^ good_JJ morning_NN ._. ^ more_AP news_NN about_IN the_ATI A01 5 Reverend_NPT Sun_NP Myung_NP Moon_NP ,_, founder_NN A01 6 of_IN the_ATI Unification_NNP church_NN ,_, who_WP 's_BEZ A01 6 currently_RB in_IN jail_NN for_IN tax_NN evasion_NN :_:

Fig. 6: Lancaster/IBM Spoken Tagged English Corpus (Source: ICAME: http://www.hit.uib.no/icame/lanspeks.html

5.3 CONCLUSION

It is wellknown that speech is historically prior to text (Halliday 1987). We know that speech is primary while writing is secondary for the reasons that children acquire speech first and illiterate people use language without having skill for writing and reading. Thus, primacy of speech over text clearly shows that speech is the basic medium of linguistic expression without regard to how language evolved and children acquire language.

All these arguments are furnished here to substantiate our claim that since spoken and written texts are characteristically different from each other with regard to their form, function, and composition, corpora developed from these two different types of text should not be merged together to produce a general corpus. Rather, each type of text should be kept in a separate corpus, so that their future use in linguistic studies and application is more useful and trouble free.

SUGGESTED READINGS

Chafe, W. (1982) Integration and involvement in speaking, writing, and oral literature. In: Tannen, D. (Ed.) Spoken and Written Language: Exploring Orality and Literacy. Norwood, New Jersey: Ablex Publishing Corporation. Pp. 3553. Eggins, S. (1994) An Introduction to Systemic Functional Linguistics. London: Pinter Publishers. Greenbaum, S. and R. Quirk (1990) A Student’s Grammar of the English language. London: Longman. Halliday, M.A.K. (1987) Spoken and Written Modes of Meaning, Comprehending Oral and Written Language. San Diego, CA: Academic Press. Hary, B.H. (Ed.) (2003) Corpus Linguistics and Modern Hebrew . Tel Aviv: Tel Aviv University Press. Leech, G. (1993) Corpus annotation schemes. Literary and Linguistic Computing. 8(4): 275 281. McEnery, A. and M. Oakes (1996) Sentence and word alignment in the CARTER project. In: Thomas, J. and M. Short (Eds.) Using Corpora for Language Research . London: Longman. Pp. 211233. Sasaki, M. (2003) The writing system of an artificial language: for efficient orthographic processing. Journal of Universal Language. 4(1): 91112, 2003. Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press. Stenström, AB, G. Andersen, and I.K. Hasund (2002) Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam: John Benjamins Publishing Company. Uhmann, S. (2001) Some arguments for the relevance of syntax to samesentence self repair in everyday German conversation. In: Selting, M. and E. CouperKuhlen (Eds.) Studies in Interactional Linguistics. Amsterdam/ Philadelphia: John Benjamins. Pp. 373404.