2.4. Purpose of Collecting a Corpus
Total Page:16
File Type:pdf, Size:1020Kb
2.4. Purpose of collecting a corpus
Each corpus is designed, collected and maintained to serve the designer’s purpose. The general purposes for creating corpora can be categorized as follows: 1) Research: description of the language in use. 2) Business: a) publishing dictionaries and reference books b) teaching: evaluating curricula, improving methods of teaching, producing teaching materials. c) translation: producing translator’s aids.
Detailed description of various purposes is presented with examples of research done in Part 4.
2.5. Storing a corpus
Collections of written texts are stored in the digital electronic form, thus corpora are sometimes called machine readable. Information technology allows users to access the data quickly via user-friendly interfaces. Spoken corpus has to be recorded as sound and transcribed into text without phonetics and prosody. Written form of a spoken corpus is convenient for lexical and syntactic analysis of the spoken language. Nowadays, it still seems to be impossible to analyse spoken corpus as it is stored, because sound concordance programmes have not been invented yet. There is a need for a sound concordance programme and a sound annotation tool and they will be developed sooner or later, as speech technology is developing quite fast now.
Written or spoken texts have to be transformed into the form of digital data. There are alphabets characteristic of particular languages. Unicode is a tool that provides codes for language description and storing texts in languages. To know more about Unicode visit http://www.unicode.org/ . Computers understand numbers only. Unicode provides a unique number for every character in all languages. Polish diacritics names and codes are the following: 00D3 Ó Latin capital letter o with acute 00F3 ó Latin small letter o with acute
13 0104 -Ą Latin capital letter A with ogonek. 0105 – ą Latin small letter A with ogonek 0106 - Ć Latin capital letter c with acute 0107 ć Latin small letter c with acute 0118 Ę Latin capital letter e with ogonek 0119 ę Latin small letter e with ogonek 0141 Ł Latin capital letter l with stroke 0142 ł Latin small letter l with stroke 0143 Ń Latin capital letter n with acute 0144 ń Latin small letter n with acute 015A Ś Latin capital letter s with acute 015B ś Latin small letter s with acute 0179 Ź Latin capital letter z with acute 017A ź Latin small letter z with acute 017B Ż Latin capital letter z with dot above 017C ż Latin small letter z with dot above
Plain text is the core of any corpus in a written form, but it is difficult to get at, hardly accessible, as reading from screen is not as convenient as reading from paper. There is a need for adding various markers e.g. semantic or syntactic, that facilitate access to the data. Whatever is done on the corpus, the original plain text of each corpus must be safely protected. Original plain sound corpora should be protected even more carefully, because they need to wait for new tools.
2.6. Annotations
2.6.1. Leech’s maxims
Plain text is called unannotated. Annotations, that is extra information, are added in order to retrieve and search the data faster in a purposeful way.
In 1993 Leech described seven maxims about annotations:
1. It should be possible to remove the annotation from an annotated corpus in order to revert to the raw corpus. At times this can be a simple process – for example removing every character after an underscore e.g. "Claire_NP1 collects_VVZ shoes_NN2" would become "Claire collects shoes". However, the prosodic annotation of the London-Lund corpus is interspersed within words – for example "g/oing" indicates a rising pitch on the first syllable of the word "going", meaning that the original words cannot be so easily reconstructed.
2. It should be possible to extract the annotations by themselves from the text. This is the flip side of maxim 1. Taking points 1 and 2 together, the annotated corpus should allow the maximum flexibility for manipulation by the user.
14 3. The annotation scheme should be based on guidelines which are available to the end user. Most corpora have a manual which contains full details of the annotation scheme and guidelines issued to the annotators. This enables the user to understand fully what each instance of annotation represents without resorting to guesswork, and in cases of ambiguity to understand why a particular annotation decision was made at that point. You might want to look briefly at an example of the guidelines for part-of-speech annotation of the BNC corpus http://www.comp.lancs.ac.uk/computing/users/eiamjw/claws/claws7.html although this page has restricted access.
4. It should be made clear how and by whom the annotation was carried out. A corpus may be annotated manually, either by a single person or by a number of different people; alternatively the annotation may be carried out automatically by a computer program whose output may or may not be corrected by human beings.
5. The end user should be made aware that the corpus annotation is not infallible, but simply a potentially useful tool. Any act of corpus annotation is, by definition, also an act of interpretation, either of the structure of the text or of its content.
6. Annotation schemes should be based as far as possible on widely agreed and theory-neutral principles. For example, parsed corpora often adopt a basic context-free phrase structure grammar rather than implementing a narrower specific grammatical theory such as Chomsky's Principles and Parameters framework.
7. No annotation scheme has the a priori right to be considered as a standard. Standards emerge through practical consensus.
Although there are no fixed standards for annotations, some conventions have been developed “through practical consensus”. Kahrel, Bbrnett, Leech (1997) pointed: “Standardization of annotation practices can ensure that an annotated corpus can be used to its greatest potential”. Standards may be developed on two levels: standard encoding of corpora and annotation, and standard annotation of corpora.
2.6.2. Types of annotations
Although there are no clear and obligatory standards, there are different types of annotations that proved to be useful in various corpora. McEnery and Wilson (2001) developed the following classification: - Part-of-speech annotation - Lemmatisation - Parsing - Semantic - Discoursal and text linguistic annotation o Pragmatic/stylistic - Phonetic transcription
15 - Prosody - Problem-oriented tagging Garside, Leech, McEnery (1997) suggest also orthographic annotation. However, orthography is the graphic representation of a text – not its linguistic interpretation. In some cases using italics may distinguish a linguistic function.
What follows are examples of annotations listed above.
Part-of-speech annotation
Part-of-speech annotations are used to identify and mark what part of speech a separate word is.
Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0; ponies&NN2; '&POS; feet&NN2; ,&PUN; suddenly&AV0; heard&VVD-VVN; Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP; better&AV0; dig&VVB; out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC; polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI; playing&VVG; in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;
The codes used above are: AJ0: general adjective AT0: article, neutral for number AV0: general adverb AVP: prepositional adverb CJC: co-ordinating conjunction CJS: subordinating conjunction CJT: that conjunction DPS: possessive determiner DT0: singular determiner NN0: common noun, neutral for number NN1: singular common noun NN2: plural common noun NP0: proper noun POS: genitive marker PNP: pronoun PRF: of PRP: preposition PUN: punctuation TO0: infinitive to VBI: be VM0: modal auxiliary VVB: base form of lexical verb VVD: past tense form of lexical verb VVG: -ing form of lexical verb VVI: infinitive form of lexical verb VVN: past participle form of lexical verb Source: McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
16 Lemmatisation Words like -ed -ing form the lemma of a lexeme. A lexeme is a unit of meaning. Variants of a lexeme create the lemma of it. For example, goes, going, gone, went belong to the lemma of go Not many corpora are lemmatised. Examples by McEnery available at: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
See the example below: In the first column there are text references, in the second part of speech, in the third the actual words from the text, the fourth column contains the lemmatised words. N12:0510g - PPHS1m He he N12:0510h - VVDv studied study N12:0510i - AT the the N12:0510j - NN1c problem problem N12:0510k - IF for for N12:0510m - DD221 a a N12:0510n - DD222 few few N12:0510p - NNT2 seconds second N12:0520a - CC and and N12:0520b - VVDv thought think N12:0520c - IO of of N12:0520d - AT1 a a N12:0520e - NNc means means N12:0520f - IIb by by N12:0520g - DDQr which which N12:0520h - PPH1 it it N12:0520i - VMd might may N12:0520j - VB0 be be N12:0520k - VVNt solved solve N12:0520m - YF +. -
17 Parsing Full parsing provides detailed analysis of the structure of sentences. Skeleton parsing uses a limited set of syntactic constituent types and ignores, for example, the internal structure of certain constituent types. Study the examples by McEnery at: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
Full parsing [S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns the_AT1 [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn teamed_VBN Vn] [R up_RP R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJ- squared_JJ JJ-] ,_, [NN+ and_CC chisel_NN NN+]JJ/JJ/NN&] toes_NNS Np]P]Tn]Fr]Ns] ._. S] This example was taken from the Lancaster-Leeds treebank available at McEnery’s website http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
The syntactic constituent structure is indicated by nested pairs of labelled square brackets, and the words have part-of-speech tags attached to them. The syntactic constituent labels used are: & whole coordination + subordinate conjunct, introduced - subordinate conjunct, not introduced Fr relative phrase JJ adjective phrase Ncs noun phrase, count noun singular Np noun phrase, plural Nq noun phrase, wh- word Ns noun phrase, singular P prepositional phrase R adverbial phrase S sentence Tn past participle phrase Vn verb phrase, past participle Vzb verb phrase, third person singular to be Vzp verb phrase, passive third person singular
Skeleton Parsing [S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1 N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2 [P of_IO [N that_DD1 victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0 seen_VVN [P in_II [N the_AT international_JJ community_NNJ [P of_IO [N scholars_NN2 N]P] [Fr that_CST [V has_VHZ graduated_VVN here_RL today_RT V]Fr]N]P]V]S+] ._.
18 This example was taken from the Spoken English Corpus.
The two examples are similar, but in the case of skeleton parsing all noun phrases are simply labelled with the letter N, whereas in the example of full parsing there are several types of noun phrases which are distinguished according to features such as plurality. The only constituent labels used in the skeleton parsing example are: Fr relative clause N noun phrase P prepositional phrase S& 1st main conjunct of a compound sentence S+ 2nd main compound of a compound sentence V verb phrase
Automatic parsing is not as effective as part-of speech annotation, and human post-editing of parsing is necessary. However, human parsing is inconsistent, particularly when ambiguities occur. Semantic marking Based on: McEnery at: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2semant.htm
The example below (Wilson 1996) is intended to give the reader an idea of the types of categories used in semantics: And 00000000 the 00000000 soldiers 23241000 platted 21072000 a 00000000 crown 21110400 of 00000000 thorns 13010000 and 00000000 put 21072000 it 00000000 on 00000000 his 00000000 head 21030000 and 00000000 they 00000000 put 21072000 on 00000000 him 00000000 a 00000000 purple 31241100 robe 21110321 The numeric codes stand for:
19 00000000 Low content word (and, the, a, of, on, his, they etc) 13010000 Plant life in general 21030000 Body and body parts 21072000 Object-oriented physical activity (e.g. put) 21110321 Men's clothing: outer clothing 21110400 Headgear 23231000 War and conflict: general 31241100 Colour The semantic categories are represented by 8-digit numbers - the one above is based on that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three top level categories, which are themselves subdivided, and so on.
Discoursal and text linguistic annotation Source: McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2discour.htm
Discourse tags Stenström (1984) annotated the London-Lund spoken corpus with 16 "discourse tags". They included categories such as: "apologies" e.g. sorry, excuse me "greetings" e.g. hello "hedges" e.g. kind of, sort of thing "politeness" e.g. please "responses" e.g. really, that's right Despite their potential role in the analysis of discourse these kinds of annotation have never become widely used, possibly because the linguistic categories are context-dependent and their identification in texts is a greater source of dispute than other forms of linguistic phenomena. Thus, annotations at the levels of discourse and text are rarely used.
Anaphoric annotation Cohesion is the vehicle by which elements in text are linked together, through the use of pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in English" (1976) was considered to be a turning point in linguistics, as it was the most influential account of cohesion. Anaphoric annotation is the marking of pronoun reference – our pronoun system can only be realised and understood by reference to large amounts of empirical data, in other words, corpora. Anaphoric annotation can only be carried out by human analysts, since one of the aims of the annotation is to train computer programs with this data to carry out the task. There are only a few instances of corpora which have been anaphorically annotated; one of these is the Lancaster/IBM anaphoric treebank, an example of which is given below: