2.4. Purpose of Collecting a Corpus

Total Page:16

File Type:pdf, Size:1020Kb

2.4. Purpose of Collecting a Corpus

2.4. Purpose of collecting a corpus

Each corpus is designed, collected and maintained to serve the designer’s purpose. The general purposes for creating corpora can be categorized as follows: 1) Research: description of the language in use. 2) Business: a) publishing dictionaries and reference books b) teaching: evaluating curricula, improving methods of teaching, producing teaching materials. c) translation: producing translator’s aids.

Detailed description of various purposes is presented with examples of research done in Part 4.

2.5. Storing a corpus

Collections of written texts are stored in the digital electronic form, thus corpora are sometimes called machine readable. Information technology allows users to access the data quickly via user-friendly interfaces. Spoken corpus has to be recorded as sound and transcribed into text without phonetics and prosody. Written form of a spoken corpus is convenient for lexical and syntactic analysis of the spoken language. Nowadays, it still seems to be impossible to analyse spoken corpus as it is stored, because sound concordance programmes have not been invented yet. There is a need for a sound concordance programme and a sound annotation tool and they will be developed sooner or later, as speech technology is developing quite fast now.

Written or spoken texts have to be transformed into the form of digital data. There are alphabets characteristic of particular languages. Unicode is a tool that provides codes for language description and storing texts in languages. To know more about Unicode visit http://www.unicode.org/ . Computers understand numbers only. Unicode provides a unique number for every character in all languages. Polish diacritics names and codes are the following: 00D3 Ó Latin capital letter o with acute 00F3 ó Latin small letter o with acute

13 0104 -Ą Latin capital letter A with ogonek. 0105 – ą Latin small letter A with ogonek 0106 - Ć Latin capital letter c with acute 0107 ć Latin small letter c with acute 0118 Ę Latin capital letter e with ogonek 0119 ę Latin small letter e with ogonek 0141 Ł Latin capital letter l with stroke 0142 ł Latin small letter l with stroke 0143 Ń Latin capital letter n with acute 0144 ń Latin small letter n with acute 015A Ś Latin capital letter s with acute 015B ś Latin small letter s with acute 0179 Ź Latin capital letter z with acute 017A ź Latin small letter z with acute 017B Ż Latin capital letter z with dot above 017C ż Latin small letter z with dot above

Plain text is the core of any corpus in a written form, but it is difficult to get at, hardly accessible, as reading from screen is not as convenient as reading from paper. There is a need for adding various markers e.g. semantic or syntactic, that facilitate access to the data. Whatever is done on the corpus, the original plain text of each corpus must be safely protected. Original plain sound corpora should be protected even more carefully, because they need to wait for new tools.

2.6. Annotations

2.6.1. Leech’s maxims

Plain text is called unannotated. Annotations, that is extra information, are added in order to retrieve and search the data faster in a purposeful way.

In 1993 Leech described seven maxims about annotations:

1. It should be possible to remove the annotation from an annotated corpus in order to revert to the raw corpus. At times this can be a simple process – for example removing every character after an underscore e.g. "Claire_NP1 collects_VVZ shoes_NN2" would become "Claire collects shoes". However, the prosodic annotation of the London-Lund corpus is interspersed within words – for example "g/oing" indicates a rising pitch on the first syllable of the word "going", meaning that the original words cannot be so easily reconstructed.

2. It should be possible to extract the annotations by themselves from the text. This is the flip side of maxim 1. Taking points 1 and 2 together, the annotated corpus should allow the maximum flexibility for manipulation by the user.

14 3. The annotation scheme should be based on guidelines which are available to the end user. Most corpora have a manual which contains full details of the annotation scheme and guidelines issued to the annotators. This enables the user to understand fully what each instance of annotation represents without resorting to guesswork, and in cases of ambiguity to understand why a particular annotation decision was made at that point. You might want to look briefly at an example of the guidelines for part-of-speech annotation of the BNC corpus http://www.comp.lancs.ac.uk/computing/users/eiamjw/claws/claws7.html although this page has restricted access.

4. It should be made clear how and by whom the annotation was carried out. A corpus may be annotated manually, either by a single person or by a number of different people; alternatively the annotation may be carried out automatically by a computer program whose output may or may not be corrected by human beings.

5. The end user should be made aware that the corpus annotation is not infallible, but simply a potentially useful tool. Any act of corpus annotation is, by definition, also an act of interpretation, either of the structure of the text or of its content.

6. Annotation schemes should be based as far as possible on widely agreed and theory-neutral principles. For example, parsed corpora often adopt a basic context-free phrase structure grammar rather than implementing a narrower specific grammatical theory such as Chomsky's Principles and Parameters framework.

7. No annotation scheme has the a priori right to be considered as a standard. Standards emerge through practical consensus.

Although there are no fixed standards for annotations, some conventions have been developed “through practical consensus”. Kahrel, Bbrnett, Leech (1997) pointed: “Standardization of annotation practices can ensure that an annotated corpus can be used to its greatest potential”. Standards may be developed on two levels: standard encoding of corpora and annotation, and standard annotation of corpora.

2.6.2. Types of annotations

Although there are no clear and obligatory standards, there are different types of annotations that proved to be useful in various corpora. McEnery and Wilson (2001) developed the following classification: - Part-of-speech annotation - Lemmatisation - Parsing - Semantic - Discoursal and text linguistic annotation o Pragmatic/stylistic - Phonetic transcription

15 - Prosody - Problem-oriented tagging Garside, Leech, McEnery (1997) suggest also orthographic annotation. However, orthography is the graphic representation of a text – not its linguistic interpretation. In some cases using italics may distinguish a linguistic function.

What follows are examples of annotations listed above.

Part-of-speech annotation

Part-of-speech annotations are used to identify and mark what part of speech a separate word is.

Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0; ponies&NN2; '&POS; feet&NN2; ,&PUN; suddenly&AV0; heard&VVD-VVN; Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP; better&AV0; dig&VVB; out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC; polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI; playing&VVG; in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;

The codes used above are: AJ0: general adjective AT0: article, neutral for number AV0: general adverb AVP: prepositional adverb CJC: co-ordinating conjunction CJS: subordinating conjunction CJT: that conjunction DPS: possessive determiner DT0: singular determiner NN0: common noun, neutral for number NN1: singular common noun NN2: plural common noun NP0: proper noun POS: genitive marker PNP: pronoun PRF: of PRP: preposition PUN: punctuation TO0: infinitive to VBI: be VM0: modal auxiliary VVB: base form of lexical verb VVD: past tense form of lexical verb VVG: -ing form of lexical verb VVI: infinitive form of lexical verb VVN: past participle form of lexical verb Source: McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm

16 Lemmatisation Words like -ed -ing form the lemma of a lexeme. A lexeme is a unit of meaning. Variants of a lexeme create the lemma of it. For example, goes, going, gone, went belong to the lemma of go Not many corpora are lemmatised. Examples by McEnery available at: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm

See the example below: In the first column there are text references, in the second part of speech, in the third the actual words from the text, the fourth column contains the lemmatised words. N12:0510g - PPHS1m He he N12:0510h - VVDv studied study N12:0510i - AT the the N12:0510j - NN1c problem problem N12:0510k - IF for for N12:0510m - DD221 a a N12:0510n - DD222 few few N12:0510p - NNT2 seconds second N12:0520a - CC and and N12:0520b - VVDv thought think N12:0520c - IO of of N12:0520d - AT1 a a N12:0520e - NNc means means N12:0520f - IIb by by N12:0520g - DDQr which which N12:0520h - PPH1 it it N12:0520i - VMd might may N12:0520j - VB0 be be N12:0520k - VVNt solved solve N12:0520m - YF +. -

17 Parsing Full parsing provides detailed analysis of the structure of sentences. Skeleton parsing uses a limited set of syntactic constituent types and ignores, for example, the internal structure of certain constituent types. Study the examples by McEnery at: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm

Full parsing [S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns the_AT1 [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn teamed_VBN Vn] [R up_RP R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJ- squared_JJ JJ-] ,_, [NN+ and_CC chisel_NN NN+]JJ/JJ/NN&] toes_NNS Np]P]Tn]Fr]Ns] ._. S] This example was taken from the Lancaster-Leeds treebank available at McEnery’s website http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm

The syntactic constituent structure is indicated by nested pairs of labelled square brackets, and the words have part-of-speech tags attached to them. The syntactic constituent labels used are: & whole coordination + subordinate conjunct, introduced - subordinate conjunct, not introduced Fr relative phrase JJ adjective phrase Ncs noun phrase, count noun singular Np noun phrase, plural Nq noun phrase, wh- word Ns noun phrase, singular P prepositional phrase R adverbial phrase S sentence Tn past participle phrase Vn verb phrase, past participle Vzb verb phrase, third person singular to be Vzp verb phrase, passive third person singular

Skeleton Parsing [S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1 N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2 [P of_IO [N that_DD1 victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0 seen_VVN [P in_II [N the_AT international_JJ community_NNJ [P of_IO [N scholars_NN2 N]P] [Fr that_CST [V has_VHZ graduated_VVN here_RL today_RT V]Fr]N]P]V]S+] ._.

18 This example was taken from the Spoken English Corpus.

The two examples are similar, but in the case of skeleton parsing all noun phrases are simply labelled with the letter N, whereas in the example of full parsing there are several types of noun phrases which are distinguished according to features such as plurality. The only constituent labels used in the skeleton parsing example are: Fr relative clause N noun phrase P prepositional phrase S& 1st main conjunct of a compound sentence S+ 2nd main compound of a compound sentence V verb phrase

Automatic parsing is not as effective as part-of speech annotation, and human post-editing of parsing is necessary. However, human parsing is inconsistent, particularly when ambiguities occur. Semantic marking Based on: McEnery at: http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2semant.htm

The example below (Wilson 1996) is intended to give the reader an idea of the types of categories used in semantics: And 00000000 the 00000000 soldiers 23241000 platted 21072000 a 00000000 crown 21110400 of 00000000 thorns 13010000 and 00000000 put 21072000 it 00000000 on 00000000 his 00000000 head 21030000 and 00000000 they 00000000 put 21072000 on 00000000 him 00000000 a 00000000 purple 31241100 robe 21110321 The numeric codes stand for:

19 00000000 Low content word (and, the, a, of, on, his, they etc) 13010000 Plant life in general 21030000 Body and body parts 21072000 Object-oriented physical activity (e.g. put) 21110321 Men's clothing: outer clothing 21110400 Headgear 23231000 War and conflict: general 31241100 Colour The semantic categories are represented by 8-digit numbers - the one above is based on that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three top level categories, which are themselves subdivided, and so on.

Discoursal and text linguistic annotation Source: McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2discour.htm

Discourse tags Stenström (1984) annotated the London-Lund spoken corpus with 16 "discourse tags". They included categories such as: "apologies" e.g. sorry, excuse me "greetings" e.g. hello "hedges" e.g. kind of, sort of thing "politeness" e.g. please "responses" e.g. really, that's right Despite their potential role in the analysis of discourse these kinds of annotation have never become widely used, possibly because the linguistic categories are context-dependent and their identification in texts is a greater source of dispute than other forms of linguistic phenomena. Thus, annotations at the levels of discourse and text are rarely used.

Anaphoric annotation Cohesion is the vehicle by which elements in text are linked together, through the use of pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in English" (1976) was considered to be a turning point in linguistics, as it was the most influential account of cohesion. Anaphoric annotation is the marking of pronoun reference – our pronoun system can only be realised and understood by reference to large amounts of empirical data, in other words, corpora. Anaphoric annotation can only be carried out by human analysts, since one of the aims of the annotation is to train computer programs with this data to carry out the task. There are only a few instances of corpora which have been anaphorically annotated; one of these is the Lancaster/IBM anaphoric treebank, an example of which is given below:

20 A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2 [N the_AT (9 Charlotte_N1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0 rid_VVN of_IO [N 3

Phonetic transcription Phonetic transcription can be done by trained and skilled humans, not by computers. Thus, it costly and time consuming.

Prosody Prosody describes the stress, intonation and rhythm of a sentence. Examples by McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm

The example below is taken from the London-Lund corpus: 1 8 14 1470 1 1 A 11 ^what a_bout a cigar\ette# . / 1 8 15 1480 1 1 A 20 *((4 sylls))* / 1 8 14 1490 1 1 B 11 *I ^w\on't have one th/anks#* - - - / 1 8 14 1500 1 1 A 11 ^aren't you .going to sit d/own# - / 1 8 14 1510 1 1 B 11 ^[/\m]# - / 1 8 14 1520 1 1 A 11 ^have my _coffee in p=eace# - - - / 1 8 14 1530 1 1 B 11 ^quite a nice .room to !s\it in ((actually))# / 1 8 14 1540 1 1 B 11 *^\isn't* it# / 1 5 15 1550 1 1 A 11 *^y/\es#* - - - /

The codes used in this example are: # end of tone group ^ onset / rising nuclear tone \ falling nuclear tone /\ rise-fall nuclear tone _ level nuclear tone [] enclose partial words and phonetic symbols . normal stress ! booster: higher pitch than preceding prominent syllable = booster: continuance (( )) unclear * * simultaneous speech - pause of one stress unit

21 Judgments on prosody are subjective and inconsistent, just as the ones on discourse.

Problem oriented tagging According to Leech’s maxims, every researcher can add annotations relevant to their research. It is important to keep this category in mind; however, it is impossible to make any general statements about it.

2.6.3. Initiatives and projects in the standardization of corpus annotation

There have been several projects aimed at the standardisation of corpus annotation. Here are some more prominent ones carried out at UCREL University Centre for Computer Corpus Research on Language, Lancaster University Information about current and previous projects can be found at: http://www.comp.lancs.ac.uk/computing/research/ucrel/projects.html

CLAWS (Constituent-Likelihood Automatic Word-Tagging System). It is a system for assigning to each word in a text an unambiguous indication of the grammatical class to which this word belongs. The system consists of five separate stages of tagging: 1. A pre-editing phase – preparing a text for tagging, partly automatic, partly manual. (PREEDIT) 2. Tag assignment – adding a set of tags without looking at the context (WORDEDIT) 3. Idiom tagging – looking at specific contexts to limit the tags (IDIOMTAG) 4. Tag disambiguation – assigning the most probable tag to words that have got more than one tag at stage 2. (CHAINPROBS) 5. A post editing phase – a manual process in which errors made by a computer are corrected; a reformatting stage (LOBFORMAT)

TEI Initiatives and projects presented here are known as Text Encoding Initiative (TEI), sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing, and the Association for Computers and the Humanities. Its aim is to develop standardized forms for machine readable texts. TEI uses document markup known as SGML (Standard Generalized Markup Language) For more information about the SGML visit http://www.w3.org/MarkUp/SGML/

22 The corpus encoding standard is available at http://www.cs.vassar.edu/CES/

23 3. Examples of corpora

3.1. Internet resources with monolingual corpora

Rapid development in corpus linguistics parallel to development in communication technologies results in large number of websites with links to various corpora. Building a corpus and researching it, is still time consuming business. Thus, existing corpora are maintained with care and protection. It is easy, however, to create a website with links to corpora websites is easy and it takes seconds to move a website from one server to another. Therefore?, the aim of this chapter is to present existing corpora in many languages and links to them. It should be made clear that the author is aware that by the time the book is complete, some of the links will have disappeared, and when the reader gets the text plenty of them will be inactive, but there will be others – even better and more complete – that will be launched. So, instead of getting annoyed with an inactive link, Dear Reader, use a search engine (such as google.com, kartoo.com, iboogie.com, hotboot.com) and type a key word: “corpus linguistics”, “multilingual corpora”, “language corpus” and you will find the latest news about corpora. An extensive list of corpora is available at: http://www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/corpora/list/index2.html

Examples of corpora in various languages: 1. English . The Bank of English written and spoken English (used extensively by researchers and for the COBUILD series of English language books) . The BNC (British National Corpus)– written and spoken British English (used extensively by researchers, and by the Oxford University Press, Chambers and Longman publishing houses) . CANCODE (Cambridge Nottingham Corpus of the Discourse of English) - spoken British English (used extensively by researchers and Cambridge University Press) . ICE (International Corpus of English) – international varieties of spoken and written English (most of the corpus is not yet available) . Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus – parallel corpora of written texts (but now rather outdated) . London-Lund Corpus (Survey of English Usage) spoken British English (used very extensively by researchers, but it is now quite old) . Santa Barbara Corpus – spoken American English (most of the corpus is not yet available) . Hong Kong Corpus of Spoken English (still being compiled, 1 million out of the target 1.5 million words have been collected so far) . ICAME (International Computer Archive of Modern English) - a centre which aims to coordinate and facilitate the sharing of computer-based corpora. . Translational English Corpus (CTIS) - the first and only computerised corpus of translated English in the world (currently approaching 10 million words). It has spearheaded the development of a unique research methodology which has informed the work of several PhD students and various research programmes around the world. Online access: http://ronaldo.cs.tcd.ie/tec/jnlp/

24 2. French . Association des Bibliophiles Universels – various French literary works. Online access: http://cedric.cnam.fr/ABU/

. American and French Research on the Treasury of the French Language (ARTFL) - 150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap). Available at: http://humanities.uchicago.edu/ARTFL/ARTFL.html

3. German . COSMAS Corpus available at: http://corpora.ids-mannheim.de/~cosmas/. Large (over a billion words!) online-searchable German and Austrian corpora. This is the publicly available part of the 1.85 billion word Mannheimer Corpus Collection http://www.ids- mannheim.de/kt/corpora.shtml

. NEGRA Corpus - Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Online access: http://www.coli.uni-sb.de/sfb378/negra-corpus/

4. Polish . PWN corpus available at: http://www.pwn.com.pl . PELCRA Corpora . IPI PAN corpus www.korpus.pl . Spoken corpus collected by Agnieszka-Otwinowska-Kasztelanic . Polish Virtual Library Polska Biblioteka Internetowa, can be accessed from: http://www.pbi.edu.pl/ 5. Russian . Library of Russian Internet Libraries - various literary works. Online access: http://www.orc.ru/~patrikey/liblib/enauth.htm 6. Spanish and Portuguese . TychoBrahe Parsed Corpus of Historical Portuguese - over a million words of Portuguese from different historical periods, some of it morphologically analysed/tagged. Available free of charge at: http://www.ime.usp.br/~tycho/corpus/

. Folha de S. Paulo newspaper - 4 annual CD ROMs with full text. Online access: http://www.publifolha.com.br/

25 . COMPARA - Portuguese-English parallel corpus. (In general, various resources at Linguateca site). Available at: http://www.linguateca.pt/COMPARA/Welcome.html

3.2. Multilingual and parallel

. OPUS - An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals. Online access: http://logos.uio.no/opus/ . Searchable Canadian Hansard French-English parallel texts (1986-1993). Online access: http://www-rali.iro.umontreal.ca/TransSearch/TS-simple-uen.cgi . European Union web server - Parallel text in all EU languages. Online access: http://europa.eu.int/ . TELRI CD-ROMs. Online access: http://www.telri.bham.ac.uk/cdrom.html . Parallel and other text in central and eastern European languages. Online access: http://stp.ling.uu.se/~corpora/

What follows is an overview of information about parallel corpora that is available on the web.

. ACL SIGLEX Parallel Corpora. Online access: http://www.clres.com/paracorp.html

A collection of links to publicly available parallel corpora. The collection is maintained by Ken Litkowski of the ACL Special Interest Group on the Lexicon.

. The BAF Corpus. Online access: http://www-rali.iro.umontreal.ca/arc-a2/BAF/

An aligned French-English corpus containing approximately 450,000 words per language from different sources. Supplied by Michel Simard of the Laboratoire de Recherche Appliquée en Linguistique Informatique in Montreal, Canada (in French).

. INTERSECT: Parallel Corpora and Contrastive Linguistics Online access: http://www.brighton.ac.uk/edusport/languages/html/intersect.html

A project at the University of Brighton, United Kingdom in which parallel texts in French and English are being constructed and analysed.

. The Lingua Project Online access: http://spraakbanken.gu.se/lb/pedant/parabank/node4.html

An excellent description of the Lingua Parallel Concordancing Project which aims at managing a multilingual corpus to ease students' and teachers' work in second language learning. 11 organisations from 6 different countries participate in this project.

26 . Linguistic Data Consortium Online access: http://www.ldc.upenn.edu/

The Linguistic Data Consortium at the University of Pennsylvania, USA supplies a big parallel corpus of United Nation texts in English, French and Spanish.

. Michael Barlow's Parallel Corpora Page Online access: http://www.ruf.rice.edu/~barlow/para.html

An overview page about the global research in parallel corpora. Michael Barlow also maintains a general Corpus Linguistics page at: http://www.ruf.rice.edu/~barlow/corpus.html

3.3. Other Internet resources – potential corpora

 Project Gutenberg  The Oxford Text Archive  CETH - Directory of Electronic Text Centers  The EServer: Accessible Online Publishing  IPL Online Text Collection (Internet Public Library)  BiVir - Virtual Library (Universal Literature in Galician language)  Galician Virtual Library (Galician Literature)  Vercial Project (Portuguese Literature)  The WWW Virtual Library: Linguistics  The applied linguistics WWW virtual library  The World Lecture Hall  Virtual Reference Collection

Conclusion There are a lot of free or commercial resources available on the Internet. There is still a need for more.

27 Part 2. Tools for computerizing a corpus

Corpora are machine readable. Nowadays they are all stored as digital data. In order to make use of them, we need tools for searching, annotating, analysing. There are two main features of the tools. They have to be user-friendly for researchers – linguists, and they have to be effective in operating textual (not numerical) data. Creating such tools is a challenge for teams of linguists and computer program analysts as well as for designers. This is a part of computational linguistics. On the one hand a language should be described in a way that can be stored in a digital form. This is not easy taking into consideration language complexity, its flexibility, ambiguity, possibility of unlimited number of utterances, change over time and unlimited creativity of its users. What is more, linguistic researchers have to identify their needs clearly to provide programmers with clear guidelines for the project design. On the other hand the interface has to be user-friendly and intuitively operated even by an inexperienced computer user. The access to data has to be fast and reliable. Operating large collection of texts in various languages, annotating them, then searching and analysing according to different research questions is a challenge for software analysts. As some of the readers may not have any background in computer science, let me remind them that the only thing that a computer can do is to perform instructions, nothing else. Rundell and Stock (1992) stated that software empowers linguists to “focus their creative energies to doing what machines cannot do”. In this part various tools for extracting linguistic information from corpora are presented: concordance programmes, parsers, taggers and other tools used in corpus linguistics.

4. Concordance programs

4.1. The nature of words

Studying the environments: linguistic and computerised

Nature of texts

Linguistic terms Computer-readable form

Letters Characters Words Words Phrases Lines Clauses Sentences Paragraphs

The logical organization is the same but the physical form is different.

28 Concordance programmes make it possible to see the context of an individual keyword. A keyword (node word) is a string of characters.

Concordance programms find all occurrences of the keyword in the corpus and present the results of the search in an appropriate format.

There are various display options available: - Variable length KWIC (Keyword in Context) - Sentence context - Paragraph context - Whole text browsing

Concordance software is insensitive to case, for example the and The are recognized as the same words. Multiple keyword patterns use a wild card symbol (*) that means any set of characters that follow the letters written before or after the symbol (*) example: a keyword part* gives the following results: parted, partiality, participated, particulars, partly, party, etc.

Sorting options are the following: - The sequence in which the lines occurred in the original texts - The keyword, if wild cards are used - Depending on the words to the right or to the left (without a direct object on the right or with a direct object) - User – allocated category (nouns or verbs on the right) Output options are as follows: - Direct output to a printer - Output to a file Other options: - Line numbers - Text reference - Frequency of collocations

4.2. Examples of software

Here are some examples of software:

29 a. Wordsmith tools – demo 3.0 . Demo version with limited number of output lines is free. http://www1.oup.co.uk/elt/catalogue/Multimedia/WordSmithTools3.0/download.html Wordsmith tools demo 4.0 http://www.lexically.net/wordsmith/version4/ i. Concord ii. Word list iii. Keyword b. Microconcord demo http://www.liv.ac.uk/~ms2928/software/ Demo is free. c. MonoConc –Demo http://www.camsoftpartners.co.uk/monoconc.htm. http://www.athel.com/mono.html#monopro d. Simple concordance program http://www.textworld.com/scp/ Free. e. AntConc http://www.antlab.sci.waseda.ac.jp/ Free

4.3. Internet concordancers

1. WebCorp http://www.webcorp.org.uk/index.html 2. Glossanet http://glossa.fltr.ucl.ac.be/ 3. KwicFinder http://www.kwicfinder.com/KWiCFinder.html 4. WebConc http://www.niederlandistik.fu-berlin.de/cgi-bin/web-conc.cgi? sprache=en&art=google 5. Lexware Culler http://82.182.103.45/lexware/concord/culler.html

5. Taggers

a. Brill’s part of speech tagger Online access: http://www.cs.jhu.edu/~brill/ b. Free CLAWS WWW trial service Online access: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html c. Free trial service at Conexor Online access: http://www.connexor.com/demos/index.html

6. Parsers

a. Link grammar Online access: http://www.link.cs.cmu.edu/link/ b. Apple pie parser Online access: http://nlp.cs.nyu.edu/app/ c. ENGCG: Constraint Grammar Parser of English Online access: http://www.lingsoft.fi/cgi-bin/engcg

30 7. Other tools

7.1. NLP tools

Here are examples of various NLP (Natural Language Processing) tools available on the Internet Paul Nation RANGE Online access: http://www.vuw.ac.nz/lals/staff/paul-nation/nation.aspx

Kolokacje - - http://www.mimuw.edu.pl/polszczyzna/

Software Tools http://lingo.lancs.ac.uk/devotedto/corpora/software.htm

TextLadder http://www.readingenglish.net/software/ordinarylicense.htm

Tools for Natural Language Processing

Multilingual Corpus Toolkit – on a CD

Paul Nation VocabProfile http://www.er.uqam.ca/nobel/r21270/cgi- bin/webfreqs/web_vp.cgi http://www.er.uqam.ca/nobel/r21270/cgi- bin/webfreqs/vp_research.html

WordCruncher Natural language toolkit http://linguistlist.org/issues/15/15-610.html#1

TACT: Textual Analysis Computing Tools. http://www.chass.utoronto.ca/tact/TACT/tact0.html http://www.indiana.edu/~letrs/help-services/QuickGuides/about-tact.html

TACTWeb http://tactweb.humanities.mcmaster.ca/tactweb/doc/tact.htm

Text analysis software http://etext.lib.virginia.edu/textual.html

Provalis Research for analysis http://www.simstat.com/home.html

WORDSTAT v4.0 - Content Analysis & Text Mining Module for Simstat and QDA Miner

31 Protan - content analysis http://www.psor.ucl.ac.be/protan/protanae.html

7.2. Potential tools for NLP

SIMSTAT v2.5 - Statistical software

QDA Miner v1.0 - Qualitative Data Analysis software

ORIANA v2.0 - Statistical analysis for circular data

MVSP v3.1 - MultiVariate Statistical Program (PCA, correspondence, cluster, etc.)

PracticeMill v1.2 (for teachers, trainers, & students)

ITALASSI v1.1 - Interaction Viewer FREE!

STATITEM v1.2 - Classical Item Analysis Module

EASY FACTOR ANALYSIS v4.0 - Principal Component/Factor Analysis

7.3 Tools for Data Driven Learning - DDL

Tom Cobb The Compleat Lexical Tutor http://www.corpus-linguistics.de/ddl/ddl.html free.

32

Recommended publications