Methods and Tools for Corpus Lexicography Ole Norling-Christensen K Øbenhavn
Total Page:16
File Type:pdf, Size:1020Kb
Methods and Tools for Corpus Lexicography Ole Norling-Christensen K øbenhavn A bstract A survey is given of some technical aspects of the theory and practice of building and using text corpora for dictionary making. The survey builds on newer, especially Anglo- Saxon, literature, as well as on the experience of the editorial team of The Danish Dictionary. 1. Introduction The work on the mainly corpus based, 6 volumes dictionary of contemporary Danish 1 was initiated in September 1991. Since then, a 40 mil. words (i.e. tokens) corpus of all kinds of general language has been collected, and each of the c. 40.000 text samples has been annotated according to a rather elaborated text typology. In parallel, methods for reuse of existing lexical sources have been developed, and a database. The Word Bank (Duncker, forthcoming) of morphological, morphosyntactic, semantic and contextual information on more than 300.000 words (i.e. lemmas) has been extracted/constructed from the machine readable versions of some standard printed dictionaries, supplemented with corpus evidence. Work on word class tagging of the corpus is (September 1993) in its initial phase, as is the writing of dictionary entries. 2. Types of corpus - types of tool As pointed out by the Text Encoding Initiative (TEI26 1993: 3), the term language corpus is used to mean a number of rather different things. However, for TEI, as well as for the purpose of this paper, the only distinguishing feature of a corpus that really matters is that its components have been selected or structured according to some conscious set of design criteria. A similar corpus definition is given by Atkins & al. (1992: 1) who, partly building on earlier work by Quemada, distinguish four types of machine readable text collection: iDen Danske Ordbog, hereinafter called DDO. 187 Proceedings of NODALIDA 1993, pages 187-196 - archive: a repository of readable electronic texts not linked in any coordinated way; - electronic text library (ETL): a collection of electronic texts in standardized format with certain conventions relating to content etc, but without rigorous selectional constraints; - corpus: a subset of an ETL, built according to explicit design criteria for a specific purpose; and - subcorpus: a subset of a corpus, either a static component of a complex corpus or a dynamic selection from a corpus during on line analysis. This distinction mirrors an increasing grade of refining of the textual material; and as different tools are needed for the different levels of refinement, as well as for getting from one level to the next one, the distinction shows useful for classifying corpus tools as well. It is, thus, evident that computational tools are needed not only for corpus analysis, but also for the creation and preprocessing of corpora. And the more information in the form of linguistic as well as extralinguistic annotation according to the design criteria that is added to the text samples of the corpus during the preprocessing, the more sophisticated analyses can be made.l Obviously, the tools used for preprocessing and the tools used for analysis must be compatible, by which is meant, among other things, that the analysis tools must conform to the format of the preprocessed texts and be able to make use of the complementary information of the annotations. Useful guidelines for defining such formats can be found in the SGML standard as it is utilized by the Text Encoding Initiative (TEI P2, 1992-). 2.1. From text archive to text library The text archive may include all kinds of word processor and typesetter files. Changing them into the standardized format of a text library implies not only homogenizing the character set, but also making up one's mind about which of the features, represented by formatting codes in the files, should be preserved, and which not. It will, for instance, hardly be relevant to keep information on specific typefaces; on the other hand, it might be useful to keep some generic information on e.g. headlines and emphasis (leaving out information on whether the emphasis was represented by italics, underlining, boldface or small caps). For this job, text converting programs are needed. They may be supplied as functions iQne should, however, keep in mind that tagging a corpus according to e.g. a grammatical theory is likely to make the corpus less usable for testing other theories. Further, there is the risk of circularism: the evidence you get out of your corpus may be, more or less, only what you yourself did put into it. 188 of the source word processing program; but if this is not known or not available, rather much text specific programming may be needed. At this stage, each text should also be annotated with at least the directly available bibliographical information, preferably in the form of an SGML header element. One may also chose even now to include text typological information like genre, subject and sender-reciever relationship, sociological information like sex, age and education of the author(s), linguistic information on e.g. language variants. Whether these annotations are made now or during the subsequent phase of corpus making, they have to be made in a standardized form and, whenever possible, with their values taken from a limited set of options. Only then, they will be useful for computational processing. Some kind of syntax and content checking device is, therefore, needed during the process of annotation. 2.2. From text library to corpus Making one or more corpora out of a text library implies selecting in a balanced way samples of the library texts, in order to fit the specific purpose of the corpus. If information on text types etc. of the library is available in standardized, electronic form, such selection may be done more or less automatically. If this is not the case, it is now time for making these annotations. For checking the balance according to different criteria (i.e. annotations) and combinations of criteria, a statistical tool is needed for computing the relative sizes of texts belonging to different classes. In addition to the annotations related to selectional criteria, the scope of which will typically vary from the entire corpus down to an entire text sample, the corpus design criteria may also define kinds of additional information that must be added at lower levels, the scope being individual sentences, phrases, words, or even parts of words. The object of these kinds of annotation wilt normally be some kind of computational and/or computer-aided analysis; consequently, the annotation system has to be thoroughly formalized. 3. A standard format for corpus samples i The international standard SGML (ISO 8879, 1986) for generic description of textual structures and for marking up the texts accordingly, is used by The Danish Dictionary for describing and tagging not only the IAii earlier version of this section was part of Norling-Christensen 1992. 189 dictionary but also the corpus. Readers who are not familiar with SGML and with terms like DTD, element, attribute, entity reference, may consult the brilliant introduction in (TEI 1990: 9-32). For the corpus an SGML document type CorpusEntry has been defined. It provides a suitable form for registration of the necessary (extralinguistic) information about the text as well as a means for unambiguous tagging of those (linguistic) features of the text proper that we have decided to represent in the corpus. Each sample (element CorpusEntry) consists of: A Header, that contains information on the kind and provenience of the text, and the Text proper. In the language of SGML: <!DOCTYPE CorpusEntry [ <! ELEMENT CorpusEntry ( Header, Text) > ] > 3.1. The Header For designing the Header and deciding which information types should form part of it, we found much inspiration in Atkins (1992). The Header of each corpus entry is divided into two main parts, viz. information on the source (Sourcelnfo), and information on the text sample proper (TextDescription). Sourcelnfo consists of an unambiguous identification (TextGroup/TextNumber), notes on restrictions of use imposed by the supplier of the text (private or confidential texts), information on those who produced the text (LanguageUser = authors, speakers), and on title, publisher, date of origination, and location {e.g. page number). There is one element LanguageUser for each person involved in the production of a given text sample. Especially in spoken language there usually will be more than one. The element describes the person's name, role (,e.g. interviewer or interviewee), sex, education, occupation, year and place of birth, and language variant {i.e. standard or regional Danish). The element TextDescription gives an account of the language type (general or special), whether it is written or spoken, and public (reception) or private (production), the age relation between sender and receiver of the text (adult-adult, adult-juvenile, adult-child, juvenile-adult, etc.), medium (book, newspaper, television etc.), genre, subject field, size of the sample. The full structure and contents of the header can be explained in the following way. An interrogation mark (?) after an element name means that the element is facultative, i.e. it shall only be there if it is relevant, and if the information in question is known. The plus (+) after LanguageUser means that there may be one or more of these elements in a single header. 190 Header Sourceinfo TextGroup Unambiguous identifier of a group of (related) corpus entries TextNumber Serial number inside the text group Restriction? RestrictA Proper names in text must be altered: "Y[es]"/"N[o]" Restricts Text must only be used for the dictionary: "Y[es]"/"N[oJ" Expiration of Restriction B, e.g. "1998" LanguageUser+ (one element for each author/speaker) Role? Esp. when more language users are involved; e.g.