<<

Encoding standards for large text resources: The Text Encoding Initiative Nancy Ide

LaboratoirE Parole et Langage Department of Computer Science CNRS/Universitd de Provence Vassar College 29, Avenue R.obert Schuman Poughkeepsie, New York 12601 (U.S.A.) 13621 Aix-en-Provence Cedex 1 (France)

e-maih ide@cs, vassar, edu

Abstract. The Text Encoding Initiative (TEl) is an considerable work to develop adequately large and international project established in 1988 to develop appropriately constituted textual resources still remains. guidelines for the preparation and interchange of The demand for extensive reusability of large text electronic texts for research, and to satisfy a broad range collections in turn requires the development of of uses by the language industries more generally. The standardized encoding formats for this data. It is no longer need for standardized encoding practices has become realistic to distribute data in ad hoc formats, since the inxreasingly critical as the need to use and, most eflbrt and resources required to clean tip and reformat the importantly, reuse vast amounts of electronic text has data for local use is at best costly, and in many cases dramatically increased for both research and industry, in prohibitive. Because much existing and potentially particular for natural language processing. In January available data was originally formatted R)r the purposes 1994, the TEl isstled its Guidelines for the Fmcoding and of printing, the information explicitly represented in the hiterehange of Machine-Readable Texts, which provide encoding concerns a imrticular physical realization of a standardized encoding conventions for a large range of text rather than its logical strttcture (which is of greater text types and features relevant for a broad range of interest for most NLP applications), and the applications. correspondence between the two is often difficult or impossihle to Establish without substantial work. Keywords. Encoding, markup, large text resources, Further, as data become more and more available and tile corpora, SGML. USE of large text collections become more central to NLP research, general and publicly awdlable software to 1. Introduction manipt, late tile texts is being developed which, to he itself reusable, also requires the existence of a standard The past few years have seen a burst of activity in the encoding format. development of statistical methods which, applied to A standard encoding format adequate for representing massive text data, have in turn enabled the dcvelopnmnt textual data for NLP research must be (1) capable of of increasingly comprehensive and robust models of representing the different kinds of information across the language structure and use. Such models are increasingly spectrum of text types and languages potentially of recognized as an inwduable resource for natural langu:lge interest to tile NLP research community, including prosE, processing (NLP) tasks, inch,ding machine translation. technical documents, newspapers, verse, drama, letters, The upsurge of interest in empricial methods for dictionaries, lexicons, etc.; (2) capable of representing language modelling has led inevitably to a need for different levels of information, including not only massive collections of texts of all kinds, including text physical characterstics and logical structure (as well as collections which span genre, register, spoken and other more complex phenomena such as intra- and inter- written data, etc., as well as domain- or application- textual references, aligtunent of parallel elements, etc.), specific collections, and, especially, multi-lingual but also interpretive or analytic annotation which may be collections with parallel translations. In tile latter half of added to the data (for exainple, markup for part of speech, the 1980's, very few appropriate or adequately large text syntactic structure, Etc.); (3) application independent, that collections existed for use in computational linguistics is, it must provide the required flexibility and generality research, especially for languages other than English. to enable, possibly siumltaneously, the explicit encoding Consequently, several efforts to collect and disseminate of potentially disparate types of information withiu thc large mono- and multi-lingual text collections have been same text, as well as accomodate all potential types of recently established, including the ACL Data Collection processing. The development of such a suitably flexible Initiative (ACL/DCI), the European Corpus Initiative and cornprehensivE encoding system is a substantial (ECI), which has developed a multilingual, partially intellectual task, demanding (just to start) the parallel corpus, the U.S. Linguistic Data Consortium development of suitably complex models for the wirious (LDC), RELATOR and MULTEXT in EuropE, etc. (see text types as well as an overall model of text and ,'m Arm.strong-Warwick, 1993). It is widely recognized that architecture for the encoding scheme that is to embody it. such efforts constitute only a beginning for the necessary data collection and dissemination efforts, and that

574 2. The Text Encoding Initiative encoding schemes, which usually could only be used for the data for which they were designed. In many cases, In 1988, the Text Encoding Initiative ('I'EI) was there had been no prior analysis of the required categories established as an international co-operative research and features and the relations among them for a given project to develop a general and tlexible set of guidelines text type, in the light of real and potential processing and for the preparation and interchange of electronic texts. analytic needs. The TEl has motiwlted and accomplished The TEI is jointly sponsored by the Association lot the substantial intellectual task of completing this Computers and the Hmnanities, the Association for analysis for a large number of text types, and provides Computational Linguistics, and the Association for encoding conventions based upon it for describing tile Literary and Linguistic Computing. The proiect has had physical and logical structure of many classes of texts, as major support from the I_I.S. National Endowment for well as features particular to a given text type or not tile Humanities (NEH), 1)irectorate XIII of the conventionally rep,csented in typogral)hy. The TEl Cmmnission of tile F.tlropean Colmnunities (CIJ.C/I)(J - Guidelines also cover COlllnlOll text encoding problems, XIII), tile Andrew W. Mellon Foundation, and tile Social including intra- and inter-textual cross reference, Science and tlumanities Research Council of Canada. demarcation of arbitrary text segments, alignment of parallel elements, overlapping hierarchies, etc. In In January 1994, the "I'E]~ issued its Guidelines for the Encoding and hiterchange of Machine-Readable Texts, addition, they provide conventions for linking texts to acoustic and visual data. which provide standardized encoding conventions for a large range of text types and features relevant for a broad The TEI's specific achievements include: range of applications, inchlding natural language processing, infommtion retrieval, hypertext, electronic 1. a determination that the Standard Generalized publishing, various forms of literary and historical Markup I4mguage (SGML) is tile framework for analysis, lexicography, etc. The Guidelines are intended development of the Gnklelines; to apply to texts, written or spoken, in any natural 2. the specification of restrictions on and language, of any date, in any genre or text type, without recommendations for SGML use that best serves restriction vn form or content. They treat both the needs of interchange, as well as enables maximal continuous materials (rttnning text) and discontinuous generality and flexibility in order to serve the widest materials such as dictionaries and linguistic corpora. As possible range of research, develol)ment , and such, the TEl Guidelines answer the fundamental needs of application needs; a wide range of users: researcher~'; in cmnputational 3. analysis and identification of categories and features linguistics, the humanities, sciences, and social for encoding textual data, at many levels of detail; sciences; lmblishers; librarians and those concerned 4. specification of a set of general text structure generally with document retrieval and storage; as well as del'inititms that is effective, flexible, and extensihle; the growing language technology community, which is 5. specification of a method for in-file documentation amassing sttbstantial multi-lingual, multi-modal corpora of electronic texts compatible with library of spoken and written texts and lexicons in order to cataloging conventi(ms, which can he used to trace advance research in ht, man hmguage understanding, the history of the texts and thus assist in production, and translation. authenticating their provenance and the modifications they have undergone; "File rules and recommendations made in the 'I'I:A 6. specification of encoding conventions for special Guidelines conform to the ISO 8879, which defines the kinds of texts (n" text features, including: Standard Generalized Markup 1.,anguage, and IS() 646, which defines a standard seven-hit character set in terms a. char~lcter sets of wt,ich tile recommendations on character-level b. language COlp(+rit interchange are formulated, l SGMI, is an increasingly c. general linguistics widely recognized international markup standard which d. dictionaries has been adopted by the US Department of Defense, the e. terminological data Coinmission of Et,ropcan Communities, and ntmlerous f. spoken texts publishers and holders of large public databases. g. hypermedkl h. literary prose 2.1. Overview i. verse j. drama Prior to the establishment of the TEI, most projects k. historical source materials involving the capture and electronic representation of 1. text critical apparatus texts and other linguistic data developed their own 3. Basic architecture of the TEl scheme

1 For more extensive discussion of tim project's history, 3.1. General architecture rationale, and design principles see Tlil internal The 'I'EI Guidelines are built on the assumption that documents EDP1 and EDP2 (awdlable fi'om the TEl) and there is a common core of textual features shared by Ide and Sperberg-McQueen (1994) and Burnard and vMually all lexts, beyond which many different elements Sperberg-McQueen (1994), both forthcoming in a special can be encoded. Therefore, the Guidelines provide an triple issue on the TEl in Con,puters and the extensible framework containing a common core (~l Humanities.

575 features, a choice of frameworks or bases, and a wide 3. dntma variety of optional additions for specific applications or 4. transcribed speech text types. The encoding process is seen as incremental, 5. letters or memos so that additional markup may be easily inserted in the 6. dictionary entries text. 7. terminological entries 8. language corpora and collections Because the TEl is an SGML application, a TEI conformant document must be described by a document Tbe first seven are intended for documents which are type definition (DTD), which defines tags and provides a predominantly composed of one type of text; the last is BNF grammar description of the allowed structural provided for use with texts which combine these basic relationships among them. A TEl DTD is composed of tagsets. Additional base tag sets will be provided in tile the core tagsets, a single base tagset, and any number of flrtnre. user selected additional tagsets, built up according to a set of rules documented in tile TEI Guidelines. Each TEl base tagset determines the basic structure of all the documents with which it is to be used. Specifically, At the highest level, all TEI documents conform to a it defines the components of text elements, combined as common model. The basic unit is a text, that is, any described above. Almost all the TEI bases defined are single document or stretch of natural language rcgm'ded as similar in their basic structure (although they can vary if a self-contained unit for processing purposes. The necessary). However', they differ in their components: for association of such a unit with a header describing it as a example, the kind of sub-elements likely to appear bibliographic entity is regarded as a single TEI element. within the divisions of a dictionary will be entirely Two variations on this basic structure are defined: a different from those likely to appear within the divisions collection of TEl elements, or a variety of composite of a letter or a novel. To accomodate this variety, the texts. The first is appropriate for large disparate constituents of all divisions of a TEl text element arc not collections of independent texts such as language corpora defined explicitly, but in terms of SGML parameter or collections of unrelated papers in an archive. The entities, which behave similar to a variable declaration in second applies to cases such as the complete works of a a programming language: the effect of using them bere is given author, which might be regarded simultaneously as that each base tag set can provide its own specific a single text in its own right and as a series of definition for the constituents of texts, which can be independent texts. modified hy the user if desired.

It is often necessary to encode more than one view of a 3.3. The core tagsets text--for example, the pbysical and the lingtfistic or the formal and the rhetorical. One of the essential features of Two core tagsets are available to all TEl documents the TEl Guidelines is that they offer the possibility to unless explicitly disabled. Tile first defines a large encode many different views of a text, simultaneously if number of elements which may appear in any kind of necessary. A disadvantage of SGML is that it uses a document, and which coincide more or less with the set document model consisting of a single hierarchical of discipline-independent textual features concerning structure. Often, different views of a text define multiple, which consensus has been reached. The second defines the possibly overlapping hierarchies (for example, the header, providing something analogous to an electronic physical view of a print version of a text, consisting of title page for the electronic text. pages sub-divided into physical lines, and the logical Tire core tagsel common to all TEl bases provides means view consisting of, for example, paragraphs sub-divided of encoding with a reasonable degree of sophistication the into sentences) which are not readily accomodated by following list of textual features: SGML's document model. The TEI has identified sever:d possible solutions to this problem in addition to 1. Paragraphs SGML's concurrent structures mechanism, which, 2. Segmentation, for' example into ortbographic because of the processing complexity it involves, is not sentences. a thoroughly satisfactory alternative. 3. Lists of various kinds, including glossaries and indexes The TEI Guidelines provide sophisticated mechanisms for 4. Typographically highlighted phrases, whether linking and alignment of elements, both within a given tmqualified or used to mark linguistic emphasis, text and between texts, as well as links to data not in the foreign words, titles etc. form of ASCII text such as sound and images. Much of 5. Quoted phrases, distinguishing direct speech, the TEI work on linkage was accomplished in quotation, terms and glosses, cited phrases etc. collaboration with those working on the 6. Names, numbers and meastnes, dates and times, and Hypermedia/Time-based Document Structuring Language similar data-like phrases. (HyTime), recently adopted as an SGML-based 7. Basic editorial changes (e.g. correction of apparent international standard for hypermedia structures. errors; regularization and normalization; additions, 3.2. The TEI base tagsets deletions and omissions) 8. Simple links and cross references, providing basic Eight distinct TEI base tagsets are proposed: hypertextual features. 9. Pre-existing or generated annotation and indexing 1, prose 2. verse

576 10. Passages of verse or drama, distinguishing for was produced, expansions or formal definitions for any example speakers, stage directions, verse lines, codcbooks used in analyzing the text, the sctting and stanzaic units, etc. identity el participants within it, etc. The amount of 11. Bibliographic citations, adequate for most encoding in a header depends both on the nature and the commonly used bibliographic packages, in either a intended USE of the text. At oae extreme, an encoder may free or a tightly structured format provide only a bibliographic identification of the text. 12. Simple or complex referencing systems, not At the other, encoders wishing to ensure that their texts necessarily dependent on the existing SGML can be used for the widest range of applications can structure. provide a level of detailed documentation approximating to the kind most often supplied in tile form of a manual. There are few documents which do not exhibit some of these features, and none of these features is particularly A colic(lion of "I'L:I headers can also be regardcd as a restricted to any one kind of document. In most cases, distinct document, and an auxiliary 1)'I'13 is provided tt~ additional more specialized tagsets are provided tit encode support interchange of headers alone, for example, aspects of these features in more detail, but the elements bEtweEn libraries or archives. defined in this core should be adequate fro" most applications most of the time. 3.5. Additional tagsets

Features are categorized within the TEl scheme based on A number of optional additional tagsets are defined by tile shared attributes. The TEl encoding scheme also uses it Guidelines, inlcuding tagscts for special application areas classification system based upon structural properties of such as alignment and linkage of text segments to form the elements, that is, their position within the SGMI~ hypertexts; a wide range of other analytic elements and document structure. Elements which can appear at the attributes; a tassel for detailed manuscril)l transcription same position within a document are regarded as forming and another for the recording of an electronic variorum a model class: for example, the class phrase includes all modelled on the tradition,'d critical apparatus; tagscts fc)r elements which can appear within paragraphs but not the detailed encoding of names and dates; abstractions spanning them, the class chunk includes all elements such as netw¢~rks, graphs or trees; nmthematic-d formulae which cannot appear within paragraphs (e.g., paragraphs), and tables Etc. etc. A class inter is also defined for elements such as lists, which can appear either within or between chunk In addition to these application-specific specialized elements. taSSelS, a general purpose tagset based tm feature structme notation is proposed fnr the encoding of entirely Classes may have super- and sub-classes, and properties abstract intcrpretati(ms of a text, either in parallel or (notably, associated attributes) may be inherited. For embedded within it. Using this mechanism, encnders can example, reflecting the needs of many TEI users to treat define arbitrarily complex bundles or sets of features texts both as documents and as input to databases, a snb- identified in a text. The syntax defined by the Guidelines class of phrase called data is defined to include data-like formalizes the way in which such features are encoded and featnres such as names of persons, places or provides for a detailed specification of legal feature organizations, numbers and dates, abbreviations and wdue/pair combinations and rules (a feature system measures. The formal definition of classes in the SGMI, declaration) determining, for example, the implication of syntax used to express the TEI scheme makes it possible under-specified or defaulted features. A related set of for users of the schente to extend it in a simple and additional elements is also provided for the encoding ~1 controlled way: new elements may be added into existing degrees of uncertainty or ambiguity in the encoding of a classes, and existing elements renamed or undefined, text. without any need for extensive revision ~t tile TEl document type definitions. A user of Ihe TEl scheme may combine as many or as few additional tagsets as suit Isis or her needs. The 3.4. The TEl header existence of tagsets for particular application areas in tile Guidelines reflects, to some extent, accidents of history: The TEl |leader is believed to he the first systcmatic no claim to systematic or encyclopedic coverage is attempt to provide in-file documentation of electronic implied. It is expected that new tagscts will be defined its texts. The TEl header allows tbr the definition of a full a part of the continued work of the "I'EI and in related AACR2-compatible hihliographic description for the projects. For example, tile l~ltropean project MULTEXT, electronic text, covering all of the lbllowing: in collaboration with EAGLIiS, will develop a 1. the electronic document itself specialized Corpus Encoding Standard for NLP 2. sources from which the document was dcrivcd applications based on the "I'E1 Gnidelines.2 3. encoding system 4. revision history

The TEI header allows for a large ammmt of structured or unstructured information under the above headings, including both traditional bibliographic material which can be directly translated into an equivalent MARC catalogue record, as well as descriptive inlormation such as the languages it uses and the situation within which it 2 See Idc and Vdronis, MULTEXT : Multilingual Text Tools and Corpora, in this vohnne.

577 4. Information about the TEI International Organization for Standards, 1SO 8879: information Processing--Text and Office Systems-- To obtain further information about the TEI or to obtain Standard Generalized (SGML). copies of the TEl Guidelines, contact one of the TEI ISO (1986). editors: International Organization for Standards, ISO/IEC DIS 10744: Hypermedia/Time-based Document C. M. Sperberg-McQueen, (Editor in Chief) Structuring Languagc (Hytime). ISO (1992). University of Illinois at Chicago (M/C 135) Sperberg-McQueen, C.M., Burnard, L., Guidelines for Computer Center Electronic Text Encoding and Interchange, 1940 W. Taylor St. Text Encoding Initiative, Chicago and Oxlbrd, Chicago, Illinois 60612-7352 US 1994. [email protected] +I (312) 413-0317 Fax: +1 (312) 996-6834 Appendix: TEl Guidelines Contents Lou Burnard, (European Editor) Part I: Introduction Oxford University Computing Service 1. About These Guidelines 13 Banbury Road, Oxford OX26NN, UK 2. Concise Summary of SGML [email protected] 3. Structure of the TEl Document Type Declarations +44 (865) 273200 Fax: +44 (865) 273275 Part I1: Core "Fags and General Rules 4. Characters and Character Sets Tim TEI also maintains a publicly-accessible ListServ 5. The TEl I leader list, [email protected] or TEI - 6. Tags Available in All TEl DTDs L@uicvm. bitnet. Part Ill: Base Tag Sets 7. Base Tag Set for Prose 8. Base Tag Set for Verse Acknowledgments -- The TEI has been funded by tile 9. Base Tag Set for Drama U.S. National Endowment for the Humanities (NEH), 1 0. Base Tag Set for Transcriptions of Spoken Texts Directorate XIII of the Commission of the European I 1. Base Tag Set for Letters and Memoranda Communities (CEC/DG-XIII), the Andrew W. Mellon 12. Base 'rag Set for Printed Dictionaries Foundation, and the Social Science and Humanities 1 3. Base Tag Set for Terminological Data Research Council of Canada. Some material in this paper 14. Base Tag Set for Language Corpora and Collections has been adapted from TEl documents written by various 1 5. User-Defined Base Tag Sets TEl participants. Part IV: Additional "Fag Sets 16. Segmentation and Migmnent 17. Simple Analytic Mechanisms References 1 g. Feature Structure Analysis 19. Certainty Armstrong-Warwick, S. Acquisition and Exploitation of 20. Mannseripts, Analytic Bibliography, and Physical Description of tile Source Text Textual Resources for NLP. Proceedings of the 21. Text Criticism and Apparatus International Conference on Ruilding and Sharing of 22. Additional Tags for Names and Dates Very Larrge Knowledge Bases '93, Tokyo, Japan, 23. Graphs, Digraphs, and Trees December, 1993, 59-68. 24. Graphics, Figures, and Ilhlslrations Burnard, L., Sperberg-McQueen, C.M. TEI Design 25. I:ormulae and Tables Principles. Computers and the Ilumanitites Special 26. Additional Tags Ibr the TEl l leader Issue on the Text Encoding Initiative, 28, 1-3 Part V: Auxiliary Document Types (1994), (to appear). 27. Structured Ileader Bryan, M. SGML: An Author's Guide, Addison-Wesley 28. Writing System Declaration Publishing Company, New York (1988). 29. Feature System Declaration Coombs, J.H., Renear, A.It., and DeRose, S.J. Markup 30. "Fag Set Declaration systems and the filttlre of scholarly text processing. Part VI: Technical Topics Communications of the ACM 30, 11 (1987), 933- 31. TEl Conlbrmance 947. 32. Modifying T[il DTDs Goldfarb, C.F. The SGMI, Ilandbook, Clarendon Press, 33. l,oeal Installation and Support of TEl Markup Oxtord (1990). 34. Use of TEI Encoding Scheme in Interchange van Herwijnen, E. Practical SGML, Kluwer Academic 35. Relationship of TEl to Other Standards Publishers, Boston (1991). 36. Markup for Non-llierarchical Phenomena 37. Algorithm fi~r Recognizing Canonical References Ide, N., Sperberg-McQueen, C.M. The Text Encoding Initiative: History and Background.Computers and Part VII: Alphabetical Reference List of Tags and Classes the Humanitites Special issue on the Text Part VIII: P,et~rence Material Encoding Initiative, 28, 1-3 (1994), (to appear). 38. Full TEl Document Type Declarations Ide, N., Vdronis, J. (Eds.) Computers and the 39. Standard Writing System Declarations tlumanitites Special issue on the Text Encoding 40. Feature System Declaration for Basic Grammatical Initiative, 28, 1-3 (1994), (to appear). Annotation 41, Sample Tag Set l)echnation 42. Formal Grammar for tire TEl-Interchange-Format Subset of SGML

578