The Text Encoding Initiative Nancy Ide
Total Page:16
File Type:pdf, Size:1020Kb
Encoding standards for large text resources: The Text Encoding Initiative Nancy Ide LaboratoirE Parole et Langage Department of Computer Science CNRS/Universitd de Provence Vassar College 29, Avenue R.obert Schuman Poughkeepsie, New York 12601 (U.S.A.) 13621 Aix-en-Provence Cedex 1 (France) e-maih ide@cs, vassar, edu Abstract. The Text Encoding Initiative (TEl) is an considerable work to develop adequately large and international project established in 1988 to develop appropriately constituted textual resources still remains. guidelines for the preparation and interchange of The demand for extensive reusability of large text electronic texts for research, and to satisfy a broad range collections in turn requires the development of of uses by the language industries more generally. The standardized encoding formats for this data. It is no longer need for standardized encoding practices has become realistic to distribute data in ad hoc formats, since the inxreasingly critical as the need to use and, most eflbrt and resources required to clean tip and reformat the importantly, reuse vast amounts of electronic text has data for local use is at best costly, and in many cases dramatically increased for both research and industry, in prohibitive. Because much existing and potentially particular for natural language processing. In January available data was originally formatted R)r the purposes 1994, the TEl isstled its Guidelines for the Fmcoding and of printing, the information explicitly represented in the hiterehange of Machine-Readable Texts, which provide encoding concerns a imrticular physical realization of a standardized encoding conventions for a large range of text rather than its logical strttcture (which is of greater text types and features relevant for a broad range of interest for most NLP applications), and the applications. correspondence between the two is often difficult or impossihle to Establish without substantial work. Keywords. Encoding, markup, large text resources, Further, as data become more and more available and tile corpora, SGML. USE of large text collections become more central to NLP research, general and publicly awdlable software to 1. Introduction manipt, late tile texts is being developed which, to he itself reusable, also requires the existence of a standard The past few years have seen a burst of activity in the encoding format. development of statistical methods which, applied to A standard encoding format adequate for representing massive text data, have in turn enabled the dcvelopnmnt textual data for NLP research must be (1) capable of of increasingly comprehensive and robust models of representing the different kinds of information across the language structure and use. Such models are increasingly spectrum of text types and languages potentially of recognized as an inwduable resource for natural langu:lge interest to tile NLP research community, including prosE, processing (NLP) tasks, inch,ding machine translation. technical documents, newspapers, verse, drama, letters, The upsurge of interest in empricial methods for dictionaries, lexicons, etc.; (2) capable of representing language modelling has led inevitably to a need for different levels of information, including not only massive collections of texts of all kinds, including text physical characterstics and logical structure (as well as collections which span genre, register, spoken and other more complex phenomena such as intra- and inter- written data, etc., as well as domain- or application- textual references, aligtunent of parallel elements, etc.), specific collections, and, especially, multi-lingual but also interpretive or analytic annotation which may be collections with parallel translations. In tile latter half of added to the data (for exainple, markup for part of speech, the 1980's, very few appropriate or adequately large text syntactic structure, Etc.); (3) application independent, that collections existed for use in computational linguistics is, it must provide the required flexibility and generality research, especially for languages other than English. to enable, possibly siumltaneously, the explicit encoding Consequently, several efforts to collect and disseminate of potentially disparate types of information withiu thc large mono- and multi-lingual text collections have been same text, as well as accomodate all potential types of recently established, including the ACL Data Collection processing. The development of such a suitably flexible Initiative (ACL/DCI), the European Corpus Initiative and cornprehensivE encoding system is a substantial (ECI), which has developed a multilingual, partially intellectual task, demanding (just to start) the parallel corpus, the U.S. Linguistic Data Consortium development of suitably complex models for the wirious (LDC), RELATOR and MULTEXT in EuropE, etc. (see text types as well as an overall model of text and ,'m Arm.strong-Warwick, 1993). It is widely recognized that architecture for the encoding scheme that is to embody it. such efforts constitute only a beginning for the necessary data collection and dissemination efforts, and that 574 2. The Text Encoding Initiative encoding schemes, which usually could only be used for the data for which they were designed. In many cases, In 1988, the Text Encoding Initiative ('I'EI) was there had been no prior analysis of the required categories established as an international co-operative research and features and the relations among them for a given project to develop a general and tlexible set of guidelines text type, in the light of real and potential processing and for the preparation and interchange of electronic texts. analytic needs. The TEl has motiwlted and accomplished The TEI is jointly sponsored by the Association lot the substantial intellectual task of completing this Computers and the Hmnanities, the Association for analysis for a large number of text types, and provides Computational Linguistics, and the Association for encoding conventions based upon it for describing tile Literary and Linguistic Computing. The proiect has had physical and logical structure of many classes of texts, as major support from the I_I.S. National Endowment for well as features particular to a given text type or not tile Humanities (NEH), 1)irectorate XIII of the conventionally rep,csented in typogral)hy. The TEl Cmmnission of tile F.tlropean Colmnunities (CIJ.C/I)(J - Guidelines also cover COlllnlOll text encoding problems, XIII), tile Andrew W. Mellon Foundation, and tile Social including intra- and inter-textual cross reference, Science and tlumanities Research Council of Canada. demarcation of arbitrary text segments, alignment of parallel elements, overlapping hierarchies, etc. In In January 1994, the "I'E]~ issued its Guidelines for the Encoding and hiterchange of Machine-Readable Texts, addition, they provide conventions for linking texts to acoustic and visual data. which provide standardized encoding conventions for a large range of text types and features relevant for a broad The TEI's specific achievements include: range of applications, inchlding natural language processing, infommtion retrieval, hypertext, electronic 1. a determination that the Standard Generalized publishing, various forms of literary and historical Markup I4mguage (SGML) is tile framework for analysis, lexicography, etc. The Guidelines are intended development of the Gnklelines; to apply to texts, written or spoken, in any natural 2. the specification of restrictions on and language, of any date, in any genre or text type, without recommendations for SGML use that best serves restriction vn form or content. They treat both the needs of interchange, as well as enables maximal continuous materials (rttnning text) and discontinuous generality and flexibility in order to serve the widest materials such as dictionaries and linguistic corpora. As possible range of research, develol)ment , and such, the TEl Guidelines answer the fundamental needs of application needs; a wide range of users: researcher~'; in cmnputational 3. analysis and identification of categories and features linguistics, the humanities, sciences, and social for encoding textual data, at many levels of detail; sciences; lmblishers; librarians and those concerned 4. specification of a set of general text structure generally with document retrieval and storage; as well as del'inititms that is effective, flexible, and extensihle; the growing language technology community, which is 5. specification of a method for in-file documentation amassing sttbstantial multi-lingual, multi-modal corpora of electronic texts compatible with library of spoken and written texts and lexicons in order to cataloging conventi(ms, which can he used to trace advance research in ht, man hmguage understanding, the history of the texts and thus assist in production, and translation. authenticating their provenance and the modifications they have undergone; "File rules and recommendations made in the 'I'I:A 6. specification of encoding conventions for special Guidelines conform to the ISO 8879, which defines the kinds of texts (n" text features, including: Standard Generalized Markup 1.,anguage, and IS() 646, which defines a standard seven-hit character set in terms a. char~lcter sets of wt,ich tile recommendations on character-level b. language COlp(+rit interchange are formulated, l SGMI, is an increasingly c. general linguistics widely recognized international markup standard which d. dictionaries has been adopted by the