The Text Encoding Initiative: Flexible and Extensible Document Encoding
Total Page:16
File Type:pdf, Size:1020Kb
The Text Enco ding Initiative: Flexible and Extensible Do cument Enco ding David T. Barnard and Nancy M. Ide Technical Rep ort 96-396 ISSN 0836-0227-96-396 Department of Computing and Information Science Queen's University Kingston, Ontario, Canada K7L 3N6 Decemb er 1995 Abstract The Text Enco ding Initiativeisaninternational collab oration aimed at pro ducing a common enco ding scheme for complex texts. The diversity of the texts used by memb ers of the communities served by the pro ject led to a large sp eci cation, but the sp eci cation is structured to facilitate understanding and use. The requirement for generality is sometimes in tension with the requirement to handle sp ecialized text typ es. The texts that are enco ded often can b e viewed or interpreted in several di erentways. While many electronic do cuments can b e enco ded in very simple ways, 1 some do cuments and some users will tax the limits of any xed scheme, so a exible extensible enco ding is required to supp ort research and to facilitate the reuse of texts. 1 Intro duction Computerized pro cessing of structured do cuments is an imp ortant theme in information science [1, 13 ]. To take full advantage of progress in this eld it is necessary to have convenient electronic representations of structured do cuments. The publication of Guidelines for Electronic Text Encoding and Interchange [14 ] by the Text Enco ding Initiative TEI [16] was the result of several years of collab orative e ort by memb ers of the humanities and linguistics research and academic community in Europ e, North America and Asia to derive a common enco ding scheme for complex textual structures [11]. The TEI was conceived at a meeting held in 1987 at Vassar College in Poughkeepsie, New York. The Vassar meeting drew together researchers involved in the collection, organization, and analysis of electronic texts for linguistic and humanistic research. Some of those present were involved in research programs that included the pro duction of large corp ora of texts, while others had written innovative pieces of software to pro cess electronic texts in various ways. There was a shared view that the enormous varietyofmutually incomprehensible enco ding schemes in use was a hindrance to research. To facilitate the interchange and reuse of texts, and to give guidance to those creating new electronic texts, some new scheme was required that could synthesize the b est ideas and asp ects of existing schemes. The Asso ciation for Computers and the Humanities, the Asso ciation for Computational 2 Linguistics, and the Asso ciation for Literary and Linguistic Computing jointly sp onsored a pro ject to devise a new enco ding scheme. Funding was obtained from the National Endowment for the Humanities USA, Directorate XI I I of the Commission of the Europ ean Communities, the So cial Science and Humanities Research Council of Canada, the Andrew W. Mellon Foundation, and the various organizations|companies, government agencies, universities and institutes|where the participants worked. The work of the pro ject was carried out by committees dealing with sp ecialized topics, and co ordinated by a Steering Committee, a technical review committee, and two editors. Information ab out the Initiative app ears primarily in the Guidelines [14 ] and on the TEI's World Wide Web page [16]. The rst three numb ers of the 1995 volume of Computers and the Humanities deal with various asp ects of the TEI; these have b een reprinted in monograph form [12]. The new enco ding scheme was required to : adequately represent all the textual features needed for research, b e simple, clear and concrete, b e easy for researchers to use without sp ecial-purp ose software, allow the rigorous de nition and ecient pro cessing of texts, provide for user-de ned extensions, and conform to existing and emerging standards. 3 The Standard Generalized Markup Language SGML [9, 10 , 7] was a relatively new standard when the pro ject b egan. Although SGML was thus not in wide use within the communities served by the pro ject, SGML was chosen as the basis for the new enco ding scheme. The TEI enco ding scheme is a complex application of SGML. To promote acceptability of the scheme, it was necessary to ensure that : a common core of textual features could b e easily shared, additional sp ecialist features could b e easily added, multiple parallel enco dings of the same feature were p ossible, users could decide how rich the markup in a do cumentwould b e, with a very small minimal requirement, and there should b e adequate do cumentation of the text and the markup of any electronic text. Here is a simple do cument enco ded using the TEI scheme. <!DOCTYPE tei.2 system 'tei2.dtd'[ <!ENTITY TEI.prose 'INCLUDE'> <!ENTITY english.wsd system 'teien.wsd' SUBDOC> ]> <tei.2> <teiHeader> 4 <fileDesc> <titleStmt><title>Short document.</title> <publicationStmt><p>Unpublished.</p></publicationStmt> <sourceDesc><p>Electronic form is original.</p></sourceDesc> </fileDesc> <profileDesc> <langUsage><language id=eng wsd=english.wsd usage='100'></langUsage> </profileDesc> </teiHeader> <text> <body> <p>A very short TEI document.</p> </body> </text> </tei.2> The rst line of this do cument identi es the do cumenttyp e by p ointing to an external de nition. The next two lines contained within the square brackets on lines 1 and 4 contain de nitions that override those in the external de nition; their purp ose here is to include the de nitions of the tag set for prose do cuments, and to indicate that the do cument b eing enco ded is written in English. From line 5 through to the end of the do cument, there are two parts. The teiHeader contains information describing the electronic do cument; this is akin to an extended bibliographic description. The text contains the enco ded text itself; 5 in this case, there is a single line of text. Realistic do cuments have these same constituent parts { a reference to the TEI de nitions, some lo cal selections of parts of those de nitions, a header, and the text itself. The remainder of this pap er addresses three asp ects of the TEI Guidelines. In the next section we describ e the structure of the de nitions{a exible collection whose parts can b e selected as required for a do cument. We then discuss the tension b etween generality and sp eci city in this de nition that is intended to servewell for many do cument classes. The last asp ect we address is the need to representmultiple views of a single electronic text. 2 The Structure of the TEI Sp eci cations From the b eginning of the pro ject it was clear that the sp eci cation pro duced by the TEI would b e very large. It was also clear that most do cuments would only make use of some parts of the sp eci cation. It was thus necessary to structure the sp eci cations in sucha way that users could select those parts that apply to a sp eci c do cument. The mechanism for making this selection should b e easy to understand and use, and it should not require any formalism or software b eyond what is provided in SGML. The nal sp eci cation arranges features to b e enco ded into several categories. Each set of features is enco ded using a set of SGML tags. The DTD is put together in sucha way that sets of tags can b e included in the DTD or excluded from it, and thus the tags are allowed in a do cument or prohibited, resp ectively [15 ]. 6 These are core features that app ear in all do cuments. provision for characters sets to b e used in do cuments tags for sp ecifying the TEI header including the le description, enco ding description tags for sp ecifying basic elements including paragraphs, emphasis, quotations, list notes, simple p ointers, references simple text structure including front and back matter, p ossibly nested divisions Two sets of tags cover the core features. The rst contains the tags for sp ecifying the TEI header. The second contains tags for basic elements like paragraphs. The de ning asp ect of the core is that it app ears in all bases. There is a base tag set for each ma jor do cument category that can b e enco ded. These are: prose contains only the core features verse structure within lines, rhyme, metrical structure drama cast lists, p erformances, stage directions 7 transcriptions of sp eech utterances, pauses, temp oral information print dictionaries structure of entries, grammatical and typ ographical information terminological databases terms and de nitions Each do cument explicitly selects one of these base tag sets. Each of these bases includes the core features. Some do cuments are more complex; they contain material that is from more than one of these bases. For example, a critical essay that included lengthy quotations from p o ems under discussion would require b oth prose and verse structures. There are mechanisms for combining base tag sets in a single do cument; the details can b e found in the Guidelines. Tag sets for additional features can b e used with any of the base typ es. These are included as necessary, dep ending on the information to b e represented int he enco ded text. The additional tag sets include: linking, segmentation and alignment extended p ointers, segments, anchors simple analytic mechanisms linguistic segments, spans of text feature structures 8 designed for linguistic analysis but useful for recording anyvalues of any named features detected in the text certainty and resp onsibility who tagged this and how certain is the enco ding? transcription of primary sources deletions, illegibility critical apparatus sources for the text, attaching witnesses to spans of text names and dates graphs, networks and trees enco ding data structures to represet information ab out text tables, formulae and graphics language corp ora collections of text, sources For example, to enco de a technical pap er, one would cho ose the prose base.