SCIENTIFIC PAPER UDC 004.439:004.55XML UDC 519.76
TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
Tomaž Erjavec* Jožef Stefan Institute, Department of Knowledge Technologies Ljubljana, Slovenia
Abstract: The paper introduces the Text Encoding Initiative Guidelines, a for- mal specification and accompanying documentation defining a vocabulary of XML elements, meant for the annotation of texts for scholarly purposes. TEI is widely used for the encoding of e.g. complex texts in digital libraries, for a wide variety of text types. The paper describes the history and organisation of the Text Encoding Initiative, the structure of the Guidelines and gives sev- eral use cases. It then moves on to the question of internationalisation: while TEI elements can describe text in any language, the Guidelines themselves are nevertheless in English. However, TEI offers several possibilities of how to translate (parts of) the Guidelines. We discuss them, and offer a new, resource light alternative.
Keywords: XML, TEI, Text encoding standardisation, internationalisation.
INFOtheca, № 1, vol XI, April 2010 3a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
1. Introduction XML, as was SGML, is a meta-language: Digitally stored text has been, for a long it does not define a particular tagset, but only time, viewed only as a digital representation of provides a means to define tagsets, and how a its printed image, and was closely bound to the tagset is defined depends on the type of data un- software that created it. Such text was thus not der consideration and the envisaged usage sce- suitable for task-based machine processing and narios. While developers are free to define their could become quickly obsolescent. To address own tagsets, or, more precisely, XML schemas these concerns the ISO standard SGML was that specify the tags (XML elements and attri- created, which allowed platform and output in- butes) and the valid interrelations between tags, dependent means of representing text, by allow- it is in general better to make use of pre-defined ing users to define their own tagsets, suitable for standard schemas, as they have been carefully encoding arbitrary text types. The most widely designed, are well documented and maintained, known application of SGML is without doubt ensure interoperability and can be accompanied HTML, the Hypertext markup language, used to by dedicated software. encode documents for display in Web browsers. By now there exist a large number of stan- But neither SGML nor HTML were suited as a dard XML schemas, for areas as diverse as broad basis for encoding arbitrary texts for arbi- mobile phone applications, representing math- trary purposes: SGML was too complex, while ematical formulas, music notation and technical HTML defined only a small vocabulary of ele- documentation. ments, mostly visually oriented. But what about the field of humanities texts, These were the reasons why a new standard i.e. texts that are meant to be the subject of hu- was created by the World Wide Web Consor- manities research, regardless of what text type tium, the Extensible Markup Language, XML.1 they are, what language they are written in, what period they come from, or what methods are XML is a subset of SGML, and as such, unlike used in their analysis? To date, there is only one HTML, specifies a means to define arbitrary tag- standard (or, better, recommendation) that at- sets, while being much simpler than SGML, so tempts to cover this vast area, namely the Text that it became much easier to write software to Encoding Initiative. process it. Since its inception, XML has become a true success story as the interchange format for 2. The Text Encoding Initiative digital data, with a host of software and related The Text Encoding Initiative (TEI) developed standards to support its authoring, validation, and a community standard for representing digital transformations. texts in a way that is both powerful and respon- Of these related standards we should mention sive to the research needs of humanities scholars. 2 XSLT, the transformation language for XML TEI can be characterised as:3 documents. It is possible to write scripts in XSLT – a freely available set of guidelines for en- that transform XML documents into other, differ- coding humanities texts using XML, ently structured documents. As XML is primarily – an international consortium that exists to used as a storage and interchange format, XSLT support the maintenance and development of the enables its transformation into more directly us- guidelines, and able formats, e.g., HTML for Web browsing. 3 This section closely follows the excellent introduction 1 http://www.w3.org/XML/ to the TEI given in http://www.wwp.brown.edu/encoding/ 2 http://en.wikipedia.org/wiki/XSL_Transformations seminars/tei.html
4a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
– a community of projects and individuals texts, often organized around a single genre, who use the TEI Guidelines. author, period, or country (or a combination of Like many standardization efforts, the TEI these. Projects of this type often use more de- faces the challenges inherent in such a project. tailed markup to represent particular features of How can the many disciplines and communi- the text that are relevant to the specific collection ties within the humanities domain find common or important for the specific scholarly audience ground in a single encoding language? How do being served. For instance, a collection focus- we agree on the level of detail that is necessary or ing on an author whose writings are an important appropriate in describing our textual materials? source of information about famous contempo- How do we reconcile the advantages to be gained raries might usefully encode all references to by consistency and agreement with the need for names and works, perhaps including links to individual specialization? How do we handle the more detailed indexes of biographical or critical truly idiosyncratic and unexpected? information. Similarly, an electronic edition of Unlike many standardization efforts, the TEI a particular author or work might well include addresses these and similar questions by explic- a representation of variant readings, authorial itly accommodating variation and debate within revisions, editorial emendations, and similar its technical framework. The TEI Guidelines are editorial information. Some collections of this designed to be both modular and customizable, sort are intended to serve very specific research goals, such as linguistic analysis, or as the basis so that specific projects can choose the relevant for a dictionary, and in these cases the markup portions of the TEI and ignore the rest, and can may be very highly specialized. also if necessary create extensions of the TEI The uses described above are all some spe- language to describe facets of the text which the cies of publication: the goal is to create a digi- TEI does not yet address. Because the TEI itself tal collection that can be used online by some is complex, the customization process is not en- public audience of greater or lesser extent—per- tirely trivial, but it is designed to be as straight- haps limited to a small community or to paying forward as possible. subscribers, perhaps open to the general public. More rarely, the TEI is also used by individuals 2.1. What is TEI used for? to create digital representations of textual ma- The predominant use of the TEI Guidelines is terials to support their own personal research, in creating large digital library collections, which in forms which may or may not be published. provide access to large quantities of textual ma- Where the scope and purpose of the larger col- terial, often focusing on rare or fragile materials lections may be determined by audience and which would otherwise be inaccessible. The text funding, in the case of an individual’s work the encoding in collections like these emphasizes constraints are more personal and professional: features that will be of immediate, general use the encoded material might serve as a private for searching and retrieval: information such research tool, or might develop into the equiva- as bibliographic data, basic text structure, and lent of a digital monograph that represents the metadata such as subject keywords to help users author’s analysis of a set of texts. The use of find materials of interest. the TEI Guidelines in these cases may be as de- The TEI Guidelines are also used by more tailed as the author finds useful–limited only by specialized research projects to represent small- time, energy, imagination, and the constraints of er, more thematically focused collections of usefulness.
INFOtheca, № 1, vol XI, April 2010 5a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
2.2. Learning TEI TEI is also to look at the work of actual text en- The TEI web site4 is a good source of general coding projects. Many projects have documen- information about the TEI, and is also the place tation and some have exceptionally good docu- to find the TEI Guidelines. However, the TEI mentation that explains the rationale behind their web site is only one of many sources of useful encoding decisions and the criteria by which they information about how to use the TEI and how recognize and encode specific textual features. to understand its significance. Although there is Many are also happy to share sample encoded no single source that can give a complete picture, texts, which can be a very useful way of getting a there is now a growing literature on the role of more detailed view of the encoding landscape. text encoding in scholarly work. The bibliogra- phy5 of the Brown University gives some useful 3. History of TEI6 sources, but probably the best way to get an ini- The TEI began in 1987 at a meeting at Vassar tial grounding in the TEI is to attend a workshop, College, which brought together a diverse group a number of which are given each year, mostly of scholars from many different disciplines and in the U.S.A. and U.K., and are advertised on the representing leading professional societies, li- TEI web site. braries, archives, and projects in a number of For those who need a more detailed under- countries in Europe, North America, and Asia. standing of the actual encoding process, work- The initial phase of their work resulted in the shops are a good start, but they are not sufficient. release of the first draft (known as ‘P1’) of the Learning the TEI is like learning a language – it Guidelines in 1990. A second phase immediately has a fairly extensive vocabulary and a complex began and released its results throughout 1990– range of usage. An introductory workshop gives 1993. Then, after another round of revisions, a good understanding of what the language is for extensions, and supplements, the first official and of its basic terms, but it needs to be followed version of the Guidelines (‘P3’) was released in with both practice and a detailed exploration of 1994. As more scholars became acquainted with how the language is really used. Having a project the Guidelines, comments, corrections, and re- to work on – a set of documents that are of inter- quests for extensions arrived from around the est – is a great help. To gain a detailed knowledge world. In the end there were nearly 200 scholars of the TEI Guidelines as you encode, there are from many disciplines, professions, and coun- several things one can do in addition to reading tries in the core group that was developing the the Guidelines themselves. The TEI maintains TEI Guidelines. the TEI-L mailing list, on which questions, even In 2000 the TEI Consortium was established. if asked by beginners, will almost invariably re- It is an international membership organization, ceive helpful answers. The list is also archived, which now maintains, continues developing, and and the answers to a number of common (and less promotes the TEI. The goal of establishing the common) questions can be found in the TEI-L ar- TEI Consortium was to maintain a permanent chives; the range of opinions and approaches can home for the TEI as a democratically constitut- also give a valuable sense of how different kinds ed, academically and economically independent, of projects use the TEI. A good way of learning self-sustaining, non-profit organization.
4 http://www.tei-c.org/ 6 For a more detailed account of the TEI history see http:// 5 http://www.wwp.brown.edu/encoding/seminars/readings. www.tei-c.org/About/history.xml on which this section is html also based.
6a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
Following the establishment of the TEI Con- Engineering Standards, and many other agencies sortium, a critical priority was the release of an around the world that fund or promote digital li- XML version of the TEI Guidelines, updating P3, brary and electronic text projects. Recognizing its itself based on SGML, to enable users to work importance in the emerging digital library com- with the emerging XML toolset. The P4 version munity, the Library of Congress has produced of the Guidelines was published in 2002. It was guidelines for best practice in applying the TEI essentially an XML version of P3, making no metadata recommendations for interoperability substantive changes to the constraints expressed with other standards. in the schemas apart from those necessitated by The success of the TEI has also gone a long the shift from SGML to XML, and changing only way to ensuring that our cultural heritage will corrigible errors identified in the prose of the P3 be brought forward into the emerging new net- Guidelines. However, given that P3 had by this worked world, and made broadly available to the time been in steady use since 1994, it was clear students, scholars, and the wider public. that a substantial revision of its content was nec- essary, and work began immediately on the P5 4. The TEI Guidelines version of the Guidelines. This was planned as The TEI Guidelines for Electronic Text En- a thorough overhaul, involving a public call for coding and Interchange define and document a features and new development in a set of cru- markup language for representing the structural, cial areas including character encoding, graph- renditional, and conceptual features of texts. ics, manuscript description, and the language in They focus (though not exclusively) on the en- which the TEI Guidelines themselves are written. coding of documents in the humanities and so- The P5 version of the Guidelines was released at cial sciences, and in particular on the represen- tation of primary source materials for research the end of 2007. and analysis. These guidelines are expressed as a The impact of the TEI on digital scholarship modular, extensible XML schema, accompanied has been enormous. Today, the TEI is interna- by detailed documentation, and are published un- tionally recognized as a critically important tool, der an open-source license. both for the long-term preservation of electron- The TEI Guidelines are available in several ic data, and as a means of supporting effective formats: the printed and bound copies can be usage of such data in many subject areas. It is purchased, while the online HTML and PDF ver- the encoding scheme of choice for the produc- sions are available for free from the TEI home tion of critical and scholarly editions of literary page. It is also possible to download the sche- texts, for scholarly reference works and large mas, the XML source files of the Guidelines, linguistic corpora, and for the management and documentation etc. as zip packages from the TEI production of detailed metadata associated with SourceForge site; more detailed instructions for electronic text and cultural heritage collections accessing and using these materials are available of many types. from the TEI homepage. The TEI’s recommendations have been en- The TEI Guidelines are themselves writ- dorsed by many organizations, including the US ten in XML and follow Donald Knuth’s literate National Endowment for the Humanities, the programming paradigm, where the same docu- UK’s Arts and Humanities Research Board, the ment contains the formal specifications of XML Modern Language Association, the European schemas and the accompanying documentation Union’s Expert Advisory Group for Language (prose text).
INFOtheca, № 1, vol XI, April 2010 7a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
The TEI Guidelines define several hundred 4.1. TEI Modules elements and attributes for marking up docu- Each chapter of the Guidelines presents a ments of any kind. Each definition has the fol- group of related elements, and also defines a lowing components: corresponding set of declarations, called a mod- – a prose description, ule. All the definitions are collected together in – a formal declaration, expressed using a the reference sections provided as an appendix. special-purpose XML vocabulary defined in the Formal declarations for a given chapter are col- Guidelines in combination with elements taken lected together within the corresponding module. from the ISO schema language RELAX NG7, and For convenience, each element is assigned to a – usage examples. single module, typically for use in some specific For illustration, Figure 1 gives the HTML application area, or to support a particular kind view of the definition of the TEI
8a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
Graphs, Networks, 19 Graphs, Networks, tions, expressed using any of the standard sche- nets and Trees and Trees ma languages: 8 Transcriptions of − the XML DTD language (part of XML itself), spoken Transcribed Speech Speech − the ISO RELAX NG language, Documentation 22 Documentation 8 tagdocs − the W3C Schema language, Elements Elements − or in principle any other adequately powerful tei TEI Infrastructure 1 The TEI Infrastructure schema language. textcrit Text Criticism 12 Critical Apparatus These output schemas can then be used by Default Text any XML processor such as a validator or editor textstructure 4 Default Text Structure Structure to validate or otherwise process documents. Transcription of 11 Representation of The TEI site provides an ODD processor called transcr Primary Sources Primary Sources Roma, which gives a Web interface that helps in verse Verse 6 Verse the process of creating a customized TEI schema. Table 1. TEI Modules 5. Three Examples of TEI texts 4.2. Constructing a TEI XML schema While the previous sections have introduced To determine that an XML document is valid TEI in broad strokes, this section gives some (as opposed to merely well-formed), its structure concrete examples from our own work on Slo- must be checked against a schema. For a valid venian texts. TEI document, this schema must be a confor- All the texts discussed below are written as mant TEI schema. TEI compliant XML documents: some older The specification for a conformant TEI sche- ones in TEI P4, while those recently completed ma is itself a TEI document, using elements from in TEI P5. The delivery mechanism is different the module for Documentation Elements. Such for different cases, depending on the particular a document is informally referred to as an ODD usage scenario, but in all cases rests on XSLT document, from the design goal originally formu- stylesheets, which take the source TEI and trans- lated for the system: ‘One Document Does it all’. form it into a use-oriented format, usually HTML Stylesheets for maintaining and processing ODD suitable for viewing in a Web browser. documents are maintained by the TEI, and the For the three examples discussed below, we Guidelines are also written as such a document. first briefly introduce the texts, or, rather, projects A TEI-conformant schema gives a specific in the scope of which they were compiled, then illustrate the encoding on a small example, and combination of TEI modules, possibly also in- finally present how the XML source of the texts cluding additional declarations that modify the was converted for actual use of the resource. element and attribute declarations contained by each module, for example to suppress or rename 5.3. The eZISS library some elements. The same system may also be The eZISS digital library9, being developed used to specify a schema which extends the TEI in cooperation between the Scientific Research by adding new elements explicitly, or by refer- Centre of the Slovenian Academy of Sciences ence to other XML vocabularies. and Arts and the Jožef Stefan Institute, offers An ODD document can be processed by an ODD processor, which will generate an appro- 8 http://www.w3.org/XML/Schema priate XML schema from this set of declara- 9 http://nl.ijs.si/e-zrc/
INFOtheca, № 1, vol XI, April 2010 9a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION selected, typically older Slovenian texts with in- tegrated facsimiles, transcriptions and scholarly commentary, in some cases including audiovisu- al recordings. The editions are intended for last- ing public use, and are freely accessible on the world-wide Web for browsing, as well as down- loading in source TEI or derived HTML. Abiding by the Creative Commons10 licensing provisions, the editions can be also distributed to others. The eZISS editions use TEI schemas which include modules for Manuscript Descriptions, Transcription of Primary Sources and Textual Criticism, as well as other modules, e.g. for Per- formance Texts. The markup in the various editions varies con- siderably, as the texts are very different in struc- ture and thus need different TEI elements for their description. To briefly illustrate the kind of mark- up used in our editions, Figure 2 contains a short excerpt from the eZISS edition of the Љkofja Loka Passion Play, the oldest performance text in Slove- th Figure 2. A speech from the eZISS edition of the nian, written at the beginning of the 18 century. Škofja Loka Passion Play. The excerpt shows one speech (TEI element ) a passage of the text and and the Jožef Stefan Institute is a learner’s dic- added (
10 http://creativecommons.org 11 http://nl.ijs.si/jaslo/
10a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
The schema for jaSlo dictionary uses the TEI The dictionary is offered on-line via a Web P4 module for Dictionaries, and the start of an ex- search interface. The interface is implemented ample entry is given in Figure 3. As can be seen the as a Perl CGI service, which, according to the entry first contains the Japanese headword (