Text Encoding Initiative Guidelines and Their Localisation

SCIENTIFIC PAPER UDC 004.439:004.55XML UDC 519.76

TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

Tomaž Erjavec* Jožef Stefan Institute, Department of Knowledge Technologies Ljubljana, Slovenia

Abstract: The paper introduces the Text Encoding Initiative Guidelines, a formal specification and accompanying documentation defining a vocabulary of XML elements, meant for the annotation of texts for scholarly purposes. TEI is widely used for the encoding of e.g. complex texts in digital libraries, for a wide variety of text types. The paper describes the history and organisation of the Text Encoding Initiative, the structure of the Guidelines and gives several use cases. It then moves on to the question of internationalisation: while TEI elements can describe text in any language, the Guidelines themselves are nevertheless in English. However, TEI offers several possibilities of how to translate (parts of) the Guidelines. We discuss them, and offer a new, resource light alternative.

Keywords: XML, TEI, Text encoding standardisation, internationalisation.

*[email protected]

INFOtheca, № 1, vol XI, April 2010 3a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

1. Introduction XML, as was SGML, is a meta-language: Digitally stored text has been, for a long it does not define a particular tagset, but only time, viewed only as a digital representation of provides a means to define tagsets, and how a its printed image, and was closely bound to the tagset is defined depends on the type of data un- software that created it. Such text was thus not der consideration and the envisaged usage sce- suitable for task-based machine processing and narios. While developers are free to define their could become quickly obsolescent. To address own tagsets, or, more precisely, XML schemas these concerns the ISO standard SGML was that specify the tags (XML elements and attri- created, which allowed platform and output in- butes) and the valid interrelations between tags, dependent means of representing text, by allow- it is in general better to make use of pre-defined ing users to define their own tagsets, suitable for standard schemas, as they have been carefully encoding arbitrary text types. The most widely designed, are well documented and maintained, known application of SGML is without doubt ensure interoperability and can be accompanied HTML, the Hypertext markup language, used to by dedicated software. encode documents for display in Web browsers. By now there exist a large number of stan- But neither SGML nor HTML were suited as a dard XML schemas, for areas as diverse as broad basis for encoding arbitrary texts for arbi- mobile phone applications, representing math- trary purposes: SGML was too complex, while ematical formulas, music notation and technical HTML defined only a small vocabulary of ele- documentation. ments, mostly visually oriented. But what about the field of humanities texts, These were the reasons why a new standard i.e. texts that are meant to be the subject of hu- was created by the World Wide Web Consor- manities research, regardless of what text type tium, the Extensible Markup Language, XML.1 they are, what language they are written in, what period they come from, or what methods are XML is a subset of SGML, and as such, unlike used in their analysis? To date, there is only one HTML, specifies a means to define arbitrary tag- standard (or, better, recommendation) that at- sets, while being much simpler than SGML, so tempts to cover this vast area, namely the Text that it became much easier to write software to Encoding Initiative. process it. Since its inception, XML has become a true success story as the interchange format for 2. The Text Encoding Initiative digital data, with a host of software and related The Text Encoding Initiative (TEI) developed standards to support its authoring, validation, and a community standard for representing digital transformations. texts in a way that is both powerful and respon- Of these related standards we should mention sive to the research needs of humanities scholars. 2 XSLT, the transformation language for XML TEI can be characterised as:3 documents. It is possible to write scripts in XSLT – a freely available set of guidelines for en- that transform XML documents into other, differ- coding humanities texts using XML, ently structured documents. As XML is primarily – an international consortium that exists to used as a storage and interchange format, XSLT support the maintenance and development of the enables its transformation into more directly us- guidelines, and able formats, e.g., HTML for Web browsing. 3 This section closely follows the excellent introduction 1 http://www.w3.org/XML/ to the TEI given in http://www.wwp.brown.edu/encoding/ 2 http://en.wikipedia.org/wiki/XSL_Transformations seminars/tei.html

4a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

– a community of projects and individuals texts, often organized around a single genre, who use the TEI Guidelines. author, period, or country (or a combination of Like many standardization efforts, the TEI these. Projects of this type often use more de- faces the challenges inherent in such a project. tailed markup to represent particular features of How can the many disciplines and communi- the text that are relevant to the specific collection ties within the humanities domain find common or important for the specific scholarly audience ground in a single encoding language? How do being served. For instance, a collection focus- we agree on the level of detail that is necessary or ing on an author whose writings are an important appropriate in describing our textual materials? source of information about famous contempo- How do we reconcile the advantages to be gained raries might usefully encode all references to by consistency and agreement with the need for names and works, perhaps including links to individual specialization? How do we handle the more detailed indexes of biographical or critical truly idiosyncratic and unexpected? information. Similarly, an electronic edition of Unlike many standardization efforts, the TEI a particular author or work might well include addresses these and similar questions by explic- a representation of variant readings, authorial itly accommodating variation and debate within revisions, editorial emendations, and similar its technical framework. The TEI Guidelines are editorial information. Some collections of this designed to be both modular and customizable, sort are intended to serve very specific research goals, such as linguistic analysis, or as the basis so that specific projects can choose the relevant for a dictionary, and in these cases the markup portions of the TEI and ignore the rest, and can may be very highly specialized. also if necessary create extensions of the TEI The uses described above are all some spe- language to describe facets of the text which the cies of publication: the goal is to create a digi- TEI does not yet address. Because the TEI itself tal collection that can be used online by some is complex, the customization process is not en- public audience of greater or lesser extent—per- tirely trivial, but it is designed to be as straight- haps limited to a small community or to paying forward as possible. subscribers, perhaps open to the general public. More rarely, the TEI is also used by individuals 2.1. What is TEI used for? to create digital representations of textual ma- The predominant use of the TEI Guidelines is terials to support their own personal research, in creating large digital library collections, which in forms which may or may not be published. provide access to large quantities of textual ma- Where the scope and purpose of the larger col- terial, often focusing on rare or fragile materials lections may be determined by audience and which would otherwise be inaccessible. The text funding, in the case of an individual’s work the encoding in collections like these emphasizes constraints are more personal and professional: features that will be of immediate, general use the encoded material might serve as a private for searching and retrieval: information such research tool, or might develop into the equiva- as bibliographic data, basic text structure, and lent of a digital monograph that represents the metadata such as subject keywords to help users author’s analysis of a set of texts. The use of find materials of interest. the TEI Guidelines in these cases may be as de- The TEI Guidelines are also used by more tailed as the author finds useful–limited only by specialized research projects to represent small- time, energy, imagination, and the constraints of er, more thematically focused collections of usefulness.

INFOtheca, № 1, vol XI, April 2010 5a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

2.2. Learning TEI TEI is also to look at the work of actual text en- The TEI web site4 is a good source of general coding projects. Many projects have documen- information about the TEI, and is also the place tation and some have exceptionally good docu- to find the TEI Guidelines. However, the TEI mentation that explains the rationale behind their web site is only one of many sources of useful encoding decisions and the criteria by which they information about how to use the TEI and how recognize and encode specific textual features. to understand its significance. Although there is Many are also happy to share sample encoded no single source that can give a complete picture, texts, which can be a very useful way of getting a there is now a growing literature on the role of more detailed view of the encoding landscape. text encoding in scholarly work. The bibliogra- phy5 of the Brown University gives some useful 3. History of TEI6 sources, but probably the best way to get an ini- The TEI began in 1987 at a meeting at Vassar tial grounding in the TEI is to attend a workshop, College, which brought together a diverse group a number of which are given each year, mostly of scholars from many different disciplines and in the U.S.A. and U.K., and are advertised on the representing leading professional societies, li- TEI web site. braries, archives, and projects in a number of For those who need a more detailed under- countries in Europe, North America, and Asia. standing of the actual encoding process, work- The initial phase of their work resulted in the shops are a good start, but they are not sufficient. release of the first draft (known as ‘P1’) of the Learning the TEI is like learning a language – it Guidelines in 1990. A second phase immediately has a fairly extensive vocabulary and a complex began and released its results throughout 1990– range of usage. An introductory workshop gives 1993. Then, after another round of revisions, a good understanding of what the language is for extensions, and supplements, the first official and of its basic terms, but it needs to be followed version of the Guidelines (‘P3’) was released in with both practice and a detailed exploration of 1994. As more scholars became acquainted with how the language is really used. Having a project the Guidelines, comments, corrections, and re- to work on – a set of documents that are of inter- quests for extensions arrived from around the est – is a great help. To gain a detailed knowledge world. In the end there were nearly 200 scholars of the TEI Guidelines as you encode, there are from many disciplines, professions, and coun- several things one can do in addition to reading tries in the core group that was developing the the Guidelines themselves. The TEI maintains TEI Guidelines. the TEI-L mailing list, on which questions, even In 2000 the TEI Consortium was established. if asked by beginners, will almost invariably re- It is an international membership organization, ceive helpful answers. The list is also archived, which now maintains, continues developing, and and the answers to a number of common (and less promotes the TEI. The goal of establishing the common) questions can be found in the TEI-L ar- TEI Consortium was to maintain a permanent chives; the range of opinions and approaches can home for the TEI as a democratically constitut- also give a valuable sense of how different kinds ed, academically and economically independent, of projects use the TEI. A good way of learning self-sustaining, non-profit organization.

4 http://www.tei-c.org/ 6 For a more detailed account of the TEI history see http:// 5 http://www.wwp.brown.edu/encoding/seminars/readings. www.tei-c.org/About/history.xml on which this section is html also based.

6a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

Following the establishment of the TEI Con- Engineering Standards, and many other agencies sortium, a critical priority was the release of an around the world that fund or promote digital li- XML version of the TEI Guidelines, updating P3, brary and electronic text projects. Recognizing its itself based on SGML, to enable users to work importance in the emerging digital library com- with the emerging XML toolset. The P4 version munity, the Library of Congress has produced of the Guidelines was published in 2002. It was guidelines for best practice in applying the TEI essentially an XML version of P3, making no metadata recommendations for interoperability substantive changes to the constraints expressed with other standards. in the schemas apart from those necessitated by The success of the TEI has also gone a long the shift from SGML to XML, and changing only way to ensuring that our cultural heritage will corrigible errors identified in the prose of the P3 be brought forward into the emerging new net- Guidelines. However, given that P3 had by this worked world, and made broadly available to the time been in steady use since 1994, it was clear students, scholars, and the wider public. that a substantial revision of its content was necessary, and work began immediately on the P5 4. The TEI Guidelines version of the Guidelines. This was planned as The TEI Guidelines for Electronic Text En- a thorough overhaul, involving a public call for coding and Interchange define and document a features and new development in a set of cru- markup language for representing the structural, cial areas including character encoding, graph- renditional, and conceptual features of texts. ics, manuscript description, and the language in They focus (though not exclusively) on the en- which the TEI Guidelines themselves are written. coding of documents in the humanities and so- The P5 version of the Guidelines was released at cial sciences, and in particular on the representation of primary source materials for research the end of 2007. and analysis. These guidelines are expressed as a The impact of the TEI on digital scholarship modular, extensible XML schema, accompanied has been enormous. Today, the TEI is interna- by detailed documentation, and are published un- tionally recognized as a critically important tool, der an open-source license. both for the long-term preservation of electron- The TEI Guidelines are available in several ic data, and as a means of supporting effective formats: the printed and bound copies can be usage of such data in many subject areas. It is purchased, while the online HTML and PDF ver- the encoding scheme of choice for the produc- sions are available for free from the TEI home tion of critical and scholarly editions of literary page. It is also possible to download the sche- texts, for scholarly reference works and large mas, the XML source files of the Guidelines, linguistic corpora, and for the management and documentation etc. as zip packages from the TEI production of detailed metadata associated with SourceForge site; more detailed instructions for electronic text and cultural heritage collections accessing and using these materials are available of many types. from the TEI homepage. The TEI’s recommendations have been en- The TEI Guidelines are themselves writ- dorsed by many organizations, including the US ten in XML and follow Donald Knuth’s literate National Endowment for the Humanities, the programming paradigm, where the same docu- UK’s Arts and Humanities Research Board, the ment contains the formal specifications of XML Modern Language Association, the European schemas and the accompanying documentation Union’s Expert Advisory Group for Language (prose text).

INFOtheca, № 1, vol XI, April 2010 7a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

The TEI Guidelines define several hundred 4.1. TEI Modules elements and attributes for marking up docu- Each chapter of the Guidelines presents a ments of any kind. Each definition has the fol- group of related elements, and also defines a lowing components: corresponding set of declarations, called a mod- – a prose description, ule. All the definitions are collected together in – a formal declaration, expressed using a the reference sections provided as an appendix. special-purpose XML vocabulary defined in the Formal declarations for a given chapter are col- Guidelines in combination with elements taken lected together within the corresponding module. from the ISO schema language RELAX NG7, and For convenience, each element is assigned to a – usage examples. single module, typically for use in some specific For illustration, Figure 1 gives the HTML application area, or to support a particular kind view of the definition of the TEI - ele of usage. A module is thus simply a convenient way of grouping together a number of associated ment. It specifies that the element belongs to the element declarations. In the simple case, a TEI module for Transcription of Primary Sources, schema is made by combining together a small and gives a hyperlink to the full text of the chap- number of modules. ter explaining this module (c.f. next section). The Table 1 lists all the modules defined by the prose description states that the element “groups TEI Guidelines P5. one or more deletions with one or more addi- Chapter of the Guide- tions when the combination is to be regarded as Name of the Formal public lines where the module a single intervention in the text.” The definition module identifier is defined further documents which attributes the element Analysis and 17 Simple Analytic analysis uses, which TEI model the element is used by, Interpretation Mechanisms which elements it can contain, the formal schema Certainty and 21 Certainty, Precision, certainty definition, an example of usage, and notes. Uncertainty and Responsibility 3 Elements Available in core Common Core All TEI Documents Metadata for Lan- corpus 15 Language Corpora guage Corpora dictionaries Print Dictionaries 9 Dictionaries drama Performance Texts 7 Performance Texts Tables, Formulae, 14 Tables, Formulæ, figures Figures and Graphics Character and 5 Representation of gaiji Glyph Documenta- Non-standard Charac- tion ters and Glyphs header Common Metadata 2 The TEI Header iso-fs Feature Structures 18 Feature Structures Linking, Segmenta- 16 Linking, Segmenta- linking tion, and Alignment tion, and Alignment Manuscript 10 Manuscript Descrip- msdescription Figure 1. Definition of the TEI element Description tion Names, Dates, Peo- 13 Names, Dates, Peo- namesdates 7 http://www.relaxng.org/ ple, and Places ple, and Places

8a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

Graphs, Networks, 19 Graphs, Networks, tions, expressed using any of the standard sche- nets and Trees and Trees ma languages: 8 Transcriptions of − the XML DTD language (part of XML itself), spoken Transcribed Speech Speech − the ISO RELAX NG language, Documentation 22 Documentation 8 tagdocs − the W3C Schema language, Elements Elements − or in principle any other adequately powerful tei TEI Infrastructure 1 The TEI Infrastructure schema language. textcrit Text Criticism 12 Critical Apparatus These output schemas can then be used by Default Text any XML processor such as a validator or editor textstructure 4 Default Text Structure Structure to validate or otherwise process documents. Transcription of 11 Representation of The TEI site provides an ODD processor called transcr Primary Sources Primary Sources Roma, which gives a Web interface that helps in verse Verse 6 Verse the process of creating a customized TEI schema. Table 1. TEI Modules 5. Three Examples of TEI texts 4.2. Constructing a TEI XML schema While the previous sections have introduced To determine that an XML document is valid TEI in broad strokes, this section gives some (as opposed to merely well-formed), its structure concrete examples from our own work on Slo- must be checked against a schema. For a valid venian texts. TEI document, this schema must be a confor- All the texts discussed below are written as mant TEI schema. TEI compliant XML documents: some older The specification for a conformant TEI sche- ones in TEI P4, while those recently completed ma is itself a TEI document, using elements from in TEI P5. The delivery mechanism is different the module for Documentation Elements. Such for different cases, depending on the particular a document is informally referred to as an ODD usage scenario, but in all cases rests on XSLT document, from the design goal originally formu- stylesheets, which take the source TEI and translated for the system: ‘One Document Does it all’. form it into a use-oriented format, usually HTML Stylesheets for maintaining and processing ODD suitable for viewing in a Web browser. documents are maintained by the TEI, and the For the three examples discussed below, we Guidelines are also written as such a document. first briefly introduce the texts, or, rather, projects A TEI-conformant schema gives a specific in the scope of which they were compiled, then illustrate the encoding on a small example, and combination of TEI modules, possibly also in- finally present how the XML source of the texts cluding additional declarations that modify the was converted for actual use of the resource. element and attribute declarations contained by each module, for example to suppress or rename 5.3. The eZISS library some elements. The same system may also be The eZISS digital library9, being developed used to specify a schema which extends the TEI in cooperation between the Scientific Research by adding new elements explicitly, or by refer- Centre of the Slovenian Academy of Sciences ence to other XML vocabularies. and Arts and the Jožef Stefan Institute, offers An ODD document can be processed by an ODD processor, which will generate an appro- 8 http://www.w3.org/XML/Schema priate XML schema from this set of declara- 9 http://nl.ijs.si/e-zrc/

INFOtheca, № 1, vol XI, April 2010 9a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION selected, typically older Slovenian texts with in- tegrated facsimiles, transcriptions and scholarly commentary, in some cases including audiovisu- al recordings. The editions are intended for last- ing public use, and are freely accessible on the world-wide Web for browsing, as well as down- loading in source TEI or derived HTML. Abiding by the Creative Commons10 licensing provisions, the editions can be also distributed to others. The eZISS editions use TEI schemas which include modules for Manuscript Descriptions, Transcription of Primary Sources and Textual Criticism, as well as other modules, e.g. for Per- formance Texts. The markup in the various editions varies con- siderably, as the texts are very different in structure and thus need different TEI elements for their description. To briefly illustrate the kind of markup used in our editions, Figure 2 contains a short excerpt from the eZISS edition of the Љkofja Loka Passion Play, the oldest performance text in Slove- th Figure 2. A speech from the eZISS edition of the nian, written at the beginning of the 18 century. Škofja Loka Passion Play. The excerpt shows one speech (TEI element ) from the play, spoken () by the Devil. For viewing, the editions are transformed into Note that the attribute xml:id gives a unique iden- HTML, using XSLT stylesheets. The older (TEI tifier to this speaker and that xml:lang marks the P4) editions each had a dedicated stylesheet to language in which the speaker is referred to as Lat- transform it into HTML, while editions written in in, “lat” being the three letter ISO 639 code for this TEI P5 have a much simpler dedicated stylesheet language. Note also that “Diabolus.” is marked as that only converts the (complex) source XML highlighted, with the required rendering as under- into a simplified, although still TEI compliant lined. The speech then contains eight lines (), XML. This XML is then transformed to HTML each one given the running number of the line in by the standard TEI XSLT stylesheets, available the play (attribute n) and a unique identifier. form the TEI site. The first two lines also contain a substitu- tion (, see also Figure 1 for the defini- 5.4. The jaSlo dictionary tion), where the scribe (the full description of the The on-line Japanese-Slovene Dictionary scribe is given in the TEI header, and the value jaSlo,11 being developed in cooperation between of the attribute hand refers to this description) the Arts Faculty of the University of Ljubljana crossed out () a passage of the text and and the Jožef Stefan Institute is a learner’s dic- added () some other text. The place attri- tionary meant for Slovenian students of the bute specifies where the addition was made: in Japanese language and currently contains about the first line inline, in the next above the line. 10,000 entries.

10 http://creativecommons.org 11 http://nl.ijs.si/jaslo/

10a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION

The schema for jaSlo dictionary uses the TEI The dictionary is offered on-line via a Web P4 module for Dictionaries, and the start of an ex- search interface. The interface is implemented ample entry is given in Figure 3. As can be seen the as a Perl CGI service, which, according to the entry first contains the Japanese headword (

), which is further subdivided into the ary, returns the relevant entries, and transforms word written three orthographies: in Romaji (trans- the XML into HTML for display. literation into Latin script), in Hiragana (phonetic Japanese script) and in Kanji (Chinese ideograms). The grammar group () gives the part- 5.5. The JOS corpora 12 of-speech of the word (Verb, class 5) and its sub- The JOS project is developing Slovene categorisation (intransitive), followed by three linguistically annotated corpora and associated inflected forms of the verb, which determine its resources meant to facilitate developments in inflectional behaviour. The translation gives three Human Language Technologies for the Slovene translation equivalents in Slovene. Follows an ex- language. Current results include the JOS mor- ample, in Japanese and translated into Slovene. phosyntactic specifications, two word-level man- Lexical entries have further information as well: ually annotated corpora, and two Web services. the difficulty level of the word, a reference to the The developed resources are available under the year and chapter of the textbook where the word Creative Commons licences. is first introduced definitions (mostly for proper The JOS corpora contain sampled paragraphs nouns), etymology, cross references, etc. from the Slovene reference corpus FidaPLUS,13 annotated with context-disambiguated morphosyntactic descriptions and lemmas. The jos100k corpus contains 100,000 words and has been ex- tensively manually validated, while jos1M contains 1 million words with partially hand validated annotations. The corpora can serve as training sets for trainable part-of-speech taggers and lem- matisers for Slovene. The corpora are encoded in TEI P5, with the schema using the modules for Metadata for Language Corpora, Linguistic Analysis, and Feature Structures with some JOS extensions. Figure 4 shows the first part of an annotated sentence () from the jos100k corpus. The sentence contains words (), punctuation marks (), and whitespace (). Each word is annotated for its morphosyntactic description (MSD), and the word lemma. The MSDs are compact strings, which have a direct decom- position into features structures, defined in the
Figure 3. Start of a lexical entry in the jaSlo 12 http://nl.ijs.si/jos/ dictionary 13 http://www.fidaplus.net/
INFOtheca, № 1, vol XI, April 2010 11a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
TEI header. So, for example, the MSD “Zk- multilingual, and its geographical reach increas- mer” decomposes into “zaimek vrsta=kazalni es every year. However, the complex encoding of spol=moљki љtevilo=ednina sklon=rodilnik”, texts at which the TEI excels requires a close un- or, in English “Pronoun Type=demonstrative derstanding of the available elements (number- Gender=masculine Number=singular Case= ing over 500), and non-English speakers are at a genitive”. considerable disadvantage in learning and using the Guidelines. The TEI is therefore working to produce a working architecture for 1. internationalisation of the TEI source; 2. internationalised stylesheets for delivery of the reference section of TEI guidelines; 3. translations of the TEI reference documentation. This work is well advanced for six “large” languages, supported by a grant from the ALLC. However, it is unlikely that the Guidelines will be translated any time soon into languages with a small number of speakers, such as Slovene. The complete printed Guidelines have over 1300 pages, with the reference section, giving the precise descriptions and definitions of all the elements, etc. having almost 500 pages. The precise XML encoding of texts is also a Figure 4. The start of the annotated sentence specialist area, so the potential readership of the “Tistega večera sem preveč popil, zgodilo se je…” translated Guidelines into Slovene would most from the jos100k corpus. likely remain small. In spite of it being unreason- able to expect full or reference translation of the The corpora are available in source XML and Guidelines into every language, there are, how- as tabular files, converted with an XSLT script, ever, some areas where a small effort produces which are more suitable for tagger and lemma- already useful results. tiser training. As mentioned, the JOS project also For Slovene point 2 above has been partial- offers a Web service to MSD tag and lemmatise ly completed, so that the TEI XSLT stylesheets uploaded Slovene texts, with the two tools hav- that produce HTML (also performing tasks such ing been trained on the JOS corpora. as producing the table of contents, and splitting large TEI documents into multiple HTML files) 6. Internationalisation have their strings translated, e.g. “Table of Con- The TEI Guidelines have been widely ad- tents” or “Next” in the HTML become “Kazalo” opted by projects and institutions in many coun- and “Naslednji”. tries in Europe, North America, and Asia, and are We also translated the names for all TEI ele- used for encoding texts in dozens of languages. ments into Slovene; this does not mean that the The TEI community is broadly international and element names in XML are translated, but that
12a INFOtheca, № 1, vol XI, April 2010 tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION the element definitions are given glosses in Slo- the translations of all the element names that are vene, as shown in Figure 5. As can be seen, the used in the document. structure of an element specification is here quite Figure 6 gives a part of the TEI header for simple: the element “teiHeader” belongs to the the jos100k corpus, which also contains the tag module “header”, and has an English gloss “TEI usage part of the header. The jos100k header is Header” with the Slovene translation “kolofon written both in English and in Slovene (distin- TEI”. It should be noted that it is in fact the mi- guishing the languages with the xml:lang attri- nority of TEI elements that do have an English bute), so that the XSLT stylesheet inserts into gloss, as most elements, like “sponsor”, have a HTML the appropriate language of the header name that is itself a gloss. content, as well as the appropriate language of the element glosses. Such HTML files are then used to present each edition either in English or in Slovene.
Figure 5. Using Documentation Elements for localisation of TEI element glosses into Slovene.
The file containing such abbreviated specification for all TEI elements together with Slovene translations is available on the Web14 in the hope that other languages will be added in the future. While such glosses could be somewhat inter- esting as an addition to the actual Guidelines, we use them in a different way in our editions. We wrote an XSLT stylesheet that converts the TEI header into simple HTML, but substitutes the names of the elements by their glosses, in Eng- lish or Slovene, or any other language that would be added to the specification file. This enables the translation of the document metadata; as a typical TEI header will also contain information about tag usage in the document, it also gives
14 http://nl.ijs.si/tei/localise/teiLocalise-sl.xml
INFOtheca, № 1, vol XI, April 2010 13a tomaž erjavec – TEXT ENCODING INITIATIVE GUIDELINES AND THEIR LOCALISATION
English glosses and their translations into Slovene is freely available on the Web, and can be used for adding the translations into other languages. As can be seen from the paper, the TEI Guide- lines are a large and rather complex document, so a legitimate question is whether it is worth the time and effort it takes to learn and use this recommendation – it is not better for particular projects to develop their own encoding, perfect- ly suited to their needs? The answer, based on our own experience is not straightforward. For simple texts, which will be used by a particular individual, or inside a particular institution, they answer might well be “no”. However, if a substantial investment of time is required to annotate the texts, making them valuable and worthy of long-term preservation, and possibly usable in more ways than one, and especially if the texts are to be made generally available in their source form, then TEI most likely is the answer: the markup of the texts can be formally validated, it is by definition well-documented, and the projects can make use of the substantial (and increas- ing) software that is associated with TEI. Figure 6. The jos100k TEI header transformed into HTML with English or Slovene element name glosses and header content. Web References Below we give the more important URLs already 7. Conclusions mentioned in the text; we do not list classical bibliographic references, which, however, can be found by The paper introduced the Text Encoding Ini- following the links given below. All URLs have been tiative, an international effort to develop a com- accessed on 2009-09-10. munity standard for representing digital texts in XML – Extensible Markup Language: http://www. a way that is both powerful and responsive to the w3.org/XML/ research needs of humanities scholars. We dis- ISO Relax NG – Schema language for validating cussed the TEI organisation, its history, Guide- XML documents: http://www.relaxng.org/ lines, and schema construction, and gave three XSLT – XSL Transformations: http://en.wikipedia. examples from our own work, were the TEI org/wiki/XSL_Transformations encoding scheme is used. The paper concluded TEI Consortium Web Site: http://www.tei-c.org/ with a discussion of the internationalisation of eZISS – a digital library of Slovenian Literary texts: http://nl.ijs.si/e-zrc/ the TEI Guidelines. jaSlo - an on-line Japanese-Slovene Learner’s Dic- We presented a “light” approach to localisa- tionary: http://nl.ijs.si/jaslo/ tion, appropriate for smaller languages, where the JOS – Slovene Language Corpora for HLT Research: glosses of the TEI elements are translated, and can http://nl.ijs.si/jos/ be then used e.g. for a HTML presentation of TEI TEI element name translations into Slovene: headers; the XML file with all the elements, their http://nl.ijs.ssi/tei/localise/teiLocalise-sl.xml
14a INFOtheca, № 1, vol XI, April 2010