Replacement Draft for TEI P5/CH

1 ODD SUBSET 1.1 Languages and Character Sets The documents which users of these Guidelines may wish to encode encompass all kinds of material, potentially expressed in the full range of written and spoken human languages, including the extinct, the non-existent, and the conjectural. Because of this wide scope, special attention has been paid to two particular aspects of the representation of linguistic information often taken for granted: language identification, and character encoding. Even within a single document, material in many different languages may be encountered. Human culture, and the texts which embody it, is intrinsically multilingual, and shows no sign of ceasing to be so. Traditional philologists and modern computational linguists alike work in a polyglot world, in which code-switching (in the linguistic sense) and accurate representation of differing language systems constitute the norm, not the exception. The current increased interest in studies of linguistic diversity, most notably in the recording and documentation of endangered languages, is one aspect of this long standing tradition. Because of their historical importance, the needs of endangered and even extinct languages must be taken into account when formulating Guidelines and recommendations such as these. Beyond the sheer number and diversity of human languages, it should be remembered that in their written forms they may deploy a huge variety of scripts or writing systems. These scripts are in turn composed of smaller units, which for simplicity we term here characters. A primary goal when encoding a text should be to capture enough information for subsequent users of it correctly to identify both language, script, and constituent characters. In this chapter we address this requirement, and propose recommended mechanisms to indicate the languages, scripts and characters used in a document or a part thereof. Identification of language is dealt with in 1.1.1. Language identification. In summary, it recommends the use of pre-defined identifiers for a language where these are available, as they increasingly are, in part as a result of the twin pressures of an increasing demand for language-specific software and an increased interest in language documentation. Where such identifiers are not available or not standardized, these Guidelines recommend a way of documenting language identifiers and their significance, in the same way as other metadata is documented in the TEI Header. Standardization of the means available to represent characters and scripts has moved on considerably since the publication of the first version of these Guidelines. At that time, it was essential to explicitly document the characters and encoded character sets used by almost any digital resource if it was to have any chance of being usable across different computer platforms or environments, but this is no longer the case. With the availability of the Unicode standard, almost 100,000 different characters representing almost all of the world’s current writing systems are available and usable in any XML processing environment without formality. Nevertheless, however large the number of standardized characters, there will always be a need to encode documents which use non- standard characters and glyphs, particularly but not exclusively in historical material. Furthermore, the full potential of Unicode is still not yet realised in all software which users of the Guidelines are likely to encounter. The second part of this chapter therefore discusses in some detail the concepts and practice underlying this standard, and also introduces the methods available for extending beyond it, which are more fully discussed in chapter «CEW06». 1.1.1 Language identification Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. The TEI therefore accomodates this need in the following way: • A global attribute lang is defined for all TEI elements. Its value identifies the language used. • The TEI Header has a section set aside for the information about the languages used in a document, for details see «5.4.2 Language Usage». The value of the attribute lang identifies the language using a coded value. For maximal compatibility with existing processes, modelling this value in the following way is recommended (this parallels the modelling of xml:lang): • The identifier for the language should be constructed as in RTF 3066 or its successor. This same identifier has to be used to identify the <language> element in the TEI header. 1 The current draft of Tags for Identifying Languages proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA) by assembling this tag from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script and a region for this language, each possibly followed by a variant subtag. • The identifier consists of at least one ‘primary’ subtag, it maybe followed by one or more ‘extended’ subtags. • Languages are identified by a language subtag, which may be a two letter code taken from ISO 639-1 or a three letter code taken from ISO 639-2. • ISO 639-2 reserves for private use codes the range ’qaa’ through ’qtz’. These codes should be used for non-registered language subtags. • A single letter primary subtag "x" indicates that the whole language tag is privately used. • Extended language subtags must begin with the letter "s". They must follow the primary subtag and precede subtags that do define other properties of the language. The order is significant. • 4 character subtags are interpreted as script identifiers taken from ISO 15924 • Region subtags can be either two letter country codes taken from ISO 3166 (with exceptions) or 3 digit codes from the UN Standard Country Codes for Statistical Use. • Variant subtags may follow any of the above, but must precede private use extensions. • Private use extensions are separated from the other subtags by the single letter subtag "x", which must be followed by at least one subtag. They might consist of several subtags separated with "-", but may not exceed a length of 32 characters. • – de (German) – ja (Japanese) – zh (Chinese) • – zh-Hant (Traditional Chinese) – en-Latn (English written in Latin script) – sr-Cyrl (Serbian written with Cyrillic script) • – zh-Hans-CN (Simplified Chinese for the PRC) – sr-Latn-891 (Serbian, Latin script, Serbia and Montenegro) • – zh-SG (Chinese for Singapore) – de-DE (German for Germany) • – zh-CN (Chinese in China, no script given) – zh-Latn (Chinese transcribed in the Latin script) • – de-CH-x-phonebook (phonebook collation for Swiss German) – zh-s-min (Min sub-language of Chinese) – zh-s-min-s-nan-Hant-CN (Southern variant of Min sublanguage as used in China, written with traditional Characters) – zh-Latn-x-pinyin (Chinese transcribed in the Latin script using the Pinyin system) It should be noted that capitalization given here follows established convention (e.g. capital letters for country coded, small letters for language codes), but RTF 3066 does not ascribed any meaning to differences in capitalization. As can be seen, both RTF 3066 and ISO 639-2 provide extensions that can be employed by private convention. The constructs mentioned above can thus be used to generate identifiers for any language, past and present, in any used in any area of the World. If such private extensions are used within the context of the TEI, they should be documented within the <language> element of the TEI header, which might also provide a prose description of the language described by the language tag. While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g. "grc" for "Greek, Ancient (to 1453)" in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension 2 1.1 Languages and Character Sets mechanism defined in RTF 3066 and relate that to a <date> or <dateRange> in the corresponding <language> sectio of the TEI header. Equivalences to language identifiers by other authorities can be given in the <language> section as well, but no formal mechanism for doing so has been defined. The scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the lang attribute, including all elements and all attributes where a language might apply.This will exclude all attributes where a non-textual data type has been specified, for example tokens, boolean values or predefined value lists.References Phillips, Addison.Davis, Mark, Tags for Identifying Languages2004-04-08, Internet Draft, proposed revision for RTF3066 http://xml.coverpages.org/draft-phillips-langtags-02a.txt Cover, Robin: Language Identifiers in the Markup Contexthttp://xml.coverpages.org/ languageIdentifiers.html Tim Bray Jean Paoli C. M. Sperberg-McQueen Eve Maler - Second Edition Francois Yergeau - Third Edition: Extensible Markup Language (XML) 1.0 (Third Edition) W3C Recommendation 04 February 2004 http: //www.w3.org/TR/2004/REC-xml-20040204/ 1.1.2 Characters and Character Sets All document encoding has to do with representing one thing by another in an agreed and systematic way. Applied to the

Load more