<<

TUGboat, Volume 19 (1998), No. 4 403

Language Support

Cyrillic encodings for LATEX2ε multi-language documents . Berdnikov, . Lapko, . Kolodin, A. Janishevsky, A. Burykin

Abstract This paper describes X2 and the T2A, T2B, T2C encodings designed to support Cyrillic writing sys- tems for the multi-language mode of LATEX2ε.The encoding X2 is the “Cyrillic glyph container” which canbeusedtoinsertintoLATEX2ε documents text fragments from all modern Cyrillic writings, but it does not strictly obey all the rules of LATEX2ε. The encodings T2A, T2B and T2C are the “true” LATEX2ε encodings which satisfies all the require- ments of the LATEX2ε kernel, but as a result three encodings are necessary to support the whole variety of languages based on the Cyrillic . These restrictions of the LATEX2ε kernel, the specific features of Cyrillic writing systems and the basic principles used to create the encodings X2 and T2A, T2B, T2C are considered. This project supports all the Cyrillic writing systems known to us, although the majority of the accented letters need to constructed using internal TEX tools. The X2 encoding was approved at CyrTUG- 97 — the annual conference of Russian-speaking TEX users and was previously presented at the EuroTEX- 98 Conference. The encodings X2, T2A, T2B and T2C were intensively discussed on the cyrtex-t2 mailing list. 404 TUGboat, Volume 19 (1998), No. 4

1 Introduction • T2A, T2B, T2C — the encodings which strictly A The project (originally named T2) to produce - satisfy the requirements necessary for LTEX2ε codings necessary to support modern Cyrillic lan- multi-language mode. With these encodings it is safe to mix different languages inside your guages in LATEX2ε multi-language mode was initi- ated at the TUG-96 Conference in Dubna. this pa- documents and to use large pieces of text with- per presents the results of that project. Although out any problems. The price is that it is neces- some minor corrections could still appear as new in- sary to have three encodings (and an enormous formation about minor Cyrillic writings appear, the number of fonts each in agreement with the en- kernel of the project seems to be stable. The encod- coding conventions of the European Computer ing support includes: Modern fonts) to support the whole variety of Cyrillic . Some Cyrillic languages are • LHfont font collection version 3.2 — the Com- supported by one or two encodings from the puter Modern Cyrillic fonts and European T2∗ encoding family, some are supported by all Computer Modern Cyrillic fonts created by three encodings. The encodings are in agree- O.Lapko, ment with the LATEX2ε multi-language mode • T2enc macro package created by . Volovich and encoding paradigm. Although there are no and . Lemberg — the input and output en- obstacles to the use of these encodings for orig- coding and font definitions necessary for the inal Cyrillic texts, native users may have differ- LATEX2ε packages fontenc and inputenc, ent preferences; • the hyphenation patterns in encoding-indepen- • LR0 and LR1 — the encodings which combine dent style: ashyphen by A. Slepuhin, ruhyphen the OT1 encoding for 0 – 127 and Cyrillic let- by . Vulis, lvhyphen by M. Vorontsova and ters and symbols from the leading languages of . Lvovski, znhyphen by S. Znamenskii, the Former for 128 – 255. (The • rusbabel package created by V. Volovich and encoding LR0 contains the Russian letters only, W. Lemberg to support Cyrillics (based on new the encoding LR1 contains in addition the most encodings) in babel. frequently used national letters.) These en- The β-versions of these packages are on the codings are as close as possible to X2 and TEXLive 3 CD-ROM distribution. The final versions T2A/T2B/T2C and are designed mainly for are available on CTAN.1 non-LATEX formats based on the CM font fam- Support for Cyrillics includes the following en- ily and the original Plain TEX. There is some codings: hope (not confirmed at this moment) that LR0 • X2 — the Cyrillic glyph container which con- and LR1 may become the inter-platform and tains all the glyphs necessary to support mod- inter-format standard for representing Russian ern Cyrillic writing. It does not obey all the letters inside TEX (as ASCII is the standard for English); specifications that LATEX2ε requires for an en- coding with the prefix , but as a result it • LR — local encodings necessary to support is enough to have just one encoding to insert individual Cyrillic languages. The T2∗ encod- in LATEX2ε documents the characters, words, ings are intended for the multi-language mode names, bibliography references, short sentences, of LATEX2ε and for this reason may be not well citations, etc., specific for all modern Cyrillic suited to bilingual documents or to the prefer- writings without too large an increase in the ences of the native users. It is definitely not number of fonts used for this purpose (.., this the task of the T2 working group, but that of encoding is a tool which enables -writing the national TEX User Groups, to organize local people to occasionally use Cyrillics in their doc- encodings that are useful for their language; uments). The price is that some Cyrillic letters • TS2 — the encoding containing accents, non- do not exist as a separate glyphs but must be symbols, etc., necessary for Cyrillic writ- composed from pieces (accents and modifiers) ings which are outside X2 and T2∗ encodings. contained in X2, and the user must obey some This encoding is under consideration, and with rules of safety (described in section 6) since X2 great probability all necessary additional glyphs does not satisfy all the requirements obligatory could be added to TS1 encoding. (The latter   for T n encodings; case has the advantage that it prevents the in- 1 /fonts/cyrillic, /languages/hyphenation/ruhyphen, crease of the number of fonts needed to support and /macros/latex/contrib/supported/t2. multi-language mode in LATEX2ε); TUGboat, Volume 19 (1998), No. 4 405

• LWN — the encoding which generalizes the ation symbols, etc.) can be mixed with the WNCyr font family by adding new Cyrillic glyphs from the corresponding Tn font. letters and new substitution pairs (ligatures) Sn — Symbol encodings. The style of fonts in Sn based on ASCII Latin input. It is suitable for encoding need not be synchronized with that Latin-writing users who use Cyrillics only occa- of Tn fonts. These encodings are used for sionally. (This encoding is still under develop- arbitrary symbols, ‘dingbats’, ornaments, frame ment and is not described here.); elements, etc. • T5 encoding(s) to support Old Slavonic, Gla- An — Encodings for special Applications (not cur- golitic, , etc., writings. The rently used). project to develop these encodings has just E∗ — Experimental encodings but those intended for started and its discussion is outside the scope wide distribution (currently used for the ET5 of this publication. proposal for Vietnamese). ∗ — Local, unregistered encodings (for example, 2LATEX2ε system of encodings the LR0, LR1 and LRn encodings mentioned The following types of encodings are recognised by above). the LAT Project:2 E OM∗ — 7 bit Mathematics encodings.   n — essentially 7 bit ‘old’ encodings. Typically M∗ — 8 bit Mathematics encodings. these will be small modifications of the origi- — Unknown (or unclassified) encoding. nal TEX encoding, OT1 (for example, OT4, a variant for Polish). 3 Specifications for Tn and Xn Tn —8bitTextEncodings.Tn encodings are encodings the main text encodings that LATEX uses. They There are two main restrictions to be fulfilled before have some essential technical restrictions to an encoding may be considered as an encoding with enable multilingual documents with standard the prefix ‘T’ satisfying the requirements of the TEX: (a) they should have the basic Latin al- LATEX2ε kernel: phabet, the digits and punctuation symbols in • the \lccode–\uccode pairs should be the same the ASCII positions, (b) they should be con- as they are in the LATEX2ε kernel (i.e., as they structed so that they are compatible with the are in the T1 encoding); lowercase code used by T1. Further discussion • of the technical requirements for Tn encod- the Latin characters and symbols: !, ’, (, ings is given in section 3. ), *, +, ,, -, ., /, :, ;, =, ?, [, ], ‘, |, @ (questionable), 0–9, A–, a–z should be at   X n — other 8 bit Text Encodings (eXtended, or the positions corresponding to ASCII, and the eXtra, or X=Non Latin). Sometimes it may be symbols produced by the ligatures --, ---, ‘‘, necessary, or convenient, to produce an encod- ’’ (at arbitrary positions). ing that does not meet the restrictions placed If the encoding requires the redefinition of the values on the Tn encodings. Essentially arbitrary \lccode–\uccode, or if it does not contain the text encodings may be registered as Xn, but necessary Latin characters in the ASCII positions, it is the responsibility of the maintainers of the it will produce undesirable effects in some situations encoding to clearly document any restrictions inside LAT X2 and will make use of the encoding on the use of the encoding. E ε incompatible with the general multi-language mode.   TS n — Text Symbol Encodings. Encodings of The reasons for such restrictions are explained in symbols that are designed to match a cor- detail in [1]. responding text encoding (for example, para- graph signs, alternative digit forms, etc.). The Although the LAT X Team’s technical specifications   E font style of fonts in TS n encoding will or- for Xn encodings are less restrictive than those for dinarily be changed in parallel with that of ‘ordinary’ text encodings, there are corresponding   the fonts in T n encoding using NFSS mecha- restrictions on their use, and some desirable proper-   nisms. As a result, at any moment the TS n ties for them to have. In particular: font style is compatible with the Tn font and • If the encoding does not have Latin letters in the glyphs from TSn font (accents, punctu- ASCII slots then the users must take care not 2 The following text is slightly adapted from a post by to enter such text, otherwise ‘random’ incorrect David Carlisle to the mailing list cyrtex-t2. output will be produced, with no warning from 406 TUGboat, Volume 19 (1998), No. 4

the LATEX system. Also, care must be taken especially taking into account the \lccode–\uccode with ‘moving’ text that is generated internally restrictions. So it is necessary to accept some within LATEX (such as cross references), which principles of selection which enable us to decrease may fail if the encodings change; the number of Cyrillic glyphs included in X2: • To reduce the problems with cross reference in- 1. The X2 encoding follows the LATEX2ε agree- formation, the LATEX maintainers strongly rec- ments about \lccode–\uccode not to produce ommend that at least the digits and ‘com- garbage for the headings, table of contents, mon’ punctuation characters are placed in their hyphenations inside paragraphs, arguments of ASCII slots; \uppercase and \lowercase; • If the encoding uses a lowercase table that is 2. All glyphs used in publishing for some language incompatible with the lowercase table of T1, are included in X2 if they cannot be constructed then it is not possible to mix this encoding and as accented letters or letters with additional aTn encoding within a single paragraph, and modifiers using TEX commands. Variant glyphs obtain correct hyphenation with standard TEX. for are also included in X2 If the Xn encoding does not use a lowercase ta- if there is some free space and if different ble that is compatible with that of T1, the package languages use different variants; supporting this encoding should ensure that encod- 3. The X2 encoding includes all punctuation sym- ing switches only happen between paragraphs (or bols, digits, mathematical symbols, accents, hy- that hyphenation is suppressed when temporarily phens, dashes, etc., needed to form the full set switching to the new encoding). It should be noted of symbols necessary for Cyrillic typography; that this restriction on the lowercase table applies 4. The additional Cyrillic letters which are used in only to systems using standard TEX(version3and the PC 866 and MS Windows 1251 code pages later). Using ε-TEX version 2 will remove the need are included in X2 even if they are accented for this restriction as the hyphenation system has forms; been improved — it will use a suitable lowercase ta- 5. Glyphs which are not used now but which were ble for each language (the table will be stored along used at some stage in the 20th century may with each language hyphenation table), and surely be included if there are good reasons to do it deals not at all with the system. so (as, for example, with the old Russian and 4“Cyrillic glyph container” — the X2 Bulgarian letters); encoding 6. Glyphs which were used in old Cyrillic texts be- The encoding X2 should include all the glyphs neces- fore 1900 (Old Slavonic, Church Slavonic, Gla- golitic, old phonetic symbols, etc.) should be sary to represent in LATEX2ε documents containing texts from stable Cyrillic languages. The basis of moved to a separate glyph container. There X2 is the (since it is the main lan- could also be an additional glyph container to guage used for publication in Cyrillic). Taking ac- collect the exotic glyphs used in some contem- count of the variety of old Cyrillic texts, only those porary Cyrillic texts; modern alphabets which are still in use are included 7. When jettisoning accented letters it is neces-

sary to take into account that they may be nec-

Ü Ý in X2. The exceptions are the characters X/ , / ,

essary for hyphenation patterns for some lan- Þ Z/ which were used in Russian and Bulgarian texts at the beginning of the 20th century. guages (if such patterns have been created or The X2 encoding is designed so that by com- if there is a chance that they will be created bining "00 – "7f from OT1 and "80 – "ff from X2 sometime). For example, accented letters for one can construct an encoding which is adequate to Russian, Ukrainian and Belorussian, Kazakh, support the most common Cyrillic languages. This Tatar, and Bashkir are included in X2; permits use of X2 as the base Cyrillic encoding 8. When deciding whether to jettison an accented for a variety of TEXformats(Plain, AMS-TEX, letter that is used in a language supported by BLUETEX, LATEX 2.09, etc.) as well as LATEX2ε. LR1, one must keep in mind that only the CM (This local encoding is called LR1 below. The accents are available in that encoding; design aim for LR1 was to select glyphs required by 9. The following priorities are used when the ac- the most widely-used languages and to put them cented letters or letters with simple modifiers into the 128 – 255 section of X2.) are thrown away: (0) letters which are easily Unfortunately the full set of glyphs including constructed by the internal command \accent accented letters is too big to fit into 256 slots, (so that the letters using accents available in TUGboat, Volume 19 (1998), No. 4 407

CM fonts have lower significance); (1) letters lower comma are also used as accents. The lower which contain a centered below the let- accents will be constructed using TEX commands

ter (cedilla, ogonek, dot, ) and are easily from the upper accents available in X2.  constructed using a command similar to \ in The accents  ("12)and ("13)areusedas Plain TEX; (2) letters which contain a horizon- stresses in Serbian; there is no letter in any Cyrillic tal stroke positioned symmetrically; (3) letters language where these symbols are used as “normal”

which require special alignment of accents and accents.  modifiers; The quasi-letters ³ (, "27), (double

10. Accents and modifiers used in Cyrillic are in- apostrophe, "22)and­ (, "0d)areused cluded in X2 even if all accented forms are in- like letters in some languages but do not have cluded in X2 for some other reasons (an exam- uppercase and lowercase forms (i.e., for these letters

the uppercase form is just the same as the lowercase â ple is Cyrillic used for É and ); form). 11. Latin letters or glyphs which are similar to some Single quotes are not used in Cyrillic writing, Latin letter (used in Macedonian, Kurdish, etc.) and for this reason there is no need to keep single are placed at the same positions as the Latin French quotes. Instead, in their place, the angle

letters are in ASCII. Among other things, this ¯ brackets Æ ( )and ( ) are provided. Angle increases the number of languages supported by "0e "0f brackets are used in Cyrillic typography, and it is the LR1 encoding; good if their style is changed in parallel with the 12. Whenever it is possible, glyphs (ASCII, accents, style of other symbols. special symbols, etc.) are placed at the same The Cyrillic breve “ ”("14)isaveryfamous positions as they are placed in T1 encoding. glyph (it is even included in the Adobe and Word-

The X2 encoding is shown in Fig. 1. The Rus- Perfect Cyrillic fonts). Although all letters with this

ß à ÿ ì ø É é â ² sian letters ü – , – (except and ) are placed accent ( / , / ) are included in X2, it is included

in the only region in the encoding table where 32 as a special glyph as well. ¬

consecutive letter positions are available — i.e., po- Cedilla “ «”("0b) and ogonek “ ”("0c)areused

á E

sitions "c0 – "df and "e0 – "ff. The Russian letters by some letters already included in X2 (†, , ). ø ì and are placed at the end of the block "80 – These letters have variant forms where cedilla could "9c and "a0 – "bc which simplifies the ordering of be oriented to the left or to the right depending

non-Russian letters. Latin letters and letters similar on the user’s taste. Also, some applications use

å ‚ Ê to Russian letters are placed as in ASCII. Letters ogonek instead of descender for ‰, , , ,etc. used in other Cyrillic alphabets are grouped into The availability of cedilla and ogonek in X2 makes the parts "80 – "ff and "00 – "7f of the encoding it possible to satisfy these needs.

table according to the “popularity” of the corre- Percentage zero “ ”("18) is included as a useful

sponding languages (to satisfy the requirements of idea borrowed from the T1 encoding and EC fonts: ± the LR1 encoding). They are placed in free posi- this symbol is used to convert ‘±’into‘ ’and A tions reserved by LTEX2ε for letters in some quasi- ‘±’. alphabetic order. The old Russian and Bulgarian Punctuation ligatures, i.e., the symbols pro- letters are placed at the end of the block of letters duced by the abbreviations -- (endash, "15), --- in "00 – "7f. (emdash, "16), ‘‘ (opening English quotes, "10), ’’ Accents and modifiers are placed in X2 at "00 – (closing English quotes, "11)areusedinthesame "1f; those also used by T1 are placed at the same manner and are placed at the same position as in positions as in T1. The same is true for additional T1, as is - (the hyphen used for hanging Hyphen- symbols produced by the ligatures --, ’’,etc.The ation, "7f). It is worth noting that the --- punctuation symbols, digits, mathematical symbols, (emdash, "16) corresponds to Cyrillic emdash which etc., are placed as they are positioned in ASCII. A (following the traditions of Russian typography) is

special case is made of the symbols íîïù úûwhich much shorter than that glyph in Latin-encoded CM are essential for Russian typography. These symbols and EC fonts.

are placed in "80 – "ff at the positions reserved for There are the special cross-modifiers: horizon-  symbols, to guarantee the correctness of the LR1 tal “ ”at"17, grave-diagonal “ ”at"19 and acute-

encoding. diagonal “”at"1a which are used to construct from

ä ð á ± † ¦ Some accents (macron, dot) can be used as pieces the letters €/ , / , / and / used in lower accents as well for systems. some minor Cyrillic languages. (These letters are In some specific cases the upper comma ("1b)and included as separate glyphs in T2A/T2B/T2C, but

408 TUGboat, Volume 19 (1998), No. 4

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

¡ ¢ £ ¤ ¥ ¦ §

¼Ü ¼Ü

¨ © ª « ¬ ­ Æ ¯

       

½Ü ½Ü

      

! " # ° ± ² ³

¾Ü ¾Ü

´ µ ¶ · ¸ ¹ º »

¼ ½ ¾ ¿ 4 5 6 7

¿Ü ¿Ü

8 9 : ; < = > ?

@ A B C D E G

4Ü 4Ü

À Á Â Ã Ä Å Æ

È É Ê Ë Ì Í Î Ï

5Ü 5Ü

X Y Z [ \ ] ^ _

` a b c d e f g

6Ü 6Ü

i Ð Ñ Ò Ó

Ô Õ Ö × Ø Ù Ú Û

7Ü 7Ü

Ü Ý Þ ß | } ~ 

€  ‚ ƒ „ † ‡

8Ü 8Ü

ˆ ‰ Š ‹ Œ  Ž DZ

à á â ã ä å æ ç

9Ü 9Ü

è é ê ë ì í î ï

¡ ¢ £ ¤ ¥ ¦ §

AÜ AÜ

¨ © ª « ¬ ­ ® ¯

° ± ² ³ ð ñ ò ó

BÜ BÜ

ô õ ö ÷ ø ù ú û

ü ý þ ÿ Ä Å Æ Ç

CÜ CÜ

È É Ê Ë Ì Í Î Ï

Ð Ñ Ò Ó Ô Õ Ö ×

DÜ DÜ

Ø Ù Ú Û Ü Ý Þ ß

à á â ã ä å æ ç

EÜ EÜ

è é ê ë ì í î ï

ð ñ ò ó ô õ ö ÷

FÜ FÜ

ø ù ú û ü ý þ ÿ

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

Figure 1: The “Cyrillic glyph container” X2 TUGboat, Volume 19 (1998), No. 4 409

unfortunately the limit of 256 characters prevents emdash which is shorter than that in OT1 and including them in X2.) T1);

The positions "1c/"1d and "1e/"1f are used 6. The ff-ligatures (ff, fi, fl, ffi, ffl) are included at

  for the exotic letters / and / used by some mi- "1b – "1f as in T1 to keep the full set of Latin nor Cyrillic alphabets. Although following LATEX2ε glyphs necessary for typography; rules these positions should be used for symbols,not 7. The standard accents and symbols, and Cyril- for letters,wehavemadethemexceptionsfromthe lic-specific accents and symbols used in "00 – A severe LTEX2ε requirements. Formally speaking, "1a of X2, are reproduced in T2∗ at the same A the LTEX2ε requirements are not violated since the positions except the cross-modifiers which are \lccode–\uccode data for these positions conserve not necessary (the letters with cross-modifiers the original values. Instead of the explicit use of are included as separate glyphs); the \lccode–\uccode mechanism, the lowercase – 8. The Russian alphabet is reproduced similar to uppercase conversion is performed by the LAT X2ε E X2; \MakeUppercase and \MakeLowercase transforma-

tions using the list \@uclclist of identifiers. As a 9. The symbols specific to Cyrillic typography (í

result these “letters” cannot be used in hyphenation îïù úû) are reproduced in "9d, "9e, "9f, "bd, patterns and they break the automatic hyphenation "be and "bf, similar to X2; whenever they appear in a word. The gain is that 10. Positions "80 – "9b and "a0 – "bb are used for even exotic Cyrillic texts could be created (if neces- national Cyrillic letters (these are the only po- sary) using X2 only and without additional encod- sitions that differ from the T2∗ encodings since ings and fonts. all other codes are already fixed as described above); 5“True”LAT X2 encodings T2A, T2B, E ε 11. To prevent an increase in the number of en- T2C codings up to infinity, the accented letters for ThebasefeaturesoftheT2∗ encodings are deter- Cyrillic languages which do not have the hy- mined by the LATEX2ε specifications for Tn encod- phenation tables in TEX format (and rarely will ings and by the already created X2 encoding. Some have in future) are not included; more requirements are added due to the necessity to 12. Equivalent letters occupy just the same posi- ∗ keep the fonts in the T1, X2 and T2 encodings com- tions in all the T2∗ encodings whenever possi- patible. For example, it is necessary to keep similar ble (i.e., some glyphs may be absent in some glyphs at the same positions in all fonts whenever it encodings, but if the glyph is included in an en- is possible. As a result the following basic principles coding, it occupies the same position as in the appear. other encodings). ∗ 1. The set of T2 encodings supports the full set The encodings can therefore be decomposed of modern Cyrillic languages, each Cyrillic lan- into the following regions: guage is supported at least by one encoding T2∗ – the accent, non-ASCII punctuation and so that there is no necessity to mix encodings ligature symbols ("00 – "1f), for some languages; – the ASCII-encoded Latin letters, digits, 2. Cyrillic letters occupy the positions reserved punctuation and mathematical symbols, in LAT X2 for letters, Cyrillic symbols — po- E ε etc. ("20 – "7f), sitions reserved for symbols, Cyrillic accents —

positions reserved for accents; – non-letter symbols at "9d – "9f and "bd – úû "bf (íî ï ù ), 3.LettersincludedinT2∗ follow the LATEX2ε con- vention about uppercase and lowercase letters – specific (uppercase and lowercase) Cyrillic (i.e., have the same \uccode–lccode assign- letters ("80 – "9b, "a0 – "bb), ments as in T1); – uppercase and lowercase Russian letters 4. ASCII glyphs (Latin letters, digits, mathemat- ("c0 – "df, "e0 – "ff, "9c, "bc). ical and punctuation symbols, etc.) are placed The accent part (see Fig. 2) is copied from X2 with at "20 – "7f; minor changes: 5. The symbols produced by the ligatures --, ---, a) the exotic letters ("1c – "1f) and the up- ‘‘, ’’ are included at the same positions as per comma accent ("1b) are substituted by they are in T1 and X2 (it is worth noting the ff-ligatures “ff ”, “fi”, “fl”, “ffi”, “ffl”, that, as in X2, the emdash glyph is the Cyrillic

410 TUGboat, Volume 19 (1998), No. 4

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

¡ ¢ £ ¤ ¥ ¦ §

¼Ü ¼Ü

¨ © ª « ¬ ­ Æ ¯

       

½Ü ½Ü

      

a) Accents, ligatures, special symbols, etc.

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

! " # ° ± ² ³

¾Ü ¾Ü

´ µ ¶ · ¸ ¹ º »

¼ ½ ¾ ¿ 4 5 6 7

¿Ü ¿Ü

8 9 : ; < = > ?

@ A B C D E F G

4Ü 4Ü

À Á Â Ã Ä Å Æ Ç

È É Ê Ë Ì Í Î Ï

5Ü 5Ü

X Y Z [ \ ] ^ _

` a b c d e f g

6Ü 6Ü

h i j k Ð Ñ Ò Ó

Ô Õ Ö × Ø Ù Ú Û

7Ü 7Ü

Ü Ý Þ ß | } ~ 

b) ASCII glyphs

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

ü ý þ ÿ Ä Å Æ Ç

CÜ CÜ

È É Ê Ë Ì Í Î Ï

Ð Ñ Ò Ó Ô Õ Ö ×

DÜ DÜ

Ø Ù Ú Û Ü Ý Þ ß

à á â ã ä å æ ç

EÜ EÜ

è é ê ë ì í î ï

ð ñ ò ó ô õ ö ÷

FÜ FÜ

ø ù ú û ü ý þ ÿ

c) Russian letters

Figure 2: Common parts in T2A/T2B/T2C TUGboat, Volume 19 (1998), No. 4 411

b) the cross-modifiers are substituted by the Uzbek, Ukrainian, Hanty-Obskii, Hanty-Sur- compound-word-mark symbol3 ("17), the gut, Gipsi, Chechen, Chuvash, Crimean-Tatar; dotless-i ("19)andthedotless-j ("1a) like T2B: Abaza, Avar, Agul, Adyghei, Aleut, Al- it is in T1. tai, Balkar, Belorussian, Bulgarian, Buryat, The ASCII-encoded Latin letters, digits, punctu- Gagauz, Dargin, Dolgan, Dungan, Ingush, Itel- ation and mathematical symbols at "20 – "7f are men, Kabardino-Cherkess, Kalmyk, Karakal- placed exactly as in the T1 encoding as shown in pak, Karachaevskii, Karelian, Ketskii, Kirgiz, Fig. 2. Komi-Zyrian, Komi-Permyak, Koryak, Kumyk, The Russian alphabet letters and non-letter Kurdian, Lak, Lezgin, Mansi, Mari-Valley, Mol- symbols at "9d – "9f, "bd – "bf are just copied from davian, Mongolian, Mordvin-Moksha, Mord- the corresponding positions in the X2 encoding (see vin-Erzya, Nanai, Nganasan, Negidal, Nenets, Fig. 2). Nivh, Nogai, Oroch, Russian, Rutul, Selkup, The component which is different in T2A/T2B/ Tabasaran, Tadjik, Tatar, Tati, Teleut, To- T2C/ are the national letters at "80 – "9b and "a0 – falar, Tuva, Turkmen, Udyghei, Uigur, Ulch, "bb, which are shown in Fig. 3 and Table 1. Al- Khakass, Hanty-Vahovskii, Hanty-Kazymskii, though tried to fulfill the requirement that the Hanty-Obskii, Hanty-Surgut, Hanty-Shurysha- equivalent letters should occupy the same positions rskii, Gipsi, Chechen, Chukcha, Shor, Evenk, in all encodings, it was impossible to completely ful- Even, Enets, Eskimo, Yukagir, Crimean Tatar,

fill this requirement. Fortunately there are only two Yakut;

k Å Ñ exceptions: the letters Ã/ and / are placed at T2C: Abkhazian, Bulgarian, Gagauz, Karelian, "87/"a7 and "9b/"bb in T2A, but at "88/"a8 and Komi-Zyrian, Komi-Permyak, Kumyk, Mansi, "99/"b9 in T2B. All other letters and symbols in Moldavian, Mordvin-Moksha, Mordvin-Erzya, T2A, T2B and T2C (and in the accent part "00 – Nanai, Orok (Uilta), Negidal, Nogai, Oroch, "1a, symbol/digit part "20 – "3f, "5b – "5f, "7b – Russian, Saam, Old-Bulgarian, Old-Russian, "7f and lower part "80 – "ff of X2) have fixed po- Tati, Teleut, Hanty-Obskii, Hanty-Surgut, sitions. Evenk, Crimean Tatar. A summary of the languages covered by T2A/T2B/T2C is shown below. The encoding T2A 6 The Cyrillic glyph container X2 versus contains the leading languages sorted by using sta- Cyrillic encodings T2A, T2B, T2C tistical data on populations. The encoding T2B con- As was already specified, there are two main require- tains the majority of the remaining languages. Fi- ments essential to the reliable working of LATEX2ε nally, the encoding T2C contains several languages in multi-language mode: with exotic letters which do not fit into T2A or • we must keep the \lccode – \uccode table as T2B: Abkhazian, Orok (Uilta), Saam (Lappish), in T1, Old-Bulgarian, Old-Russian. Due to the intersec- • tions between Cyrillic alphabets some languages are we must keep the ASCII encoding for positions supported by two or three encodings simultaneously: 32 – 127. Both requirements are fulfilled by the T2A/T2B/ T2A: Abaza, Avar, Agul, Adyghei, Azerbaidzan, T2C encodings and hence they can be safely mixed Altai, Balkar, Bashkir, Belorussian, Bulgar- with the Latin encodings OT1 and T1 inside a doc- ian, Buryat, Gagauz, Dargin, Dungan, Ingush, ument. The encoding X2 conserves the \lccode – Kabardino-Cherkess, Kazah, Kalmyk, Karakal- \uccode values but does not contain these ASCII pak, Karachaevskii, Karelian, Kirgiz, Komi- glyphs. As a result it may cause problems and unex- A Zyrian, Komi-Permyak, Kumyk, Lak, Lez- pected effects inside LTEX2ε documents if the user gin, Macedonian, Mari-Mountain, Mari-Val- is not careful enough. So, why do we need X2 when ley, Moldavian, Mongolian, Mordvin-Moksha, we have T2A/T2B/T2C? Mordvin-Erzya, Nogai, Oroch, Osetin, Rus- The reason is that the requirement to keep all sian, Rutul, Serbian, Tabasaran, Tadjik, Tatar, the ASCII glyphs is very restrictive — it leaves only 4 Tati, Teleut, Tofalar, Tuva, Turkmen, Udmurt, 61 positions for non-ASCII letters. To fit all Cyril- lic letters into Tn encodings requires three tables 3 The empty character with the zero thickness and the height equal to 1ex used in EC fonts and T1 encoding for 4 For Cyrillic encodings it is even more restrictive: it is special applications — such as hyphenating compound words, necessary to keep 32 base Russian letters in each encoding breaking down ligatures, creating accents to be placed over as well since they are encountered in almost all Cyrillic the invisible space between two letters. alphabets.

412 TUGboat, Volume 19 (1998), No. 4

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

€  ‚ ƒ „ † ‡

8Ü 8Ü

ˆ ‰ Š ‹ Œ  Ž DZ

à á â ã ä å æ ç

9Ü 9Ü

è é ê ë ì í î ï

¡ ¢ £ ¤ ¥ ¦ §

AÜ AÜ

¨ © ª « ¬ ­ ® ¯

° ± ² ³ ð ñ ò ó

BÜ BÜ

ô õ ö ÷ ø ù ú û

a) T2A encoding

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

€  ‚ ƒ „ † ‡

8Ü 8Ü

ˆ ‰ Š ‹ Œ  Ž DZ

à á â ã ä å æ ç

9Ü 9Ü

è é ê ë ì í î ï

¡ ¢ £ ¤ ¥ ¦ §

AÜ AÜ

¨ © ª « ¬ ­ ® ¯

° ± ² ³ ð ñ ò ó

BÜ BÜ

ô õ ö ÷ ø ù ú û

b) T2B encoding

ܼ»Ü8 ܽ»Ü9 ܾ»ÜA Ü¿»ÜB Ü4»ÜC Ü5»ÜD Ü6»ÜE Ü7»ÜF

€  ‚ ƒ „ † ‡

8Ü 8Ü

ˆ ‰ Š ‹ Œ  Ž DZ

à á â ã ä å æ ç

9Ü 9Ü

è é ê ë ì í î ï

¡ ¢ £ ¤ ¥ ¦ §

AÜ AÜ

¨ © ª « ¬ ­ ® ¯

° ± ² ³ ð ñ ò ó

BÜ BÜ

ô õ ö ÷ ø ù ú û

c) T2C encoding

Figure 3: The national Cyrillic letters in T2A/T2B/T2C TUGboat, Volume 19 (1998), No. 4 413

Code T2A T2B T2C "80/"a0 ghe-upturned ghe-with-descender hcrossed -with-tail "81/"a1 ghe-hcrossed ghe-hcrossed + "82/"a2 (T+h with tail) ghe-with-descender te-with-descender "83/"a3 (T+h) ghe-with-tail ghe-with-tail "84/"a4 h-special h-special h-special "85/"a5 -with-descender zhe-with-descender -with-descender "86/"a6 -with-descender -phonetical er-gravecrossed "87/"a7 zet zet "88/"a8 i-with-umlaut lje -with-descender "89/"a9 -with-descender ka-with-descender ka-with-descender "8a/"aa ka-with-left-poker -with-descender el-with-descender "8b/"ab ka-vcrossed ka-with-tail ka-hcrossed "8c/"ac a+e el-with-tail el-with-tail "8d/"ad en-with-descender en-with-descender en-with-descender "8e/" en+ghe en+ghe em-with-tail "8f/"af S en-with-tail en-with-tail "90/"b0 o-barred o-barred o-barred (fita) "91/"b1 -with-descender es-acutecrosseds abkhazian-ch "92/"b2 u-with-cyrbreve u-with-cyrbreve abkhazian-ch with descender "93/"b3 Y-special Y-special () "94/"b4 Y-special-hcrossed ha-hcrossed "95/"b5 ha-with-descender ha-with-descender ha-with-descender "96/"b6 tse-macedonian ha-with-tail tse-macedonian "97/"b7 -vcrossed che-with-left-descender abkhazian-ha "98/"b8 che-with-descender che-with-descender che-with-descender "99/"b9 e-ukrainian en-with-right-tail "9a/"ba shwa shwa shwa "9b/"bb nje epsilon big

Table 1: The national Cyrillic letters in encodings T2A/T2B/T2C to achieve the coverage of X2: the problem is that ASCII in 32 – 127, but they do have the standard, most characters in the Tn tables are the same. compatible \lccode–\uccode values. And to support each such Tn encoding it is neces- Currently the LAT X Team only supports the sary to have a separate font class like the EC fonts. E Tn encodings and TS encodings, while the support To keep such enormous numbers of fonts is of an Xn encoding is entirely the responsibility of too large a price for people who use Cyrillics only the designer of the encoding.5 For the “glyph con- occasionally. On the other hand, if all the Cyrillic tainer” X2 such support should include listing some glyphs are put into just one table without the Latin simple rules which should be followed by Users so letters in 32 – 127, but in a way that satisfies the as to avoid strange and undesirable effects. The X2 \lccode–\uccode requirements, one table and one encoding does not contain Latin ASCII letters but font class is enough provided the user obeys some only digits, punctuation and mathematical symbols, elementary rules of safety. This “economy mode” is etc., therefore the rules should guarantee that text implemented by X2. containing Latin ASCII letters is never used with the X2 encoding: There is a similar situation for Old Slavonic characters and some other encodings which are only occasionally used by normal users. To resolve this problem, “glyph containers” like X2 could again 5 It seems that there should be support for such “glyph be helpful. The “glyph container” encodings Xn container” encodings by the LATEX Team as well (such sup-   port should include the registration procedure for glyph con- should be an intermediate case between T n and tainers and maintenance of the official list of exceptions where the “free style” Xn: such encodings do not have the glyph container encodings produce undesirable results). 414 TUGboat, Volume 19 (1998), No. 4

1. If you use ASCII Latin letters in the text part of (d) the commands that write, explicitly or your document, some Latin encoding (i.e., OT1 implicitly, text to external files, which may or T1) must be active for this piece of your text; be loaded outside the X2 encoding: 2. The following commands should be used only \addtocontents, \addcontentsline, outside the range where the X2 encoding is \glossary, \index, \part, active (or should be preceded by the explicit \chapter, \section, \subsection, specification of some “Latin” encoding) because \subsubsection, \paragraph, they may implicitly include the Latin text in \subparagraph, \caption. your document: Similarly, if such a floating command includes \part, \chapter, \section, Latin letters and the resulting object may ap- \subsection, \subsubsection, pear inside the range where X2 is active, some \paragraph, \subparagraph, \caption, Latin encoding (i.e., OT1 or T1) should be acti- \tableofcontents, \listoftables, vated explicitly before the Latin-encoded text. \listoffigures, \ref, \pageref, \cite, 4. Just the same requirement holds for all com- \item, \labelenumxx, \labelitemxx, mands which can occasionally insert Cyrillic or \makelabel, \numberline, \thechapter, Latin text where the Latin encodings OT1/T1 \thesection, \thesubsection, \date, or the “Cyrillic glyph container” encoding X2 \today. are active. For example, you should be care- Exactly the same requirement is needed for ful with the definition and re-definition of the all user-defined macros (or those loaded from following commands: external packages) that have ASCII Latin text (a) commands which create automatically in their body but without explicit specification generated text used by other commands: of the “Latin” encodings OT1 or T1 for this \labelenumxx, \labelitemxx, text; \makelabel, \numberline, \thechapter, 3. When you deal with floating objects (or moving \thesection, \thesubsection, \today, arguments), you should not rely on the assump- (b) commands which are used in international tion that the Cyrillic letters are used by LAT X E LATEX to define language-specific names: when the floating material is inserted into the \abstractname, \appendixname, document. For example, if Cyrillic letters are \alsoname, \ccname, \chaptername, used inside some command defining the floating \contentsname, \enclname, object, the encoding X2 should be activated ex- \headtoname, \figurename, \indexname, plicitly in front of Cyrillic letters even if X2 is \listfigurename, \listtablename, active at the point where the command is is- \notesname, \pagename, \partname, sued. Among such commands are: \prefacename, \seename, \tablename, (a) the floating environments: (c) commands which implicitly define headers, \begin{table}– \end{table}, footers, margin remarks, etc., and/or im- \begin{figure}– \end{figure}, plicitly write something into external files: \begin{table*}– \end{table*}, \part, \chapter, \section, \begin{figure*}– \end{figure*}, \subsection, \subsubsection, (b) the commands that define floating text \paragraph, \subparagraph, \caption, explicitly: \addtocontents, \addcontentsline, \author, \title, \date, \address, \glossary, \index, \name, \signature, \telephone, (d) commands which create floating text and \footnote, \footnotetext, \thanks, floating environments: \marginpar, \markboth, \markright, \author, \title, \date, \address, \bibitem, \topfigrule, \botfigrule, \name, \signature, \telephone, \dblfigrule, \footnoterule, \footnote, \footnotetext, \thanks, (c) the commands that define headers, foot- \marginpar, \markboth, \markright, ers, margin remarks, etc., implicitly: \bibitem, \topfigrule, \botfigrule, \part, \chapter, \section, \dblfigrule, \footnoterule, \subsection, \subsubsection, \begin{table}– \end{table}, \paragraph, \subparagraph, \caption, \begin{figure}– \end{figure}, TUGboat, Volume 19 (1998), No. 4 415

\begin{table*}– \end{table*}, a table for direct text coding, and the status of \begin{figure*}– \end{figure*}, T2∗ as the modern Cyrillic encodings. In struc- (e) macros and user-defined commands which tured markup, the ambiguity would be addressed by may be expanded unintentionally inside assigning two symbolic names for each glyph (say, or outside X2: \yat/\semisft and \/\obarred) and only us- ing the semantically correct one to code texts. \def, \newcommand, \newcommand*, Some preliminary information about exotic \renewcommand, \renewcommand*, glyphs and pure phonetic symbols has been provided \providecommand, \providecommand*, by linguists studying some minor writing systems. \newenvironment, \newenvironment*, These letters and symbols are not currently included \renewenvironment, in X2 and T2∗ at all. The reason for not including \renewenvironment*, the glyphs at this stage is that the writing systems \newtheorem, \ProvideTextCommand, are very unstable and are subject to change from \ProvideTextCommandDefault, publication to publication. There is no justification \AtBeginDocument, \AtEndDocument, for including such symbols in the version of X2 and \AtEndClass, \AtEndPackage, T2∗ proposed as a standard until the situation be- \DeclareRobustCommand, comes stable. , \DeclareTextCommand It seems that all the specific Cyrillic glyphs used . \DeclareTextCommandDefault in modern Cyrillic alphabets are included in X2 and ∗ 7 The weak points of X2 and T2∗ in one of the T2 , but there is also a chance that some minor is omitted. There is also ∗ The X2 and T2 encodings do not contain accented a chance that some linguists suggest a new alphabet letters, and (for some languages) this throws the for some minor language using their own glyphs user back on the \accent primitive which prevents not available in X2. Until this happens we can construction of correct hyphenation tables and - consider X2 and T2A/T2B/T2C as comprehensive stroys kerning pairs. The encodings (especially glyph collections for modern Cyrillic texts (although X2) are also overloaded (to some extent) with rare not very comfortable and not specifically adjusted glyphs, which arise from the attempt to collect all for intensive Cyrillic writing). Cyrillic glyphs in one table.

There are the cross-modifiers (horizontal stroke 8 Some remarks about the TS2 encoding  “ ”, vertical stroke “ ”, diagonal strokes “ ”and The TS2 is expected to be the collection of accents “ ”) which are included in X2 but are absent in and special symbols which are necessary for Cyrillic T2A/T2B/T2C. Although there is a great chance typography, but which are not included into the that these glyphs will be included in TS2 (see encodings X2 and T2A/T2B/T2C for some reasons section 8), their status at this stage of the project is 6 (i.e., TS2 is the encoding supplementary to X2 and

undefined. Similarly, there are the title forms for T2n as TS1 is supplementary to T1).

k Å Ñ the letters Ã/ and / which were included in For typographical reasons, ‘wide’ versions of previous (intermediate) versions of X2 but are now some accents — macron, tilde, breve, etc. — are de- excluded for some reason. sirable. These versions would be used for extra wide

Another disadvantage of minor importance is letters: as compared with the , Cyril-

Ü à °

that there are two glyphs (X/ and / )which lic has a far higher proportion of wide letters. Such Ü correspond to logically different letters: X/ stands wide versions of the accents are good candidates for

for Saam semisoft sign and for old Russian yat, a TS2 encoding. Similarly, the lowercase/uppercase ° and à/ stands for o-barred and old Russian fita. variants of cedilla, ogonek and the accents absent in Although graphically these symbols are similar, they T1 and TS1 may make useful additions to TS2.

are different logically.

k Å Ñ The letters Ã/ and / used in some

This situation can be accepted taking into ac- Ü Cyrillic languages are actually ligatures “ Ë+ ”

count the status of X2 as a glyph table rather than Ü and “Í+ ”. As well as the uppercase and lowercase 6 forms there is also a title form for these letters:

The title form is a combination of the uppercase “ Ë”

Ë ü or “Í” and the bowl from the lowercase “ ”. These glyphs the combination of the uppercase form for “ ”or

are used for first letters in titles, etc., where the first letter is ü “ Í” and the bowl for the lowercase “ ”. This form capital and other letters are in lowercase mode. For example, is used for titles where the first letter is capital there is the title form “Ij” for the ligature “IJ”/“ij” used in Dutch. (Note, that the title form “Ij” is absent in T1 and while the other letters are ordinary (a similar effect TS1 encodings.) occursfor‘IJ’usedinDutch).Suchtitleletters 416 TUGboat, Volume 19 (1998), No. 4

should be placed in TS2 and shared by the X2 and Dutch Organization for Scientific Research (NWO T2∗ encodings. grant No 07 – 30 – 007). To construct some exotic letters from pieces, References

special modifiers are necessary: horizontal stroke  “ ”, vertical stroke “ ”, diagonal strokes “ ”and [1]A.Berdnikov,O.Lapko,M.Kolodin,A.Jani-

“ ”. The diagonal strokes are used only for letters shevsky, A. Burykin. The encoding paradigm

A



ñ Ð ð

Ñ/ (Enetz) and / (Saam, or Lappish). Verti- in LTEX2ε and the projected X2 encoding

  

f Î Ú cal strokes are used only for letters F/ and / for Cyrillic texts. Proceedings of EuroTEX-98, which have become obsolete since modern Azerbai- Saint-Malo, 1998.

jan writing is based on the Latin alphabet. Horizon- [2]A.Berdnikov,O.Lapko,M.Kolodin,A.Jani- ¡

tal strokes are used in several Cyrillic letters (/ , shevsky, A. Burykin. Alphabets necessary for

g ä ð G/ , / , etc.). There are serious reasons for keep- various Cyrillic writing systems (towards X2 ing these modifiers in TS2: there are still minor lan- and T2 encodings). Proceedings of EuroTEX- guages for which alphabets based on Cyrillic could 98, Saint-Malo, 1998. be proposed. The availability of these modifiers in [3] A. Berdnikov, O. Grineva. Some problems with TS2 would support such developments without the accents in TEX: letters with multiple accents necessity to include more glyphs in the X2 and T2∗ and accents varying for uppercase/lowercase encodings. letters. Proceedings of EuroTEX-98,Saint- It is still a question how, and whether, the TS2 Malo, 1998. encoding should be realized. Taking into account [4] K. P´ıˇska, Cyrillic Alphabets, in: Proceedings that there are only a few glyphs really necessary of TUG’96, eds. M. Burbank and C. Thiele, for it and that there are several positions in TS1 pp. 1 – 7, JINR, Dubna; TUGboat 17(2), reserved for future extension of this encoding, it pp. 92 – 98. may be a good decision just to combine these two [5] O. Lapko, Full Cyrillic: How many languages, encodings. in: Proceedings of TUG ’96, eds. M. Burbank and C. Thiele, pp. 164 – 170, JINR, Dubna; 9 Acknowledgments TUGboat 17(2), pp. 174 – 180. There are many people who have contributed to this [6] K. Kenneth. The languages of the world. Lon- project, and it is difficult to list all of them in this don, Henley. section. Among the people who contributed the es- [7] M. Ruhlen. A Guide to the languages of the sential components are Mikhail Grinchuk, Vladimir world. Stanford University, 1975. Volovich, Werner Lemberg, Frank Mittelbach, J¨org [8] C. F. Voegelin, F. M. Voegelin. Classification Knappen, Michel Goossens, Andrew Slepuhin, but and Index of the World Languages. Academic the list is not restricted to these names only. Press, 1977. We are especially thankful to Vladimir - [9] WWW page by Karel P´ıˇska: http://www-hep. lovich and Werner Lemberg for their work on fzu.cz/~piska/ macro support for the X2 and T2∗ encodings and [10] Ethnologue Database: ftp://ftp.std.com/ to the members of the mailing list CyrTEX-T2 who discussed enthusiastically the X2 and T2∗ obi/Ethnologue/eth.Z problems. (To subscribe to this list send email  to [email protected] with the command: A. Berdnikov, O. Lapko, subscribe cyrtex-t2 your-e-mail-address.) M. Kolodin, A. Janishevsky, A. Burykin We are grateful to Robin Fairbairns for his time Institute of Analytical spent polishing the text of our papers submitted to Instrumentation the EuroTEX-98 conference where the preliminary Rizskii pr. 26, 198103 results of this project (namely, the X2 encoding) was St. Petersburg, presented for the first time, and to Michel Goossens [email protected] for his efforts to organize the support for Russian participants at EuroTEX-98. Finally, although most work on this project was done on a voluntary basis (as it is traditional for TEX community), it is worth mentioning that part of the research was supported by a grant from the