<<

TUGBOAT

Volume 17, Number 2 / June 1996 1996 Annual Meeting Proceedings

Opening Address 91 Michel Goossens / Opening Words by President

Fonts 92 Karel P´ıˇska / 99 J¨org Knappen / The dc fonts 1.3: Move towards stability and completeness 102 Fukui Rei / TIPA: system for processing phonetic symbols in LATEX 115 A.S. Berdnikov / Computer Modern Typefaces as Multiple Master Fonts 120 A.S. Berdnikov / VFComb 1.3 — the program which simplifies virtual font management

Encoding and 126 Yannis Haralambous / ΩTimes and ΩHelvetica fonts under development: Multilingual Step One Support 147 Richard J. Kinch / Extending TEXforUnicode 161 L. N. Znamenskaya and S. V. Znamenskii / Russian encoding plurality problem and a new Cyrillic font set 166 Peter A. Ovchenkov / Cyrillic TEX files: interplatform portability 172 Michael M. Vinogradov / A user-friendly multi-function TEX interface based on Multi-Edit 174 Olga . Lapko / Full Cyrillic: How many languages?

TEXSystems 181 John Plaice and Yannis Haralambous / The latest developments in Ω 184 Dag Langmyhr / StarTEX—a TEX for beginners 191 Gabriel Valiente Feruglio / Do journals honor LATEX submissions? 200 Sergei V. Znamenskii and Denis . Leinartas / A new approach to the TEX-related programs: A user-friendly interface 204 Ivan G. Vsesvetsky / The strait gate to TEX 206 Laurent Siebenmann / DVI-based electronic publication 215 Kees van der Laan / BLUe’s format — the off-off alternative

Graphics 222 Kees van der Laan / Turtle graphics and TEX—achildcandoit 229 A.S. Berdnikov, .A. Grineva and S.B. Turtia / Some useful macros which extend the LATEX picture environment

News & 90 Mimi Burbank / Production notes Announcements 233 Calendar

TUG Business 234 Institutional members

Advertisements 235 TEX consulting and production services TEXUsersGroup Board of Directors Memberships and Subscriptions Donald Knuth, Grand Wizard of TEX-arcana† TUGboat (ISSN 0896-3207) is published quarterly Michel Goossens, President ∗ ∗ by the TEX Users Group, Flood Building, 870 Judy Johnson , Vice President Market Street, #801; San Francisco, CA 94102, Mimi Jett∗, Treasurer . S. A. Sebastian Rahtz∗, Secretary Barbara Beeton 1996 dues for individual members are as follows: Karl Berry Ordinary members: $55 Mimi Burbank Students: $35 Michael Ferguson Membership in the T X Users Group is for the E Peter Flynn calendar year, and includes all issues of TUGboat George Greenwade for the year in which membership begins or is Yannis Haralambous renewed. Individual membership is open only to Jon Radel named individuals, and carries with it such rights Tom Rokicki and responsibilities as voting in the annual election. Norm Walsh TUGboat subscriptions are available to organi- J´ıˇri Zlatuˇska zations and others wishing to receive TUGboat in a Raymond Goucher, Founding Executive Director † name other than that of an individual. Subscription Hermann Zapf, Wizard of Fonts † rates: $70 a year, including air mail delivery. ∗ Periodicals postage paid at San Francisco, CA, member of executive committee †honorary and additional mailing offices. Postmaster: Send TUGboat address changes to ,TEX Users Group, Addresses Telephone 1850 Union Street, #1637, San Francisco, CA All correspondence, +1 415 982-8449 94123, U.S.A. payments, parcels, etc. Fax Institutional Membership TEX Users Group +1 415 982-8559 Institutional Membership is a means of showing 1850 Union Street, #1637 continuing interest in and support for both T X E San Francisco, Electronic Mail and the TEX Users Group. For further information, CA 94123 USA (Internet) contact the TUG office. If you are visiting: General correspondence: [email protected] TEX Users Group TUGboat c Copyright 1996, TEXUsersGroup Flood Building Submissions to TUGboat: Permission is granted to make and distribute verbatim 870 Market Street, #801 [email protected] copies of this publication or of individual items from this publication provided the copyright notice and this permission San Francisco, notice are preserved on all copies. CA 94102, USA Permission is granted to copy and distribute modified versions of this publication or of individual items from this publication under the conditions for verbatim copying, TEX is a trademark of the American Mathematical provided that the entire resulting derived work is distributed Society. under the terms of a permission notice identical to this one. Permission is granted to copy and distribute transla- tions of this publication or of individual items from this publication into another language, under the above condi- tions for modified versions, except that this permission notice may included in translations approved by the TEXUsers Group instead of in the original English. Some individual authors may wish to retain traditional copyright rights to their own articles. Such articles can be identified by the presence of a copyright notice thereon.

Printed in U.S.A. 1996 Annual Meeting Proceedings TEX Users Group Seventeenth Annual Meeting Dubna, , July 28–August 2, 1996

COMMUNICATIONS OF THE TEX USERS GROUP TUGBOAT EDITOR BARBARA BEETON PROCEEDINGS EDITORS MIMI BURBANK CHRISTINA THIELE

VOLUME 17, NUMBER 2 • JUNE 1996 SAN FRANCISCO • CALIFORNIA • U.S.A. Production Notes

What’s different this year? version 1.3) fonts, using Malyshev’s BaKoMaPost- Script Type1 versions.2 The Ω article by Haralam- My goodness—much! This is the second annual bous necessitated the creation of proper .tfm files TUG meeting to be held outside North America, and from .afm files provided by the author. The arti- the first meeting that takes place in a country with cle by O. Lapko used special Cyrillic fonts, a result a non- . of the ongoing Russian Cyrillic font project. The This is also the first (and probably the last) wncyr fonts developed at the University of Washing- year to attempt to present the Proceedings issue of ton and distributed by the American Mathematical TUGboat before the actual meeting. have had Society were also widely used in this issue. varying success in this regard, and will most likely Owing to the wide variation in Cyrillic fonts return to our former schedule in 1997, if only be- (the two mentioned above are far from being the cause a pre-conference publication schedule does not only ones in existence) and methods for using them, allow authors and editors sufficient time to properly many of the Cyrillic examples were included in the fine-tune the final version of a text. As a result of the form of PostScript graphics prepared by the authors, delay in receiving articles from some authors, there so that additional fonts would not have to be in- are articles in this issue which have not been re- stalled and incorporated into the TUGboat styles viewed. Russian typographic styles differ from those for just one use. of TUGboat, and these differences may be noted in a variety of articles in this issue. Output Many of our speakers present material that is actually “work-in-progress” and is in some cases - The editor will have to confess to many problems pendent upon discussions and input from others in during the editing and production of this issue and a particular area of expertise who attend the annual any remaining errors are mine (MB). Without help meetings. Therefore, some of the material to be pre- from my co-editor, Christina Thiele, and all of the sented at the conference in Dubna will be appearing production team members, final output would have in future issues of TUGboat. Look in the next issue been impossible. Final output was prepared at SCRI Web2C for a more detailed report of the TUG’96 Conference on an IBM RS6000 running AIX,usingthe in Dubna and list of participants. implementation of TEX. Output was printed on a Of the total number of articles submitted, four- QMS 680 print system at 600 dpi. teen were by Russian authors, thirrteen were by au- thors whose primary language was not English, and two articles were submitted by North American au- thors—surely a tribute to the international flavor of TUG.1 Macros During the three-month process of editing these pro- ceedings, had the opportunity to use at least ⋄ Mimi Burbank three different versions of LATEX2ε macros. Look in Supercomputer Computations the next issue for an article by Robin Fairbairns on Research Institute the current state of the TUGboat style files. Florida State University Only five articles were submitted as plain TEX Tallahassee, FL 32306–4502 [email protected] source; the others were submitted as LATEX2ε. Fonts ⋄ Christina Thiele A large area of focus at the TUG’96 meeting will 15 Wiltshire Circle be on “languages”, “encoding” and “fonts”. The is- Nepean, Ontario sue is set primarily in Computer Modern (or DC, K2J 4K9, Canada [email protected]

1 What an experience for someone whose first language is 2 J¨orge Knappen’s article on page 99 required the upgrade ’merkan and who speaks English as a ‘second language’. from 1.2 to 1.3 for this issue.

90 TUGboat, Volume 17 (1996), No. 2 — Proceedings of the 1996 Annual Meeting Opening Words by the President

Michel Goossens CERN, Geneva, Switzerland [email protected]

A lot has happened in those twelve months since the second time that TUG’s Annual Meeting has taken last TUG Conference (the 16th) in St. Petersburg place outside of North America (the first time was (Florida). We now know that the revolution in in 1993, in Aston, Birmingham, United Kingdom). the area of electronic documents is here to stay. Moreover, this time we are moving beyond the Eng- It becomes more and more commonplace to find lish-speaking world altogether, as we are guests in individuals connecting from home computers to the the heart of Russia, the largest country in the world, Internet, and even in Europe “going global” starts spanning eleven time zones, and where the majority to become affordable, with most PTT’s now offering of the 150 million inhabitants speak Russian, which ISDN lines at prices comparable to normal telephone is written using the non-Latin Cyrillic alphabet. I connections and many cable operators providing sincerely thank CyrTUG for the invitation to come Internet services. to Russia, and JINR for helping with the technical TEX users worldwide are finally beginning to organization of the Conference. It shows that TEXis profit fully from these electronic wonders, and can truly international and knows no borders, and that now download everything they need from a CTAN it is an ideal vehicle to promote friendship, and fos- site via the Internet, or put one of several CD-ROMs ter cultural exchanges and scientific collaboration. which appeared over the last twelve months (one The subject area covered by the papers at this for MS-DOS by NTG, one containing the CTAN Conference shows the diversity of the TEXculturein archives by DANTE, and a TDS-based plug-and- the world, more particularly in Central and Eastern play one for Unix by TUG, GUTenberg and - Europe. When reading the articles, one may some- TUG) in their CD-drive. Efforts to regularly update times be struck by a “funny” or “unexpected” not- these CD-ROMs and extending the target domain very-English-sounding expression, but this merely to Microsoft Windows (NT and 95), and Macintosh shows the true richness of all contributions. It are already underway, so that I have good hopes underlines the fact that various approaches exist that dealing with TEX will one day become almost as to attain a certain goal, and reading about these simple as running Word, WordPerfect, or other easy- solutions often opens up new horizons. And then to-install commercial products. At the same time there are the “classics”, like , the dc fonts, translation programs between LATEX sources and “Blue”, whose steady improvements we have been various electronic hyper-formats, such as HTML, following over the years, as well as the efforts of Acrobat, have been further improved. It is thus fair the LATEX3-team providing us each semester with a to say that LATEX users are certainly not the worst more robust LATEX2ε, and the promises of eTEX, as placed to fully profit in an almost effortless manner announced at TUG’95. All this makes it clear that from both the typographic excellence of the TEXen- TEX is more alive than ever, and looking for a seat gine and the easy integration of their documents in on the front row when , hypersurfing, and the global information hyperhighway. Therefore, I virtual (ir)reality take over the world. would like to thank all individuals who have worked Let me conclude by expressing my gratitude hard to develop these important tools and I can to the TUGboat Production Team, especially Mimi only hope that they will contribute to make TEX and Christina, for their tremendous effort on the better known in the PC commodity market, where present proceedings. I also want to thank all au- the action in the world of electronic publishing (and thors, for their continued support by communicat- computing in general) will be more and more con- ing their work to the wider TEX community, thus centrated in the future. increasing the pool of TEX-pertise available to ev- I am writing these words a fews weeks before erybody. And, last but not least, let me mention the start of TUG’96, the 17th TUG Annual Meet- the support of DANTE, GUTenberg, NTG, and ing, taking place in the Joint Institute for Nuclear UKTUG, who donated funds to the TUG Bursary Research (JINR) in Dubna (Russia). It is only the or otherwise contributed in kind to the Conference.

TUGboat, Volume 17 (1996), No. 2 — Proceedings of the 1996 Annual Meeting 91

Karel P´ıˇska Institute of Physics, Academy of Sciences 180 40 Prague, Czech Republic [email protected], [email protected] URL: http://www-hep.fzu.cz/~piska/

Abstract A collection of Cyrillic-based language alphabets is presented. The contribution contains the data about more than 50 languages using . A “Unicode- like” coded font is used for the rendering of the Cyrillic texts. The aim is to take part in creating a universal Cyrillic font for TEX and the Ω project and to further help languages using Cyrillic join the TEX community.

Introduction Language Names and Codes Cyrillic-based alphabets are (or have been) used by The ISO standard 639-2 (1993) and the Ethnologue nations in the Russian Federation, and a number base eth (1990) contain the English names of lan- of nations in Europe and Asia, including many guages and also their three-letter codes. Another nations of the former USSR now beyond the Russian source of language names I have used is Webster’s border. The article aims to present a list of currently Dictionary (1989). Unfortunately the English and existing written languages using the Cyrillic script international terminology is not stabilized. When and having a codified literary form. I began translation from Russian I could not find Most of the encoding systems for unambiguous names for languages in English. On Cyrillic used in Russia, , Belarus, and also the other hand, the Russian names are fixed in most

the UCS/Unicode [5] standard are based on the cases. O . They contain a continuous or- One example, “ KA ( )”, is evident dered code sequence only for Russian letters. Other in Russian – but I have not selected the best exam- characters are non-standard, they are missing or ples from the following variants: Adyghe, Adyge, they are coded “accidentally”. I will call these char- Adygey, Adygei, Adighe, Circassian, Lower Cir- acters “additional” (relative to the usual computer cassian, West Circassian, Kinkh, Kjkax, Cherkes. encoding standards!). Many Cyrillic alphabets were Therefore, I have decided to present one (rarely two) borrowed from the Russian alphabet. We can con- language name(s) in Russian and one (maximally sider their “non-Russian” letters being “additional” two) name(s) in English (often selected “arbitrar- or “new”, often they were created (appended) as ily”). The ISO and eth language codes are also “new” characters. Of course, the previous assertion shown in the table of languages which use Cyrillic. is not true for languages which have traditionally If the first code (ISO) is not defined or the codes are used Cyrillic script – Belarusian, Ukrainian, Serbian, different then the second code (eth) is presented. Macedonian and Bulgarian.

One of my most important sources has been Real Font

OA . . and . . (1960). Un- The B5 font family (borrowing from the Computer fortunately this unique book may be obsolete today. Modern) is a bank of Cyrillic glyphs corresponding I would be very grateful for corrections and remarks to the Ω project (Haralambous and Plaice, 1995). and also references to another sources; especially The proposed encoding of a real 8-bit font is based if the reader is an expert in any language. Please on Unicode (ISO-IEC 10646-1, 1993) — more ex- contact me by email. More information about lan- actly, "04xx mod "100. Thus the character codes guages can be found on my WWW Home Page (e.g., are well defined and standardized. This can simplify complete alphabetical orders). And please overlook communication between authors supporting and im- my lack of knowledge of English. proving the fonts. The table of the b5r12 font (Computer Modern Cyrillic Roman 12 point) is on the last page of the

92 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Cyrillic Alphabets

article. I repeat: the real font is only a bank of languages uses a Cyrillic-alphabet (in my opinion). glyphs and cannot be used autonomously in a simple I would like to add the following comments: and effective way. • I have no data about other characters; for Virtual Font example, punctuation marks, special signs and other symbols. A virtual font was created for the present article to • I don’t present information about additional enable access to Cyrillic letters using ASCII char- characters not in current use. acters. For creating .tfm files for virtual fonts, • Old Cyrillic is omitted and is not a subject of the program VFComb (Berdnikov and Turtia, 1995) inquiry in this paper. was used. It allows the definition (or redefinition), • mapping, ligature and kerning data once for all font Regarding the variant forms: more alternative sizes, and then merging them with metric informa- glyphs may be stored in a font bank and then tion of the real fonts (reading proper list files). It is selected to depict a particular character. This necessary to mention that every font in TEX (real or problem is solvable in TEX. virtual) can contain no more than 256 characters • Regarding letters with : there are im- and it is complicated, or impossible, using fonts portant differences in the three distinct appli- with many characters. This is a good reason for cations of diacritical marks (with possible dis- introducing Ω–the 16-bit extension of TEX. The agreement in different languages). virtual font used in this article combines a real font 1. The accented denotes the distinct with Unicode-like encoding (mod "100), a font with letter as opposed to the same symbol with- alternative glyphs (located separately) and several out an accent and it may even be posi- characters from the original CM (e.g., parentheses). tioned independently in the alphabet. The way of referencing the “I’s” is shown in the 2. An accent can be used to modify the sym- following example. bols representing and : A segment from the .tbf file (input file for VF- for example; vowels can be marked for Comb) length or nasalization, consonants can be (LIGTABLE marked for palatalization. The presence of (LABEL C I) the accent when writing is significant but (LIG C 1 O 006) unlike the above item, the combination (LIG C 2 O 007) does not constitute a new or special letter, (LIG C 3 O 300) and therefore would be alphabetized in the (LIG C 4 O 342) same position as the letter without such a (LIG C 5 O 344) . (STOP) 3. An accent is used to mark stress. These (LABEL C i) “stressed” letters are not part of the writ- (LIG C 1 O 022) ing system but are, nevertheless, necessary (LIG C 2 O 211) for entries in dictionaries and textbooks. (LIG C 4 O 343) A few examples illustrate the use of stress (LIG C 5 O 345) marks (above, right or below):

(STOP)

B ′ ´e , ) ′ ′ ac cent mark ,rAkze.nt results Alphabetical Orders and Sort ‘Ii’ => % “Standard” ‘I’

‘I1i1’ => V % Ukrainian/Belarusian ‘I’ The greater number of languages using Cyrillic in

‘I2i2’ => W % Ukrainian ‘’ Russia and the former USSR have adopted words G ‘I3’ => % Caucasian aspiration sign “ ” from Russian or, with modifications, in the original

‘I4i4’ => % Tadzhik ‘I’ with stress form (especially proper names) and their alphabets

‘I5i5’ => include all Russian letters. Not often exceptions are Ukrainian, Belarusian, Moldaviancyror Abkhazian. Alphabetical orders of distinct languages may Cyrillic Character Set and Unicode be different. “Additional” letters have been ap- The ISO/IEC 10646-1/Unicode (1993 E) covers pended to the end or may occur in the middle most of the letters used in current living written of alphabets. Two letters may be located in the

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 93 Karel P´ıˇska

opposite order. And then the order of similar or The confusion perhaps may be in my sources or in even identical words in dictionaries or indexes may Unicode. be different. Examples (1960, 1990)12 Unicode codes

Russian Ukrainian Q

"0401 "0451 Many languages use the

N O N O L

L < < < < Q

letter Q. The list of the languages is

OBLAO OBAO LA LA < <

not used in is shorter:

NA NA LAL LA < <

Ukrainian, Bulgarian, Serbo-Croatiancyr,

ANB ANB AL ALK < < Macedonian, Kurdishcyr, Moldaviancyr, Correspondence Cyrillic vs. Latin Azerbaijani, Abkhazian, Abazin(?)

Many languages now written in Cyrillic used Latin- cyr R "0402 "0452 Serbo-Croatian

like alphabets in the 1930s (e.g., Tatar or Kazakh). S

Several languages have used both Latin and Cyrillic "0403 "0453 Macedonian T alphabets—at the last count these included Serbo- "0404 "0454 Ukrainian

Croatian, Kurdish, Moldavian, and Azerbaijani. U "0405 "0455 Macedonian

Several nations are preparing projects to migrate V from Cyrillic to Latin. The alphabetical orders "0406 "0456 Ukrainian, Belarusian, for Cyrillic and Latin are different but I am sure Kazakh, Khakass, Komi (Zyrian), it will be possible to define algorithms for auto- Komi-Permyak

matic transliteration, use of common hyphenation W "0407 "0457 Ukrainian

patterns and compile and print texts from the one cyr X "0408 "0458 Serbo-Croatian , source, in either , to produce for a Macedonian, Azerbaijani, Altaic (Oirot) reader the script with which she/he is familiar. cyr "0409 "0459 Y Serbo-Croatian , Cyrillic Letters and Symbols Macedonian cyr The table contains the Cyrillic characters defined in "040A "045A Z Serbo-Croatian , the Unicode standard. Russian letters (used in most Macedonian

cyr alphabets) and old Cyrillic letters and symbols are "040B "045B Serbo-Croatian

omitted in the list. Corresponding symbolic names "040C "045C Macedonian of characters can be found in [5, 6]. It would too "040D "045D (This position shall not be used)

long to present them here. Example: CYRILLIC CAPITAL LETTER IO is the "040E "045E Belarusian, Uzbek,

Unicodenamefor"0401 => . Dungan

cyr "040F "045F Serbo-Croatian , Macedonian, Abkhazian Explanatory notes and comments cyr The Cyrillic-alphabet languages presented here "0410.."042F uppercase Russian also uses other alphabets (usually Latin-like). "0430.."044F lowercase Russian Languages using the following letters are unknown "0460.."0486 Old Cyrillic (to me):

1. "0490 "0491 Ukrainian (now used 2. for I have two candidates–2 letters undefined in Unicode: again!)

˜ C

˜(= )? in Chuvash and "0492 "0493 Tadzhik, Uzbek, Uighur, Uu (= )? in Karachay-Balkar. Kazakh, Azerbaijani, Khakass,(Bashkir),

(Karakalpak) g 1 Referee’s note: The Ukrainian Academy of Sciences variant G Bashkir, Karakalpak

changed the official order of the in 1991 "0494 "0495 Yakut (Sakha), (or thereabouts), and the is no longer the last letter cyr of the alphabet. Abkhazian, Eskimo (Yuit) 2 Author’s note: Reworking and reprinting of all the

"0496 "0497 dictionaries of any language will not be easy. I will keep Uighur, Turkmen, this example to demonstrate “real life” changes. Tatar, Kalmyk, Dungan

94 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Cyrillic Alphabets

"04C5 "04C6 (This position shall not be used) c

"0498 "0499 b Bashkir È

"04C7 "04C8 Ç Khanty (Ostyak), z variant Z Bashkir

Chukcha, Eskimo (Yuit)cyr, › "049A "049B š Tadzhik, Uzbek, Uighur, Koryak (Nymylan)

Kazakh, Karakalpak, Abkhazian "04C9 "04CA (This position shall not be used) Ì

"04CB "04CC Ë Khakass k

variant K 

"049C "049D œ Azerbaijani Ñ

"04D0 "04D1 Ð Chuvash Ÿ

"049E "049F ž Abkhazian Ó

"04D2 "04D3 Ò Mari-high, ¡ "04A0 "04A1 Bashkir

Khanty (Ostyak), (Kalmyk) £

"04A2 "04A3 ¢ Uighur, Kazakh, Õ

Turkmen, Kirghiz, Tatar, Bashkir, "04D4 "04D5 Ô Ossetic × Khakass, Tuva (Soyot), Kalmyk, Dungan "04D6 "04D7 Ö Chuvash

cyr

Ø Ù ¥ "04A4 "04A5 ¤ Altaic (Oirot), "04D8 "04D9 Kurdish ,Uighur, Yakut (Sakha), Mari-low Kazakh, Turkmen, Azerbaijani, Tatar,

"04A6 "04A7 ¦§ Abkhazian Bashkir, Kalmyk, Khanty (Ostyak),

Abkhazian, Dungan ©

"04A8 "04A9 ¨ Abkhazian Û

"04DA "04DB Ú Khanty (Ostyak) « "04AA "04AB ª Chuvash, Bashkir

"04DC "04DD ÜÝ Udmurt (Votyak) s

variant S Bashkir ß

"04DE "04DF Þ Udmurt (Votyak) ­

"04AC "04AD ¬ Abkhazian á

"04E0 "04E1 à Abkhazian ¯

"04AE "04AF ® Uighur, Kazakh, ã "04E2 "04E3 â Tadzhik

Turkmen, Kirghiz, Azerbaijani, Tatar, å Bashkir, Tuva (Soyot), Yakut (Sakha), "04E4 "04E5 ä Udmurt (Votyak) cyr

Mongolian , Buryat, Kalmyk, Dungan cyr ç

"04E6 "04E7 æ Kurdish , ±

"04B0 "04B1 ° Kazakh Altaic (Oirot), Khakass, Mari- ³ "04B2 "04B3 ² Tadzhik, Uzbek, low, Mari-high, Udmurt (Votyak), Karakalpak, Abkhazian, Eskimo (Yuit)cyr Komi (Zyrian), Komi-Permyak,

Khanty-Vakhi, (Kalmyk) x

variant X é

"04E8 "04E9 è Uighur, Kazakh, µ

"04B4 "04B5 ´ Abkhazian Turkmen, Kirghiz, Azerbaijani, Tatar, · "04B6 "04B7 ¶ Tadzhik, Abkhazian Bashkir, Tuva (Soyot), Yakut (Sakha),

Mongoliancyr, Buryat, Kalmyk, ¹ "04B8 "04B9 ¸ Azerbaijani Khanty (Ostyak)

cyr »

"04BA "04BB º Kurdish ,Uighur, ë Kazakh, Azerbaijani, Tatar, Bashkir, "04EA "04EB ê Khanty (Ostyak)

Yakut (Sakha), Buryat, Kalmyk "04EC "04ED (This position shall not be used)

î ï ½

"04BC "04BD ¼ Abkhazian "04EE "04EF Tadzhik

¾ ¿ ñ "04BE "04BF Abkhazian "04F0 "04F1 ð Khakass, Mari-low, Mari-high, Khanty-Vakhi, Altaic (Oirot), "04C0 À Abazin, Adyge, Kabardian-Circassian, Avar(ic), Lezgin, (Kalmyk)

Lak(i), Dargwa, Tabasaran, Chechen, ó "04F2 "04F3 ò ???

Ingush õ "04F4 "04F5 ô Udmurt (Votyak)

"04C1 "04C2 ÁÂ ???

"04F6 "04F7 (This position shall not be used) Ä

"04C3 "04C4 Ã Khanty-Vakhi, Chukcha, ù Eskimo (Yuit)cyr, Koryak (Nymylan) "04F8 "04F9 ø Mari-high

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 95 Karel P´ıˇska

Languages Using Cyrillic Script The following table contains a short overview of Cyrillic-alphabet languages and their “additional” letters (according to ‘usual’ standard encoding systems). ISO (ISO Committee Draft 639-2, 1993)/eth (Ethnologue, 1990)are two different three-letter language codes. The order is “quasi-linguistic-geographical-historical” (“cognate” or “neighbouring” nations being together).

Indo-European Group Codes Additional characters OK

Slavic Languages / OA ISO/eth Q

• Russian CAA rus ( )

T V W

• Ukrainian CA ukr (’)

V • Belarusian CAA bel/ruw

• Bulgarian A bul/blg

cyr

R X Y Z

• Serbo-Croatian AAEBA src

S U X Y Z

• Macedonian A mac/mkj OK

Iranian Languages / A • Ossetic ABA oss/ose

cyr

◦ Kurdish CA kur ’ ’QqWw

• Tadzhik BA tgk/pet OK Romance Languages / A cyr ◦ Moldavian A mol

Altaic Group OK

Turkic Languages / NA

• Uzbek CA uzb

• Uighur CCA uig

V

• Kazakh EA kaz

• Turkmen BCA tuk/tck

• Kirghiz A kir/kdo

X

◦ Azerbaijani A aze ’

• Tatar BBA tat/ttr

Gg Zz

• Bashkir A bak/bxk (= ) (= )

Ss

(= )

A Uu • Karachay-Balkar G- krc ( =)

• Kumyk CKA ksk

• Nogay A nog

Gg

• Karakalpak A kaa/kac (= )

X • Altaic (Oirot) BA alt

˙ V V J J

• Khakass EAA kjh I (= )( =)

( =)

• Tuva (Soyot) BCA tyv/tun

˜ C • Chuvash GCHA chv/cju ˜(= )

Yakut (Sakha) OCBA sah/ukt OK Mongolian Languages / LA

cyr

• Mongolian LA mon/khk

• Buryat COBA bua/mnb

• Kalmyk KF kgz

96 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

Cyrillic Alphabets

LGCA OK Tungusic-Manchu Languages / CCA-

• Evenki (Tungus) MA evn

• Even (Lamut) MA eve

• Nanai (Gold) A gld

Uralic Group

CLA OK

Finno-Ugric Languages / -

C

• Mari-low A mal

K

• Mari-high A mrj MOA

• Mordvin-Erzya A myv HA

• Mordvin-Moksha A mdf

• Udmurt (Votyak) CCBA udm

V

• Komi (Zyrian) kpv

OF V • Komi-Permyak - koi

• Mansi (Vogul) AA mns

EA

• Khanty-Vakhi EBKA- kca

• -Kazim - KA

• -Shurishkar -HCHA OK / A

• Nenets (Yurak) F yrk

• Selkup ALCA sel/sak OK

Caucasian Languages / A

• Abkhazian EA abk

• Abazin A abq

• Adyge KA ady

GAA

• Kabardian-Circassian - kab

• Avar(ic) A ava/avr

• Lezgin A lez

• Lak(i) A lbe

• Dargwa A dar

• Tabasaran BAA tab

• Chechen GGA /cjc

• Ingush CHA inh

BBA OK

Sino-Tibetan Group / BA-

• Dungan CA OK

Paleo-Asiatic Languages / BA

• Chukcha GCBA ckt ’

cyr

• Eskimo (Yuit) MAAA ess ( ’ ’=) ( ’ ’=)

E (’ ’=) ( ’ ’=) ’

Koryak (Nymylan) OA kpy ( ’ ’) ( ’ ’)

(’ ’) ( ’ ’)

• Nivkh (Gilyak) EA ’ Explanatory notes ◦ the language is migrating (again!) to Latin writing

(Q) a character not used today (particularly or entirely)

(= ) variant not preferred (’ ’=) a former variant

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 97 Karel P´ıˇska

b5r12 Font Table Conclusion (Computer Modern Roman 12 point) I think the Cyrillic portion of the Unicode Mapping Modern Cyrillic Part of ISO 10646-1/Unicode Table covers quite a good character set for most of the languages using the Cyrillic alphabet today. On ´0 ´1 ´2 ´3 ´4 ´5 ´6 ´7 the other hand UCS/Unicode does not offer a so-

phisticated solution for alphabetical ordering. And

´00x

˝0x I am afraid that the situation with special symbols,

´01x

particularities, and multiple accents for dictionaries

´02x

˝1x and textbooks, etc., will be more complicated.

´03x

Acknowledgements

´04x

˝2x I would like to thank Prof. Donald E. Knuth for

´05x

providing TEX, METAFONT and the Computer Mod-

´06x

˝3x ern Fonts, Michel Goossens for important docu-

´07x ments about languages and Unicode standard, Yan-

nis Haralambous, Olga Lapko and other authors

A B C D E F G ´10x

˝4x for METAFONT sources of Cyrillic fonts, Aleksandr

I J K L M N O 11x H

´ Berdnikov for the program VFComb (See his article

R S T U V W ´12x Q

˝5x on VFComb in this issue), and Mimi Burbank for

Y Z

´13x X help with the English text.

a b c d e f g ´14x

˝6x References

i j k l m 15x h

´ [1] ISO Committee Draft 639-2. Code for the repre-

q r s t u v w ´16x p ˝7x sentation of names of languages. Part 2: Alpha- ´17x 3 code. International Information Centre for Terminology (INFOTERM), Wien 1993. (pre- ´20x ˝8x liminary draft, not published) ´21x

[2] Ethnologue Database, ftp://ftp.std.com/

´22x

˝9x obi/Ethnologue/eth.Z, 19 February 1990.

´23x [3] Webster’s Encyclopedic Unabridged Dictionary

of the , Portland House, New

´24x

˝Ax York, 1989.

´25x

[4] Yannis Haralambous, John Plaice, ’Ω + Vir-

´26x

˝Bx tual METAFONT = Unicode + Typography’,

´27x Cahiers GUTenberg n21, juin 1995.

´30x [5] International Organization for Standardization. ˝Cx

“Information technology – Universal Multiple-

´31x

Octet Coded Character Set (UCS) – Part 1:

´32x

˝Dx Architecture and Basic Multilingual Plane”,

´33x ISO/IEC 10646-1 : 1993, (First edition, 1993-

´34x 05-01), Geneva, 1993, (Unicode). ˝Ex

[6] ftp://unicode.org/pub/MappingTables/

´35x

UnicodeDataCurrent.txt.Z, 21 May 1996.

´36x

OA

˝Fx [7] . . , . . ,

´37x

OK AL

‘BL -

BLAB ABG BBCK ˝8 ˝9 ˝A ˝B ˝C ˝D ˝E ˝F ABO’, ,

Comments A 1960.

Alternative glyphs referenced in the article [8] A.S. Berdnikov, S.B. Turtia: VFComb - a pro-

Kk Xx Zz Ss Uu (Gg ) are located in the additional gram for design of virtual fonts. Proceedings of separate font. the Ninth European TEX Conference, Arnhem The old Cyrillic part ("60.."86) is not used in 1995. the article and has not been completed.

98 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

TIPA: A System for Processing Phonetic Symbols in LATEX

Fukui Rei Department of Asian and Pacific , Institute of Cross-Cultural Studies, Faculty of Letters, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, TOKYO 113 Japan [email protected]

Introduction TIPA Encoding TIPA1 is a system for processing IPA (International Selection of symbols The selection of TIPA pho- 5 Phonetic Alphabet) symbols in LATEX. It is based netic symbols was made based on the following 2

on TSIPA but both METAFONT source codes and works. A LTEX macros have been thoroughly rewritten so • Phonetic Symbol Guide [9] (henceforth abbre- that it can be considered as a new system. viated as PSG). TIPA Among many features of , the following • The official IPA charts of ’49, ’79, ’89 and ’93 TSIPA are the new features as compared with or any versions. other existing systems for processing IPA symbols. • Recent articles published in the JIPA6,such • A new 256 for phonetic sym- as “Report on the 1989 Kiel Convention” [6], bols (‘T3’), which includes all the symbols and “Further report on the 1989 Kiel Convention” diacritics found in the recent versions of IPA [7], “Computer Codes for Phonetic Symbols” and some non-IPA symbols. [3], “Council actions on revisions of the IPA” • Complete support of LATEX2ε. [8], etc. • Roman, slanted, bold, bold extended and sans • An unpublished paper by J. C. Wells: “Com- serif font styles. puter-coding the IPA: a proposed extension of SAMPA” [10]. • Easy input method in the IPA environment. • Popular textbooks on . • Extended macros for accents and diacritics.3 More specifically, TIPA contains all the sym- • A flexible system of macros for ‘tone letters’. bols, including diacritics, defined in the ’79, ’89 and • An optional package (.sty) for drawing ’93 versions of IPA. And in the case of the ’49 version vowel diagrams.4 of IPA, which is described in the Principles [5], • A slightly modified set of fonts that go well there are too many obsolete symbols and only those when used with Times Roman and Helvetica symbols that had had some popularity at least for fonts. some time or for some group of people are included. Besides IPA symbols, TIPA also contains sym- 1 TIPA stands for TEXIPAor Tokyo IPA. The primary bols that are useful for the following areas of pho- ftp site in which the latest version of TIPA is placed is netics and linguistics. ftp://tooyoo.L.u-tokyo.ac.jp/pub/TeX/tipa,andalsoit

is mirrored onto the directory fonts/tipa of the CTAN • Symbols used in the American phonetics (e.g.

archives. , , , ,etc.). 2 TSIPA was made in 1992 by Kobayashi Hajime, Fukui • Symbols used in the historical study of Indo-

Rei and Shirakawa Shun. It is available from a CTAN archive.

One problem with TSIPA was that symbols already in- European languages (e.g. , , , , , ,and

a e cluded in OT1, T1 or Math fonts are excluded, because of accents such as , ,etc.). the limitation of its 128 character encoding. As a result, a • Symbols used in the phonetic description of

string of phonetic representation had to be often composed

languages in (e.g. , , , , ,etc.). of symbols from different fonts, disabling the possibility of automatic inter-word kerning. And also too many symbols • Diacritics used in ‘extIPA Symbols for Disor- had to be realized as macros. dered Speech’ [4] and ‘VoQS (Voice Quality

3

These macros are now defined in a separate file called Symbols)’ [1] (e.g. n ,f, m, etc).

‘exaccent.sty’ in order for the authors of other packages to be able to make use of them. The idea of separating these It should be also noted that TIPA includes all macros from other ones was suggested by Frank Mittelbach. the necessary elements of ‘tone letters’, enabling 4 This package (vowel.sty) can be used independently from the TIPA package. Documentation is also made sepa- 5 In the case of TSIPA, the selection of symbols was based rately in ‘vowel.tex’ so that no further mention will be made on “Computer coding of the IPA: Supplementary Report” [2]. here. 6 Journal of the International Phonetic Association.

102 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting TIPA: A System for Processing Phonetic Symbols in LATEX

all the theoretically possible combinations of the ’0 ’1 ’2 ’3 ’4 ’5 ’6 ’7 tone letter system. In the recent publication of the ’00x International Phonetic Association tone letters are Accents and diacritics admitted as an official way of representing tones ’04x but the treatment of tone letters is quite insuffi- ’05x Punctuation marks cient in that only a limited number of combina- ’06x Basic IPA symbols I (vowels) tion is allowed. This is apparently due to the fact ’07x Diacritics, etc. that there has been no ‘portable’ way of combining ’10x symbols that can be used across various computer Basic IPA symbols II environments. Therefore TEX’s productive system of macro is an ideal tool for handling a system like ’13x Diacritics, etc. tone letters. ’14x Pct.

In the process of writing METAFONT source TIPA Basic IPA symbols III codes for phonetic symbols there have been (lowercase letters) many problems besides the one with the selection ’17x Diacr. of symbols. One of such problems was that some- ’20x times the exact shape of a symbol was unclear. Tone letters and other

For example, the shapes of the symbols such as ’23x suprasegmentals

(Stretched C), J (Curly-tail J) differ according to ’24x sources. This is partly due to the fact that the Old IPA, non-IPA symbols IPA has been continuously revised for the past few ’27x decades, and partly due to the fact that different ’30x ways of computerizing phonetic symbols on different Extended IPA symbols systems have resulted in the diversity of the shapes ’33x Gmn. of phonetic symbols. ’34x Although there is no definite answer to such a Basic IPA symbols IV problem yet, it seems to me that it is a privilege of ’37x Gmn.

those working with METAFONT to have a systematic way of controlling the shapes of phonetic symbols. Pct. = Punctuation marks, Diacr. = Diacritics, Gmn. = Symbols for . Encoding The 256 character encoding of TIPA is 7 now officially called the ‘T3’encoding. In deciding Table 1: Layout of the T3 encoding this new encoding, care is taken to harmonize with existing other encodings, especially with the T1 encoding. Also the easiness of inputting phonetic symbols is taken into consideration in such a way text encodings. However it is a matter of trade-off to that frequently used symbols can be input with decide which punctuation marks are to be included. small number of keystrokes. For example ‘:’ and ‘;’ might have been preserved in Table 1 shows the layout of the T3 encoding. T3 but in this case ‘:’ has been traditionally used as The basic structure of the encoding found in the

a substitute for the length mark ‘’ so that I decided first half of the table (character codes ’000-’177) to exclude ‘:’ in favor of the easiness of inputting the is based on normal text encodings (ASCII, OT1 length mark by a single keystroke. and T1) in that sectioning of this area into several The encoding of the section for accents and groups such as the section for accents and diacritics, diacritics is closely related to T1 in that the accents the section for punctuation marks, the section for commonly included in T1 and T3 have the same numerals, the sections for uppercase and lowercase encoding. letters is basically the same with these encodings. The sections for numerals and uppercase letters Note also that the T3 encoding contains not are filled with phonetic symbols that are used fre- only phonetic symbols but also usual punctuation quently in many languages, because numerals and marks that are used with phonetic symbols, and in uppercase letters are usually not used as phonetic such cases the same codes are assigned as the normal symbols. And the assignments made here are used 7 as the ‘shortcut characters’, which will be explained In a discussion with the LATEX2ε team it was suggested that the 128 character encoding used in WSUIPA would be in the section entitled “Ordinary phonetic symbols” refered to as the OT3 encoding. (page 105).

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 103 Fukui Rei

ASCII :;" For a table of the T3 encoding, see Appendix C

TIPA (page 114). ASCII 0123 456789

TIPA fonts

TIPA

ASCII @ABC DEFGHI This version of TIPA includes two families of IPA

A B C D E F G H I TIPA fonts, tipa and xipa. The former family of fonts

ASCII JKLM NOPQRS is for normal use with LATEX, and the latter family

K L M N O P Q R S TIPA J is intended to be used with ‘times.sty’(PSNFSS).

ASCII TUVW XYZ| They all have the same T3 encoding as explained in

U V W X Y Z TIPA T the previous section. • tipa TIPA Table 2: shortcut characters Roman: tipa8, tipa9, tipa10, tipa12, tipa17 As for the section for uppercase letters in the Slanted: tipasl8, tipasl9, tipasl10, usual text encoding, a series of discussion among tipasl12 the members of the ling-tex mailing list revealed Bold extended: tipabx8, tipabx9, that there seem to be a certain amount of consensus tipabx10, tipabx12 on what symbols are to be assigned to each code. Sans serif: tipass8, tipass9, tipass10,

For example they were almost unanimous for the tipass12, tipass17

B D S assignments such as A for A, for B, for D, for S, Bold: tipab10

T for T, etc. For more details, see table 2. • xipa The encoding of the section for numerals was Roman: xipa10 more difficult than the above case. One of the Slanted: xipasl10 possibilities was to assign symbols based on the Bold: xipab10

resemblance of shapes. One can easily think of Sans serif: xipass10 assignments such as for 3 for 6, etc. But the re-

All these fonts are made by METAFONT ,based semblance of shape alone does not serve as a criteria on the Computer Modern font series. In the case for all the assignments. So I decided to assign basic 8 of the xipa series, parameters are adjusted so as vowel symbols to this section. Fortunately the to look fine when used with Times Roman (in the resemblance of shape is to some extent maintained cases of xipa10, xipasl10, xipab10) and Helvetica as is shown in table 2. (in the case of xipass10). The encoding of the section for lowercase letters poses no problem since they are all used as phonetic Usage

symbols. Only one symbol, namely ‘g’, needs some Declaration of TIPA package In order to use attention because its shape should be ‘g’, rather TIPA TIPA than ‘g’, as a phonetic symbol.9 , first declare package at the preamble of The second half of the table (character codes adocument. ’200-’377) is divided into four sections. The first \documentclass{article} section is devoted to the elements of tone letters and \usepackage{tipa} other suprasegmental symbols. Encoding options The above declaration uses OT1 Among the remaining three sections the last as the default text encoding. If you want to use TIPA section ’340-’377 contains more basic symbols than symbols with T1, specify the option ‘T1’. the other two sections. This is a result of assigning \documentclass{article} the same character codes as latin-1 (ISO8859-1) and \usepackage[T1]{tipa} T1 encodings to the symbols that are commonly included in TIPA, latin-1 and T1 encoded fonts.10 If you want to use a more complex form of encoding, declare the use of fontenc package by These are the cases of æ, ø, œ,cand ¸ .And within each section symbols are arraneged largely yourself and specify the option ‘noenc’. In this TIPA in alphabetical order. case the option ‘T3’, which represents the encoding, must be included as an option to the 8 This idea was influenced by the above mentioned article fontenc package. For example, if you want to use by J. C. Wells [10]. 9 TIPA and the University Washington Cyrillic (OT2)

But the alternative shape ‘ ¤’ is preserved in other section and can be used as \textg. with the T1 text encoding, the following command 10 This is based on a suggestion by J¨org Knappen. will do this.

104 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting TIPA: A System for Processing Phonetic Symbols in LATEX

\documentclass{article} A shortcut character refers to a single character \usepackage[T3,OT2,T1]{fontenc} that is assigned to a specific phonetic symbol and \usepackage[noenc]{tipa} that can be directly input by an ordinary keyboard. By default, TIPA includes the fontenc package In TIPA fonts, the character codes for numerals internally but the option noenc suppresses this. and uppercase letters in the normal ASCII encoding are assigned to such shortcut characters, because TIPA TIPA Using with PSNFSS In order to use numerals and uppercase letters are usually not used with times.sty, declare the use of times.sty be- as phonetic symbols. And additional shortcut char- fore declaring tipa packages. acters for symbols such as æ, œ, ø may also be used if \documentclass{article} you are using a T1 encoded font and an appropriate \usepackage{times} input system for it. \usepackage{tipa} The following pair of examples show the same Font description files T3ptm.fd and T3phv.fd of a English word that are are automatically loaded by the above declaration. input by the above mentioned two input methods. Other options TIPA can be extended by the op- Input1: [\textsecstress\textepsilon kspl tions tone, extra. \textschwa\textprimstress ne

If you want to use the optional package for \textsci\textesh\textschwa n]

IS ‘tone letters’, add ‘tone’ option to the \usepackage Output1:[Ekspl ne n] command that declares tipa package. Input2: \textipa{[""Ekspl@"neIS@n]}

\usepackage[tone]{tipa} Output2: EksplneISn And if you want to use diacritics for extIPA and It is apparent that inputting in the IPA - VoQS, specify ‘extra’ option. vironment is far easier than in the normal text \usepackage[extra]{tipa} environment. Moreover, although the outputs of Finally there is one more option called ‘safe’, the above examples look almost the same, they are which is used to suppress definitions of some possibly not identical, exactly speaking. This is because ‘dangerous’ commands of TIPA. in the IPA environment automatic kerning between symbols is enabled, as is illustrated by the following \usepackage[safe]{tipa} pair of examples. More specifically, the following commands are Input1: v\textturnv v w\textsca w suppressed by declaring the safe option. Explana- y\textturny y [\textesh]

tion on the function of each command will be given

L S later. Output1:vvw wy y[] Input2

• \s (equivalent to \textsyllabic) : \textipa{v2v w\textsca w yLy [S]}

vwwyLy S Output2: v • \* (already defined in plain TEX) • \|, \:, \;, \! (already defined in LATEX) Table 2 shows most of the shortcut characters together with the corresponding characters in the Input Commands for Phonetic Symbols ASCII encoding. Ordinary phonetic symbols TIPA phonetic sym- Naming of phonetic symbols Every TIPA pho- bols can be input by the following two ways. netic symbol has a unique symbol name, such as 1. Input macro names in the normal text environ- Turned A, Hooktop B, .12Also each symbol ment. has a corresponding control sequence name, such 2. Input macro names or shortcut characters with- as \textturna, \texthtb, \textschwa. The name in the follwoing groups or environment. used as a control sequence is usually an abbreviated • \textipa{...}11 form of the corresponding symbol name with a prefix • {\tipaencoding ...} \text. The conventions used in the abbreviation are • \begin{IPA} ... \end{IPA} as follows. (These groups and environment will be hence- • Suffixes and endings such as ‘-ive’, ‘-al’, ‘-ed’ forth refered to as the IPA environment.) are omitted.

11 I personally prefer a slightly shorter name like \ipa 12 The naming was made based on the literature listed in rather than \textipa but this command was named after the section entitled “Selection of Symbols” (page 102). And the general convention of LATEX2ε. The same can be said to users of TSIPA should be careful because TIPA’s naming is all the symbol names beginning with \text. slightly modified from that of TSIPA.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 105 Fukui Rei

Symbol name Macro name Symbol The macro \* is used in three different ways.

Turned A \textturna First, when this macro is followed by one of the 14 Glottal Stop \textglotstop P letters f, k, r, t or w, it results in a turned symbol.

Right-tail D \textrtaild Input: \textipa{\*f \*k \*r \*t \*w}

Small Capital G \textscg Output:

Hooktop B \texthtb Secondly, when this macro is followed by one of

Curly-tail C \textctc C the letters j, n, h, l or z, it results in a frequently used

Crossed H \textcrh symbol that has otherwise no easy way to input.

Old L-Yogh Ligature \textOlyoghlig Input \textipa{\*j \*n \*h \*l \*z} Beta \textbeta B :

Output: Table 3: Naming of TIPA symbols Thirdly, when this macro is followed by letters other than the above cases, they are turned into the symbols of the default text font. This is useful in • ‘right’, ‘left’ are abbreviated to r, l respec- the IPA environment to select symbols temporarily tively. from the normal text font. • For ‘small capital’ symbols, prefix sc is added. Input: \textipa{\*A dOg, \*B k\{}t, • A symbol with a hooktop is abbreviated as ht... ma\super{\*{214}}}

214

kt ma • A symbol with a curly-tail is abbreviated as Output:AdOg B ct... The remaining macros \;, \: and \! are used • A ‘crossed’ symbol is abbreviated as cr... to make small capital symbols, retroflex symbols, • A ligature is abbreviated as ...lig. and implosives or clicks, respectively. • For an old version of a symbol, prefix O is added. Input: \textipa{\;B \;E \;A \;H \;L \;R}

Note that the prefix O (old) should be given in Output: uppercase letter. Input: \textipa{\:d \:l \:n \:r \:s \:z}

Table 3 shows some examples of correspondence Output: between symbol names and control sequence names. Input: \textipa{\!b \!d \!g \!j \!G \!o}

Output: Ligatures Just like the symbols such as “, ”, –, —, fi, ff are realized as ligatures by inputting ‘‘, ’’, Punctuation marks The following punctuation marks and text symbols that are normally included --, ---, fi, ff in TEX, two of the TIPA symbols, namely Secondary Stress and Double Pipe,and in the text encoding are also included in the T3 double quotation marks13 can be input as ligatures encoding so that they can be directly input in the in the IPA environment. IPA environment. Input: \textipa{! ’ ()*+,-.\/=? Input: \textipa{" "" | || ‘‘ ’’}

[]‘}

Output:

Output: Special macros \*, \;, \: and \! TIPA defines All the other punctuation marks and text sym- \*, \;, \: and \! as special macros in order to bols that are not included in T3 need to be input easily input phonetic symbols that do not have a with a prefix \* explained in the last section when shortcut character explained above. Before explain- they appear in the IPA environment. ing how to use these macros, it is necessary to note Input: \textipa{\*; \*: \*@ \*\# \*\$ that these macros are primarily intended to be used \*\& \*\% \*\{ \*\}} by linguists who usually do not care about things Output:;:@#$&%{} in math mode. And they can be ‘dangerous’ in that they override existing LAT X commands used Accents and diacritics Table 4 shows how to E TIPA in the math mode. So if you want to preserve the input accents and diacritics in with some original meaning of these commands, daclare the examples. Here again, there are two kinds of input option ‘safe’ at the preamble. methods; one for the normal text environment, and the other for the IPA environment. 13 Although TIPA fonts do not include the symbols “ and ”, In the IPA environment, most of the accents a negative value of kerning is automatically inserted between and diacritics can be input more easily than in the ‘ and ‘, ’ and ’, so that the same results can be obtained as in the case of the normal text font. 14 This idea was pointed out by J¨org Knappen.

106 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting TIPA: A System for Processing Phonetic Symbols in LATEX

Input in the normal Input in the IPA Output Input in the normal Input in the IPA Output

text environment environment text environment environment t

\’a \’a a \textsubbridge{t} \|[t t

\"a \"a a \textinvsubbridge{t} \|]t a

\ a \~a a \textsublhalfring{a} \|(a

a

\r{a} \r{a} a \textsubrhalfring{a} \|)a

m

\textsyllabic{m} \s{m} \textroundcap{k} \|c{k} k

a

\textsubumlaut{a} \"*a \textsubplus{o} \|+o o

a

\textsubtilde{a} \~*a \textraising{e} \|’e e

a

\textsubring{a} \r*a \textlowering{e} \|‘e e

e

\textdotacute{e} \.’e \textadvancing{o} \|

e

\textgravedot{e} \‘.e \textretracting{a} \|>a a

a

\textacutemacron{a} \’=a \textovercross{e} \|x{e} e

a

\textcircumdot{a} \^.a \textsubw{k} \|w{k} k

a

\texttildedot{a} \~.a \textseagull{t} \|m{t} t

a \textbrevemacron{a} \u=a Table 5: Examples of the accent prefix \| Table 4: Examples of inputting accents

Since the name of this macro is too long, TIPA normal text environment, especially in the cases of provides an abbreviated form of this macro called subscript symbols that are normally placed over a \super. symbol and in the cases of combined accents, as shown in the table. Input1: t\textsuperscript h As can be seen by the above examples, most of k\textsuperscript w the accents that are normally placed over a symbol a\textsuperscript{bc} can be placed under a symbol by adding an * a\textsuperscript{b% to the corresponding accent command in the IPA \textsuperscript{c}} c environment. Output1:th kw abc ab The advantage of IPA environment is further Input2 \textipa{t\super{h} k\super{w} exemplified by the all-purpose accent \|,whichis :

used as a macro prefix to provide shortcut inputs a\super{bc} a\super{b\super{c}}}

c

h w bc b

k a a for the diacritics that otherwise have to be input Output2: t by lengthy macro names. Table 5 shows examples These macros automatically select the correct of such accents. Note that the macro \| is also size of superscript font no matter what size of the ‘dangerous’ in that it has been already defined as text font is used. amathsymbolofLATEX. So if you want to preserve safe the original meaning of this macro, declare ‘ ’ Tone letters TIPA provides a flexible system of option at the preamble. macros for ‘tone letters’. A tone letter is represented Finally, examples of words with complex ac- by a macro called ‘\tone’, which takes one argument cents that are input in the IPA environment are consisting of a string of numbers ranging from 1 to shown below. 5. These numbers denote pitch levels, 1 being the Input: \textipa{*\|c{k}\r*mt\’om lowest and 5 the highest. Within this range, any

*bhr\’=at\=} combination is allowed and there is no limit in the

kmtom bhr ater Output: length of combination. For a full list of accents and diacritics, see As an example of the usage of tone letter macro, Appendix A the four tones of Chinese are show below. Input: \tone{55}ma ‘‘mother’’, Superscript symbols In the normal text environ- \tone{35}ma ‘‘hemp’’, ment, superscript symbols can be input by a macro \tone{214}ma ‘‘horse’’, called \textsuperscript, which has been newly \tone{51}ma ‘‘scold’’ introduced in the recent version of LAT X2ε.This

E

macro takes one argument which can be either a Output: ma “mother”, ma “hemp”,

symbol or a string of symbols, and can be nested. ma “horse”, ma “scold”

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 107 Fukui Rei

Roman Input: \textsl{\textipa{\’{\"{\u*{e}}}}}

fnEtIks \textipa{f@"nEtIks} Output: e Slanted Input: \textsl{\textdoublebaresh}

\textipa{\slshape f@"nEtIks} fnEtIks or Output: S (Thissymboliscomposedbyamacro.)

\textipa{\textsl{f@"nEtIks} fnEtIks or

\textsl{\textipa{f@"nEtIks} fnEtIks Internal commands Bold extended Some of the internal commands of TIPA are defined

\textipa{\bfseries f@"nEtIks} fnEtIks or without the letter @ in order to allow a user to extend

\textipa{\textbf{f@"nEtIks} fnEtIks or the capability of TIPA.

\textbf{\textipa{f@"nEtIks} fnEtIks

Sans Serif \ipabar Some TIPA symbols such as \textbarb b \textipa{\sffamily f@"nEtIks} fnEtIks or , \textcrtwo 2 are defined by using an internal

\textipa{\textsf{f@"nEtIks} fnEtIks or macro command \ipabar. This command is useful

\textsf{\textipa{f@"nEtIks} fnEtIks when you want to make barred or crossed symbols not defined in TIPA. Table 6: Examples of font switching This command requires the following five pa- rameters to control the position of the bar. • #1 the symbol to be barred How easy to input phonetic symbols? • #2 the height of the bar (in dimen) Let us briefly estimate here how much easy (or • #3 bar width difficult) to input phonetic symbols with TIPA in • #4 left kern added to the bar terms of the number of keystrokes. • #5 right kern added to the bar The following table shows statistics for all the Parameters #3, #4, #5 are to be given in a phonetic symbols that appear in the ’93 version of scaling factor to the width of the symbol, which IPA chart (diacritics and symbols for suprasegmen- is equal to 1 if the bar has the same width with tals excluded). It is assumed here that each symbol the symbol in question. For example, the following is input within the IPA environment and the safe command states a barred b ( b )ofwhichthebar option is not specified. position in the y-coordinate is .5ex and the width

keystrokes number examples of the bar is slightly larger than that of the letter b.

b A B etc

1 65 a % Barred B

2 2 \newcommand\textbarb{%

etc 3 30 \ipabar{{\tipaencoding b}}%

5 1 {.5ex}{1.1}{}{}}

etc more than 5 7 Note that the parameters #4 and #5 can be left As is shown in the table, about 92% of the blank if the value is equal to 0.

symbols can be input within three keystrokes. And the next example declares a barred c (c ) of which the bar width is a little more than half Changing font styles as large as the letter c and it has the same size of This version of TIPA includes five styles of fonts, i.e. kerning at the right. roman, slanted, bold, bold extended and sans serif. % Barred C These styles can be switched in much the same way \newcommand\textbarc{% as in the normal text fonts (see table 6). \ipabar{{\tipaencoding c}}% The bold fonts are usually not used within the {.5ex}{.55}{}{.55}} A standard LTEX class packages so that if you want More complex examples with the \ipabar com- to use them, it is necessary to use low-level font mand are found in T3enc.def. selection commands of LATEX2ε. \tipaloweraccent, \tipaupperaccent These two Input: {\fontseries{b}\selectfont commands are used in the definitions of TIPA ac- abcdefg \textipa{ABCDEFG}} cents and diacritics. They are special forms of the Output: abcdefg ABCDEFG commands \loweraccent and \upperaccent that Note also that slanting of TIPA symbols should are defined in exaccent.sty. The difference be- correctly work even in the cases of combined accents tween the commands with the prefix tipa and the and in the cases of symbols made up by macros. ones without it is that the former commands select

108 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting TIPA: A System for Processing Phonetic Symbols in LATEX

accents from a T3 encoded font while the latter ones the development of TIPA.Iwasalsohelpedand do so from the current text font. encouraged by Christina Thiele, Martin Haase, Kirk These commands take two parameters, the code Sullivan and many other members of the ling-tex of the accent (in , or mailing list. number) and the symbol to be accented, as shown At the last stage of the development of TIPA below. Frank Mittelbach gave me precious comments on TIPA Input: \tipaupperaccent{0}{a} how to incorporate various commands into the NFSS. I would like to thank also Barbara Beeton Output: a who kindly read over the preliminary draft of this Optionally, these commands can take a extra document and gave me useful comments. parameter to adjust the vertical position of the accent. Such an adjustment is sometimes necessary References in the definition of a nested accent. The next [1] Martin J. Ball, John Esling, and Craig Dickson. example shows TIPA’s definition of the ‘Circumflex

VoQS: Voice Quality Symbols. 1994, 1994. Dot Accent’ (e.g. a). [2] John Esling. Computer coding of the IPA: Sup- % Circumflex Dot Accent plementary report. Journal of the International \newcommand\textcircumdot[1]% Phonetic Association, 20(1):22–26, 1990. {\tipaupperaccent[-.2ex]{2}% {\tipaupperaccent[-.1ex]{10}{#1}}} [3] John H. Esling and Harry Gaylord. Computer codes for phonetic symbols. Journal of the This definition states that a dot accent is placed International Phonetic Association, 23(2):83– over a symbol thereby reducing the vertical distance 97, 1993. between the symbol and the dot by .1ex and a [4] ICPLA. extIPA Symbols for Disorderd Speech. circumflex accent is placed over the dot and the 1994, 1994. distance between the two accents is reduced by .2ex. If you want to make a combined accent not [5] IPA. The Principles of the International Pho- included in TIPA, you can do so fairly easily by using netic Association, 1949. these two commands together with the optional [6] IPA. Report on the 1989 Kiel Convention. parameter. For more examples of these commands, Journal of the International Phonetic Associ- see tipa.sty and extraipa.sty. ation, 19(2):67–80, 1989. [7] IPA. Further report on the 1989 Kiel Con- \tipaLoweraccent, \tipaUpperaccent These two vention. Journal of the International Phonetic commands differ from the two commands explaind Association, 20(2):22–24, 1990. above in that the first parameter should be a symbol (or any other things, typically an \hbox), rather [8] IPA. Council actions on revisions of the IPA. than the code of the accent. They are special cases Journal of the International Phonetic Associa- of the commands \Loweraccent and \Upperaccent tion, 23(1):32–34, 1993. and the difference between the two pairs of com- [9] Geoffrey K. Pullum and William A. Ladusaw. mands is the same as before. Phonetic Symbol Guide.TheUniversityof The next example makes a schwa an accent. Chicago Press, 1986. Input: \tipaUpperaccent[.2ex]% [10] John C. Wells. Computer-coding the IPA: a {\lower.8ex\hbox{% proposed extension of SAMPA. Revised draft \textipa{\super@}}}{a} 1995 04 28, 1995. Output: a@ Appendix Acknowledgments AAListofTIPA Symbols First of all, many thanks are due to the co-authors of TSIPA, Kobayashi Hajime and Shirakawa Shun. For each symbol the following information is shown: Kobayashi Hajime was the main font designer of (1) the symbol, (2) input method in the normal text TSIPA. Shirakawa Shun worked very hard in de- environment (and a shortcut method that can be ciding encoding, checking the shapes of symbols and used within the IPA environment in parenthesis), writing the Japanese version of document. TIPA was (3) the name of the symbol. impossible without TSIPA. Vowels and Consonants I would like to thank also J¨org Knappen whose insightful comments helped greatly in many ways a a Lower-case A

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 109

Fukui Rei f

\textturna(5) Turned A f Lower-case F g

A \textscripta(A) Script A \textg (g) Lower-case G g

\textturnscripta(6) Turned Script A \textbarg Barred G g \ae Ash \textcrg Crossed G

15 \textsca (\;A) Small Capital A \texthtg (\!g) Hooktop G

16

\textturnv(2) Turned V g (\textg) Text G

b b Lower-case B \textscg (\;G) Small Capital G

\textsoftsign Soft Sign \texthtscg (\!G) Hooktop Small Capital G G

\texthardsign \textgamma (G)

\texthtb (\!b) Hooktop B \textbabygamma Baby Gamma

\textscb (\;B) Small Capital B \textramshorns(7) Ram’s Horns h

\textcrb Crossed B h Lower-case H

b \textbarb Barred B \texthvlig H-V Ligature

B \textbeta (B) Beta \textcrh Crossed H H

c c Lower-case C \texthth (H) Hooktop H

c \textbarc Barred C \texththeng Hooktop Heng

\texthtc Hooktop C \textturnh (4) Turned H

c \v{c} CWedge \textsch (\;H) Small Capital H i

\c{c} C Cedilla i Lower-case I C \textctc (C) Curly-tail C \i Undotted I

17

\textstretchc Stretched C \textbari (1) Barred I d d Lower-case D \textiota

18

\textcrd Crossed D \textlhti Left-hooktop I d \textbard Barred D \textlhtlongi Left-hooktop Long I

19

\texthtd (\!d) Hooktop D \textvibyi Viby I

\textrtaild(\:d) Right-tail D \textraisevibyi Raised Viby I I

\textctd Curly-tail D \textsci (I) Small Capital I j

dz \textdzlig D-Z Ligature j Lower-case J d \textdctzlig D-Curly-tail Z Ligature \j Undotted J

20 J

\textdyoghlig D-Yogh Ligature \textctj (J) Curly-tail J \textctdctzlig \textscj (\;J) Small Capital J

Curly-tail D-Curly-tail Z Ligature \v{\j} JWedge

D \dh (D) Eth \textbardotlessj Barred Dotless J

e e Lower-case E \textObardotlessj Old Barred Dotless J \textschwa(@) Schwa \texthtbardotlessj(\!j) 21

\textrhookschwa Right-hook Schwa Hooktop Barred Dotless J k

\textreve (9) Reversed E k Lower-case K

\textsce (\;E) Small Capital E \texthtk Hooktop K

E \textepsilon(E) Epsilon \textturnk (\*k) Turned K l \textcloseepsilon Closed Epsilon l Lower-case L

18

\textrevepsilon(3) Reversed Epsilon §

¦ § The four symbols ¥, , and are mainly used among

\textrhookrevepsilon Chinese linguists. These symbols are based on “det svenska Right-hook Reversed Epsilon landsm˚alsalfabetet” and introduced to by Bernhard Karlgren. The original shapes of these symbols were in italic \textcloserevepsilon as was always the case with “det svenska landsm˚alsalfabetet”. Closed Reversed Epsilon It seems that the Chinese linguists who wanted to continue 15 to use these symbols in IPA changed their shapes upright. This symbol is fairly common among Chinese phoneti- 19 I call this symbol ‘Viby I’, based on the following cians. description by Bernhard Karlgren: “Une voyelle tr` analogue 16 In PSG this symbol is called ‘Inverted V’ but it is

`a § se rencontre dans certains dial. su´edois; on l’appelle ‘i de apparently a mistake. 17 Viby’.”(Etudes´ sur la phonologie chinoise, 1915–26, p. 295) The shape of this symbol differs according to the 20 In the official IPA charts of ’89 and ’93, this symbol has sources. In PSG and recent articles in JIPA, it is ‘stretched’ a dish serif on top of the stem, rather than the normal sloped toward both the ascender and descender regions and the serif found in the letter j. I found no reason why it should whole shape looks like a thick staple. In the old days, have a dish serif here, so I changed it to a normal sloped serif. however, it was streched only toward the ascender and the 21 In PSG the shape of this symbol slightly differs. Here I whole shape looked more like a stretched c. followed the shape found in IPA ’89, ’93.

110 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

TIPA: A System for Processing Phonetic Symbols in LATEX

\textltilde(\|~l) L with Tilde \textscr (\;R) Small Capital R K

\textbarl Barred L \textinvscr(K) Inverted Small Capital R s

\textbeltl Belted L s Lower-case S s

\textrtaill(\:l) Right-tail L \v{s} SWedge

\textlyoghlig L-Yogh Ligature \textrtails(\:s) Right-tail S (at left) S

\textOlyoghlig Old L-Yogh Ligature \textesh (S) S

\textscl (\;L) Small Capital L \textdoublebaresh Doube-barred Esh

\textlambda Lambda \textctesh Curly-tail Esh t

\textcrlambda Crossed Lambda t Lower-case T

m m Lower-case M \texthtt Hooktop T

M \textltailm(M) Left-tail M (at right) \textlhookt Left-hook T

W \textturnm(W) Turned M \textrtailt(\:t) Right-tail T tC

\textturnmrleg Turned M, Right Leg \texttctclig T-Curly-tail C Ligature

n n Lower-case N \texttslig T-S Ligature

\textnrleg N, Right Leg \textteshlig T-Esh Ligature

n \~n N with Tilde \textturnt (\*t) Turned T

\textltailn Left-tail N (at left) \textctt Curly-tail T C N \ng (N) Eng \textcttctclig

\textrtailn(\:n) Right-tail N Curly-tail T-Curly-tail C Ligature T

\textctn Curly-tail N \texttheta (T) u

\textscn (\;N) Small Capital N u Lower-case U

o o Lower-case O \textbaru (0) Barred U U

\textbullseye(\!o) Bull’s Eye \textupsilon(U) Upsilon

\textbaro (8) Barred O \textscu (\;U) Small Capital U v

\o Slashed O v Lower-case V V

\ O-E Ligature \textscriptv(V) Script V w \textscoelig(\OE) w Lower-case W

Small Capital O-E Ligature \textturnw (\*w) Turned W x

O \textopeno(O) Open O x Lower-case X X

\textturncelig Turned C(Open O)-E Ligature \textchi (X) Chi y

\textomega Omega y Lower-case Y L

\textscomega Small Capital Omega \textturny (L) Turned Y Y \textcloseomega Closed Omega \textscy (Y) Small Capital Y

23

p p Lower-case P \textvibyy Viby Y z

\textwynn Wynn z Lower-case Z

\textthorn(\th) Thorn \textcommatailz Comma-tail Z z

\texthtp Hooktop P \v{z} ZWedge

F \textphi (F) \textctz Curly-tail Z

q q Lower-case Q \textrevyogh Reversed Yogh \texthtq Hooktop Q \textrtailz(\:z) Right-tail Z

22 Z

\textscq (\;Q) Small Capital Q \textyogh (Z) Yogh r r Lower-case R \textctyogh Curly-tail Yogh

R \textfishhookr(R) Fish-hook R 2 \textcrtwo Crossed 2 P

\textlonglegr Long-leg R \textglotstop(P) Glottal Stop

\textrtailr(\:r) Right-tail R \textraiseglotstop Superscript Glottal Stop

\textturnr(\*r) Turned R \textbarglotstop Barred Glottal Stop

\textturnrrtail(\:R) Turned R, Right Tail \textinvglotstop Inverted Glottal Stop \textturnlonglegr Turned Long-leg R \textcrinvglotstop Crossed Inverted Glottal Stop 22 Suggested by Prof S. Tsuchida for Austronesian lan- guages in Taiwan. In PSG ‘Female Sign’ and ‘Uncrossed Q \textrevglotstop(Q) Reversed Glottal Stop

Female Sign’(pp. 110–111) are noted for pharyngeal stops, \textbarrevglotstop as proposed by Trager (1964). Also I’m not sure about the Barred Reversed Glottal Stop difference between an epiglottal plosive and a pharyngeal stop. 23 See explanations in footnote 19.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 111

Fukui Rei

a

\textpipe (|) Pipe \texttildedot{a}(\~.a) Tilde Dot Accent

a \textdoublebarpipe Double-barred Pipe \textbrevemacron{a}(\u=a)

\textdoublebarslash Double-barred Slash Macron Accent

a \textdoublepipe(||) Double Pipe \textringmacron{a}(\r=a)

! Exclamation Point Ring Macron Accent

s \textacutewedge{s}(\v’s) Suprasegmentals Acute Wedge Accent

\textprimstress(") Vertical Stroke (Superior)

a \textdotbreve{a} Dot Breve Accent

\textsecstress("") Vertical Stroke (Inferior) t \textsubbridge{t}(\|[t) Subscript Bridge

\textlengthmark(:) Length Mark d \textinvsubbridge{d}(\|]t) \texthalflength(;) Half-length Mark Inverted Subscript Bridge

\textvertline Vertical Line n \textsubsquare{n} Subscript Square

\textdoublevertline Double Vertical Line o \textsubrhalfring{o}(\|)o) \textbottomtiebar(\t*{}) Bottom Tie Bar 25

Subscript Right Half-ring

\textglobfall Downward Diagonal Arrow

o \textsublhalfring{o}(\|(o)

\textglobrise Upward Diagonal Arrow 24 Subscript Left Half-ring

\textdownstep Down Arrow

k \textsubw{k}(\|w{k}) Subscript W

\textupstep Up Arrow g \textoverw{g} Over W

Accents and Diacritics t \textseagull{t}(\|m{t}) Seagull

e \textovercross{\e}(\|x{e}) Over-cross e \‘e

O \textsubplus{\textopeno}(\|+O) e \’e Acute Accent Subscript Plus26 e \^e Circumflex Accent

E \textraising{\textepsilon}(\|’E) e \~e Tilde Raising Sign e \"e Umlaut

e \textlowering{e}(\|‘e) Lowering Sign e \H{e} Double Acute Accent

u \textadvancing{u}(\|

\textretracting{\textschwa}(\|>@) e \v{e} Wedge Retracting Sign e \u{e} Breve

e \textsubtilde{e}(\~*e) Subscript Tilde

e \=e Macron

e \textsubumlaut{e}(\"*e) Subscript Umlaut

e \.e Dot

u \textsubring{u}(\r*u) Subscript Ring

e \c{e} Cedille

e \textsubwedge{e}(\v*e) Subscript Wedge

e \textpolhook{e}(\k{e})

Polish Hook (Ogonek Accent) e \textsubbar{e}(\=*e) Subscript Bar

e \textsubdot{e}(\.*e) Subscript Dot

e \textdoublegrave{e}(\H*e)

Double Grave Accent e \textsubarch{e} Subscript Arch

m \textsyllabic{m}(\s{m}) Syllabicity Mark

e \textsubgrave{e}(\‘*e)

t \textsuperimposetilde{t}(\|~{t})

Subscript Grave Accent Superimposed Tilde

e \textsubacute{e}(\’*e) Subscript Acute Accent t t\textcorner Corner

t t\textopencorner Open Corner

e \textsubcircum{e}(\^*e) Subscript Circumflex Accent \textschwa\rhoticity Rhoticity

b b\textceltpal Celtic Palatalization Mark g \textroundcap{g}(\|c{g}) Round Cap

k k\textlptr Left Pointer

a \textacutemacron{a}(\’=a)

k k\textrptr Right Pointer Acute Accent with Macron 27

p p\textrectangle Rectangle a \textvbaraccent{a} Vertical Bar Accent 25 a \textdoublevbaraccent{a} Diacritics \textsubrhalfring and \textsublhalfring Double Vertical Bar Accent can be placed after a symbol by inputting, for example,

[e\textsubrhalfring{}] e. e \textgravedot{e}(\‘.e) Grave Dot Accent 26 The diacritics such as \textsubplus, \textraising,

e \textdotacute{e}(\’.e) Dot Acute Accent \textlowering \textadvancing and \textretracting can be

a \textcircumdot{a}(\^.a) placed after a symbol by inputting [e\textsubplus{}] e, Circumflex Dot Accent for example. 27 This symbol is used among Japanese linguists as a dia- 24 The shapes of \textdownstep and \textupstep differ critical symbol indicating no audible release (IPA ), because according to sources. Here I followed the shapes found in the the symbol is used to indicate pitch accent in Japanese. recent IPA charts.

112 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

TIPA: A System for Processing Phonetic Symbols in LATEX

Ts

gb \texttoptiebar{gb}(\t{gb}) Top Tie Bar → \sliding{\ipa{Ts}} Right Arrow

m \crtilde{m} Crossed tilde

\textrevapostrophe Reversed Apostrophe a \dottedtilde{a} Dotted Tilde

. Period s \doubletilde{s} Double Tilde

\texthooktop Hooktop n \partvoiceless{n} Parenthesis + Ring

\textrthook Right Hook n \inipartvoiceless{n} Parenthesis + Ring

\textpalhook Palatalization Hook n \finpartvoiceless{n} Parenthesis + Ring

h

p

p\textsuperscript{h}(p\super h) s \partvoice{s} Parenthesis + Subwedge

Superscript H s \inipartvoice{s} Parenthesis + Subwedge

w

k

k\textsuperscript{w}(k\super w) s \finpartvoice{s} Parenthesis + Subwedge

Superscript W J \sublptr{J} Subscript Left Pointer

j t t\textsuperscript{j}(t\super j) J \subrptr{J} Subscript Right Pointer

Superscript J G

t t\textsuperscript{\textgamma}(t\super G)

Superscript Gamma B B Symbols not included in TIPA Q d d\textsuperscript{\textrevglotstop} There are about 40 symbols that appear in PSG

(d\super Q) Superscript Reversed Glottal Stop but are not included or defined in TIPA (ordinary n d d\textsuperscript{n}(d\super n) capital letters, Greek letters and punctuation marks

Superscript N excluded). Most of such symbols are the ones that l d d\textsuperscript{l}(d\super l) have been proposed by someone but never or rarely Superscript L been followed by other people. Tone letters Some of such symbols can be realized by writing The tones illustrated here are only a representative appropriate macros, while some others cannot be sample of what is possible. For more details see the realized without resorting to the Metafont. section entitled “Tone Letters” (page 107). This section discusses these problems by classi- fying such symbols into three categories, as shown

below.

\tone{55} Extra High Tone

\tone{44} High Tone 1. Symbols that can be realized by TEX’s macro

\tone{33} Mid Tone level and/or by using symbols from other fonts. \tone{22} Low Tone 2. Symbols that can be imitated by TEX’s macro

\tone{11} Extra Low Tone level and/or by using symbols from other fonts

\tone{51} Falling Tone (but may not look quite satisfactory).

\tone{15} Rising Tone 3. Symbols that cannot be realized at all, without

\tone{45} High Rising Tone creating a new font.

\tone{12} Low Rising Tone

The following table shows symbols that belong \tone{454} High Rising Falling Tone to the first category. For each symbol, an example Diacritics for extIPA, VoQS of input method and its output is also given. Note In order to use diacritics listed in this section, it that barred or crossed symbols can be easily made TIPA is necessary to specify the option ‘extra’atthe by ’s \ipabar macro. preamble (See the section entitled “Other options” Script Lowe-case F on page 105). Note also that some of the diacritics {\itshape f} f are defined by using symbols from fonts other than Barred Small Capital I TIPA so that they may not look quite satisfactory

\ipabar{\textsci}{.5ex}{1.1}{}{} I and/or may not be slanted (e.g. \whistle{s} s). ↑ Barred J \ipabar{j}{.5ex}{1.1}{}{} j

s \spreadlips{s} Left Right Arrow ↔ Crossed K v \overbridge{v} Overbridge \ipabar{k}{1.2ex}{.6}{}{.4} k n \bibridge{n} Bibridge Barred Open O

t \subdoublebar{t} Subscript Double Bar

O

\ipabar{\textopeno}{.5ex}{.6}{.4}{}

f \subdoublevert{f}

Barred Small Capital Omega Subscript Double Vertical Line

\ipabar{\textscomega}{.5ex}{1.1}{}{}

v \subcorner{v} Subscript Corner

Barred P

s \whistle{s} Up Arrow ↑

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 113 Fukui Rei

\ipabar{p}{.5ex}{1.1}{}{} p • Inverted Lower-case Omega Half-barred U • Reversed Esh with Top Loop \ipabar{u}{.5ex}{.5}{}{.5} u • T with Upper Left Hook Barred Small Capital U • Turned Small Capital U

\ipabar{\textscu}{.5ex}{1.1}{}{} • Bent-tail Yogh • Turned 2 $\emptyset$ ∅ • Turned 3 Double Slash /\kern-.25em/ // Triple Slash CCTIPA encoding (T3) /\kern-.25em/\kern-.25em/ /// Pointer (Upward) ’0 ’1 ’2 ’3 ’4 ’5 ’6 ’7

k\super{\tiny$\wedge$} k ’00x

Pointer (Downward) ’01x

k\super{\tiny$\vee$} k ’02x

Superscript Arrow ’03x

k\super{\super{$\leftarrow$}} k ’04x

’05x

Symbols that belong to the second category are ’06x

shown below. Note that slashed symbols can be in ’07x

A B C D E F G

fact easily made by a macro. For example, a slashd b ’10x

I J K L M N O i.e. b/canbemadeby\ipaclap{b}{/}. The reason ’11x H

TIPA

Q R S T U V W

why slashed symbols are not included in is ’12x P

Y Z

as follows: first, a simple overlapping of a symbol ’13x X

a b c d e f g

and a slash does not always result in a good shape, ’14x

i j k l m n o

and secondly, it doesn’t seem significant to devise ’15x h

q r s t u v w

fine-tuned macros for symbols which were created ’16x p

y z

essentialy for typewriters. ’17x x

’20x

Right-hookA a

’21x

SlashedB b/

’22x

SlashedC/ c

’23x

SlashedD d/

’24x

Small Capital Delta ∆

’25x

Right-hookE e

’26x

Right-hook Epsilon E

’27x

Small Capital F f

’30x

Female Sign

’31x

Uncrossed Female Sign

’32x

Right-hook Open O

’33x

SlashedU u/

’34x

SlashedW w/

’35x

And finally, symbols that cannot be realized at ’36x

all are as follows. ’37x • Reversed Turned Script A • A-O Ligature • Inverted Small Capital A • Small Capital A-O Ligature • D with Upper-left Hook • Hooktop H with Rightward Tail • Heng • Hooktop J • Front-bar N

114 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Computer Modern Typefaces as the Multiple Master Fonts

A.S. Berdnikov Institute of Analytical Instrumentation Rizsky pr. 26, 198103 St.Petersburg, Russia [email protected]

O.A. Grineva Institute of Analytical Instrumentation Rizsky pr. 26, 198103 St.Petersburg, Russia [email protected]

Introduction which enables one to treat Computer Modern type- faces like multiple master fonts. Several years ago Adobe Inc. announced its new The mff.sty macros follow the ideas imple- Multiple Master font format which enabled one to mented in the MFP C package and allows specifica- vary smoothly the font characteristics (say, weight I tion of new fonts dynamically within a LAT Xdocu- from light to black, width from condensed to ex- E

ment without dealing with the details of METAFONT panded, etc.) and create a unique font which suites programming and without manual manipulations of the User’s demands. Like many other “new” inven-

1 each of 62 parameters used in METAFONT source tions in computer-assisted typography, the roots of A files. Like MFPIC, the first pass of LTEX creates this idea can be found inside TEX — namely, in the the METAFONT source file (substituting font dummy Computer Modern font family created by D. Knuth

instead of the user-defined font), the METAFONT in 1977 – 1985 [1]. These fonts are parametrized

source file is processed by METAFONT ,andatthe using 62 (!!!) parameters most of which are in- second pass the generated font is used to format the dependent. It can be seen easily that it exceeds document properly. the flexibility of any multiple master font which has The user can vary the font shape smoothly been created up to now or even will be created by between CMR, CMBX, CMSL, CMSS, CMTT and somebody in future. CMFF font families, specify the weight, width, height

The METAFONT source code and the parame- and contrast of the output font independently, and in ters for the Computer Modern series were created addition he/she can play with the character charac- using the advice of such professional font designers teristics so that the resulting output does not look as Hermann Zapf, Matthew Carter, Charles Bigelow like the canonical Computer Modern typefaces at and others. The METAFONT source code is designed all. Although originally the package was created so that the parameter files are separated from the for internal purposes to facilitate the investigation main source code; all the parameters together with of the possibilities hidden inside Computer Modern the relations between them are totally documented source code, it can be useful for professional typo- [1], and the font variations (provided that they have graphic purposes too. a name different from the original CM fonts) are encouraged by the author. The parametric repre- Font series with the arbitrary design size sentation of the parameters for canonical Computer Modern typefaces created by John Sauter and Karl Suppose that there is a smooth approximation for Berry (sauter fonts [2]), and, on a different basis, Computer Modern Roman (CMR) which enables by J¨org Knappen and Norbert Schwarz in European one to calculate the METAFONT font parameters Computer Modern typefaces [3], enables one to vary for an arbitrtary design size even with weak TEX smoothly the font size in a wide range without arithmetical capabilities. The desired font size is loosing high quality of the resulting fonts. All these specified as the input parameter, all the internal A calculations of the font parameters are performed facts provide the basis of the LTEX macros mff.sty,

by T X, and the result is a METAFONT ready-to- 1 E For example, Microsoft Word 6.0 was announced as the run font header file for a new font. When this new first program which enables to mark some place in the text by a special marker and then to refer to its position in a form: font header file is processed by METAFONT ,itcan “see page . . . ”.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 115 A.S. Berdnikov and O.A. Grineva

be used in TEX documents like other generic TEX fonts. 0 A B C a b c This operation is performed by the command Q R S q r s

\FontMFF[fntscaled]{fntcmd}{filename}{size}

B C a b c The command fntcmd will switch to the desired font 1/3 A

as is done by the LATEX commands \bf, \sl, \sf,

Q R S q r s

etc. The file filename.mf will contain the META-

FONT source for a new font (please, never use the

font name which is already used in CM or DC or

B C a b c 2/3 A

some other standard fonts!). The parameter size

R S q r s specifies the design size of the new font, and the Q optional parameter fntscaled specifies the additional

scaling of this font in LATEX document. Using the

B C a b c commands described in the following sections the 1 A

user can vary the shape of the font characters in a

Q R S q r s wide range. It is interesting to compare the possibilities Figure 1: CMSS series of this simplest form of parametrization of CMR fonts and the PostScript vector fonts. The ne- arly-proportional changing of the font dimensions tions produce good quality approximations, it is not with respect to the magnification parameter is the so with the data extracted from Computer Modern analog of the linear scaling of the PostScript fonts.

METAFONT files. The plots of the parameters, with The non-linear relationship of the inter-character respect to the design size, are “noisy” functions spacing from the font size imitates the tracking with some abrupt jumps since these parameters mechanism implemented in PostScript fonts (which were selected manually to optimize the font shape, is not taken into account in most cases by text not the mathematical plots. As a result the cubic processors). The fact that the ratio height/width smooth approximations obey parasitic local minima is a non-constant (and non-linear) function of the and maxima and do not work far outside the range font size is a serious advantage of these pseudo- of design sizes used as the base data points. The dc CMR fonts in comparison with the linearly scaled and tc fonts with intermediate font sizes are visually PostScript fonts since it enables one to make the good even with these “mathematical” defects, but font proportions more suitable for the human eye (it for some specific reasons these defects could work is well known that for good eye recognition, small badly when implemented in mff.sty. letters are to be more expanded and have greater The first version of mff.sty was based on pie- inter-character spacing). cewise-linear and piecewise-cubic (Lagrange splines) There are at least two ready-to-use font ap- functions using Computer Modern typefaces as the proximations available from CTAN. The first one reference data. To eliminate the parasitic local is the METAFONT sauter font package [2]. It minima and maxima, some data points were slightly uses the smooth functions composed from constant, changed, and new data points were added to guar- linear and quadratic pieces which are constructed antee a good behavour of the approximating expres- so that for canonical font sizes they produce nearly sions outside the range 5pt – 17.28pt. The current the same *.mf files as the ones used by the original version of mff.sty is based on the sauter-type ap- Computer Modern typefaces. Although the latest proximation with some modifications (especially for version of sauter is dated 1992, and in 1995 the 2 cmff and cmfib fonts). As an option the piecewise- parameters of Computer Modern fonts were again linear and the piecewise-cubic approximations based slightly changed, it seems still to be the most reliable on dc data could be be included in futher versions of source of the fonts with intermediate design sizes. mff.sty while the variated Computer Modern data The other approximation is realized in dc and used in the intermediate versions becames more-or- tc fontsbybyJ¨org Knappen and Norbert Schwarz less obsolete. [3]. It is based on cubic splines — Lagrange cubic splines or canonical cubic splines—using the pa- 2 The reason to use sauter was the following: it is not a rameters of Computer Modern typefaces as the base good idea to modify voluntary the original Computer Modern points. Although generally piecewise-cubic func- parameters.

116 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Computer Modern Typefaces as the Multiple Master Fonts

0 A B C a b c I A B C I J K

Q R S q r s a b c i j k

A B C a b c

B C I J K

1/3 II A

a b c i j k

Q R S q r s

A B C I J K

B C a b c

2/3 A III

a b c i j k

Q R S q r s

Figure 3: Font Modifications

B C a b c

1 A

R S q r s Q ries—one for “boldness” (i.e., weight)andone for “extension” (i.e., width). It makes the total Figure 2: CMTT series scheme more closely related to the NFSS realized in LATEX2ǫ.Soinmff.sty the CMBX font sequence is decomposed into CMB′ (fonts which are as bold as Mixture of independent fonts CMBX and as wide as CMR, but which are different from the standard CMB10 font) and CMX (fonts The Computer Modern fonts roman, bold, slanted, sans serif dunhill which are as wide as CMBX and as bold as CMR). , typewriter, funny, ,andquota- The mixture of fonts is performed by the com-

tion METAFONT use the same programs but with mand \MFFcompose{α1}{α2}...{α6} where α1 – α6 different values for font parameters. For each font are some numerical values. The value α1 corre- series it is possible to construct the smooth ap- sponds to CMB, α2 to CMX, α3 to CMSS (sans serif proximations (similar to CMR font approximation font), α4 to CMTT (typewriter font), α5 to CMFIB mentioned in the previous section) which enables (“Fibonacci” font), and α6 to CMFF (funny font). cmr creation of a METAFONT header file with arbitrary If some parameter has the numerical value p for design size. CMR font, pcmb for CMB font with the same design Suppose that there are such smooth approxima- size, etc., the mixture value is calculated as − − tions, and it is possible to calculate for some fixed p∗ = pcmr + α1 (pcmb pcmr)+α2 (pcmx pcmr) − − design size the parameters of legal CMR, CMBX, +α3 (pcmss pcmr)+α4 (pcmtt pcmr) − − CMTT and CMSS fonts. All these fonts are gener- +α5 (pcmfib pcmr)+α6 (pcmff pcmr)

ated by METAFONT without errors, and it can be It enables one, for example, to make the font supposed that the weighted sum of the parameters “less bold” than CMR or “more bold” and “more corresponding to these fonts also can be processed extended” than CMBX by assignment of values

by METAFONT without error messages — at least we which are less than 0 or greater than 1, and to create can expect it with good probability. the “mutant” combinations of nearly incompatible The CMTT fonts have nearly rectangular serifs, font families. In Fig. 1 the result of a mixture of nearly no contrast between thin and thick lines, and cmr10 and cmss10 is shown, and in Fig. 2 a similar are a little compressed in the vertical direction. The series is constructed for cmr10 and cmtt10. CMSS fonts have no serifs at all, their width is Manual font modification less than that of CMR fonts, and, although they also have no contrast, the thickness of their lines is In addition to the weighted mixture of font ingre- greater as compared with CMTT. Other fonts have dients it is possible to vary some font parameters their own specific features, but in spite of this fact which are responsible for the specific effects. they can be “added” together — at least mathemat- The inclination of the characters depends on the ically. The resulting font is no longer CMR, CMBX, single font parameter slant#, and its value can be CMTT, etc., but is something intermediate with a set explicitly as a ratio ∆y/∆x or as the inclination unique shape. angle (specified in degrees). The CMBX fonts can be subdivided (to some The width of the CMR font can be varied by extend artificially) into two independent font se- mixing with the CMX font, but it also can be

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 117 A.S. Berdnikov and O.A. Grineva

scaled explicitly by some factor specified by the user. • font (III) differs from cmr10 by scaling the Although this scaling skips some fine tuning of the width by 1.2, the height by 0.8, the ascenders font parameters, it can be advantageous to use it and descenders by 1.25, the width of bold lines in some situations. For example, if we deal with by 0.8 and the width of thin lines by 2.4. CMTT fonts or CMSS fonts, mixing with CMX All the commands which allow performance of these results also in some variation of the character shapes modifications are described in mff.sty manual in which can have an undesirable effect. details. Similarly, the weight, i.e., the “boldness” of the characters, which can be controlled by mixing with Automatic check of font parameters CMB, can be defined also as a direct scaling of the Each specification of mff.sty parameters produces font parameters which are responsible for this effect. a unique font which belongs to the CMR/MF (Com- The weight of “thin lines” and “thick lines” can puter Modern Roman Master Font) font family and be scaled independently, and in addition the user with a unique name specified by the user. Not all of contrast can specify explicitly the —i.e., the ratio these fonts are too pleasant, and not each variation between thick and thin strokes of the characters. of the parameters result to a font which is well- The height of the vertical elements of the char- distinguished from the others. But at least it is an acters can be varied by the user with greater flexi- interesting toy for font maniacs. bility. That is, it is possible to scale independently There is a list of mutual relations between font

• the general height and the depth of the charac- parameters which are assumed implicitly in META-

ters; FONT programs for Computer Modern typefaces [1].

• the height of the capital characters, brackets, Although in reality most METAFONT source files digits, etc., and the ascenders of the characters violate these conditions, it is safer if the font pa- like ‘b’, ‘t’; rameters calculated by mff.sty satisfy them. The • the depth of commas and the ascenders of the command \MFFcheck sets the mode when these characters like ‘Q’, ‘y’; conditions are checked and the variable values are corrected if necessary. Nethertheless, several inter- • the height of digits and the position of the esting effects can be achieved only without such au- horizontal bar for mathematical signs like +, tomatic correction. The mode of automatic checking −. can be switched off by the command \MFFcheck. If several height factors are specified, their effect is combined. For example, the font cmdunh10 can be Switch to other font classes exactly cmr10 A derived from by proper specification The LTEX2ε NFSS classifies TEX font families in a of all these factors. way which is different from the logical structure of Finally, the special scaling factors can be used METAFONT programs. That is, the italic and small to produce the special effects: caps are at the same family roman, together with • scale the fine connection between thin and thick bold and slanted fonts, although they are pro- lines in ‘h’, ‘m’, ‘n’; duced by a different driver files. Similarly, roman, sans serif • scale the thickness of sharp corners in letters typewriter and fonts are different families

‘A’, ‘V’, ‘w’; while they are generated by the same METAFONT script and their parameters can be varied so that • scale the of dots in ‘i’, ‘:’ and bulbs one family is smoothly converted to another family. in ‘a’, ‘c’; In mff.sty, there is no sharp boundary be- • scale the curvature of the serif footnotes. tween roman, bold, slanted, typewriter, sans and the logical flags can control the level of ligatures serif, quotation, funny and dunhill fonts—each font and to switch on/off square dots, sans serif mode, is smoothly converted to another one, while italic monospace mode, etc. and small caps fonts are quite different. The Fig. 3 demonstrates the examples of such mod- mff.sty macros assign different classes to these ifications: fonts to distinguish such differences from the font • font (I) is the standard cmr10 scaled at 1.8 families used in NFSS. The following font classes times, can be used: • font (II) differs from cmr10 by scaling the width CMR Computer Modern Roman; by 0.8, the height by 1.2 and the width of bold CMTI Computer Modern Text Italic; lines by 1.5; CMCSC Computer Modern Small Caps;

118 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Computer Modern Typefaces as the Multiple Master Fonts

CMRZ CMZ Computer Modern Roman/Cyrillic by N. Glonti and A. Samarin; CMRIZ CMZ Computer Modern Text Italic/Cy- rillic; CMCCSC CMZ Computer Modern Small Caps/ Cyrillic; LHR LH Computer Modern Roman/Cyrillic by O. Lapko and A. Khodulev; LHTI LH Computer Modern Text Italic/Cyrillic; LHCSC LH Computer Modern Small Caps/Cyril- lic. The interface for DC fonts [3] is under development. The set of font classes can be extended easily

when the METAFONT program is based on the same set of parameters as Computer Modern fonts: The only thing to do is to specify the macro which writes the font identifier value and the operator generate with the corresponding file name. Acknowledgements The authors would like to express their warmest thanks to Kees van der Laan for organization and realization of the Euro-Bus project which enabled the Russian delegation to take part in EuroTEX-95, and to Ph.Taylor for his activity in breaking down the barriers between West and East. This research was partially supported by a grant from the Dutch Organization for Scientific Research (NWO grant No 07-30-007). References [1] Donald E. Knuth. Computer Modern Type- faces,(Computers & Typesetting series). Ad- dison-Wesley, 1986. [2] John Sauter. Building Computer Modern fonts. TUGboat, 7 (1986), pp. 151 – 152. [3] J¨org Knappen. The release 1.2 of the Cork en- coded DC fonts and the text companion symbol fonts. Proceedings of the 9th EuroTEX Confer- ence, Arnhem, 1995. [4] A. Khodulev and I. Mahovaya. On TEXexperi- ence in MIR Publishers. Proceedings of the 7th EuroTEX Conference, Prague, 1992. [5] O. Lapko. MAKEFONT as a part of CyrTUG– EmTEX package. Proceedings of the 8th Eu- roTEX Conference, Gda´nsk, 1994.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 119 VFComb 1.3 — the program which simplifies the virtual font management

A.S. Berdnikov Institute of Analytical Instrumentation Rizsky pr. 26, 198103 St.Petersburg, Russia [email protected]

S.B. Turtia Institute of Analytical Instrumentation Rizsky pr. 26, 198103 St.Petersburg, Russia [email protected]

The MS DOS program VFComb enables one to platform “as is”, but it is not too difficult to transfer simplify the design of the virtual fonts.[1]1 Its main it to portable ANSI C (volunteers are welcome). purpose was to facilitate the integration of CM fonts with Cyrillic LL fonts created by O. Lapko Virtual fonts for TEX formats with national and A. Khodulev [2, 3] but it can be used for alphabets other applications too. It uses the information from Although everything which can be done by VFComb .tfm files (converted to ASCII form by TFtoPL) could be realized also by explicit usage of PL and and the ASCII data files created by the User on VPL file syntax (as well as everything which can its input, and produces the .vpl file on its output be done manually by .pl and .vpl files can be (the .vpl file can be converted later to the virtual done with VFComb), some typical operations with font using VPtoVF). The characteristic feature of the virtual fonts are performed with its help easier the program is that it can assemble the ligature than by manual editing of .pl and .vpl The typical tables and metric information from various fonts and problem of this type is the adaptation of standard combine it with the user-defined metric information TEX formats to national alphabets — this problem and ligature/kerning data. VFComb supports the is especially important for Cyrillic alphabets since full syntax of .pl files and .vpl files as it was most Cyrillic letters cannot be created as the com- defined by D.E. Knuth and adds new commands like bination of the Latin (English) letters with some symbolic variables or conditional operators, which accents. simplifies the creation and the debugging of the The standard solution of this problem is to virtual fonts. combine the English part taken from Computer The description of the previous version 1.2 Modern family with the national fonts which extend (which is the first version distributed far outside the Computer Modern family and which contain the home computer) can be found in [4]. This in the upper part of ASCII table (codes 128 – 255) version has only the Russian manual which prevents the national symbols. The best way how to do its wide distribution among TEX community. The it is to create the virtual font whose lower part current version has the English manual, but except refers to original CM fonts, and upper part refers this “new feature” a lot of additional improvements to the national fonts—it is just the way which was are added. The main features of the program are recommended by D. Knuth.[1] The advantage of this described below while the complete information can approach is that it is possible to keep the changes be found inside the manual. in CM fonts and in national fonts separately, and The program will be put on the CTAN archives in addition, it is possible to economize disk space following TUG’96 together with the source code and since it is not necessary to keep two copies of each will be available to TEX community on a freeware Computer Modern character — one as the original basis. Since this MS DOS program is written on CM font which is necessary for original TEX formats, Borland Pascal and uses some specific features of and the second one as the lower part in the combined this language, it is hardly portable to any other national font.

1 The combination of the lower part of one font It is assumed that you are familiar with the virtual fonts. and the upper part of another font, or even the If it is not so, it is highly recommended that you read [1] before proceeding further. joining all the characters from one font and all

120 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting VFComb 1.3 — the program which simplifies the virtual font management

the characters from another font (provided that no colors are assigned to different fonts, and it is a task character code is encountered twice) can be done by for the dvi driver to decide how to print these fonts VFComb using few commands. If some characters in desired colors. are to be discarded from the font or moved to The colored printing is collected from the over- the different positions of the ASCII table, it does lapped sheets where each sheet of text or graphics not makes the command script more complicated. is printed by individual monocolor passes. To make The resulting virtual font contains proper metric the templates for monocolor printing it is necessary information for each character borrowed from the to organize the output of the .dvi file so that in one source metric information and the correct mapping pass only yellow characters are printed, in another of the characters into individual real fonts. pass only blue characters are printed, etc., while The similar operation can be performed also by the characters which have the green color are to be the program TFMerge (IHEP TEXware, Protvino), printed twice—in blue as well as in yellow. The but VFComb performs additional operations. That easiest way to teach dvi driver how to do it is to is, in addition to joining metric and ligature/kerning create different subdirectories with virtual fonts — information from each font into one virtual font, it one subdirectory for each elementary color. The is necessary to add cross-ligature and cross-kerning virtual font files placed in the subdirectory for yellow information for the pairs of characters taken from printing (which corresponds to the yellow fonts) will different fonts. VFComb enables one to add metric, refer to the actual *.pk files if and only if the yellow ligature and kerning data taken from its script file color is assigned to this character — otherwise it will (which makes the original script a little bit more refer to empty character. The subdirectories for complicated). The important feature is that this other colors are organized similarly. As soon as additional data can contain variables and logical the yellow printing is performed, the dvi driver is structures, from which one can generate the whole configured so that it takes the virtual fonts from the CM family of the virtual typefaces with national “yellow” subdirectory, and for the output in other characters using just the same pseudo-program writ- colors a corresponding reconfiguration of the dvi ten on VFComb command language. driver is performed. Except the operations described above, VF- If the mapping of the empty characters into the Comb is capable of performing the following opera- dummy font is performed, it results to the wrong tions if it is specified by the user in its script: behaviour of the dvi driver: the characters in the • discard the ligature tables of some real fonts; dummy font have size, and this means that the next character after the empty character is shifted • include in the virtual font the full ligature table to the left (as compared with the desired behaviour) of the real font; • a the distance equal to the width of the skipped include in the virtual font only those characters character. To prevent this effect it is necessary to which are declared explicitly in user-defined insert into the virtual font the explicit dvi com- data, and discard the elements of the ligature mands which move the current output position to tables which correspond to non-included char- the right by the distance representing the width of acters of the real font; the skipped character. This operation is performed • automatically add to the virtual fonts the char- by VFComb by a single command: the user assigns acters which are not included explicitly by the the attribute NULLCHARACTER to the corresponding user but which are joined with the already in- real font, and for the characters of this font the cluded characters through ligature table data, empty mapping will be performed instead of map- or by specifications NEXTLARGER and VARCHAR. ping the real font characters. These features allow creation of the desired virtual fonts for national alphabets with less effort and with Substitution of CM fonts instead of more reliability than by manual manipulations with PostScript fonts for DVI Viewers .pl and .vpl files. The next problem where the usage of the virtual fonts is advantageous is the visualization of the doc- Virtual fonts for colored printing ument which was compiled using PostScript fonts. Another problem is the application of the virtual Generally, the screen viewer cannot process the fonts to multicolored printing. Suppose that it is PostScript characters, and it is necessary to remap necessary to print text where different characters the PostScript font characters into some .pk font have different colors. From TEX-compiler’s point which can be displayed by the viewer — say, some of view it means that the characters with different typefaces from the Computer Modern family.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 121 A.S. Berdnikov and S.B. Turtia

Such remapping can be performed using virtual • automatic computation of the metric informa- font mechanism, but if it is done without special tion for the characters composed from user- precautions the screen view can be far from the defined dvi commands. printed output. The reason is that CM characters have a width different from the PostScript font (the All of these improvements ensure easier use of the fact that they have a different graphical image is program. For example, after implementation of the not so essential). As in the previous case, the screen new features, the generation of the virtual fonts for output will be shifted to the left on the distance the LL/LH family (used in the CyrTUG Cyrillic A which is the difference between the width of the version of TEX/LTEX/AMS-TEX) is performed us- PostScript character and the CM character if no ing just one script file of VFComb instead multiple special precautions are taken. To make the correct header files (see the example below). output, it is necessary to add to the virtual font Comparison with FontInst the explicit dvi commands which correct the current output position. There is another package for manipulating with To make corresponding virtual font automati- virtual fonts — namely, FontInst by Alan Jeffrey. cally, VFComb allows the user to specify for the real Although there are many similar features between fonts two .pl files with the metric information: the FontInst and VFComb, these tools are designed to first one is for the nominal characters which are used solve different tasks. by TEX to compile the .dvi file (in our case it is the The main purpose of FontInst is to create new PostScript .afm file converted to .tfm format), and font families for LATEX2ε using existing PostScript the second one is for the real characters which are fonts. FontInst contains high-level operators which used when the .dvi file is displayed (or printed). enable one to perform this task in several com- If such information is specified by the user, the mands, and it is written totally in TEX, which commands to correct properly the current output guarantees its high portability. In addition, FontInst position are inserted into the virtual font. contains special TEX macro commands which enable This operation works if both fonts have the the user to do nearly anything with virtual fonts, same coding scheme – namely, the characters used without direct editing of .vpl files; but, in this case by TEX and the characters used by dvi viewer have the “program” written in FontInst commands may the same code value. If not, the operation of re- be comparable in length to the .vpl file. mapping inside an already-mapped font is required, VFComb is designad to solve different prob- and this could be very complicated and result in a lems—it was designed mainly to simplify the in- very complicated scheme of virtual font generation. tegration of new alphabets into TEX so that the To solve this problem, it is assumed that the correct Latin part of Computer Modern Typefaces is left metric information for the “true” font (i.e., for the unchanged. It cannot read .afm files and it cannot font used in compilation of the .dvi file), is already create .fd files. If you wish to use it to create virtual available. The special operators in VFComb enable fonts for PostScript, you can do so, but you need the user to load this information and to correct the some additional utilities (afm2tfm) and many more proper character width. manual operations than you need with FontInst. Similarly, you can make virtual fonts like that Other features created by VFComb using FontInst as well, but this The other features incorporated in VFComb 1.3 are: will require manual manipulations which are com- .vpl • improved syntax for VFComb commands; parable to the direct editing of files. Although one can expect in future more convergence between • improved logical operators: FontInst and VFComb, currently these programs do • implementation of string variables in VFComb not intersect in this respect. If you wish to create script files; virtual fonts for font families derived from Post- • specification of variable values in command line Script, please use FontInst, and you will economize among other parameters; a lot of your time. If you wish to make virtual fonts • specification of the names of the real and virtual which solves any of the problems described in this fonts inside VFComb scripts instead of com- paper, VFComb may be more suitable. mand line; Example • correction of some bugs, including the obliga- tory conversion to uppercase all input charac- The syntax of VFComb commands is similar to that ters together with the font names; of .vpl files. It means that it is more suitable

122 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting VFComb 1.3 — the program which simplifies the virtual font management

for computers than for people. The following ex- the header. The commands MAPFONT state that two ample is extracted from the file lhfonts.tbf used fonts are used to create the virtual fonts where the to generate the full family of Cyrillic virtual fonts characters 0 – 127 are taken from one font and the demonstrates the VFComb syntax. characters 128 – 255 are taken from the other font. The actual names of the real fonts ‘0’and‘1’ are Suppose that the program VFComb is called as specified after the following analysis of the contents vfcomb lhfonts.tbf font=lhr10 of the variable FONT: which means that the script file lhfonts.tbf is used (IFS-CASE @F @V FONT) as a source of VFComb commands, and that the user (CASE LHR) asks to generate the virtual font lhr10 which is a (MAPFONT D 0 (FONTNAME @+ CMR @P @V FONT)) (MAPFONT D 1 (FONTNAME @+ LLR @P @V FONT)) combination of the Latin part taken from cmr10 and (HEADER (FAMILY LHR)) the Cyrillic part taken from llr10. The fact that the (VARIABLE (BYTE FLKERN 0)) virtual font has the name lhr10 andthatitiscom- (BREAK) posed from cmr10 and llr10 is described inside the (CASE LHTI) script file lhfonts.tbf. Actually the command line (MAPFONT D 0 (FONTNAME @+ CMTI @P @V FONT)) shown above specifies only the fact that the string (MAPFONT D 1 (FONTNAME @+ LLTI @P @V FONT)) variable font is defined with the initial value lhr10, (HEADER (FAMILY LHTI)) before the script file lhfonts.tbf is processed. (VARIABLE (BYTE FLKERN 1)) The head of the file lhfonts.tbf contains the (BREAK) commands ...... (ELSE) (IF-DEF font) (MESSAGE Unknown font family @F @V FONT) (VARIABLE (STRING FONT @V font)) (HALTPGM) (ELSE) (ENDIF) (MESSAGE Variable FONT is not defined) (HALTPGM) Here the prefix commands @F @V FONT extract (ENDIF) the non-digital component of the font name which These commands check that the user does not is analysed by the CASE operators (note that the forget to specify the variable font=... at the comparison of text strings by the operator IFS-CASE command line, and assign its value to the string does not distinguish between uppercase and lower- variable FONT (uppercase and lowercase letters are case letters). If the non-digital font component is distinguished by variable names, and the prefix @V equal to LHR (as it is in our case for font=lhr10), means the value of the string variable). the first font gets the name CMR..., and the second font gets the name LLR... where the dots are Similarly, the commands substituted by the font design size: the prefix @P (IF-DEF type) @V FONT extracts the value ‘10’ from lhr10 (like (VARIABLE (STRING TYPE @V type)) @F @V FONT extracts ‘lhr’ from the same value), (ELSE) and the prefix @+ performs the concatenation of two (VARIABLE (STRING TYPE NEW)) strings. In the same command block, the header (MESSAGE TYPE = NEW is assumed) field FAMILY gets the value LHR, and the byte vari- (ENDIF) able FLKERN gets the value 0 (it is used later to analyzes the contents of the variable type if it construct the additional entries for ligature table is specified at the command line, and assigns the which corresponds to the ligature/kerning data for default value NEW if there is no expression type=... the characters taken from different real fonts). The at the command line. font names LHTI... are analyzed similarly, and The commands analogous commands (skipped here) are specified for every legal font family. (VTITLE CyrTUG freeware LH font family) The following commands specify the symbolic (OUTPUT @V FONT) names for some Cyrillic letters which are used later (HEADER (FONT D 1)) in the ligature tables. Note the the coding scheme (MAPFONT D 0 (LOWPART)) specified here depends on the value of the expression (MAPFONT D 1 (HIGHPART)) type=... specified in the command line: specify the title of the virtual font and the name (IFS-EQ @V TYPE OLD) of the output virtual font. The header of the virtual (VARIABLE font is similar to that of the font D1if there are (BYTE CYR_open_quote D 243) no other commands which modify the contents of (BYTE CYR_close_quote D 244)

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 123 A.S. Berdnikov and S.B. Turtia

(BYTE CYR_Number D 245) (REAL kkk# A* R -2.0 V u#) ) ...... (ELSE) (COMMENT TYPE = NEW) ) (VARIABLE (LIGTABLE (BYTE CYR_open_quote D 250) (LABEL V CYR_GHE) (BYTE CYR_close_quote D 251) (KRN C . V kk#) (BYTE CYR_Number D 252) (KRN C , V kk#) ) (KRN C : V kk#) (ENDIF) (KRN C ; V kk#) (VARIABLE (STOP) (BYTE CYR_GHE D 131) ) (BYTE CYR_ghe D 163) (LIGTABLE (BYTE CYR_ER D 144) (LABEL V CYR_ER) (BYTE CYR_er D 224) (KRN C . V kk#) ...... ) ) The following commands analyze the value of ..... the variable FONT and assign the value for the real (BREAK) variable u# and for the logical variable monospace, which are used to construct the ligature data for the (CASE D 1) (COMMENT KERN Italic) pairs of characters taken from different fonts: ...... (ENDIF) (IFS-CASE @V FONT) (ENDIF) (CASE LHR5) (COMMENT Font LHR5) The syntax of these commands is similar to that (VAR of .vpl files except that the variable values are used (REAL u# A/ A/ R 12.5 R 36 R 5) (some variables are calculated depending on the (BYTE monospace D 0) u# ) variable defined above) and the logical operators (BREAK) analyze what data should be included depending (CASE LHR6) on the current values of the variables FLKERN and (COMMENT Font LHR6) monospace. (VAR (REAL u# A/ A/ R 14 R 36 R 6) Acknowledgements (BYTE monospace D 0) All new improvements of VFComb (except the En- ) glish manual) are the result of the contacts and (BREAK) discussions which were held during the EuroT X- ... E 95 (ENDIF) meeting. So we would like to thank Kees van der Laan for his giant efforts to organise the visit to the The arithmetic expressions in these commands EuroTEX-95 by the delegation from Russia and for are constructed using the “Polish notation”struc- his patient attention to Russian colleagues before, ture. That is, the prefix A/ is the division of the during and after the EuroTEX-95. two arguments which follow it (A+ is addition, A- Itisnotsoeasytorecallall participants of this is subtraction, A* is multiplication and R is a real conference whose opinions made an impact on the number), and each of these arguments can be the preparation of the new version of VFComb.Among arithmetic expression starting with an arithmetic other persons we would like to thank Phil Taylor and prefix as well. For example, the value of the variable S. Znamensky for their valuable suggestions which u# for the font LHR5 is equal to A/ A/ R 12.5 R 36 led to improvements in the program. We would R5= ((12.5/36)/5) = 0.069444444444. like also to thank O.A. Lapko, S.A. Strelkov and Finally, the commands which specify the addi- I.A. Makhovaya for their efforts spent on the Cyrillic tional ligature data for the pairs of characters taken T X project which actually inspired our work. from different fonts are added: E This research was partially supported by a (IF-EQ V monospace D 0) grant from the Dutch Organization for Scientific (IF-CASE V FLKERN) (CASE D 0) (COMMENT KERN Roman) Research (NWO grant No 07-30-007). (VARIABLE References (REAL k# A* R -0.5 V u#) (REAL kk# A* R -1.5 V u#) [1] D. Knuth, “Virtual Fonts: More Fun for Grand

124 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting VFComb 1.3 — the program which simplifies the virtual font management

Wizards”, TUGBoat 11 (1993), No. 1, pp.13 – 23. [2] A. Khodulev, I. Mahovaya. “On TEX experience of MIR Publishers”, Proceedings of the 7th Eu- roTEX Conference, Prague, 1992. [3] O. Lapko. “MAKEFONT as a part of CyrTUG – EmTEX package”, Proceedings of the 8th Eu- roTEX Conference, Gda´nsk, 1994. [4] A.S. Berdnikov, S.B. Turtia. “VFComb — a pro- gram for design of virtual fonts”, Proceedings of the 9th EuroTEX Conference, Arnhem, 1995.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 125 ΩTimes and ΩHelvetica Fonts Under Development: Step One

Yannis Haralambous Atelier Fluxus Virus, 187, rue Nationale, F-59800 Lille, France [email protected]

John Plaice D´epartement d’informatique, Universit´e Laval, Ste-Foy (Qu´ebec) Canada G1K 7P4 [email protected]

TheTruthIsOutThere and publishers request that their texts be typeset —ChrisCARTER, The X-Files (1993) in Times; Helvetica (especially the bold series) is often used as a titling font. Like Computer Modern, Times is a very neutral font that can be used in a Introduction wide range of documents, ranging from poetry to ΩTimes and ΩHelvetica will be public domain technical documentation.. . virtual Times- and Helvetica-like fonts based upon It would surely be more fun to prepare a real PostScript fonts, which we call “Glyph Con- Bembo- or Stempel Garamond-like font for the serifs tainers”. They will contain all necessary characters part and a Gill Sans- or Univers-like one for the sans-serifs part; but these can hardly be used in the for typesetting efficiently (that is, with TEX quality) in all languages and systems using the Latin, Greek, scientific/technical area, and that’s perhaps where Cyrillic, Arabic, Hebrew and Tifinagh alphabets and TEX (and hence potentially Ω) is used the most. their derivatives. All Unicode characters will be When will the Ω fonts be finished? The covered, although the set of glyphs of our Ω fonts development of ΩTimes and ΩHelvetica fonts is will not be limited to these; after all, our goal is high- divided into four steps: quality typography, which requires more glyphs than 1. Drawing of PostScript outlines and packaging would be required for mere information interchange. of Glyph Container fonts. Other alphabets will follow (the obvious first candidates are Coptic, Armenian and Georgian) as 2. Development of virtual code, based on the real well as mathematical symbols, dingbats, etc. fonts of step one.

3. Kerning of virtual fonts. META Why PostScript instead of METAFONT ? - 4. Development of LATEX code and Ω Translation FONT is the ultimate tool for font development. Extending Computer Modern fonts to the Unicode Processes necessary for the use of these fonts. encoding is still one of our goals. We have started For the time being (June 1996) we have done developing a set of fonts we call “Unicode Com- the biggest part of step one, and this is what we puter Modern” fonts, using techniques such as Vir- present in this paper. We hope to have finished with

tual METAFONT (virtual fonts are created directly steps two, three and four before the next teTEXCD-

by METAFONT using the gftotxt utility for text ROM in December 1996. output). Nevertheless, we realized that the task We want your support! Please keep in mind:

of writing METAFONT code for some thousands of characters (including code for typewriter style) is a The choice and shapes of glyphs presented in tremendous task, which will take several years. this paper are only a first attempt. We need So we have decided to take a small break your feedback to improve them, so that you

from METAFONT ing, and to develop in a limited can use them efficiently. time period PostScript fonts that will cover a In the tables we present only Times family and maximum number of languages and will give the medium series fonts (except in the case of Tifinagh, TEX community a good reason to switch to Ω. which is also presented in the Helvetica family). Why Times and Helvetica? First of all because, Up-to-date tables of the remaining fonts can be after Computer Modern, they are the most widely consulted on our Ω WWW server used fonts in the TEX community. Many journals http://www.ens.fr/omega

126 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

Container “Common”; theoretically, to produce an acute-accented Latin letter and an acute-accented Cyrillic letter, one would use three Glyph Contain- ers: one for the accent, and one for each alphabet. To avoid this, we store all accents relevant to a given alphabet, in the alphabet’s Glyph Container. The same method is used for shapes that are similar in the different alphabets: Latin, Greek and Cyrillic al- phabets share the letter ‘A’, Latin, IPA and Cyrillic Figure 1: Italic-style letters with ogonek (Polish alphabets share the letter ‘a’.2 and Lithuanian) The “Common” Glyph Container The “Common” Glyph Container, shown in Table 1, You can also retrieve the PostScript code from the contains glyphs that will be used potentially in same address, or from conjunction with all alphabets. These are described ftp://ftp.ens.fr/pub/tex/yannis/omega in the following subsections. General remarks on the fonts Punctuation, digits, editorial marks Special care has been taken to distinguish between “typo- To prevent confusion, the word “font” in this section graphical” punctuation and “typewriter/computer is meant in the sense of the PostScript Type 1 font terminal-derived” one: compare the typographical structure; and not of T X text or math fonts: the E double quotes lm and the straight ‘ASCII’ones". fonts we describe in this paper will never be used This table covers all Unicode punctuation directly for typesetting. Their raison d’ˆetre is to marks from the ASCII and ISO 8859-1 tables as provide glyphs for the virtual Unicode+Typography well as from the general punctuation table Ω fonts which we will develop in steps two–four. (0x2010–0x2046). We have not included a few Hence, there is no need to look in the tables for punctuation marks specific to a single alphabet: a‘´e’: this character will be assembled by the virtual Arabic asterisk, inverted comma and semicolon, font, using the glyphs of letter ‘e’ and of the acute Hebrew colon, Greek upper dot. These will be found accent. in the corresponding Glyph Containers. The same stands for the letter ‘c’ with cedilla: In Fig. 3 the reader can see how regular it can be assembled out of the two corresponding curly braces have been transformed into square glyphs; however, this is not true for letters with brackets with quill, by simply reflecting the ogonek: the shape of the ogonek changes while it central part of the brace. gets attached to the letter; that is why you find letters with ogonek in the Glyph Containers and Commonly adopted derived not letters with cedilla. In Fig. 1 the reader can symbols Symbols like A use Latin alphabet letters see examples of letters in italic style, carrying an but are used in many non-Latin-alphabet based ogonek accent. languages. Symbol z is an even stranger example: In Fig. 2 the reader can find the general although the glyph ‘N’ does not exist in Cyrillic, structure of the fonts:1 On the left, the 16-bit this symbol is used mainly in Cyrillic-alphabet Unicode+Typography virtual font, on the right a languages. certain number of Glyph Containers, that is 8-bit 2 PostScript fonts. We will use a totally different approach when dvips and Adobe Acrobat are able to use 16-bit PostScript fonts. The reader will notice that a certain number of Instead of having many ‘small’ PostScript fonts and one glyphs are repeated in the different Glyph Contain- ‘big’ virtual font, we will use a single ‘big’ PostScript font, ers. This is because we want to minimize the number in Unicode+Typography encoding. In that font, accented of Glyph Containers used for a single-alphabet text. letters and repeated identical shapes will be obtained by the PostScript font technique of composite characters. This will For example, accents for all fonts are stored in Glyph allow Acrobat users to select and copy Unicode-encoded text directly from the document window. 1 In the figure, the reader will notice real font “Adobe Since the two techniques (16-bit + 8-bit real, vs. 16- Zapf Dingbats”. In fact, the glyphs of this font have become bit composite real) will share the same .tfm metrics for Unicode characters (0x2701–0x27be) and we see no reason to characters, it will be possible to convert .dvi files from one redraw them since this font is widely available. Hence the format to the other, so that files obtained today by the virtual 16-bit font will also point to the standard Adobe Zapf first method will be “Acrobat-ready” later, whenever Acrobat Dingbats font. switches to Unicode (hopefully soon!).

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 127 Yannis Haralambous and John Plaice

PostScript fonts

OmegaTimesCommon 8-bit

OmegaTimesLatin 8-bit

Ω OmegaTimesIPA Times 8-bit virtual font (Unicode OmegaTimesGreek encoding + 8-bit characters needed for OmegaTimesCyrillic typography) 8-bit

OmegaTimesArabicOne, … 8-bit 16-bit OmegaTimesHebrew 8-bit

OmegaTimesTifinagh 8-bit …

Adobe™ ZapfDingbats 8-bit

Figure 2: General structure of the ΩTimes fonts (idem for ΩHelvetica)

In Fig. 4 the reader can see our small tribute to the GNU foundation: the “copyleft” symbol. We hope that this character will soon be included in the letter-like symbols section of Unicode. —Q.Whynotusethe \reflectbox macro to reflect the “copyright” glyph into a “copyleft” one? — A. Because we want to treat “copyleft” as a separate character in the virtual font, which may Figure 3:Theleft and right square be searched inside a .dvi or PDF file. In other brackets with quill were drawn by reflecting cases, such as B ,usedintheA B BA logo and the the central part of regular curly braces cobar construction in Algebraic Topology, one can

128 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

ΩTimes and ΩHelvetica Fonts Under Development: Step One Letter ‘O’C

Letter ‘C’ Letter ‘ee ’

Figure 6: How shapes ‘C’, ‘O’ and ‘e’ were used for the design of the latin and cyrillic capital letter shwa

to enter math mode...), the remaining ones have found their home in the currency symbols section of Unicode. Figure 4: The GNU “copyleft” symbol We have included them all (even the Thai currency symbol è, which looks suspiciously Latin) in the “Common” Glyph Container. Note that the symbol Ñ for French Franc is virtually unknown in France. . . Diacritics The zone 0xa0–0xe2 of the “Common” Glyph Container is dedicated to combining diacrit- ics. These diacritics are supposed to be useful for more than one alphabet; whenever a diacritic belongs specifically to one alphabet, it has been included only in the corresponding Glyph Container (this is the case, for example, of Vietnamese dou- ble accents, Greek spirit+accent combinations, and Slavonic accents). Thanks to the diacritics in the “Common” Glyph Container we will be able to con- struct all latin extended additional Unicode characters (0x1e00–0x1ef9) virtually, by combining them with letters from the “Latin” Glyph Con- tainer. This Unicode region covers Welsh, Viet- namese and transcriptions of Indic and other lan- guages. Figure 5:Theestimated symbol was drawn The “Latin” Glyph Container inside a perfect circle The “Latin” Glyph Container, shown in Table 2, contains glyphs of letters for Latin alphabet lan- easily use PostScript-manipulation macros without guages. These are described in the following sub- damage. sections. Finally, in Fig. 5 the reader can see how the Letters for Latin alphabet languages estimated symbol fits inside a perfect circle (in All red, for readers of a color version of this paper). glyphs necessary to typeset Western and Central European, Nordic, Baltic, African languages, Viet- Currency symbols Among currencies having namese and Zhuang. proprietary symbols, there are strong and weaker Some African characters are derived from the ones. The strong ones have made it into ISO 8859- International Phonetic Alphabet. It is a fascinating 1 (you-know-who made it even into ASCII itself, challenge to design uppercase and italic-style forms and is used by a well-known typesetting language for these characters (for example, see in Fig. 6 the

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 129 Yannis Haralambous and John Plaice

Figure 9: Italic letter ‘y’ with umlaut accent followed by the Dutch ligature ‘ij’ in Times italic, Helvetica medium uppercase and Helvetica bold uppercase form Figure 7: The African letter f with hook in roman upper- and lowercase, as well as in italic lowercase form

Figure 10: The Vietnamese small letter o with horn was designed using the glyphs of the ring accent, the apostrophe and letter ‘o’

Figure 8: Capital letter b with topbar (which shares the same glyph as cyrillic In the former case, we have connected the two capital letter be) small letter b with letters, creating an intentional confusion with a ‘y’ topbar and cyrillic small letter be letter with umlaut accent. In Fig. 9 the reader can compare these two constructions. On the other hand, the shape of the uppercase sans serif ‘IJ’ is derived from that of the letter ‘U’: this is common uppercase version of letter © letter which derives Dutch titling practice. shwa from the phonetic , and in Fig. 7 the straight The characters ^ and ~ are used in Breton. uppercase and straight lowercase versions of an They are not (yet) included in Unicode, and have African letter that becomes a standard f in italic been brought to our attention by Jacques Andr´e. style). Finally, in Fig. 10 the reader can see how the Sometimes, although a Greek form (or a form “horn” of Vietnamese letters has been designed, derived from Greek) is used for the lowercase, the using graphical elements from the font: in this case, uppercase does not follow the Greek model: for the ring accent and the apostrophe. example, the uppercase of African letters ™ and ≠ (the former is 100% Greek, while the latter looks Ligatures Besides the “standard five” ligatures Ã, more like a “phonetic gamma”) are ä and ç. Õ, Œ, œ, –, we have included ligatures for the case Inversely, sometimes a Greek form is used for where: the uppercase only: ô is the uppercase form of the • the second or third letter is an ‘i’ with ogonek: integral-like π, a character derived from the IPA,for —, “, useful in Lithuanian; which one can hardly imagine an obvious uppercase • the second or third letter is a stroked ‘l’ : ”, ‘, form — taking a Greek letter for that purpose is the useful in Polish; easiest solution. • It is quite interesting how the lowercase of the the second or third letter is a ‘j’: ’, ÷; • Latin letter Å differs from that of the Cyrillic letter instead of an ‘i’ one has an ‘ij’ ligature: ◊, ÿ, sharing the same glyph (see Fig. 8). useful in Dutch; Special attention has been paid to the notorious • instead of ‘f’ one has a long ‘s’: Ÿ, ⁄, ¤, ‹, Dutch ligature ‘ij’, in italic Times lowercase form ›. We haven’t (yet) included any ‘f’ + ‘long s’ and in Helvetica uppercase form (see Fig. 9). combinations.

130 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

Figure 13: Three closely-related glyphs: German ‘sz’, IPA ‘beta’ (not a Unicode character), Greek Figure 11 : French ligatures ‘st’ and ‘ct’ in Times lowercase ‘beta’ and Helvetica fonts

Finally we couldn’t resist the temptation of making “French” ligatures fi, fl, of course, both in Times and Helvetica (!) font families. These ligatures are well-known because of their use (in the Garamond typeface) in the Pl´eiade book collection. One can argue about their reason for being in Times (and especially in Helvetica) style, which Figure 14: Different types of ‘gamma’: the first has absolutely no historical background...Consider one from the , the others from the them an experiment, and trust us for not making IPA them automatic in default text mode!

Diacritics Accents 0xf8–0xff are used for Viet- useless since they are indistinguishable from lower- namese only. Diacritics occupying positions 0xe8– case ‘s’, ‘x’, ‘z’. 0xf6 are shared with the “Common” Glyph Con- This point deserves some explanation: one tainer. should not confuse small capitals used for text and IPA small capitals. The former are a stylistic The “IPA” Glyph Container enhancement of text; they appear only in words The “IPA” Glyph Container, shown in Table 3, entirely in small capitals style; their height is not contains a collection of glyphs needed for the equal to the x-height of the font, generally they are International Phonetic Alphabet. They have been slightly higher. The latter are phonetic characters found in different sources: Unicode encoding, used together with authentic lowercase letters: they literature on phonetics (in particular we covered the must have exactly the same height. complete table of characters of the French classic In Fig. 12 the reader can see some small caps we Initiationa ` la phon´etique, by J.-M.-C. Thomas, L. have designed for the ΩTimes IPA Glyph Container. Bouquiaux and F. Cloarec-Heiss, PUF, 1976). Only It may not be obvious from the figure, but stroke a small number of these characters are contained in widths of the small caps are exactly the same as the Unicode encoding. We would be grateful for the ones of lowercase letters (see lowercase letter ‘a’ any feedback from scholars on additional characters to compare, as well as glyph ‘v’, which is used to or corrections of the existing ones. represent both a lowercase and a small caps ‘v’). On Since IPA is so. . . international, we have as- the second line of the figure, one can see the same sumed that one can encounter phonetic insertions letters obtained as ‘fake’ small capitals. in text written in any language. Hence, we have in- We had a lot of fun designing these characters. cluded all glyphs needed, including lowercase Latin In some cases, they just had to look a little different alphabet letters, and small capitals — the latter be- from their Greek counterparts: for example, in ing included only whenever these are significantly Fig. 13 the reader can see how the ‘phonetic beta’ different from their lowercase counterparts: includ- has been inspired by the German ‘sz’, and not by the ing, for example, small caps ‘s’, ‘x’, ‘z’ would be ‘real’ Greek beta; in Fig. 14 we present a collection

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 131 Yannis Haralambous and John Plaice

Figure 12: On the first line, lowercase letters ‘a’ and ‘v’ and specially designed IPA small capitals; on the second line, uppercase letters reduced 68%

Figure 16: Which one is the best ‘l with retroflex Figure 15:TwoIPA symbols of different origin: hook+ezh’ ligature? (Three proposals) inverted letter ‘f’ and dotless ‘j’ with stroke

of course will survive in the final Glyph Container). of gamma-like glyphs, as well as the ‘real’ Greek In Fig. 16 the reader can see: (a) the tail of ‘ezh’ letter gamma. merged with the retroflex hook of the ‘l’, (b) the In some cases we were not very sure about retroflex hook of the ‘l’ continuing deep enough for the origin of some IPA characters and made several it to be seen under the tail of ‘ezh’, (c) the retroflex attempts: for example, in Fig. 15 one can see two hook rising higher so that it fits between the tail of symbols with a superficial resemblance: an inverted ‘ezh’ and the baseline. In cases (a) and (c) the risk ‘f’ and a dotless stroked ‘j’. Which one is used in may be that the ‘l with retroflex hook’ be taken for phonetics? The choice is left to the user. an ‘ordinary l’ (compare ù, ¯ and ö); in case (b) we Finally there was one case where we had to solve are going too deep under the baseline... a real design problem: the one of the ‘l with retroflex Anyhow, we expect your feedback to resolve hook’ and ‘ezh’ ligature (not a Unicode character). this issue. In most real-life examples we had the opportunity The “Greek” Glyph Container to see, the letter ‘ezh’ was sadly squeezed so that its tail remains higher than the retroflex hook of the ‘l’; The “Greek” Glyph Container, shown in Table 4, we find it bad typographical practice to squeeze the contains glyphs needed for the Greek language, ‘ezh’ and propose three solutions (only one of which ancient and modern. The problem with most

132 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

Figure 18: The inverted ‘iota with circumflex’ (not a Unicode character) used in 19th-century modern Greek printing

Figure 17: Two forms of Greek circumflex accent: the first used in modern Greek typesetting, the second one used in a number of scholarly typefaces

commercial Greek fonts is that they are either made for modern Greek use, or for ancient Greek use by non-Greek scholars. There is a third possibility: fonts made for ancient Greek use by Greek scholars. In this Glyph Container we have tried to satisfy all three categories of users. Of course, this font can be used both for monotonic and polytonic text (there is a straight monotonic accent, while the Figure 20: The different variant forms of Greek acute one can be used as well). But we have letters: beta, theta, phi, rho, kappa gone even farther, by including two versions of the circumflex accent, shown in Fig. 17: the tilde-like one, used in Greece, and the cap-like one used in 3 In fact, we have included all known variants of Greek scholarly Western editions. numerals: stigma, digamma, qoppa and sampi (see Faithful to the first TUGboat paper by one of Fig. 19, next page), and the upper and lower nu- the authors (Haralambous and Thull, 1989), we have meral signs. included the inverted iota with circumflex accent, Greek letters are also very common in math- found in certain 19th-century editions of modern ematics and physics. Some variant forms are used Greek (see Fig. 18). for different purposes in these fields (rho with curved Version 1.0 of the Unicode standard contained or straight tail, open/closed phi, open/closed theta, uppercase versions of , just as roman curly or straight kappa). To these we add a variant numerals were provided in upper- and lowercase; form used in regular text: the initial and medial in ISO 10646 these characters were removed, ap- beta (b vs. 1). All variant forms of lowercase letters parently by decision of the Greek delegation. We are shown in Fig. 20. find this action absurd (there are many examples of There is also a variant form of the letter sigma: uppercase numerals in literature) and have of course the so-called “lunate” sigma. This character is used included the relevant glyphs in the Glyph Container. in some scholarly editions to avoid the distinction 3 between ordinary medial and final sigmas. In Fig. 21 This brings us to a nice typographical joke: the Greek “circumflex” accent is either “tilde”-like or “cap”-like, but we compare it with Latin letter ‘c’: the lowercase of never actually. . . “circumflex”-like! lunate sigma has no bulb and the upper one, no serif

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 133 Yannis Haralambous and John Plaice

Figure 19: Collection of glyphs used for Greek numerals; on the first line: lowercase stigma, digamma (2 variants), qoppa (2 variants), sampi; on the second line: uppercase stigma, digamma, qoppa, sampi

The same reason justifies positions 0x5e and 0x5f: the dieresis (dialytika, in Greek) does not have the same width, depending on whether it is placed on an iota or an upsilon. Finally, in the table there are also the specifi- cally Greek (round ones), the upper dot, Figure 21: The glyphs of Greek lunate sigma and and the two forms of “subscript” iota: for low- Latin letter ‘c’ (in upper- and lowercase) ercase letters (ypogegrammeni) and for uppercase ones (prosgegrammeni). The “Cyrillic” Glyph Container

(and hence the Greek and Latin letters are identical The “Cyrillic” Glyph Container, shown in Table 5, in the Helvetica family). contains glyphs needed for all languages using The reader may ask why on positions 0x80– the Cyrillic alphabet, whether European or Asian. 0x8d and 0xa0–0xad of the Glyph Container table Characters for pre-Lenin Russian (fita, , ) we have accented letters, while on positions 0x90– have also been included, as well as “modernized” 0x9d there are only accents and no letters. The versions of Slavonic characters. As in the case of the answer is very simple: accents on Greek letters Latin ‘st’, ‘ct’ ligatures, one may argue the necessity sometimes change shape according to the width of of modernizing Slavonic script, especially when it the letter. Greek vowels can be roughly divided comes to drawing the Helvetica version. It happens into three width classes: narrow ones (the iota), that these characters have been included in Unicode wide ones (the omega) and medium ones (all the (like the Coptic ones), and this is already reason remaining). On row 8 and a we have placed accents enough to draw the font; it is up to the user to for narrow and wide letters, respectively; on row actually adopt the “modern” font, or use beautiful 9 we have placed the ordinary accents. And since traditional Slavonic typefaces. rows 8 and a have been made for unique vowels, What was fascinating when designing Slavonic we have prefered to include complete accent+letter letters was their relation to Greek ones. In Fig. 22 combinations (letters being aliases, of course) so we compare Slavonic and Greek Xi and : at that the virtual font doesn’t need to make the a first glance, Slavonic Xi bears a resemblance construction. to Greek lowercase xi, it is quite surprising that

134 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

so that multilingual texts using the three scripts will produce typographically acceptable results. Concerning diacritics we have included all vowels, and combinations with shadda and hamza, as well as some special cases: madda, wesla, vertical fatha, vertical fatha + shadda, and other diacritics used in Arabic spellings of African languages, Kurdish, Baluchi, Kashmiri, Uigur and Kazakh.5

Esthetic Ligatures We have included a very small number of esthetic ligatures (fewer than 150): ba- like letters followed by a final noon-like, initial fa- like letters followed by a -like, an initial lam- meem ligature, and the llah ligature with and without vertical fatha + shadda. We are not convinced that heavy ligaturing of these fonts, in Figure 22: Comparing Slavonic letters Xi and Psi the manner of traditional Naskhi fonts, would be with the corresponding Greek ones esthetically judicious. Nevertheless, we are open to any suggestion on possible ligatures that might be added. the Slavonic xi is reflected with respect to the The “Tifinagh” Glyph Container Greek one. Uppercase Psi letters are identical, and lowercase Slavonic psi has a heavier stem The “Tifinagh” Glyph Container, shown in Tables 9 (with serif, while Greek lowercase letters never have and 10, contains the glyphs needed for typesetting the Tamazight (alias Berber) language. A complete serifs). Uppercase Slavonic Omega + is a magnified lowercase omega (where we have placed a serif on description of the Berber TEX system developed by the central stem). For the lower part of the Slavonic Haralambous [2]. The glyphs shown on the tables warrant some / we have used an inverted Psi: this is a small designer’s secret allowing better integration of the explanation. Tifinagh script has always been letter in the Times (resp. Helvetica) font family. written in a non-serif style. On table 9 we show Another design initiative of ours was to use a “Helvetica” version of the script, in the sense graphic elements from the Serbian letter !1,for that everything has been done to bring these glyphs closer to the Latin/Cyrillic/Greek Helvetica types. the Asian Cyrillic åú, §¥, Øø. We expect user feedback to validate or deny this choice. This approach is quite safe, and — in all modesty — the result should not surprise any speaker of Berber. The “Arabic” Glyph Container On the other hand, table 10 shows a 100% experimental font! The main idea was to say: The “Arabic” Glyph Container, shown in Tables 6– “What would Tifinagh letters look like today if 8 contains the glyphs that are necessary to typeset they had followed the same evolution as Latin in any language using the .4 The ones?” See Fig. 24 for a closer comparison design of these Ω Arabic fonts doesn’t actually of some Tifinagh letters in both Helvetica and have much in common the Times and Helvetica Times styles. Admittedly, the result seems weird, families; in fact, we have used popular modern and a lot of corrections have to be made before designs, which can be used both for technical and these drawings actually can be called Tifinagh literary text, and which allow easy readability in glyphs. . . Nevertheless, the Berber language is small sizes. Fat and thin stroke width, ascender being typeset more and more in its traditional height and descender depth have all been calculated script, often together with other scripts. As a with respect to the corresponding parameters of the result, the problem of homogenization with Western Latin/Greek/Cyrillic fonts. typographical traditions must be faced; the style of In Fig. 23 the reader can see the metrical the Times typeface is one of the first challenges. relationship between the Latin, Arabic and Hebrew fonts: we want these characters to fit with one other, 5 We have not included diacritics used for the Qur’¯an, as we believe that these should be printed using very specific 4 We have also included undotted versions of letters ba, traditional typefaces; nevertheless, the diacritics provided are fa, qaf, for typesetting of old manuscripts. sufficient for excerpts from the Qur’¯an.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 135 Yannis Haralambous and John Plaice

Figure 23: Capital height, x-height, baseline and descender depth of Latin, Arabic and Hebrew letters

Figure 24: On line 1: Tifinagh letters in Helvetica style, with shapes identical to Greek and Latin ones; on line 2: idem. but shapes become stranger; on lines 3 and 4: the same letters, in Times style, using “evolutional logic”

The “Hebrew” Glyph Container and weight of Hebrew letters with those of Latin and Arabic.6 The “Hebrew” Glyph Container, shown in Table 11, We have not included Masoretic signs in the contains the glyphs needed for Hebrew, Yiddish font, because we believe that the Bible should be (both in the Russian or American YIVO spelling) typeset in traditional “square” fonts. Nevertheless, and Ladino. The font design is based on a — we have included letters with dagesh which appear very popular in Israel — modern Hebrew typeface. only in a few isolated positions in the Bible, as well Once again we have adapted the stroke widths as the inverted nun. We have also included two and character dimensions of Latin/Greek/Cyrillic glyphs. In Fig. 23, the reader can compare the size 6 It should be noted that the height of Hebrew letters is exactly equal to the half distance between that of upper- and lowercase Latin letters.

136 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

different forms of the lamed-aleph ligature (with and without left stroke), used in older texts. References [1] Haralambous, Yannis and Klaus Thull. “Type- setting modern Greek with 128-character codes.” TUGboat 10,3 (1989), pages 354 — 359. [2] Haralambous, Yannis. Un syst`eme TEX berb`ere. Actes de la table ronde internationale p Phonologie et notation usuelle dans le domaine berb`ere q. Paris: Institut National des Langues et Civilisations Orientales, 1993. [Forthcoming in the Cahiers GUTenberg thematic issue on Semitic scripts.]

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 137 Yannis Haralambous and John Plaice

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & ' ( ) * + , - . / "3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ _ "6x ` a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë

"Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ Æ Ø "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω æ ø "Cx ¿ ¡ ¬ √ ƒ ≈ ∆ « » … À Ã Õ Œ œ "Dx – — “ ” ‘ ’ ÷ ◊ ÿ Ÿ ⁄ ¤ ‹ › fi fl "Ex ‡ ·

"x8 "x9 "xA "xB "xC "xD "xE "xF

Table 1: Tentative OmegaTimesCommon Glyph Container Table (June 10, 1996).

138 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & ' ( ) * + , - . / "3x 1 2 3 4 5 6 7 8 9 : ; < = > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ "6x ` a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü "Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ Æ Ø "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω æ ø "Cx ¿ ¡ ¬ √ ƒ ≈ ∆ « » Ã Õ Œ œ "Dx – — “ ” ‘ ’ ÷ ◊ ÿ Ÿ ⁄ ¤ ‹ › fi fl "Ex ‡ · ‚ „ ‰ Â Ê Á Ë È Í Î Ï Ì Ó Ô "Fx  Ò Ú Û Ù ı ˆ ˜ ¯ ˘ ˙ ˚ ¸ ˝ ˛ ˇ "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 2: Tentative OmegaTimesLatin Glyph Container Table (June 10, 1996). TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 139 Yannis Haralambous and John Plaice

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & ' ( ) * + , - . / "3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ _ "6x ` a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü "Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ Æ Ø "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω æ ø "Cx ¿ ¡ ¬ √ ƒ ≈ ∆ « » … À Ã Õ Œ œ "Dx – — “ ” ‘ ’ ÷ ◊ ÿ Ÿ ⁄ ¤ ‹ › fi fl "Ex ‡ · ‚ „ ‰ Â Ê Á Ë È Í Î Ï Ì Ó Ô "Fx  Ò Ú Û Ù ı ˆ ˜ ¯ "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 3: Tentative OmegaTimesIPA Glyph Container Table (June 10, 1996). 140 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "3x 0 1 2 3 4 5 6 7 8 9 ; < = > ? "4x A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ _ "6x a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú ù û "Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 4: Tentative OmegaTimesGreek Glyph Container Table (June 10, 1996).

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 141 Yannis Haralambous and John Plaice

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & ' ( ) * + , - . / "3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ "6x ` a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü "Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ Æ Ø "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω æ ø "Cx ¿ ¡ ¬ √ ƒ ≈ ∆ « » "Dx – — “ ” ‘ ’ ÷ ◊ ÿ "Ex ‡ · ‚ „ ‰ Â Ê Á Ë È Í "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 5: Tentative OmegaTimesCyrillic Glyph Container Table (June 10, 1996).

142 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & '

"3x 0 1 2 3 4 5 6 7 8 9 > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ _ "6x ` a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü "Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ Æ Ø "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω æ ø "Cx ¿ ¡ ¬ √ ƒ ≈ ∆ « » … À Ã Õ Œ œ "Dx – — “ ” ‘ ’ ÷ ◊ ÿ Ÿ ⁄ ¤ ‹ › fi fl "Ex ‡ · ‚ „ ‰ Â Ê Á Ë È Í Î Ï Ì Ó Ô "Fx  Ò Ú Û Ù ı ˆ ˜ ¯ ˘ ˙ ˚ ¸ ˝ ˛ ˇ "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 6: Tentative OmegaTimesArabicOne Glyph Container Table (June 10, 1996).

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 143 Yannis Haralambous and John Plaice

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & ' ( ) * + , - . / "3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ ] ^ _ "6x ` a b c d e f g h i j k l m n o "7x p q r s t u v w x y z { | } ~ "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü "Ax † ° ¢ £ § • ¶ ß ® © ™ ´ ¨ ≠ Æ Ø "Bx ∞ ± ≤ ≥ ¥ µ ∂ ∑ ∏ π ∫ ª º Ω æ ø "Cx ¿ ¡ ¬ √ ƒ ≈ ∆ « » … À Ã Õ Œ œ "Dx – — “ ” ‘ ’ ÷ ◊ ÿ Ÿ ⁄ ¤ ‹ › fi fl "Ex ‡ · ‚ „ ‰ Â Ê Á Ë È Í Î Ï Ì Ó Ô "Fx  Ò Ú Û Ù ı ˆ ˜ ¯ ˘ ˙ ˚ ¸ ˝ ˛ ˇ "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 7: Tentative OmegaTimesArabicTwo Glyph Container Table (June 10, 1996).

144 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting ΩTimes and ΩHelvetica Fonts Under Development: Step One

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " # $ % & ' ( ) * + "8x Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è "9x ê ë í ì î ï ñ ó ò ô ö õ ú "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 8: Tentative OmegaTimesArabicThree Glyph Container Table (June 10, 1996).

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7

"2x ! " # $ % & ' ( ) * + , - . /

"3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

"4x @ A B C D E F G H I J K L M N O

"5x P Q R S T U V W X Y Z [ \ ] ^ _

"6x ` a b c d e f g h i j k l m "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 9: Tentative OmegaHelveticaTifinagh Glyph Container Table (June 10, 1996).

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 145 Yannis Haralambous and John Plaice

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7

"2x ! " # $ % & ' ( ) * + , - . /

"3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

"4x @ A B C D E F G H I J K L M N O

"5x P Q R S T U V W X Y Z [ \ ] ^ _

"6x ` a b c d e f g h i j k l m "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 10: Tentative OmegaTimesTifinagh Glyph Container Table (June 10, 1996).

"x0 "x1 "x2 "x3 "x4 "x5 "x6 "x7 "2x ! " $ % & ' ( ) * + , - . "3x 0 1 2 3 4 5 6 7 8 9 : < = > ? "4x @ A B C D E F G H I J K L M N O "5x P Q R S T U V W X Y Z [ \ "6x ` a b c d e f g h i j k l m n o "7x p q r t v w x y z { | } ~ "x8 "x9 "xA "xB "xC "xD "xE "xF

Table 11: Tentative OmegaTimesHebrew Glyph Container Table (June 10, 1996).

146 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

Richard J. Kinch 6994 Pebble Beach Ct Lake Worth FL 33467 USA Telephone (561) 966-8400 FAX (561) 966-0962 [email protected] URL: http://styx.ios.com/~kinch

Abstract

TEX began its “childhood” with 7-bit-encoded fonts, and has entered adolescence with 8-bit encodings such as the Cork standard. Adulthood will require TEX to embrace 16-bit encoding standards such as Unicode. Omega has debuted as a well-designed extension of the TEX formatter to accommodate Unicode, but much new work remains to extend the fonts and DVI translation that make up the bulk of a complete TEX implementation. Far more than simply doubling the width of some variables, such an extension implies a massive reorganization of many components of TEX. We describe the many areas of development needed to bring TEX fully into multi-byte encoding. To describe and illustrate these areas, we introduce the r TrueTEX Unicode edition, which implements many of the extensions using the Windows Graphics Device Interface and TrueType scalable font technology.

Integrating TEXandUnicode especially to a commercial product in an inter- national marketplace. You cannot use T X for long without discovering E • With access to Unicode fonts, the natural that character encoding is a big, messy issue in every ability of T X to process the large character sets implementation. The promise of Unicode, a 16-bit E of the Asian continent will be realized. Methods character-encoding standard [15, 14], is to clean up such as the Han unification will be accessible. the mess and simplify the issues. • T X will install with fewer font and driver files. While Omega [5, 13] has upgraded TEX-to-DVI E translation to handle Unicode [3], the fonts and Many 8-bit fonts will fit into one 16-bit font, and in systems like Windows, which treat fonts DVI-to-device translators are far too entrenched in narrow encodings to be easily upgraded. This paper as a system-wide resource, fewer fonts are an will develop the concepts needed to create Unicode advantage. Only one application will be needed to translate from Unicode DVI to output device. TEX fonts and DVI translators, and exhibit our • T X documents will convert to other portable progress in the TrueTEX Unicode edition. E forms (like PDF,OpenDoc,orHTML) and will A fully Unicode-capable TEX brings many substantial benefits: work with Windows OLE, without tricks and without pain. • TEX will work smoothly with non-TEXfonts. • Computer Modern and other TEX fonts will be While TEX already has a degree of access to usable in non-TEX Unicode applications. The 8-bit PostScript and TrueType fonts, there are 8-bit encoding problems have broken Computer many limitations that Unicode can eliminate. Modern on every variety of Microsoft Windows. • When 16-bit encodings overcome the resistance • TEX will eliminate the last vestiges of its deep- of the past — and we have every reason to hope seated bias for the English language and US ver- that they will—TEX will play a continuing role sions of multilingual platforms like Windows. It in software of the future, instead of becoming will adapt freely and instantaneously to other an antique. languages, not just in the documents produced, Claiming these promises involves some trouble along but in its run-time messages and user interface. the way, but without 16 bits to use for encoding, we This flexibility is crucial to quality software, will never have a solid solution.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 147 Richard J. Kinch

Let us survey in the rest of this paper what is Two, these tables have to be stored in files, needed to achieve these various aspects of integrat- and we need to carefully and deliberately extend ing TEXandUnicode. the file formats to handle the extensions. Not only could we come up with a bad design that limits us Omega and Unicode unnecessarily in the future, but all the old TEXware The goal of our work has been to create a Unicode- has to be upgraded, and then we have to port the capable DVI translator, and to reorganize the TEX upgrades. fonts into a Unicode encoding. TEX itself (that Three, we must rationalize the old ad-hoc is, the formatter) already has a Unicode successor, character sets into a big union set. Just cataloging namely Omega. and managing this data is a large task: many The chief advance of Omega is that it gener- tens of thousands of items, where we used to have alizes the TEX formatter to handle wider encod- only hundreds before. Some degree of database ings. What Omega is less concerned with is the management tools must be applied to get the codes DVI format (which has always provided for wider into a form which we can compile into software; it encodings, up to 32 bits), the encoding of the exist- is not enough to just type in some array initializers ing TEX fonts, and the translation of .dvi files for here and there. output devices. In fact, Omega has side-stepped DVI Encoding standards are necessarily incomplete translation altogether with its extended xdvicopy or imprecise in some aspects, and none fit the translation, whereby Omega operates within the old TEX enterprise. While many of the Unicode math environment of 8-bit TEX fonts and the old DVI symbols were taken from TEX, many of the TEX translators. Since Unicode rendering is supported characters are missing from Unicode. But Unicode on Windows but not on other popular TEX platforms is about the closest encoding to TEXmaththat (UNIX, DOS, etc.), a devotion to Omega’s porta- we can expect from an unspecialized encoding, and bility requires that Omega use the old fonts and with Unicode we gain a powerful connection to DVI translators. Lacking any compulsion to extend multilingual character sets. the DVI translators for Unicode, the Omega project Extending Computer Modern to Unicode has justifiably invested most of its effort into earlier stages of the typesetting process [4, page 426]. A “rational” encoding establishes a mapping of char- Our Unicode TEX fonts and Unicode DVI acter names to unique integers, and this mapping translator, while having a natural connection to does not vary from font to font. The Computer Omega, are capable of connecting TEX82 to Unicode Modern fonts were not encoded rationally. For as well. Through the mechanism of virtual fonts [11], example, code 0x7b is overloaded about 8 different TEX can access Unicoded fonts while using its old characters, and character dotlessi appears in differ- 8-bit encodings itself. ent codes in different fonts. Given an 8-bit limit on encoding, this was inevitable. But this makes What’s the fuss? for many troubles; moving up to a rational, 16-bit Wishing for 16 bits of Unicode sounds like, hey encoding is a clean solution. presto, we just widen some integer types, double Computer Modern is also “incomplete” in the some constants, and type “make” somewhere very, sense that if you made a table having on one axis very high in a directory tree. The task is far from the list of all the specific styles (Roman, Italic, this simple for several reasons: typewriter, sans serif, etc.) and on the other axis One, TEXandTEXware is full of 256-member all the characters in all the fonts (A–Z, punctuation, tables which enumerate all code points. These diacritics, math symbols, etc.), the table would have

would have to grow to 65,536 members. While lots of holes when it came to what METAFONT Haralambous and Plaice want us to agree that source exists. Commercial text fonts have all of this is “impossible” for practical reasons [6], they these holes filled, or at least the regions populated assume that we are not going to re-implement the in the table are rectangular. In Computer Modern 8-bit-encoded software for sparse arrays. Applying the regions are randomly shaped. sparse-array techniques to manage per-character Furthermore, the character axis of this (very data will avoid an impossible increase in execution large) imaginary table is missing many characters time and/or memory, although it will require an considered important in non-TEX encodings. For initial extra effort to upgrade the software. example, ANSI characters like florin, perthousand,

148 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

cent, currency, yen, brokenbar, etc., are not imple- names1 in common use come from an assort- mented in Computer Modern. Certain of these sym- ment of sources, and they exhibit inconsisten- bols can be unapologetically composed from exist- cies, conflicts, and ambiguities which frustrate ing Computer Modern symbols: a Roman multiply computation of set projections and joins. or divide would come from the math symbol font, • A database of TEX encodings, which tells which trademark from the T and M of a smaller optical characters appear in which TEXfonts.TEX size, and so on. But many other characters will uses a gumbo of no less than 28 (!) distinct

just have to be autographed anew in METAFONT encodings (Table 1). This number may come (at least one related work is in progress [6]). as a surprise, but has been hidden by the web

The job of extending Computer Modern to be a of METAFONT source files. The sum of all TEX rational and complete set of fonts first requires that encodings constitutes a database of 3415 name- we reorganize the existing characters into a clean, to-code pairs, each name being taken from the 16-bit encoding. Then we are in a strong position 1108 members of TeXunion.Wedesignate to fill in the missing characters. this set of encodings (that is, a set of mappings We not only want to give TEX access to 16-bit- of names to integers) as TeXencod. encoded fonts, we also want the converse: non-TEX As if the miasmic fog of encoding conventions applications to have access to the Computer Modern were not confused enough, small-caps fonts fonts in TrueType form. This mandates adherence present still more encoding problems. They to the Unicode standard wherever possible, and represent an axis of variation that is hardly an organized method to manage the non-Unicode defined in the usual set of font parameters. characters in Computer Modern. We must consider small-caps characters to be Here is a list of the components we consider different from their corresponding parents. If essential to a Unicode rationalization of Computer this is not done, then there is no way to Modern. In this list we take a different approach compose virtual small-caps fonts from their from Haralambous’ Unicode Computer Modern lowercase counterparts, because we would have project [7], which is aimed at producing virtual fonts no way of knowing which characters are to which resolve to 8-bit .pk fonts from METAFONT . shrink (a jumbled set of letters and accented Our aim is a set of Unicoded TrueType fonts. letters) and which do not (punctuation and all the rest). Thus, for each encoding used • A METAFONT -to-outline converter, a very diffi- cult although not impossible task, as illustrated anywhere by a small-caps font, we must make in MetaFog [9]. a duplicate small-caps version (altering the lowercase character names to small-caps names) • A database system to treat the converted of the encoding in the list of all the encodings. Computer Modern glyphs as atoms, for input Thus we have a csc2 for roman2, t1csc for t1, to a TrueType font-builder. and so on. To produce these duplicate encodings, we • A database of character names which covers all need a rule to convert lowercase names (both the characters in the TEX fonts and in Unicode. letters a–z and diacritical letters) to small-caps We call the grand union TEX character set lowercase names and back. We have simply TeXunion, in the same way that we denote been appending “sc” to the name (this works the Unicode characters as the Unicode char- because there are no collisions with names that acter set. (We will use Small Caps to indi- happen already to end in “sc”). For example, a cate a formal set.) TeXunion contains 1108 small-caps letter “a” is “asc”. characters by the present inventory. Produc- Adobe has been appending “small” to their ing this database involves some work because names (this often causes character names to there are no standards for TEX character names exceed a traditional limit of 15 characters in (that is, single-word alphanumeric names such length), as in the MacExpert encoding [2]. This as are used in PostScript encodings). The stan- is done in an irregular manner by appending dard Unicode character descriptions are lengthy to the uppercase character names (for example, phrases instead of single words, making them 1 unjoinable to the TEX names. For example, the There is an attempt at standardization in PostScript- Unicode standard provides the verbose entry style names from the Association for Font Information Interchange (AFII), but the standard is proprietary and on for the code 0x00ab,“left-pointing double paper only. The names are serial numbers as opposed to angle ”. The PostScript abbreviated descriptions.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 149 Richard J. Kinch

indicates which codes are letters or diacritics. Table 1: T X Single-Byte Encodings E Another related limitation of T1 encoding is the (TeXencod). Covers all Computer Modern, lack of small-caps accents. A S math symbol, and Euler fonts. Each item M A wealth of code positions does not exempt maps a set of 128 or 256 character names to Unicode from pecksniffian absurdities. The integers. Unicode committee will not provide encodings Name Description for a small-caps alphabet (small-caps being a csc0 TEX caps and small caps (ligs =0) matter of typography and not information con- csc1 TEX caps and small caps (ligs =1) tent), although they provide encodings for some euex AMS Euler Big Operators small-caps characters (which appear in older eufb AMS Euler Bold† encoding standards subsumed by Unicode). eufm AMS Euler Fraktur • A rationalization of the TEX character sets into eur AMS Euler their largest common subsets (Table 2). This eus AMS Euler Script represents the relations between the TEXen- lasy LATEXsymbols codings and the character subsets as organized A

lcircle LTEX circles in Knuth’s METAFONT sources. The first item A line LTEX lines in each entry of Table 2 gives the TEXencoding

logo METAFONT logo as given in Table 1, known by the .mf source manfnt TEXbook symbols font file used to generate the font; the remaining mathex2 TEX math extension items are the common subsets generated via the

mathit1 TEX math italic (ligs =1) METAFONT source files of the same names. mathit2 TEX math italic (ligs =2) The relation set forth in Table 2 is not re- mathsy1 TEX math symbols (ligs =1) fined for the distinctions regarding the ligature mathsy2 TEX math symbols (ligs =2) setting. Certain of Knuth’s encodings appear msam AMS symbol set A overqualified, namely, mathex2, mathit{12}, msbm AMS symbol set B and mathsy{12} do not vary with the ligs set- roman0 T X Roman (ligs =0) E ting, although it is specified in the METAFONT roman1 TEX Roman (ligs =1) driver file.2 roman2 TEX Roman (ligs =2) These decompositions of the various TEXen- t1 LAT XNFSST1encoding E codings may be considered close to the “great- t1csc T1 with small caps est common” subsets, although we do not re- texset0 T X “texset” encoding (ligs =0) E quire a full decomposition here. To be com- textit0 T X text italic (ligs =0) E pletely decomposed, the {012}-numbered items textit2 T X text italic (ligs =2) E on the right should be further decomposed into title2 T X 1-inch capitals (ligs =2) E the unnumbered common set and the various †A superset of eufm with two extra chars numbered differential sets. The sets on the right column of Table 2 we will use below as the set known as TeXpages. We have not yet made the Adobe small-caps for aacute is Aacutes- the effort to elaborate the members of each mall, while ours is aacutesc); apparently some- TeXpages set, which is needed to compute the one mistook appearance for semantics. Fur- remaining work to complete the style axes of thermore, Adobe has a small-caps version of Computer Modern. bare diacritics in their MacExpert encoding, al- • A database giving the mapping of TEXfonts though the diacritical character name is irregu- to their encodings as known above (Table 3). larly changed to an initial capital (for example, The table below lists TEX font names and their Adobe small-caps for acute is Acutesmall). encoding name; an N indicates a wildcard for The LATEX T1 encoding, which was supposed any optical point size integer, excluding sizes of to have been uniform for all DC fonts, also has the same style already matched earlier in the an irrational aspect, in that the T1 encoding is table. If a new optical size for a font name is overloaded when it is applied to both lowercase not in this table, the presumption should be and small-caps fonts. Somewhere in the 2 romsub.mf A Also, the comments at the top of are in error LTEX macros is buried something tantamount about what happens when ligs = 2. Apparently, no one has to another small-caps encoding of T1, which tried any other ligs setting!

150 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

Table 2: TeXencod Decomposition into Table 3: Mapping of TeXfonts to TeXencod. TeXpages. Set romanlsc is romanl with N indicates an optical point size; asterisk a suffix small-caps semantics; it does not actually appear wildcard. in Computer Modern. Braced digits indicate TEXFont Encoding TEXFont Encoding factored suffixes. cmb10 roman2 cmbsy5 mathsy1 TeXencod Covering TeXpages cmbsyN mathsy2 cmbxN roman2 Member Members cmbxsl10 roman2 cmbxti10 textit2 csc0 accent0 cscspu greeku punct cmcscN csc1 cmdunh10 roman2 romand romanp romanu rom- cmexN mathex2 cmff10 roman2 spu romsub0 romanlsc cmfi10 textit2 cmfib8 roman2 csc1 accent12 comlig cscspu greeku cminch title2 cmitt10 textit0 punct romand romanp romanu cmmi5 mathit1 cmmiN mathit2 romspu romsub1 romanlsc cmmib5 mathit1 cmmibN mathit2 mathex2 bigdel bigop bigacc cmmr10 mathit2 cmmb10 mathit2 mathit{12} romanu itall greeku greekl cmr5 roman1 cmrN roman2 italms olddig romms cmslN roman2 cmsltt10 roman0 mathsy{12} calu symbol cmssN roman2 cmssbx10 roman2 roman0 accent0 greeku punct romand cmssdc10 roman2 cmssiN roman2 romanl romanp romanu romspl cmssq8 roman2 cmssqi8 roman2 romspu romsub0 cmsy5 mathsy1 cmsyN mathsy2 roman1 accent12 comlig greeku punct cmtcsc10 csc0 cmtexN texset0 romand romanl romanp ro- cmtiN textit2 cmttN roman0 manu romspl romspu romsub1 cmu10 textit2 cmvtt10 roman2 roman2 accent12 comlig greeku punct lasy* lasy lcircle* lcircle romand romanl romanp ro- line* line logo* logo manu romlig romspl romspu manfnt manfnt msam10 msam texset punct romand romanl romanp msbm10 msbm dccscN t1csc romanu tset tsetsl dctcscN t1csc dc* t1 textit0 accent0 greeku itald itall italp euex* euex eufb* eufb italsp punct romanu romspu eufm* eufm eurb* eur romsub0 eurm* eur eusb* eus textit2 accent12 comlig greeku itald eusm* eus italig itall italp italsp punct romanu romspu msam calu asymbols – macron vs. overscore msbm calu bsymbols xbbold – minus vs. hyphen vs. endash vs. sfthyphen ... (...andsoonfortherest...) vs. dash – grave vs. quoteleft in code 0x60 that its encoding ought to be the lowest optical – space (0x40) vs. nbspace (0xa0) vs. visi- size in the table of the same name. The wild blespace vs. spaceopenbox vs. spaceliteral card “*” matches any suffix, such as variations – rubout in code 0x7f on style or optical size, for names which do not – ring vs. degree match higher in the table. We designate the set – dotaccent vs. periodcentered vs. middot {cmb10,..., eusm*} as TeXfonts. vs. dotmath; Zdotaccent vs. Zdot, etc. • A fuzzy-matching operator which, when join- – quotesingle vs. quoteright ing, selecting, and projecting the above data- – slash vs. virgule bases, can resolve the redundancy, synonyms, – star vs. asterisk and ambiguities in the character names and their composition. Here is an inventory of issues – oneoldstyle vs. one, etc. known to date: – diamondmath vs. diamond vs. lozenge – bar (vertical bar) vs. brokenbar (vertical – openbullet vs. degree broken bar) – nabla vs. gradient

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 151 Richard J. Kinch

– cwm vs. compoundwordmark (a T1 prob- means that the character ff (which is a lig- lem) vs. zeronobreakspace ature not to be found in Unicode) carries a – perzero vs. zeroinferior vs. perthousand- recommended private-zone code assignment of zero para perthousand 0xf001 or 0xfb01. One or more such recom- – slash vs. suppress vs. polishlslash, as in mended codes may appear, in order of pref- Lslash and Lsuppress erence. In resolving a private-zone conflict, a font-building program may take the recom- – Ng vs. Eng (and ng vs. eng) mended codes in order until a non-conflicting – hungarumlaut vs. umlaut, as in Ohun- code is found. Only after the recommended garumlaut vs. Oumlaut codes are exhausted should the program make – dbar vs. thorn a random private-zone assignment. Codes may – tilde vs. asciitilde be given in decimal, octal (leading 0), or hex – tcaron transmogrifies to tcomma, et al. (leading 0x) formats. Programs using these – mu vs. mu1 vs. micro, code 0xb5 tables take care to distinguish character codes (which contain only hexadecimal digits if start- – Dslash vs. Dmacron, code 0x0110;dslash ing with 0x, otherwise only octal digits if start- vs.dmacron,code0x0111 ing with a leading zero, otherwise only decimal – florin (not in Unicode) vs. fscript, code digits) from names (anything else, including 0x0192 names which start with digits). Lucida Sans – fraction vs. fraction1 vs. slashmath, code Unicode contains some names like “2500” (for 0x2215 code position 0x2500); if this presents a prob- – circleR vs. registered, circlecopyrt vs. lem we might have to prefix a letter to these copyright names. – arrowboth vs. arrowlongboth A name may appear in more than one – aleph vs. alef vs. alephmath synonym group, although such groups do not – Ifractur (eus, mathsy1, mathsy2) vs. Ifrak- join within the matching algorithm. The first tur (Unicode) vs. Rfractur (eus, mathsy1, name in any group is the “canonical” name. mathsy2, and Unicode); the spelling The canonical name is the name which should should uniformly be “fraktur” be output by programs which compute set operations on the encoding sets. This helps – smile vs. smileface vs. invsmileface vs. Unicode 0x263A (unnamed) to achieve a “filtered” result which does not contain troublesome synonyms. For example, – Omega vs. ohm, Omegainv vs. mho if the synonym file contained the lines: – names not starting with letters: 0script (0x2134), 2bar (0x01bb) joseph jose yosef josephus 0xfb10 joseph joe joey We represent these items in a text file hav- ing the following format: Each line of the the names jose, yosef,andjosephus would file gives character name synonyms, one group have fall-back code 0xfb10, the names joe and of synonyms per line. Any of the names on joey would have no fall-back code, and all the one line are synonyms, and can be freely ex- names above would invariably be transformed changed. For example, the line “visiblespace to joseph on output. spaceliteral” means that the character names The TrueTEX filter accessory program, visiblespace and spaceliteral are com- joincode, performs a relational join on two pletely equivalent names. (The former was used font encoding sets, making a new encoding. in the TEX DC fonts [10], while the latter was It resolves the issues of a given synonym file the PostScript name used in the Lucida Sans according to the rules we have stated. Unicode TrueType font of Windows NT.) • A database of non-TEX encodings, which tells A special case of “synonym” is the Unicode which characters appear in various encoding fall-back. This is a code number which is standards such as ANSI or Unicode. This list a“synonym”forTeXunion members not in presently constitutes a database of 3523 entries Unicode, and is our assignment of the Unicode from a set of 1814 characters. Some of the “private zone” codes for the misfits. For exam- common examples of commercial importance ple, the line “ff 0xf001 0xfb01” (Microsoft today are given in Tables 4 and 5. Having these fonts have an undocumented usage like this) sets allows us to export virtual fonts for any of

152 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

Table 4: Some Non-TEXEncodings. Table 5: Some Encodings Used in Windows. Name Description Name Description ase Adobe PostScript “Adobe- wae Adobe “WinAnsiEncoding” StandardEncoding”, the (used in Acrobat in PDF), built-in encoding of many “winansi” with bullets in Type 1 fonts the .notdef positions, some belleek Belleek [8] scheme for LATEXT1 semantic synonyms encoding on 8-bit TrueType winansi Windows ANSI 8-bit belleekc Belleek with small-caps (US/Western Europe code latin1 Latin 1 (ISO 8859-1) page) (Includes certain non- latin1ps Adobe PostScript “ISO- ANSI characters in 0x80–0x9f Latin1Encoding” (which is not range of codes.) ISO Latin-1!) winansiu Windows ANSI Unicode mac Macintosh (US/Western Europe code macexp Adobe PostScript “Mac- page) (Same characters as Expert” encoding (used by are present in the winansi Acrobat in PDF [2]), containing encoding, except the non-ANSI ligatures, small caps, fractions, characters are in their Unicode typesetting niceties positions.) mre3 Adobe PostScript “Macin- winmultu Windows Multi-Lingual toshRomanEncoding” (used Unicode (Windows 95/NT) by Acrobat in PDF), a subset (655 characters supporting all of “mac”, which omits some Latin alphabets, Greek, math characters and the Apple Cyrillic, OEM screen trademark characters.) pdfdoc Adobe PostScript “PDFDo- winNNNN Windows (for code pages num- cEncoding” (used by Acrobat bered NNNN ) in PDF), an ad-hoc encoding used in PDF outline entries, text annotations, and Info dic- waste of effort to convert the glyphs twice. tionary strings, consisting of a Another example is that many font variants are remapped set = {ase ∪ mre ∪ slanted versions of the upright face, and the wae} geometric slant is easily applied to an already- unicode 16-bit Unicode (Windows NT converted glyph rather than slanting in META- and 95, AT&T Plan 9) FONT and repeating the glyph conversion. This technique is also used to compose accents and letters for “purely” accented characters (where the encodings represented, so we can virtualize the accent and letter do not overlap), since non-Unicode, non-TEXfonts. the MetaFog conversion is applied only to • A TrueType-font-builder that takes the con- the accent part of such glyphs, allowing the verted outlines from various TEX fonts, orga- redundant letter conversion to be done only nizes sets of them based on an output encod- once. ing, and builds binary TrueType fonts from the Another sub-tool builds these redundancy reorganized glyph data. tables by comparing the encoding tables for A sub-tool for the font-builder incorporates sets of fonts against a target font. For a redundancy-elimination feature that allows example, a DC font combines punctuation you to specify a table listing which characters and symbol characters spread across several in a given TEX font may be taken from other Computer Modern fonts.

TEX fonts without repeating a costly META- In our system, we actually produce textual

FONT glyph conversion. One example of such a versions of the binary fonts and convert them redundancy is how DC fonts largely replicate to Type 1 and TrueType formats with separate the Computer Modern fonts; it would be a tools. This allows a general conversion to be

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 153 Richard J. Kinch

we can un-do the “scattering” of characters amongst Table 6: DC fonts in LAT X (Rev. 1/95). The E the TEX fonts. For example, the math italic set “funny” fonts (Fibonacci Roman, etc.) are (mathit{12}) contains the regular (not italic) low- omitted. This list is made by examining names in ercase Greek letters. Conversely, we are going to *.fd from the LAT X distribution. E have to scatter a few TEX fonts that happened to Font56789101217 combine dissimilar styles into one 7-bit font, such as dcb ••••• • • • the math symbol fonts (mathsy{12}) which contain dcbx ••••• • • calligraphic capitals. dcbxsl ••••• • • In set-theoretic terms, the rationalization task dcbxti ••• involves the following steps: dccsc ••• • Begin with the union of all TEX characters, the dcitt ••••• set we have called TeXunion. Remember that dcr ••••• • • • this is the set of character names, not the glyphs dcsl ••••• themselves. dcsltt •• • • • Partition this union set into the largest subsets dcss ••••• which do not cross encodings. This partitioning dcssbx • is a set of proper subsets of TeXunion;we dcssdc • call this set TeXpages. For example, all dcssi ••••• the uppercase letters A–Z make such a subset. dctcsc ••• A combination of all upper- and lowercase dcti ••• • • • letters A–Z and a–z do not, because the dctt •• • • small-caps fonts do not contain the lowercase dcu ••• • • • letters. These subsets are equivalent to Knuth’s

Computer Modern METAFONT “program” files, because this was the highest level of source file optimized for the ultimate binary format. For nesting in which he did not make conditional example, Type 1 glyphs require knot-pivoting, the generation of characters. following by combing, to insert extrema tangent • Encode the TEX character union set for a new, points. The hinting methods also differ between universal 16-bit encoding. That is, we invent Type 1 and TrueType. a mapping of TeXunion members to unique Once a binary version of a font is prepared, 16-bit integer codes. Most of the members containing all the glyphs, a re-encoding tool of TeXunion appear in Unicode and so (TrueTEX accessory program ttf_edit [17], have a natural encoding already determined. which is a stack-oriented TrueType font encod- For the TeXunion members not in Unicode ing editor) must be applied to finish the font (which includes all the small-caps letters), we for real-world use. The re-encoding stage not shall promulgate (by fiat) assignments to the only re-encodes, but can optionally adjust the Unicode private-zone codes. We designate this metrics and kerning information. By making subset of TeXunion as TeXpzone; this subset these aspects “afterthoughts” we can fine-tune finds its concrete representation in the private- fonts without going back into the detailed con- zone codes expressed in the character synonym version process. The re-encoding stage can also table. We have the relation upgrade any 8-bit-encoded TrueType font to an (Unicode ∪ TeXpzone) ⊃ TeXunion, arbitrary Unicode encoding, which is important since many commercial font editors can only since Unicode contains many characters not output 8-bit TrueType fonts. used in TEX. We can compute a mapping: TeXunion Unicode TeXpzone • AnotionofwhatTEX fonts we want to convert. 7→ ( ∪ ) If we consider the DC fonts a good target, we by matching character names from the left come up with quite a list (Table 6). to the names on the right; in this way we arrive at a Unicode code number for each T X Rationalizing T X fonts in Unicode sets E E character. We designate this mapping (a 16- Let us consider the shuffling and dealing needed to bit number for each member of TeXunion) reorganize Computer Modern into a Unicode encod- as TeXeru (“TEX encodings rationalized to ing. With the luxury of thousands of code positions, Unicode”, rhymes with “kangaroo”). This

154 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

mapping is the key result of rationalizing wildcarding to its parent TeXfonts, but fewer Computer Modern into Unicode. members in parallel with the reduction. • We can think again of a large table having, on • To produce each real Unicode font (a member oneaxis,theTEX fonts, on the other, the mem- of TeXinUni), we assemble the glyphs and bers of TeXpages, and bullets wherever Com- metrics from TeXfonts and install them via

puter Modern implements METAFONT glyphs. TeXeru into each code of the mapped-to This table will be sparsely and irregularly pop- TeXinUni member. ulated. The sparseness reflects the fact that • We must finally produce Omega virtual fonts the T X fonts cover a wide range of characters, E (that is, .xvp files) which will map 8-bit DVI while the variations in style mostly are typo- codes from the old TEX fonts into TeXeru graphic distinctions on alphabets and punctu- codes in TeXinUni members. For this we use ation; in other words, TEX provides more than the TrueTEX metric exporter to generate an a few symbols sets, in contrast to the simple .xvp file, and XVPtoXVF to convert this to an ANSI/Symbol set distinction in 8-bit Windows .xvf file; the .xfm file also produced contains fonts. The irregularity in this imaginary table

the same information as the METAFONT .tfm results from the ad-hoc arrangment of TeX- and may be discarded if only TEX82 is to be pages among TeXfonts. The sparseness is used for formatting. not a deficiency, but we ought to have some goal in mind for the rational extension of Computer Generating Unicode virtual fonts for Modern and the other TEX fonts to populate non-TEXfonts areas of this table for the sake of regularity. Realizing this goal would require drafting of Let us consider a converse task: instead of con- verting single-byte-encoded Computer Modern fonts

new METAFONT code and translation to out- into Unicode fonts, let us assume we have a Unicode lines. This is similar to how the DC font project extended Computer Modern to the rational T1 font in TrueType form, and want to make it usable encoding. with TEX or Omega. To use a font TEX (and Omega) require a .tfm (or .xfm) metric file and a .vf (or • At this stage we are ready to determine a list of .xvf) virtual font file. The virtual font is neces- actual Computer Modern Unicode fonts which sary only if a remapped encoding or composition is will cover TeXfonts. While Haralambous needed (usually the case). retained 8-bit PK fonts as the actual fonts for To generate metric and virtual font files for virtual Unicode fonts [7, 6], we will create a Unicode fonts in Windows, TrueT Xprovidesa converse realization, namely Unicode TrueType E File + Export Metrics item which takes the user fonts as the actual fonts for virtual CM, T1, or through several steps which illustrate the elements UT1 encodings. of such a translation: Using the imaginary table just described, we can take the union of row subsets such that • First, the user selects the font from the Win- columns are not overlapped. If we want to dows standard font-selection dialog. For exam- maintain stylistic uniformity within individual ple, in Windows, standard fonts include Arial fonts, we merge rows subject to a personal (a Helvetica clone), Courier, and Times New decision as to which rows “belong together” Roman (a Times Roman clone), together with in a stylistic sense. On the other hand, if we their bold and/or italic variations. Windows want to minimize the number of actual fonts will also install other TrueType fonts or (with and don’t mind different styles in a single font, Adobe Type Manager) Type 1 fonts. After the we can make a “knapsack” optimization to pack user selects a font, TrueTEX has a “font han- the rows as tightly as possible. (Indeed, if we dle” with which it can access all the geometric discard the Unicode conformity, we could put information needed to calculate global and per- all of Computer Modern into a single Unicode character metric quantities for the font. font!) In any case, this collapsing of rows • Second, the user must select names for the involves imprecise judgments to arrive at an output .xvp file. Font names in Windows are optimized reduced set of fonts. verbose strings containing several words (such Implicit in this reduction is the factoring of as, “Times New Roman Regular (TrueType)”), wildcarded optical sizes that was introduced while TEX insists on a single-word alphabetic in TeXfonts; we call this reduced set of name; therefore TrueTEX selects a TEX-style fonts TeXinUni, which will have a similar name for the font based on the TrueType file

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 155 Richard J. Kinch

name (such as “times”). The metric output in compose or substitute a glyph for the character, this case would be a file times.xvp. it emits the xvp commands for the metrics and • Third, the user must specify the “input” and other actions. If the composition engine is “output” encodings for the virtual font. The “stumped”, TrueTEX emits an xvp comment input encoding is the encoding of the actual to note that the character is unencoded. font, typically ANSI or Unicode. The output Note that virtual fonts can use more than one encoding is the virtual encoding which the input font to produce a virtual output font. This user desires to construct for TEX’s point of would allow, for example, a text font, an expert font TeXencod view, and is typically a member of . (such as might contain ligatures), and a symbol font True TEX uses encodings in the form of .cod (such as might contain math symbols) to contribute files (each line contains a hex code field followed toasingleTEX Unicode font. Another use for this by the character name field) or .afm (Adobe technique would be the assembling of a Unicode Font Metric [1]) files4 To select an encoding, the font from the old bit-mapped PK fonts. TrueTEX user selects an item from a list which TrueT X E supports all of the TEX and Omega-extended virtual presents, each item giving a description of font mechanisms, but does not yet directly support the encoding (for example, “TEXRoman with multiple input fonts for metric export (or bit- ligs = 2” corresponds to the roman2.cod file). mapped fonts, for that matter), although an expert The user can also browse for encoding files in- user can merge multiple virtual-property-list files to stead of selecting from the canned list. The user produce such a virtual font. can edit custom encoding files (which are just text files in afm format), and thereby gains com- Composing missing characters plete flexibility of input versus output encoding The Unicode and TeXunion sets are disjoint, but in the virtual fonts, including automatic pro- the virtual font mechanism allows users to create duction of composites for missing input charac- virtual TeXunion characters missing from a Uni- ters. The ttf edit [17] program originates afm code font with various composition or substitution encoding tables from existing TrueType fonts, methods. TrueT X uses this technique to create allowing maximal compatibility with randomly- E completely populated virtual fonts when the under- encoded fonts. lying TrueType fonts are missing accented charac- True • User input is now complete. TEX begins ters or ligatures. analysis of the information provided by forming To allow for easy upgrading, the “composition in-memory encoding tables from the encoding engine” in TrueTEX uses a user-modifiable script files (using sparse-array techniques to manage in a PostScript-like language to control the compo- True large, sparse code ranges). TEX sorts and sition and substitution process. By changing the indexes the tables for fast content-addressibility script, the user can add new composition methods or by either code or character name, assembles the substitution rules, or adapt the methods to various global metrics in xvp terms for the font, and typographic conventions. TrueTEX, using its own visits each input code in the font to build a table mini-PostScript interpreter, interprets this script at of per-character metrics and ligatures. xvp file metric-export time, which means that users who building may now begin. speak PostScript and know a bit of font design • For each output character name, TrueTEX can customize the composer. A good script yields determines if an exact match exists to an a much better TEX virtual font, since commercial input name, and thus to an input code. fonts are typically missing characters that TEXcon- If there is such an exact match, TrueTEX siders important, and the script can fill in most of emits xvp commands which give the character’s the missing pieces. metrics and which re-map the TEX PUT/SET When the composition script receives control commands to the Unicode positions. from TrueTEX, all the encoding and metric infor- • For output characters which have no exact to mation for the font and input and output encodings input characters, TrueTEX invokes the “com- are defined as PostScript arrays and dictionaries. position engine.” If the composition engine can The standard composition script in TrueTEXim- 4 .afm files need not contain metrics; they can simply plements the following techniques: define an encoding for a dummy font, using only the C, CH, • Accent-plus-letter composition: if the name and N fields of the CharMetrics table. We thus maintain compatibility with other afm-reading software and avoid implies that the character is an accented inventing yet another file format. letter, the script decomposes the name into

156 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

guities by recognizing ambiguous names and Table 7: Composable Accent Characters. making the appropriate substitutions. Name Position Example • Future extensions: Several more exotic com- umlaut top-center ¨o position methods are possible for an upgraded acute top-center ´o composition script. A novel idea is an “emer- breve top-center ˘o gency fall-back” character generator: as a last caron1 top-center ˆo resort for a missing character, the script could cedilla bottom-center ¸o have a table of low-resolution renderings for circumflex top-center ˆo all of TeXunion, consisting of (for example) comma top-right L’ an 8 × 16 dot-matrix or a plotter-style stick dieresis top-center ¨o font; virtual font commands would render these dotaccent top-center ˙o with rules. In this method we could render any TEX document in a crude but accurate fashion grave top-center `o 6 hungarumlaut top-center ˝o without any fonts or special’s at all! macron top-center ¯o Another idea would use the ability of virtual 1 ogonek bottom-right fonts to call upon the TEX \special command. period top-center ˙o AB´ezier curve special could draw and fill ring2 top-center glyphs without the need for operating system 1For D/d/L/l, changes to comma at top-right. support for fonts. This would not be efficient, 2Accent is not present in the Computer Modern and hinting would be missing on low-resolution fonts used in this portable document. devices, but it would place all the scalable font information in an XVF file. Projecting ligature rules into TrueType the letter and accent components, and (when fonts the letter and accent exist in the input font) uses the geometric information and typographic The TrueType fonts in Windows do not supply any conventions to overlay the accent onto the letter ligature rules such as are contained in Computer in a virtual accented character, as shown in Modern. To export metrics containing the usual Table 7. Since the composer has detailed TEX ligature rules, TrueTEX considers the rules geometric information on the glyph shape, in Table 8 when exporting the global vpl (xvp) which is more elaborate than the bounding-box metrics, when the target ligatures exist in the font, metrics TEX uses, it can do a careful job of or when the ligatures can be produced by the placing accents. composition engine. • Ligature composition from sequence of letters: Supporting metric export formats A ligature character (not to be confused with TrueT X supports both .vpl (T X Virtual Prop- the ligature rules of the exported T X metrics, E E E erty List) and .xvp (Omega Extended Virtual Prop- a different topic) such as the T1 character erty List) file formats when exporting font metrics. “SS” will not usually exist in a TrueType font. This is more than merely a variation in format; Code positions for ligatures are not part of the 5 when exporting to .vpl format, the encodings are Unicode standard, so even common ligatures truncated to 8 bits, so that the composition process are often not present. The composition script for missing characters will likely be more intensive. forms these by concatenating the component T Xware programs VPtoVF and XVPtoXVF trans- letters within the bounding box of the T X E E late the property list files to their respective .tfm, character. This is applied to the ligatures: ff, .xfm, .vf,and.xvf binary formats for use with fi, fl, ffi, ffl, IJ, ij, SS, Æ, æ, Œ, and œ. TEX and Omega. TrueTEX launches the appro- • Remapping of certain names: The synonym priate translator after the property-list export is table shows the problems of character names complete. which are not standardized. An ad-hoc section TrueTEX metric export also supports the older of the composition script fixes up any ambi- .pl (TEX Property List) metric file format, and the companion program PLtoTF, should it be needed 5 Unicode does contain ligatures which are phonetic 6 letters in certain languages, but this does not include This could solve the problem of rendering TEXdocu- typographic ligatures such as the f-ligatures ments in HTML browsers.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 157 Richard J. Kinch

Table 8: Ligature Rules Applied to Exported Fonts.

First Second Result Description Dashes hyphen hyphen endash -- to endash endash hyphen emdash endash- to emdash Shortcuts to national symbols comma comma quotedblbase ,, to quotedblbase less less guillemotleft << to left guillemot greater greater guillemotright >> to right guillemot exclam quoteleft exclamdown !‘ to exclamdown question quoteleft questiondown ?‘ to questiondown F-ligatures f f ff ff to ff f i fi fi to fi f l fl fl to fl ff i ffi ffi to ffi ff l ffl ffl to ffl Paired single quotes to double quotes quoteleft quoteleft quotedblleft ‘‘ to quotedblleft quoteright quoteright quotedblright ’’ to quotedblright

for use with older TEX software. In this case the characters each; most codes are in the range 0– user can specify only an input font encoding, and the 0x2ff, with some symbols in the 0x2000 vicinity property list reflects this encoding as applied to the and a few odd characters in the private zone at TrueType font selected, without virtual remapping. 0xf001 or 0xfb01. We would expect such a While exporting .xvp files will connect the segmented locality in a typical font. TrueTEX previewer to Windows’ Unicode fonts, One technique is to use a segmented table with the TEX82 formatter requires .tfm metric files, not binary-search lookup; this is close to the method Omega .xfm files. If the output font has an 8-bit used in TrueType fonts. A hash table for the two- encoding, the resulting virtual font is nevertheless byte keys may be used instead of the binary search. compatible with the original TEX formatter’s 8-bit Segmented tables will require the least storage, character codes, and will not require the Omega at the expense of a possible hashing performance formatter. To create a .tfm file for such a font, problem in the event of degenerate tables. Since a TrueTEX filter program xvptovpl truncates the all Unicode-capable operating systems are advanced virtual codes in the .xvp file and produces a enough to support virtual memory, the performance truncated .vpl file, and via VPtoVF, a .tfm file risk does not justify the memory savings. for use with TEX. The .vf file created in this TrueTEX uses a 2-level pointer technique: process is discarded, since it does not properly map metrics for a 16-bit code table consist of a table of the 8-bit characters to Unicode. The .xfm and 256 pointers to metric tables with 256 entries each. .tfm files produced by this process will contain the In this way a typical font having perhaps 6 or 7 same information; the .xfm format is needed only if contiguous code populations has very little wasted Omega is to be used. TrueTEXusesthe.xvf file to space. The worst-case and best-case performance map the 8-bit TEX characters to Unicode positions. are both acceptable. The lookup time is accelerated by avoiding a null-pointer test by pointing unen- Implementing sparse metric tables coded pages to a dummy table of zero-width metrics. In implementing programs which use metric data, A time-bomb of a problem looms with Omega’s we must take care to apply sparse-matrix techniques .xfm format [12], which recklessly ignores any to avoid enormous memory demands from nearly- sparse-array issues. An .xfm file, like the older empty font-metric tables. Sixteen-bit encoded fonts .tfm format, keeps an unpacked char info array are typically sparsely populated. For example, which spans the smallest to largest codes (that is, the Windows NT text fonts contain about 650

158 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Extending TEXforUnicode

the interval [bc,ec]). Since the Unicode private- • Identifying and assigning non-Unicode TEX zone population typically extends through 0xfb00, character names to the Unicode private zone, practically all fonts will have a char info of about thereby promoting inter-operability of Unicode 126 KB, almost all wasted space. This will result in TEX implementations. These should be con- atypical.xfm file being 130 KB, instead of about cretely represented in a synonym table, which 4 KB that a simple sparse-array technique would also needs to be published. There are many provide. It is imperative that we upgrade the .xfm potential conflicts, and taking counsel from as format and xfm-reading programs to better handle many different users of TEX as possible is the sparse encodings.7 only way to maximize the compatibility of the result. The Omega project has already started Summary to stake out claims on the virgin Unicode real Let us review the areas of development needed to estate [4, Table 4, page 425] for Ω-Babel. There extend all of TEX to Unicode: are no doubt synonyms and ambiguities outside our own experience; one can only hope there is • Extending the .tfm file format and its run- sufficient room for all interested parties. time forms to large, sparsely populated fonts, • Extending DVI translators to accommodate without sacrificing backward compatibility and the extended .tfm, .vf,and.dvi formats, without exploding file lengths. We have including sparse-array techniques for efficient seen that the .xfm format can represent the run-time performance. The Omega project has information, although it needs improvement for issued this call to “DVI-ware developers” [4, sparsely populated fonts. page 426, Conclusion], although with surprising • Extending Computer Modern and other meta- aplomb for the implications. We hereby fonts to fully populate the appropriate Unicode respond with our implementation in TrueTEX, positions. A complete Unicode text font re- and invite others to build on our experience. quires about 500 symbols. While it is unlikely • Promulgating the ongoing TeXeru,theTEX- that all the styles of Computer Modern will re- in-Unicode mapping, based on a seasoned reg- ceive the attention to fill the tables completely, istry. An ongoing authority for additions and we can at least insert legible placeholders. corrections will be vital. This authority will be • Creating a formal database of TEX character responsible for registering new TEX character names, joinable to the Unicode official names. names and avoiding Unicode conflicts. Some standard is necessary for any develop- • Changing the plain TEXandLaTEX macros to ment, and there is no reason to favor anyone’s accommodate the 16-bit encoding extensions, favorite names. What is crucial is that the while maintaining backward compatibility from registry be initiated now, so that TEX software a single source. This is a tall order, and one we authors have an early start on making inter- have not touched. changeable fonts and documents. While TEX • Extending \special handling for 16-bit char- users cannot themselves dictate the standard acter sets. Now we open up the carousing com- Unicode names, we can at least make a stand- mand of TEX to a whole new vista of revelry, in version for our own use (since none seem to with the gift of tongues. This is another item exist at present), and if an acceptable set of that we shall put off for now. Unicode names comes along, we can adopt it • Implementing TEXandDVI translation user in- later. terfaces in selectable languages. While Omega processes in Unicode, it talks to the user in the • Creating a formal database of all 28 T X E old 8-bit fashion. Perhaps it is a bit much to font encodings and their 31 greatest common expect a Web2C T X change file to incorporate subsets, joinable to Unicode and other encoding E wchar t and other Unicode constructs of the C standards. The TrueT X tables are available E programming language. But DVI translators to all for examination and use [16]. Once these for Windows NT and other Unicode-capable are improved by public usage and scrutiny, they platforms should have this designed in from should be adopted as a formal standard for the start. A properly designed application can T X. E be reimplemented for another language by any 7 non-programmer who knows the application The .vf and .xvf virtual font formats have always packed sparse per-character information, so they need no such and can translate the messages; no program- attention. ming or recompilation is required.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 159 Richard J. Kinch

• Establishing provisions for orderly extension [7] Haralambous, Yannis. “The Unicode Computer of the fonts and character sets, so that new Modern Project.” [A document link on the TEX fonts, characters, and encodings may be Omega Web page [5].] incorporated into later versions. Having made [8] Kinch, Richard. “Belleek: TEX Virtual T1-

the effort to retrofit METAFONT output for Encoded Fonts for Windows TrueType.” Avail- an altogether different way of encoding, we able on the author’s Web site as a LATEXdocu- would hope that font designers recognize the ment in belleek.zip. [The Belleek software is shortcomings of the 7- and 8-bit encodings and an implementation of T1-encoded fonts using keep the promises of Unicode in mind. TrueType scaling technology under Microsoft • Rationalizing the TEX fonts into orthogonal Windows. Belleek consists of TrueType fonts

styles and weights (such as cmmb10 and cmmr10, which render elements of METAFONT glyphs for example). As TEX users we don’t care about in scalable form, plus TEXvirtualfontswhich this, but if Computer Modern is to be accepted remap and compose T1 characters from these in non-TEX applications, the style axes will elements.]

have to be fully varied and populated along the [9] Kinch, Richard. “MetaFog: Converting META-

conventional ranges. FONT Shapes to Contours.” TUGboat 16,3 • Providing a means for creating virtual fonts (1995), pages 233 – 243. for non-TEX Unicode fonts in TrueType or [10] Knappen, J¨org. “The European Computer Type 1 format. Although this capability is Modern Fonts.” CTAN file: tex-archive/ available now only in the commercial TrueTEX fonts/dc/mf/dcdoc.tex. Unicode edition, a new TEXware stand-alone [11] Knuth, Donald. “Virtual Fonts: More Fun for tool could interpret TrueType font files (or Grand Wizards.” TUGboat 11,1 (1990), pages whatever typeface technology is supporting 13 – 24. Unicode rendering) and join the encoding and [12] Plaice, John, and Yannis Haralambous. “Draft other information into an .xvp file. Documentation for the Ω System.” ftp://nef. References ens.fr/pub/tex/yannis/omega/first.tex, February 1995. [See also the Omega Web page [1] Adobe Systems Incorporated. Portable Doc- [5].] ument Format Reference Manual. Reading, [13] Plaice, John. “Progress in the Ω Project.” Mass.: Addison-Wesley, 1993. TUGboat 15,3 (1994), pages 320 – 324. [2] Adobe Systems Incorporated. Adobe Font Met- [14] The Unicode Consortium, Inc. http://www. rics (AFM) File Format Specification,Version unicode.org. [This site holds files listing codes 4.1, October 1995. [Published in PDF and and descriptive phrases. There are no sample PostScript form at ftp://ftp.adobe.com.] character images, and there are no PostScript [3] Fairbairns, Robin. “Omega — Why Bother with names. A monograph gives paper images [15].] Unicode?” TUGboat 16,3 (1995), pages 325 – 328. [15] Addison-Wesley Publishing Company. The Unicode Standard: Worldwide Character En- [4] Haralambous, Yannis, John Plaice, and Jo- coding, Version 1.0, Volumes I and II. hannes Braams “Never Again Active Charac- ters! Ω-Babel.” TUGboat 16,4 (1995), pages [16] The encodings (consisting at present of 46 en- 418 – 427. coding sets containing over 9000 pairs, repre- sented in .afm format) herein listed in Tables 1, [5] Haralambous, Yannis. “Ω, a T X Extension E 4, and 5, are available via the author’s Web site. Including Unicode and Featuring Lex-like Fil- tering Processes.” Proceedings of the Eighth [17] The programs ttf_edit and joincode for DOS, Windows, and Linux are available via the European TEX Conference,Gda´nsk, Poland, 1994, pages 153 – 166. [There is an Omega Web author’s Web site. page at http://www.ens.fr/omega.Seealso the Omega FTP site [12].] [6] Haralambous, Yannis, and John Plaice. “Ω +

Virtual METAFONT = Unicode + Typogra- phy (First Draft).” ftp://nef.ens.fr/pub/ tex/yannis/omega/cernomeg.ps.gz, January 1996.

160 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Russian Encoding Plurality Problem and a New Cyrillic Font Set

L.N. Znamenskaya and S.V. Znamenskii Krasnoyarsk State University, Svobodnyi prospekt 79, 660041 Krasnoyarsk, Russia [email protected]

Abstract

To run TEX with cyrillic in network is a problem. Various widespread Cyrillic coding tables under DOS, UNIX and other OS are incompartible. The ASCII Russian text imported from a different system usually become completely un- readable. The new set of fonts, TDS and some other tools give a solution of the problem for the east-European Cyrillic typsertting users.

TEX has become the one of the best known means of not wait so long and and at this point in time, communication between scientific people. To solve the development of the new Russian extension of the problem of plural incompatible Russian TEX aCMTEX font family was at the kerning stage. systems, the Russian Foundation for Basic Research It so happened that we obtained the extra Russian (RFBR) proposed the idea of creation of a standard extension of CM font family. non-commercial Russian TEX distribution. There- We tried to realise the following aims in this fore, half a year ago the new “Russian TEX” project new font set: was begim under RFBR support. An important • to keep the original CM font sources unchange- feature of the project is to determine the best system able to input by extension sources in order to which is able to work in a LAN, with various client provide appropriate Latin text when typeset- platforms and operating systems. ting using the new fonts; The new TDS (TEX Directory Structure) stan- • dard gives us the perfect base for a such system. The to make text and letters more habitual for the problem we find here is specific for the Cyrillic-based Russian eye, keeping the traditional CM fonts languages. It is the Russian encoding plurality peculiarity; problem. For example there exists several widely- • to make letter darkness in text more uniform; used Russian coding tables under UNIX.Even • to make all CM source based fonts, including r Microsoft uses completely different coding tables concrete available for Russian typesetting; for Russian text under DOS and Windowsr on the • to avoid possible low-resolution font-creation same PC. At the same time, in different directo- errors causing problems while using automatic ries on CTAN, we can find METAFONT sources for font generation; and Cyrillic fonts with the same name cmrz10 but with • different Russian letter“A” character codes. to lay the foundation for future support of all Cyrillic-based alphabets of the Russian people. Fonts We used CM macros, fragments of CM codes The first thing we have had to do was to select and a bit of cmcyr code. The acroLH font family an available Cyrillic extended standard TEXfont has been used just for comparison in the first stage. set and fix new names in order to reflect a coding When the new fonts were almost ready it was table in the name of font. As soon as we found decided to compare their typesetting quality with the CyrTUG LH fonts not to be available for non- the one of the best sources of widely distributed commercial RFBR distribution for free, we asked N. fonts—the Samarin and Glonty Cyrillic fonts. A Glonty and A. Samarin for a permission to use their large mathematical paper has been printed at 10 fonts, as they are the first and the most widely used and 12 points on a 600dpi HP LaserJet4 printer, TEX fonts in Russia. After a period of a month the same text in two copies printed with different and a half, we received the very kind and grateful font sets. There was a blank page in each copy for permission to use or modify the fonts or their sources experts to write their opinion. The RFBR experts for RFBR distribution and we appreciate very much (physicists and mathematicians) compared the two, such a generous solution. Unfortunately, we could and determined that the both Russian font families are of the same good quality.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 161 L.N. Znamenskaya and S.V. Znamenskii

What should we do with the new fonts names? correspondence between each set of coding tables in The first idea was to use the fontname scheme. In a consistent way. this way, we made the name of extended 8-bit font Unfortunately, this is not possible. The set much too different from the name of corresponding of symbols in this part of the coding tables differs standard 7-bit CM font. As a result users would very much from one table to other. Therefore we have a problems while adopting new styles and using have to permit the Rchar to change meaning during the TEX primitive font selection commands. To re- conversions. We try to considerably decrease the set duce such problems we decided to create a font name of possible meaning changes. The desirable solution from RF (Russian Font + Russian Foundation); to is to split the set of all possible char meanings of use the third char (digit) in the name to point to the 128 equivalence classes such that any conver- the coding table, and to end by using the same char sion can change the symbol meaning only inside its sequence as that used by the corresponding CM font. equivalence class. This is also impossible. Some of One can see the examples on tables. the meanings will necessarily be found in different The empty boxes in font tables will be filled by classes and the best thing we can do is to use the other Cyrillic letters in next version of fonts. It is less valuable meanings for such a mess. You can see impossible to support all Cyrillic-based languages by the summary of a various available information on the same 8-bit coding table—the number of differ- Cyrillic coding tables [1]–[8] and our proposals on ent letters is more than 256. The project is working the one-to-one table correspondence in a huge table on a coding table which would allow typesetting bellow.Inthistablethenumbers0,1,2,3,4,5 on more than sixty Cyrillic-based languages with respectively denote ISO8859-5, CP1251, PC866, -8, the use of accents or virtual fonts or \charsubdef. MacOS, and PC855. The list of languages to be supported in such a way contains all of the Cyrillic-based languages of The problem of other Cyrillic languages There Russia. are more then 60 Cyrillic-based languages and some of them still have not settled coding tables. Most Russian encoding plurality problem of the files contains a lot of the non-text commands. We need to support the typical situation of an There is a lot of software which puts a non-ASCII entire TEX file system residing on a server, with chars into file and the program has to distinguish, clients working under different operation systems as far as is possible, the right Cyrillic words from using various Russian encodings. The main problem the combinations of such symbols. is to select the appropriate procedure for inputting We therefore cannot use only the char set in- TEX files with any encoding. formation of the file to discover the coding table of Our way to solve this problem is to create an document. Another problem we see is that some executable which would recognize the Cyrillic coding coding tables use the same char set. As we need of a file in the correct way, and then recode it to get a right solution for a short file, it is also automatically to conform to the local coding. inadequate just to count the number of each letters Why not? Anybody who reads Russian can appearing in text. A more precise instrument would easily convert the text in the right coding from the be to count the number of each combinations of two same text in the wrong coding. But as soon as we letters appearing in the document. try to look more carefully at the problem, we see the This effective approach require more them 128 multiple problems. kilobytes of memory for an intermediate data stor- age. The natural algorithm to perform a proper statistical analysis of this data includes multiple The coding tables one-to-one correspondence computing of logarithms and is not fast enough – as a part of problem If a binary file is occasion- especially on a PC. How to find a way to get the ally to be recoded as Cyrillic, it is useful to have acceptable result in a simple and fast way? the capability of recovering an accidently-converted The next idea was to select two sets of pos- file. The networking forse problem to be more diffi- sible strings of length 2: the set, A, of frequently- cult: multiple convertions must preserve the original appearing Cyrillic text bicharacter strings and a set, information. We cannot see a way to solve this U, of commonly unused Cyrillic text bicharacter problem without the additional difficulty of creating strings. The executable counts the numbers NA and a proper conversion algorithm. It is natural to NU of strings from A and U, respectively, appearing N −N ASCII C A U preserve the first 128 positions of code table. in the file. The number = NA+NU will show if this In the last 128 positions, we have to put one-to-one file looks as Cyrillic text or not. Such a number can

162 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Russian Encoding Plurality Problem and a New Cyrillic Font Set

where the meaning where the meaning 235 box drawings light up and right 23 box drawings down single and right double 14 left single quotation mark 0145 cyrillic capital letter 235 box drawings light up and left 23 right half block 14 right single quotation mark 0145 cyrillic capital letter 235 box drawings double up and left 23 box drawings down single and left double 14 left double quotation mark 0145 cyrillic capital letter 235 box drawings double up and right 23 left half block 14 right double quotation mark 0145 cyrillic capital letter byelorussian-ukrainian i 235 box drawings double down and left 3 top half integral 14 double low-9 quotation mark 01245 cyrillic capital letter yi 4 pound sign 23 box drawings up single and right double 235 box drawings light down and right 0145 cyrillic capital letter 1 single low-9 quotation mark

23 box drawings up double and right single 0145 cyrillic capital letter Table 2: the symbols look more-or-less like

23 box drawings up single and left double left/right coma quotation 0145 cyrillic capital letter

23 box drawings up double and left single 0145 cyrillic capital letter

23 bullet operator 0145 cyrillic capital letter

23 box drawings vertical single and right double where the meaning 0145 cyrillic capital letter 3 greater-than or equal to *) 23 box drawings vertical double and right single 01245 cyrillic capital letter ukrainian ie 0145 cyrillic small letter dje 3 division sign *) 23 box drawings vertical single and left double 01245 cyrillic capital letter 0145 cyrillic small letter gje 3 less-than or equal to *) 23 box drawings vertical double and left single 01245 cyrillic small letter ukrainian ie 0145 cyrillic small letter dze 3 almostequalto*) 23 box drawings down single and horizontal double 01245 cyrillic small letter short u 0145 cyrillic small letter byelorussian-ukrainian i

3 bottom half integral Table 3: pc855/pc866 splittings 01245 cyrillic small letter yi *) for this coding table

23 box drawings down double and horizontal single 0145 cyrillic small letter je

23 box drawings up single and horizontal double 0145 cyrillic small letter lje

23 box drawings up double and horizontal single 0145 cyrillic small letter nje where the meaning 23 box drawings vertical single and horizontal double 23 box drawings down double and left single 0145 cyrillic small letter tshe 145 left-pointing double angle quotation mark 23 box drawings vertical double and horizontal single 23 box drawings down double and right single 0145 cyrillic small letter kje 145 right-pointing double angle quotation mark 23 full block *) 4 less-than or equal to *) 0145 cyrillic small letter dzhe 235 box drawings double vertical and left 1 single left-pointing angle quotation mark 235 box drawings light vertical and right 14 cyrillic capital letter 4 greater-than or equal to *) 235 box drawings double vertical and right 235 box drawings light vertical and left 1 single right-pointing angle quotation mark 14 cyrillic small letter ghe with upturn

Table 1: the non-russian letters Table 4: the symbols look more-or-less *) for this coding table like left/right angle quotation *) for this coding table

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 163 L.N. Znamenskaya and S.V. Znamenskii

be computed for each known coding table and the

where the meaning largest value must point to the right coding table. It

3 superscript two seems to be fast, easy and effective because the most 01245 numero sign frequently used conjunctions of two characters (less 23 middledot*) then 5% of all conjunctions) gives more then 50% 0145 section sign of bicharacter substrings in Russian text and ap- 25 lower half block *) proximately half of all possible conjunctions which 134 copyright sign are practically never used in Russian. The “only” 235 box drawings light vertical problem remaining is to select the sets A and U 14 not sign properly. 235 box drawings light vertical and horizontal 14 registered sign How we selected A and U 235 box drawings light down and left A great help for us was 14 plus-minus sign the unique Gilyarovskii and Grivnin book [9] with

235 box drawings double horizontal the text samples on most of the languages. We had 14 micro sign to turn the samples into computer files in order to 235 box drawings double vertical count biletter appearance numbers. A new problem 14 pilcrow sign then arose: what should we do with non-Russian 235 box drawings light down and horizontal letters? 14 en dash There are no fixed coding tables for most of the 235 box drawings light up and horizontal languages. We also do not know about any other 14 dash attempts to use a Russian keyboard and special TEX 235 box drawings light horizontal 14 dagger commands for typesetting of most of the Cyrillic languages of Russia, and Alaska. For each 235 box drawings double down and right 14 bullet of the languages which use non-Russian letters, we

235 lightshade have made two files: the first file has char represen- 14 horizontal ellipsis tation of non-Russian letters mostly according to the

235 box drawings double down and horizontal tables above, and the second file has more-or-less 14 trademarksign better readable Russian letter sequences following

4 notequalto the slash char (such as K for “K as in beak” or

235 box drawings double up and horizontal

L C KC for “K as in desk” or for or for 1 double dagger

) and maximal usage of the standard TEX accent 4 infinity 235 box drawings double vertical and horizontal control sequences. For the , we 1notused used three different subject topics and a dictionary

4 increment with 51924 words. Each of the other languages was 235 upper half block represented by a single file. We obtained 109 files 1 per mille sign for 64 languages. 012345 no-break space We cannot be certain other people will use the 4 division sign *) same codes or sequences for non-Russian letters. 235 mediumshade 1 broken bar Therefore, while counting the biletter strings for

4 latin small letterf with hook each file we assign all letters with unknown codes 235 dark shade to a group, identify all ASCII non-letters and assign 1 middle dot *) them to another group and assign all Latin letters 4 almostequalto*) unusable by Cyrillic text to a separate group. After, 235 black square 1notused counting we selected biletter strings which did not appeared in files. They composed the set U with 5fullblock*) 1234 degreesign 695 elements. A 234 square root The selection of set was more difficult. After 015 soft hyphen several attempts to select it we got the following

3 lowerhalfblock*) algorithm. For each couple of letters and each file, 1245 currency sign the logarithm of ‘relative frequence’ was computed. To avoid infinity we had zero frequences changed Table 5: other symbols to a small non-zero value, as if this biletter string *) for some coding tables appears once in a file twice as long. Then we found the sums over all the files and used them for

164 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Russian Encoding Plurality Problem and a New Cyrillic Font Set selection. The most frequent 314 couples consist of only Russian letters and almost each word contains at least one of such biletter strings. We had to avoid the effects of possible usage of other TEX names for non-Russian letters, or other coding tables which may correlate only to the Russian part of our coding table. Therefore we used only 306 of these couples without the biletter strings which our special notations for non-russian letters could produce. In this way, the Cyrillic coding recognition al- gorithm was finished. Availability

The METAFONT sources of RF font family and sources of cyrillic coding recognition algorithm will be available from RFBR TEX server via anonymous ftp: ftp.tex.math.ru. Acknowledgements This work was inspired and supported by Russian Foundation for Basic Research, grant 96-07-89406. References [1] A. Chernov. Registration of a Cyrillic Character Set. RFC 1489, RELCOM Development Team, July 1993. [2] J. Reynolds, J. Postel. Assigned Numbers. RFC 1700, USC/Information Sciences Institute, Oc- tober 1994. [3] T.Greenwood, J. H. Jenkins. ISO 8859-5 (1988) to Unicode. Unicode Inc. January 1995. [4] M. Siugnard, L. Hoerth. cp1251 WinCyrillic to Unicode table. Unicode Inc. March 1995. [5] M. Siugnard, L. Hoerth. cp10007 MacCyrillic to Unicode table. Unicode Inc. March 1995. [6] M. Siugnard, L. Hoerth. cp855 DOSCyrillic to Unicode table. Unicode Inc. March 1995. [7] M. Siugnard, L. Hoerth. cp866 DOSCyrillicRus- sian to Unicode table. Unicode Inc. March 1995. [8] P. Edberg. MacOS Ukrainian [to Unicode]. Uni-

code Inc. April 1995.

Gilrovs ki i VS Grivnin Opre

[9] RS

delitel zykov m ira po pis mennosti

Izde trete is pravlennoe i dopolnennoe

M Nauka

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 165 Cyrillic TEX files: interplatform portability

Peter A. Ovchenkov Russia, 125047, Miusskaya sq. 4, M. V. Keldysh Institute of Applied Mathematics [email protected]

Abstract We consider the problems with regard to Cyrillic encoding while working with different operatining systems. The proposed solutions can simplify the administration and support of TEX on different platforms. The related problem of encoding Cyrillic TEX fonts is also discussed.

Preliminary notes

A number of new Cyrillic encodings have appeared

a b c d e f recently. Maybe that’s all? But no — the process

needs further consideration. Why, indeed, ...is Bill x

x

Gates better than me? I would like to suggest trying x

x

A B C D E F G H I J K L M N O

Cyrillic encoding for TEXfonts.Ifyoudon’tlike x

P Q R S T U V W X Y Z

it, there are almost as many TEX fonts as Cyrillic x

x a b c d e f g h i j k l m n o

p q r s t u v w x y z

encodings. If you multiply these two numbers, you x

can see fantastic potential for encodings suggestions. x

x

I would also like to talk about the native Rus- ax

sian typesetting, not translitaration. The examples bx cx

I will use demonstrate only Russian letters; I do not dx

ex

have enough experience to talk about letters specific fx to other Cyrillic alphabets. Figure 1 Problem : Cyrillic encoding “alternative” ( 866). First of all, for historical reasons, many Cyril- lic encodings were developed for different types of computers (and different operating systems). To simplify, I will speak only about computers with 8 bits/byte and with Latin characters encoded as US-ASCII. For such machines, at least four cyrillic

DOS

a b c d e f encodings exist. For computers running ,the

most widely used encoding is the so-called alterna- x

x

tive encoding (fig. 1). Its popularity here is because x

the pseudographic symbols have the same codes as x

x A B C D E F G H I J K L M N O

P Q R S T U V W X Y Z

those for extended ASCII.ForMS Windows, the x

x a b c d e f g h i j k l m n o

p q r s t u v w x y z original Microsoft encoding “honored Bill Gates” x

(fig. 2). On UNIX machines two encodings co- x

x

exist — KOI-8 (fig. 3) and ISO 8859-5 (fig. 4). ISO ax

8859-5 is the official standard (GOST). Macintoshes bx

cx

use the ISO 8859-5 standard. That’s enough without dx

Cyrillic in EBCDIC-like encodings. ex

fx Let us consider the following problem: you need to process TEX files on computers with different Figure 2: Cyrillic encoding in MS Windows (code Cyrillic encodings. Don’t worry about your .tex page 1251). files: (i) there are enough transcoder programs, and (ii) there are enough program tools (like text editors, etc.) that are outside TEX’s control. But

166 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

Cyrillic TEX files: interplatform portability

a b c d e f

79h), punctuation, and digits, etc., and the second x

x half contains Cyrillic letters. Let define the font

x E

x encoding as f .ForTEX (the binary executable

A B C D E F G H I J K L M N O

x program) we can define a transformation κ,such

P Q R S T U V W X Y Z

x κ

a b c d e f g h i j k l m n o

x that Ee −→ Ei,whereEe is a character encoding

p q r s t u v w x y z x that T X “sees” as input flow (i.e., “native” en-

x E

x coding of a (soft)hardware system), and Ei —TEX

ax

bx internal encoding.

cx If we define Ei ≡ ,wecanusethesame

dx

ex fonts during editing and previewing on one machine

fx type, and printing on another one. (This is obvious for systems with Latin typesettings, but not for Figure 3: Cyrillic encoding KOI8. Cyrillic!) We will had that a .dvi file (created on

a PC) can be converted by dvips on SPARCstation

a b c d e f

x without problems.

x In the input flow you can refer directly to a sym-

x

x bol in “internal” encoding (in Ei representation).

x A B C D E F G H I J K L M N O

P Q R S T U V W X Y Z

x This fact allows one to create style files that can be

a b c d e f g h i j k l m n o

x used on different platforms simultaneously.

x p q r s t u v w x y z

x Input Flow Transformation in emT X In

x E

ax emTEX, Eberhard Mattes suggests a simple way to

bx κ

cx set up transformation. This is the creation of TEX

dx Code Page to then include this information in the

ex

fx precompiled format file. The transformation is described in the .mtc Figure 4: Cyrillic encoding ISO 8859-5. file. Below is the beginning of the file for the al- ternative — experimental (compare figures 1 and 5) encoding: what about .dvi files and style files? Oh, this is % the chief problem for the T X administrator in a E % alt_abs.mtc heterogeneous network. He (or she) must: % 1. define as letters (catcode 11) Cyrillic letters for ^^80 ^^c0 all encodings in usage (during format genera- ^^81 ^^c1 tion) ^^82 ^^c2 2. transcode styles with Cyrillic styles (this would ^^83 ^^c3 preclude sharing LATEX styles via an electronic ^^84 ^^c4 network) ^^85 ^^c5 3. replace fonts with virtual fonts ^^86 ^^c6 The first two operations are simple, but if you forget ^^87 ^^c7 to convert a style to a Cyrillic style, for example, ^^88 ^^c8 you discover unusual output: and in some cases, the ^^89 ^^c9 mistake may be very difficult to locate. ^^8a ^^ca Remapping fonts into virtual fonts is also not ^^8b ^^cb without problems: for example, what do you do with ^^8c ^^cc the kerning between punctuation (codes less than ^^8d ^^cd 127) and Cyrillic letters (codes greater than 128)? ^^8e ^^ce The use of 8-bit full fonts leads to incompatibility ^^8f ^^cf of .dvi files prepared on different platforms and Here, the first column represents the character code dramatically increases the number of fonts. that TEX see as input flow, and second is the one inside TEX. In the second column you can write Input Flow Transformation TEX commands instead of character codes. Let us consider a hypothetical 8-bit font: the first Next the .mtc file is compiled into the .tcpfile: half is Latin symbols (symbols with codes 00h – C:\EMTEX\DATA> maketcp -c -8 alt_abs.mtc

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 167

Peter A. Ovchenkov

a b c d e f

Substitute printable symbols for the non-print-

x

x able ones, (i.e., TEX substitutes on the terminal

x .log

x display, and xwrites in the file:

x A B C D E F G H I J K L M N O

P Q R S T U V W X Y Z

x %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

a b c d e f g h i j k l m n o

x % [?.??] Modify characters cannot be

x p q r s t u v w x y z

x % printed -- Ptr.

x %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

ax

bx @x

cx @=

dx

ex (k<" ")or(k>"~")

fx @y @= Figure 5 E ≡ E : Experimental font encoding ( i f ). (k<" ")or((k>"~")and(k<@’300) and(k<>@’270)and(k<>@’271)) @z The code table is turned on while the format file is After that you can build T X in the normal way. being created: E C:\EMTEX\LATEX\BASE> latex -i -8 \ Fonts Encoding -c alt_abs -mt15000 -mp65500 latex.ltx But there is one small thing: now we have as many There are some other options that you can find in font encodings as machine ones. Now’s the time to the emTEX documentation. find a uniform Cyrillic font encoding. The problem connected with letters of national Input flow transformation in UNIX-like sys- alphabets in Europe was solved in part with uni- tems On UNIX-like systems, TEX is traditionally fication of TEX fonts. The arguments, pro and compiled from WEB sources. In many cases this is contra, connected with the scheme, done via CWEB. Here it seems there is no simple way, you can find in various TUGboats, or in the LATEX as there is for emTEX. One is forced to program; Companion. but for UNIX you always can find a C-compiler (at There is no more room for Cyrillic in the Cork least). encoding. That’s evident. But the experience of dc- While TEX is translated from WEB to C, there fonts is good (in my opinion). So I think that a good are applied patches from the file ctex.ch.You solution is encoding similar to that in figure 5. The can create your own variant of this file with minor first half is equivalent to the first half of the Cork additions. encoding, and Cyrillic is placed in the second half. First we remap the input flow (the example Why is there no ISO- or alternative-like encoding below illustrates the remapping from ISO 8859-5 en- here? Thisis dueto theLATEX2ε requirement for the coding into the experimental one–see figures 4, 5): difference between codes for upper- and lowercase of the same letter to be 32, and a sequence of letters %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% beginning with code 128 or 192. There is no evident % [2.23] Allow any character as input. preference for any machine encoding of Cyrillic, but % (Remapping by ptr) with an encoding as seen in figure 5 we can avoid %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% some of the problems with LAT X2ε. @x E for i:=0 to @’37 do xchr[i]:=’ ’; Cyrillic in Style Files for i:=@’177 to @’377 do xchr[i]:=’ ’; @y TEX allows direct references on internal TEXencod- for i:=0 to @’37 do xchr[i]:=chr(i); ing (^^hh, hh is a pair of hex digits). This fact canbe for i:=@’177 to @’237 do xchr[i]:=chr(i); used for writing files for the precompiled format or for i:=@’240 to @’357 do for styles: such files depend only upon font encoding, xchr[i+16]:=chr(i); but not upon the Cyrillic encoding on computer. for i:=@’360 to @’377 do For precompiled format files we can write defi- xchr[i-64]:=chr(i); nitions for Russian letters and then use this file on xchr[@’205]:=chr(@’241);{\} every computer: xchr[@’245]:=chr(@’361);{\yo} \def\letter#1#2{% @z \catcode‘#1=11\catcode‘#2=11%

168 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Cyrillic TEX files: interplatform portability

\uccode‘#1=‘#1\lccode‘#1=‘#2% \uccode‘#2=‘#1\lccode‘#2=‘#2% } \language=\l@russian \letter{^^c0}{^^e0}% \letter{^^c1}{^^e1}% \letter{^^c2}{^^e2}% \letter{^^c3}{^^e3}% Or, we can do the same with styles: \addto\captionsrussian{% \def\prefacename{% ^^cf^^f0^^e5^^e4^^e0^^f1^^eb^^ee^^e2% ^^e8^^e5}% \def\refname{% ^^cb^^e8^^f2^^e5^^f0^^e0^^f2^^f3^^f0^^e0} \def\abstractname{% ^^c0^^ed^^ed^^ee^^f2^^e0^^f6^^e8^^ff}% \def\bibname{% ^^c1^^e8^^e1^^e8^^eb^^e8^^ee^^e3^^f0% ^^e0^^f4^^e8^^ff}% \def\chaptername{^^c3^^eb^^e0^^e2^^e0}% \def\appendixname{% ^^cf^^f0^^e8^^eb^^ee^^e6^^e5^^ed^^e8^^e5} \def\contentsname{% ^^d1^^ee^^e4^^e5^^f0^^e6^^e0^^ed^^e8^^e5} \def\listfigurename{% ^^d1^^ef^^e8^^f1^^ee^^ea % ^^f0^^e8^^f1^^f3^^ed^^ea^^ee^^e2}% \def\listtablename{% ^^d1^^ef^^e8^^f1^^ee^^ea % ^^f2^^e0^^e1^^eb^^e8^^f6}% \def\indexname{^^c8^^ed^^e4^^e5^^ea^^f1} \def\figurename{% ^^d0^^e8^^f1^^f3^^ed^^ee^^ea}% \def\tablename{% ^^d2^^e0^^e1^^eb^^e8^^f6^^e0}% \def\partname{^^d7^^e0^^f1^^f2^^fc}% \def\enclname{^^e2^^ea^^eb.}% \def\ccname{^^e8^^e7}% \def\headtoname{^^e2}% \def\pagename{% ^^f1^^f2^^f0^^e0^^ed^^e8^^f6^^e0}% \def\seename{^^f1^^ec.}% \def\alsoname{^^f1^^ec.~^^f2^^e6.}% } Of course, this is not very readable, but this is done by a style developer, and only once for every machine-specific encoding of Cyrillic fonts. One note: this mechanism cannot be applied to hyphenation patterns. TEX doesn’t expect Ei-symbols representation (^^hh) inside command \patterns.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 169 Michael M. Vinogradov

A User-Friendly Multi-Function TEX Interface Based on Multi-Edit

Michael M. Vinogradov Institute of Economic Forecasting, Moscow [email protected]

Abstract The aim of this paper is to describe the set of Multi-Edit macros created for easily processing of TEX files. Note that the standard version of Multi-Edit has some tools to work with TEX files, but these tools are not very convenient. Therefore we have created our own set of macros, which allows us to process the whole file or to compile and view any selected block of text, such as complicated formulas, for example. Also, there are additional tools for handling some standard situations.

On-Screen Compile Menu To find the next compile error, simply press the hNxtErri key. The Compile menu is activated by pressing the corresponding key. This menu includes: Block Compile 1. Compiling for formats: Plain-TEX, AMS-TEX, A A L TEXorAMS-LTEX. One can also use various To compile and view a portion of a text (for ex- compilers, for example, 286-compiler and big- ample, a complicated formula), select the fragment compiler. These menu items are realized by and use a special macro, invoked by pressing the calls of corresponding batch-files. Automatic corresponding combination of keys. Various TEX processing can be used to find compile errors; A formats (Plain-TEX, AMS-TEX, L TEXorAMS- 2. Variants of a view. In particular, one can use LATEX) may be used also. The selected block may a view with manual input of options; contain user-defined TEX macros. These macros 3. Variants of printing, similar to the view ones. should be placed on the beginning of the file and Various output devices can be used, such as the line dot matrix and laser printers; %%% End of Leading Block 4. Additional menu items; for example, one can include a call of Russian spell-checker. after them. How many of the signs % is not It is easily to add, to modify or to delete menu important. All definitions found above this line will items by using standard Multi-Edit tools. be added into the selected block when it is compiled. If this line is absent, no macros and definitions will Automatic error handling be added. If additional definitions are placed in the another file and then included via either \input or Errors identified during compiling are handled in \documentstyle, etc., these commands also should the following way: be placed above the percented line just mentioned. – The cursor is positioned at the location of the If one needs to omit some commands, then the first error found; corresponding lines should be marked – The corresponding error message appears in %%% Skip line the upper line of the window; For example, the typical beginning of a LAT Xfile – The log-file displayed in the OUTPUT window E may be marked as follows: points to the corresponding error message, which shows only the first two lines. If this \documentstyle[12pt,fleqn,draft]{article} is insufficient, then the cursor in the OUTPUT \newcommand\smbl{{\rm smbl}} window can be moved by pressing hF11i or \newcommand\al{alpha} hAlt+Esci. In this case the OUTPUT window \input footline.tex %%% Skip line will increase up to seven lines. To move the \begin{document} cursor in the text window, press hF11i or %%% End of Leading Block hAlt+Esci again and the OUTPUT window will decrease up to two lines.

172 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting A User-Friendly Multi-Function TEX Interface Based on Multi-Edit

After the file has been compiled, the logfile contain- ing the selected block and error messages appears in the OUTPUT window.

Additional macros Additional possibilities can be divided into two types: macros that are useful for all TEX users and those useful for Russian TEX users only. For all TEXusers. 1. A database can be used to insert some TEX commands and macros. This database contains a set of commonly used macros for Plain- TEX, AMS-TEXandLATEX. Commands may be inserted on the cursor place or so as to surround a selected block. 2. Users can locate opening and closing parenthe- sesorcurlybrackets,aswellascheckforany missing opening or closing brackets. For Russian TEXusers. 1. For those whose English-language skills need assistance, there is a a special database con- taining more than five hundreds basic sentences often used when writing English mathematical papers. The idea is adapted from Sosinsky. A part of this book is included in the HELP system to explain how to use the database. 2. Should the Russian TEX user forget to swich the , the file can still be pro- cessed without re-typing the text in the proper register. 3. Capitalization of a word, line or block can be changed to either upper- or lowercase. This macro works independently of the alphabet in use (Russian or English).

Conclusion Most of these abilities can also be added to Multi- Edit for Windows. Thus we offer the TEXusera friendly multi-function Windows interface.

References Sosinsky, A. How to Write a Mathematical Paper in English.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 173 Full Cyrillic: How Many Languages?

Olga G. Lapko Russia, 129820, Moscow Pervy Rizhsky per., 2 Mir Publishers [email protected]

A Brief History of Cyrillic uses Russian letters with two additions: ‘’, ‘ ’. The Slavonic writing was invented by St. Cyrill and One may also find Cyrillic letters used in scripts St. Method. Now there are two well-known Slavonic based on the Latin alphabet. Examples are the Chi- writings: Glagolitic and Cyrillic. nese languages: Y, Lahu, Lisu, Myao, Juang, as well Historians are not sure whether the author of as several African languages. both was St. Cyrill or whether Cyrillic script was created by St. Method while St. Cyrill invented the History of the full Cyrillic font project Glagolitic alphabet. In any case, in this paper we 1 will deal about the alphabet nowadays called “Cyril- The LHFONTS package was created as a part of lic.” the CyrTUG-EmTEX package, which is distributed The birthday of Cyrillic is considered to be the among Russian and non-Russian users who use end of May 863. May 24 was declared by UNESCO Cyrillic. LHFONTS offers the LH Cyrillic font fam- as the day of Cyrillic. By coincidence, the first con- ily; these fonts are based on the WNCYR fonts of the CYRILLIC package — part of AMS-T X. ference of the Cyrillic TEX Users Group, CyrTUG, E was held May 24–25, 1991. The main task of the LHFONTS package was The Cyrillic alphabet is based on the Greek al- to create the Cyrillic fonts family, an extension of phabet. There were 43 letters in this alphabet. Up standard text fonts of Computer Modern, which also until the beginning of the 20th century, four addi- corresponds to Russian typesetting traditions. tional letters existed; these are absent in the modern First this package offered two more or less

popular encoding schemes:2 Alternative — an 8-bit

s c Russian alphabet: ‘u’, ‘ ’, ‘ ’, and ‘i’. Nowadays Cyrillic script is used not only by Latin-Russian font encoding analogous to MS-DOS’s Slavonic people, but also by other nations of the for- , mainly used by Russian MS-DOS mer USSR. Historically, many of these nations used users; and the Washington or WNCYR —a 7-bit en- other scripts. Some Soviet republics such as Mid- coding for typesetting with transliteration, which is dle Asian republics, , and the Russian mainly used by non-Russian users. The Cyrillic character encodings are described autonomic republics, used the Arabian script. In in special files — lbcoding.mf (Alternate encoding) , the old Mongolian vertical script was used in and wncoding.mf (WNCYR encoding). One can the , and the Dzayapandin vertical choose between these files by changing the value script was used in the Kalmyk language. Soon af- of one of the following variables: altcoding, ter the October Revolution many languages started vfcoding or wncoding. These variables also to use the with additional letters. In determine the font layout: Abkhasia, the Georgian alphabet was used for a few years. At the end of 20s and 30s, almost all lan- altcoding Standard Computer Modern in the lower guages of the USSR changed from using Latin script part of table plus Russian letters and ad- to Cyrillic. In many languages new letters were cre- ditional punctuation marks in the upper ated (see fig. 1). one; Alternate encoding: encoding file Outside Russia, Cyrillic is used in , Ser- lbcoding.mf bia, Macedonia and Mongolia. The Bulgarian lan- guage now uses only letters of the modern Russian

1 This package was originally named MAKEFONT, but c alphabet, but earlier ‘k’and‘ ’werealsoused.In was renamed to avoid confusion with the utility of the same

Macedonian and Serbian the following additional name on the 4AllTEX CD-ROM, produced by the Nederland-

Z S letters are used: ‘j’, ‘Y’, ‘ ’, with ‘ ’, ‘ ’, ‘s’ in stalige TEX Gebruikersgroep.

2 The package also offered virtual encoding: rather condi-

R Macedonian only, and ‘’, ‘ ’, ‘ ’ in Serbian. The tional 7-bit encoding which combines Cyrillic and Latin fonts.

174 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Full Cyrillic: How Many Languages?

vfcoding Russian letters and added punctuation which use the Cyrillic alphabet with added letters. marks in the upper part of table — for the The following sections discuss this problem. following combining with Computer Mod- ern in virtual font; Alternate encoding: The Global Cyrillic font as the material for encoding file lbcoding.mf Ω project wncoding Cyrillic letters for Slavonic languages with necessary input ligatures, standard and Some time ago the multilingual project Ω was additional punctuation marks in the lower started. One of the authors of Ω, Yannis Haralam- part of table; WNCYR encoding: encoding bous, began to create a full Cyrillic font. He offered file wncoding.mf this font to CyrTUG for further work. The data for this font was taken from the Cyril- The values of these variables are determined in 3 driver files (ld??font.mf) by default. The header lic part of the Unicode table. But this table still of these files (for ex. ldrmfont.mf)containsthe does not cover all Cyrillic letters; some old Cyrillic following lines: letters and national letters are missing. Probably the full assortment of accented vowels is necessary, if unknown wncoding: wncoding:=0; fi which are not included in Unicode. if unknown vfcoding: vfcoding:=0; fi The font created during this work had more ... than 256 letters and marks. This font assortment altcoding:=1-wncoding-vfcoding; should be further extended and improved. During the testing of this font, I created shortened variants, if wncoding<>0: input wncoding; or split it into a few fonts. The two Cyrillic Uni- else: input lbcoding; fi code fonts were extended with glyphs for characters ... not included in Unicode and created by the methods Variables altcoding, vfcoding and wncoding may described in this paper (see the Appendix). be set by hand in the file header or at the start of The methods of partial font creation may use-

the METAFONT run. fully improve the economy of use of the computer’s The LHFONTS package contains font headers memory. One may create a big 256-letter (Unicode) named lh*.mf and ll*.mf font and then use a virtual font to achieve the nec- The files lh*.mf (56 files) are virtually identical essary encoding. Alternatively, one may create the to cm*.mf except for the last line: font immediately in its required encoding. generate How TEXhelpsMETAFONT .Creationof that is, the standard Computer Modern driver file coding and ligature-kerning tables was changed to the analogous file for the LH fonts.

To create a font, METAFONT needs program descrip- These file headers generate a full 8-bit Latin-Cyrillic tions for letters (a lot of them), information about font. lettercodes, and kerning and ligature data. The files ll*.mf (also 56 files) contain only the Now there are a few well-known encodings of following line: Cyrillic. They differ in which characters they hold vfcoding=1; input ; and in what order (see the Appendix). So, for every thus the command vfcoding:=1; sets generation of encoding, a separate file is needed. Russian letters and punctuation marks only. Since TEX cannot use a font containing lig- Since the WNCYR encoding was an optional en- atures or kerning information relating to external coding in this package, the files wn*.mf were not characters, we cannot use the same table, with all created, but documentation explains how to create Cyrillic letters, for every font: we must create a sep- fonts with the WNCYR encoding using a similar one- arate table for each encoding. There are five tables line header file as follows: for the different font shapes of the text fonts of the Computer Modern family: they are included in the wncoding=1; input ; driver files. We must create the same number of The LH Cyrillic font family offers typesetting tables for every encoding. in Russian and other languages using the Russian part of the Cyrillic alphabet—that is Virtual and 3 Alternate encoding. Unicode — International Standard ISO/IEC 10646–1, The WNCYR encoding also offers typesetting in first edition, 1993.05.01. Information Technology — Univer- th sal Multiple Octet Coded Character Set (UCS)Part1:Ar- modern Slavonic texts using Cyrillic and 19 cen- chitecture and Basic Multilingual Plane (Table 11, Row 04, tury Russian text. But there are a lot of languages Cyrillic).

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 175 Olga G. Lapko

Furthermore, for each font and each encoding \fonttwoletters; these two letters are set by the we must create a header file (the Computer Modern user according to the necessary encoding at the start family has 56 text fonts). of TEX’s run. A fragment of such a file for font head- As you can see the number of files will be very ers is shown below: large, and they will often duplicate each other. % by default full Cyrillic Font (Unicode) The best solution is to create three files which % is generated contain all the information that could potentially \ifx\mainfontspecific\undefined be duplicated. The first one contains a table with \def\mainfontspecific{vfcoding:=1;}\fi all the Cyrillic glyphs and signs and all supported \ifx\fonttwoletters\undefined encodings. The second one contains data on liga- \edef\fonttwoletters{uc}\fi ture and kerning for the complete font repertoire. \long\def\FontsToBeGenerated{ The third file contains the table of font names and \tablevalues % sizes and the necessary command lines for every font ( ... 8 9 10 ... 17.28[17] ) % header file. The data for every font in every encod- ing would be taken from these files and the neces- \makefont\fonttwoletters r % sary header files created. All these files are created ( ... 8 9 10 ... 17.28[17] )() by TEX. TEX also can create the file containing all ... uccodes, lccodes and mathcodes for a given font. \makefont\fonttwoletters tt % ( ... 8 9 10 ... )(specific:=0;) Preparing font headers As mentioned above, we } use the parameters of the Computer Modern text By default there is creation of a set of file head- fonts for creating Cyrillic fonts. First, the header ers and encoding for the global Cyrillic font in Uni- files for the LHFONTS package were copied from code (see fig. 1) in this file and the file of encoding header files of the Computer Modern family, chang- data. ing the only line (See the section entitled “History of the full Cyrillic font project”). To avoid unneces- Preparing the encoding file and files of liga- sary duplication, we create header files which load ture/kerning tables The encoding files are cre- the necessary Computer Modern header file, substi- ated from the file which contains the table of all tuting the standard driver file with the driver of LH Cyrillic glyphs and signs and all well-known (or at Cyrillic fonts. This task, for example, was solved in least necessary) encodings. the Polish fonts package, in there is a special tricky % by default full Cyrillic Font (Unicode) file fik_mik.mf which substitutes standard drivers % is generated with Polish ones. One of the authors of this file, Bo- \ifx\fonttwoletters\undefined guslav Yackovski, allowed us to use this file in the \def\fonttwoletters{uc}\fi LHFONTS package. \def\nolettercode{*} Now it is necessary to create header files for font \long\def\CodesToBeGenerated{ creation which include one or a few lines only. \tablevalues ( uc lh wn ... ) The EmT X package supports a command opera- E \makecod CYR_A CYRA ( 10 80 41[A] ... ) tor in its MFJob program, which enables one to write \makecod CYR_BE CYRB ( 11 81 41[B] ... ) necessary short commands for the METAFONT run. ... By using this, we can avoid creation of a lot of files \makecod CYR_LJE CYRLJE ( 09 * 01[LJ] ... ) with the only line: } input fik_mik_; use_driver; We can see that this table is analogous to the For font generation on other platforms we must previous one. Macros, analogous to macros for cre- create these files in any case. For quick gener- ating font headers, were used for creating the encod- ation of the files I used the file dcstdedt.tex ing files. from DCFONTS. The original file includes the Now we must create tables of ligatures and table with all font names and font sizes. We kerns. In the Computer Modern fonts these tables modified the file, providing a possibility to add are in the font driver files. In the LH fonts the tables the line (\mainfontspecific macro), which can switch on variables vfcoding:=1; or wncoding:=1; are in separate files. As we said above, we need to when necessary, and the parameter for a few fonts, create five tables of ligatures and kernings for text which switch on necessary shape. To specify us- fonts: 1) for roman and sans serif shape; 2) for italic age of different encodings we must change the font shape; 3) for caps and small caps shape, for which names, so the first two letters are changed to macro two tables are actually necessary, separately for the

176 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Full Cyrillic: How Many Languages?

uppercase and small saps letters; and 4) for large How METAFONT generates only necessary let- fonts like cminch. ters TEX has thus created files necessary for en- The Cyrillic font is very large, but one may see coding and ligature and kerning pair tables. Now

that almost all Cyrillic letters can be identified in a METAFONT must generate only the necessary glyphs few shape groups. We determined 14 groups for up- required for the given encoding. percase letters, 14 groups for lowercase letters and From the very beginning the LH font family sup- 17 groups for italic letters in Cyrillic font. The let- ported different encodings. Since in different coding ters in these groups sometimes repeat the shape (or schemes Cyrillic letters occupy different places, in contour) of Latin ones, so one may use the Computer character descriptions (beginchar command), ex- Modern table as a base. plicit character codes have been replaced with their symbolic names. For example, the description of the The file of kerning and ligature data retains the lowercase Cyrillic letter ‘a’ starts with: shape grouping of letters, so every new letter will be added to an appropriate group. At the beginning of each group we place a “typical” Latin letter as a cmchar "Lowercase Russian letter a"; comment. When the new letter appears it may be beginchar(CYR_a,9.25u#,x_height#,0); added into a necessary group: ... \writeLig{if wn:} \writeLig{ ligtable CYR_ZE: "1"=:CYR_ZHE, In the description of uppercase letters we added if lower_case: ... fi "H"=:CYR_ZHE, "h"=:CYR_ZHE;} % "Z" the line: for redefinition of a code in the font “Small Caps”: \writeLig{fi}

\Ligtab %A cmchar "Uppercase Russian letter A"; \Letter{CYR_A} \Letter{CYR_A_acute} beginchar(CYR_A,13u#,cap_height#,0); \Letter{CYR_LIT_YUS} if lower_case: charcode:=CYR_a; fi ...... %b

\Letter{CYR_HARD_SIGN}\Letter{CYR_YATZ} In plain METAFONT the definition of the ... beginchar command has the following lines: %R \WriteLig{if serifs:} def beginchar(expr c,w_sharp,h_sharp,d_sharp) = \Letter{CYR_BIG_YUS} begingroup ... charcode:=if known c: byte c else: 0 fi; \WriteLig{fi} ... % enddef; %O \Kern{CYR_O}{k#}\Kern{CYR_O_lcomma}{k#} which means that a letter with an unrecognized \Kern{CYR_O_acute}{k#} ... code number is set to position ‘0’. \Kern{CYR_ABKH_O}{k#} For the LH fonts,aletterorasignwhosecodeis % not recognized must be skipped, so the beginchar ... command is redefined in the following way: \EndLigtab let plain_beginchar=beginchar; We may create a font for kern testing by taking the characteristic examples from these letter groups def beginchar(expr c,w_sharp,h_sharp,d_sharp) = (see Appendix). iff known c: % For creation of the necessary ligature and kern- plain_beginchar(c,w_sharp,h_sharp,d_sharp); ing pair tables, we use data from the encoding file enddef; which was created just before them. Now the liga- tures are used in WNCYR encoding only without any What needs to be done when it is necessary to cre- changes from the original Washington State Univer- ate the Cyrillic font only, but letters and signs of sity fonts. standard Computer Modern have got code numbers In addition to the tables of ligature/kerning, in beginchar so they are always determined? The TEX creates a uccode/lccode/mathcode file and a three variables which were mentioned in the section file ???cod.tex, which is used by russianb.ldf4. on the History of the Full Cyrillic Font Project, de- termined three different encodings. Now they may 4 The Russian language-specific file for the Babel system. switch on the following:

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 177 Olga G. Lapko

altcoding Standard Computer Modern in the lower [10] Fry Edmund, “Pantographia containing ac- part plus Cyrillic letters and added punc- curate copies of all the known alphabets in tuation marks in necessary encoding in the the world together with an English explana- upper one tion of the regular force or power of each let- vfcoding Cyrillic letters and added punctuation marks in the upper, lower, or in both parts ter, to which are added specimens of all well- of the table—for next combining with authenticated oral languages; forming a com- Latin part in virtual font or for full Cyrillic prehensive digest of ”, Cooper and font creation (for example Unicode) Wilson, London, 1799. wncoding this set was not changed—Cyrillic letters [11] Katzner Kenneth, “The languages of the for Slavonic languages with necessary lig- world”, London, Henley: atures, standard and additional punctua- [12] The World’s major languages, ed. by Bernard tion marks in the lower part of the table; Comrie, London, Sydney, 1987. Rout ledge& WNCYR encoding KeganPaul, 1977. lbcoding.mf Now the encoding files, or [13] Y. Haralambous, J. Plaice, “Typesetting in the wncoding.mf switched on by these variables, set Cyrillic alphabet with Ω—The Basic Ideas”, necessary selection of Cyrillic letters and their en- August 24, 1994. coding. In fact, the file wncoding.mf for WNCYR [14] DC-Fonts, Beschreibung der Kodebelegung: encoding was not changed. The file lbcoding.mf T X 256 Zeichen — internationaler Zeichensetz, switches the necessary encoding. When we have the E 22. M¨arz 1992. necessary files, we can create the font.

References

M Musaev Alfavity zykov narodov

[1] K

SSSR Moskva Nauka

S Gilrevski V S Grivnin Opre

[2] R A Appendix

delitel zykov mira po pismennostm

Moskva Izdatelstvo vostoqno liter

atury Figure 1: Unicode encoding; Cyrillic part

I Ubrtova Nekotorye voprosy grafiki

[3] E

i orfografii pismennosti zykov naro

0:

SSSR polzuwihs alfavitami na

dov 16:

osnove Moskva

russko 32:

48:

sinodalna tipografi Ob

[4] Moskovska

A B C D E F G H I J K L M N O

64:

razcy liter cerkovnyh rossiskih gre

R S T U V W X Y Z

80: Q

qeskih latinskih gruzinskih evres

a b cd ef ghij klm n o

96:

kih nemeckih i proqih nahodwihs v

q r st u v w x yz

112: p

Moskovsko sinodalno tipografi Mos

128:

kva

144:

xriftov Uzbekska SSR Sovet

[5] Obrazcy 160:

komissarov ridiqeskoe izda

narodnyh 176:

Tipografi Samarkand

telstvo 192:

208:

zyk i novye informacionnye

[6] Tatarski

224:

tehnologii Vypusk Izdatelstvo Kazan

240:

skogo universiteta

narodov SSSR t Moskva [7] zyki [8] Andrei B. Khodulev and Irina A. Makhovaya, “On TEX experience in Mir Publishers”, Pro- Since the Latin part is unchanged and uses the TEX ceedings of the 7th EUROTEX Conference, encoding scheme the next examples show only the Prague, pp. 37–43, 1992. Cyrillic part of the font. [9] Olga G. Lapko, “MAKEFONT as part of CyrTUG-EmTEX package”, Proceedings of the 8th EUROTEX Conference,Gda´nsk, Poland, pp. 110–114, 1994.

178 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting Full Cyrillic: How Many Languages?

Figure 5: KOI-8 (Unix platform) encoding

Figure 2: Cyrillic letters which are not included in Unicode 128:

144:

0: 160: Q

16: 176:

N F D E

32: 192:

O A B C L K H M I G J

48: 208:

A B C D E F G H I J K L M N O

64: 224:

240:

Q R S T U V W X Y Z

80: P

ab cd e f gh i j k l m n o

96:

qr s

112: p

128:

144: 160: 176: Figure 6: ISO 8859-5 encoding

192: 128: 208: 144:

224:

160:

240:

176:

192:

208:

A B C D E F G H I J K L M N O

224:

Q R S T U V W X Y Z 240:

Figure 3: MS DOS cp866 encoding

128:

Figure 7: Apple Macintosh encoding

144:

160:

176: 128:

192: 144:

R S

208: 160:

V T W Y Z

A B C D E F G H I J K L M N O

224: 176:

U

192: X

Q T W

240:

Q

208:

224:

A B C D E F G H I J K L M N 240:

Figure 4: Washington encoding

Figure 8: Windows 1251 encoding

0:

S

16: 128:

R Y Z

32: 144:

48: 160:

V Q T X U W

A B C D E F G H I J K LM N O

64: 176:

P Q R S T U VWX Y Z

80: 192:

a b c d e f g h i j k l m n o

96: 208:

p q r s t u v w x y z

112: 224:

A B C D E F G H I J K L M N O 240:

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 179 Olga G. Lapko

Figure 9: Example of national encoding: Tatar

encoding

128:

144:

160: 176: 192:

208:

A B C D E F G H I J K L M N O

224:

Q < > 240:

180 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting

StarTEX—A TEX for Beginners

Dag Langmyhr Department of Informatics University of Oslo Norway [email protected]

Abstract

This article describes StarTEX, a new TEX format for students writing their first report and other novice users. Its aim is to provide a simpler and more robust tool for users with no previous knowledge of TEXandLATEX.

The problem This error alone produced more then 100 error messages. Students taking courses at our department are re- A • LAT X uses ten special characters: #, $, %, &, quired to write short project reports, and LTEX[2] E has been the preferred tool. Several years experi- ~, ^, _, {, } and \. Users need to remember A that these characters are special, and they must ence has, however, shown us that LTEX is not ideal for this. learn which commands are necessary to produce This project report is the first encounter most them if they are required in the text. Fewer A special characters would be an advantage. students have with LTEX, and they face many prob- lems: • The command notation \xxx used in LATEXof- • The major problem is the error messages. They ten causes problems with the space following are very terse at best, and since they are some- it. A times produced by LATEX and at other times by • LTEX has borrowed its error recovery philos- TEX, understanding the messages requires rea- ophy from plain TEX: the user is expected to sonably good knowledge of both systems. Most manually correct each detected error to allow students tend to look only at the line numbers LATEX to proceed. The problem with this ap- when examining their error logs. proach is that you will get many confusing error messages if you do not correct the error prop- • LATEX is not very robust; trivial syntax errors can cause a serious burst of confusing error mes- erly. sages, like when you forget a \\ prior to \hline None of our students use this interactive re- in an array environment. covery facility; they either restart after having You can also experience undesired effects if discovered the first error, or they let the pro- you use the commands incorrectly, for instance cessing run to completion without any interac- if you write tion. An automatic error recovery scheme like that employed by compilers would be a great \abstract{text} benefit for these users. rather than the correct • LATEX provides a mixture of structural mark- \begin{abstract} up commands as well as visual mark-up. The text advantage is that experienced users can achieve \end{abstract} the visual appearance they desire; the disad- vantage is that less experienced users — partic- This error produces no error message, but will ularly those who have used other document pro- cause the whole article to be set in a smaller cessing tools — spend too much of their time font. trying to coerce LATEX into producing exactly • LATEX does not hide the primitive commands the layout they think is proper. of T X, making it possible for the users to ac- A E • LTEX is a large system and running it is not as cess them accidentally. For example, one of our fast as for instance plain TEX. For instance, a users defined a macro for her name: 1 2 /2 page sample document takes from 2.8 sec- \def \else {Else Hansen} onds on a Sun SparcStation20 to 8.7 seconds

184 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting StarTEX—A TEX for Beginners

on a Silicon Graphics Indy. Since novice users The notation tend to process their documents very frequently At the EuroT X conference in Arhem in Septem- to remove errors or test the effect of a feature, E ber last year, Philip Taylor [6] proposed a different execution times do matter — even in this range. notation for (LA)TEX commands: The requirements rather than \xxx. All these problems indicate that LATEX in its present I decided to use this notation in StarTEXasitsolves form is not the tool we want for our students, at many of our problems: least not for their first report. We want a document • Spaces following the command are no longer a processing program with the following properties: problem. There is no need for special rules like “When a space comes after a control word, it is • It must be based on TEX to achieve the desired quality in mathematical formulae. ignored by TEX.” [1, p. 8]. • Only one special character is needed: <.The • It should use a different notation for its mark- characters #, $, %, &, ~, ^, _, {, } and \ can be up commands; one which caused less confusion defined to be just ordinary characters. concerning spaces and has fewer special charac- • The command name may contain almost any ters. character, not just letters. • It must hide all the internal TEX commands; • The scheme is easy to implement: all that is re- this is the only safe way to avoid students using quired is to make < an active character, and let them accidentally. the corresponding command regard everything • It must be small and easy to understand, so up to the following > as a parameter. that it may easily be adapted to the particular • Since all commands are called through this in- need of each installations. terface, it is easy to make all internal TEXcom- mands invisible. • It should contain structural mark-up commands only, and no visual mark-up. • It is easy to check whether the user command is defined, and provide suitable error recovery • It should be robust. if it is not. • It should produce better error messages. If pos- • It is easy to \lowercase the user command, sible, no messages from TEX should ever ap- thus making the command handling insensitive pear. If this is impossible, error messages from to case. TEX should be preceded by a message produced • This command notation is the same as in HTML by the new tool. [4] with which many students are familiar. • Since most students tend to just disregard all I could have used any bracketing symbol pair, like messages about under- and over-full boxes, it [xxx] or {xxx} or /xxx\, but I chose because should try to reduce the number of such mes- it resembles HTML and because < and > are not used sages. very frequently. • It should run in nonstop mode and use auto- Command parameters A few commands need a matic error recovery to detect as many genuine parameter to specify non-printing matter like a file errors as possible. name or a label. I chose to use square brackets for this, as in • It should be as fast as plain TEX. [label] • The command handling should be insensitive to uppercase and lowercase. This is not an impor- Using a special notation indicates more clearly that tant issue, but case confusion has caused prob- the parameter is not to be typeset. lems for some. The command set

The solution The set of available StarTEX commands was chosen Attempting to achieve the goals mentioned above, with the following aims in mind: StarTEX was designed. The name was chosen to • There should be sufficient commands for writ- indicate that it was a Starters’ TEX. ing a student report, but otherwise there should StarTEX is a new format, and is thus a simple be as few commands as possible. A cousin of AMS-TEX [5] and LTEX. It is built on top • There should be no commands for visual mark- of the plain TEX [1] commands. up, only structural specifications.

TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting 185 Dag Langmyhr

• The commands should have a form that makes Table 2: A small table sample them easy to check for errors, and to automat- ically recover from the errors. Index Data 12 199 In table 1 are listed most of the StarTEX commands A 17 0 with their LTEX counterpart. Paragraph separation It was decided to use

to separate paragraphs, as in HTML. Even though

caption text A the blank line used by (L )TEX is easier to type, it texttext... does cause problems with indentation of the para- texttext... graph following an environment like a list. Using :

alleviates this problem.

Another advantage of using the notation Every startsanewrow,andeach starts is that it can be employed as line separator (like A another column. The text prior to the first row is \\ in LTEX) in environments where the concept regarded as the table caption. of paragraph makes little sense, as in the The number of columns is determined automat- <author> or environments. This provides a double ically. All columns are centered, and a grid of hori- benefit: a special command for line breaking is no zontal and vertical rules is always added. For exam- <p> <title> longer necessary, and using in a envi- ple, the code ronment is now legal. <table> Font selection A few commands for font selection A small table sample are necessary, but my belief is that <b> (for bold <row> <b>Index</b> <col> <b>Data</b> text), <i> (for italic)and<tt> (for typewriter <row> 12 <col> 199 text) form a sufficient set of commands. The com- <row> 17 <col> 0 mands may of course be nested to provide for in- </table> italic typewriter text stance . will generate the table shown as table 2. Some might argue that these commands are vi- sual rather than structural, and that the HTML ap- Document styles All documents need some adap- proach of providing a wider selection of structural tion to conform to a particular style. I propose to commands like <dfn> for definitions, <em> for em- let the user decide this by stating phasis, <kbd> for keyboard input and <samp> for lit- <style>[style file] eral characters, is more logical. My own experience The style file is written in plain TEX and contains is that there are seldom enough definitions to suit the necessary definitions and modifications. Since my needs, so I will for instance use a specification the user has no visual mark-up commands at his or like <strong> when I really want to indicate a re- her disposal, all design decisions are made by the served word in a programming language. Providing style designer. This makes it easier to have all re- a few simple type changing commands is simpler. ports conform to the approved standard. PostScript figures Since nearly all figures used in My hope is that each site using StarTEX will LATEX documents at our department are PostScript develop styles of their own. These styles should files, it seems reasonable to specialize the interface be comprehensive, so the user should only have to for this. The notation specify that one style. For instance, our style ifi- <psfig>[file name]caption text</psfig> report defines was chosen as only two keywords were necessary. All • the page size (A4 paper), figures are automatically scaled and they float to the • Norwegian format of <today> and <now>, top of the current or following page. • Norwegian translations of fixed texts like “Fig- ure” and “Table”, Tables The notation for tables was also chosen to • the page headers and footers, and be as simple as possible, and to ease error detection and recovery. Only very regular tables are catered • various minor typographic details. for, but this is the price one has to pay for a simple Cross references StarTEX uses more or less the notation. same mechanisms for cross references as LATEX. In- A table is a complex structure, with entries in teresting sections, figures and tables are given a la- columns within rows inside the table, but a notation bel using the <label> command, which may then was found which will seldom give grouping errors: be referenced using the <ref> command.</p><p>186 TUGboat, 17, Number 2 — Proceedings of the 1996 Annual Meeting StarTEX—A TEX for Beginners</p><p>StarTEX LATEX Document bounds <body>text</body> \begin{document}text\end{document} Document style \style[style file] \documentclass{style file} Document <title>text \title{text} head text \author{text} text \date{text} Font text \textbf{text} change text \textit{text} text \texttt{text} Paragraph break

hblank linei Mathematical formula \(formula\) formula formula \[formula\] Sectioning

text

\section{text}

text

\subsection{text}

text

\subsubsection{text}

text

\paragraph{text} Itemized \begin{itemize} list ... \item ... : : \end{itemize} Enumerated \begin{enumerate} list ... \item ... : : \end{enumerate} Description \begin{description} list text ... \item[text] ... : : \end{description} PostScript [file name]caption text \begin{figure} figure \caption{caption text} \begin{center} \epsfig{file=file name,...} \end{center} \end{figure} Table caption text \begin{table} texttext... \caption{caption text} texttext... \begin{center} : \begin{tabular}{|c|...}\hline
text& text& ...\\ \hline text& text& ...\\ \hline : \end{tabular} \end{center} \end{table} Footnote text \footnote{text} Unformatted text text \begin{verbatim}text\end{verbatim} Cross