Extending TEX for Unicode

Richard J. Kinch 6994 Pebble Beach Ct Lake Worth FL 33467 USA Telephone (561) 966-8400 FAX (305) 644-6978 [email protected] http://www.truetex.com

Abstract

TEX began its \childhood" with 7-bit-encoded fonts, and has entered adolescence with 8-bit encodings such as the Cork standard. Adulthood will require TEX to embrace 16-bit encoding standards such as Unicode. Omega has debuted as a well-designed extension of the TEX formatter to accommodate Unicode, but much new work remains to extend the fonts and DVI translation that make up the bulk of a complete TEX implementation. Far more than simply doubling the width of some variables, such an extension implies a massive reorganization of many components of TEX. We describe the many areas of development needed to bring TEX fully into multi-byte encoding. To describ e and illustrate these areas, we introduce the r TrueTEX° Unicode edition, which implements many of the extensions using the Windows Graphics Device Interface and TrueType scalable font technology.

Integrating TEX and Unicode especially to a commercial product in an inter- national marketplace. You cannot use T X for long without discovering E With access to Unicode fonts, the natural that character encoding is a big, messy issue in every ² ability of T X to process the large character sets implementation. The promise of Unicode, a 16-bit E of the Asian continent will be realized. Methods character-encoding standard [15, 14], is to clean up such as the Han uni¯cation will be accessible. the mess and simplify the issues. T X will install with fewer font and driver ¯les. While Omega [5, 13] has upgraded TEX-to-DVI ² E Many 8-bit fonts will ¯t into one 16-bit font, translation to handle Unicode [3], the fonts and and in systems like Windows, which treat fonts DVI-to-device translators are far too entrenched in as a system-wide resource, fewer fonts are an narrow encodings to be easily upgraded. This paper advantage. Only one application will be needed will develop the concepts needed to create Unicode to translate from Unicode DVI to output device. TEX fonts and DVI translators, and exhibit our TEX documents will convert to other portable progress in the TrueTEX Unicode edition. ² forms (like PDF, OpenDoc, or HTML) and will A fully Unicode-capable TEX brings many substantial bene¯ts: work with Windows OLE, without tricks and without pain. T X will work smoothly with non-T X fonts. Computer Modern and other TEX fonts will be ² E E ² While TEX already has a degree of access to usable in non-TEX Unicode applications. The 8-bit PostScript and TrueType fonts, there are 8-bit encoding problems have broken Computer many limitations that Unicode can eliminate. Modern on every variety of Microsoft Windows. When 16-bit encodings overcome the resistance ² TEX will eliminate the last vestiges of its deep- of the past | and we have every reason to hope ² seated bias for the English language and US ver- that they will| TEX will play a continuing role sions of multilingual platforms like Windows. It in software of the future, instead of becoming will adapt freely and instantaneously to other an antique. languages, not just in the documents produced, Claiming these promises involves some trouble along but in its run-time messages and user interface. the way, but without 16 bits to use for encoding, we This °exibility is crucial to quality software, will never have a solid solution.

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1001 Richard J. Kinch

Let us survey in the rest of this paper what is Two, these tables have to be stored in ¯les, needed to achieve these various aspects of integrat- and we need to carefully and deliberately extend ing TEX and Unicode. the ¯le formats to handle the extensions. Not only could we come up with a bad design that limits us Omega and Unicode unnecessarily in the future, but all the old TEXware The goal of our work has been to create a Unicode- has to be upgraded, and then we have to port the capable DVI translator, and to reorganize the TEX upgrades. fonts into a Unicode encoding. TEX itself (that Three, we must rationalize the old ad-hoc is, the formatter) already has a Unicode successor, character sets into a big union set. Just cataloging namely Omega. and managing this data is a large task: many The chief advance of Omega is that it gener- tens of thousands of items, where we used to have alizes the TEX formatter to handle wider encod- only hundreds before. Some degree of database ings. What Omega is less concerned with is the management tools must be applied to get the codes DVI format (which has always provided for wider into a form which we can compile into software; it encodings, up to 32 bits), the encoding of the exist- is not enough to just type in some array initializers ing TEX fonts, and the translation of .dvi ¯les for here and there. output devices. In fact, Omega has side-stepp ed DVI Encoding standards are necessarily incomplete translation altogether with its extended xdvicopy or imprecise in some aspects, and none ¯t the translation, whereby Omega operates within the old TEX enterprise. While many of the Unicode math environment of 8-bit TEX fonts and the old DVI symbols were taken from TEX, many of the TEX translators. Since Unicode rendering is supp orted characters are missing from Unicode. But Unicode on Windows but not on other popular TEX platforms is about the closest encoding to TEX math that (UNIX, DOS, etc.), a devotion to Omega's porta- we can expect from an unspecialized encoding, and bility requires that Omega use the old fonts and with Unicode we gain a powerful connection to DVI translators. Lacking any compulsion to extend multilingual character sets. the DVI translators for Unicode, the Omega project Extending Computer Mo dern to Unicode has justi¯ably invested most of its e®ort into earlier stages of the typesetting process [4, page 426]. A \rational"encoding establishes a mapping of char- Our Unicode TEX fonts and Unicode DVI acter names to unique integers, and this mapping translator, while having a natural connection to does not vary from font to font. The Computer Omega, are capable of connecting TEX82 to Unicode Modern fonts were not encoded rationally. For as well. Through the mechanism of virtual fonts [11], example, code 0x7b is overloaded about 8 di®erent TEX can access Unicoded fonts while using its old characters, and character dotlessi appears in di®er- 8-bit encodings itself. ent codes in di®erent fonts. Given an 8-bit limit on encoding, this was inevitable. But this makes What's the fuss? for many troubles; moving up to a rational, 16-bit Wishing for 16 bits of Unicode sounds like, hey encoding is a clean solution. presto, we just widen some integer types, double Computer Modern is also \incomplete" in the some constants, and type \make" somewhere very, sense that if you made a table having on one axis very high in a directory tree. The task is far from the list of all the speci¯c styles (Roman, Italic, this simple for several reasons: typewriter, sans serif, etc.) and on the other axis One, TEX and TEXware is full of 256-member all the characters in all the fonts (A{Z, punctuation, tables which enumerate all code points. These diacritics, math symbols, etc.), the table would have would have to grow to 65,536 members. While lots of holes when it came to what METAFONT Haralambous and Plaice want us to agree that source exists. Commercial text fonts have all of this is \impossible" for practical reasons [6], they these holes ¯lled, or at least the regions populated assume that we are not going to re-implement the in the table are rectangular. In Computer Modern 8-bit-encoded software for sparse arrays. Applying the regions are randomly shaped. sparse-array techniques to manage per-character Furthermore, the character axis of this (very data will avoid an impossible increase in execution large) imaginary table is missing many characters time and/or memory, although it will require an considered important in non-TEX encodings. For initial extra e®ort to upgrade the software. example, ANSI characters like °orin, perthousand,

1002 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode cent, currency, yen, brokenbar, etc., are not imple- names1 in common use come from an assort- mented in Computer Modern. Certain of these sym- ment of sources, and they exhibit inconsisten- bols can be unapologetically composed from exist- cies, con°icts, and ambiguities which frustrate ing Computer Modern symbols: a Roman multiply computation of set projections and joins. or divide would come from the math symbol font, A database of T X encodings, which tells which ² E trademark from the T and M of a smaller optical characters appear in which TEX fonts. TEX size, and so on. But many other characters will just uses a gumbo of no less than 28 (!) distinct have to be autographed anew in METAFONT (at encodings (Table 1). This number may come as least one related work is in progress [6]). a surprise, but has been hidden by the web of The job of extending Computer Modern to be a METAFONT source ¯les. The sum of all TEX rational and complete set of fonts ¯rst requires that encodings constitutes a database of 3415 name- we reorganize the existing characters into a clean, to-code pairs, each name being taken from the 16-bit encoding. Then we are in a strong position 1108 members of TeXunion . We designate to ¯ll in the missing characters. this set of encodings (that is, a set of mappings We not only want to give TEX access to 16-bit- of names to integers) as TeXencod . encoded fonts, we also want the converse: non-TEX As if the miasmic fog of encoding conventions applications to have access to the Computer Modern were not confused enough, small-caps fonts fonts in TrueType form. This mandates adherence present still more encoding problems. They to the Unicode standard wherever possible, and represent an axis of variation that is hardly an organized method to manage the non-Unicode de¯ned in the usual set of font parameters. characters in Computer Modern. We must consider small-caps characters to be Here is a list of the components we consider di®erent from their corresponding parents. If essential to a Unicode rationalization of Computer this is not done, then there is no way to Modern. In this list we take a di®erent approach compose virtual small-caps fonts from their from Haralambous' Unicode Computer Modern lowercase counterparts, because we would have project [7], which is aimed at producing virtual fonts no way of knowing which characters are to which resolve to 8-bit .pk fonts from METAFONT. shrink (a jumbled set of letters and accented Our aim is a set of Unicoded TrueType fonts. letters) and which do not (punctuation and A METAFONT-to-outline converter, a very dif- all the rest). Thus, for each encoding used ² ¯cult although not impossible task, as illus- anywhere by a small-caps font, we must make trated in MetaFog [9]. a duplicate small-caps version (altering the lowercase character names to small-caps names) A database system to treat the converted of the encoding in the list of all the encodings. ² Computer Modern glyphs as atoms, for input Thus we have a csc2 for roman2, t1csc for t1, to a TrueType font-builder. and so on. To produce these duplicate encodings, we A database of character names which covers all ² need a rule to convert lowercase names (both the characters in the TEX fonts and in Unicode. letters a{z and diacritical letters) to small-caps We call the grand union TEX character set lowercase names and back. We have simply TeXunion, in the same way that we denote been appending \sc" to the name (this works the Unicode characters as the Unicode char- because there are no collisions with names that acter set. (We will use Small Caps to indi- happ en already to end in \sc"). For example, a cate a formal set.) TeXunion contains 1108 small-caps letter \a" is \asc". characters by the present inventory. Produc- Adobe has been appending \small" to their ing this database involves some work because names (this often causes character names to there are no standards for TEX character names exceed a traditional limit of 15 characters in (that is, single-word alphanumeric names such length), as in the MacExpert encoding [2]. This as are used in PostScript encodings). The stan- is done in an irregular manner by appending dard Unicode character descriptions are lengthy to the upp ercase character names (for example, phrases instead of single words, making them 1 unjoinable to the TEX names. For example, the There is an attempt at standardization in PostScript- Unicode standard provides the verbose entry style names from the Association for Font Information Interchange (AFII), but the standard is proprietary and on for the code 0x00ab, \left-pointing double paper only. The names are serial numbers as opposed to angle quotation mark". The PostScript abbreviated descriptions.

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1003 Richard J. Kinch

indicates which codes are letters or diacritics. Table 1: T X Single-Byte Encodings E Another related limitation of T1 encoding is the (TeXencod ). Covers all Computer Modern, lack of small-caps accents. math symbol, and Euler fonts. Each item A S A wealth of code positions does not exempt mMaps a set of 128 or 256 character names to Unicode from pecksni±an absurdities. The integers. Unicode committee will not provide encodings Name Description for a small-caps alphabet (small-caps being a matter of typography and not information con- csc0 TEX caps and small caps (ligs = 0) tent), although they provide encodings for some csc1 TEX caps and small caps (ligs = 1) euex Euler Big Operators small-caps characters (which appear in older A S eufb M Euler Fraktur Bold encoding standards subsumed by Unicode). A S y eufm M Euler Fraktur A rationalization of the TEX character sets into A S ² eur M Euler their largest common subsets (Table 2). This A S eus M Euler Script represents the relations between the TEX en- A S lasy LAMTEX symbols codings and the character subsets as organized lcircle LATEX circles in Knuth's METAFONT sources. The ¯rst item line LATEX lines in each entry of Table 2 gives the TEX encoding logo METAFONT logo as given in Table 1, known by the .mf source manfnt TEXbook symbols font ¯le used to generate the font; the remaining mathex2 TEX math extension items are the common subsets generated via the mathit1 TEX math italic (ligs = 1) METAFONT source ¯les of the same names. mathit2 TEX math italic (ligs = 2) The relation set forth in Table 2 is not re- mathsy1 TEX math symbols (ligs = 1) ¯ned for the distinctions regarding the ligature mathsy2 TEX math symbols (ligs = 2) setting. Certain of Knuth's encodings appear msam symbol set A overquali¯ed, namely, mathex2, mathit 12 , A S f g msbm M symbol set B and mathsy 12 do not vary with the ligs A S f g roman0 TEMX Roman (ligs = 0) setting, although it is speci¯ed in the META- roman1 TEX Roman (ligs = 1) FONT driver ¯le.2 roman2 TEX Roman (ligs = 2) These decompositions of the various TEX en- t1 LATEX NFSS T1 encoding codings may be considered close to the \great- t1csc T1 with small caps est common" subsets, although we do not re- texset0 TEX \texset" encoding (ligs = 0) quire a full decomposition here. To be com- textit0 TEX text italic (ligs = 0) pletely decomposed, the 012 -numbered items f g textit2 TEX text italic (ligs = 2) on the right should be further decomposed into title2 TEX 1-inch capitals (lig s = 2) the unnumbered common set and the various A sup erset of eufm with two extra chars y numbered di®erential sets. The sets on the right column of Table 2 we will use below as the set known as TeXpages. We have not yet made the Adobe small-caps for aacute is Aacutes- the e®ort to elaborate the members of each mall, while ours is aacutesc); apparently some- TeXpages set, which is needed to compute the one mistook appearance for semantics. Fur- remaining work to complete the style axes of thermore, Adobe has a small-caps version of Computer Modern. bare diacritics in their MacExpert encoding, al- A database giving the mapping of TEX fonts ² though the diacritical character name is irregu- to their encodings as known above (Table 3). larly changed to an initial capital (for example, The table below lists TEX font names and their Adobe small-caps for acute is Acutesmall). encoding name; an N indicates a wildcard for The LATEX T1 encoding, which was supp osed any optical point size integer, excluding sizes of to have been uniform for all DC fonts, also has the same style already matched earlier in the an irrational aspect, in that the T1 encoding is table. If a new optical size for a font name is overloaded when it is applied to both lowercase not in this table, the presumption should be and small-caps fonts. Somewhere in the 2 Also, the comments at the top of romsub.mfare in error A LTEX macros is buried something tantamount about what happens when ligs = 2. Apparently, no one has to another small-caps encoding of T1, which tried any other ligs setting!

1004 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode

Table 2: TeXencod Decomposition into Table 3: Mapping of TeXfonts to TeXencod . TeXpages. Set romanlsc is romanl with N indicates an optical point size; asterisk a su±x small-caps semantics; it does not actually appear wildcard. in Computer Modern. Braced digits indicate T X Font Encoding T X Font Encoding factored su±xes. E E cmb10 roman2 cmbsy5 mathsy1 TeXencod Covering TeXpages cmbsyN mathsy2 cmbxN roman2 Member Members cmbxsl10 roman2 cmbxti10 textit2 csc0 accent0 cscspu greeku punct cmcscN csc1 cmdunh10 roman2 romand romanp romanu rom- cmexN mathex2 cm®10 roman2 spu romsub0 romanlsc cm¯10 textit2 cm¯b8 roman2 csc1 accent12 comlig cscspu greeku cminch title2 cmitt10 textit0 punct romand romanp romanu cmmi5 mathit1 cmmiN mathit2 romspu romsub1 romanlsc cmmib5 mathit1 cmmibN mathit2 mathex2 bigdel bigop bigacc cmmr10 mathit2 cmmb10 mathit2 mathit 12 romanu itall greeku greekl f g cmr5 roman1 cmrN roman2 italms olddig romms cmslN roman2 cmsltt10 roman0 mathsy 12 calu symbol f g cmssN roman2 cmssbx10 roman2 roman0 accent0 greeku punct romand cmssdc10 roman2 cmssiN roman2 romanl romanp romanu romspl cmssq8 roman2 cmssqi8 roman2 romspu romsub0 cmsy5 mathsy1 cmsyN mathsy2 roman1 accent12 comlig greeku punct cmtcsc10 csc0 cmtexN texset0 romand romanl romanp ro- cmtiN textit2 cmttN roman0 manu romspl romspu romsub1 cmu10 textit2 cmvtt10 roman2 roman2 accent12 comlig greeku punct lasy* lasy lcircle* lcircle romand romanl romanp ro- line* line logo* logo manu romlig romspl romspu manfnt manfnt msam10 msam texset punct romand romanl romanp msbm10 msbm dccscN t1csc romanu tset tsetsl dctcscN t1csc dc* t1 textit0 accent0 greeku itald itall italp euex* euex eufb* eufb italsp punct romanu romspu eufm* eufm eurb* eur romsub0 eurm* eur eusb* eus textit2 accent12 comlig greeku itald eusm* eus italig itall italp italsp punct romanu romspu msam calu asymbols { bar (vertical bar) vs. brokenbar (vertical msbm calu bsymbols xbb old broken bar) . . . (. . . and so on for the rest . . . ) { macron vs. overscore { minus vs. hyphen vs. endash vs. sfthyphen vs. dash that its encoding ought to be the lowest optical { grave vs. quoteleft in code 0x60 size in the table of the same name. The wild { space (0x40) vs. nbspace (0xa0) vs. visi- card \*" matches any su±x, such as variations blespace vs. spaceopenbox vs. spaceliteral on style or optical size, for names which do not { rubout in code 0x7f match higher in the table. We designate the set { ring vs. degree cmb10, .. . , eusm* as TeXfonts. f g { dotaccent vs. periodcentered vs. middot vs. dotmath; Zdotaccent vs. Zdot, etc. A fuzzy-matching operator which, when join- ² { quotesingle vs. quoteright ing, selecting, and projecting the above data- { slash vs. virgule bases, can resolve the redundancy, synonyms, and ambiguities in the character names and { star vs. asterisk their composition. Here is an inventory of issues { oneoldstyle vs. one, etc. known to date: { diamondmath vs. diamond vs. lozenge

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1005 Richard J. Kinch

{ openbullet vs. degree \private zone" codes for the mis¯ts. For exam- { nabla vs. gradient ple, the line \ff 0xf001 0xfb01" (Microsoft { cwm vs. compoundwordmark (a T1 prob- fonts have an undocumented usage like this) lem) vs. zeronobreakspace means that the character ff (which is a ligature not to be found in Unicode) carries a { perzero vs. zeroinferior vs. perthousand- recommended private-zone code assignment of zero para perthousand 0xf001 or 0xfb01. One or more such recom- { slash vs. suppress vs. polishlslash, as in mended codes may appear, in order of pref- Lslash and Lsuppress erence. In resolving a private-zone con°ict, { Ng vs. Eng (and ng vs. eng) a font-building program may take the recom- { hungarumlaut vs. umlaut, as in Ohun- mended codes in order until a non-con°icting garumlaut vs. Oumlaut code is found. Only after the recommended { dbar vs. thorn codes are exhausted should the program make a random private-zone assignment. Codes may { tilde vs. asciitilde be given in decimal, octal (leading 0), or hex { tcaron transmogri¯es to tcomma, et al. (leading 0x) formats. Programs using these { mu vs. mu1 vs. micro, code 0xb5 tables take care to distinguish character codes { Dslash vs. Dmacron, code 0x0110; dslash (which contain only hexadecimal digits if start- vs. dmacron, code 0x0111 ing with 0x, otherwise only octal digits if start- { °orin (not in Unicode) vs. fscript, code ing with a leading zero, otherwise only decimal 0x0192 digits) from names (anything else, including names which start with digits). Lucida Sans { fraction vs. fraction1 vs. slashmath, code Unicode contains some names like \2500" (for 0x2215 code position 0x2500); if this presents a prob- { circleR vs. registered, circlecopyrt vs. lem we might have to pre¯x a letter to these copyright names. { arrowboth vs. arrowlongboth A name may appear in more than one { aleph vs. alef vs. alephmath synonym group, although such groups do not { Ifractur (eus, mathsy1, mathsy2) vs. Ifrak- join within the matching algorithm. The ¯rst tur (Unicode) vs. Rfractur (eus, mathsy1, name in any group is the \canonical" name. mathsy2, and Unicode); the spelling The canonical name is the name which should should uniformly be \fraktur" be output by programs which compute set { smile vs. smileface vs. invsmileface vs. operations on the encoding sets. This helps Unicode 0x263A (unnamed) to achieve a \¯ltered" result which does not contain troublesome synonyms. For example, { Omega vs. ohm, Omegainv vs. mho if the synonym ¯le contained the lines: { names not starting with letters: 0script (0x2134), 2bar (0x01bb) joseph jose yosef josephus 0xfb10 joseph joe joey We represent these items in a text ¯le having the following format: Each line of the the names jose, yosef, and josephus would ¯le gives character name synonyms, one group have fall-back code 0xfb10, the names joe and of synonyms per line. Any of the names on joey would have no fall-back code, and all the one line are synonyms, and can be freely ex- names above would invariably be transformed changed. For example, the line \visiblespace to joseph on output. spaceliteral" means that the character names The TrueTEX ¯lter accessory program, visiblespace and spaceliteral are com- joincode, performs a relational join on two pletely equivalent names. (The former was used font encoding sets, making a new encoding. in the TEX DC fonts [10], while the latter was It resolves the issues of a given synonym ¯le the PostScript name used in the Lucida Sans according to the rules we have stated. Unicode TrueType font of Windows NT.) A database of non-TEX encodings, which tells ² A special case of \synonym" is the Unicode which characters appear in various encoding fall-back. This is a code number which is standards such as ANSI or Unicode. This list a \synonym" for TeXunion members not in presently constitutes a database of 3523 entries Unicode, and is our assignment of the Unicode from a set of 1814 characters. Some of the

1006 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode

Table 4: Some Non-TEX Encodings. Table 5: Some Encodings Used in Windows. Name Description Name Description ase Adobe PostScript \Adobe- wae Adobe \WinAnsiEncoding" StandardEncoding", the (used in Acrobat in PDF), built-in encoding of many \winansi" with bullets in Type 1 fonts the .notdef positions, some belleek Belleek [8] scheme for LATEX T1 semantic synonyms encoding on 8-bit TrueType winansi Windows ANSI 8-bit belleekc Belleek with small-caps (US/Western Europe code latin1 Latin 1 (ISO 8859-1) page) (Includes certain non- latin1ps Adobe PostScript \ISO- ANSI characters in 0x80{0x9f Latin1Encoding" (which is not range of codes.) ISO Latin-1!) winansiu Windows ANSI Unicode mac Macintosh (US/Western Europe code macexp Adobe PostScript \Mac- page) (Same characters as Expert" encoding (used by are present in the winansi Acrobat in PDF [2]), containing encoding, except the non-ANSI ligatures, small caps, fractions, characters are in their Unicode typesetting niceties positions.) mre3 Adobe PostScript \Macin- winmultu Windows Multi-Lingual toshRomanEncoding" (used Unicode (Windows 95/NT) by Acrobat in PDF), a subset (655 characters supp orting all of \mac", which omits some Latin alphabets, Greek, math characters and the Apple Cyrillic, OEM screen trademark characters.) pdfdoc Adobe PostScript \PDFDo- winNNNN Windows (for code pages num- cEncoding" (used by Acrobat bered NNNN ) in PDF), an ad-hoc encoding used in PDF outline entries, text annotations, and Info dic- tionary strings, consisting of a FONT glyph conversion. One example of such remapped set = ase mre a redundancy is how DC fonts largely replicate f [ [ wae the Computer Modern fonts; it would be a g unicode 16-bit Unicode (Windows NT waste of e®ort to convert the glyphs twice. and 95, AT&T Plan 9) Another example is that many font variants are slanted versions of the upright face, and the geometric slant is easily applied to an already- common examples of commercial importance converted glyph rather than slanting in META- today are given in Tables 4 and 5. Having these FONT and repeating the glyph conversion. sets allows us to export virtual fonts for any of This technique is also used to compose accents the encodings represented, so we can virtualize and letters for \purely" accented characters non-Unicode, non-TEX fonts. (where the accent and letter do not overlap), since the MetaFog conversion is applied only A TrueType-font-builder that takes the con- ² verted outlines from various T X fonts, orga- to the accent part of such glyphs, allowing the E redundan t letter conversion to be done only nizes sets of them based on an output encod- once. ing, and builds binary TrueType fonts from the reorganized glyph data. Another sub-to ol builds these redundancy A sub-tool for the font-builder incorporates tables by comparing the encoding tables for a redundancy-elimination feature that allows sets of fonts against a target font. For you to specify a table listing which characters example, a DC font combines punctuation in a given TEX font may be taken from other and symbol characters spread across several TEX fonts without repeating a costly META- Computer Modern fonts.

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1007 Richard J. Kinch

Rationalizing TEX fonts in Unico de sets Table 6: DC fonts in LATEX (Rev. 1/95). The \funny" fonts (Fibonacci Roman, etc.) are Let us consider the shu²ing and dealing needed to omitted. This list is made by examining names in reorganize Computer Modern into a Unicode encod- *.fd from the LATEX distribution. ing. With the luxury of thousands of code positions, we can un-do the \scattering" of characters amongst Font 5 6 7 8 9 10 12 17 the TEX fonts. For example, the math italic set dcb ² ² ² ² ² ² ² ² (mathit 12 ) contains the regular (not italic) low- dcbx ercase Gfreegk letters. Conversely, we are going to ² ² ² ² ² ² ² dcbxsl have to scatter a few T X fonts that happened to ² ² ² ² ² ² ² E dcbxti combine dissimilar styles into one 7-bit font, such as ² ² ² dccsc the math symbol fonts (mathsy 12 ) which contain ² ² ² f g dcitt calligraphic capitals. ² ² ² ² ² dcr In set-theoretic terms, the rationalization task ² ² ² ² ² ² ² ² dcsl involves the following steps: ² ² ² ² ² dcsltt ² ² ² ² Begin with the union of all TEX characters, the dcss ² ² ² ² ² ² set we have called TeXunion . Remember that dcssbx ² this is the set of character names, not the glyphs dcssdc themselves. dcssi ² ² ² ² ² ² dctcsc Partition this union set into the largest subsets ² ² ² ² dcti which do not cross encodings. This partitioning ² ² ² ² ² ² dctt is a set of proper subsets of TeXunion; we ² ² ² ² dcu call this set TeXpages. For example, all ² ² ² ² ² ² the uppercase letters A{Z make such a subset. A combination of all upper- and lowercase letters A{Z and a{z do not, because the In our system, we actually produce textual small-caps fonts do not contain the lowercase versions of the binary fonts and convert them letters. These subsets are equivalent to Knuth's to Type 1 and TrueType formats with separate Computer Modern METAFONT \program" tools. This allows a general conversion to be ¯les, because this was the highest level of source optimized for the ultimate binary format. For ¯le nesting in which he did not make conditional example, Type 1 glyphs require knot-pivoting, the generation of characters. following by combing, to insert extrema tangent Encode the TEX character union set for a new, ² points. The hinting methods also di®er between universal 16-bit encoding. That is, we invent Type 1 and TrueType. a mapping of TeXunion members to unique Once a binary version of a font is prepared, 16-bit integer codes. Most of the members containing all the glyphs, a re-encoding tool of TeXunion appear in Unicode and so (TrueTEX accessory program ttf_edit [17], have a natural encoding already determined. which is a stack-oriented TrueType font encod- For the TeXunion members not in Unicode ing editor) must be applied to ¯nish the font (which includes all the small-caps letters), we for real-world use. The re-encoding stage not shall promulgate (by ¯at) assignments to the only re-encodes, but can optionally adjust the Unicode private-zone codes. We designate this metrics and kerning information. By making subset of TeXunion as TeXpzone ; this subset these aspects \afterthoughts" we can ¯ne-tune ¯nds its concrete representation in the private- fonts without going back into the detailed con- zone codes expressed in the character synonym version process. The re-encoding stage can also table. We have the relation upgrade any 8-bit-encoded TrueType font to an (Unicode TeXpzone) TeXunion , [ ¾ arbitrary Unicode encoding, which is important since Unicode contains many characters not since many commercial font editors can only used in TEX. We can compute a mapping: output 8-bit TrueType fonts. TeXunion (Unicode TeXpzone ) 7! [ A notion of what T X fonts we want to convert. ² E by matching character names from the left If we consider the DC fonts a good target, we to the names on the right; in this way we come up with quite a list (Table 6). arrive at a Unicode code number for each TEX

1008 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode

character. We designate this mapping (a 16- Implicit in this reduction is the factoring of bit number for each member of TeXunion ) wildcarded optical sizes that was introduced as TeXeru (\TEX encodings rationalized to in TeXfonts; we call this reduced set of Unicode", rhymes with \kangaroo"). This fonts TeXinUni, which will have a similar mapping is the key result of rationalizing wildcarding to its parent TeXfonts, but fewer Computer Modern into Unicode. members in parallel with the reduction. We can think again of a large table having, To produce each real Unicode font (a member ² ² on one axis, the TEX fonts, on the other, the of TeXinUni), we assemble the glyphs and members of TeXpages, and bullets wherever metrics from TeXfonts and install them via Computer Modern implements METAFONT TeXeru into each code of the mapped-to glyphs. This table will be sparsely and TeXinUni member. irregularly populated. The sparseness re°ects We must ¯nally produce Omega virtual fonts the fact that the T X fonts cover a wide range of ² E (that is, .xvp ¯les) which will map 8-bit DVI characters, while the variations in style mostly codes from the old TEX fonts into TeXeru are typographic distinctions on alphabets and codes in TeXinUni members. For this we use punctuation; in other words, T X provides E the TrueTEX metric exporter to generate an more than a few symbols sets, in contrast .xvp ¯le, and XVPtoXVF to convert this to an to the simple ANSI/Symbol set distinction .xvf ¯le; the .xfm ¯le also produced contains in 8-bit Windows fonts. The irregularity in the same information as the METAFONT .tfm this imaginary table results from the ad-hoc and may be discarded if only TEX82 is to be arrangment of TeXpages among TeXfonts. used for formatting. The sparseness is not a de¯ciency, but we ought to have some goal in mind for the rational Generating Unico de virtual fonts for extension of Computer Modern and the other non-TEX fonts TEX fonts to populate areas of this table for Let us consider a converse task: instead of con- the sake of regularity. Realizing this goal would require drafting of new METAFONT code and verting single-byte-encoded Computer Modern fonts translation to outlines. This is similar to into Unicode fonts, let us assume we have a Unicode how the DC font project extended Computer font in TrueType form, and want to make it usable Modern to the rational T1 encoding. with TEX or Omega. To use a font TEX (and Omega) require a .tfm (or .xfm) metric ¯le and a .vf (or At this stage we are ready to determine a list of ² .xvf) virtual font ¯le. The virtual font is neces- actual Computer Modern Unicode fonts which sary only if a remapped encoding or composition is will cover TeXfonts. While Haralambous needed (usually the case). retained 8-bit PK fonts as the actual fonts for To generate metric and virtual font ¯les for virtual Unicode fonts [7, 6], we will create a Unicode fonts in Windows, TrueT X provides a converse realization, namely Unicode TrueType E File + Export Metrics item which takes the user fonts as the actual fonts for virtual CM, T1, or through several steps which illustrate the elements UT1 encodings. of such a translation: Using the imaginary table just described, we First, the user selects the font from the Win- can take the union of row subsets such that ² columns are not overlapped. If we want to dows standard font-selection dialog. For exam- maintain stylistic uniformity within individual ple, in Windows, standard fonts include Arial fonts, we merge rows subject to a personal (a Helvetica clone), Courier, and Times New decision as to which rows \belong together" Roman (a Times Roman clone), together with in a stylistic sense. On the other hand, if we their bold and/or italic variations. Windows want to minimize the number of actual fonts will also install other TrueType fonts or (with and don't mind di®erent styles in a single font, Adobe Type Manager) Type 1 fonts. After the we can make a \knapsack" optimization to pack user selects a font, TrueTEX has a \font han- the rows as tightly as possible. (Indeed, if we dle" with which it can access all the geometric discard the Unicode conformity, we could put information needed to calculate global and per- all of Computer Modern into a single Unicode character metric quantities for the font. font!) In any case, this collapsing of rows Second, the user must select names for the involves imprecise judgments to arrive at an ² output .xvp ¯le. Font names in Windows are optimized reduced set of fonts. verbose strings containing several words (such

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1009 Richard J. Kinch

as, \Times New Roman Regular (TrueType)"), metrics and which re-map the TEX PUT/SET while TEX insists on a single-word alphabetic commands to the Unicode positions. name; therefore TrueTEX selects a TEX-style For output characters which have no exact to name for the font based on the TrueType ¯le ² input characters, TrueT X invokes the \com- name (such as \times"). The metric output in E position engine." If the composition engine can this case would be a ¯le times.xvp. compose or substitute a glyph for the character, Third, the user must specify the \input" and ² it emits the xvp commands for the metrics and \output" encodings for the virtual font. The other actions. If the composition engine is input encoding is the encoding of the actual \stump ed", TrueTEX emits an xvp comment font, typically ANSI or Unicode. The output to note that the character is unencoded. encoding is the virtual encoding which the Note that virtual fonts can use more than one user desires to construct for TEX's point of view, and is typically a member of TeXencod . input font to produce a virtual output font. This would allow, for example, a text font, an expert font TrueTEX uses encodings in the form of .cod ¯les (each line contains a hex code ¯eld followed (such as might contain ligatures), and a symbol font by the character name ¯eld) or .afm (Adobe (such as might contain math symbols) to contribute Font Metric [1]) ¯les4 To select an encoding, the to a single TEX Unicode font. Another use for this technique would be the assembling of a Unicode user selects an item from a list which TrueTEX presents, each item giving a description of font from the old bit-mapp ed PK fonts. TrueTEX supp orts all of the T X and Omega-extended virtual the encoding (for example, \TEXRoman with E ligs = 2" corresponds to the roman2.cod ¯le). font mechanisms, but does not yet directly supp ort multiple input fonts for metric export (or bit- The user can also browse for encoding ¯les in- mapped fonts, for that matter), although an expert stead of selecting from the canned list. The user user can merge multiple virtual-prop erty-list ¯les to can edit custom encoding ¯les (which are just produce such a virtual font. text ¯les in afm format), and thereby gains complete °exibility of input versus output encoding Comp osing missing characters in the virtual fonts, including automatic pro- duction of composites for missing input charac- The Unicode and TeXunion sets are disjoint, but ters. The ttf edit [17] program originates afm the virtual font mechanism allows users to create encoding tables from existing TrueType fonts, virtual TeXunion characters missing from a Uni- allowing maximal compatibility with randomly- code font with various composition or substitution encoded fonts. methods. TrueTEX uses this technique to create completely populated virtual fonts when the under- User input is now complete. TrueTEX begins ² lying TrueType fonts are missing accented charac- analysis of the information provided by forming ters or ligatures. in-memory encoding tables from the encoding To allow for easy upgrading, the \composition ¯les (using sparse-array techniques to manage engine" in TrueTEX uses a user-modi¯able script large, sparse code ranges). TrueTEX sorts and in a PostScript-like language to control the compo- indexes the tables for fast content-addressibility sition and substitution process. By changing the by either code or character name, assembles the script, the user can add new composition methods or global metrics in xvp terms for the font, and substitution rules, or adapt the methods to various visits each input code in the font to build a table typographic conventions. TrueTEX, using its own of per-character metrics and ligatures. xvp ¯le mini-PostScript interpreter, interprets this script at building may now begin. metric-export time, which means that users who For each output character name, TrueTEX speak PostScript and know a bit of font design ² determines if an exact match exists to an can customize the composer. A good script yields input name, and thus to an input code. a much better TEX virtual font, since commercial If there is such an exact match, TrueTEX fonts are typically missing characters that TEX con- emits xvp commands which give the character's siders important, and the script can ¯ll in most of the missing pieces. 4 .afm ¯les need not contain metrics; they can simply When the composition script receives control de¯ne an encoding for a dummy font, using only the C, CH, from TrueT X, all the encoding and metric infor- and N ¯elds of the CharMetrics table. We thus maintain E compatibility with other afm-reading software and avoid mation for the font and input and output encodings inventing yet another ¯le format. are de¯ned as PostScript arrays and dictionaries.

1010 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode

Remapping of certain names: The synonym Table 7: Composable Accent Characters. ² table shows the problems of character names Name Position Example which are not standardized. An ad-hoc section of the composition script ¯xes up any ambi- umlaut top-center Äo guities by recognizing ambiguous names and acute top-center ¶o making the appropriate substitutions. breve top-center ¸o caron1 top-center ^o Future extensions: Several more exotic com- cedilla bottom-center »o ² position methods are possible for an upgraded circum°ex top-center ^o composition script. A novel idea is an \emer- comma top-right L' gency fall-back" character generator: as a last dieresis top-center Äo resort for a missing character, the script could dotaccent top-center o_ have a table of low-resolution renderings for grave top-center µo all of TeXunion, consisting of (for example) hungarumlaut top-center }o an 8 16 dot-matrix or a plotter-style stick macron top-center ¹o £ 1 font; virtual font commands would render these ogonek bottom-right with rules. In this method we could render any period top-center o_ 2 TEX document in a crude but accurate fashion ring top-center without any fonts or special's at all!6 1For D/d/L/l, changes to comma at top-right. 2Accent is not present in the Computer Modern Another idea would use the ability of virtual fonts used in this portable document. fonts to call upon the TEX \special command. A B¶ezier curve special could draw and ¯ll glyphs without the need for operating system supp ort for fonts. This would not be e±cient, The standard composition script in TrueTEX implements the following techniques: and hinting would be missing on low-resolution devices, but it would place all the scalable font Accent-plus-letter composition: if the name ² information in an XVF ¯le. implies that the character is an accented letter, the script decomposes the name into Pro jecting ligature rules into TrueType the letter and accent components, and (when fonts the letter and accent exist in the input font) The TrueType fonts in Windows do not supply any uses the geometric information and typographic ligature rules such as are contained in Computer conventions to overlay the accent onto the letter Modern. To export metrics containing the usual in a virtual accented character, as shown in T X ligature rules, TrueT X considers the rules Table 7. Since the composer has detailed E E geometric information on the glyph shape, in Table 8 when exporting the global vpl (xvp) which is more elaborate than the bounding-b ox metrics, when the target ligatures exist in the font, or when the ligatures can be produced by the metrics T X uses, it can do a careful job of E composition engine. placing accents. Ligature composition from sequence of letters: Supp orting metric export formats ² A ligature character (not to be confused with TrueT X supp orts both .vpl (T X Virtual Prop- the ligature rules of the exported T X metrics, E E E erty List) and .xvp (Omega Extended Virtual Prop- a di®erent topic) such as the T1 character erty List) ¯le formats when exporting font metrics. \SS" will not usually exist in a TrueType font. This is more than merely a variation in format; Code positions for ligatures are not part of the when exporting to .vpl format, the encodings are Unicode standard,5 so even common ligatures truncated to 8 bits, so that the composition process are often not present. The composition script for missing characters will likely be more intensive. forms these by concatenating the component T Xware programs VPtoVF and XVPtoXVF trans- letters within the bounding box of the T X E E late the property list ¯les to their respective .tfm, character. This is applied to the ligatures: ®, .xfm, .vf, and .xvf binary formats for use with ¯, °, ±, ², IJ, ij, SS, À, ½, Á, and ¾.

5 Unicode does contain ligatures which are phonetic 6 letters in certain languages, but this does not include This could solve the problem of rendering TEX docu- typographic ligatures such as the f-ligatures ments in HTML browsers.

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1011 Richard J. Kinch

Table 8: Ligature Rules Applied to Exported Fonts.

First Second Result Description Dashes hyphen hyphen endash -- to endash endash hyphen emdash endash- to emdash Shortcuts to national symbols comma comma quotedblbase ,, to quotedblbase less less guillemotleft << to left guillemot greater greater guillemotright >> to right guillemot exclam quoteleft exclamdown !` to exclamdown question quoteleft questiondown ?` to questiondown F-ligatures f f ® ff to ® f i ¯ fi to ¯ f l ° fl to ° ® i ± ®i to ± ® l ² ®l to ² Paired single quotes to double quotes quoteleft quoteleft quotedblleft `` to quotedblleft quoteright quoteright quotedblright '' to quotedblright

TEX and Omega. TrueTEX launches the appro- Implemen ting sparse metric tables priate translator after the property-list export is In implementing programs which use metric data, complete. we must take care to apply sparse-matrix techniques TrueT X metric export also supp orts the older E to avoid enormous memory demands from nearly- .pl (T X Property List) metric ¯le format, and the E empty font-metric tables. Sixteen-bit encoded fonts companion program PLtoTF, should it be needed are typically sparsely populated. For example, for use with older T X software. In this case the E the Windows NT text fonts contain about 650 user can specify only an input font encoding, and the characters each; most codes are in the range 0{ property list re°ects this encoding as applied to the 0x2ff, with some symbols in the 0x2000 vicinity TrueType font selected, without virtual remapping. and a few odd characters in the private zone at While exporting .xvp ¯les will connect the 0xf001 or 0xfb01. We would expect such a TrueT X previewer to Windows' Unicode fonts, E segmented locality in a typical font. the T X82 formatter requires .tfm metric ¯les, not E One technique is to use a segmented table with Omega .xfm ¯les. If the output font has an 8-bit binary-search lookup; this is close to the method encoding, the resulting virtual font is nevertheless used in TrueType fonts. A hash table for the two- compatible with the original T X formatter's 8-bit E byte keys may be used instead of the binary search. character codes, and will not require the Omega Segmented tables will require the least storage, formatter. To create a .tfm ¯le for such a font, at the expense of a possible hashing performance a TrueT X ¯lter program xvptovpl truncates the E problem in the event of degenerate tables. Since virtual codes in the .xvp ¯le and produces a all Unicode-capable operating systems are advanced truncated .vpl ¯le, and via VPtoVF, a .tfm ¯le enough to supp ort virtual memory, the performance for use with T X. The .vf ¯le created in this E risk does not justify the memory savings. process is discarded, since it does not properly map TrueT X uses a 2-level pointer technique: the 8-bit characters to Unicode. The .xfm and E metrics for a 16-bit code table consist of a table of .tfm ¯les produced by this process will contain the 256 pointers to metric tables with 256 entries each. same information; the .xfm format is needed only if In this way a typical font having perhaps 6 or 7 Omega is to be used. TrueT X uses the .xvf ¯le to E contiguous code populations has very little wasted map the 8-bit T X characters to Unicode positions. E space. The worst-case and best-case performance are both acceptable. The lookup time is accelerated

1012 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode by avoiding a null-pointer test by pointing unen- subsets, joinable to Unicode and other encoding coded pages to a dummy table of zero-width metrics. standards. The TrueTEX tables are available A time-bomb of a problem looms with Omega's to all for examination and use [16]. Once these .xfm format [12], which recklessly ignores any are improved by public usage and scrutiny, they sparse-array issues. An .xfm ¯le, like the older should be adopted as a formal standard for .tfm format, keeps an unpacked char info array TEX. which spans the smallest to largest codes (that is, Identifying and assigning non-Unicode TEX ² the interval [bc, ec]). Since the Unicode private- character names to the Unicode private zone, zone population typically extends through 0xfb00, thereby promoting inter-operability of Unicode practically all fonts will have a char info of about TEX implementations. These should be con- 126 KB, almost all wasted space. This will result in cretely represented in a synonym table, which a typical .xfm ¯le being 130 KB, instead of about also needs to be published. There are many 4 KB that a simple sparse-array technique would potential con°icts, and taking counsel from as provide. It is imperative that we upgrade the .xfm many di®erent users of TEX as possible is the format and xfm-reading programs to better handle only way to maximize the compatibility of the 7 sparse encodings. result. The Omega project has already started to stake out claims on the virgin Unicode real Summary estate [4, Table 4, page 425] for •-Bab el. There Let us review the areas of development needed to are no doubt synonyms and ambiguities outside extend all of TEX to Unicode: our own experience; one can only hope there is Extending the .tfm ¯le format and its run- su±cient room for all interested parties. ² time forms to large, sparsely populated fonts, Extending DVI translators to accommodate ² without sacri¯cing backward compatibility and the extended .tfm, .vf, and .dvi formats, without exploding ¯le lengths. We have including sparse-array techniques for e±cient seen that the .xfm format can represent the run-time performance. The Omega project has information, although it needs improvement for issued this call to \DVI-ware developers" [4, sparsely populated fonts. page 426, Conclusion], although with surprising Extending Computer Modern and other meta- aplomb for the implications. We hereby ² fonts to fully populate the appropriate Unicode respond with our implementation in TrueTEX, positions. A complete Unicode text font re- and invite others to build on our experience. quires about 500 symbols. While it is unlikely Promulgating the ongoing TeXeru, the T X- ² E that all the styles of Computer Modern will re- in-Unicode mapping, based on a seasoned reg- ceive the attention to ¯ll the tables completely, istry. An ongoing authority for additions and we can at least insert legible placeholders. corrections will be vital. This authority will be Creating a formal database of T X character responsible for registering new TEX character ² E names, joinable to the Unicode o±cial names. names and avoiding Unicode con°icts. Some standard is necessary for any develop- Changing the plain T X and LaT X macros to ² E E ment, and there is no reason to favor anyone's accommodate the 16-bit encoding extensions, favorite names. What is crucial is that the while maintaining backward compatibility from registry be initiated now, so that TEX software a single source. This is a tall order, and one we authors have an early start on making inter- have not touched. changeable fonts and documents. While TEX Extending \special handling for 16-bit char- ² users cannot themselves dictate the standard acter sets. Now we open up the carousing com- Unicode names, we can at least make a stand- mand of TEX to a whole new vista of revelry, in version for our own use (since none seem to with the gift of tongues. This is another item exist at present), and if an acceptable set of that we shall put o® for now. Unicode names comes along, we can adopt it Implementing TEX and DVI translation user in- ² later. terfaces in selectable languages. While Omega Creating a formal database of all 28 T X processes in Unicode, it talks to the user in the ² E font encodings and their 31 greatest common old 8-bit fashion. Perhaps it is a bit much to expect a Web2C T X change ¯le to incorporate 7 The .vf and .xvf virtual font formats have always E packed sparseper-characterinformation,so they need no such wchar t and other Unicode constructs of the C attention. programming language. But DVI translators

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1013 Richard J. Kinch

for Windows NT and other Unicode-capable page at http://www.ens.fr/omega. See also platforms should have this designed in from the Omega FTP site [12].] the start. A properly designed application can [6] Haralambous, Yannis, and John Plaice. \• + be reimplemented for another language by any Virtual METAFONT = Unicode + Typogra- non-programmer who knows the application phy (First Draft)." ftp://nef.ens.fr/pub/ and can translate the messages; no program- tex/yannis/omega/cernomeg.ps.gz, January ming or recompilation is required. 1996. Establishing provisions for orderly extension ² [7] Haralambous, Yannis. \The Unicode Computer of the fonts and character sets, so that new Modern Project." [A document link on the TEX fonts, characters, and encodings may be Omega Web page [5].] incorporated into later versions. Having made [8] Kinch, Richard. \Belleek: TEX Virtual T1- the e®ort to retro¯t METAFONT output for Encoded Fonts for Windows TrueType." Avail- an altogether di®erent way of encoding, we able on the author's Web site as a LATEX docu- would hope that font designers recognize the ment in belleek.zip. [The Belleek software is shortcomings of the 7- and 8-bit encodings and an implementation of T1-encoded fonts using keep the promises of Unicode in mind. TrueType scaling technology under Microsoft Rationalizing the TEX fonts into orthogonal ² Windows. Belleek consists of TrueType fonts styles and weights (such as cmmb10 and cmmr10, which render elements of METAFONT glyphs for example). As TEX users we don't care about in scalable form, plus TEX virtual fonts which this, but if Computer Modern is to be accepted remap and compose T1 characters from these in non-TEX applications, the style axes will elements.] have to be fully varied and populated along the [9] Kinch, Richard. \MetaFog: Converting META- conventional ranges. FONT Shapes to Contours." TUGb oat 16,3 Providing a means for creating virtual fonts ² (1995), pages 233{ 243. for non-TEX Unicode fonts in TrueType or Type 1 format. Although this capability is available now only in the commercial TrueTEX Unicode edition, a new TEXware stand-alone tool could interpret TrueType font ¯les (or whatever typeface technology is supp orting Unicode rendering) and join the encoding and other information into an .xvp ¯le. References [1] Adobe Systems Incorporated. Portable Doc- ument Format Reference Manual. Reading, Mass.: Addison-Wesley, 1993. [2] Adobe Systems Incorporated. Adobe Font Met- rics (AFM) File Format Speci¯cation, Version 4.1, October 1995. [Published in PDF and PostScript form at ftp://ftp.adobe.com.] [3] Fairbairns, Robin. \Omega| Why Bother with Unicode?" TUGb oat 16,3 (1995), pages 325{ 328. [4] Haralambous, Yannis, John Plaice, and Jo- hannes Braams \Never Again Active Charac- ters! •-Babel." TUGb oat 16,4 (1995), pages 418{ 427. [5] Haralambous, Yannis. \•, a TEX Extension Including Unicode and Featuring Lex-like Fil- tering Processes." Proceedings of the Eighth European TEX Conference, Gda¶nsk, Poland, 1994, pages 153 {166. [There is an Omega Web

1014 TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting Extending TEX for Unicode

[10] Knapp en, JÄorg. \The European Computer Modern Fonts." CTAN ¯le: tex-archive/ fonts/dc/mf/dcdoc.tex. [11] Knuth, Donald. \Virtual Fonts: More Fun for Grand Wizards." TUGb oat 11,1 (1990), pages 13 { 24. [12] Plaice, John, and Yannis Haralambous. \Draft Documentation for the • System." ftp://nef. ens.fr/pub/tex/yannis/omega/first.tex, February 1995. [See also the Omega Web page [5].] [13] Plaice, John. \Progress in the • Project." TUGb oat 15,3 (1994), pages 320 {324. [14] The Unicode Consortium, Inc. http://www. unicode.org. [This site holds ¯les listing codes and descriptive phrases. There are no sample character images, and there are no PostScript names. A monograph gives paper images [15].] [15] Addison-Wesley Publishing Company. The Unicode Standard: Worldwide Character En- coding, Version 1.0, Volumes I and II. [16] The encodings (consisting at present of 46 encoding sets containing over 9000 pairs, represented in .afm format) herein listed in Tables 1, 4, and 5, are available via the author's Web site. [17] The programs ttf_edit and joincode for DOS, Windows, and Linux are available via the author's Web site.

TUGb oat, Volume 0 (1996), No. 0 | Proceedings of the 1996 Annual Meeting 1015