The TEX Font Panel
Total Page:16
File Type:pdf, Size:1020Kb
The TEX Font Panel Nelson H. F. Beebe (chair) Center for Scientific Computing University of Utah Department of Mathematics, 322 INSCC 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Email: [email protected], [email protected], [email protected], [email protected] (Internet) URL: http://www.math.utah.edu/~beebe Telephone: +1 801 581 5254 FAX: +1 801 585 1640, +1 801 581 4148 Introduction them, changing the character set has huge ramifica- tions for the computing industry and for worldwide The TUG’2001 Font Panel convened on Thursday, business data processing, data exchange, and record August 16, 2001, with members William Adams, keeping. Nelson H. F. Beebe (chair), Barbara Beeton, Hans Fortunately, a particular encoding scheme called Hagen, Alan Hoenig, and Ross Moore, with active UTF-8 makes it possible for files encoded in pure participation by several attendees in the audience. ASCII to also be Unicode in UTF-8 encoding, easing The list of topics that was projected on the screen the transition to the new character set. makes up the sectional headings in what follows, and Up to version 2.0 in 1996, the Unicode character the topics are largely independent. repertoire could be fit into a table of 216 = 65 536 en- Any errors or omissions in this article are solely tries. Version 3.0 in 2000 increased the count to over the fault of the panel chair. a million, although just under 50 000 are assigned Unicode and tabulated in the book. Version 3.2 in 2002 has just over 95 000 assigned. Consortium members The work of the Unicode Consortium, begun in hold the view that 20 or 21 bits per character (just 1988, and first reported on for the TEX community over two million) may ultimately be necessary by in a TUGboat article [9], has reached version 3.0 the time all historical scripts have been covered. of the Unicode Standard [29]. Version 3.1 appeared Despite the Consortium’s warning that the col- about the time of the TUG’2001 conference, and lection was expected to grow, several vendors did version 3.1.1 shortly thereafter. Unicode is a proper not pay attention, and prematurely adopted 16-bit subset of the ISO/IEC 10646 Universal Character entities to hold Unicode characters. Set Standard [14], but publication of the latter lags. Thus, the C language data type, wchar t, intro- Unicode defines a character set that is intended duced in 1989 Standard C [7, 13, 28], is implemented ultimately to cover all of the world’s writing sys- as a 16-bit unsigned integer in many C and C++ tems. Its first 128 entries are identical to the ASCII compilers, with a companion function library that character set (dating from 1964) used by most of the also has this limitation. world’s computers. Even worse, the popular Java programming There is a very active Unicode technical dis- language is defined in terms of an underlying virtual cussion e-mail list: send subscription requests to machine [23, 24], already implemented in hardware, [email protected]. The list is archived whose instructions are permanently designed for 16- at http://www.unicode.org/mail-arch/. bit characters. Unicode conferences are held twice a year, These 16-bit limitations can be overcome by with the twentieth in late January 2002; see representation of Unicode values with variable num- http://www.math.utah.edu/pub/tex/bib/ bers of bytes, as was done with the UTF-8 encoding. index-table-u.html#unicode for a bibliography Unfortunately, the opportunity to simplify character of publications about Unicode. processing significantly by having fixed-size units is Since most programming languages, operating tragically lost. systems, file systems, and even computer I/O and In the panel chair’s view, these design errors CPU chips, have character knowledge designed into will rank with the infamous ASCII/EBCDIC split 220 TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting The TEX Font Panel in 1964, with IBM System/360 adopting EBCDIC, keys, so software like t1disasm (from Lee Het- and everyone else (by about 1980) adopting ASCII, herington’s and Eddie Kohler’s t1utils package, with enormous economic costs, and user confusion, available at ftp://ctan.tug.org/tex-archive/ that lasted for decades. fonts/utilities/t1utils) can readily disassem- Newer operating systems are already designed ble a font. to use Unicode as the native character set, and Disassembly reveals essentially a table of num- vendors of older ones are migrating in that direction bered (not named) subroutines, Subrs, each con- through UTF-8 encoding. taining positioning commands, and calls to other Of course, jumping from a 256-character set subroutines, plus a table of character definitions, to one with potentially millions of characters poses CharStrings, indexed by character name. Each en- an almost impossible problem for font vendors. It try of CharStrings also consists of positioning and will be a very long time before the Unicode font drawing commands, and calls to the numbered sub- repertoire is adequate. Current systems with native routines. Unicode support generally provide only a subset Because subroutine numbers could be con- of characters, and then sometimes only in low- structed dynamically, it is in general not possible to resolution screen bitmaps. Bitstream for a while identify which of the numbered subroutines can be offered their Cyberbit Unicode font, but in July omitted, but a DVI driver could drop unused entries 2001, withdrew it without explanation. from the CharStrings table. This is transparent to Thanks to fine work by fellow TUG members font rendering software, since the entries are named, Yannis Haralambous and John Plaice [26], TEX has rather than numbered. been extended to fully support Unicode. Their It was reported by a reviewer that computation system is known as Ω (Omega), and it has been of subroutine numbers is in practice not done in available on the annual TEX Live CD-ROM distribu- existing Type 1 and Type 2 Compact Font Format tions since at least version 5 in 2000. Development (CFF) fonts, so perhaps it is safe to drop subroutines has not been as rapid as end users might like, but that are not explicitly called. it must be understood that this is a hugely complex Recent versions of Tom Rokicki’s dvips driver problem, and the Ω designers have been proceeding are capable of subsetting PostScript Type 1 outline very carefully, cognizant of other TEX developments fonts, as can Adobe Acrobat Distiller and Ghost- such as pdfTEX, ε-TEX, and N T S, in addition to script’s ps2pdf. the evolution of the Unicode Standards. However, this subsetting introduces new prob- lems. What if the DVI file also included PostScript Mathematics fonts figures which themselves used fonts? Subsetting Fonts for mathematics are a substantial problem, might remove characters needed by those figures. because, among the more than twenty thousand It is infeasible, or unreliable, for the DVI driver fonts on the market, only a handful have a remotely to attempt to examine an included figure file to adequate repertoire of mathematical glyphs. These determine its font requirements, because far too fonts are almost the only choices: Computer Con- many PostScript producing programs fail to con- crete, Computer Modern, Informal Math, Lucida, form to Adobe’s Document Structuring Conventions MathTime, PA Math, PX, Palatino Math, Pandora, that would otherwise clearly, and simply, record and TX. the file’s font needs. Those conventions are clearly While it is, of course, possible to use an existing described in the first two editions of the PostScript mathematics font with any other text font, the re- Language Reference Manual [1, Appendix C] [3, sults are rarely visually successful. For some careful Appendix G], but were ominously dropped from the studies of this, see Hoenig’s book [11, Chapter 10]. third edition [6]. They are, however, documented at the Adobe Web site among the technical notes Font subsetting collected at http://partners.adobe.com/asn/ DVI drivers for virtually all devices, other than Post- developer/technotes/postscript.html, in the Script, subset the fonts that they include in their file http://partners.adobe.com/asn/developer/ output streams: descriptions of unused characters pdfs/tn/5001.DSC_Spec.pdf. are simply omitted. Each Type 1 font contains a special 24-bit Doing this for PostScript Type 1 out- (0 ... 16 777 215) unsigned number, the UniqueID, line fonts has proved considerably more trouble- which is intended to allow printing devices to cache some. These fonts are generally encrypted, but bitmaps of rendered fonts between jobs. A million Adobe has published the encryption algorithm and of these numbers are reserved for private use, and TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting 221 Nelson H. F. Beebe (chair) the rest are allocated to font vendors on request. some may even support a user-defined font substi- A subsetted font is a different font, because it lacks tution file. some characters, and so must be assigned a UniqueID The PANOSE system [8] is a font classifica- from the private use area. A random choice from tion system that assigns numeric values in 0 ... 15 this area would mean a one-in-a-million chance of for ten font attributes (family, serif style, weight, confusion between fonts in a printer. proportion, . ). The system is further described Regrettably, several versions of Adobe’s own at http://www.w3.org/Fonts/Panose/pan2.html, Acrobat Distiller, and most versions of Ghostscript http://www.w3.org/Printing/stevahn.html, and (until the panel chair, who is a long-time Ghost- http://www.agfamonotype.com/print_manu/pan1.