<<

Unicode and math, a combination whose time has come — Finally!

Barbara Beeton American Mathematical Society [email protected]

Abstract To technical publishers looking at ways to provide mathematical content in elec- tronic form (Web pages, -books, etc.), are seen as an “f-word”. Without an adequate of symbols and alphabetic type styles available for di- rect presentation of mathematical expressions, the possibilities are limited to such workarounds as .gif and . files, either of which limits the flexibility of presentation. The STIX project (Scientific and Technical Information eXchange), repre- senting a consortium of scientific and technical publishers and scientific societies, has been trying to do something about filling this gap. Starting with a com- prehensive list of symbols used in technical publishing, drawn from the fonts of consortium members and from other sources like the public entity sets for SGML as listed in the ISO Technical Report 9573-13, a proposal was made to the Technical Committee to add more math symbols and variant to Uni- . Negotiations have been underway since mid-1997 (the wheels of standards organizations grind exceedingly slowly), but things are beginning to happen. This paper will share the latest information on the progress of additional math symbols in Unicode, and the plans for making fonts of these symbols freely available to anyone who needs them.

Introduction tions — the things operated on. The of dif- ferent alphabets used in some documents appears The composition of has never been to be limited only by what is available or by the straightforward; it has always required special fonts capacity of the typesetting system (manual, mecha- above and beyond the alphabetic complement re- nized or electronic). Only numerals seem to denote quired for text. Even if an author makes an effort more or less the same kinds of concepts in both ordi- to describe mathematical concepts and relationships nary prose and mathematical . Needless to in words, there comes a point where symbols become say, foundries have never been overly eager to necessary for both clarity and conciseness. In some provide an unlimited supply of new fields (for example, symbolic ), the use of nota- of arcane design and often intricate production re- tion has expanded to such a degree that it is nearly quirements. impossible to express concepts clearly in ordinary Complicating this situation is the fact that the words; symbols convey the desired meaning much audience for typeset mathematics is relatively small. more directly. The situation might be compared If the number of mathematicians clamoring for com- to that of two literate Chinese from different areas petently printed material in their subject were any- meeting, and communicating by writing rather than where near the number of readers of novels or sports in their different spoken dialects. There is some- magazines, or if these mathematicians had budgets times just no reasonable substitute for a common matching those of major advertising agencies, font . foundries could muster much greater interest in do- Although symbols form a large and important ing this sort of work. part of written mathematics, mainly indicating op- erations, relations, and other similar concepts, al- phabets are also co-opted from their role of rep- Terminology resenting ordinary language to provide the nota- tion for mathematical constants, variables and func- When one looks at a printed page, one sees that it is constructed from many small elements. There are letters, digits, , symbols, , . . .

176 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting Unicode and math, a combination whose time has come — Finally!

One might think to refer to all of these as characters. be one-to-many, or many-to-one. While this is not The dictionary [9] definition of is, in part, only adequate, but even admirable, when dealing character ... 1. A sign or token placed with text, for mathematics it can introduce serious upon an object as an indication of some spe- . cial fact, as ownership or origin; a mark, brand, or stamp. 2. Hence: a a graphic symbol of any sort; esp., a graphic symbol The alphanumeric soup of standardized codes has employed in recording language, as a . already been mentioned. Consider the history of b Writing; printing. c ... codes used for computer input. Clear enough? Well, not quite. Although quite a few different models of digital In standardese, a term can have only one mean- computer architecture have been devised, very few ing. The basic ISO1 definition [5] is have been based on a wider range of possibilities for character A member of a of elements the smallest other than zero and one — on used for the organisation, control, or repre- and off. (This provides the rationale for naming the sentation of data. bit, “binary digit”.) Different combinations of bits, in strings of predefined length, designate characters. Thus the term character cannot be used in an ISO The number of bits in such a string is the limiting standard with any other meaning. factor in how many characters can comprise a code. Another relevant term is code; from the same Very early codes contained six bits — 64 char- dictionary [9]: acters, just enough for a single-case (latin) , code ... 3. A system of signals for commu- ten digits, five arithmetical operators (+, -, *, / and nication by telegraph, flags, etc. (. . . ); also, a =), the punctuation required to format real num- system of words or other symbols arbitrarily bers and accounting data, and a number of control used to represent words; as, a secret code. codes to support interaction with a Teletype ma- This is the term adopted to identify the system by chine. The following symbol complement, a variant which data is stored in a computer memory, and the of the BCD code, was available on a punched paper individual elements are known as coded characters, tape device used in the 1970s at AMS; symbols in or characters for short. Different computer coding the second row had to be preceded by an upshift and systems use different bit patterns to represent the followed by a downshift, as they were piggy-backed same character; for example, the letter A would have onto other characters. a different code in ASCII, BCD, EBCDIC, ISO 646, . , - / * # & $ ISO 8859-1, etc., but in each of these systems, A :;+=@()<>’" is still considered the same character. If it is in a context that might (in print) be represented in italic This was sufficient to support Fortran, Cobol, and or boldface, that makes no difference; the same code other antique programming languages, but not a di- is used for all. rect visual representation of mathematical expres- But an A in a font or on a printed page is not sions. (by this system) a character, and an italic A is differ- ASCII, in its original form, had seven bits and ent from a boldface A, and so on. The term adopted 128 characters, which could accommodate a lower- in standardese [3] for such an element is glyph: case alphabet and more symbols. (This is the code under which T X was first implemented.) ISO 646 is glyph A recognizable abstract graphic sym- E the “international” version of ASCII. A key princi- bol which is independent of any specific de- ple was — and is — that once a character gets into a sign. code, it is never removed, so the current 8-bit ASCII Thus it is clear that one code may represent many is backward compatible with the 7-bit version, at different glyphs. The reverse is also true: while the least insofar as what can be encoded. word “file” is spelled with four letters, and coded as Other codes have been promulgated by manu- four characters, when printed with a font that has facturers, national standards bodies, and the ISO. ligatures, only three glyphs are used. Until the mid-1980s, these codes were used almost The association between characters and glyphs exclusively to support processing of programming is referred to as mapping. What is important here languages and natural languages. Whatever sym- is that in neither direction is the mapping between bols were included were necessary to specify pro- a coded character and a glyph one-to-one; it may gramming operations, not the symbolic representa- 1 International Organization for Standardization tion of scientific disciplines, and typically, except for

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 177 Barbara Beeton the few symbols and punctuation characters that In the Unicode 3.0 manual [8], only one refer- were already present in six-bit codes, symbolic char- ence can be unambiguously associated with math acters were typically segregated in programming symbols: ISO 6862, Information and documenta- language–specific codes such as the one for APL. tion — Mathematics character set for bibliographic Some language codes already exceeded the typ- information interchange (no explicit references are ical eight-bit capacity of 256 elements. It is impossi- shown in the Unicode 2.0 manual). Many of the ble, for example, to fit in all the accented and vari- symbols listed in the annex to the SGML standard ant letters of the alphabet needed to represent all don’t appear in Unicode. More about this later. the languages based on the . And There are several design principles especially codes for Japanese and Chinese had to accommo- relevant to the designation of math symbols as char- date the nearly 10,000 characters used to publish acters [8]: newspapers, or, preferably, the 50,000 characters or • The Unicode Standard encodes characters, not more found in literary works. glyphs. The development of a multi-byte code, ISO • Characters have well-defined . 10646 (originally a two-byte code), undertook to • The Unicode Standard encodes plain text. combine in a single code all existing national and commercial codes. Computer manufacturers The implication is that the meaning of each char- and other commercial organizations dependent on acter is distinct, so that the representation when computer technology became dissatisfied with the interchanged or typeset will be unambiguous. More progress of the ISO working group responsible for about this later as well. standardizing codes, and, in 1988, formed the Uni- Unicode is organized into segments of 65,536 code Consortium for the purpose of creating a uni- characters called planes. The first of these, fied international code standard on which new multi- 0, is the basic multilingual plane (BMP). Within national computer technology could be based. The this plane, characters with common characteristics ISO old guard was joined or replaced by the Unicode are grouped into blocks, usually of 256 characters. members, and since 1991 Unicode and ISO 10646 The first full block is equivalent to Latin 1, with the have been parallel. first half comprising 7-bit ASCII. The code for any character assigned to the BMP can be represented by 16 bits, a two-byte, or two-octet code. The formal The content and structure of Unicode representation of such a code is “U+xxxx”, where In the Unicode 2.0 manual [7], the section Design xxxx is a string of four digits. goals identifies some of the gaps in coverage by ex- Within the BMP, these blocks are occupied by isting codes. symbols: • U+2000–206F: General punctuation When the Unicode project began in 1988, • groups most affected by the lack of a con- U+2070–209F: Subscripts and superscripts sistent international character standard in- • U+20A0–20CF: Currency symbols cluded the publishers of scientific and math- • U+20D0–20FF: Combining diacritical marks ematical software, newspaper and book pub- for symbols lishers, bibliographic information services, • U+2100–214F: and academic researchers. . . . The explo- • U+2150–218F: sive growth of the Internet has added to the • U+2190–21FF: Arrows demand for a character set standard that can • U+2200–22FF: Mathematical Operators be used all over the world. • U+2300–23FF: (in The first iteration of Unicode included “char- Unicode 2.0, U+2380–23FF are unassigned, acters from all major international standards pub- reserved for later additions) lished before December 31, 1990”. One of these • U+2400–243F: Control pictures was the SGML standard [2], which contained a size- • U+2440–245F: Optical Character Recognition able list of mathematical and technical symbols in • U+2460–24FF: Enclosed alphanumerics its original Annex A (this list was later moved to a • U+2500–257F: Box drawing technical report [4]). Other sources included “bib- liographic standards used in libraries . . . , the most • U+2580–259F: prominent national standards, and various industry • U+25A0–25FF: standards in very common use”. • U+2600–267F: Miscellaneous symbols

178 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting Unicode and math, a combination whose time has come — Finally!

• U+2700–27BF: Dingbats Unicode coverage for math and technical notation is • U+27C0–27FF: (unassigned) complete. • U+2800–28FF: patterns (added in A working group for Scientific and Technical Unicode 3.0) Information eXchange (STIX) was formed with the charter to identify the required symbol complement • U+2900–2DFF: (unassigned) and get the missing elements incorporated into Uni- Symbols that were part of earlier codes are kept with code. A first step was to collect from the STIX par- those codes in other blocks; if a code already existed, ticipants and other sources lists of symbols currently the character was not duplicated. in use, and to reduce this to two subcollections: sym- A segment of the BMP has been set aside for bols already in Unicode and symbols not in Uni- private use, where characters may be assigned which code. Information was gathered from the following are not formally included in Unicode but for which sources, in addition to the STIX members: an agreement exists between sending and receiving • users. the entity sets of ISO TR 9573-13 [4]; Up to now, most character assignments are in electronic files were provided by the editor, the BMP, with the intention that they be easily ac- Anders Berglund. cessed. However, for less frequently occurring char- • fonts designed to be used with TEX: acters, work has begun to populate Plane 1. This is Computer Modern, AMSFonts, Lucida New another area with relevance that will become obvi- Math, lasysym, St. Mary Road, wasysym ous later. • Wolfram Research (Mathematica) A Identifying symbols required for math • Justin Ziegler’ LTEX 3 project report [10] typesetting • Taco Hoekwater (for Kluwer Academic Publishers) Early in 1997, a group of scientific and technical societies and publishers banded together under the • J¨org Knappen (for Springer Verlag) name STIPub — Scientific and Technical Informa- • Paul Topping, Design Science, Inc. tion Publishers — to address matters of common in- (MathType) terest. The founding members of this group were • the ISO language standard (ISO CD 13568) • American Chemical Society • various requests for specific symbols identified • American Mathematical Society through AMS technical support and the • American Institute of Physics newsgroup comp.text. • American Physical Society More than 2200 distinct symbols were identified. • Elsevier Science, Inc. The next step was to determine which were al- • Institute of Electrical and Electronic Engineers ready included in Unicode. As already mentioned, not all symbols are located in the blocks designated One topic of growing concern to the STIPub for symbols; some have codes in other blocks, in- members was how best to move into the Internet cluding Latin 1, Greek, and even among the CJK3 age, to make use of the World Wide Web as an ad- symbols and punctuation. About half of the sym- junct, if not the new centerpiece, of their publishing bols in the collection were found in the Unicode 2.0 efforts. A major obstacle facing Web publication manual [7]; the remainder were assigned provisional was — and is — the pitifully inadequate symbol set identifiers in the Unicode private use area, and a available with HTML, and its lack of support for the table was constructed, listing the following for each two-dimensional positioning of mathematical nota- symbol: tion. Several attempts at providing some support for this material had been brushed aside as succes- • its ID 2 sive releases of the HTML Recommendation added • a possible cross-reference to an ID for another features to improve the visual presentation and con- symbol of similar or meaning trol of document layout. • the AFII4 glyph identifier It was understood that a future version of most • the entity name, T X code, or other identifying browsers would include “Unicode support”. Al- E information for each contributor though it is unclear exactly what is meant by this, an obvious course of action was to make certain that • a brief description 2 A Recommendation is the World Wide Web Consor- 3 Chinese, Japanese and Korean tium’s (W3C) equivalent of an international standard. 4 Association for Font Information Interchange

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 179 Barbara Beeton

1X0 1X1 1X2 1X3 1X4 1X5 1X6 1X7 1X8 3X0 3X1 3X2 3X3 3X4 3X5 3X6 3X7 3X8 3X9

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

A A

B B

C C

D D

E E

F F

Figure 1: First (arrows) and third (binary relations) of seven symbol tables in the December 1998 version of the Unicode math proposal

This completed the first half of the project. The the proposal, the order of symbols in the tables was second, more difficult, half remained—putting the the same as that in our master list, by reference ID; request into a form that would be acceptable to the the full collection of tables and text — about 70 files Unicode Technical Committee (UTC). in all — was sent to the UTC in March 1998. The arrangement of symbols was semi-random, Preparing the Unicode proposal and individual symbols were hard to find, so at the UTC’s request, the proposal was reorganized and the It is a UTC requirement that any request for encod- first revision delivered in December 1998. Figure 1 ing must be accompanied by a sample image of the shows two of the seven tables in this first revision. requested character. This posed some problems. For Once the symbols were rearranged, the pres- many of the symbols to be submitted, no fonts were ence of similar shapes — considered by the UTC to available. We solved that problem in a rudimentary be possible duplicates — became obvious. way by creating GIF images at a very low resolu- The UTC takes its job seriously [8, p. 17]: tion; for symbols we did have in fonts, this was ac- complished using latex2html, and for others, bitmaps The Unicode Standard avoids duplicate en- were created by hand and packaged as GIFs. The coding of characters by unifying them within resulting images, packaged in HTML tables, were scripts across languages; characters that are rough but recognizable. For the initial version of equivalent in form are given a single code.

180 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting Unicode and math, a combination whose time has come — Finally!

Common letters, punctuation marks, sym- bers of the UTC was stronger; in the next iteration, bols, and are given one code each, the alphabets were removed to a separate proposal. regardless of language, . . . Refining the proposal With respect to math symbols, this means that Adjusting the content of the list of symbols to make if two symbols look very much alike, unless there it acceptable to the UTC took several iterations over is very strong documentation to support the con- the course of two years. Attendance at the meetings tention that they have different meanings, only one where the proposal was discussed proved to be - will be assigned a code. Thus, for example, ≤ and ≦ sential, given the scope of the project. Two UTC might be considered equivalent — if they had not al- members were assigned to help refine the proposal, ready both been accepted into Unicode. Some UTC cast it in the form required by the ISO working group members feel that the original inclusion of such pairs (WG2) in charge of character coding standards, and was a mistake, and they are determined not to re- prepare the text that will appear in the published peat it. Knowledge of that fact guided the organiza- Unicode manual. tion of the math proposal, and helped to determine The first task was to overcome the reluctance what kind of documentation would be needed. of some members of the UTC to believe that there The first rearrangement of the symbols was into could actually be more than a thousand math and groups that roughly coincided with existing Unicode technical symbols not already in Unicode. The re- blocks: arrows, “traditional” math symbols, geo- organized proposal, generally following the ordering metric shapes, etc. The “traditional” symbols were of symbols already present, clarified the situation by classified further into groups that corresponded to making it relatively easy to compare the new mate- their functions: large operators, binary operators, rial to the existing Unicode. binary relations, delimiters, etc. After some pre- Quite a few symbols consisted of a base sym- liminary discussions with a member of the UTC, we bol plus a cancellation. This could be a long or decided to structure our proposal in blocks corre- short or vertical stroke, a backwards slash, or sponding to these functional groups, with the sym- a double vertical stroke. Although two lengths of bols arranged in the same general order as similar the forward slash and vertical stroke appear in Uni- ones already present in Unicode. code among the combining diacritics, it was decided We were also advised to eliminate any “dupli- that the longer versions would be designated as the cates” of existing characters, but since slight vari- proper cancellation markers for math, leaving the ations do often have different meanings in mathe- actual shape of the cancelled symbol as a font issue. matical exposition, we decided to keep everything Special attention was paid to symbols that look for the first round, and refine on the basis of spe- similar, that differ for example in the number cific directives from the UTC. However, we did at Several decisions were made in order to mim- this stage identify symbols with similar shapes, and imize the number of codes to be assigned. These possibly equivalent meanings, and flagged them in were the most important: our master list to indicate the need for additional • Any symbols cancelled by a vertical or slanted documentation. stroke should be constructed from a base sym- The UTC requested at the outset that symbols bol and a combining ; this eliminated used not in math, but in fields such as chemistry, as- the entire cancelled alphabet used by physicists. tronomy, engineering and phonetics, be omitted, to be requested separately by representatives of those • A “variant selector” (VS) would be provided to disciplines. This change was made, eliminating more allow for shape variants that ordinarily repre- than a hundred symbols. sent personal preference or house style and not In the first round of the proposal, we included differences in meaning. Except where needed three alphabets — , and frak- to provide a base character for cancellation, in tur — as well as a list of alphabets required for which case a code would be assigned, only the mathematical exposition. Some additional alpha- most common variant of such a symbol would betic inclusions were letters that occur in an unusual be assigned, and the specified variant repre- orientation (e.., ` ) and variant forms of several sented by the code for the base symbol plus the Greek letters (including ǫ, the straight-backed ep- VS. A list of such variants appears in figure 2. silon) which were missing from the Unicode comple- Documentation for the symbols that were most ment. Although we felt that the case for including likely to be controversial was sought in published these alphabets was strong, opposition from mem- material. A request for citations was presented on

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 181

Barbara Beeton

✂✁☎✄✝✆✟✞✡✠☞☛✍✌✍✎✑✏✒✌✔✓☎✕✗✖✙✘✂✚✜✛✢✓✂✚✣✘✥✤✂✖✗✏✒✓✢✦✧✌✩★✪✌✔✎✑✏✒✌✔✕✗✏✒✞✡✓✥✢✚✜✠✒✚✜✫✬✕✗✞✡✎✮✭✯★✰✢✱

✲✴✳✶✵✸✷✹✳✺✵✻✳✪✲✹✼✽✼✽✾❀✿❂❁✹❃❅❄❆✿❂✵❈❇✣❉☎❊✺❋✢●❍✲

✼✽✷✜✵✸▲✹✳✶✵◆▼❖❂◗❂◗❂◗

■❑❏

❘✝❙❚❙❱❯❚❲ ❳❩❨❱❬✥❭ ❪ ❫✽❴◆❴✸❵❜❛❀❝✹❞✶❡❢❞✺❡✹❣❤❡✜✐❂❛❥❣✗✐❂❦✣❧✹❪ ❫♠❫♦♥♣❦❞❂❪❵❈✉ ❛❀❝✇✈❂❫✽①✸❛②✉④③✽❞❂❪✜❴✸❛❀①✸✐❂⑤❂❫

⑥✝⑦❚⑦❱⑧❚⑨ ⑩❩❶❱❷✥❸ ❹❂❺✸❻♦❼✶❽❀❻✽❺✸❾❜❽❀❿✹❼✶➀✝❼✶➀✹➁❤➀✜➂✑❽❥➁✗➂❂➃✜➄✹➅ ❻♠❻♦➆②➃✹❼❂➅❾➇➉➈ ❽❀❿➋➊❂❻✽❺✸❽②➈④➌✽❼❂➅✬➍✸❽❀❺✸➂❂➎❂❻

➏✝➐❚➐❱➑❚➒➔➓ →❩➣❱↔✥↕ ➙ ➛ ➜✽➝◆➝✸➞❜➟❀➠✹➡✶➢❢➡✺➤✍➥✻➦❂➜➧➝✻➛④➡✶➢❱➟❀➜♦➨✪➜♦➩②➫✹➡❂➛➭➡✶➤✔➥②➦❂➜❈➯❂➲✸➜♦➡✶➟❀➜✽➲✸➞❜➟❀➠✹➡✶➢

➳✝➵❚➵❱➸❚➺➔➻ ➼❩➽❱➾✥➚ ➪➹➶❂➘✸➴♦➷✶➬❀➴✽➘✸➮❜➬❀➱✹➷✶✃✝➷✶❐✔❒②❮❂➴➧❰➷✶✃❱➬❀➴♦➴♦②✹➷❂✡➷✶❐✔❒✻❮❂➴ ➴✽❰✸❰✸➮❜➬❀➱✹➷✺✃

✝❚❱❚❩ ❩❱✥ ✽❀✸❜❀✹✶❂✻④✪④④✶✣④ ②➉ ✜✪❀✜➧④✶❚✣❀✜❑ ②✽ ♣

✝❚❱❚❩ ❩❱✥ ✁✄✂✆☎✞✝✠✟✡☎☛✂✆☞✌✟✡✍✎✝✠✏✒✑✄✂✔✓✖✕✘✗✙✕✘✚✘✝✠✂✛☞✔✜✢✑✣✚✘✚ ✑✖✤✥✕ ✏✣✦✟✡✍✧☎★✓✩✚✘✝✠✏✪✟✫✑✬✜✣✟✡✍✣☎✦✚ ✑✖✤✭☎☛✂✮✚ ☎☛

✯✱✰✪✲✴✳✪✵✷✶ ✸✺✹✴✻✷✼ ✽✿✾✖❀✘❁✙❀✘❂✘❃✠❄✛❅✔❆✢❇✣❂✘❂ ❇❉❈✥❀ ❊✧❋✦●✆❍✧■★✾❉❂✘❃✠❊✴●✫❇✣❆✬●✡❍✧■❑❏✧▲✣▲▼■◆❄❑❂ ■☛❋❖❅❇✄❄◗❂ ■☛✾✆✾✆❅✌●✡❍❘❃✠❊

❙✱❚✪❯✴❱✪❲✷❳ ❨✺❩✴❬✷❭ ❪✿❫✖❴✘❵✙❴✘❛✘❜✠❝✔❞✔❡❣❢✣❛✘❛ ❢✖❤✥❴ ✐✣❥✦❦✡❧✧♠★❫✩❛✘❜✠✐✪❦✫❢✬❡✣❦✡❧✣♠❑♥✧♦✧♦♣♠☛❝★❛ ♠☛❥✦❞❢✄❝✛❥✄❝✆♠✞❜✠❦✡♠☛❝✆❞✌❦✡❧❘❜✐

✱✪✴✪✉✇✈ ①✺②✴③✷④ ⑤⑦⑥✖⑧✙⑨❶⑩✘⑩ ❷☛❸✔❹✡❺✎⑨✠❻❽❼✄❸✫⑥❉⑩✘⑨✠❻✴❹✡❷✞❾✦❷✞❿◆➀❘⑨✄⑩

➁✱➂✪➃✴➃✪➄✇➅ ➆✺➇✴➈✷➉ ➊ ➋➍➌✠➎➐➏✄➑☛➎✔➒✡➓❘➌✠➔❽→✄➎✛➣❉➋✘➌✠➔✴➒✡➑✞↔↕➑✞➙✖➛❘➌✄➋

➜➞➝✴➝✴➟✪➠ ➡✺➢✴➤✷➥ ➦✆➧✧➨✧➦✆➩☛➫➯➭✣➲✄➫➯➩✞➳◆➧❘➵✄➸ ➦✛➺➼➻➽➵✠➾✩➚✘➵✠➭✪➫✫➪◗➚ ➫➹➶➘➦✆➫✡➾✆➲✄➴✄➩➷➫✡➶✣➾✆➲✄➧✧➬✄➶➮➨♣➲✄➫✡➫✡➲✣➱✃➱❑➩✞➱✫➨♣➩☛➾✆➦

❐➞❒✴❒✴❮✪❰ ✺✪ ➐✧♣☛✆✆☛✧✄✫✞✖❘✄ ✔✠❉✘✠✴✔✥ ✡❽➐✡✆✄✄★✡✧✆❶✧✄❽♣✄✡✡✣✃✞✔♣☛✆

➞✴✴✪ ✺✪ ➐✧✧✆☛➯✣✄➯✭✄✆✘ ✧◗✁✖✄✂✄ ✔✆☎✝✂✟✞❉✠✂✠✴✔✥ ☛✡❽✆☛✞➐✌☞✄★☛✡✍✞✆✄✏✎✌✡❽♣✄✡✡✏✑

✑❑✁✑✫♣✒✞✆

✓✕✔✗✖✗✘✙✘ ✚✜✛✙✢✤✣ ✥✧✦✍★✪✩✒✫✬✥✬✩✒✭✯✮✍✰✌✭✱✭✳✲✴✰✌✵✧✶✠✷ ✮✍✩✸✩✁✹✺✦✄✻✌✶ ✥✼✵✾✽✿✻✟✫❀✷✠✻✟✮✙✭❁✲✸✷ ✭☛❂❃✥✬✭☛✫✬✰✌❄✌✩✸✭☛❂✍✫✬✰✌✦✍❅✌❂❇❆✪✰✌✭☛✭☛✰✏❈

❈❉✩✁❈✱❆✪✩✒✫✬✥

❊✕❋✗●✗❍✙■✜❏ ❑✜▲✙▼✤◆❖ ◗ ❘✗❙☛❚✒❯❱◗ ❲✌❯✼❳✍❯✬❲❩❨❭❬✄❪❫❙✯❴✾❙❛❵✌❜✠❜✝❝✝❵✟❯❱◗✠❵✟❘✗❙✱❞✸◗ ❙☛❡❃❘❢❵✟❯✬❯✬❲❛❞❤❣✐❲✗❲✌❙

❥✕❦✗❧✗♠✙♥✜♦ ♣✜✙✤✉❀✈ ✇✌①✙②☛①✄③✟④❢⑤❃✈ ④✙②☛⑥✒✉❀✈ ⑦✌✉❁⑧✍✉✧⑦✪⑤❭⑨✄⑩❫②✯❶✾②❛③✌❷✠❷✍❸✿③✟✉❀✈✠③✟④✙②✱❹❺✈ ②☛①❃④✄③✟✉✬✉✬⑦❱❹❼❻❽⑦✙⑦✌②

❾✕❿✗❿✗➀✙➁➃➂ ➄✜➅✙➆✤➇➈ ➉✺➊ ➋❀➉✒➌ ➍✁➎➐➏✄➌ ➑✍➒❁➓✸➊ ➔☛→❃➓✱→✄➊ ➔☛➍❉➋❀➊✠➣

↔✕↕✗↕✗➙✙➛➃➜ ➝✜➞✙➟✤➠➡ ➢✺➤ ➥❀➢✒➦ ➧✁➨➐➩❛➤✠➫➭➧✒➯✼➲✸➤ ➩☛➳❃➲✱➳❢➤ ➩✬➧➵➥❱➤✠➫

➹➴➘✁➷✺➬✄➮✌➱✝✃❱❐ ❒✌❮❰❐ ❮✍✃❀❐✠❭➘➮✟❮✄☛✌➬✄✄❐ ❮✍❒➮✒❐ ❱✒➱ ➘ ➸✕➺✗➺✗➻✙➼✜➽ ➾✜➚✙➪✤➶

✙✗✙ ✜✗ ❱✠✟✙☛✁✄✟❱✌✠ ✁

✂☎✄✝✆✟✞✡✠☞☛✍✌✏✎✒✑✓✞✡✔✕✞✖✆✗✆ ✌✏✆✙✘✛✚ ☛✍✜✢✔✣✌✥✤✖✌✥✔✣✄✦✌✧✄✝✆✟✞✡✄✣✜

✙✗✙ ✜✗✜✙✗✗➃ ✁

★✖★✪✩✬✫✭✫☞✫✭✮✰✯ ✱✳✲☞✴✶✵ ✷✹✸✙✺✓✻ ✼✖✺✢✽✛✻ ✾✍✿✢❀✦❁✥❂✝✻✟❃✁❀

❄✖❄✪❅✬❆✭❆☞❆✭❇✰❈ ❉✳❊☞❋✶● ❍ ■ ❏☞❑✍▲✥▼✣◆✦▲✏❖❑◗■ ❘✖❏✪❙❚■ ❑✍❯✢◆✣▲✥▼✝■✟❱✁◆

❲✖❲✪❳✬❨✭❨☞❩✭❬✰❭ ❪✳❫☞❴✶❵ ❛✹❜◗❝◗❞✓❡✡❢✣❣✐❤ ❥✭❦✣❣✥❢✣❜✣❣✏❧❦◗❤ ♠♥❥✪♦✛❤ ❦✍♣❜✣❣✥❢✝❤✟✁❜

✖✪✬✉✭✉☞✈✭✇✰① ②✳③☞④✶⑤ ⑥✹⑦✝⑧◗⑨✓⑩✡❶✣❷❸⑨✙❹✓❺ ❻✖❹✢❼✛❺ ❽✍❾✢⑦✦❷✥❶✝❺✟❿✁⑦

❻✖❽✍❷✥⑦✝➁

➂➃➂➅➄➇➆✙➈➊➉✣➆✓➋✡➌➍➈✐➎ ➉❚➎ ➏✓➐➑♥➒✣➒✣➈✏➐➓◗➔➃➑◗→❚➎ ➏✙➣✐➓✍➑↕↔✙➏✓➋✍➙➛➋✖➎✟➜✟➋✡➝✓➎✟➜✟➎ ➓➟➞➑➃➠➇➋❸➉✣↔✓➎ ➓◗➋✡➝➡➜ ➈↕➠➢➑✖➏☞➓◗➤➃➓✍➆✙➈↕➐➑✖➒✣➒✦➈✏➐➓➥➉✍➆✓➋➦➌➧➈❸→❚➎✟➜✟➜➛➝➍➈

➌➃➒✣➑◗➙➨➎✟➩➫➈✏➩✢➋✡➉➭➉✣➑✭➑♥➏➯➋✡➉➲➌➍➑✖➉✣➉✕➎ ➝✓➜ ➈✏➳➵➄➇➆✙➈✐➋✡➉✣➉✦➑➍➐✥➎✟➋✡➓✍➈✏➩✐➓✍➈✥➸✭➓✛➐➑✖➒✦➒✣➈✏➐➓◗➜ ➞☎➩➫➈✥➉✝➐➒✝➎ ➝➍➈✥➉➵➓✍➆➃➈↕➩➫➈✥➉✝➎ ➒✣➈✏➩✒➉✣➆✓➋✡➌➍➈✏➳

Figure 2: Symbols constructed using the Variant Selector

182 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting Unicode and math, a combination whose time has come — Finally! the AMS Web site during the summer and early fall These alphabets are needed for proper compo- of 1998. Although the was not as great sition of mathematics: as hoped for, some useful references were obtained. • lightface upright Latin, Greek and digits The ideal citation contained several similar-appear- • boldface upright Latin, Greek and digits ing symbols on a single page, in context, preferably • lightface italic Latin, Greek and digits with definitions of one or more of the symbols in • the text. More than a hundred pages of such exam- boldface italic Latin, Greek and digits ples were copied, annotated, assigned reference IDs, • script indexed, and provided to the two UTC members re- • sponsible for advancing the symbols proposal. This • bold fraktur mass of data proved its worth more than once, when • open-face (blackboard bold) including digits it was possible to cite a particular sample in answer • lightface upright sans serif Latin and digits to a challenge. • lightface italic sans serif Latin Late in 1999, the content of the symbols pro- • boldface upright sans serif Latin, Greek, and posal was agreed, codes were assigned, and a final digits version of the proposal was prepared for the spring • 2000 meeting of WG2. The proposal forwarded to boldface italic sans serif Latin and Greek WG2 places material into the following blocks: • monospace Latin and digits • U+2000–206F: General punctuation (9 new Except for the lightface upright letters and dig- codes) its, which are to be encoded using the base Uni- • U+20D0–20FF: Combining diacritical marks codes (ASCII for the Latin letters and digits), the for symbols (4 codes) alphanumerics are to be placed in a tightly packed block (U+D400–D7FF) in plane 1, so that they can • U+2100–214F: Letterlike symbols (16 codes) be used for math (most likely via entity names in • U+2190–21FF: Arrows (12 codes) MathML), but will be very difficult to access for • U+2200–22FF: Mathematical operators (14 other purposes. codes) The math alphanumerics block has been incor- • U+2300–23FF: Miscellaneous technical (26 porated into a larger proposal for plane 1, and its codes) schedule is slightly behind that of the symbols pro- • U+2400–243F: Control pictures posal. A “final” version is now in preparation, and • U+25A0–25FF: Geometric shapes (8 codes) will be forwarded to WG2 for their fall meeting. • U+2900–297F: Supplemental arrows (new; 128 Assuming that the required three ISO ballots codes) are favorable, the new codes should be a formal part of Unicode and ISO 10646 by • U+2980–29FF: Miscellaneous math symbols (new; 114 codes) Commissioning a font • U+2A00–2AFF: Supplemental math operators Late in 1999, even before the fate of the Uni- (new; 246 codes) code math proposals was known, STIPub issued a In addition, a few characters were added to other Request for Proposal to a number of font suppli- areas; in all, 584 new codes have been assigned. ers. This RFP requested bids for creating a set of After much discussion, the proposition was re- fonts compatible with Times and incorporating all luctantly accepted that the same letter from differ- the symbols and alphabets identified by the STIX ent alphabets has different meanings within a sin- project, suitable both for use in Web browsers and in gle document, and thus these different alphabets print. The resulting font set is to be “made available deserve to be coded for use only in mathematical for general use under license, but free of charge, with notation. The example used to clinch the the aim of easing and fostering the uninhibited flow, was the contrast between these two : exchange, and linking of scientific information.” [6] = dτ(εE2 + µH2) Proposals were received from four potential sup- Z pliers, and comments from a fifth, which refrained H = dτ(εE2 + µH2) from proposing because of a prior commitments. As Z of this writing, a probable supplier has been chosen, The first is the Hamiltonian well known and negotiations are proceeding toward a contract. in physics; the second is an unremarkable integral More details will be available on this topic at . the time of the Oxford TUG meeting.

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 183 Barbara Beeton

Remaining the rest of the committee that it should be included. Thus, for the math proposal, the strategy was to The Unicode manual contains extensive text de- provide sufficient evidence for a symbol to convince scribing the proper use of the character codes, as Ken, and then let him persuade the committee using a guide to programmers. Particular attention is they would find convincing. paid to processing of context dependencies, combin- The contributors to the symbol collection were ing codes and the like. Since always willing to provide additional information. will be realized in a document as a combination of Neil Soiffer of Wolfram Research attended sev- coding and markup, and not all mathematical sym- eral UTC meetings to explain the uses of several bols are interpreted in the same way, instructions “letter-like symbols” that have special significance are needed. The creation of a technical report is an in Mathematica . open action item on the Unicode docket; the text of Patrick Ion, co-chair of the W3C MathML work- this report will ultimately be incorporated into the ing group, took my place at several UTC meetings Unicode manual. when questions were expected that would best be The symbols from chemistry, astronomy, engi- answered by a practicing mathematician, such as, neering and phonetics that were excluded by request “Do you know for a fact that [some particular sym- of the UTC have been left for consideration by the bol] is used, and are you sure it isn’t the same as organizations that submitted them. [some other symbol]?” His presence and his answers Mathematical notation is not static. Authors successfully conveyed the importance of this project continually devise new symbols and ascribe new to the mathematical community. meanings to existing ones. The complement of sym- Thanks to all for their efforts. bols requested from Unicode was frozen in mid-1998; a few additions were made only to achieve consis- For more information tency in the base set of symbols affected by com- bining codes, namely the VS and marker. The history of the STIX project is recorded at the About 50 additional symbols used in math, physics Web site http://www.ams.org/STIX/. and theoretical have been identi- References fied since then. Documentation must be completed for these, and the formal request presented to the [1] Association for Font Information Interchange, UTC for their addition. International Glyph Register, Volume 1: The glyph complement was frozen somewhat Alphabetic scripts and symbols, Rochester, later than the Unicode complement, but this too 1993. remains to be addressed. [2] International Organization for It has not been determined how these, or fur- Standardization, ISO 8879:1986, Information ther additions, are to be handled. Nonetheless, I Processing — Text and office systems — am still actively collecting citations for new nota- Standard Generalized Markup Language tion, and welcome contributions. (SGML), Geneva, 1986. [3] International Organization for Acknowledgments Standardization, ISO 9541:1991, Information Technology — Font Information Three UTC members were involved heavily in ad- Interchange — Part 1: Architecture vancing this project: Murray Sargent of Microsoft, , Geneva, formerly a practicing physicist; Ken Whistler of 1991. Sybase, the UTC archivist and general editor of Uni- [4] International Organization for code 3.0; and Asmus Freytag, the UTC font special- Standardization, ISO 9573-13:1991, ist, who is responsible for typesetting the Unicode Information Technology — SGML Support manuals and ISO 10646. Their knowledge and expe- Facilities — Techniques for using SGML — rience of character code standards has proved invalu- Part 13: Public entity sets for mathematics able, and the project’s success owes much to their and science, Geneva, 1991. thoughtful assistance. Ken in particular is able to [5] International Organization for explain in non-expert terms the UTC’s requirements Standardization, ISO 10646-1:1993, and the historical background affecting specific de- Information Technology — Universal cisions. From the other direction, he can quickly Multiple-Octet Coded Character Set (UCS) — assess the evidence for support of an item and, if Part 1: Architecture and Basic Multilingual convinced that it has sufficient merit, can convince Plane, Geneva, 1992.

184 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting Unicode and math, a combination whose time has come — Finally!

[6] “STIPUB Request for Proposals for Scientific [9] Webster’s New Collegiate Dictionary, G. & C. and Technical Fonts for the STIX Project”, Merriam, Springfield, MA, 1959. November 16, 1999 (unpublished). [10] Justin Ziegler, Technical Report on Math [7] The , The Encodings, available from CTAN Standard, Version 2.0, Addison-Wesley (http://www.tug.org/tex-archive/info/ Developers Press, Reading, MA, 1996. ltx3pub/l3d007.tex and supporting files in [8] The Unicode Consortium, The Unicode the same directory), 1994. Standard, Version 3.0, Addison Wesley Longman, Inc., Reading, MA, 2000.

Barbara Beeton

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 185