The STIX Project — from Unicode to Fonts

The STIX Project | From Unicode to fonts Barbara Beeton American Mathematical Society 201 Charles Street Providence, RI 02904-2294 USA bnb at ams dot org Abstract The goal of the STIX project is to provide fonts usable with other existing tools to make it possible to communicate mathematics and similar technical material in a natural way on the World Wide Web. This has involved two major efforts: enlarging Unicode to recognize the symbols of the mathematical language, and creating the fonts necessary to convert encoded texts into readable images. This ten-year effort is finally resulting in fonts that can actually be used for the intended purpose. 1 Introduction: What is Unicode? puter keyboards. The second half of block 00 is According to the Unicode manual, the original goal known as \Latin 1", and includes many accented of the effort was \to unify the many hundreds of con- letters found in western European languages, as well flicting ways to encode characters, replacing them as additional punctuation marks and control char- with a single, universal standard." acters. Unicode is thus an encoding system capable of The next few blocks contain: representing all the world's languages in a way that • the Greek and Cyrillic alphabets; will enable any person to interact with a computer • a collection of diacritics to be used to compose in his own language. Nearly all modern computer accented letters not accommodated by Latin 1; operating systems are based on Unicode. • Hebrew; The three principal components of Unicode are • Arabic; the character, the block and the plane. A character • the scripts for many of the languages of India is the smallest unit, carrying a semantic value. A and southeast Asia. character may represent a letter, a digit, or some Each script is allotted a half or full block as needed. other symbol or function. Blocks from 10 to 1F accommodate more lan- A block consists of 256 characters | the num- guage scripts, including extensions for Latin and ber of characters that can be addressed by eight bi- Greek. Except for very basic symbols such as plus nary digits, addressed as 00{FF. A plane is com- (+) or asterisk (*), non-language characters aren't posed of 256 blocks, for a total of 65,536 characters; included until blocks beginning at 20. there are 17 planes for a capacity of 1,114,112 in all [4, p. 2]. 2 Who is responsible for Unicode, and how The first plane, Plane 0, is referred to as the do things get added? Basic Multilingual Plane (BMP); if a piece of soft- Unicode was developed and is maintained by the ware claims to support Unicode, it should be able Unicode Technical Committee (UTC) an arm of the to access every character in the BMP. Unicode Consortium. Members of the consortium Characters are assigned to blocks with the most include most computer hardware manufacturers and heavily used given the lowest addresses; assignments software vendors. To align Unicode with ISO 10646, are made in half-block (128-byte) chunks. the standard on which hardware and software are The first half of block 00 is the basic character actually based, the UTC works closely with the stan- set known as ASCII (the formal name is \C0 Con- dardization subcommittee for coded character sets trols and Basic Latin"). This contains the upper- of the International Organization for Standardiza- and lowercase Latin alphabet, ten digits, various tion. punctuation marks, and a number of control func- The UTC members are individuals with various tions; it is the set of characters found on most com- areas of expertise. Most have a strong background in TUGboat, Volume 28 (2007), No. 3 | Proceedings of the 2007 Annual Meeting 299 Barbara Beeton computer software. Many are skilled as well in lan- 2190 Arrows 21FF guages and linguistic-related areas. However, there 219 21A 21B 21C 21D 21E 21F 0 ← ↖ 1 ⇐ ⇠ are very few practicing physical scientists. 2190 21A0 21B0 21C0 21D0 21E0 21F0 1 ↑ ↗ 2 ⇑ ⇡ If something isn't in Unicode, there is a stan- 2191 21A1 21B1 21C1 21D1 21E1 21F1 dard proposal form. This asks for a number of items: 2 → ↙ 3 ⇒ ⇢ 2192 21A2 21B2 21C2 21D2 21E2 21F2 ↓ ↘ 4 ⇓ ⇣ • the repertoire of characters being requested, in- 3 2193 21A3 21B3 21C3 21D3 21E3 21F3 cluding character names; 4 ↔ % 5 ⇔ U 2194 21A4 21B4 21C4 21D4 21E4 21F4 • the context in which the proposed characters 5 ↵ 6 F V 2195 21A5 21B5 21C5 21D5 21E5 21F5 are used; 6 ' 7 G W 2196 21A6 21B6 21C6 21D6 21E6 21F6 • references to authoritative published sources 7 ( 8 H X where the characters have been used; 2197 21A7 21B7 21C7 21D7 21E7 21F7 8 ) 9 I Y • relationships the proposed characters bear to 2198 21A8 21B8 21C8 21D8 21E8 21F8 9 * : J Z characters already encoded; 2199 21A9 21B9 21C9 21D9 21E9 21F9 A + ; K ⇪ • contact information for the supplier of a com- 219A 21AA 21BA 21CA 21DA 21EA 21FA B , < L puterized font to be used in printing the stan- 219B 21AB 21BB 21CB 21DB 21EB 21FB C - = M dard; 219C 21AC 21BC 21CC 21DC 21EC 21FC D . > N • names and addresses of contacts within national 219D 21AD 21BD 21CD 21DD 21ED 21FD / ? ⇞ or user organizations. E 219E 21AE 21BE 21CE 21DE 21EE 21FE F 0 @ ⇟ The on-line description of the proposal review pro- 219F 21AF 21BF 21CF 21DF 21EF 21FF cess warns that 182 The Unicode Standard 5.0, Copyright © 1991-2006 Unicode, Inc. All rights reserved. • international standardization requires a signifi- cant effort on the part of the submitter; Figure 1: When the STIX project began, positions 21EB{21FF were empty. Copyright Unicode, used by • it frequently takes years to move from an initial permission. proposal to final standardization; • submitters should be prepared to become in- 2200 Mathematical Operators 22FF volved in the process. 220 221 222 223 224 225 226 227 228 229 22A 22B 22C 22D 22E 22F 0 ∀ ∠ 6 E ≠ e ⊀ ¤ ¯ In the case of the STIX proposal, all these warnings 2200 2210 2220 2230 2240 2250 2260 2270 2280 2290 22A0 22B0 22C0 22D0 22E0 22F0 were true, in spades. 1 ∑ 7 F ≡ f ⊁ ¥ ° 2201 2211 2221 2231 2241 2251 2261 2271 2281 2291 22A1 22B1 22C1 22D1 22E1 22F1 2 ∂ − ! 8 ≒ W g ⊂ ¦ ± 3 Initial conditions 2202 2212 2222 2232 2242 2252 2262 2272 2282 2292 22A2 22B2 22C2 22D2 22E2 22F2 3 ∃ ∓ ∣ 9 H X h ⊃ ⊣ § ² In 1997, when the STIX project began, Unicode was 2203 2213 2223 2233 2243 2253 2263 2273 2283 2293 22A3 22B3 22C3 22D3 22E3 22F3 at version 2.0. It contained several blocks of interest 4 # ∴ : I ≤ i w ⊤ ¨ ³ 2204 2214 2224 2234 2244 2254 2264 2274 2284 2294 22A4 22B4 22C4 22D4 22E4 22F4 for mathematics: 5 ∅ $ ∵ ≈ J ≥ j x ⊕ ⊥ © ⋅ • combining diacritics (first half of block 03 for 2205 2215 2225 2235 2245 2255 2265 2275 2285 2295 22A5 22B5 22C5 22D5 22E5 22F5 6 ∆ % ∶ < K ≦ k ⊆ text; the last three 16-cell columns of block 20 2206 2216 2226 2236 2246 2256 2266 2276 2286 2296 22A6 22B6 22C6 22D6 22E6 22F6 7 ∇ ∗ ∧ ∷ = L ≧ l ⊇ ⊗ for diacritics used with symbols) 2207 2217 2227 2237 2247 2257 2267 2277 2287 2297 22A7 22B7 22C7 22D7 22E7 22F7 • Greek (last half of block 03) 8 ∈ ∨ . > M ] { 2208 2218 2228 2238 2248 2258 2268 2278 2288 2298 22A8 22B8 22C8 22D8 22E8 22F8 • arrows (last half of block 21) (Figure 1) 9 ∉ ∩ / ? N ^ | ª 2209 2219 2229 2239 2249 2259 2269 2279 2289 2299 22A9 22B9 22C9 22D9 22E9 22F9 • mathematical operators (block 22) (Figure 2) A √ ∪ 0 @ O _ ≺ ⊊ « • miscellaneous technical (first half of block 23) 220A 221A 222A 223A 224A 225A 226A 227A 228A 229A 22AA 22BA 22CA 22DA 22EA 22FA B ∋ ∫ 1 A P ` ≻ ⊋ ¬ • geometric shapes (last half of block 25) 220B 221B 222B 223B 224B 225B 226B 227B 228B 229B 22AB 22BB 22CB 22DB 22EB 22FB C ∼ ≌ Q a o None of these blocks was entirely full at that time. 220C 221C 222C 223C 224C 225C 226C 227C 228C 229C 22AC 22BC 22CC 22DC 22EC 22FC D ∝ 3 B R b p ¡ ® 4 Character 6= glyph 220D 221D 222D 223D 224D 225D 226D 227D 228D 229D 22AD 22BD 22CD 22DD 22ED 22FD E í ∞ ∮ 4 C S ≮ q ¢ Unicode encodes characters. Each character has a 220E 221E 222E 223E 224E 225E 226E 227E 228E 229E 22AE 22BE 22CE 22DE 22EE 22FE F ∏ ∟ 5 D T ≯ r £ ⊿ designated, well-defined meaning. It appears in the 220F 221F 222F 223F 224F 225F 226F 227F 228F 229F 22AF 22BF 22CF 22DF 22EF 22FF Unicode charts as a representative glyph, or image. The Unicode Standard 5.0, Copyright © 1991-2006 Unicode, Inc. All rights reserved. 185 However, since the purpose of Unicode is to convey meaning, the shape of the glyph may vary. To take Figure 2: Math operators in Unicode; at the start a trivial example, in text, an \A" has the same code of the STIX project, the last code assigned was 22F1. whether it is upright Roman, italic (A), bold (A), Copyright Unicode, used by permission. 300 TUGboat, Volume 28 (2007), No. 3 | Proceedings of the 2007 Annual Meeting The STIX Project | From Unicode to fonts or sans serif (A). Similarly, an accented \é"can be track of them. IDs were assigned in order of acces- represented either by one code or by a combination sion, rather than by shape, usage, or other rational of the letter \e" and the combining diacritic “´”. system. This is not true for math notation, however. Because of the large number of characters be- The same letter in different styles (italic, script, ing requested, the UTC invited a representative of Fraktur, bold, .

The STIX Project — from Unicode to Fonts

ST.36 Page: 3.36.1

Transport and Map Symbols Range: 1F680–1F6FF

Typesetting Classical Greek Philology Could Not ﬁnd Anything Really Suitable for Her

Proposal to Add U+2B95 Rightwards Black Arrow to Unicode Emoji

Package Mathfont V. 1.6 User Guide Conrad Kosowsky December 2019 [email protected]

Writing Mathematical Expressions in Plain Text – Examples and Cautions Copyright © 2009 Sally J

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

The File Cmfonts.Fdd for Use with Latex2ε

Surviving the TEX Font Encoding Mess Understanding The

The Unicode Standard 5.2 Code Charts

Geometry and Art LACMA | | April 5, 2011 Evenings for Educators

L2/14-274 Title: Proposed Math-Class Assignments for UTR #25