Infrastructure for High-Quality Arabic Typesetting

Infrastructure for high-quality Arabic typesetting Yannis Haralambous Département Informatique, ENST Bretagne CS 83 818, 29 238 BREST Cedex 3 France yannis dot haralambous (at) enst dash bretagne dot fr 1 Introduction • metadata; This paper presents what we consider to be the • etc. 1 ideal (or at least a first step towards the ideal) We will see how — due to the internal structure infrastructure for typesetting in the Arabic script. of the Arabic writing system — the concept of text- This infrastructure is based on four tools which have eme happens to be particularly useful for text in the partly been presented elsewhere: Arabic script. 1. the concept of texteme; There is no need to present OpenType and 2. OpenType (or AAT ) fonts. In fact in this pa- AAT fonts; the reader can consult [20] or the corre- per we will talk about “super-OpenType” which sponding Web pages at Microsoft and Apple. consists in using static OpenType substitutions Ω2 modules have been presented in [14]. The and positionings in dynamic typesetting; well-known technique of Ω1 Translation Processes 3. Ω2 modules: transformations applied to the (ΩTPs) is applied at a later stage of text process- horizontal node list before entering the main ing inside Ω2: just before the end_graf procedure, loop; which is called when the complete list of nodes of a 4. an extended version of TEX’s line-breaking paragraph has been stored in memory, before we en- graph for dynamic typesetting. ter into the line breaking engine which will examine Textemes have been presented in [12] and [13]. this list of nodes and insert active nodes at potential They are atomic units of text extending the notion line breaks. of character. A texteme is a collection of key-value Ω2 will output the horizontal list of nodes, in pairs, some of which are mandatory (but may have XML. Since we are using textemes, this horizon- an empty value) and others optional or freely de- tal list contains both traditional types of nodes as finable by the user. Mandatory keys are “charac- well as texteme nodes. External processes will then ter” (a Unicode position), “font” (a font identifier) transform these XML data and the result is again and “glyph” (a glyph identifier in the given font). read by Ω2 and replaces the original horizontal list. The latter two keys are, in some sense, the com- Besides the nodes of the horizontal list, Ω2 also includes in the XML data global information such as mon part between textemes and TEX’s “character nodes” (quoted because in fact a “character node” font name mapping, the current language, etc. contains only glyph-related information). Hence, a In the next section we will describe the fourth first application of textemes is to add character data tool, namely the extended TEX graph. to “character nodes”. In fact there are other types 2 Dynamic typesetting and of information which we also add to textemes: an extended graph • hyphenation: instead of using discretionary When T X processes a horizontal list, inserts ac- nodes, we add hyphenation-related information E tive nodes for potential breakpoints, calculates bad- into textemes; nesses for each arc of the graph and finally finds • color; the shortest path (where we consider badness as a • horizontal and vertical offset of glyph (leaving “distance”), glyphs do not change: the abstract box “width × (depth + height)” unchanged); • if glyphs are given by “character nodes” then they are static; 1 The author has written several papers on the typesetting of Arabic: [6, 7, 8, 9, 10, 15, 16], and has developed • if they are given by discretionary nodes, then Arabic systems using three different methods: a preproces- we have two possibilities: when there is no line sor (1990), intelligent ligatures (ArabiTEX, using TEX--XET, break we have a single node list (possibly con- 1992) and Ω1 Translation Processes (Ω1 distribution and Al- taining one or more glyphs) and when there is Amal¯ , 1994). The system described in this paper will be the fourth (and hopefully the last) Arabic system developed by a break we have two entirely different lists, the the author. “pre-break” and the “post-break”. The choice TUGboat, Volume 27 (2006), No. 2 — Proceedings of the 2006 Annual Meeting 167 Yannis Haralambous depends entirely on the fact whether we break the next big bang and beyond, for only ten lines of or not; text . a perspective which would delight Douglas • if they are given by ligature nodes, once again Adams if he was still with us. we have two possibilities: when the ligature is This is why dynamic typesetting requires a not broken then we have a single static glyph. strategy. Even if choices of glyph variants are mu- When the ligature is broken we return to “char- tually independent, one has to define rules to limit acters” (or at least to something which is a bit the number of glyph combinations. closer to the concept of character, even though For example, a realistic strategy could be the it is not exactly a character) and apply the main following: loop again to the two parts (before and after 1. classify glyphs into N width classes (the higher the break), which sometimes results in new lig- N the finer the result will be, but the more atures. But once again each node list obtained calculations we will have to do); that way is unique. 2. whenever we have to choose between glyphs in In all three cases glyphs either do not change the same class, make a single choice, in a ran- or their change depends only on a line break close dom manner; to them. Dynamic typesetting is a method of typeset- 3. if we have many choices for which the sum of the ting where glyphs can change during the process widths of classes is relatively constant, choose of line breaking, for reasons which may depend on a single one, randomly. macrotypographic properties such as justification of In other words, we restrict the number of choices the line or of the entire paragraph, or more global to those that give us different widths on the word phenomena like glyphs on subsequent lines touching level. Whenever we have different glyph combina- each other or to avoid rivers, etc. tions producing the same global width, we use a Dynamic typography was applied by Gutenberg random generator to choose a single combination. in his Bibles. He was systematically applying liga- This strategy is useful when there is no semantic tures to optimize justification on the line level. It overload. One can imagine refined versions where is very useful for writing systems using words but the combination chosen is not entirely random but not allowing hyphenation, like Hebrew (where some is based on more-or-less strict criteria (for example: letters have large versions without semantic over- use ligatures preferably at the end of words, or do load, these letters have been used mostly at the end not use specific variants in the same word, etc.). of lines, when printers realized that they are facing The strategy we have described is based solely justification problems) or Arabic. on width criteria. But when many different glyph To perform dynamic typesetting with Ω2 we versions are designed the chances that some of them are extending the graph of badnesses so that we are in conflict (for example, may touch each other) can have many arcs between two given nodes, each heavily increases. Due again to combinatorial rea- one corresponding to a given width (“width” in the sons, we can’t expect the font designer to anticipate sense of glue, that is a triple — ideal width, maximal all possible conflicts between glyphs. We need a tool stretch, maximal shrink) and to its badness. which will test each combination (for example, test Using an extended graph means applying the whether glyphs are touching) and eventually add an same principle of optimized typesetting on the para- additional penalty to the corresponding arc of the graph level to paragraphs where glyphs (or glyph graph. groups) have alternative forms (in the case of groups This tool can work on an interline level so that we call them “ligatures”). Performing the calculation the calculation of global badness is more complex of shortest path on such a graph means that the so- than just summing up the badnesses of individual lution will be the best possible paragraph, chosen arcs. The extended TEX graph used by Ω2 will use among all possible combinations of alternate forms binary arithmetic on flags to add additional bad- of glyphs. nesses to given path choices. Alas, such a calculation can explode combina- torially. Imagine a paragraph of ten lines, each con- Up to now, very few fonts exist with extremely many taining 60 glyphs, that is 600 glyphs in total. Imag- variants (one of which is, for example, Zapfino which ine each glyph having two variant forms, that is a has ten different ‘d’ letters) and in the case of calli- total of three choices for each glyph. No ligatures. graphic fonts the document’s author (who becomes That would already make 3600 ≈ 1.87·10286 possible a “calligrapher”) will probably be more interested combinations of glyphs, enough for running Ω2 until in (manually) choosing a beautiful combination of 168 TUGboat, Volume 27 (2006), No. 2 — Proceedings of the 2006 Annual Meeting Infrastructure for high-quality Arabic typesetting glyphs than in having the absolutely best justification by leaving the choice of glyphs to the machine. The case of Arabic is different: ligatures are much more common (especially in traditional writing styles) and calligraphers have a long tradition of using them in the frame of a justification-oriented strategy (see [3]).

Load more