<<

First applications of : Greek, , Khmer, Poetica, IS0 ~~~~~/,etc.

Yannis Haralarnbous Centre dlEtudes et de Recherche sur le Traitement Automatique des Langues Institut National des Langues et Civihsations Orientales, . Private address: 187, rue Nationale, 59800 Lille, France. Yanni .Haralambous@univ-lillel.

John Plaice Departement 'informatique, Universite Laval, Ste-Foy, Quebec, Canada G1K 7P4 pl aice@ ft . ul aval .ca

Abstract In this paper a few applications of the current implementation of R are given. These applications concern typesetting problems that cannot be solved by TEX (consequently, by other typesetting system known to the authors). They cover a wide range, going from calligraphic (Adobe Poetica), to plain Dutch/Portuguese/Turkish typesetting, to vowelized Arabic, fully diacriticized scholarly Greek, or decently kerned Khmer. For every chosen example, the particular difficulties, either typographcal or TEXnical (or both), are explained, and a short glance to the methods used by !2 to solve the problem is given. A few problems cannot solve are mentioned, as challenges for future Cl versions.

On Greek, ancient and modern (but rather phenation (wbch in turn would result in disastrous ancient than modern) over/underfulls, since Greek can easily have long words like cjtop~volapuyyoloy~~),no , and a against kerning. It is in general expected cumbersome input, involving one or two macros for of educated men and women to know Greek letters. every word. Already in college, having used 0 for angles, for The first approach, originated by Silvio Levy acceleration, and to calculate the area of a round (1988), was to use TEX'S ligatures ('dumb' ones first, apple pie of given radius, we are all familiar with 'smart' ones later on) to obtain accented letters out these letters, just as with the . But the of combinations of codes representing breathings (> , in particular the ancient one, needs and <), accents ( ' , ' and - or =) and the letters them- more than just letters to be written. Two lunds of selves. In this way, one writes >' to get 4. Thls diacritics are used, namely accents (acute, grave and approach solved the problem of hyphenation and of ), and breathings (smooth and raw) which cumbersome input. are placed on vowels; breathings are also placed on Nevertheless, ths approach fails to solve the the . kerning problem. Let's take the very common case Every word has at most one accent1, and 99.9% of the article zb ( followed by the letter omi- of Greek words have exactly one accent. Every word cron); in almost all fonts there is a kerning instruc- starting with a vowel has exactly one breathing2. It tion between these letters, obviously because of in- follows that in Greek involves much more variant characteristics of their shapes. Suppose now accentuation than any Latin-alphabet language, with that is accented, and that one writes ' to the obvious exception of Vietnamese. get tau followed by omicron with . What How does TEX deal with ? If the TEX sees is a 't' followed by a grave accent. No kern- traditional approach of the \accent primitive had ing can be defined for these two characters, because been taken, then we would have practically no hy- we have no idea what may follow after the grave ac- Sometimes an accent is transported from a cent (it can very well be a , and usually there is no kerning between tau and iota). When the letter omi- word to the preceding one: &vOpwn6q TL~instead of &vOpwnoq, ziq, so that typographically a word has cron arrives, it is too late; TEX has already forgotten more than one accent. that there was a tau before the grave accent. 'With one exception: the letters pp are often A solution to this problem would be to write di- written $6, when inside a word: no$bG. acritics after vowels ("post-positive notation"). But

344 TUGboat, Volume 15 (1994), No. 3 -Proceedings of the 1994 Annual Meeting First applications of Q: Greek, Arabic, Khmer, Poetica, IS0 10646/~~1cO~~,etc. ths contradicts the visual characteristics of diacrit- if we ask \lefthyphenrnin=3, we dlstill get the ics in the upper case, since these are placed on word hyphenated as tap!! Q solves this problem by the left of uppercase letters: "Eorp could hardly be hyphenating after the translation has been done (in transliterated >'ar. And, after all, TEX should be ths case >'e or >' E or >i- 2). able to do proper Greek typesetting, however the let- Dactyls, spondees and 16-bit fonts. Scholarly edi- ters and diacritics are input. tions of Greek texts are slightly more complicated R solves this problem by using an appropriate than plain ones4, one of the add-ons being a third chain of R Translation Processes (QTPs), a notion level of diacritization: lengths. explained in Plaice (1994): as example, consider the One reads in Betts and Henry (1989, p. 254): word Zap: "Greek poetry was composed on an entirely differ- 1. Suppose the user wishes to input hs/her text ent principle from that employed in English. It was in 7-bit ASCII; he/she will type >'ear, and this not constructed by arranging stressed in pat- is already ISO-646,so that no particular input terns, nor with a system of rhymes. Greek poets em- translation is needed. Another choice would be ployed a number of different metres, all of which ron- to use some Greek input encoding, such as rso- sist of a certain fixed arrangement of long and short 8859-7 or EAOT; then he/she may as well type syllables". Long and short syllables are denoted by >' Eap or >Cap (the reason for the absurd compli- the diacritics and . The diacritics are cation of being forced to type either >' E or >t placed between the letter and the regular , if to obtain , is that "" encodings any (except in the case of uppercase letters, where have taken the easy way out and feature only they are placed over the letter while regular diacrit- one accent, as if the Greek language was born ics are placed to its leftL5 in 1982, year of the hasty and politically moti- The famous first two verses of the Odyssey vated ). The first RTP will send "AvGpa pot BVVEXE,Moboa nohi)rpoxov, Sc paXa xoAXbl these codes to the appropriate 16-bit codes in nh6yxOq tnei Tpoiqc iepbv nrohie8pov Enapoe IS0 ~O~~~/UNICODE:Oxlf14 for Z, 0~03blfor a, form hexameters. These consists of six feets: four and 0x03~1for p. dactyls or spondees, a dactyl and a spondee or 2. Once R knows what characters it is dealing with trochee (see again Betts and Henry 1989 for more (Qrs default internal encoding is precisely rso- details). One could write the text without accents or 10646), it will hyphenate using 16-bit patterns. breathngs to make the metre more apparent: 3. Finally, an appropriate QTP wdl send Greek AVF~& poi EvvEni,, MoGoG n6AEzp6n6v, 65 &&A&n6AXZ IS0 10646/~~1CO~Ecodes to a 16-bit virtual nXrTiyxB? End TpoSqg Zp6v nr6hiE8pijv inipoi (see next section to find out why we or one could decide to typeset all types of diacritics: need 16 bits), built up from one or more 8-bit "AVG~& poi EvvEnE, MMoG x6hihp6~c6v,85 pkh6 n6hh& fonts. This font contains kerning instructions, nkiiyx6t ixt Tpotqc kP$ xt6;ik8p6v XnipoE applied in a straightforward manner, since we Having paid a fortune to acquire the machme are now dealing with only three codes: , that sits between the keyboard and the screen, one and

. No auxiliary codes interfere anymore. could expect that hyphenation of text and kerning 4. xdvi copy3 will de-virtualize the dvi file and re- between letters remains the same, despite the con- turn a new dvi file using only 8-bit fonts, com- stantly growing number of hacritics. Actually this patible with every decent dvi driver. isn't possible for TEX: there are exactly 345 combi- By separating tasks, hyphenation becomes more nations of Greek letters with accents, breathings, syl- natural (for TEX one has to use patterns including lable lengths and subscript iota; TEX can handle at Q auxiliary codes ' , ' , = etc.). Furthermore, an addi- most 256 characters in a font. Therefore is neces- tional problem has been solved en passant: the prim- sary for hyphenation of Greek text, whenever syllable itives \1 efthyphenmi n and \righthyphenmi n apply lengths are typeset. to characters of \catcode 12. To obtain hyphenation In tlscase, things do not run as smoothly as in between clusters involving auxiliary codes, we have the previous section: although 345 is a small number to declare these codes as 'letter-~kecharacters'. For compared to 65,536 (= 216), ISO-peopledecided that example, the word Eap, written as >'ear: the codes 4 After all, scholars have been studying Greek text > and ' must be considered as letter-like (non-trivial for more than 2,000 years now. \I ccode) to allow hyphenation; but this means that 5 These additional diacritics are also used for a for TEX, lap has 5 letters instead of 3, and hence even different purpose: in prose, placed on letters , iota or , they indicate if they are pronounced In the name of this program, which is an ex- long or short (this time we are talhg about letters tended version of Peter Breitenlohner's dvi copy, 'x' and not about syllables). stands for 'extended', not for 'X-Window'.

TCTGboat, Volume 15 (1994), No. 3 -Proceedings of the 1994 Annual Meeting Yannis Haralambous and John Plaice there is not room enough for all combinations of Arabic, or "The Art of separating tasks" accents, breathgs and syllable lengths6 Plain Arabic, quick and clean and elegant. Arabic Whenever IS0 ~O~~~/UNICODEcomes too short typesetting is a beautiful compromise between West- for our needs, we use the private zone. Just as in ern typesetting techniques (htenumber of types, the TV serial The Twilight Zone, in the private zone anything can happen. In the case of Q all opera- repeated ad infiniturn) and Arabic calligraphy (id- tions remain internal, so that we have absolute free- nite number of arbitrarily complex ligatures). We can dom of dehgcharacters: according to the 1993- subdivide Arabic ligatures into two categories: (a) mandatory ones: letter connections ( and 05-01 version of so-10646-1, the private zone con- + 7 - e) the special la-alif ( J), and (b) op- sists of characters OxeOOO to Oxfff d (of group 0, that + I - is the 16-bit part of rso-10646-l), a total of 8,190 po- tional ones, used for esthetic reasons. sitions. The second category of ligatures corresponds to our good old 'fi' ,'fl' ,etc. They depend on the font Q wdl treat the input like in the previous sec- tion, but letters with macron and breve diacritics will design and on the degree of artistic quality of a doc- occupy internal positions in the private zone. The ument. The first author has made a thorough clas- rest of the treatment will be exactly the same. As for sification of esthetic ligatures of the Egyptian type- case (see Haralambous 1992, reprinted in Haralam- the transliterated input one could take A and - to de- note macron and breve (after having changed their bous 1993~).Here is an example of the ligaturing catcodes so that they do not interfere with math op- process of the word u,following Egyptian typo- erators), or any other combination of 7-bit or 8-bit graphical traditions: codes. J ~;j(letters not connected); A dream that may come true. As stated by the first &"(only mandatory ligatures, connecting let- author in his Cork talk, back in 1990, hs dream ters); was-and still is-to draw a Greek font based on the 9"(esthetic ligature between the first two let- famous (( Grecs du Roi )) typeface by Claude Gara- ters); mont, graved in 1544-46 for the lung Fran~oisI. Ths 3"(esthetic ligature between the first three let- typeface was designed after a manuscript of "Ayye- ters). Xoc Bapy-ijx~oq,a Cretan, calligrapher and reader of To produce more than 1,500 possible ligatures Greek at the French Court, in the beginning of the of two, three or four letters, three 256-character ta- XVI century. There are 1327 different types, most of bles were necessary. Each ligature is constructed by them ligatures of two or more letters (sometimes en- superposition of small pieces. Once TEX knows which tire words). One can read in Nationale (1990) that characters to take, and from which font, it only needs "this typeface is the most precious piece of the col- to superpose them (no moving around is necessary). lection [of the French National Printing House]", cer- The problem is to recognize the existence of a lig- tainly not the least of honors! Q is the ideal plat- ature and to find out which characters are needed. form for typesetting with this font, since it would This process is hghly font-dependent. A different need only an additional QTP to convert plain Greek font-for example in Kuffic or Nastaliq style-may input into ligatures. The author hopes to have fin- have an entirely different set of ligatures, or none ished (or, on a more realistic basis, to have brought at all (like the plain font, in whch the two words to a decent level) ths project in time for the Interna- d& l are written, that is widely used in com- tional Symposium "Greek Letters, From to puter typesetting because of its readability); never- Gutenberg and from to Y", which wdl take place in theless, the mandatory ligatures remain strictly the Athens in late Spring 1995, organized by the Greek same, whatever font is used. Font Society7. Up to now, there are three solutions to the prob- Nevertheless, they included lower and upper al- lem of mandatory Arabic ligatures: pha, iota and upsilon with macron and breve, prob- First, by . Lagally (19921, is to use TEX macros ably for the reasons explained in the previous foot- for detecting and applying mandatory ligatures note. However, combining diacritics must be used (we say: "to do the contextual analysis"). Ths to code letters with macronbreve and additional process is cumbersome and long. It is hlghly de- diacritics. pendent on the font encoding and the macros 7For additional information on the Syrnpo- used can interfere with other TEX macro pack- sium, contact 'Ezatpeia 'EhXqvixGv TunoypacprxGv ages. All in all, it is not the natural way to treat Czoix~iwv,'Ehhavixou 39-41, 116 35 'A0-ijva, , a phenomenon whch is a low-level fundamental or Michael S. Macrakis, 24 Fieldmont Road, Belmont, characteristic of the . MA 02 178-2607, USA.

TUGboat, Volume 15 (1994),No. 3 -Proceedings of the 1994 Annual Meeting First applications of Q: Greek, Arabic, Khmer, Poetica, IS0 ~O~~~/UNICODE,etc.

Second, by the first author (Haralambous The tbdQTP simply translates these codes to 1993b), is to use "smart" TEX font ligatures a 16-bit standard Arabic TEX font encolng (this (together with TEX--XgT, the bi-directional is a minor operation: the private zone being lo- version of TEX); ths process is philosophically cated at the end of the 16-bit table, we move the more natural, since contextual analysis is whole block near to the beginning of the table). done behmd the scenes, on the very lowest If the font has no esthetic ligatures, we are done: level, namely the one of the font itself. It does Q will send the results of the last QTP to the not depend on the font encoding, since every DVI file, and produce &&. On the other hand, font may use its own set of ligatures. The if there are still esthetic ligatures-as in # - disadvantage lies in the number of ligatures then these will be included in the font, as 'smart' needed to accomplish the task: about 7,000! ligatures. Since the font table can have as many The situation becomes tragic when one wants as 65,536 characters, there is plenty of room for to use a dozen Arabic fonts on the same page: small character parts to be ~ombined.~ TEX will load 7,000 probably strictly identical What we have achieved is that the fundamental ligatures for each font. You need more than process of contextual analysis is done by background BigTEX to do that. machmery (just like TEX hyphenates and breaks para- Third, also by the first author (Haralambous graphs into lines), and that the optional esthetic re- 1993b), is to use a preprocessor. The advan- finements are handled exclusively by the font (in tage is that the contextual analysis is done by analogy to Roman fonts having more ligatures than a utllity dedicated to this task, with all possi- typewriter ones, etc.). ble bells and whstles (for example adding vari- able length connecting strokes, also known as Vowelized Arabic (things get harder). In plain con- 'keshideh'); it is quick and uses only a very small temporary Arabic, only and long vowels amount of memory. Unfortunately there are the are written; short vowels have to be guessed by the classical disadvantages of having a preproces- reader, using the context (the same consonants with sor treat a document before TEX: one file may different short vowels, can be understood as a verb, a \input another file, from any location of your noun, an adjective etc.). When it is essential to spec- net, and there is no way to know in advance ify short vowels, small diacritics are added over or which files will be read, and hence have to be under the letters. Besides short vowels, there are also preprocessed; preprocessor directives can in- diacritics for doubling consonants, for inhcating the terfere with TEX macros; there is no nesting be- absence of vowel, and for the (like in "Oh- tween them and ths can easily produce errors oh"): counting all possible combinations, we obtain with respect to TEX grouping operations, etc. 14 signs. These diacritics can give TEX a hard time, since they have to be coded between consonants, and None of these methods can be applied for large hence interfere in the contextual analysis algorithm: scale real-life Arabic production: in all cases, the Ara- for example, suppose that TEX is about to typeset let- bic script is treated as a 'puzzle to solve' and, in- ter x, which is the last letter of a word, followed by a evitably, TEX'S performance suffers. period. Having read the period, TEX knows that the We use QTPs to give a natural solution to the letter has to be of final form (one of the 7,000 liga- problem:, consider once again the example of the tures has to be + <.> <.>).Now suppose that the letter is imme- First, &iiis read by Q, either in Latin translit- diately followed by a short vowel, whch in our case eration (tHml) or in rso-8859-6, or ASMO, or is necessarily placed between the letter xand the pe- Macintosh Arabic, or any decent Arabic script riod. TEX'S smart ligatures cannot go two positions input encoding. backwards; when TEX discovers the period after the The first QTP converts ths input into short vowel it is too late to convert the medial xinto IS0 ~O~~~/UNICODEcodes for Arabic letters: a final one. Ox062a ("), Ox062d ($,0x0645 (?), 0x0644 (Jl; Fortunately, RTPs are clever enough to be able to calculate letter forms, whatever the diacritics sur- IS0 ~O~~~/UNICODEbeing a logical way of rounding them (which is exactly the attitude of the codifying Arabic letters, and not a graphcal one, there is no information on their contextual 8 If these esthetic ligatures are used in several forms (isolated, initial, mehal, final). The fonts, it might be possible that we have the same second QTP sends these codes to the private problem of overloading Q's font memory; in this zone, where we have (internally) reserved case, we can always write a fourth QTP, which would positions for the combinations of Arabic systematically make the 'esthetic analysis' out of the characters and contextual forms. Once ths is contextually analyzed codes. done, Q knows the form of each character.

TUGboat, Volume 15 (1994),No. 3 -Proceedings of the 1994 Annual Meeting Yannis Haralambous and John Plaice human typesetter, who first typesets the letters, and In some cases, more than one script remained in then adds the corresponding diacritics). use for the same language and attempts are made to Nevertheless, RTPs are not perfect, and there clarify and standardize the equivalences between let- are problems whch cannot be solved, even by the ters of these scripts, in order to provide an efficient most judicious RTP: for example, the positioning of algorithm. diacritics. We all know that TEX (and hence also Q, The first author has worked on two cases involv- whch is no more than an humble extension of TEX) ing the Arabic script: Berber and Comorian. places all elements on a page by using boxes. Unfor- Berber: a language with three scripts and two writ- tunately, the placement of diacritics requires more ing directions. The Berber language ( call it information than just the height, width, depth and "Tamazight")can be written in Arabic, Latin or native italic correction of a character; in some cases, a real ("") script. The first author has developed, insight into the shape of the character itself and the under the guidance of Salem Chaker,ga Tifinagh font surrounding characters is necessary (think of liga- in METAFONT. Here is a small sample of Berber text, tures constructed vertically out of four letters, each written in left-to-right Tifmagh: one having its own diacritic). Ths problem can easily be solved for an (esthet- XflCf I-i, A-Xf 0. XfCX=:O. I f -Xfil. ically) non-ligatured font: counting all possible let- Il\\.IX Af XC:OX-I\i A-X XfO. I X.;0*8X ters (not to forget Farsi, Urdu, , Sindh, Kirghz, A-XII-Ef IfX. I\:IIIC.IX-AA A-X f Cf0 I Uigur and other languages using variants of Arabic :Xll\\f A C.Wf IfMI. f C-Xfil I f Cf0-1, letters), in all possible forms one may end up with a XX.O:I-XIX ilC f#O., AXX f ICO-I, ilC f X- figure of no more than a thousand . Combin- A:01, C.C. XfX"XoXf IIC +#+=-I : XX-0:I ing these glyphs with the 14 diacritical signs, combi- lCll\\-.MI fWC I :IIXXfI, A =f -X-f 11.1, A nations would result in no more than 14,000 font po- =-SI S::AC Af X:AOX-fW .++I :O X XXX- sitions, a figure well under the 65,536 character limit. X:l f I-EIC-01. Since the private IS0 10646/~~1Co~~zone is not big enough to handle so many characters, we would use Follows the same text in right-to-left Tifinagh: an additional RTP to send combinations of and to codes in the output font encoding. The X8.0;.X I -0tX X-A i/l_XO:7X +A .NII advantage of this method is that every diacritic can I Ot7t X-A AA-XI.ICII:/I .XtltZl.IIX-A be placed indvidually (assuming a minute to find the ,I_Ot7+ I IitX.7t .IWtltW.I AtNIIX: ideal position for a diacritic is needed, the font can be -Xt 1Ci ,I.OICt XXA ,.O#t 1Ci XIX-I:O*XX completed in four weeks of steady work), or one can I:O-XX : I.=+#+ ICE tXoXoXtX-3.3 ,10:A use QDTEXVPL methods to place the diacritics auto- A ,I-llt-X-t= A ,ItXX7: I 7Wt IW.-//Ill[: matically, and then make the necessary corrections. -XXX X 0: I$+. Wt-X0A:X +A 7A::% 13.= Unfortunately the number of necessary font po- .IO.IC9.lt I:X sitions grows astronomically when we consider 2- or 3-letter ligatures, where each letter can have a dif- and in Arabic script: ferent diacritic. One of the future challenges of the Q project will be to analyze Arabic script characters and find the necessary parameters to determine dia- critic positioning, just like D.E. Knuth did for mathe- matical typesetting. It should be noted that despite the huge complexity of ths task, we remain in the strict domain of Arabic typography, which is after all no more than a reduced and simplified version ofAra- bic calligraphy. Multiscript languages: "do you read Vietnamese?" Both Westerners and Arabs had the-not so democra- tic-habit of imposing their alphabet on nations they conquered (either militarily, religiously, culturally or The encoding of these fonts is such that one can technologically). So it happens that we can all read output the same TEX input in Latin transliteration, Vietnamese (but not understand it) just as any Arab left-to-right Tifinagh, right-to-left Tifmagh or Arabic. can read Malay and Sindh, and not understand a word (except perhaps for some Arabic words which Director of the Berber Studies Department of the accompanied the alphabet in its journey). National Institute of Oriental Languages of Paris

348 TUGboat, Volume 15 (1994),No. 3 -Proceedings of the 1994 Annual Meeting First applications of R: Greek, Arabic, Khmer, Poetica, IS0 ~O~~~/UNICODE,etc.

All one needs to change is a macro at the beginning. used. Because of the many sounds that must be dis- Accomplishmg this feature for Latin and Tifinagh tinguished, one has to use diacritics together with was more-or-less straightforward; not so for Arabic. Arabic letters. These diacritics look llke Arabic dia- Unfortunately, that font has all the problems of plain critics (for practical reasons) but are not used in the Arabic fonts: it needs more than 7,500 ligatures to same way; in fact, they are part of the letters, just llke do the contextual analysis, and is overloaded: there the dots are part of plain Arabic letters. is no longer room to add a single character, an annoy- Once again, the situation can easily be handled ance for a language that is still under the process of by an QTP. Whde it is still not clear what should be standardization. proposed for insertion into IS0 ~O~~~/UNICODE(this Another source of difficulties is the fact that the proposal, made by Ahmed-Chamanga, a member of equivalences between Latin, Tifinagh and Arabic are the Institute of Oriental Languages in Paris, is now not immediate. Some short vowels are written in the circulating from Ministries to Educational and Reli- Latin text, but not in the Arabic and Tifinagh ones. gious Institutions, and is being tested on natives of Moreover, double consonants are written explicitly in all educational levels), Comorians can already use Latin and Tifinagh, while they are written as a single R for typesetting, and upgrade the transliteration consonant with a special diacritic in Arabic. And scheme on the fly. perhaps the most difficult problem is to make every Berber writer feel "at home", regardless of the script Khmer he/she uses: one should not have the impression As pointed out in Haralambous (1993a), the Khmer that one script is privileged over the others! script uses clusters of consonants, subscript conso- Finally, the last problem (not a minor one when nants, vowels and diacritics. Inside a cluster, these it comes to real-life production) is that we need a spe- parts have to be moved around by TEX to be posi- cial Arabic font for Berber, because of the different tioned correctly. It follows that TEX must use \kern input transliteration: for example, while in plain Ara- instructions between individual parts of a cluster. bic transliteration we use 'v' for 3 and '' for &, Because of these, there is no kerning anymore: sup- in Berber we are forced to use 'g' for the former and pose that characters 2 and PC need to be kerned; and 'c' for the latter. There are two supplementary let- suppose that the consonant 2 is (logically) followed ters used for Berber in Arabic script: and j; these by the subscript consonant n, whch is (graphically) letters are also used in Sindhi and Pashto, 'so that the glyphs are already covered by the general Arabic placed under ths letter: !PC. For TEX, 2 is not irnme- TEX system; but in Berber, they have to be transliter- diately followed by PC anymore, and so no kerning ated as 'j' and 'z', because of the equivalences with between these letters will be applied; nevertheless, the Latin al~habet.This forces us to use a different graphically they still are adjacent, and hence need transliteration scheme than the one for plain Arabic, eventually to be kerned. and hence-because of TEX'S inability to clearly sepa- Q uses the sledgehammer solution to solve this rate input and output encoding-to use a differently problem: we define a 'big' (virtual) Khmer font, con- encoded TEX output font. Imagine you are typeset- taining all currently known clusters. As already men- ting a book in both Berber and Arabic; you will need tioned in Haralambous (1993a),approximately 4,000 two graphically identical fonts for every style, point codes would be sufficient for this purpose. And of size, weight and font fay,each one with more than course one could always use the traditional T~Xmeth- 7,000 ligatures. And we are just talking of (estheti- ods to form exceptional clusters, not contained in cally) non-ligatured fonts! that font. Q solves this problem by using the same out- As in Arabic, we deal with Khmer's complexity put fonts for Berber and plain Arabic. We just need by separating tasks. A first QTP will send the input to replace the first QTP of the translation chain: method the use has chosen, to IS0 ~O~~~/UNICODE the one that converts raw input into IS0 10646/ Khmer codes (actually there aren't any IS0 10646/ UNICODE codes. Berber linguists can feel free to in- UNICODE Khmer codes yet, but the first author has vent/introduce new characters or diacritics; as long submitted a Khmer encoding proposal to the appro- as they are included in the IS0 ~O~~~/UNICODEtable priate IS0 committees, and expects that soon some we wdl simply have to make a slight change to the step will be taken in that direction-for the moment first QTP (and if these signs are not yet in IS0 10646/ we will use once again the private zone). A sec- UNICODE we will use the private zone). ond QTP will analyze these codes contextually and will send groups of them to the appropriate cluster Comorian: African Latin versus Arabic. A similar codes. The separation of tasks is essential for allow- situation has occurred in the small islands of the Co- ing multiple input methods, without redefining each mores, between Madagascar and the African conti- time the contextual analysis-which is after all a ba- nent. Both the (with a few additions sic characteritics of the Khmer . R will taken from African languages) and the Arabic one are

TUGboat, Volume 15 (1994), No. 3 -Proceedings of the 1994 Annual Meeting 349 Yannis Haralambous and John Plaice

send the result of the second QTP to the dvi file, us- psychological effects of switching between scripts ing kerning information stored in the (virtual) font. (and all other changes the switch of scripts implies: Finally, xdvi copy will de-virtualize the dvi file and language, culture, state of mind, idiosyncracy, back- create a new file using exclusively characters from ground): the more the scripts differ, the easier this the original 8-bit Khmer font, described in Haralam- transition can be made. bous (1993a). The only safe thing we can do with types from different origins is to balance stroke widths so that Adobe Poetica the global gray density of the page is homogeneous (no "holes" inside the text, whenever we change Poetica is a chancery script font, designed by Robert scripts). Slimbach and released by Adobe Systems Inc. Stating These remarks concern primarily scripts that Adobe's promotional material, "The Poetica typeface is modeled on chancery handwriting scripts developed have significantly Mferent esthetics. There is one case, though, where one can apply all possible means during the Italian Renaissance. Elegant and straight- forward, chancery writing is recognizable as the basis of uniforrnization, and letters can be immediately for italic typefaces and as the foundation of modern recognized as belonging to the same font family: calligraphy. Robert Slimbach has captured the vital- the LGCAI group (LGCAI stands for "Latin, Greek, West/East Cyrillic, African, Vietnamese and IPA"). ity and grace of this writing style in Poetica. Char- acteristic of the chancery hand is the common use of Very few types cover the entire group: Computer flourished letterforms, ligatures and variant charac- Modern is one of them (not precisely the most beau- tiful), Unicode Lucida another (a nice Latin font but ters to embellish an otherwise formal script. To cap- ture the variety of form and richness of this hand, rather a failure in lowercase Greek); there are Times Slimbach has created alternate and charac- fonts for all members of this group, but there is no ter sets in his virtuoso Poetica design, which includes guarantee that they belong to the same Times style, a diverse collection of these letterforms." similarly for and Courier. Other adapta- Technically, the Poetica package consists of 21 tions have been tried as well and it is to be ex- pected that the success (?) of Windows NT will lead Postscript fonts: Chancery I-IV, Expert, Small Caps, other foundries into "extending" their typefaces to Small Caps Alternate, Lowercase Alternates I-II, Low- ercase Beginnings I-II, Lowercase Endings I-II, Liga- the whole group lo. Fortunately, TEX/Q users can already now type- tures, Caps I-IV, Initial Swash Caps, Amper- set in the whole LGCAI range, in Computer Modern1' sands, Ornaments. The ones of particular interest for us are Alternate, Beginnings, Endings and Liga- (by eventually adding a few characters and correcting some others). The 16-bit font tables of Q allow: tures, since characters from these can be chosen au- tomatically by R. The user just types plain text, pos- 1. hyphenation patterns using arbitrary characters sibly using a symbol to indicate degrees of alterna- of the group; tion. An QTP converts the input into characters of 2. the possibility of avoiding frequent font a (virtual) 16-bit font, including the characters of all changes, for example when switchng from Poetica components. Using several QTPs and chang- Turkish to Welsh, to Vietnamese, to Ukrainian, ing them on the fly will allow the user to choose the to Hawsa; number of ligatures he/she will obtain in the output. 3. potential kerning between all characters. It will also allow us to go farther than Adobe, and de- But Q can go even beyond that: one can fine kerning pairs between characters from different include different styles in the same (virtual) font; Poetica components. looking at the IS0 ~O~~~/UNICODEtable, one sees See Fig. 1 for a sample of text typeset in Poetica. that LGCAI characters, together with all possible IS0 UNICODE and beyond style-dependent dingbats and , fit in 6 lo As a native African pointed out to the first au- It is certainly not a trivial task to fit together char- thor, this Mrlll also mean that Africans will find them- acters from different scripts and to obtain an opti- selves in the sad and paradoxical situation of hav- cally homogeneous result. Often the esthetics inher- ing (a) fonts for their languages, (b) computers, since ent to different writing systems do not allow suffi- Western universities send all their old equipment to cient manipulation to make them 'look allke'; it is not the Third World, but (c) no electricity for running even trivial if ths should be tried in the first place: them and using the fonts.. . suppose you take Hebrew and Armenian, and mod- l1 Definitely, sooner or later some institution or ify the letter shapes until they resemble one another company will take the highly praisable initiative of sufficiently, to our Western eyes. It is not clear if sponsoring the development of other METRFONT Armenian would still look like Armenian, or Hebrew typefaces; the authors would like to encourage this like Hebrew; furthermore one should not neglect the idea.

350 TUGboat, Volume 15 (1994),No. 3 -Proceedings of the 1994 Annual Meeting First applications of R: Greek, Arabic, Khmer, Poetica, IS0 ~O~~~/UNICODE,etc.

Figure 1: Sample of the Poetica typeface. rows (1,536 characters). This means, that-at least obtaining it. In R you just need to place the macro theoretically-a virtual R font can contain up to 42 \inputtrans1 ationldutchi j) into the expansion (!) style variations12 of the whole LGCAI group, of the Dutch-switching macro; according to R syntax for example Italic, Bold, Small Caps, Sans , (described in Plaice 1994) this RTP can be written Typewriter and all possible combinations [a total of as simply as 24 = (Roman or Italic) x (normal or bold) x (plain or in: 1 small caps13) x (Serif or Sans Serif or Typewriter)]. out: 2 Defining kerning pairs between all different styles expressions: will avoid the use and misuse of \/ (italic correction) 'I"3' => @"0132; and give a better appearance to mixed-style text. 'i"jl => @"0133; It Mlll be quite an experience to make such a => \l; font, since many African and IPA characters have no where @"0132 and @"0133are characters '' and 'ij' uppercase or italic or small caps style defined yet; in IS0 ~O~~~/UNICODE. see Jijrg Knappen's paper on African fonts (Knappen 'f' + 'i' or 'f' + 'l"' 1993) for the description of a few problematic exam- Portuguese and Turkish typesetting do not use ples and the solutions he proposes. ligatures 'fi',. . . 'ffl' (in Turkish there is an obvious rea- Dutch, Portuguese, Turkish: the easy way son for doing thls: the uses both letters 'i' and 'i', and hence it would be impossible These three languages (and maybe others?) have to know if 'fi' stands for 'f + 'i' or 'f + '1'). This is a at least one thing in common: they need fonts major problem for TEX, since the only solution that with a slightly different ligature table than the would retain a natural input would be to use new one in the . Dutch typesetting uses fonts; and defining a complete new set of fonts (ei- the notorious 'ij' ligature (think of the names of ther virtual or real), just to avoid 5 ligatures, is more people very well known to us: Dijkstra, Eijkhout, trouble than benefit. R solves that problem easily; of van Dijk, or the name of the town Nijmegen or the course, it is not possible to 'disable' a ligature, since lake IJselmeer); this ligature appears in the Cork the latter arrives at the very last step, namely inside encoding (as well as in IS0 ~O~~~/UNICODE),but the font. It follows that we must cheat in some way; until now there was no user-transparent means of the natural way is to place an invisible character be- tween 'f' and 'i'; in IS0 ~O~~~/UNICODEthere is pre- l2 The authors would like to emphasize the fact cisely such a character, namely Ox2OOb (ZERO WIDTH that it is by no means necessary to produce all styles SPACE); this operation could be done by an RTP line out of the same METAFONT code, as is done for of the type 'f9'i' => "f" @"2OOb "infor each Computer Modern. As a matter of fact the font ligature. This character would then be sent to the we are talking about can very well be a mixture of Cork table's 'compound mark' character, which was Times, Helvetica, Courier; the advantage of having defined for that very reason. them inside the same structure is that we can define A still better way to do this would be to define kerning pairs between characters of different styles. a second 'f' in the output font table, which would l3 The figure is actually less than 18, since only not form a ligature with 'f', 'i', or '1'. This would give lowercase small caps are needed.. . the font the possibility of applying a kern between

TUGboat, Volume 15 (1994), No. 3 -Proceedings of the 1994 Annual Meeting 351 Yannis Haralambous and John Plaice the two letters, and counterbalance the effect of the Haralambous, Y. "Typesetting the Holy Qur'Fm with missing ligature (after all, if a font is designed to TEX". In Proceedings of the 2nd International use a ligature between 'f' and 'i', a non-ligatured 'fi' Conference on Multilingual Computing-Arabic pair would look rather strange and could need some and (Durham). 1992. correction). Haralambous, Y. "The tamed by the Lion (of TEX)". In Proceedings of the 14th TEX Other applications: Ars Longa, Vita Brevis Users Groups Annual Meeting (Aston, Birming- In this paper we have chosen only a few applica- ham). 1993a. tions of a, out of personal and highly subjective cri- Haralambous, Y. "Nouveaux systemes arabes TEX teria. Almost every script/language can take advan- du domaine public". In Comptes-Rendus de la tage in one or another way of the possibilities of in- Conference <( TEX et l'ecriture arabe w (Pans). ternal translation processes and 16-bit tables. For 1993b. example, the first author presents in this same con- Haralambous, Y. "Typesetting the Holy Qur'an with ference his pre-processor Indica, for Indic languages TEX". In Comptes-Rendus de la Conference <( TEX (languages of the Indian subcontinent (except Urdu), et l'ecriture arabe N (Paris). 1993c. Tibetan and Sanskrit). Indica will be rewritten as a set Knappen, J. "Fonts for Africa". TUGboat 14 (2), 104- of l2TPs; in this way we will eliminate all problems of 106,1993. preprocessing TEX code. Lagally, K. "ArabTEX". In Proceedings of the 7th All in all, the development of 16-bit typesetting European TEX Conference (Prague). 1992. will be a fascinating challenge in the next decade, Levy, S. "Using Greek Fonts with TEX". TUGboat 9 (I), and TEX/Q can play an important rde, because of 20-24, 1988. its academic support, openness, portability, and non- Imprimerie Nationale. Les caracteres de l'lmprimerie commercial spirit. Nationale. Imprimerie Nationale, ~dtions,Paris, France, 1990. References Plaice, J. "Progress in the Q Project". Presented at Betts, G. and A. Henry. Teach yourselfAncient Greek. the 15th TUG Annual Conference, Santa Barbara, Hodder and Stoughton, Kent, 1989. 1994, and published in these Proceedings.

TUGboat, Volume 15 (1994), No. 3 -Proceedings of the 1994 Annual Meeting