Low-Level Devanāgarī Support for Omega - Adapting Devnag Yannis Haralambous, John Plaice
Total Page:16
File Type:pdf, Size:1020Kb
Low-level Devanāgarī Support for Omega - Adapting devnag Yannis Haralambous, John Plaice To cite this version: Yannis Haralambous, John Plaice. Low-level Devanāgarī Support for Omega - Adapting devnag. Tugboat, TeX Users Group, 2002, Proceedings of the 2002 Annual Meeting, 23 (1), pp.50-56. hal- 02112001 HAL Id: hal-02112001 https://hal.archives-ouvertes.fr/hal-02112001 Submitted on 26 Apr 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Low-level Devan¯agar¯ıSupport for Omega — Adapting devnag Yannis Haralambous D´epartement Informatique Ecole´ Nationale Sup´erieuredes T´el´ecommunications de Bretagne BP. 832, 29285 Brest, France [email protected] http://omega.enstb.org/yannis John Plaice School of Computer Science and Engineering The University of New South Wales UNSW Sydney NSW 2052, Australia [email protected] http://www.cse.unsw.edu.au/school/people/info/plaice.html Abstract This paper presents tools (OTPs and macros) for typesetting languages using the Devan¯agar¯ıscript (Hindi, Sanskrit, Marathi). These tools are based on the Omega typesetting system and are using fonts from devnag, a package developed by Frans Velthuis in 1991. We are describing these new OTPs in detail, to provide the reader with insight into Omega techniques and allow him/her to further adapt these tools to his/her own environment (input method, font), and even to other Indic languages. sArA\f yh l˜Kd˜vnAgrFElEp ko þyog krt˜h` e BAqAao\ kF VAIps{EV\g yA l˜KAyojn k˜Ele þyog Eke jAn˜vAl˜V´ l ko þ-t` t krtA h{.y˜V´ l aom˜gAVAIps{EV\g pr aADAErt h{aOr d˜vnAgk˜l˜KAzpoko þyog krtA h{joEk ČA\s v˜lT` is ŠArA 1991 m˜bnAyA gyA p{k˜jh{.hm in ao VF pF ko Ev-tˆ t kr rh˜h{\ tA\Ek p”hn˜ vAlo\ ko aom˜gAtknFk kF jAnkArF ho jAe aOr vo apn˜aAp ko is V´ l s˜apn˜vAtAvrZ aOr d´ srF BArtFy BAqAao k˜an` zp bdl sk˜yA xAl sk˜. Introduction 1. A Sanskrit font contains over 300 glyphs, when ligatures are taken into account. One of the first Indic language support packages 2. The TFM and VF languages are not powerful for TEX was devnag, developed by Frans Velthuis in 1991.1 At that time it was necessary to use a enough to make all the necessary glyphs out of preprocessor for converting Hindi or Sanskrit text a font of 256 characters. written in a way legible to humans into data legible Using a preprocessor has many disadvantages, by TEX. This preprocessor allowed the use of an due mainly to the fact that it has to read not plain ascii transcription, and performed the contextual text, but rather LATEX code. It also has to avoid analysis inherent to Devan¯agar¯ı script, as well as treating commands and environment names as De- pre-hyphenation (by explicitly inserting hyphen- van¯agar¯ıtext. So the preprocessor should be clever ation points). The preprocessor was necessary for enough to distinguish text from commands, i.e., con- two main reasons: tent from markup. It is well known that, in the case of TEX, this is practically impossible, unless the preprocessor is 1 A second system for processing Devan¯agar¯ı was cre- TEX itself (there is a notorious saying: “only TEX ated by Charles Wikner. It has important features lacking can read TEX”). in Velthuis’s devnag system, but unlike the latter it did not So much for computing in the 20th century. address the setting of Hindi text. The general design of the system – Metafont plus pre-processor – was identical to that Nowadays we have other means of processing infor- of Velthuis. mation, and the concept of (external) preprocessor 50 TUGboat, Volume 23 (2002), No. 1 — Proceedings of the 2002 Annual Meeting Low-level Devan¯agar¯ıSupport for Omega — Adapting devnag is obsolete. In fact, the same operations are done cessed (read, printed, analyzed) on every machine inside Omega, a successor of TEX. Processing text or software that is Unicode compliant. internally has the crucial advantage of allowing the Omega fullfils Unicode compliance, and the sys- processor to distinguish precisely what is content tem we are describing in this paper is designed in and what markup (at least as precisely as TEX itself such a way that Unicode-encoded texts can be pro- does it). cessed equally well as texts encoded in Velthuis’ This makes it much easier to treat properties transcription. inherent to writing systems: one only needs to con- centrate on the linguistic and typographical prop- Installation and Usage erties of the script, and one doesn’t need to think The Omega low-level support2 of Devan¯agar¯ıcon- of what to “do with LATEX commands” in the data sists of eight OTPs (Omega Translation Processes) stream. and a small file with macros: Furthermore, there is an efficiency issue: using velthuis2unicode.otp Omega there is only one source file, namely the TEX hindi-uni2cuni.otp file (and not a pre-TEX file and a TEX file); one hindi-uni2cuni2.otp doesn’t need to care about preprocessor directives; hindi-cuni2font.otp the system will not fail because of a new LATEX envi- hindi-cuni2font2.otp ronment which is not known to devnag; mathemat- hindi-cuni2font3.otp ics and other similar constructions do not interfere sanskrit-uni2cuni.otp with Devan¯agar¯ıpreprocessing. sanskrit-cuni2font.otp Contextual analysis of Devan¯agar¯ıscript has, at odev.sty last, become a fundamental property of the system, independent of macros and packages. OTP files have to be converted to binary form (*.ocp) and placed in a directory where Omega ex- pects to find them. Unicode and Devan¯agar¯ı To typeset text in Devan¯agar¯ı, use the com- The 20-bit information interchange encoding Uni- mands \hindi or \sanskrit (depending on the lan- code (www.unicode.org) has tables for all Indic guage of your choice) inside a group, and keyboard writing systems, based on a common scheme (so the text in Velthuis’ transcription (see Table 1, 3 that phonetically equivalent letters are placed on the taken from Velthuis’ devnag documentation ). For same relative positions in each table). The first of example, {\hindi kulluu, acaanak, \sanskrit these tables (positions 0900-097F, see Table 1) cov- kulluu, acaanak} will produce k` Sl´ , acAnk , ers Devan¯agar¯ı. k` ¥´ , acAnk^ . For historical reasons (compatibility with lega- Description of the OTPs cy encodings) the Unicode approach to Devan¯agar¯ı is quite awkward: it is partly logical and partly This description is a bit technical and demands both graphical. For example, there are separate posi- some knowledge of Omega and of Devan¯agar¯ıscript. tions for independent and dependent versions of vo- The reader can find more information on the former, wels: when encoding text one has to choose if a on the Omega Web site4 and on the latter in books given vowel is dependent or independent, although about Devan¯agar¯ıscript. In particular, there is a this clearly derives from contextual analysis, as in very nice introduction to the contextual features of Velthuis’ transcription where both versions of vo- the script in the Unicode book5 (Section 9.1). wels have the same excellent input transcription. 2 On the other hand, this method is not applied We call it “low-level,” because there is no standard LATEX3-compliant high-level language support interface yet. to consonant ra; indeed, placing a ra before a clus- We don’t know yet how languages and their properties will ter of consonants is graphically represented by a be managed in LATEX3 and therefore do not attempt to in- mark on the last of the consonants (compare Ä- troduce yet another syntax for switching to Hindi or Sanskrit kta and Ä‚ rkta) — this mark is not provided in the or Marathi. Instead, we — temporarily — use a devnag-like syntax: simple commands \hindi and \sanskrit which have Unicode table, and hence application of this feature to be placed inside groups, as in the good old days of plain is left to the rendering engine. TEX. Nevertheless, despite its weaknesses, Unicode is 3 To be found on CTAN, language/devanagari/distrib/ very important because it ensures compatibility be- manual.tex. 4 http://omega.enstb.org tween devices all around the world: a text written 5 The Unicode Standard, Version 3.0, Addison Wesley, in Devan¯agar¯ıand encoded in Unicode can be pro- Reading Massachusetts, 2000. TUGboat, Volume 23 (2002), No. 1 — Proceedings of the 2002 Annual Meeting 51 Yannis Haralambous and John Plaice velthuis2unicode.otp In this OTP we convert the consonantic cluster. This is done by lines of the Velthuis’ input transcription into Unicode. It is a type: quite short OTP (about 80 lines), with lines of the {CONSONANT} {VIRAMA} {INITI} => type @"093F \1 \2 @"097D ; `z' => @"095B @"094D ; where we have a consonant and virama followed by `a'`a' => @"0906 ; a ‘short i’ vowel. In this case we place Unicode char- On the second line, the pair of letters aa of acter @"093F followed by the consonant. On similar Velthuis’ transcription is sent to Unicode character lines, we have n-uplets (n · 7) of consonants and vi- @"0906 (independent vowel “aa”).