Dadtex — a Full Arabic Interface
Total Page:16
File Type:pdf, Size:1020Kb
DadTEX — A full Arabic interface Mustapha Eddahibi, Azzeddine Lazrek and Khalid Sami Department of Computer Science, Faculty of Science, University Cadi Ayyad P.O. Box 2390, Marrakesh, Morocco Phone: +212 24 43 46 49 Fax: +212 24 43 74 09 lazrek (at) ucam dot ac dot ma http://www.ucam.ac.ma/fssm/rydarab Abstract This paper presents a TEX-based way to localize LATEX documents for natural languages with a script based on the Arabic alphabet. So, native speakers of such natural languages who do not know English can use and understand LATEX. TOA Tyr Th¤ - AR @¡ yt§ .ryhK QwOn dySnt A\n ®A Ab§r` Aqm @¡ |r`§ :Pl §@l yt§ ©@ r¯ .TOA Tyr Tl dm Pn r§r TyAk §r`t h¶r\n ty Ar Am`tF TyAk T§zyl¯ Tl Am`tF w`yWts§ ¯ Arb ¢dq§ A Cr Yl T`yC dySn w ¤Ð QwO Yl wO¤ .¨l}¯ 1 Introduction encoding. To allow direct data entry for partic- ular letters of Latin languages, a set of packages Although the typesetting system TEX was originally (inputenc, Babel,6 . ) have been proposed. The designed especially for composition of mathemati- 7 cal documents, it has become a very fine system for application encTEX [5], compatible with the stan- typesetting documents in many other fields. The dard 8-bit TEX, allows the encoding of input/output Arabic alphabet–based scripts are some of the lin- in UTF-8. Initially, Arabic document composition was per- guistic contexts where the use of TEX has been pro- gressively adapted. Many projects, including Arab- formed through Arabic/Latin transliterations. That is, the input of Arabic text was done with Latin TEX1 [4], MlTEX,2 Omega3 [3], ε-TEX,4 Aleph, and others, as multilingual systems, have been interested characters corresponding to Arabic alphabet glyphs. in typesetting documents containing simple Arabic The glyph transliterations are accompanied by the text. The system RyDArab5 has been dedicated to application of a set of contextual operations to find the composition of mathematical expressions in var- suitable glyphs. This is necessary because, besides ious notations used in Arabic, depending on the cul- its right-to-left writing direction, Arabic writing is tural context. cursive and letters have several shapes according to their positions in the word: initial, median, fi- At first, TEX was intended for document com- nal, isolated (e.g., ). The Omega and position in English. It was based on the ASCII ArabTEX systems were interested in direct Arabic 1 ArabTEX is a multilingual computer typesetting sys- text input; i.e., text is directly typed in an Ara- tem developed by Klaus Lagally: http://www.informatik. bic text editor. This operation is performed using uni-stuttgart.de/ifi/bs/research/arab_e.html transliteration tables to translate Arabic text from 2 MlTEX was developed by Michael Ferguson 3 Omega is a 16-bit enhanced version of TEX developed by the text editor encoding to the encoding used by the John Plaice and Yannis Haralambous: http://omega.enstb. typesetting system. org/ 4 ε-TEX is an extended TEX including the features of TEX--XET, developed as part of the NT S project by Peter 6 Babel provides multilingual support for TEX. It’s de- Breitenlohner under the aegis of DANTE e.V. during 1992: veloped by Johannes L. Braams: http://www.ctan.org/ http://www.ctan.org/tex-archive/systems/e-tex/ tex-archive/macros/latex/required/babel/babel.pdf 5 RyDArab is a system for typesetting mathematics in 7 encTEX allows full UTF-8 processing in standard 8-bit an Arabic notation: http://www.ucam.ac.ma/fssm/rydarab/ TEX. It’s developed by Petr Olˇs´ak: http://www.olsak.net/ system/zip/rydarab.zip enctex.html 154 TUGboat, Volume 27 (2006), No. 2 — Proceedings of the 2006 Annual Meeting DadTEX — A full Arabic interface Until now, however, there is no system that With DadTEX,10 such problems can be avoided by allows composing Arabic documents completely in using only Arabic text and minimizing bidirectional Arabic. So, the development of a mechanism com- text in source documents. pletely based on TEX to do so was an open ques- The RyDArab system has already made some tion. Therefore, we developed an interface, called steps in the way of TEX localization, by adding com- DadTEX,8 that allows creation of LATEX documents mands for automatic date generation with Arabic in Arabic. The goal was to allow Arabic users, who month names and Arabic numbers. Further steps to don’t know English, to compose LATEX documents be done in this field include exploiting the linguis- containing only Arabic letters, with the addition of tic and regional properties from the system settings control characters like \, $, . and of course with and getting notational preferences (such as units, Latin text for bilingual documents. currency, . ). Mathematical documents in an Arabic nota- tion, composed with RyDArab and CurExt,9 can so be typeset directly. In order to save the seman- 2 DadTEX system tic meaning of mathematical commands and envi- 2.1 Process ronments, it was natural to translate the RyDArab and CurExt macros. Of course, those translations At the beginning of this work, we thought of creat- are not enough. In TEX, besides the commands for ing an application to convert Arabic text in a source mathematical objects, there are many commands document into its transliterated version using the that specify the mathematical content of the text, same correspondences as those used in the translit- such as the theorem environment. erations. The program would read the source text With these translations, Arabic users now hope character by character and translate each element that one need not use commands like \chapter, in into its equivalent. When a control character is met, a standard article. Clearly, if commands are also lo- the program would operate differently, by seeking in calized, it will be easy to understand their meaning; a commands dictionary the equivalent of this com- further, customization of command names allows mand. T X to be tailored to local pedagogical approaches. This mechanism is similar to the one used in E 11 Also, using English based commands with Ara- FarsiTEX [1]. FarsiTEX translates Persian text bic text is a source of ambiguity. Some problems to transliterated text via an application program with TEX source editing with bidirectional lines are: ftx2tex. Commands are written with Latin charac- ters but are rendered from right to left. For example, Text selection: selections can be made either in log- \begin{document} is written as {document}begin\. ical or visual mode. Selections in logical mode This method, based on the use of an applica- are accompanied by discontinuities which is a tion to translate a localized source file, presents sev- direct consequence of bidirectionality and that eral drawbacks. For instance, it is not direct; it leave the user feeling uncomfortable, while vis- generates a supplementary file to be processed in- ual mode selections lead to content ambiguity. stead of the original source file. So, the compila- Character positions: in a bidirectional line, the end tion time is increased. Likewise, additional memory and beginning of a run of Arabic text are vis- and free space are required. Therefore, the method ually in the same place. adopted in DadTEX is based on a direct use of TEX through translation of commands into Arabic. The Semantic ambiguity: characters without explicit source file is compiled directly. This makes it pos- direction inherit the direction of the preceding sible to avoid the use of a supplemental application text. This misleads the reader of the source that could eventually limit the use of DadTEX across about the meaning of the expression. E.g., the platforms. command \catcode‘\ =11 will be displayed Such command name translation is not suffi- as the incorrect \catcode‘\11= . cient when using some systems, such as RyDArab. 8 Dad is an Arabic letter. Its sound doesn’t exist in many languages, showing the Arabic language’s phonetic precision. 10 DadTEX is developed within the framework of global In the Arabic literature, the Arabic language is often called project “Al-khawarizmy”, acting in the general field of local- the Dad language. ization of e-documents and tools: http://www.ucam.ac.ma/ 9 CurExt is a variable-sized curved symbols type- fssm/rydarab/system/zip/dadtex.zip setting system: http://www.ucam.ac.ma/fssm/rydarab/ 11 FarsiTEX is a Persian/English bidirectional typesetting system/zip/curext.zip system: http://www.farsitex.org/ TUGboat, Volume 27 (2006), No. 2 — Proceedings of the 2006 Annual Meeting 155 Mustapha Eddahibi, Azzeddine Lazrek and Khalid Sami Indeed, RyDArab is based on transliteration of indi- vidual letter-based alphabetical symbols [2]. There- fore, it becomes necessary to redefine the correspon- dences using Arabic characters. This operation has to take into account the particular encoding used. Various systems of numbers are used in Arabic alphabet-based writings: the Arabic numbers no- tation system called Arabic or occidental numbers, used in North African Arabic countries; Arabic-Indic numbers, used in Middle Eastern Arabic countries; and Persian numbers, used in Iran. In TEX, num- bers are used to control document component sizes, an action’s frequency, etc. In this work, only the syntactic aspect of numbers will be discussed, the semantic would be in the scope of a future DadTEX version. The method used in DadTEX is not specific to Arabic language, or languages with an Arabic alphabet–based script. It can be generalized on one hand to any right-to-left writing system, and on the other hand to any non-ASCII encoded documents. This system would alleviate the learning burden im- posed on non-English speaking people.