The Arabi System — TEX Writes in Arabic and Farsi
Total Page:16
File Type:pdf, Size:1020Kb
The Arabi system | ] ¨r` [ A\ TEX writes in Arabic and Farsi Youssef Jabri Ecole´ Nationale des Sciences Appliqu´ees, Oujda, Morocco yjabri (at) ensa dot univ-oujda dot ac dot ma Abstract In this paper, we will present a newly arrived package on CTAN that provides Arabic script support for TEX without the need for an external pre-processor. The Arabi package adds one of the last major multilingual typesetting capabilities to Babel by adding support for the Arabic ¨r and Farsi ¨FCA languages. Other languages using the Arabic script should also be more or less easily imple- mentable. Arabi comes with many good quality free fonts, Arabic and Farsi, and may also use commercial fonts. It supports many 8-bit input encodings (namely, CP-1256, ISO-8859-6 and Unicode UTF-8) and can typeset classical Arabic poetry. The package is distributed under the LATEX Project Public License (LPPL), and has the LPPL maintenance status \author-maintained". It can be used freely (including commercially) to produce beautiful texts that mix Arabic, Farsi and Latin (or other) characters. Pl Y ¾Abn Tn ®¤ Tr` ¤r Am`tF TAk t§ A\ ¨r` TEC .¤r fOt TEX > < A\ Am`tFA d¤ dnts ¨ n A\ (¨FCA ¤ ¨r)tl Am`tF TAk S ¨r` TEC , T¤rm rb Cdq tmt§¤ ¯m ¢k zymt§ A\n @h , T§db @n¤ Y At§ ¯ ¢ Y TAR . AARn £EA A \` Am`tF® A ¢± ¾AO ¾AA ¨r` dq§ . Tmlk ¨ ¤r AkJ d§dt ¨CA A` © ¨ ¨t ªW d Am`tF ¢nkm§ Am Am`tF¯ r ªW Tmm ¯¤ ¨A ¨r` , A\n TbsnA A w¡ Am . ¾® E¤dn§¤ A\ . Am`tF¯ ºAn ¯ ¢lim`atsu lk§ 1 Introduction 4. Free (as in freedom), meaning a license like the The development of Arabi1 was a response to the GNU GPL or LPPL. absence of a package that manipulates the Arabic Arabi comes with an extensive user manual; this script and fulfills the following requirements: article gives a general overview of the system. 1. LATEX 2" and Babel compliant, this combina- 2 Typesetting Arabic with T X: the tion format/package being the most widely used E existing possibilities in our opinion when mixing different languages. 2. The possibility of using 8-bit input text includ- TEX and the Arabic script have a long history. ing already existing Arabic texts, on different One might imagine that enabling TEX to write systems. in both directions Right-to-Left (R2L) and Left-to- 3. Able to use existing, commercial and free, beau- Right (L2R) with an Arabic font suffices to typeset tiful Arabic fonts. Arabic with TEX. 1 Unfortunately, although such an extended TEX The name of the package should not be misunderstood. may perhaps be used to typeset a R2L language like It is not designed to support only the Arabic language, but all languages that use the Arabic script. Technically speaking, Hebrew, this is far from sufficient for a complex for Babel, they will all be considered as dialects of Arabic. script like Arabic, where the shapes of the glyphs TUGboat, Volume 27 (2006), No. 2 | Proceedings of the 2006 Annual Meeting 147 Youssef Jabri depend on the context, and may take many forms 3 Arabic script specifics (at least four forms for the majority of Arabic char- 2 The Arabic script is one of the most widely used acters even in the simplest cases). scripts on earth. It dominates in Arabic countries, Many early attempts have been made; they all of course, but has a special place for all Muslims relied on a preprocessor that does the contextual because it's the script used to write the Koran, the analysis (also known as the shaping algorithm). holy book of Muslims. One attempt, not widely known, due to Terry The Arabic script, like all other Semitic lan- Regier from the University of California, Berkeley, dating from December 1990, relied on the famous guages, is written from Right-to-Left. macros of D. Knuth and P. MacKay: Another important aspect of the Arabic script is that no hyphenation is needed, or allowed at all. %The lines below are from Knuth and MacKay % TUGboat vol.8, #1, page 14. So, no hyphenation patterns are needed for any lan- \font\revrm=xbmc10 \hyphenchar\revrm=-1 guages that uses the Arabic script. In very old Ara- \catcode`\|=\active bic documents, words could be split after a non- \def|#1|{{\revrm \reflect#1\empty\tcelfer}} connecting character, while characters that connect \def\reflect#1#2\tcelfer{\ifx#1\empty\else% were never split. In modern Arabic, hyphenation is \reflect#2\tcelfer#1\fi} forbidden completely. This makes it more difficult to do the reflection, after a preprocessor has done a to get justification when long words occur at the rough contextual analysis. end of a line, but Arabic is also cursive and has (in The pioneering work by Knuth and MacKay modern fonts mimicking the handwritten forms) a [11], who implemented the TEX bidirectional algo- special character called kashida or tatweel (keshideh rithm (which is unrelated to the Unicode Bidirec- is a Farsi word that means stretch) that may be used tional Algorithm; the latter implicitly chooses the between adjoining characters to make the word be- directions of the text) and added to TEX the four come longer. An example is the following word: primitives (\beginL, \endL, \beginR and \endR) A that may be written to occupy longer Aþ made things much better! and longer Aþþ and much more longer space Some early attempts were also carried out by Aþþþþ. Y. Haralambous, who used the new extended en- 3 3.1 The Arabic alphabet gine TEX--XET. This includes the non-free Al-Amal 4 (1992, [6]), and the free ArabiTEX (April 1995). The Arabic alphabet is caseless, but most letters The most widely used system at present is prob- have either two or four forms. The different forms ably K. Lagally's ArabTEX [13]. It is a package for are used according to the letter's position in the writing Arabic in several languages using the Ara- word (initial, medial, final and isolated). The al- bic script. It consists of a TEX macro package and phabet is constituted in its basic form by one Arabic Naskhi-like font. ArabTEX will run with • 28 consonants (29 if we count the hamza). But Plain TEX and LATEX; and work with any TEX en- the number of 28 characters can exceed easily gine, because it uses its own bidirectional algorithm. 1000 glyphs per font if all ligatures are present! So, no preprocessor is needed! This makes it a little slow but with today's computer power, this is not Isolated Initial Medial Final really a problem. Its real drawback lies in the fact that the macros apparently depend heavily on the b glyphs of the font it uses, making it quite impossi- ble to use any other fonts that may be available to ` the user. For courageous users, there also exist two more £ ¡ h ¢ powerful systems Table 1: Some characters' contextual forms • Ω by Y. Haralambous and J. Plaice, and • X TE EX by J. Kew, if you have the right system • Seven diacritical marks specifying the vowels. and the right fonts. They are not used in typical Arabic texts but 2 Through typographical simplifications. Some aspects of appear in poetry, textbooks for people learning traditional Arabic typography are described in [5]. 3 the Arabic language, and some religious texts. We did not review it, as it was not available to the public They can be typed and then at the moment of as far as we know. 4 The source and a DOS executable of the preprocessor compiling the document, can be either included were available through the French TUG. or omitted according to the author's wish! The 148 TUGboat, Volume 27 (2006), No. 2 | Proceedings of the 2006 Annual Meeting The Arabi system | T X writes in Arabic and Farsi ] ¨r` [ A\ E three basic ones are called fatha, damma and fonts, using the (quite limited) ligature possi- bilities of METAFONT. kasra: ; the sukun is used þiþþþþþuþþþþaþ þ"þ This second point is the whole secret of Arabi's com- patibility with most available packages. We tried to for the absence of vowels; and there are three shorten T X coding to deal with the specifics of the tanwin forms written by doubling the three ba- E Arabic script as much as we could, to avoid eventual conflicts and clashes with other code. sic ones: . þÀþþþ¿þþþþ¾þþ The system is also compatible with all other for- mats, such as plain or ConT Xt. This too is because The vowel marks are written somewhat like E the whole contextual analysis is done in the fonts! accents in the Latin script. Above, the drawn line represents the baseline, with the vowels that 4.1 Input and font encodings appear above the line being typeset above let- ters, while those below the line are typeset be- Typesetting Arabic and Farsi texts with TEX implies low letters. the use of special input and output encodings, so we need to use the standard packages inputenc and 3.2 Arabic typography fontenc. We use two special font encodings. For Arabic, This aspect of Arabic merits much investigation and we use LAE for Local Arabic Encoding, while for Farsi so much can be said about it. But in order not to we use LFE that stands for Local Farsi Encoding. be too lengthy, we will just cite three points. These two encoding are not final. Some character In the classical Arabic literature, there are no positions may change, and some empty slots will be typographical styles like bold, italic, etc.