A Non-Standard Application of Arabtex: Generating Sorted Indices∗

A Non-standard Application of ArabTEX: Generating Sorted Indices∗ Klaus Lagally Institut fur¨ Informatik, Universität Stuttgart Breitwiesenstraße 20-22, 70565 Stuttgart, Germany [email protected] 1. Januar 2011 Abstract ArabTEX, a macro package extending TEX and LATEX, has primarily been designed to support the typesetting of texts using the Arabic and other right-to- left scripts. However, due to the high flexibility of the TEX macro mechanism, we can also apply ArabTEX to support some more general data processing tasks, provided the input data contain a suitably chosen symbolic markup. We present some techniques we used to generate sorted indices in the context of a multi-lingual dictionary. 1 Introduction ArabTEX[?, ?], a macro package for TEX[?] and LATEX[?, ?, ?] originally designed to support the typesetting of texts using the Arabic writing, has recently been extended in several respects. Originally it offered its own private ASCII encoding, modelled after the standard ZDMG transliteration [?, ?] as the only means of inputting Arabic text; but meanwhile several other encoding standards as, e.g., ASMO 449 [?, ?] and ISO 8859-6 [?, ?], are also supported, and can even peacefully coexist within the same document. Likewise, other Semitic languages can be handled, e.g., vowelized Hebrew in several common encodings [?]; and the processing of Syriac presently is only hampered by the fact that we know of no Syriac font available in the public domain. One may even, should the need arise, typeset in Ugaritic cuneiform. This surprising versatility made ArabTEX an obvious candidate to be considered in an ongoing project of building a multi-lingual dictionary, and first results look ∗Submitted to Cahiers GUTenberg 1 promising, even though the project is still far from its completion [?, ?]. For a sample page see Figure 1. Building a dictionary amounts to more than just producing a printed version; in fact the input data can be considered a data base, and may be evaluated along many various lines, even if not all possible applications are known at this stage. It may pay off to choose an encoding that, in addition to supporting the printing job, will allow us to capture all relevant information available now, without interfering with the printing process and without precluding further evaluations. The obvious solution is some kind of symbolic markup, denoting the structure and, where available, also the meaning of segments of the data, without having to decide now on the details of further processing. This idea has been advocated for quite some time, e.g., within LATEX[?], SGML [?, ?], and more recently, HTML [?]. Though these proposals are related, we observe some differences which, in our opinion, are important for the application at hand. Even though the manuals say otherwise, both SGML and HTML appear primarily oriented towards the visual pre- sentation of text, and require a complete definition of the syntax and the semantics of the markup statements occurring in a given document class. This may be very useful if we want to check the formal consistency and completeness, and also enables us to decouple the collection of data from the rendering process, but it will also make extending the notation for additional information a major task. On the other hand, due to the TEX macro feature the TEX/LATEX markup is completely open-ended, and, apart from the primitive TEX commands, has no fixed meaning at all. The interpre- tation of a control sequence needs only to be known at the time of actual processing, and can be bound to quite different actions for different processing tasks according to additional requirements. We shall present some examples in the sequel. 2 Symbolic Markup To demonstrate what we mean we present a short excerpt of the encoded input data for the example shown in Figure 1. \alemma {qAbUs} JA 1886 (1) 460. \see \ar {qwA_tUs} (ib.) \alemma {qAbIl} \glemma {k'aphloc} \from \syr {qpIl'} ZDMG 1897 (51) 470. der Kleinh"andler, Speisewirth: \ar {mi_tl insAn _dAhib fI al-sUq ìnda al-qAbIl ya^sum al-^siwA'| wa-al.tabI_h} "`Wie ein Mensch wel\-cher auf dem Markte bei [dem] Speisewirth 2 BBH 843 paen., 1049.10 ) JA 1886 (1) 460. ) (ib.) á £AKA¯ Q £AKA¯ ñK.A¯ ñK@ñ¯ (ib.) kphloc < (syr. ĂŇĽŤŮ ) ZDMG kajolikc ÉJKA¯ MAF 129.9 . 1897 (51) 470. der Kleinhändler, J ËñKA¯ J ÊKAm.Ì'@ ñëð J ÊKA mÌ'@ ÐA®Óð K Q¢JË@ YK I m' á ºKð Speisewirth: . . ú¯ I. ë@X àA@ ÉJÓ éJ KYÓ @QªË@ XCJ.K. ÐAÓB@ èQå k ú¯ qJJ¢Ë@ð Z@ñË@ Õæ ÉJKA®Ë@ YJ « ñ Ë@ ' Wie . ein Mensch welcher . auf dem é J»A¢@ K Q¢. Y K Im áº J¯ ÐCË@ " var. , , ; OC Markte bei [dem] Speisewirth vor- J ËñKA¯ J ËñKA¯ á »ñKA¯ 1979 (63) 79,80 n.22 var. ) beigeht und den Duft der gekochten J ËñKA¿ und gebratenen Speisen riecht.\ , , . J ËñKA¿ J ËñKA ¯ J ËñKA g. kjisma pl. GRAF VERZ. AÒA¯ HA ÒA¯ AG 2.54.5 ) (ib.) 86 Kathisma in der Psalmeneinlei- ñKA¯ ñKA¯ " tung\. var. (pl. ), SUWAIDI 235a.15-b.3 ) AÒ£A¯ HAÒA¯ á KPXA ¯ PXA ¯ AÒA¿ . DIOSK/DIET 1.22.14 - 2.1. Nr. 44. GRAF VERZ. 86 ) (ib.) PXA¯ kdroc DIOSK/DIET 1.22.14-2.1. HAÒA¯ AÒA¯ Nr. 44. (Zeder, Cedrus Libani A. HA ºKA ¯ katoiqseic HIPA 2.474.7. ) Rich) das ist die Ze- á K.QåË@ ñëð " (ib.) cod. HA Jk. ñ«A¢Ó HA º JÊ®Ë@ der\. KPXA ¯ SUWAIDI 235a.15 - ? HA º JÊ®Ë@ ? HA ºJ JÊ®Ë@ b.3 (ib.); cod. PXA¯ pl. GRAF VERZ. 86 var. J ËñKA¯ é®ÊJ¯ àXA ¯ FI 1.252.22 ) àXA ¯ (ib.) ; MAF. 129.9 ) kdoc ñ® JËA®¯ J ËñKA¯ ðXA¯ ZDMG 1896 (50) 617, ib. (ib.) 1897 (51) 300, 325. Eimer\. JA " kajaresic < (syr. ŚĽŚŸŽŮ ) 1886 (1) 431, ib. 1913 (2) 383 pot. @PAKA¯ GRAF VERZ. 86 Amtsenthebung, ne signifie guère`conduit, tuyau' que dans le Maghreb ) DOZY Absetzung. ) GRAF VERZ. Q¯ 87 322-323; Y¯ (Hiˇgaz)ZDMG 1897 (51) 325. ZAHRAWI 68b.9 ) (ib.) Q£AKA ¯ Q £AKA ¯ àðXA ¯ FI 1.252.22 ) XA¯ (ib.) Q£AKA ¯ kajtwr ZAHRAWI 68b.9 ≈ éËB@ QT 19.2 ) (ib.) ùÒ úæË@ cod. Q£AKA ¯ ; HINDU àðXA¯ PXñ¯ 163.5. ÉJÊgB@ ú¯ ÉgY K ¬ðñ m× ÉJ¯ @PA¯ kraboc ART. 235.10 ::: ½ÒË@ . ú¯ ©¯ð Qm.k I. .. ÈñJ.Ë@ .Jk@ @X@ @PA¯ ÕÔ øY ÊË@ ÉJÓ ) HPA¯ .ÐX Ê « ð@ èY ð@ é KA J ÖÏ@ J « . FRAENKEL, FREMD 261; BBH 843 paen; 1049.10 var. á £AKA ¯ Figure 1: Example of multi-lingual text 3 vorbeigeht und den Duft der gekochten und gebratenen Speisen riecht."' \alemma {qAtismA} \glemma {k'ajisma} pl. \ar {qAtismAt} GRAF VERZ. 86 "`Kathisma in der Psalmeneinleitung"'. \var \ar {qA.tsmA} (pl. \ar {qA.ssmAt}), \ar {kAtsmA}. As we can see, the data are structured as a sequence of primary and secondary entries, each providing information about some Arabic word that is supposed to originate directly or indirectly from Greek. A primary entry refers to the Greek origin directly and also reports some additional information; a secondary entry refers to some primary entry and usually is just a writing variant. The markup command \ar indicates that its argument is Arabic text encoded in the ArabTEX standard ASCII encoding; \alemma in addition identifies an Arabic lemma. Likewise we have \gr for Greek text in the GreekTEX encoding [?, ?], \glemma denotes a Greek lemma. The markup adds some logical structure to the entries and supplies semantic information, but does not yet fix the intended processing and/or rendering; to do this we have to provide some macro definitions for the occurring control sequences before the actual processing run. For producing the printed version this looks as follows: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Arabic: call ArabTeX macros \let \ar \RL \def \alemma #1{\item [{\setnashbf \<#1>}]} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Greek: use GreekTeX encoding and fonts \font \grfont = kdgr10 \font \grbfont = kdbf10 \def \gr #1{{\grfont #1}} \def \glemma #1{{\grbfont #1}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Hebrew: ArabTeX transliteration % use ArabTeX Hebrew mode 4 \def \heb #1{\sethebrew \RL{#1}\setverb } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Syriac: ArabTeX transliteration % temporary solution: mapped to Hebrew \def \syr #1{{\it (syr. \heb {#1})\/}} % print a single Syriac word %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % miscellaneous markup commands \def \see {\unskip \ $\Rightarrow$ } \def \from {\unskip \ $<$ } \def \like {\unskip \ $\approx$ } \def \var {{\it var. }} \def \cod {{\it cod. }} \def \key #1{\relax } We have assumed that the ArabTEX commands and a Greek font are available, and we have also assumed that the data are included in a LATEX \description list" envi- ronment, formatted in two columns. Observe that the markup for an Arabic lemma and for an Arabic comment are different; they play different rôlesand are also rendered differently. The same is true for Greek lemmata and Greek explanations (none in the example). We did not diffe- rentiate between the various latin-script languages occurring but could do so easily. 3 Producing a Greek Index The dictionary in its present form enables the user to look up some Arabic word assumed to have a Greek origin, but gives no assistance at all to the task of finding the Arabic versions of a Greek term. To do this we need an index on the Greek terms leading at least to the relevant main entries. Now our data as given are not in a format that could be sorted easily on the Greek lemmata, and we probably will not be able to find a sorting routine that can process the input notation supported by GreekTEX that we used. So we will have to do some preprocessing ourselves: we have to take out the Greek lemmata and transform them into a format that can be sorted by the software at hand, without losing the connection to the main entries.

A Non-Standard Application of Arabtex: Generating Sorted Indices∗

X E TEX Live

DE-Tex-FAQ (Vers. 72

The Arabi System — TEX Writes in Arabic and Farsi

Bibliography and Index

The Treasure Chest for Compatibility with Texpower and Seminar

The Amsart, Amsproc, and Amsbook Document Classes

Complete Issue 25:0 As One

TEX Live CD-ROM

The Lyx User's Guide

Arabic Font Building for LATEX

MAPS BIBLIOGRAPHY Previous

Automating the Generation and Typesetting of Arabic Script