<<

PassiveTEX: from XML to PDF

Michel Goossens CERN, IT Division, CH-1211 Gen`eve 23, Switzerland [email protected]

Sebastian Rahtz Oxford University Computing Services, Oxford, United Kingdom [email protected]

Abstract

This article introduces PassiveTEX, a of TEX macros based on xmltex, that processes XML documents containing XSL formatting objects and generates PDF or DVI output. We show examples of XML sources marked up using the TEI, DocBook, and MathML DTDs.

Introduction The present article was prepared in XML us- ing TEI markup. It was not actually typeset with New information becomes available on the Inter- PassiveT X, but transformed into LAT Xwithan net every day, and increasingly material is becoming E E XSLT transformation stylesheet and typeset using available in XML. However, when one wants to print the ltugproc LAT Xclassfile. the information on the screen one is faced with sev- E eral shortcomings, the more important being the low typographic quality of the result or the problem of XSL formatting objects how to serialize a ‘’ of pages onto a linear output medium. In this article we explain how a source doc- The Extensible Stylesheet Language (XSL)isalan- ument marked up in XML can be typeset nicely by guage for describing page designs. For any given a two-step procedure that combines a transforma- class of arbitrarily structured XML documents or tion of the XML source into XSL formatting objects data files, an XSL stylesheet allows you to express and the direct interpretation of these XML objects how the content contained in the XML should be with Sebastian Rahtz’ PassiveTEX, a variant of TEX presented. The stylesheet indicates how source ele- based on David Carlisle’s xmltex [4]. ments should be styled, laid out, and paginated in The choice of TEX as the typesetting engine is some presentation form. In this article we only ad- justified by the fact that TEX handles mathematics dress issues related to high quality typesetting (using in a natural way, can manage several languages si- TEX). Our XML examples can be transformed into multaneously (hyphenation, typographic rules, mul- other representations (see [7] for more discussion), tiple encodings and alphabets, right-left typesetting, in particular HTML,andXSL stylesheets targeting etc.), and has a good model for marking up com- HTML are available for both the TEI [12] and Doc- plex tabular information. Moreover, TEX can easily Book [15] markup schemes. handle very long documents, such as large software An XSL processor reads an XML source docu- manuals, or bills for hundreds of thousands of tele- ment together with an XSL stylesheet, and produces phone subscribers. the presentation of the given XML content according In the first part of this article we use as an ex- to the instructions in the stylesheet. The prepara- ample some literary texts, including poems by the tion of the presentation form proceeds in two steps. French poet Paul Verlaine and the Russian poet First, a result tree is constructed from the XML in- Alexander Blok, marked up according to the TEI put source tree (tree transformation step). Second, DTD ( [3], Lite version). In the result tree is interpreted to produce the format- the second part we show an example of a more tech- ted results (formatting step), that can be presented nical document, including a couple of simple math on output media such as a computer screen, a WAP formulae, using the DocBook [14] and MathML [17] display, on paper, in speech, etc. DTDs. The result tree can be very different from the structure of the source tree, since parts from the

222 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF

input can be deleted from the result tree, while sup- There are, however, some drawbacks associated plementary information (e.g., table of contents) can with using TEX as typesetting engine. Firstly, we are be generated. The tree transformation step also has constrained to use TEX’s page makeup model, and to add the information necessary to format the re- have to force the XSL formatting objects to fit that sult tree. model, which is not always straightforward. More- Thenodesoftheresulttreeareformatting ob- over, since PassiveTEX is actually layered over LATEX jects, whose semantics are expressed in terms of a it is too easy to allow things to fall through and take catalog of formatting object classes defined in the LATEXdefaults. XSL Specification [24]. They denote typographic ab- On a more practical level, TEX macro writing stractions such as page, paragraph, table, list, etc. is obscure and difficult and thus the system is not Fine control over the presentation of these abstrac- transparent for most . And, last but tions is provided by a set of formatting properties, not least, TEX is large and monolithic, and unsuited such as those controlling alignments, color, fonts, to embedding in other applications. spacing, writing-mode, hyphenation etc. XSL offers Users of the system with a TEX background a rich range of formatting objects as one of its aims is can find it confusing to understand that TEXno to cover the semantic functionality of both the Doc- longer does all the work, which is now split betwen ument Style Semantics and Specification Language the stylesheet and the formatter; the four important (DSSSL) [10] and Cascading Stylesheets (CSS)[16] points to remember are: models as completely as possible. • No explicit use is made of LATEX’s high-level constructs, in particular there are no sections, PassiveTEX: implement XSL using TEX lists, cross-references, bibliographies, etc. In the previous section we explained how an input • all vertical and horizontal space is explicit in XML document can be transformed into a new XML the specification; document containing a result tree of XSL format- • page and line breaking is left to TEX: the rest ting objects. How do we go about rendering those is up to the stylesheet; objects with a real typesetting engine, and getting • XSL formatting objects use XML syntax, so that high-quality printout? One way is to use TEXit- the underlying character set is Unicode. By de- self. PassiveTEX is a library of TEXmacroswhich fault, entities are mapped to their Unicode po- can typeset the XSL formatting objects. In particu- sition. lar, PassiveT X combined with the pdfT Xvariant E E PassiveTEX will switch to using the Unicode- of T X generates high-quality PDF files in a single E based TEX variant soon (Omega [8]), to handle non- operation. PassiveTEX allows one to choose TEXas Latin material more naturally. the formatter of choice for XML, while hiding the Two extensions are needed for practical use. details of its operation from the user. The first is support of MathML; this is largely in PassiveTEX derives from and builds on xmltex place (it is simply passed through as-is by an XSL [4], a TEX package by David Carlisle, providing the stylesheet), but needs some tuning. In particu- core XML parser and UTF8 handler. Ideas and TEX lar, the intricacies of equation numbers remain to code are also inherited from earlier work by Sebas- be dealt with properly. Second, we need to sup- tian Rahtz on his JadeTEX package, which imple- port (SVG)[21]XML code mented the output of the Jade DSSSL processor’s somehow (e.g., by direct interpretation, translation TEX backend, and already used a catalogue of Uni- into MetaPost [9], or pre-processing). Moreover, code/TEX mappings. SVG fragments need to be recognized directly to per- form in-line graphical functions (e.g., setting text at Issues in using TEX Since PassiveTEXisbased an angle). No work at all has been done on this. on TEX one can rapidly develop and test new high level code, knowing that TEX will take care of Support for XSL formatting objects Not all the typesetting details. Moreover, TEX is a well- parts of the XSL specification are fully supported. understood, robust, and free page formatter, where The XSL formatting objects which are implemented good support for fonts, graphics inclusion, hyper- fairly well cover the page specifications, blocks, in- links, etc., is already available. One can also profit line sequences, lists, graphics inclusion, floats, font from TEX’s mature handling of language issues, in- properties, and links. Not so well implemented at cluding hyphenation, and its high-quality math ren- present are all the table properties, and object mar- dering. Moreover, pdfTEX allows direct generation gins, borders, and padding. In order to properly im- of high-quality PDF. plement these, we have to put every paragraph into

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 223 Michel Goossens and Sebastian Rahtz

an explicit TEX box ‘just in case’, and this becomes 12 {\mathrm{#1}} slow and clumsy. Tables, too, are treated rather dif- 13 ferently in XSL, with many properties assigned at 14 \XMLelement{m:msqrt} 15 {} the cell level. 16 {\xmlgrab} Parts of the specification that are not yet im- 17 {\sqrt{#1}} plemented at all include bidi handling, backgrounds, Lines 1 and 2 declare the prefix m to be used for re- aural properties, and a large number of assorted ferring to elements in the given namespace. Lines properties. 4 to 7 declare that the start and end tag of a math AsmallexampleThe set of XSL stylesheets for element are transformed into the opening and clos- the TEI DTD (tei-fo [12]) specify how the various ing of an equation environment. Lines 9 to 12 de- XML elements can be transformed into XSL format- fine how a MathML mn (number) element is typeset. ting objects (see Section 7.6.6 of [7] for a more de- Its contents are stored (grabbed, see line 11) and tailed discussion). Without further comments, the then injected as parameter of a \mathrm to be type- following code example shows the transformations set in roman. Similarly, lines 14 to 17 show how asquareroot(msqrt) element and its content are for a paragraph (p element, lines 1 to 6) and em- A phasized material (emph element, lines 7 to 11). transformed into LTEX’s \sqrt and its argument. To fine-tune the output, xmltex supports pro- 1 2 directly. 4 The biggest problem at present in using xmltex 5 to write PassiveT XisthatitusesTX grouping 6 E E 7 to make sure each element is handled in the right 8 namespace. Because TEX does not allow for nested 9 grouping, each element is caught in a group; while 10 \aftergroup can help a bit, this is the first hurdle 11 for anyone trying to extend the package. Afewwordsaboutxmltex As PassiveTEXuses xmltex to read and interpret the XML source it is An example of TEI markup typeset with appropriate to say a few words about it. xmltex PassiveTEX is a non validating (and not 100% conforming) In this section we use an XML source file that con- namespace-aware XML parser implemented in TEX. tains a collection of literature of the French poet xmltex reads the encoding declaration and tries to Paul Verlaine, the Russian poet Alexander Blok, the implement it; in particular it handles UTF8.Itreads English novelist Thomas Hardy, and the Finnish au- the DTD subset and expands entities properly, it thor Aleksis Kivi. It is marked up according to the uses namespaces to load modules of TEXcodeto TEI Lite DTD. The master file poemstei-utf8 fol- process elements (for instance for MathML, see the lows: following example). xmltex offers limited access to 1 child and parent elements to guide the interpreta- 2 The workhorse of an xmltex package file is the 4 5 XMLelement \ command, whose four parameters in- 6 xml"> dicate how TEX will handle a given element, its 7 attributes and its content. The following example 8 shows how a few simple elements of the MathML 9 10 ]> namespace are handled. 11 1 \DeclareNamespace{m} 12 2 {http://www.w3.org/1998/Math/MathML} 13 3 14 4 \XMLelement{m:math} 15 Various languages and scripts 5 {} 16 Collected by Michel Goossens 6 {\begin{equation}} 17 7 {\end{equation}} 18 Transcription in TEI.2 markup 8 19 Michel Goossens 9 \XMLelement{m:mn} 20 10 {} 21 11 {\xmlgrab} 22 Distributed by mg

224 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF

23 Below we show part of the beginning of the 24 XML file fotex.xml that corresponds to the first 25 page of the collection of poems of Verlaine trans- 26 27 English formed using Sebastian Rahtz’ TEI XSL stylesheets 28 Finnish [12]. 29 French 1 ... 30 Russian 2 3 font-family="Times Roman" 32 4 font-size="10pt"> 33 5 34 6 35 7 36 8 37 9 Poems in various languages and scripts 38 Languages and scripts 10 39 11 40 Collected by Michel Goossens 12 41 July 2000 13 42 14 Collected by Michel Goossens 43 15 44 16 45 &Verlaine; 17 July 2000 46 &Blok; 18 47 &Hardy; 19 48 &Kihlaus; 20 49 21 22 id="N115" 51 23 text-align="start" This XML source document uses the UTF8 en- 24 font-size="14pt" 25 text-indent="-3em" coding, since we want to mix various alphabets (in 26 font-weight="bold" this case Latin and Cyrillic). UTF8 is a Unicode- 27 space-after="3pt" based encoding [13] that can encode each charac- 28 space-before.optimum="9pt"> ter in Unicode’s base plane and its sixteen extended 29 1. Paul Verlaine, F\^etes galantes (1869) 30 planes (allowing for more than one million charac- 31 ters) with at most three bytes. Documents that 32 40 1.1. To typeset this document we first transform the 41 Clair de lune XML into XSL formatting objects using an XSLT 42 stylesheet, as explained previously. Then, this in- 43 44 fotex. 53 ... The file style.xsl is an XSL stylesheet that 54 contains the XSLT transformations needed to trans- At the beginning of the file (not shown) the form the input XML file infile.xml into XSL for- various page masters are defined (margins, height, matting objects. The resulting intermediate XML width, etc.), and the static page information (run- document fotex.xml can be interpreted by appli- ning headers, footers) is initialized. Then, the con- cations that can render XSL formatting objects. In struction of formatted output pages starts. We show our case we use PassiveTEX. what happens at the start of page 1 (compare with

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 225

Michel Goossens and Sebastian Rahtz

Figure 1: Partial input source of French (left) and Russian (right) documents

the output shown in the upper left corner of Fig- on lines 21 to 30 in the formatting objects output, ure 3). The fo:flow element contains the material where a block for the first level title is created. Note that has to be typeset and cut into individual pages that the section number (1. on line 29) is generated by the typesetting engine. We see that the default by the XSLT application. Similarly, the title of an document font is Times Roman at 10 pt. individual poem (specified as the content of the head Lines 6 to 19 typeset a block of centered mate- element of a div2 element in Figure 1, e.g., on line rial. In particular, lines 7–11 take care of the doc- 6) is handled on lines 32 to 42. Once more, the ument title that is typeset in 16 point bold, lines number (1.1 on line 40) is generated automatically. 12–15 correspond to the author’s name typeset in Then, the stanza and the lines inside the stanza of 14 point italic, while lines 16–18 typeset the date the poem (the lg and l elements in Figure 1) are in 14 point roman. All this information is obtained treated. For instance, the stanza on lines 8–13 in from the content of the titlepage element of the Figure 1 corresponds to the block on lines 44–52 in XML master source file poemstei-utf8, shown pre- the formatting objects output, in particular the in- viously (lines 36–42). put line 8 becomes the block 47–50. More details The remaining text is in the external files about the various XSL formatting objects and their Verlainetei-utf8.xml (entity reference on line 46 attributes can be found in the XSLT Recommenda- of poemstei-utf8, which contains the poems of tion [24]. Paul Verlaine in French, a fragment of which is seen The second stage of the process (typesetting inthelefthalfofFigure1),andBloktei-utf8.xml the XSL formatting objects with LATEX) uses xmltex (entity reference on line 47 of poemstei-utf8,which that is called by running LATEX on the intermediate contains the poems of Alexander Blok in Russian, a file fotex.tex containing the following three lines: fragment of which is seen in the right half of Fig- 1 \def\xmlfile{fotex.xml} ure 1). 2 \input xmltex.tex The author and the title of the collection of po- 3 \end{document} ems (specified as the content of the head element of Line 1 specifies the file that has to be interpreted a div1 element on line 3 in Figure 1) are handled by xmltex, before it takes control on line 2. The

226 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF result of treating the file poemstei-utf8.xml with 1 the two-step procedure outlined is shown in Fig- 2 xmltex, where we had to define for each XML ele- 6 ment and its attributes how it should be rendered 7 8 xmltex"> Besides PassiveTEX, FOP [2], originally devel- 10 opped by James Tauber, and presently supported by 11 the Apache Project, is another application, written 12 XSL in , that transforms formatting objects into 14 does not yet handle many aspects of fine typography 16 as well as T X. Commercial products targeting the 17 %; E 18 ]> same task of producing excellent typographic out- 19

put from XSL formatting objects have also been an- 20 nounced (RenderX,ArborText’sEpic), but are not 21 A Docbook document featuring a yet available. 22 few formulae 23 Michel 24 Goossens DocBook markup and some words about 25 mathematics 26 Wednesday, 18 July 2000 27 Up to now we have discussed a document marked 28 up according to the Text Encoding Initiative (TEI) 29 This XML document is marked up according the 30 DocBook schema It shows a few elements of the XML schema. In this section we show that other 31 DocBook vocabulary, as well as a couple of markup schemes can also be used, in particular, 32 examples of mathematical expressions where DocBook [14] which we will combine with MathML 33 we used MathML markup. [17], W3C’s XML DTD for mathematics. 34 35 XML SGML DTD DocBook is an (originally ) 36 which is especially well-suited for marking up tech- 37
nical documents. It is used extensively for preparing 38 The Docbook model software reference guides and computer equipment 39 40 DocBook is an XML model The DocBook DTD or schema contains hun- 42 for marking up technical the different components of an electronic document 44 documents. It is particularly well-suited for 45 software reference guides and computer (book, article, reference guide, etc.), not only dis- 46 equipment manuals. playing its hierarchical structure but also indicat- 47 ing the semantic meaning of the various elements. The structure of the DTD is optimized to allow for customization, thus making it relatively straightfor- The DTD internal subset (lines 4–18) contain ward to add or eliminate certain elements or at- the general entity definitions of LaTeX, PassiveTeX, tributes, to change the content model for certain TeX,andxmltex (lines 5 to 9), and on lines structural groups, or to restrict the value that given 10 and 11 parameter entity definitions specifying attributes can take. the content models for the equation.content and inlineequation.content elements, that can con- An example of DocBook markup We are not tain one or more math elements. Lines 12–13 define going to cover the DocBook vocabulary in any detail the mathml parameter entity that defines the system (see Walsh’s book [14] or his DocBook Web page). file containing the MathML DTD, which is loaded on We shall only consider an example where we used line 17. Line 14 specifies the namespace of the math the DocBook markup schema. A file dbxmml.xml, element. The remaining lines have markup using whose first part is reproduced below, contains var- the DocBook vocabulary, and their meaning should ious elements of the DocBook vocabulary. It will be rather straightforward to understand. allow us to test the typesetting paradigm based on An example of MathML presentation markup PassiveTEX outlined for the TEI schema previously. [17] follows.

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 227 Michel Goossens and Sebastian Rahtz Kihlaus . at rather than into Yâllàpà vasta leikki nousi. Kovin levoton Preivin kirjoitti Herrojen-Eeva tàllà tavalla Kiheliâitsipà miehen pààssà; mutta sità en Pasteeraili, pasteeraili edestakaisin perman- Eikà ihme; sillà onpa sità pehmitetty. Tu- Ja mestaris tàmàn luettuansa rupesi tuumaile- Vai Herrojen-Eevan. Tuliko mies juonikkaaksi, umaile ja harkitse, harkitse ja tuumaileleikkaus joka ja uusi muudi, sauma, niinpààvàrkin laita. kysytàànpà viimein — kuinkalopulta. Mutta on mielinpà kuulla kuinka kàvi oli mestarini. Milloin pasteerailiitsensà hàn, milloin taasen heitti hàn vuoteelle,ylâs mutta jàlleen samassa japien rupesi pyâràhti aina uudestaan niskaansa. hàn Kolme pasteerailemaan kertaa kàvi raap- pààtànsà hàn kaivolla. valelemassa Hàn pelkàsi aivoansa, nàette. ihmettele. — Millà tavalla luonnistui asia? nolla, raappien niskatukkaansa. maan ankarasti? mestarilleni eilen illalla:tahdon tietà “Kraatari antaa, ettà, Aapeli! josvalmis Jumala olen niin kohta Lyhyesti on tulemaan aviopuolisoksenne. sallinut,minua niin Tulkaat kyydittàmààn tààltà luoksenne;minà sillà tàmàn herrat ijankaikkisen jàtàn pilkunvahvasti tykânàni pààllà. pààttànyt.” Niin seisoi Sen preivissà. olen nyt hurjapàiseksi, villityksi, tairohkeuden? mistà kieppasi hàn tàmàn 4. From Aleksis Kivi’s play A misty and shadyface, blue, and was that looked had no beginning or sur- EENOKKI JOOSEPPI EENOKKI JOOSEPPI EENOKKI JOOSEPPI EENOKKI of idle men ,by Collected by Michel Goossens monstrari gigito . . . A Pair Of Blue Eyes ÔÅÍÎÏ É ÇÌÕÂÏËÏ , One point in her, however, you did notice: that was These eyes were blue; blue as autumn distance—blue Personally, she was the combination of very interest- 2 had not flattered her, and atshe the age was of nineteen no or twenty furtherurban young on lady in of fifteen. social consciousness than an her eyes. In them waswas seen not a necessary sublimation to of look all further: of her; there it she lived. as the blue wehills see and between woody the slopes retreating on mouldings a of sunny September morning. ðÌÁËÁÌÉ ÓÔÒÕÎÙ ÍÏÉ ÷ÅÔÅÒ ÐÒÉÎÅÓ ÉÚÄÁÌ£ËÁ ú×ÕÞÎÙÅ ÐÅÓÎÉ Ô×ÏÉ 3. From òÅÑÌÉ Ú×ÅÚÄÎÙÅ ÓÎÙ òÏÂËÏ Thomas Hardy Elfride Swancourt was anear girl the whose surface. emotions laymodified by very Their the creeping nature hours more ofto time, precisely, those was who known and watched only the as circumstances of her history. ing particulars, whose rarity, however, laynation in itself the rather combi- than inbined. the As individual a elements matter com- ofsubstance fact, of you her did features when not conversingthis see with the charming her; form and power and ofher preventing lineaments a by material ancloaking study interlocutor, effect of originated of a not well-formed manner in (forwas the her childish manner and scarcelycrudeness formed), of but the in remarks the themselves.her attractive life She in had retirement—the lived all 1 ...» ÑÎ×ÁÒÑ ...» . . ...» (29 . . . , . ? ? 1901) . . , ? , . . ! . ... óÔÉÈÉ Ï äÏÒÏÇÁ ËÒÕÔÁ , . !.. , . , ... . , , e (1901-1902) ÑÎ×ÁÒÑ íÅÄÌÅÎÎÏ ÓÈÏÄÉÌÉ óÔÕÞÕ × ×ÏÒÏÔÁ äÏÒÏÇÁ ËÒÕÔÁ . ,25 . . ÷ÏÒÏÔÁ ÏÔÐÅÒÌÁ 1901) É ÚÁÒÑ ÚÁÍÅÒÌÁ ? , íÅÄÌÅÎÎÏ ÓÈÏÄÉÌÉ . ïÔÄÙÈ ÎÁÐÒÁÓÅÎ ñ×ÙÛÅÌ ÷ÅÔÅÒ ÐÒÉÎÅÓ ÉÚÄÁÌ£ËÁ ðÅÔÅÒÂÕÒÇ ÔÏ Ó×ÅÔÌÏ É ÇÌÕÂÏËÏ ÄÅËÁÂÒÑ - áÌÅËÓÁÎÄÒ âÌÏË .- ÌÕÞÛÉÈ ÄÎÅÊ ÖÉ×ÙÅ ÂÙÌÉ , ó ëÒÁÓÎÁÑ ÔÁÊÎÁ Õ ×ÈÏÄÁ ÌÅÇÌÁ þÔÏ ×ÏÚÄ×ÉÇÁÌÁ ãÁÒÅ×ÎÁ óÁÍÁ ëÁÖÄÙÊ ËÏÎÅË ÎÁ ÕÚÏÒÎÏÊ ÒÅÚØÂÅ ëÒÁÓÎÏÅ ÐÌÁÍÑ ÂÒÏÓÁÅÔ Ë ÔÅÂÅ ëÕÐÏÌ ÓÔÒÅÍÉÔÓÑ × ÌÁÚÕÒÎÕÀ ×ÙÓØ óÉÎÉÅ ÏËÎÁ ÒÕÍÑÎÃÅÍ ÚÁÖÇÌÉÓØ ÷ÓÅ ËÏÌÏËÏÌØÎÙÅ Ú×ÏÎÙ ÇÕÄÑÔ úÁÌÉÔ ×ÅÓÎÏÊ ÂÅÚÚÁËÁÔÎÙÊ ÎÁÒÑÄ ôÙ ÌÉ ÍÅÎÑ ÎÁ ÚÁËÁÔÁÈôÅÒÅÍ ÖÄÁÌÁ ÚÁÖÇÌÁ ëÔÏ ÐÏÄÖÉÇÁÌ ÎÁ ÚÁÒÅ ÔÅÒÅÍÁ 2.1. « 2. ïÔÄÙÈ ÎÁÐÒÁÓÅÎ ÷ÅÞÅÒ ÐÒÅËÒÁÓÅÎ äÏÌØÎÅÍÕ ÓÔÕËÕ ÞÕÖÄÁ É ÓÔÒÏÇÁ ôÙ ÒÁÓÓÙÐÁÅÛØ ËÒÕÇÏÍ ÖÅÍÞÕÇÁ ôÅÒÅÍ ×ÙÓÏË 2.2. « íÉÎÕ×ÛÉÈ ÄÎÅÊ ÍÌÁÄÙÅ ÂÙÌÉ ðÒÉÛÌÉ ÄÏ×ÅÒÞÉ×Ï ÉÚ ÔØÍÙ ðÒÉÛÌÉÉ×ÓÔÁÌÉÚÁÐÌÅÞÁÍÉ é ÐÅÌÉ Ó ×ÅÔÒÏÍ Ï ×ÅÓÎÅ ï îÁ ÚÅÍÌÀ ÓÕÍÅÒËÉ ÚÉÍÙ é ÔÉÈÉÍÉ Ñ ÛÅÌ ÛÁÇÁÍÉ ðÒÏ×ÉÄÑ ×ÅÞÎÏÓÔØ × ÇÌÕÂÉÎÅ ( ñ×ÙÛÅÌ ðÏÄ ×ÁÛÕ ÐÅÓÎØ ÉÚ ÇÌÕÂÉÎÙ îÁ ÚÅÍÌÀ ÓÕÍÅÒËÉ ÓÈÏÄÉÌÉ é ×ÅÞÎÏÓÔÉ ×ÓÔÁ×ÁÌÉ ÓÎÙ 2.3. « çÄÅ 1901) ÷ÅÔÅÒ ÐÒÉÎÅÓ ÉÚÄÁÌ£ËÁ ðÅÓÎÉ ×ÅÓÅÎÎÅÊ ÎÁÍÅË îÅÂÁ ÏÔËÒÙÌÓÑ ËÌÏÞÏË ÷ÜÔÏÊÂÅÚÄÏÎÎÏÊÌÁÚÕÒÉ ÷ ÓÕÍÅÒËÁÈ ÂÌÉÚËÏÊ ×ÅÓÎÙ ðÌÁËÁÌÉ ÚÉÍÎÉÅ ÂÕÒÉ ÐÒÅËÒÁÓÎÏÊ ÄÁÍ (28 July 2000 and scripts Collected by Michel Goossens Clair de lune Les indolents Literature in various languages 1. Paul Verlaine, Fêtes galantes (1869) 1.1. Votre âme est un paysage choisi Que vont charmant masques et bergamasques, Jouant du luth, et dansant, etTristes sous quasi leurs déguisements fantasques. Tout en chantant sur le modeL’amour vainqueur mineur et la vie opportune, Ils n’ont pas l’air deEt croire à leur chanson leur se bonheur mêle au clair deAu lune, calme clair de luneQui triste fait et rêver beau, les oiseaux dansEt les sangloter arbres d’extase les jetsLes d’eau, grands jets d’eau sveltes parmi les marbres. 1.2. Bah ! malgré les destinsMourons jaloux, ensemble, voulez-vous ? - La proposition est rare. - Le rare est bon.Comme Donc dans mourons les Décamérons. - Hi ! Hi ! Hi !- quel Bizarre, amant je bizarre ne ! sais.Irréprochable, Amant assurément. Si vous voulez, mourons ensemble ? - Monsieur, vous raillez mieux encor Que vous n’aimez, et parlez d’orMais ; taisons-nous, si bon vous semble !Si » bien que ce soir-là Tircis Et Dorimène, à deux assis Non loin de deux sylvains hilares, Eurent l’inexpiable tort D’ajourner une exquise mort. Hi ! hi ! hi ! les amants bizarres !

Figure 2: Typesetting a collection of literature marked up according to the TEI, using XSL formatting objects 228 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting

PassiveTEX: from XML to PDF

¾ üëåêñàíäð ýëîê¸ Ñòèõè î ïðåêðàñíîé äàìe ´½9¼½¹½9¼¾µ

¿ FÖÓÑ A ÈaiÖ Çf BÐÙe EÝe׸ bÝ ÌhÓÑa× ÀaÖdÝ

¾º½ úÎòäûõ íàïðàñåíº Äîðîãà êðóò຺ºû ´¾8 äåêàáðÿ ½9¼½µ

½ ÈaÙÐ ÎeÖÐaiÒe¸ FêØe× gaÐaÒØe× ´½869µ

½º½ CÐaiÖ de ÐÙÒe

4 FÖÓÑ AÐek×i× ÃiÚi³× ÔÐaÝ ÃihÐaÙ×

EEÆÇÃÃÁ

ÂÇÇËEÈÈÁ

¾º¾ úß âûøåëº Ìåäëåííî ñõîäèë躺ºû ´Ñº¹Ïåòåðáóð㸠¾5 ÿíâàðÿ ½9¼½µ

½º¾ Äe× iÒdÓÐeÒØ×

EEÆÇÃÃÁ

ÂÇÇËEÈÈÁ

EEÆÇÃÃÁ

ÂÇÇËEÈÈÁ

¾º¿ úþåòåð ïðèíåñ èçäàëê຺ºû ´¾9 ÿíâàðÿ ½9¼½µ

EEÆÇÃÃÁ





¾

Figure 3: Typesetting a collection of literature marked up according to the TEI,usingxmltex

1

35 2 A MathML example 36 3 37 . 4 A MathML formula can be typeset inline, as here 38 5 39 6 E=m 7 c2 On lines 5–8 we use the inlineequation ele- 8 , Einstein’s famous equation. ment, which contains a mathematical equation to 9 be displayed inline, whereas line 19 to 39 show an 10 informalequation element, that contains mathe- 11 12 A mathematical equation can also be typeset in matical material (a matrix equation) to be type- 13 display mode using DocBook’s set in display mode. The typeset result (with Pas- 14 informalequation siveTEX, see below) is shown following the heading 15 element, as is shown in the following example A MathML example in the lower left hand image of 16 containing a matrix: 17 Figure 4. 18 19 The DocBook example and PassiveTEX We 20 use the same two-step procedure (transform the 21 XML file into XSL formatting objects and then type- 22 A set with PassiveT X as outlined previously in the 23 = E 24 context of the TEI) here with DocBook. For the 25 transformation of the XML DocBook markup into 26 XSL formatting objects we use XSL stylesheets de- 27 x veloped by Norman Walsh [15]. We customized the 28 y 29 stylesheet somewhat to handle the mathematics and 30 add templates for a few elements. The customized 31 z stylesheet foplus.xsl follows. 32 w 1 33 2 34

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 229 Michel Goossens and Sebastian Rahtz

4 version="1.0" MathML element m:math and the whole tree struc- 5 xmlns:fo="http://www.w3.org/1999/XSL/Format" ture it supports have to be copied unchanged to the 6 xmlns:m="http://www.w3.org/1998/Math/MathML"> 7 result tree. This means that the MathML elements 8

230 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF

Presentation markup

The presentation part of MathML discribes the spacial layout of the different elements of a mathematical A Docbook document featuring a few formulae expression. MathML presenation markup has about thirty elements, that form the basis of a mathematical syntax Michel Goossens using classical visual layout model. Some fifty attributes provide precise control on the exact positioning of the Wednesday, 18 July 2000 various components of the math expression.

Abstract List of presentation elements This XML document is marked up according the the DocBook schema It shows a few elements of the DocBook vocabulary, as well as a couple of examples of mathematical expressions where we used MathML markup. • Token elements

mi identifier mn number The Docbook model mo operator, fence, separator mtext text DocBook ??? proposes an XML model ??? for marking up technical documents. It is particularly well-suited for software reference guides and computer equipment manuals. mspace/ space ms Docbook contains hundreds of elements to markup up clearly and explicitly the different components of an mchar non-Ascii character reference electronic document (book, article, reference guide, etc.), not only displaying its hierarchical structure but also mglyph add new glyph to MathML indicating the semantical meaning of the various elements. The structure of the DTD is optimized to allow for customization, thus making it relatively straightforward to add or eliminate certain elements or attributes, to • General layout schemata change the content model for certain structural groups, or to restrict the value that given attributes can take. mrow group any number of sub-expressions horizontally Norman Walsh developed a set of XSL stylesheets to transform XML documents marked up using the DocBook mfrac form a fraction from two sub-expressions DTD into HTML or XSL formatting objects. The latter can be interpreted by PassiveTeX and xmltex to obtain msqrt form a square root sign (radical without an index) PDF or PostScript output. mroot form a radical with specified index mstyle style change merror enclose a syntax error message from a MathML and mathematics mpadded adjust space around content MathML ??? is a W3C recommendation whose aim is to encode mathematical material for teaching and mphantom make content invisible but preserve its size scientific communication at all levels. mfenced surround content with a pair of fences menclose enclose content with a stretching symbol such as a long division sign MathML consists of two parts: presentation (notation) et contents (meaning). MathML permits the conversion of mathematical information into various representations and output formats, including graphical displays, • Script and limit schemata speech synthesizers, computer algebra programs, other mathematics typesetting languages, such as TeX, , printers (PostScript), braille, etc. msub attach a subscript to a base msup attach a superscript to a base MathML has support for efficiently browsing long and complex mathematical expressoins and offers an msubsup attach a subscript-superscript pair to a base extension mechanism. MathML code is human readable, easy to generate and process by software applications, munder attach an underscript to a base and well suited for editing (even though MathML syntax might appear unnecessarily verbose and complex to the mover attach an overscript to a base human reader). munderover attach an underscript-overscript pair to a base The W3C MathMl Working Group is actively preparing the second version of the MathML Specification, which mmultiscripts attach prescripts and tensor indices to a base is planned to be released in the second half of 2000. Two public initiatives that allow the display of MathML code direcly and that are under active development are W3C’s and Mozilla. Commercial programs, such • Tables and matrices as IBM’s techexplorer (a plugin for Netscape or ’s ) or Design Science Webeq (using the Java applet technology) can display MathML formulae in present day browsers. Several computer algebra mtable row in a table or matrix with a label or equation number programs, e.g., Mathematica, or editors using e.g., mathtype, offer a user-friendly interface to enter, produce or mtr row in a table or matrix read mathematics material marked up in MathML. mtd one entry in a table or matrix

maligngroup/ and malignmark/ alignment markers calculus and vector calculus int, diff, partialdiff, lowlimit, uplimit, bvar, degree, divergence (MathML 2.0), grad (MathML 2.0), curl (MathML 2.0), • Enlivening expressions laplacian (MathML 2.0). theory of sets set, list, union, intersect, in, notin, subset, prsubset, maction bind actions to a sub-expression notsubset, notprsubset, setdiff, card (MathML 2.0). sequences and series sum, product, limit, tendsto. elementary classical function exp, ln, log, sin, cos, tan, sec, csc, cot, sinh, cosh, tanh, sech, A MathML example csch, coth, arcsin, arccos, arctan, arccosh, arccot, arccoth,

arccsc, arccsch, arcsec, arcsech, arcsinh, arctanh.

¾

= A MathML formula can be typeset inline, as here , Einstein’s famous equation. statistics mean, sdev, variance, median, mode, moment. linear algebra vector, matrix, matrixrow, determinant, transpose, selector, A mathematical equation can also be typeset in display mode using DocBook’s informalequation vectorproduct (MathML 2.0), scalarproduct (MathML 2.0),

element, as is shown in the following example containing a matrix: 

 outerproduct (MathML 2.0).

=

. semantic mapping element annotation, semantics, annotation-xml. constant and symbol element (all MathML 2.0) integers, reals, rationals, naturalnumbers, Note the two attributes open and close on the mfenced element to specify the style of the braces to be used. complexes, primes, exponentiale, imaginaryi, notanumber, The MathML Specification ??? contains a detailed list of all possible attributes associated to each presentation true, false, emptyset, pi, eulergamma, infinity. element. The matrix example given in the preceding section in its presentation markup form if recoded here using content markup. Content markup The meaning of mathematical symbols (e.g., sums, products, integrals) exists independently of their rendering. Moreover the presentation markup of an expression is not necessarily sufficient to evaluate its value and use it in A calculations. Therefore, MathML defines content markup to explicitly encode the underlying mathematical structure of an expressoin. xy It is impossible to cover all of mathematics, so MathML proposes only a relatively small number of commonplace mathematical constructs, chosen carefully to be sufficient in a large number of applications. In zw addition, it provides a extension mechanism for associating semantics with new notational constructs, thus allowing these to be encoded even when they are not in MathML content element base collection. MathML’s basic set of content elements was chosen to allow for simple coding of most of the formulas used in secondary education, through the first year of university. The subject areas targeted are arithmetic, algebra, logic and relations, calculus and vector calculus, set theory, sequences and series, elementary classical functions, and statistics linear algebra. Using this basic sets more complex constructs can be built. Bibliographic references The list of the content elements of MathML follows. token elements cn, ci, csymbol (MathML 2.0). [XML98] Consortium, Tim Bray, Jean Paoli, and Michael Sperberg-McQueen. basic content elements apply, reln (deprecated), fn (deprecated for externally Extensible (XML) 1.0 [http://www.w3.org/TR/REC-xml/]. See also the defined functions), interval, inverse, sep, condition, annotated version of the XML specification [http://www.xml.com/axml/axml.html].. declare, lambda, compose, ident. [WALSH99] Norman Walsh and Leonard Muelner. Docbook. The Definitive Guide..O’Reilly & Associates, arithmetic, algebra and logic quotient, exp, factorial, divide, max and min, minus, plus, Inc.. Copyright © 1999. 1-56592-580-7. You can also consult the power, rem, times, root, gcd, and, or, xor, not, implies, forall, online version of the DocBook reference guide [http://www.oasis-open.org/docbook/.html] and download the exists, abs, conjugate, arg (MathML 2.0), real (MathML 2.0), Docbook DTD and XSL stylesheets [http://nwalsh.com/docbook/index.html].. imaginary (MathML 2.0), lcm (MathML 2.0). relations eq, neq, gt, lt, geq, leq, equivalent (MathML 2.0), approx (MathML [MATHML99] World Wide Web Consortium, Patrick Ion, and Robert Miner. 2.0). Mathematical Markup Language (MathML[tm]) 1.01 Specification [http://www.w3.org/TR/REC-MathML/]. .

Figure 4: Typeset result of an article marked up according to the DocBook schema

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 231 Michel Goossens and Sebastian Rahtz

xmltex (direct interpretation)

X2L (via XSL)

XML PassiveTeX X2F (via XSL) XSL A) C) B) (xmltex) DTD LaTeX FOs JadeTeX DSSSL X2F (Jade) macros, styles CSS

L2H (LaTeX2HTML, TeX4ht) X2H (via XSL)

TeX X2H (Jade)

DVI direct visualization (techexplorer, WebEQ) pdfTeX DVIPDF other programs (X)HTML

FOP DVIPS FM, WP MSW CSS

PS PDF

browsers printers multimedia Web books audio-video Internet

Figure 5: XML as the central part of a document strategy for the Web

232 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF

browsers). They indicate programs to transform ex- [10] International Organization for isting LATEX source documents into XML (using one Standardization. Information or more standard DTDs) to store the information for Technology—Processing archiving purposes. The vertical ellipse in the cen- Languages—Document Style Semantics tre represents other editing tools, such as Adobe’s and Specification Language (DSSSL). First FrameMaker [1] and Corel’s WordPerfect [6], that edition, 1996 International Standard ISO/IEC allow or are expected to allow import/export of 10179:1996. XML documents. Thus, XML genuinely becomes [11] Sebastian Rahtz. Passive TEX http: the central element in a global strategy for man- //users.ox.ac.uk/~rahtz/passivetex/ aging electronic documents by allowing information [12] Sebastian Rahtz. XSL stylesheets for TEI to be stored, saved, shared, and used by different XML. http://users.ox.ac.uk/~rahtz/tei/ applications on all computer platforms. PassiveT X E [13] The Unicode Consortium. The Unicode can be truly considered as a much-needed comple- Standard, Version 3.0. Addison-Wesley, ment to XML for bringing typographic excellence to Reading, 2000. the Web, just as Knuth with TEX introduced the same excellence to the emerging electronic printing [14] Norman Walsh and Leonard Muelner. industry a generation ago. DocBook. The Definitive Guide. O’Reilly & Associates, Inc., Sebastopol, USA, 1999. See References also http://nwalsh.com/docbook/index. html. [1] Adobe. FrameMaker 6.0. http://www.adobe. XSL com/products/framemaker. [15] Norman Walsh. DocBook Stylesheets. http://nwalsh.com/docbook/xsl/index. FOP XSL [2] Apache XML Project. , Formatting html Object Processor in Java. http://xml. apache.org/fop/ [16] World Wide Web Consortium. H˚akon Wium Lie, Bert Bos, Chris Lilley and Ian Jacobs [3] Lou Burnard and C.M. Sperberg-McQueen. (editors). Cascading Style Sheets, level 2. TEI Guidelines for Electronic Text Encoding http://www.w3.org/TR/REC-CSS2. and Interchange. http://etext.lib. [17] World Wide Web Consortium. Patrick Ion virginia.edu/tei.html and Robert Miner (editors). Mathematical xmltex [4] David Carlisle. A non validating Markup Language (MathML[tm]) 1.01 (and not 100% conforming) namespace aware Specification. http://www.w3.org/TR/ XML parser implemented in TEX.Canbe REC-MathML/. downloaded from CTAN in the directory [18] World Wide Web Consortium, David C. macros/xmltex/. Fallside (editor). XML Schema Part 0: [5] James Clark. xt, an implementation in Primer (W3C Working Draft). http: Java of XSL Transformations. http: //www.w3.org/TR/xmlschema-0. //www.jclark.com/xml/xt.html [19] World Wide Web Consortium, Henry S. [6] Corel. WordPerfect Office 2000. http: Thompson, David Beech, Murray Maloney, //www.corel.com/Office2000 (Microsoft Noah Mendelsohn (editors). XML Schema Windows) and http://linux.corel.com/ Part 1: Structures (W3C Working Draft). products/wpo2000_linux (). http://www.w3.org/TR/xmlschema-1. [7] Michel Goossens and Sebastian Rahtz. The [20] World Wide Web Consortium, Paul V. Biron, LATEXWebCompanion. Addison-Wesley, Ashok Malhotra (editors). XML Schema Reading, 1999. Part 2: Datatypes (W3C Working Draft). [8] Yannis Haralambous and John Plaice. The http://www.w3.org/TR/xmlschema-2. latest developments in Omega. TUGBoat, [21] World Wide Web Consortium, Jon Ferraiolo 17 (2), pages 181-183, June 1996. (See also (editor). Scalable Vector Graphics (SVG) http://www.gutenberg.eu.org/omega/). 1.0 Specification (W3C Working Draft). [9] John Hobby. A user’s manual for MetaPost. http://www.w3.org/TR/SVG. Computer Science Technical Report 162, [22] World Wide Web Consortium. Tim Bray, AT&T Bell Laboratories, 1992. Jean Paoli, and C. M. Sperberg-McQueen

TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 233 Michel Goossens and Sebastian Rahtz

(editors). Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/REC-xml.An annotated version of the specification is at http://www.xml.com/axml/axml.html. [23] World Wide Web Consortium, James Clark (editor). XSL Transformations (XSLT), Version 1.0 (W3C Recommendation 16 November 1999). http://www.w3.org/TR/ . [24] World Wide Web Consortium, Stephen Deach (editor). Extensible Stylesheet Language (XSL), Version 1.0 (W3C Working Draft). http://www.w3.org/TR/WD-xsl.

Michel Goossens Sebastian Rahtz

234 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting