PassiveTEX: from XML to PDF
Michel Goossens CERN, IT Division, CH-1211 Gen`eve 23, Switzerland [email protected]
Sebastian Rahtz Oxford University Computing Services, Oxford, United Kingdom [email protected]
Abstract
This article introduces PassiveTEX, a library of TEX macros based on xmltex, that processes XML documents containing XSL formatting objects and generates PDF or DVI output. We show examples of typesetting XML sources marked up using the TEI, DocBook, and MathML DTDs.
Introduction The present article was prepared in XML us- ing TEI markup. It was not actually typeset with New information becomes available on the Inter- PassiveT X, but transformed into LAT Xwithan net every day, and increasingly material is becoming E E XSLT transformation stylesheet and typeset using available in XML. However, when one wants to print the ltugproc LAT Xclassfile. the information on the screen one is faced with sev- E eral shortcomings, the more important being the low typographic quality of the result or the problem of XSL formatting objects how to serialize a ‘tree’ of pages onto a linear output medium. In this article we explain how a source doc- The Extensible Stylesheet Language (XSL)isalan- ument marked up in XML can be typeset nicely by guage for describing page designs. For any given a two-step procedure that combines a transforma- class of arbitrarily structured XML documents or tion of the XML source into XSL formatting objects data files, an XSL stylesheet allows you to express and the direct interpretation of these XML objects how the content contained in the XML should be with Sebastian Rahtz’ PassiveTEX, a variant of TEX presented. The stylesheet indicates how source ele- based on David Carlisle’s xmltex [4]. ments should be styled, laid out, and paginated in The choice of TEX as the typesetting engine is some presentation form. In this article we only ad- justified by the fact that TEX handles mathematics dress issues related to high quality typesetting (using in a natural way, can manage several languages si- TEX). Our XML examples can be transformed into multaneously (hyphenation, typographic rules, mul- other representations (see [7] for more discussion), tiple encodings and alphabets, right-left typesetting, in particular HTML,andXSL stylesheets targeting etc.), and has a good model for marking up com- HTML are available for both the TEI [12] and Doc- plex tabular information. Moreover, TEX can easily Book [15] markup schemes. handle very long documents, such as large software An XSL processor reads an XML source docu- manuals, or bills for hundreds of thousands of tele- ment together with an XSL stylesheet, and produces phone subscribers. the presentation of the given XML content according In the first part of this article we use as an ex- to the instructions in the stylesheet. The prepara- ample some literary texts, including poems by the tion of the presentation form proceeds in two steps. French poet Paul Verlaine and the Russian poet First, a result tree is constructed from the XML in- Alexander Blok, marked up according to the TEI put source tree (tree transformation step). Second, DTD (Text Encoding Initiative [3], Lite version). In the result tree is interpreted to produce the format- the second part we show an example of a more tech- ted results (formatting step), that can be presented nical document, including a couple of simple math on output media such as a computer screen, a WAP formulae, using the DocBook [14] and MathML [17] display, on paper, in speech, etc. DTDs. The result tree can be very different from the structure of the source tree, since parts from the
222 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF
input can be deleted from the result tree, while sup- There are, however, some drawbacks associated plementary information (e.g., table of contents) can with using TEX as typesetting engine. Firstly, we are be generated. The tree transformation step also has constrained to use TEX’s page makeup model, and to add the information necessary to format the re- have to force the XSL formatting objects to fit that sult tree. model, which is not always straightforward. More- Thenodesoftheresulttreeareformatting ob- over, since PassiveTEX is actually layered over LATEX jects, whose semantics are expressed in terms of a it is too easy to allow things to fall through and take catalog of formatting object classes defined in the LATEXdefaults. XSL Specification [24]. They denote typographic ab- On a more practical level, TEX macro writing stractions such as page, paragraph, table, list, etc. is obscure and difficult and thus the system is not Fine control over the presentation of these abstrac- transparent for most programmers. And, last but tions is provided by a set of formatting properties, not least, TEX is large and monolithic, and unsuited such as those controlling alignments, color, fonts, to embedding in other applications. spacing, writing-mode, hyphenation etc. XSL offers Users of the system with a TEX background a rich range of formatting objects as one of its aims is can find it confusing to understand that TEXno to cover the semantic functionality of both the Doc- longer does all the work, which is now split betwen ument Style Semantics and Specification Language the stylesheet and the formatter; the four important (DSSSL) [10] and Cascading Stylesheets (CSS)[16] points to remember are: models as completely as possible. • No explicit use is made of LATEX’s high-level constructs, in particular there are no sections, PassiveTEX: implement XSL using TEX lists, cross-references, bibliographies, etc. In the previous section we explained how an input • all vertical and horizontal space is explicit in XML document can be transformed into a new XML the specification; document containing a result tree of XSL format- • page and line breaking is left to TEX: the rest ting objects. How do we go about rendering those is up to the stylesheet; objects with a real typesetting engine, and getting • XSL formatting objects use XML syntax, so that high-quality printout? One way is to use TEXit- the underlying character set is Unicode. By de- self. PassiveTEX is a library of TEXmacroswhich fault, entities are mapped to their Unicode po- can typeset the XSL formatting objects. In particu- sition. lar, PassiveT X combined with the pdfT Xvariant E E PassiveTEX will switch to using the Unicode- of T X generates high-quality PDF files in a single E based TEX variant soon (Omega [8]), to handle non- operation. PassiveTEX allows one to choose TEXas Latin material more naturally. the formatter of choice for XML, while hiding the Two extensions are needed for practical use. details of its operation from the user. The first is support of MathML; this is largely in PassiveTEX derives from and builds on xmltex place (it is simply passed through as-is by an XSL [4], a TEX package by David Carlisle, providing the stylesheet), but needs some tuning. In particu- core XML parser and UTF8 handler. Ideas and TEX lar, the intricacies of equation numbers remain to code are also inherited from earlier work by Sebas- be dealt with properly. Second, we need to sup- tian Rahtz on his JadeTEX package, which imple- port Scalable Vector Graphics (SVG)[21]XML code mented the output of the Jade DSSSL processor’s somehow (e.g., by direct interpretation, translation TEX backend, and already used a catalogue of Uni- into MetaPost [9], or pre-processing). Moreover, code/TEX mappings. SVG fragments need to be recognized directly to per- form in-line graphical functions (e.g., setting text at Issues in using TEX Since PassiveTEXisbased an angle). No work at all has been done on this. on TEX one can rapidly develop and test new high level code, knowing that TEX will take care of Support for XSL formatting objects Not all the typesetting details. Moreover, TEX is a well- parts of the XSL specification are fully supported. understood, robust, and free page formatter, where The XSL formatting objects which are implemented good support for fonts, graphics inclusion, hyper- fairly well cover the page specifications, blocks, in- links, etc., is already available. One can also profit line sequences, lists, graphics inclusion, floats, font from TEX’s mature handling of language issues, in- properties, and links. Not so well implemented at cluding hyphenation, and its high-quality math ren- present are all the table properties, and object mar- dering. Moreover, pdfTEX allows direct generation gins, borders, and padding. In order to properly im- of high-quality PDF. plement these, we have to put every paragraph into
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 223 Michel Goossens and Sebastian Rahtz
an explicit TEX box ‘just in case’, and this becomes 12 {\mathrm{#1}} slow and clumsy. Tables, too, are treated rather dif- 13 ferently in XSL, with many properties assigned at 14 \XMLelement{m:msqrt} 15 {} the cell level. 16 {\xmlgrab} Parts of the specification that are not yet im- 17 {\sqrt{#1}} plemented at all include bidi handling, backgrounds, Lines 1 and 2 declare the prefix m to be used for re- aural properties, and a large number of assorted ferring to elements in the given namespace. Lines properties. 4 to 7 declare that the start and end tag of a math AsmallexampleThe set of XSL stylesheets for element are transformed into the opening and clos- the TEI DTD (tei-fo [12]) specify how the various ing of an equation environment. Lines 9 to 12 de- XML elements can be transformed into XSL format- fine how a MathML mn (number) element is typeset. ting objects (see Section 7.6.6 of [7] for a more de- Its contents are stored (grabbed, see line 11) and tailed discussion). Without further comments, the then injected as parameter of a \mathrm to be type- following code example shows the transformations set in roman. Similarly, lines 14 to 17 show how asquareroot(msqrt) element and its content are for a paragraph (p element, lines 1 to 6) and em- A phasized material (emph element, lines 7 to 11). transformed into LTEX’s \sqrt and its argument. To fine-tune the output, xmltex supports pro- 1
224 TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting PassiveTEX: from XML to PDF
23 Below we show part of the beginning of the 24 XML file fotex.xml that corresponds to the first 25
TUGboat, Volume 21 (2000), No. 3 — Proceedings of the 2000 Annual Meeting 225
Michel Goossens and Sebastian Rahtz