Generating Multiple Outputs from Ω John Plaice, Yannis Haralambous
To cite this version:
John Plaice, Yannis Haralambous. Generating Multiple Outputs from Ω. Tugboat, TeX Users Group, 2003, Proceedings of EuroTeX 2003, 24 (3), pp.512-518. hal-02112933
HAL Id: hal-02112933 https://hal.archives-ouvertes.fr/hal-02112933 Submitted on 27 Apr 2019
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Generating Multiple Outputs from Ω
John Plaice School of Computer Science and Engineering The University of New South Wales UNSW SYDNEY NSW 2052, Australia [email protected] http://www.cse.unsw.edu.au/~plaice
Yannis Haralambous Département Informatique École Nationale Supérieure des Télécommunications de Bretagne CS 83 818, 29238 Brest Cédex, France [email protected] http://omega.enstb.org/yannis
Abstract In this paper, we describe how to generate multiple outputs (DVI, PostScript, PDF, XML, ...) from the same Ω document. The Ω engine is augmented with a library for manipulating mul- tidimensional contexts. Each macro can be defined in multiple versions, and macros can thereby adapt to differing contexts. Macros can be specialized for several different output formats, without changing the overall structure. As a result, the same document can be used to easily produce dif- ferent output formats, with appropriate specializations for each of them, without having to make any changes to the document itself.
Résumé Dans cet article nous décrivons le processus de génération de sorties multiples (DVI, PostScript, PDF, XML, ...) à partir du même document Ω. Le moteur Ω a été muni d’une bibliothèque de sous-routines dédiée à la manipulation de contextes multi-dimensionnels. Les macros TEX peuvent être spécialisés selon le format de sortie, sans changer leur structure globale. Ainsi, le même docu- ment peut, sans la moindre modification, produire facilement différents formats de sortie avec les spécialisations ad hoc.
Introduction rial for different output formats: versioning the typeset- ting process also provides a high-level interface for mul- We present in this paper a new approach to generating tilingual typesetting, an issue that has hindered the de- typeset and structural material from Ω in a number of velopment of the Ω system since its inception. See the different output formats. This approach generalizes the paper presented at TUG 2003, with Chris Rowley [3], existing approaches of DVI postprocessors capable of in- for a detailed discussion. terpreting DVI \special’s, specialized modifications to However, it is not sufficient simply to be able to gen- the typesetting engine, judicious use of alternate versions erate different versions of macros and ΩTPs; the T X of macros, and external interpreters of subsets of LAT X. E E document model is very simple, and the one-pass doc- Key to this new approach is the introduction in Ω of ument manipulation approach — analogous to the Pascal versioned macros and versioned ΩTPs that can adapt their language in which it was written — built into the soft- behavior to a dynamically running tree-structured con- ware acts like a straitjacket when one wishes to pass as in- text that permeates the entire typesetting process. As a put or to generate as output significantly different docu- result, when a text is to be typeset for a new output for- ment structures. mat, then new versions of macros can be written at any Therefore, at least three additional components level, without changing the existing macros, thereby mini- need to be added to Ω in order for it to be fully adapt- mizing the amount of additional work to be undertaken. able to different formats. First is the ability to directly Versioned macros and ΩTPs have ramifications well apply ΩTPs and other filters to the input stream, even be- beyond the structural issues involved in generating mate- fore, and possibly bypassing, the macro processing stage.
512 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 Generating Multiple Outputs from Ω
Second is the ability to directly apply ΩTPs to the out- • building pages from streams of boxes and glue (page put stream, possibly without even generating DVI output. builder). Third is to supply general hooks that allow the user to TEX’s operation is undertaken in one pass, and it is manipulate internal document structure, and not simply very difficult, if not impossible, to be able to manipulate horizontal and vertical boxes. intermediate data structures as they are being built. In this article, we present the work that we have The different extensions to TEX and the different undertaken towards these goals. We begin with a brief DVI postprocessors have all taken different approaches, background, describing what we consider to be the main which is quite normal given their divergent aims. contributions of existing extensions of the general TEX First are the DVI postprocessors, dvips (generat- framework (not just to the TEX engine), and show how ing PostScript) and dvipdfm (generating PDF). Each of these different approaches all contribute to a better un- these programs transforms DVI output, augmented with derstanding of the general problem of generating differ- DVI \special’s, specifically designed for use with that ent outputs from the same files. program and generated by TEX through its macro mech- The model for contexts that we have adopted was anism, into the relevant output format. developed by Paul Swoboda in his PhD thesis [7]. It The main advantage of this approach is that it en- is the most highly developed presentation of intensional courages modularity, in the sense that the typesetter is versioning, an approach to the development of software separate from the pretty-printer. However, one can only variants first proposed by the first author and William W. put into \special’s information that is made available to Wadge [6]. We give a discussion of intensional program- the user. Information about intermediate data structures ming and versioning, then give a detailed presentation of is not directly available, so can only be approximated. contexts and context operators. Second is the LATEX2HTML approach. This tool We then show how contexts have been integrated does not do typesetting, rather it reorganizes the struc- into Ω. To do this, new Ω primitives are introduced for ture of the text into HTML. It does not use the TEX creating and using different versions of macros, and for engine, but itself parses a large (reasonable) subset of changing and manipulating the runtime context. In ad- LATEX. For parts that cannot be directly translated into dition, means for having versions of internal and exter- HTML, such as mathematics, then it generates small nal ΩTPs are defined. LATEX files, calls LATEX, then dvips, then transforms These technical sections are followed by a discus- then into PNG files. Although LATEX2HTML is a useful sion of how the internals of the Ω engine should be reor- tool, in its current form it will never have access to TEX’s ganized to facilitate the generation of multiple outputs. internal data structures, since it never calls TEX. Third, also for generating HTML, is TEX4HT, TEX and its Extensions which produces HTML files that resemble DVI pages The TEX document model supposes that a stream of text, generated by TEX. TEX4HT is also standalone, but it interspersed with control sequences, is to be transformed does use TEX for parsing and typesetting the input. It into a series of pages, each of which is a vertical box that makes use of extensive DVI \special’s. contains other boxes, either vertical or horizontal. Each Fourth are the extensions to the TEX engine, namely page is generated into DVI output, in the process losing e-TEX, pdfTEX, and Ω. The e-TEX extensions fo- some of the information contained in the page. cus mainly on improving the macro expansion facilities. The boxes are generated on the fly. Although cer- They do not change the typesetting, but do provide the tain items can be stored for later use in registers, accessed very useful ability to reparse an input sequence. much as one would in assembly programming, TEX’s doc- The pdfTEX extensions are two-fold. First is some ument model essentially consists of the following pro- experimental work simulating some of Peter Karow’s cesses: hz program. Second, more commonly used, are the ex- tensions to generate PDF directly rather than DVI. In ad- • transforming streams of characters into streams of typeset glyphs and boxes (main loop); dition to its new pretty-printer for TEX pages, pdfTEX provides built-in mechanisms, using whatsit nodes, for • building math lists from TEX math, then transform- generating such things as PDF forms and margin items. ing the lists into streams of typeset glyphs (math Although pdfTEX is practical, in the sense that one mode); can quickly generate PDF files from a TEX file, the fact • transforming streams of typeset glyphs, with in- that all of the functionality is hard-coded limits the pos- serted hyphenation points, into streams of horizon- sibility for extending the same system. For example, the tal boxes, corresponding to lines (paragrapher); current pdfTEX does not allow EPS files to be included • building boxes, corresponding to tables, from align- in the PDF files that it generates. ment specifications; The Ω extensions are of a more general nature.
TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 513 John Plaice and Yannis Haralambous
In the Ω model, before glyph selection, the character A context is a specific point in a multidimensional stream to be typeset is segmented and processed by a se- space, i.e., given a dimension, the context will return a ries of filters, each reading from standard input and writ- value for that dimension. The simplest contexts are dic- ing to standard output. Once all of the filters are applied, tionaries (lists of attribute-value pairs). A natural gen- the stream is passed to the standard TEX character-level eralization is what will be used in this paper: the values typesetter. themselves can be contexts, resulting in a tree-structured In addition, Ω also includes an experimental pretty- context. The set of contexts is furnished with a partial printer for MathML and XML. One of the goals of order ⊑ called a refinement relation. this work is to provide the means for recovering struc- For example, to describe Australian English, we ture, particularly in mathematics expressions, in TEX and could use the context: A LTEX files that do not have perfect markup. However,
514 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 Generating Multiple Outputs from Ω is given by the following syntax: contexts: C ::= ⊥ | A | Ω |hB; Li (1) ⊥ ⊑ C (17) B ::= ǫ | α | ω | v (2) C ⊑ Ω (18) L ::= ∅ | d:C + L (3) C 6= ⊥ (19) where d, v ∈ S. A ⊑ C C ≡ C There are three special contexts: 0 1 (20) • ⊥ is the empty context (also called vanilla); C0 ⊑ C1 A B0 ⊑B B1 L0 ⊑L L1 • is the minimally defined context, just more defined (21) than the empty one; hB0; L0i⊑hB1; L1i • Ω is the maximally defined context, more defined which supposes a partial order ⊑B over base values: than all other contexts. ǫ ⊑B B (22) The normal case is that there is a base value B, along with a context list (L for short), which is a set of dimension- B ⊑B B (23) context pairs. We write δL for the set of dimensions of L. B ⊑B ω (24) A sequence of dimensions is called a compound di- B 6= ǫ (25) mension. It can be used as a path into a context. Formally: α ⊑B B D = · | d:D (4) v , v ∈ Si v ⊑i v 0 1 0 1 (26) If C is a context, C(D) is the subtree of C whose root is v0 ⊑B v1 reached by following the path D from the root of C: The last rule states that if v0 and v1 belong to the same C(·) = C (5) set Si and are comparable according to the partial or- hB; d:C′ + Li (d:D) = C′(D) (6) der ⊑i, then that order is subsumed for refinement pur- poses. As with contexts, there are three special base values: The partial order over contexts also supposes a par- • ǫ is the empty base value; tial order ⊑B over context lists: • α is the minimally defined base value, just more de- ∅ ⊑L L (27) fined than the empty base value; L ≡L L • ω is the maximally defined base value, more defined 0 1 (28) than all others. L0 ⊑L L1 C ⊑ C The normal case is that a base value is simply a scalar. 0 1 (29) To the set C, we add an equivalence relation ≡, and d:C0 ⊑L d:C1 ′ ′ a refinement relation ⊑. We begin with the equivalence L0 ⊑L L1 L0 ⊑L L1 ′ ′ (30) relation: L0 + L0 ⊑L L1 + L1 ⊥ ≡ hǫ; ∅i (7) Rule 30 ensures that the + operator defines the least up- A ≡ hα; ∅i (8) per bound of two context lists. Context and Version Domains. When doing intensional Ω ≡ ω; d:Ω (9) programming, we work with sets of contexts, called con- * d ∈ S + X text domains, written C. There is one operation on con- L ≡L L 0 1 (10) text domains, namely the best-fit. Given a context do- hB; L i⊑hB; L i 0 1 main C of existing contexts and a requested context Creq, Thus, ⊥ and A are notational conveniences, while Ω can- the best-fit context is defined by: not be reduced. The normal case supposes an equiva- best(C,C ) = max{C ∈ C | C ⊑ C } lence relation ≡L over context lists: req req (31) ∅ ≡L d:⊥ (11) If the maximum does not exist, there is no best-fit con- ′ ′ text. d:hB; L + L i ≡L d: hB; Li + hB; L i (12) Typically, we will be versioning something, an ob- L ≡L ∅ + L (13) ject of some type. This is done using versions, simply L ≡L L + L (14) (C, object) pairs. Version domains V then become func- ′ ′ L + L ≡L L + L (15) tions mapping contexts to objects. The best-fit object in a ′ ′′ ′ ′′ version domain is given by: L +(L + L ) ≡L (L + L ) + L (16) The + operator is idempotent, commutative, and asso- ciative. Now we can define the partial order over entire best O(V,Creq) = V(best(dom V,Creq)) (32)
TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 515 John Plaice and Yannis Haralambous
Context Operators. Context operators allow one to selec- cilitate entry. Here is the concrete syntax for con- tively modify contexts. Their syntax is similar to that of texts: contexts. C ::= <> Empty context Cop ::= C | [Pop; Bop; Lop] (33) | ~~ Minimum context | ^^ Pop ::= −− | ǫ (34) Maximum context |
d:(∅ Cop)+(LLop), d 6∈ δL This context is initialized at the beginning of an Ω run with the values of environment variables and Context operators can also be applied to context op- command-line parameters. Once it is set, it can be erators. There are two cases: changed as follows: [Pop; Bop0; Lop0] [ǫ; Bop1; Lop1] = (46) \contextset{Cop} Pop;(Bop Bop );(Lop Lop ) 0 1 0 1 Adapting to the Context h i Ω [Pop; Bop ; Lop ] [−−; Bop ; Lop ] = (47) During execution, there are three mechanisms for to 0 0 1 1 modify its behavior with respect to the current con- −−;(Bop0 Bop1); (Lop0\(δLop0 − δLop1)) Lop1 text: (1) versioned execution flow,(2) versioned macros, and h Now that we have given the formal syntax and se-i (3) versioned ΩTPs. mantics of contexts, version domains, and context oper- Execution Flow. The new \contextchoice primitive ations, we can move on to typesetting. is used to change the execution flow: \contextchoice{{C }=>{exp }, The Running Context in Ω op1 1 . . . As is standard, the abstract syntax is simpler than the {Copn}=>{expn} concrete syntax, which offers richer possibilities to fa- }
516 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 Generating Multiple Outputs from Ω
Depending on the current context C, one of the ex- simplest syntax, this becomes: pressions expi will be selected and expanded. The <
\vdef{Cop}\controlsequence args{definition} vious sections clearly facilitate the development of soft- ware that is more flexible, in the sense that if new pa- If the current context is C, then this definition defines rameters are added to a system, new code only needs to the C Cop version of \controlsequence. The scoping of be written for those parts affected directly by the new definitions is the same as for TEX. parameters. This approach is upwardly compatible with the But it is still not clear how these mechanisms will TEX macro expansion process. The standard TEX def- help solve this problem of having multiple input and out- inition: put formats for use with the same typesetting system, in \def\controlsequence args{definition} particular, inside Ω. We outline the solution here, since at the time of is simply equivalent to writing, we have not yet finalized the syntax. \vdef{<>}\controlsequence args{definition} Essentially, all of Ω’s internal data structures will be made directly accessible to the user. These include at i.e., it defines the empty version of a control sequence. least: As stated above, during expansion the best-fit defini- • streams of characters; tion of \controlsequence, with respect to the current con- text, will be expanded whenever it is encountered. It is • streams of glyphs; also possible to expand a particular version of a control • math lists; sequence, by using: • input to each ΩTP application; • output from each ΩTP application; \vexp{Cop}\controlsequence • paragraphs; ΩTPs and ΩTP-lists. Beyond the ability to manipulate • tables; larger data structures than does TEX, Ω allows the user to apply a series of filters to the input, each reading from • pages; standard input and writing to standard output. Each of • other kinds of boxes. the filters is called an ΩTP (Ω Translation Process), and For each of these data structures will be specified a a series of filters is called an ΩTP-list. canonical serialization and deserialization. For each of There are two kinds of ΩTPs: internal and exter- the algorithms that can be applied to these data struc- nal. Internal ΩTPs are finite state machines written in an tures, a means for applying the algorithms from the user Ω-specific language, and that are compiled before being level will be defined as well. interpreted by the Ω engine. External ΩTPs are stand- As a result, it will be possible to completely ma- alone programs, reading from standard input and writing nipulate what we might call the “canonical input” and to standard output, like Unix filters. the “canonical output” of the typesetter. Then inputting Internal and external ΩTPs handle context differ- from different formats and outputting to different for- ently. For external ΩTPs, the context information can be mats becomes much simpler. For input from a specific in- passed on through an additional parameter to the system put format, an ΩTP-list must be defined to translate that call invoking the external ΩTP: input format into the “canonical input”. For output to ΩTP program -context=context a specific output format, an -list must be defined to translate from the “‘canonical output” to the output for- Internal ΩTPs have been modified so that every in- mat. These ΩTP-lists may well be parameterized by the struction can be preceded by a context tag. Using the current context to achieve specific results.
TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 517 John Plaice and Yannis Haralambous
This approach is suitable for generic content that is References found in all input and output files, but what about ele- [1] Donald Knuth. Computers and Typesetting. 5 vol- ments that are relevant for only specific kinds of input or umes, Addison-Wesley, 1986. output? For these elements, it is the versioned macros that will come into play. Some macros will have versions [2] Omega Typesetting and Document Processing Sys- defined to deal with these elements only when the right tem, http://omega.enstb.org conditions occur in the context. [3] John Plaice, Yannis Haralambous and Chris Row- ley. A multidimensional approach to typesetting. Conclusions TUGboat 24(1), 2003, Proceedings of the TUG Annual Meeting, pp. 105–114. Given the successful experimental work in generating MathML and XML directly from the Ω engine, we con- [4] John Plaice and Joey Paquet. Introduction to inten- sider that the approach presented in this paper is both sional programming. In Intensional Programming I, elegant and doable. In fact, we consider that this ap- World-Scientific, Singapore, 1996. proach makes it possible to integrate into a single frame- [5] John Plaice, Paul Swoboda and Ammar Alammar. work most of the existing extensions to TEX, and its orig- Building intensional communities using shared con- inal view that a document should be transformed into a texts. In Distributed Communities on the Web, LNCS DVI file that is in turn converted to (for example) Post- 1830:55–64, Springer-Verlag, 2000. Script. [6] John Plaice and William W. Wadge. A new ap- However, the implications go well beyond just inte- proach to version control. IEEE-TSE 19(3):268– grating existing work, even if that goal is both laudable 276, 1993. and desirable. New possibilities will arise, since the user [7] Paul Swoboda. A Formalization and Implementation will no longer be forced to use the TEX document model, of Distributed Intensional Programming. PhD thesis, in which a document is just a series of pages. Ω then be- University of New South Wales, Sydney, Australia, comes usable for typesetting small pieces of text at a time, 2003. say to generate EPS files on-demand for the uses of other [8] Extensible Markup Language (XML), http:// applications, such as online multilingual mapping tools or www.w3c.org/XML air traffic control systems. [9] Neomega Typesetting System. http://neomega. For a piece of software to survive from one gener- web.cse.unsw.edu.au. ation to the next, it must be able to adapt to the arising needs of coming times, and continually provide new pos- sibilities. We believe that the approach presented here will ensure the long-term viability of Ω, hence of the TEX community.
518 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003