Generating Multiple Outputs from Ω John Plaice, Yannis Haralambous

To cite this version:

John Plaice, Yannis Haralambous. Generating Multiple Outputs from Ω. Tugboat, TeX Users Group, 2003, Proceedings of EuroTeX 2003, 24 (3), pp.512-518. ￿hal-02112933￿

HAL Id: hal-02112933 https://hal.archives-ouvertes.fr/hal-02112933 Submitted on 27 Apr 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Generating Multiple Outputs from Ω

John Plaice School of Computer Science and Engineering The University of New South Wales UNSW SYDNEY NSW 2052, Australia [email protected] http://www.cse.unsw.edu.au/~plaice

Yannis Haralambous Département Informatique École Nationale Supérieure des Télécommunications de Bretagne CS 83 818, 29238 Brest Cédex, France [email protected] http://omega.enstb.org/yannis

Abstract In this paper, we describe how to generate multiple outputs (DVI, PostScript, PDF, XML, ...) from the same Ω document. The Ω engine is augmented with a library for manipulating mul- tidimensional contexts. Each macro can be defined in multiple versions, and macros can thereby adapt to differing contexts. Macros can be specialized for several different output formats, without changing the overall structure. As a result, the same document can be used to easily produce dif- ferent output formats, with appropriate specializations for each of them, without having to make any changes to the document itself.

Résumé Dans cet article nous décrivons le processus de génération de sorties multiples (DVI, PostScript, PDF, XML, ...) à partir du même document Ω. Le moteur Ω a été muni d’une bibliothèque de sous-routines dédiée à la manipulation de contextes multi-dimensionnels. Les macros TEX peuvent être spécialisés selon le format de sortie, sans changer leur structure globale. Ainsi, le même docu- ment peut, sans la moindre modification, produire facilement différents formats de sortie avec les spécialisations ad hoc.

Introduction rial for different output formats: versioning the typeset- ting process also provides a high-level interface for mul- We present in this paper a new approach to generating tilingual typesetting, an issue that has hindered the de- typeset and structural material from Ω in a number of velopment of the Ω system since its inception. See the different output formats. This approach generalizes the paper presented at TUG 2003, with Chris Rowley [3], existing approaches of DVI postprocessors capable of in- for a detailed discussion. terpreting DVI \special’s, specialized modifications to However, it is not sufficient simply to be able to gen- the typesetting engine, judicious use of alternate versions erate different versions of macros and ΩTPs; the T X of macros, and external interpreters of subsets of LAT X. E E document model is very simple, and the one-pass doc- Key to this new approach is the introduction in Ω of ument manipulation approach — analogous to the Pascal versioned macros and versioned ΩTPs that can adapt their language in which it was written — built into the soft- behavior to a dynamically running tree-structured con- ware acts like a straitjacket when one wishes to pass as in- text that permeates the entire typesetting process. As a put or to generate as output significantly different docu- result, when a text is to be typeset for a new output for- ment structures. mat, then new versions of macros can be written at any Therefore, at least three additional components level, without changing the existing macros, thereby mini- need to be added to Ω in order for it to be fully adapt- mizing the amount of additional work to be undertaken. able to different formats. First is the ability to directly Versioned macros and ΩTPs have ramifications well apply ΩTPs and other filters to the input stream, even be- beyond the structural issues involved in generating mate- fore, and possibly bypassing, the macro processing stage.

512 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 Generating Multiple Outputs from Ω

Second is the ability to directly apply ΩTPs to the out- • building pages from streams of boxes and glue (page put stream, possibly without even generating DVI output. builder). Third is to supply general hooks that allow the user to TEX’s operation is undertaken in one pass, and it is manipulate internal document structure, and not simply very difficult, if not impossible, to be able to manipulate horizontal and vertical boxes. intermediate data structures as they are being built. In this article, we present the work that we have The different extensions to TEX and the different undertaken towards these goals. We begin with a brief DVI postprocessors have all taken different approaches, background, describing what we consider to be the main which is quite normal given their divergent aims. contributions of existing extensions of the general TEX First are the DVI postprocessors, dvips (generat- framework (not just to the TEX engine), and show how ing PostScript) and dvipdfm (generating PDF). Each of these different approaches all contribute to a better un- these programs transforms DVI output, augmented with derstanding of the general problem of generating differ- DVI \special’s, specifically designed for use with that ent outputs from the same files. program and generated by TEX through its macro mech- The model for contexts that we have adopted was anism, into the relevant output format. developed by Paul Swoboda in his PhD thesis [7]. It The main advantage of this approach is that it en- is the most highly developed presentation of intensional courages modularity, in the sense that the typesetter is versioning, an approach to the development of software separate from the pretty-printer. However, one can only variants first proposed by the first author and William W. put into \special’s information that is made available to Wadge [6]. We give a discussion of intensional program- the user. Information about intermediate data structures ming and versioning, then give a detailed presentation of is not directly available, so can only be approximated. contexts and context operators. Second is the LATEX2HTML approach. This tool We then show how contexts have been integrated does not do typesetting, rather it reorganizes the struc- into Ω. To do this, new Ω primitives are introduced for ture of the text into HTML. It does not use the TEX creating and using different versions of macros, and for engine, but itself parses a large (reasonable) subset of changing and manipulating the runtime context. In ad- LATEX. For parts that cannot be directly translated into dition, means for having versions of internal and exter- HTML, such as mathematics, then it generates small nal ΩTPs are defined. LATEX files, calls LATEX, then dvips, then transforms These technical sections are followed by a discus- then into PNG files. Although LATEX2HTML is a useful sion of how the internals of the Ω engine should be reor- tool, in its current form it will never have access to TEX’s ganized to facilitate the generation of multiple outputs. internal data structures, since it never calls TEX. Third, also for generating HTML, is TEX4HT, TEX and its Extensions which produces HTML files that resemble DVI pages The TEX document model supposes that a stream of text, generated by TEX. TEX4HT is also standalone, but it interspersed with control sequences, is to be transformed does use TEX for parsing and typesetting the input. It into a series of pages, each of which is a vertical box that makes use of extensive DVI \special’s. contains other boxes, either vertical or horizontal. Each Fourth are the extensions to the TEX engine, namely page is generated into DVI output, in the process losing e-TEX, pdfTEX, and Ω. The e-TEX extensions fo- some of the information contained in the page. cus mainly on improving the macro expansion facilities. The boxes are generated on the fly. Although cer- They do not change the typesetting, but do provide the tain items can be stored for later use in registers, accessed very useful ability to reparse an input sequence. much as one would in assembly programming, TEX’s doc- The pdfTEX extensions are two-fold. First is some ument model essentially consists of the following pro- experimental work simulating some of Peter Karow’s cesses: hz program. Second, more commonly used, are the ex- tensions to generate PDF directly rather than DVI. In ad- • transforming streams of characters into streams of typeset glyphs and boxes (main loop); dition to its new pretty-printer for TEX pages, pdfTEX provides built-in mechanisms, using whatsit nodes, for • building math lists from TEX math, then transform- generating such things as PDF forms and margin items. ing the lists into streams of typeset glyphs (math Although pdfTEX is practical, in the sense that one mode); can quickly generate PDF files from a TEX file, the fact • transforming streams of typeset glyphs, with in- that all of the functionality is hard-coded limits the pos- serted hyphenation points, into streams of horizon- sibility for extending the same system. For example, the tal boxes, corresponding to lines (paragrapher); current pdfTEX does not allow EPS files to be included • building boxes, corresponding to tables, from align- in the PDF files that it generates. ment specifications; The Ω extensions are of a more general nature.

TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 513 John Plaice and Yannis Haralambous

In the Ω model, before glyph selection, the character A context is a specific point in a multidimensional stream to be typeset is segmented and processed by a se- space, i.e., given a dimension, the context will return a ries of filters, each reading from standard input and writ- value for that dimension. The simplest contexts are dic- ing to standard output. Once all of the filters are applied, tionaries (lists of attribute-value pairs). A natural gen- the stream is passed to the standard TEX character-level eralization is what will be used in this paper: the values typesetter. themselves can be contexts, resulting in a tree-structured In addition, Ω also includes an experimental pretty- context. The set of contexts is furnished with a partial printer for MathML and XML. One of the goals of order ⊑ called a refinement relation. this work is to provide the means for recovering struc- For example, to describe Australian English, we ture, particularly in mathematics expressions, in TEX and could use the context: A LTEX files that do not have perfect markup. However, + this functionality is not in any way integrated with the lang:>> ΩTP mechanism. From this discussion, one can start to elucidate what where script and lang are called dimensions, and is needed. One should be able to have, in a single system: lang:dialect is called a compound dimension. See be- • internal data structures corresponding to page, para- low for more details. graph, line, etc., that can be explicitly manipulated During execution, the current context can be quer- by the user; ied, dimension by dimension, and the program can adapt its behavior accordingly. In addition, if the programming • input mechanisms not based on the macro language language supports it, then contextual conditional expres- that generate these data structures; sions and blocks can be defined, in which the most relevant • more advanced macro-processing and other pro- case, with respect to the current context and according to gramming languages to manipulate these data struc- the partial order, is chosen among the different possibil- tures; ities. • mechanisms to pretty-print the data structures in In addition, any entity can be defined in multiple multiple output formats; versions, which are mappings from contexts to objects. • limiting the output-format-specific additions to the Whenever an identifier designating an entity appears in internals of the typesetting engine. an expression or a statement, then the most relevant ver- For this whole approach to work, it is important that sion of that entity, with respect to the current context, is as one moves from one mechanism to another, chang- chosen. This is called the variant substructure principle. ing input or output formats, and what is expected of the The general approach is called intensional versioning [6]. typesetting engine, that the whole framework be suffi- The ISE programming language [5] was the first lan- ciently flexible that one does not have to completely re- guage combining both intensional programming and ver- program everything. In other words, the system must sioning. It is based on the procedural scripting language adapt to a complex context with many parameters. The Perl, and it has greatly facilitated the creation of multi- next few sections focus in detail on how to deal with con- dimensional Web pages. Similar experimental work has text in Ω; they are followed by a discussion of the points been undertaken under the supervision of the first author raised in the above wishlist. with C, C++, Java, and Eiffel. And, when combined with a context server (see Paul Swoboda’s PhD thesis [7]), it Intensional Programming becomes possible for several documents or programs to Intensional programming [4] is a form of computing that be immersed in the same context. supposes that there is a multidimensional context, and that all programs are capable of adapting themselves to Structuring the Context this context. The context is pervasive, and can simulta- We use the same notation to designate contexts and ver- neously affect the behavior of a program at the lowest, sions of entities. This section has three subsections. highest and middle layers. First, we define contexts and the refinement relation. When an intensional program is running, there is a Then, we define version domains, which hold versioned current context. This context is initialized upon launching entities. Finally, we define context operators, which are the program from the values of environment variables, used to change from context to context. In the following from explicit parameters, and possibly from active con- section, we will show how all of these are to be used. text servers. The current context can be modified dur- S ing execution, either explicitly through the program’s ac- Contexts and Refinement. Let ( i, ⊑i) i be a collection tions, or implicitly, through changes at an active context of sets of ground values, each with its own partial order.  server. Let S = ∪iSi. Then the set of contexts C (∋ C) over S

514 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 Generating Multiple Outputs from Ω is given by the following syntax: contexts: C ::= ⊥ | A | Ω |hB; Li (1) ⊥ ⊑ C (17) B ::= ǫ | α | ω | v (2) C ⊑ Ω (18) L ::= ∅ | d:C + L (3) C 6= ⊥ (19) where d, v ∈ S. A ⊑ C C ≡ C There are three special contexts: 0 1 (20) • ⊥ is the empty context (also called vanilla); C0 ⊑ C1 A B0 ⊑B B1 L0 ⊑L L1 • is the minimally defined context, just more defined (21) than the empty one; hB0; L0i⊑hB1; L1i • Ω is the maximally defined context, more defined which supposes a partial order ⊑B over base values: than all other contexts. ǫ ⊑B B (22) The normal case is that there is a base value B, along with a context list (L for short), which is a set of dimension- B ⊑B B (23) context pairs. We write δL for the set of dimensions of L. B ⊑B ω (24) A sequence of dimensions is called a compound di- B 6= ǫ (25) mension. It can be used as a path into a context. Formally: α ⊑B B D = · | d:D (4) v , v ∈ Si v ⊑i v 0 1 0 1 (26) If C is a context, C(D) is the subtree of C whose root is v0 ⊑B v1 reached by following the path D from the root of C: The last rule states that if v0 and v1 belong to the same C(·) = C (5) set Si and are comparable according to the partial or- hB; d:C′ + Li (d:D) = C′(D) (6) der ⊑i, then that order is subsumed for refinement pur- poses. As with contexts, there are three special base values: The partial order over contexts also supposes a par- • ǫ is the empty base value; tial order ⊑B over context lists: • α is the minimally defined base value, just more de- ∅ ⊑L L (27) fined than the empty base value; L ≡L L • ω is the maximally defined base value, more defined 0 1 (28) than all others. L0 ⊑L L1 C ⊑ C The normal case is that a base value is simply a scalar. 0 1 (29) To the set C, we add an equivalence relation ≡, and d:C0 ⊑L d:C1 ′ ′ a refinement relation ⊑. We begin with the equivalence L0 ⊑L L1 L0 ⊑L L1 ′ ′ (30) relation: L0 + L0 ⊑L L1 + L1 ⊥ ≡ hǫ; ∅i (7) Rule 30 ensures that the + operator defines the least up- A ≡ hα; ∅i (8) per bound of two context lists. Context and Version Domains. When doing intensional Ω ≡ ω; d:Ω (9) programming, we work with sets of contexts, called con- * d ∈ S + X text domains, written C. There is one operation on con- L ≡L L 0 1 (10) text domains, namely the best-fit. Given a context do- hB; L i⊑hB; L i 0 1 main C of existing contexts and a requested context Creq, Thus, ⊥ and A are notational conveniences, while Ω can- the best-fit context is defined by: not be reduced. The normal case supposes an equiva- best(C,C ) = max{C ∈ C | C ⊑ C } lence relation ≡L over context lists: req req (31) ∅ ≡L d:⊥ (11) If the maximum does not exist, there is no best-fit con- ′ ′ text. d:hB; L + L i ≡L d: hB; Li + hB; L i (12) Typically, we will be versioning something, an ob- L ≡L ∅ + L  (13) ject of some type. This is done using versions, simply L ≡L L + L (14) (C, object) pairs. Version domains V then become func- ′ ′ L + L ≡L L + L (15) tions mapping contexts to objects. The best-fit object in a ′ ′′ ′ ′′ version domain is given by: L +(L + L ) ≡L (L + L ) + L (16) The + operator is idempotent, commutative, and asso- ciative. Now we can define the partial order over entire best O(V,Creq) = V(best(dom V,Creq)) (32)

TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 515 John Plaice and Yannis Haralambous

Context Operators. Context operators allow one to selec- cilitate entry. Here is the concrete syntax for con- tively modify contexts. Their syntax is similar to that of texts: contexts. C ::= <> Empty context Cop ::= C | [Pop; Bop; Lop] (33) | ~~ Minimum context | ^^ Pop ::= −− | ǫ (34) Maximum context | Base value Bop ::= − | B (35) | Subversions L L d C L op ::= ∅ op | : op + op (36) | Base & subversions A context operator is applied to a context to transform it val ::= ~ Minimum value into another context. (It can also be used to transform a | ^ Maximum value context operator into another; see below.) The − oper- | string Normal value ator removes the current base value, while the −− oper- L ::= dim:C [+ dim:C]∗ ator in Pop is used to clear all dimensions not explicitly listed at that level. dim ::= string Now we give the semantics for C Cop, the applica- As for the context operation, here is the syntax: tion of context operator Cop to context C: Cop ::= C Replace the context C0 C1 = C1 (37) | [] No change Ω Cop = error (38) | [val op] Change base | [Lop] Change subversions hB; Li [−−; Bop; Lop] = (39) | [val op+Lop] Change base & subs B; L\(δL − δLop) [ǫ; Bop; Lop] val op ::= - Clear base hB; Li [ǫD; Bop; Lop] = E (40) | val New value | -- Clear subversions (B Bop);(LLop) | val+-- New base, clear subs The general case consists of replacingD the base valueE and | --- Clear base & subs ∗ replacing the context list. First, the base value: Lop ::= dim:Cop [+ dim:Cop] B − = ǫ (41) In Ω, the current context is given by: B0 B1 = B1 (42) \contextshow{} Now, the context list: If D is a compound dimension, then the subversion at di- mension D is given by: L ∅Lop = L (43) D (d:C + L)(d:Cop + Lop) = (44) \contextshow{ } D d:(C Cop)+(LLop) while the base value at dimension is given by: D L (d:Cop + Lop) = (45) \contextbase{ }

d:(∅ Cop)+(LLop), d 6∈ δL This context is initialized at the beginning of an Ω run with the values of environment variables and Context operators can also be applied to context op- command-line parameters. Once it is set, it can be erators. There are two cases: changed as follows: [Pop; Bop0; Lop0] [ǫ; Bop1; Lop1] = (46) \contextset{Cop} Pop;(Bop Bop );(Lop Lop ) 0 1 0 1 Adapting to the Context h i Ω [Pop; Bop ; Lop ] [−−; Bop ; Lop ] = (47) During execution, there are three mechanisms for to 0 0 1 1 modify its behavior with respect to the current con- −−;(Bop0 Bop1); (Lop0\(δLop0 − δLop1)) Lop1 text: (1) versioned execution flow,(2) versioned macros, and h Now that we have given the formal syntax and se-i (3) versioned ΩTPs. mantics of contexts, version domains, and context oper- Execution Flow. The new \contextchoice primitive ations, we can move on to typesetting. is used to change the execution flow: \contextchoice{{C }=>{exp }, The Running Context in Ω op1 1 . . . As is standard, the abstract syntax is simpler than the {Copn}=>{expn} concrete syntax, which offers richer possibilities to fa- }

516 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 Generating Multiple Outputs from Ω

Depending on the current context C, one of the ex- simplest syntax, this becomes: pressions expi will be selected and expanded. The <> pattern => expression one chosen will correspond to the best-fit context among When an internal ΩTP is being interpreted, an instruc- {C Cop1, ..., C Copn} (see the discussion above of Context and Version Domains). tion is only examined if its context tag (defaulting to the empty context) is less than the current running context. Macros. The Ω macro expansion process has been ex- When ΩTPs and ΩTP-lists are being declared in Ω, tended so that any control sequence can have multiple, the \contextchoice operator can be used to build ver- simultaneous versions, at the same scoping level. When- sioned ΩTP-lists. These will be particularly useful for ever \controlsequence is expanded, the most relevant, i.e. multilingual typesetting. See [3] for more details. the best-fit, definition, with respect to the current con- text, is expanded. The Internals of the Typesetting Engine A version of a control sequence is defined as follows: The versioned macros and ΩTPs presented in the pre-

\vdef{Cop}\controlsequence args{definition} vious sections clearly facilitate the development of soft- ware that is more flexible, in the sense that if new pa- If the current context is C, then this definition defines rameters are added to a system, new code only needs to the C Cop version of \controlsequence. The scoping of be written for those parts affected directly by the new definitions is the same as for TEX. parameters. This approach is upwardly compatible with the But it is still not clear how these mechanisms will TEX macro expansion process. The standard TEX def- help solve this problem of having multiple input and out- inition: put formats for use with the same typesetting system, in \def\controlsequence args{definition} particular, inside Ω. We outline the solution here, since at the time of is simply equivalent to writing, we have not yet finalized the syntax. \vdef{<>}\controlsequence args{definition} Essentially, all of Ω’s internal data structures will be made directly accessible to the user. These include at i.e., it defines the empty version of a control sequence. least: As stated above, during expansion the best-fit defini- • streams of characters; tion of \controlsequence, with respect to the current con- text, will be expanded whenever it is encountered. It is • streams of glyphs; also possible to expand a particular version of a control • math lists; sequence, by using: • input to each ΩTP application; • output from each ΩTP application; \vexp{Cop}\controlsequence • paragraphs; ΩTPs and ΩTP-lists. Beyond the ability to manipulate • tables; larger data structures than does TEX, Ω allows the user to apply a series of filters to the input, each reading from • pages; standard input and writing to standard output. Each of • other kinds of boxes. the filters is called an ΩTP (Ω Translation Process), and For each of these data structures will be specified a a series of filters is called an ΩTP-list. canonical serialization and deserialization. For each of There are two kinds of ΩTPs: internal and exter- the algorithms that can be applied to these data struc- nal. Internal ΩTPs are finite state machines written in an tures, a means for applying the algorithms from the user Ω-specific language, and that are compiled before being level will be defined as well. interpreted by the Ω engine. External ΩTPs are stand- As a result, it will be possible to completely ma- alone programs, reading from standard input and writing nipulate what we might call the “canonical input” and to standard output, like Unix filters. the “canonical output” of the typesetter. Then inputting Internal and external ΩTPs handle context differ- from different formats and outputting to different for- ently. For external ΩTPs, the context information can be mats becomes much simpler. For input from a specific in- passed on through an additional parameter to the system put format, an ΩTP-list must be defined to translate that call invoking the external ΩTP: input format into the “canonical input”. For output to ΩTP program -context=context a specific output format, an -list must be defined to translate from the “‘canonical output” to the output for- Internal ΩTPs have been modified so that every in- mat. These ΩTP-lists may well be parameterized by the struction can be preceded by a context tag. Using the current context to achieve specific results.

TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003 517 John Plaice and Yannis Haralambous

This approach is suitable for generic content that is References found in all input and output files, but what about ele- [1] Donald Knuth. Computers and Typesetting. 5 vol- ments that are relevant for only specific kinds of input or umes, Addison-Wesley, 1986. output? For these elements, it is the versioned macros that will come into play. Some macros will have versions [2] Omega Typesetting and Document Processing Sys- defined to deal with these elements only when the right tem, http://omega.enstb.org conditions occur in the context. [3] John Plaice, Yannis Haralambous and Chris Row- ley. A multidimensional approach to typesetting. Conclusions TUGboat 24(1), 2003, Proceedings of the TUG Annual Meeting, pp. 105–114. Given the successful experimental work in generating MathML and XML directly from the Ω engine, we con- [4] John Plaice and Joey Paquet. Introduction to inten- sider that the approach presented in this paper is both sional programming. In Intensional Programming I, elegant and doable. In fact, we consider that this ap- World-Scientific, Singapore, 1996. proach makes it possible to integrate into a single frame- [5] John Plaice, Paul Swoboda and Ammar Alammar. work most of the existing extensions to TEX, and its orig- Building intensional communities using shared con- inal view that a document should be transformed into a texts. In Distributed Communities on the Web, LNCS DVI file that is in turn converted to (for example) Post- 1830:55–64, Springer-Verlag, 2000. Script. [6] John Plaice and William W. Wadge. A new ap- However, the implications go well beyond just inte- proach to version control. IEEE-TSE 19(3):268– grating existing work, even if that goal is both laudable 276, 1993. and desirable. New possibilities will arise, since the user [7] Paul Swoboda. A Formalization and Implementation will no longer be forced to use the TEX document model, of Distributed Intensional Programming. PhD thesis, in which a document is just a series of pages. Ω then be- University of New South Wales, Sydney, Australia, comes usable for typesetting small pieces of text at a time, 2003. say to generate EPS files on-demand for the uses of other [8] Extensible (XML), http:// applications, such as online multilingual mapping tools or www.w3c.org/XML air traffic control systems. [9] Neomega Typesetting System. http://neomega. For a piece of software to survive from one gener- web.cse.unsw.edu.au. ation to the next, it must be able to adapt to the arising needs of coming times, and continually provide new pos- sibilities. We believe that the approach presented here will ensure the long-term viability of Ω, hence of the TEX community.

518 TUGboat, Volume 24 (2003), No. 3 — Proceedings of EuroTEX 2003