Extending Gnu Troff to Produce Html Through the Technique of Next Event Simulation
Total Page:16
File Type:pdf, Size:1020Kb
G. MULLEY and W.LEMBERG: EXTENDING GNU TROFF TOPRODUCE HTML EXTENDING GNU TROFF TOPRODUCE HTML THROUGH THE TECHNIQUE OF NEXT EVENT SIMULATION GAIUS MULLEY and WERNER LEMBERG School of Computing,University of Glamorgan, CF37 1DL, UK E-Mail: [email protected] Kl. Beurhausstr.1,44137 Dortmund, Germany E-Mail: [email protected] Abstract: This paper reports on a technique used to generate accurate HTML output from GNU Troff. GNU Troff is a typesetting package which reads plain text mixed with formatting commands and produces formatted output. It supports a number of devices and nowsupports the production of HTML.The paper discusses the design of the HTML device driver grohtml and modifications made to GNU Troff. The front end program troff wasmodi- fied to maintain a reduced state machine which is examined each time a glyph is passed to the back end device driver(post-grohtml). Anychange in system state between the production of twoglyphs results in a sequence of events being passed to the device driver. There is a direct correspondence between this technique and creating ascript for a next event simulation queue. Furthermore the device driverreconstructs the system state and for- mats the HTML according to state changes caused when processing the event queue. This technique works well, as it minimises the state information passed from front end to back end device driverwhilst still preserving the high levellayout of the text. Using this technique GNU Troffeffectively translates input source into another mark-up language and thus this technique could be extended to translate GNU Troffdocuments into anyofthe OpenOffice supported formats. Troffhas been in use for three decades nowand is still actively used by authors. Troff’sbiggest use, however, istoformat manual pages for GNU/Linux and other UNIX likeoperating systems. Introducing this facility into GNU Troffprovides users with the ability to translate legacy documents into HTML and in the future to a format supported by OpenOffice. Ke ywords: groff, troff, grohtml, HTML, simulation. table [Lesk, 1976a], picture [Kernighan, 1991, Wyk, 1982] and equation handling [Kernighan, 1977, INTRODUCTION Kernighan, 1976] (refer, tbl, pic and eqn respec- tively). These programs were executed in a pipeline GNU Troffisareimplementation of the program and troff transformed the heavily preprocessed troff which is available on the UNIX operating sys- formatting commands and text into device dependent tem. The original troff waswritten in PDP-11 output [Kernighan, 1978]. Although the input to assembly language by Joe Ossanna in 1973. Tw o troff wasdevice independent it was also very low years later it was rewritten in C and afterwards it leveland therefore users were encouraged to use went through a series of revisions until Joe Ossanna macro sets when producing documents [Lesk, died in 1977. 1976b]. During the 1980s the internals were mod- estly revised and a number of newmacro sets were Brian Kernighan continued troff development for the next 15 years and it is testament to the original written [Allman, 1980, Smith, 1980]. The macro design that the input language remained the same. sets provided freedom in document styling (similar Much of the original documentation receivedminor to modern HTML style sheets) and theyincluded: arbitrary style headers and footers; arbitrary style changes as new troff releases were issued and therefore Joe’sname was retained on these manu- footnotes; automatic sequence numbering for para- als [Ossanna, 1992]. graphs, sections, etc; multiple column output; dynamic font and point-size control; arbitrary hori- Brian Kernighan modified troff so that it could zontal and vertical local motions at anypoint; math- produce output for a number of different typesetting ematical bracket construction, and line drawing devices, while at the same time retaining the same functions [Ossanna, 1992]. input language specification. The input language wassorobust that a number of preprocessors were During the 1980s troff wasextended to handle written to provide reference [Tuthill, 1986], manydifferent devices, this was accomplished by I.J. of SIMULATION Vol. 6 No 7-8 37 ISSN 1473-804x online, 1473-8031 print G. MULLEY and W.LEMBERG: EXTENDING GNU TROFF TOPRODUCE HTML splitting the task of troff into twocomponents. commands to format abstracts, titles, footnotes, para- The front end ditroff produced device indepen- graphs, lists, hanging indents etc. Consequently, dent code and the back end device driverwhich sim- translating troff input source into HTML cannot be ply translated the device independent code into the achievedbywriting a device driverwhich simply target device commands. reads ditroff input and produces HTML output. Consider the diagram in figure 1 which shows the James Clark begantowork on the GNU implemen- keycomponents of a simple groff command line tation of the troff family of tools in 1989. This was invocation together with a synopsis of the data pass- to be a completely newimplementation of all the ing though the pipeline. In figure 1 the troff input preprocessors, the ditroff program and the macro consists of requests or calls to macros (lines prefixed sets. The first release of groff (version 0.3.1) with a period) and text. Troffinput might also occurred in June 1990 and it included a replacement include text with escapes, for example the word for ditroff, eqn, tbl and pic.Italso included a program can be typeset by temporarily reducing the replacement for the me macros and the man macros. point size by 1 and altering the font to Courier.The The replacement programs were mostly written in word and escapes can be encoded as \s-1\fCpro- C++ and often supported extensions and removed gram\s+1\fP.Infigure 1 the final output is Post- various static data size limitations. Since 1999 the Script (the default device) and the invocation also groff package has had newmaintainers and has includes the ms macro package. Figure 2 shows how undergone active dev elopment. It supports the com- aPostScript printer interprets the output from Figure mon macro sets associated with troff (man, me, ms, 1. mdoc and mm). However groff also provides a modern macro set (mom)and also provides newpre- processors and support for colour. Abasic title DESCRIPTION OF THE PROBLEM 1. Heading at level1 1.1. Heading at level2 Troffwas widely used in the 1970s and 1980s. All UNIX documentation, release notes for both the 1.1.1. Heading at level3 First paragraph body AT&T and BSD UNIX variants was written in troff.Manypapers in the CACM and Software Figure 2: displaying the PostScript output Practice and Experience journals were also produced Notice howthe ditroff output only knows which using troff.Eventoday all UNIX and GNU/Linux font is to be used, the font size and position that the manual pages are written in troff using the man glyph should be placed. Another example that an macro set and some authors advocate using troff HTML device drivermust translate accurately is above other WYSIWYG tools to typeset their shown in figure 3. books [Tanenbaum, 1997, Stevens, 1998, Schaffter, 2004]. The start of an indented paragraph example in which line 1, line 2 and line 3 are Groffprovides compatibility with troff as well as vertically aligned. manymodern enhancements, image handling, colour .LP and limited pdf mark capability.Clearly the addition .IP once of an HTML device driverwould be useful. line 1 .IP twice line 2 GNU Troffcopied the original design of UNIX Troff .IP threefold ,which maps the input source onto a physical device line 3 through the use of a device independent language. Figure 3: examples of indented paragraphs Through this language it effectively plots each glyph at Cartesian coordinate position on a page. The dif- Here we see from the output shown in figure 4 that ficulty in translating troff source into HTML is the ms macro set diverted the indented paragraph exacerbated by the fact that both HTML and troff label parameter once, twice and threefold into a source are mark-up languages. As the GNU Troff macro. It then tests to see whether the macro width and original UNIX Troffpackages rely heavily on the is greater than a default length and if so it breaks the pipeline principle it has naturally led to the pre- line before starting the indented paragraph. Clearly processors (pic, tbl and eqn)translating their high the HTML device driverneeds to cope with these levelcommands onto much lower level troff com- constructs, as theyare also heavily used in manual mands. In turn, this has meant that much of the high pages. levelinformation (for example where a table starts and ends) is lost. Furthermore, using anyofthe macro packages will result in manylow lev el I.J. of SIMULATION Vol. 6 No 7-8 38 ISSN 1473-804x online, 1473-8031 print G. MULLEY and W.LEMBERG: EXTENDING GNU TROFF TOPRODUCE HTML Troffsource ditroffintemediate language postscript output .TL xTps %!PS-Adobe-3.0 Abasic title xres 72000 %%Creator: groff .NH 1 xinit %%DocumentNeed Heading at level1 troff-ms p1 grops %%+font Times-Ro .NH 2 xfont 6 TB %%DocumentSupp Heading at level2 f6 %%Pages: 1 .LP s12 H123k V 123k %%PageOrder: Asc First paragraph tA %%Orientation: body wh3000 %%EndComments tbasic Figure 1: Simple title, heading and paragraph Forexample HTML has a restricted set of mathemat- The start of an indented paragraph example in which line 1, line 2 and line ical glyphs and it is restricted in its ability to accu- 3are vertically aligned. once line 1 rately position these glyphs [Musciano, 1998]. Groff twice in contrast can be used to typeset mathematical for- line 2 mulae and users will at least expect the groff HTML threefold line 3 device drivertoreproduce the formulae as an image. Figure 7 shows Troffinput, which describes the defi- Figure 4: result of processing figure 3 with groff -ms nite integral.