G. MULLEY and W.LEMBERG: EXTENDING GNU TROFF TOPRODUCE HTML EXTENDING GNU TROFF TOPRODUCE HTML THROUGH THE TECHNIQUE OF NEXT EVENT SIMULATION GAIUS MULLEY,WERNER LEMBERG School of Computing,University of Glamorgan, CF37 1DL, UK E-Mail: [email protected] Kl. Beurhausstr.1,44137 Dortmund, Germany E-Mail: [email protected] Abstract: This paper reports on a technique used to generate accurate HTML output from GNU Troff. GNU Troff is a typesetting package which reads plain text mixed with formatting commands and produces formatted output. It supports a number of devices and nowsupports the production of HTML.The paper discusses the design of the HTML device driver grohtml and modifications made to GNU Troff. The front end program troff wasmodi- fied to maintain a reduced state machine which is examined each time a glyph is passed to the back end device driver(post-grohtml). Anychange in system state between the production of twoglyphs results in a sequence of events being passed to the device driver. There is a direct correspondence between this technique and creating ascript for a next event simulation queue. Furthermore the device driverreconstructs the system state and for- mats the HTML according to state changes caused when processing the event queue. This technique works well, as it minimises the state information passed from front end to back end device driverwhilst still preserving the high levellayout of the text. Using this technique GNU Troffeffectively translates input source into another mark-up language and thus this technique could be extended to translate GNU Troffdocuments into anyofthe OpenOffice supported formats. Troffhas been in use for three decades nowand is still actively used by authors. Troff’sbiggest use, however, istoformat manual pages for GNU/Linux and other UNIX likeoperating systems. Introducing this facility into GNU Troffprovides users with the ability to translate legacy documents into HTML and in the future to a format supported by OpenOffice. Ke ywords: groff, troff, grohtml, HTML, simulation. Brian Kernighan modified troff so that it could INTRODUCTION produce output for a number of different typesetting devices, while at the same time retaining the same GNU Troffisareimplementation of the program input language specification. The input language troff which is available on the UNIX operating sys- wassorobust that a number of preprocessors were tem. The original troff waswritten in PDP-11 written to provide reference [Tuthill, 1986], ta- assembly language by Joe Ossanna in 1973. Tw o ble [Lesk, 1976a], picture [Kernighan, 1991, Wyk, years later it was rewritten in C and afterwards it 1982] and equation handling [Kernighan, 1977, went through a series of revisions until Joe Ossanna Kernighan, 1976] (refer, tbl, pic and eqn respec- died in 1977. tively). These programs were executed in a pipeline and troff transformed the heavily preprocessed Brian Kernighan continued troff development for the next 15 years and it is testament to the original formatting commands and text into device dependent design that the input language remained the same. output [Kernighan, 1978]. Although the input to Much of the original documentation receivedminor troff wasdevice independent it was also very low leveland therefore users were encouraged to use changes as new troff releases were issued and therefore Joe’sname was retained on these manu- macro sets when producing documents [Lesk, als [Ossanna, 1992]. 1976b]. During the 1980s the internals were mod- estly revised and a number of newmacro sets were written [Allman, 1980, Smith, 1980]. The macro sets provided freedom in document styling (similar Published in the International Journal of to modern HTML style sheets) and theyincluded: SIMULATION Vol. 6 No 7-8, ISSN arbitrary style headers and footers; arbitrary style 1473-804x online, 1473-8031 print, 2005 footnotes; automatic sequence numbering for I.J. of SIMULATION Vol. 6 No 7-8 37 ISSN 1473-804x online, 1473-8031 print G. MULLEY and W.LEMBERG: EXTENDING GNU TROFF TOPRODUCE HTML paragraphs, sections, etc; multiple column output; and original UNIX Troffpackages rely heavily on the dynamic font and point-size control; arbitrary hori- pipeline principle it has naturally led to the pre- zontal and vertical local motions at anypoint; math- processors (pic, tbl and eqn)translating their high ematical bracket construction, and line drawing levelcommands onto much lower level troff com- functions [Ossanna, 1992]. mands. In turn, this has meant that much of the high levelinformation (for example where a table starts During the 1980s troff wasextended to handle and ends) is lost. Furthermore, using anyofthe manydifferent devices, this was accomplished by macro packages will result in manylow lev elcom- splitting the task of troff into twocomponents. mands to format abstracts, titles, footnotes, para- The front end ditroff produced device indepen- graphs, lists, hanging indents etc. Consequently, dent code and the back end device driverwhich sim- translating troff input source into HTML cannot be ply translated the device independent code into the achievedbywriting a device driverwhich simply target device commands. reads ditroff input and produces HTML output. Consider the diagram in Figure 1 which shows the James Clark begantowork on the GNU implemen- keycomponents of a simple groff command line tation of the troff family of tools in 1989. This was invocation together with a synopsis of the data pass- to be a completely newimplementation of all the ing though the pipeline. In Figure 1 the troff input preprocessors, the ditroff program and the macro consists of requests or calls to macros (lines prefixed sets. The first release of groff (version 0.3.1) with a period) and text. Troffinput might also occurred in June 1990 and it included a replacement include text with escapes, for example the word for ditroff, eqn, tbl and pic.Italso included a program can be typeset by temporarily reducing the replacement for the me macros and the man macros. point size by 1 and altering the font to Courier.The The replacement programs were mostly written in word and escapes can be encoded as C++ and often supported extensions and removed \s-1\fCprogram\s+1\fP. various static data size limitations. Since 1999 the In Figure 1 the final output is PostScript (the default groff package has had newmaintainers and has device) and the invocation also includes the ms undergone active dev elopment. It supports the com- macro package. Figure 2 shows howaPostScript mon macro sets associated with troff (man, me, ms, printer interprets the output from Figure 1. mdoc and mm). However groff also provides a modern macro set (mom)and also provides newpre- processors and support for colour. Abasic title DESCRIPTION OF THE PROBLEM 1. Heading at level1 1.1. Heading at level2 Troffwas widely used in the 1970s and 1980s. All UNIX documentation, release notes for both the 1.1.1. Heading at level3 AT&T and BSD UNIX variants was written in First paragraph body troff.Manypapers in the CACM and Software Figure 2: displaying the PostScript output Practice and Experience journals were also produced Notice howthe output only knows which using troff.Eventoday all UNIX and GNU/Linux ditroff font is to be used, the font size and the position that manual pages are written in troff using the man the glyph should be placed. Another example that macro set and some authors advocate using troff above other WYSIWYG tools to typeset their an HTML device drivermust translate accurately is books [Tanenbaum, 1997, Stevens, 1998, Schaffter, shown in Figure 3. 2004]. The start of an indented paragraph example in which line 1, line 2 and line 3 are Groffprovides compatibility with troff as well as vertically aligned. manymodern enhancements, image handling, colour .LP and limited pdf mark capability.Clearly the addition .IP once of an HTML device driverwould be useful. line 1 .IP twice GNU Troffcopied the original design of UNIX Troff, line 2 which maps the input source onto a physical device .IP threefold line 3 through the use of a device independent language. Figure 3: examples of indented paragraphs Through this language it effectively plots each glyph at Cartesian coordinate position on a page. The dif- Here we see from the output shown in Figure 4 that ficulty in translating troff source into HTML is the ms macro set diverted the indented paragraph exacerbated by the fact that both HTML and troff label parameter once, twice and threefold into a source are mark-up languages. As the GNU Troff macro. It then tests to see whether the macro width I.J. of SIMULATION Vol. 6 No 7-8 38 ISSN 1473-804x online, 1473-8031 print G. MULLEY and W.LEMBERG: EXTENDING GNU TROFF TOPRODUCE HTML Troffsource ditroffintemediate language postscript output .TL xTps %!PS-Adobe-3.0 Abasic title xres 72000 %%Creator: groff .NH 1 xinit %%DocumentNeed Heading at level1 troff-ms p1 grops %%+font Times-Ro .NH 2 xfont 6 TB %%DocumentSupp Heading at level2 f6 %%Pages: 1 .LP s12 H123k V 123k %%PageOrder: Asc First paragraph tA %%Orientation: body wh3000 %%EndComments tbasic Figure 1: Simple title, heading and paragraph is greater than a default length and if so it breaks the line before starting the indented paragraph. Clearly An example of encapsulated postscript the HTML device driverneeds to cope with these constructs, as theyare also heavily used in manual pages. The start of an indented paragraph example in which line 1, line 2 and line 3are vertically aligned. Figure 6: result of processing Figure 5 with groff -ms once line 1 twice line 2 threefold Afurther problem which needs to be addressed is line 3 the limited number of glyphs available in HTML. Figure 4: result of processing Figure 3 with groff -ms Forexample HTML has a restricted set of mathemat- ical glyphs and it is restricted in its ability to accu- Furthermore manydocuments will include encapsu- rately position these glyphs [Musciano, 1998].
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-