Parsers in TEX and Using CWEB for General Pretty-Printing 72 Tugboat, Volume 35 (2014), No

TUGboat, Volume 35 (2014), No. 1 71 Parsers in TEX and using CWEB for general and thus relies on a number of very clever techniques pretty-printing that are highly language-specific (out of necessity). Since TEX is a system well-suited for typeset- Alexander Shibakov ting technical documents, pretty-printing texts writ- In this article I describe a collection of TEX macros ten in formal languages is a common task and is also and a few simple C programs called SPLinT that one of the primary reasons to consider a parser writ- enable the use of the standard parser and scanner ten in TEX. generator tools, bison and flex, to produce very The author's initial motivation for writing the general parsers and scanners coded as TEX macros. software described in this article grew out of the SPLinT is freely available from http://ctan.org/ desire to fully document a few embedded micro- pkg/splint and http://math.tntech.edu/alex. controller projects that contain a mix of C code, Makefiles, linker scripts, etc. While the majority Introduction of code for such projects is written in C, superbly processed by CWEB itself, some crucial information The need to process formally structured languages resides in the kinds of files mentioned above and can inside TEX documents is neither new nor uncom- only be handled by CWEB's verbatim output (with mon. Several graphics extensions for TEX (and some minimal postprocessing, mainly to remove the LATEX) have introduced a variety of small special- #line directives left by CTANGLE). ized languages for their purposes that depend on simple (and not so simple) interpreters coded as Parsing with TEX vs. others TEX macros. A number of pretty-printing macros take advantage of different parsing techniques to Naturally, using TEX in isolation is not the only way achieve their goals (see [Go], [Do], and [Wo]). to produce pretty-printed output. The CWEB sys- Efforts to create general and robust parsing tem for writing structured documentation uses TEX frameworks inside TEX go back to the origins of merely as a typesetting engine, while handing over TEX itself. A well-known BASIC subset interpreter, the job of parsing and preprocessing the user's in- BASIX (see [Gr]) was written as a demonstration put to a program built specifically for that purpose. of the flexibility of TEX as a programming language Sometimes, however, a paper or a book written in and a showcase of TEX's ability to handle a variety TEX contains a few short examples of programs writ- of abstract data structures. On the other hand, ten in another programming language. Using a sys- a relatively recent addition to the LATEX toolbox, tem such as CWEB to process these fragments is cer- l3regex (see [La]), provides a powerful and very tainly possible (although it may become somewhat general way to perform regular expression matching involved) but a more natural approach would be to in LATEX, which can be used (among other things) create a parser that can process such texts (with to design parsers and scanners. some minimal help from the user) entirely inside Paper [Go] contains a very good overview of TEX itself. As an example, pascal (see [Go]) was several approaches to parsing and tokenizing in created to pretty-print Pascal programs using TEX. TEX and outlines a universal framework for parser It used a custom scanner for a subset of standard design using TEX macros. In an earlier article Pascal and a parser, generated from an LL(1) Pas- (see [Wo]), Marcin Woliński describes a parser cal grammar by a parser generator, called parTEX. creation suite paralleling the technique used by Even if CWEB or a similar tool is used, there may CWEB (CWEB's `grammar' is hard-coded into CWEAVE, still be a need to parse a formal language inside TEX. whereas Woliński'sapproach is more general). One One example would be the use of CWEB to handle a commonality between these two methods is a highly language other than C. customized tokenizer (or scanner) used as the input Before I proceed with the description of the tool to the parser proper. Woliński'sdesign uses a finite that is the main subject of this paper, allow me to automaton as the scanner engine with a `manually' pause for just a few moments to discuss the wisdom designed set of states. No backing up mechanism (or lack thereof) of laying the task of parsing for- was provided, so matching, say, the longest input mal texts entirely on TEX's shoulders. In addition would require some custom coding (it is, perhaps, to using an external program to preprocess a TEX worth mentioning here that a backup mechanism is document, some recent developments allow one to all one needs to turn any regular language scanner implement a parser in a language `meant for such into a general CWEB-type parser). The scanner in tasks' inside an extension of TEX. We are speaking [Go] was designed mainly with efficiency in mind of course about LuaTEX (see [Ha]) that essentially Parsers in TEX and using CWEB for general pretty-printing 72 TUGboat, Volume 35 (2014), No. 1 implements an entirely separate interface to TEX's use) framework for storing and manipulating ar- typesetting mechanisms and data structures in Lua rays and lists (see the discussion of list macros in (see [Lu]), `grafted' onto a TEX extension. Appendix D of The TEXbook and in [Gr]) but this Although I feel nothing but admiration for the limitation is readily overcome by putting some extra LuaTEX developers, and completely share their de- care into one's macros. sire to empower TEX by providing a general purpose programming language on top of its internal mecha- Languages, grammars, parsers, and TEX nisms, I would like to present three reasons to avoid taking advantage of LuaTEX's impressive capabili- Or ::: ties for this particular task. Tokens and tables keep macros in check. First, I am unaware of any standard tools for Make 'em with bison, use WEAVE as a tool. generating parsers and scanners in Lua (of course, it Add TEX and CTANGLE, and C to the pool. would be just as easy to use the approach described Reduce 'em with actions, look forward, not back. here to create such tools). At this point in time, Macros, productions, recursion and stack! it is just as easy to coax standard parser generators Computer generated (most likely) into outputting parsers in TEX as it is to make them The goal of the software described in this article, output Lua. SPLinT (Simple Parsing and Lexing in TEX, or, in Second, I am a great believer in generating the tradition of GNU, SPLinT Parses Languages in àrchival quality' documents: standard TEX has TEX) is to give a macro writer the ability to use been around for almost three decades in a practi- standard parser/scanner generator tools to produce cally unchanged form, an eternity in the software TEX macros capable of parsing formal languages. world. The parser produced using the methods Let me begin by presenting a `bird's eye view' outlined in this paper uses standard (plain) TEX ex- of the setup and the workflow one would follow to clusively. Moreover, if the grammar is unchanged, create a new parser with this package. To take full the parser code itself (i.e. its semantic actions) is advantage of this software, two external programs very readable, and can be easily modified without (three if one counts a C compiler) are required: going through the whole pipeline (bison, flex, etc.) bison and flex (see [Bi] and [Pa]), the parser and again. A full record of the grammar is left with the scanner generators, respectively. Both are freely generated parser and scanner so even if the more available under the terms of the General Public `volatile' tools, such as bison and flex, become License version 3 or higher and are standard tools incompatible with the package, the parser can still included in practically every modern GNU/Linux be utilized with TEX alone. Perhaps the following distribution. Versions that run under a number of quote by D. Knuth (see [DEK2]) would help to re- other operating systems exist as well. inforce this point of view: \Of course I do not claim While the software allows the creation of both to have found the best solution to every problem. I parsers and scanners in TEX, the steps involved in simply claim that it is a great advantage to have a making a scanner are very similar to those required fixed point as a building block." to generate a parser, so only the parser generation Finally, the idea that TEX is somehow un- will be described below. suitable for such tasks may have been overstated. Setting the semantic actions aside for the mo- While it is true that TEX's macro language lacks ment, one begins by preparing a generic bison input some of the expressive ability of its `general purpose' file, following some simple guidelines. Not all bison brethren, it does possess a few qualities that make options are supported (the most glaring omission it quite adept at processing text (it is a typeset- is the ability to generate a general LR (glr) parser ting language after all!). Among these features are: but this may be added in the future) but in every a built-in hashing mechanism (accessible through other respect it is an ordinary bison grammar. In \csname...\endcsname and \string primitives) some cases, a bison grammar may already exist and for storing and accessing control sequence names can be turned into a TEX parser with just a few (or and creating associative arrays, a number of tools none!) modifications and a new set of semantic ac- and data structures for comparing and manipu- tions (written in TEX of course).

Load more