TUGboat, Volume 35 (2014), No. 1 71

Parsers in TEX and using CWEB for general and thus relies on a number of very clever techniques pretty-printing that are highly language-specific (out of necessity). Since TEX is a system well-suited for typeset- Alexander Shibakov ting technical documents, pretty-printing texts writ- In this article I describe a collection of TEX macros ten in formal languages is a common task and is also and a few simple programs called SPLinT that one of the primary reasons to consider a parser writ- enable the use of the standard parser and scanner ten in TEX. generator tools, bison and flex, to produce very The author’s initial motivation for writing the general parsers and scanners coded as TEX macros. described in this article grew out of the SPLinT is freely available from http://ctan.org/ desire to fully document a few embedded micro- pkg/splint and http://math.tntech.edu/alex. controller projects that contain a of C code, Makefiles, linker scripts, etc. While the majority Introduction of code for such projects is written in C, superbly processed by CWEB itself, some crucial information The need to process formally structured languages resides in the kinds of files mentioned above and can inside TEX documents is neither new nor uncom- only be handled by CWEB’s verbatim output (with mon. Several graphics extensions for TEX (and some minimal postprocessing, mainly to remove the LATEX) have introduced a variety of small special- #line directives left by CTANGLE). ized languages for their purposes that depend on simple (and not so simple) interpreters coded as Parsing with TEX vs. others TEX macros. A number of pretty-printing macros take advantage of different parsing techniques to Naturally, using TEX in isolation is not the only way achieve their goals (see [Go], [Do], and [Wo]). to produce pretty-printed output. The CWEB sys- Efforts to create general and robust parsing tem for writing structured documentation uses TEX frameworks inside TEX go back to the origins of merely as a typesetting engine, while handing over TEX itself. A well-known BASIC subset interpreter, the job of parsing and preprocessing the user’s in- BASIX (see [Gr]) was written as a demonstration put to a program built specifically for that purpose. of the flexibility of TEX as a programming language Sometimes, however, a paper or a book written in and a showcase of TEX’s ability to handle a variety TEX contains a few short examples of programs writ- of abstract data structures. On the other hand, ten in another programming language. Using a sys- a relatively recent addition to the LATEX toolbox, tem such as CWEB to process these fragments is cer- l3regex (see [La]), provides a powerful and very tainly possible (although it may become somewhat general way to perform regular expression matching involved) but a more natural approach would be to in LATEX, which can be used (among other things) create a parser that can process such texts (with to design parsers and scanners. some minimal help from the user) entirely inside Paper [Go] contains a very good overview of TEX itself. As an example, pascal (see [Go]) was several approaches to parsing and tokenizing in created to pretty-print Pascal programs using TEX. TEX and outlines a universal framework for parser It used a custom scanner for a subset of standard design using TEX macros. In an earlier article Pascal and a parser, generated from an LL(1) Pas- (see [Wo]), Marcin Woli´nski describes a parser cal grammar by a parser generator, called parTEX. creation suite paralleling the technique used by Even if CWEB or a similar tool is used, there may CWEB (CWEB’s ‘grammar’ is hard-coded into CWEAVE, still be a need to parse a formal language inside TEX. whereas Woli´nski’sapproach is more general). One One example would be the use of CWEB to handle a commonality between these two methods is a highly language other than C. customized tokenizer (or scanner) used as the input Before I proceed with the description of the tool to the parser proper. Woli´nski’sdesign uses a finite that is the main subject of this paper, allow me to automaton as the scanner engine with a ‘manually’ pause for just a few moments to discuss the wisdom designed set of states. No backing up mechanism (or lack thereof) of laying the task of parsing for- was provided, so matching, say, the longest input mal texts entirely on TEX’s shoulders. In addition would require some custom coding (it is, perhaps, to using an external program to preprocess a TEX worth mentioning here that a backup mechanism is document, some recent developments allow one to all one needs to turn any regular language scanner implement a parser in a language ‘meant for such into a general CWEB-type parser). The scanner in tasks’ inside an extension of TEX. We are speaking [Go] was designed mainly with efficiency in mind of course about LuaTEX (see [Ha]) that essentially

Parsers in TEX and using CWEB for general pretty-printing 72 TUGboat, Volume 35 (2014), No. 1 implements an entirely separate interface to TEX’s use) framework for storing and manipulating ar- typesetting mechanisms and data structures in Lua rays and lists (see the discussion of list macros in (see [Lu]), ‘grafted’ onto a TEX extension. Appendix D of The TEXbook and in [Gr]) but this Although I feel nothing but admiration for the limitation is readily overcome by putting some extra LuaTEX developers, and completely share their de- care into one’s macros. sire to empower TEX by providing a general purpose programming language on top of its internal mecha- Languages, grammars, parsers, and TEX nisms, I would like to present three reasons to avoid taking advantage of LuaTEX’s impressive capabili- Or ... ties for this particular task. Tokens and tables keep macros in check. First, I am unaware of any standard tools for Make ’em with bison, use WEAVE as a tool. generating parsers and scanners in Lua (of course, it Add TEX and CTANGLE, and C to the pool. would be just as easy to use the approach described Reduce ’em with actions, look forward, not back. here to create such tools). At this point in time, Macros, productions, recursion and stack! it is just as easy to coax standard parser generators Computer generated (most likely) into outputting parsers in TEX as it is to make them The goal of the software described in this article, output Lua. SPLinT (Simple Parsing and Lexing in TEX, or, in Second, I am a great believer in generating the tradition of GNU, SPLinT Parses Languages in ‘archival quality’ documents: standard TEX has TEX) is to give a macro writer the ability to use been around for almost three decades in a practi- standard parser/scanner generator tools to produce cally unchanged form, an eternity in the software TEX macros capable of parsing formal languages. world. The parser produced using the methods Let me begin by presenting a ‘bird’s eye view’ outlined in this paper uses standard (plain) TEX ex- of the setup and the workflow one would follow to clusively. Moreover, if the grammar is unchanged, create a new parser with this package. To take full the parser code itself (i.e. its semantic actions) is advantage of this software, two external programs very readable, and can be easily modified without (three if one counts a C compiler) are required: going through the whole pipeline (bison, flex, etc.) bison and flex (see [Bi] and [Pa]), the parser and again. A full record of the grammar is left with the scanner generators, respectively. Both are freely generated parser and scanner so even if the more available under the terms of the General Public ‘volatile’ tools, such as bison and flex, become License version 3 or higher and are standard tools incompatible with the package, the parser can still included in practically every modern GNU/Linux be utilized with TEX alone. Perhaps the following distribution. Versions that run under a number of quote by D. Knuth (see [DEK2]) would help to re- other operating systems exist as well. inforce this point of view: “Of course I do not claim While the software allows the creation of both to have found the best solution to every problem. I parsers and scanners in TEX, the steps involved in simply claim that it is a great advantage to have a making a scanner are very similar to those required fixed point as a building block.” to generate a parser, so only the parser generation Finally, the idea that TEX is somehow un- will be described below. suitable for such tasks may have been overstated. Setting the semantic actions aside for the mo- While it is true that TEX’s macro language lacks ment, one begins by preparing a generic bison input some of the expressive ability of its ‘general purpose’ file, following some simple guidelines. Not all bison brethren, it does possess a few qualities that make options are supported (the most glaring omission it quite adept at processing text (it is a typeset- is the ability to generate a general LR (glr) parser ting language after all!). Among these features are: but this may be added in the future) but in every a built-in hashing mechanism (accessible through other respect it is an ordinary bison grammar. In \csname...\endcsname and \string primitives) some cases, a bison grammar may already exist and for storing and accessing control sequence names can be turned into a TEX parser with just a few (or and creating associative arrays, a number of tools none!) modifications and a new set of semantic ac- and data structures for comparing and manipu- tions (written in TEX of course). As a matter of lating strings (token registers, the \ifx primitive, example, the grammar used to pretty-print bison various expansion primitives: \edef, \expandafter grammars in CWEB that comes with this package was and the like), and even string matching and replace- adopted (with very minor modifications, mainly to ment (using delimited parameters in macros). TEX create a more logical presentation in CWEB) from the notoriously lacks a good (i.e. efficient and easy to original grammar used by bison itself.

Alexander Shibakov TUGboat, Volume 35 (2014), No. 1 73

Once the grammar has been debugged (using Data structures for parsing a combination of bison’s own impressive debugging A surprisingly modest amount of machinery is re- facilities and the debugging features supported by quired to make a bison-generated parser ‘tick’. In the macros in the package), it is time to write the addition to the standard arithmetic ‘bag of tricks’ semantic actions for the syntax-directed translation (integer addition, multiplication and conditionals), (see [Ah]). These are ordinary T X macros written E some basic integer and string array (or closely re- using a few simple conventions listed below. First, lated list and stack) manipulation is all that is the actions themselves will be executed inside a large needed. \ifcase statement (this is not always the case, see Parser tables and stack access ‘in the raw’ are the discussion of ‘optimization’ below, but it would normally hidden from the parser designer but cre- be better to assume that it is); thus, care must ating lists and arrays is standard fare for most se- be taken to write the macros so that they can be mantic actions. The bison parser supplied with the ‘skipped’ by T X’s scanning mechanism. Second, in- E package does not use any techniques that are more stead of using bison’s $n syntax to access the value sophisticated than simple token register operations. stack, a variety of \yy p macros are provided. Fi- Representing and accessing arrays this way (see Ap- nally, the ‘driver’ (a small C program, see below) pendix D of The T Xbook or the \concat macro in provided with the package merely cycles through the E the package) is simple and intuitive but computa- actions to output T X macros, so one has to use one E tionally expensive. The computational costs are not of the C macros provided with the package to out- prohibitive though, as long as the arrays are kept put T X in a proper form. One such macro is TeX_, E short. In the case of large arrays that are read often, used as TeX_("{T X tokens}");. E it pays to use a different mechanism. One such tech- The next step is the most technical, and the one nique (used also in [Go], [Gr], and [Wo]) is to ‘split’ most in need of automation. A Makefile provided the array into a number of control sequences (cre- with the package shows how such automation can be ating an associative array of token sequences called, achieved. The newly generated parser (the ‘.c-file’ for example \array[n], where n is an index value). produced by bison) is #include’d in (yes, included, This approach is used with the parser and scanner not merely linked to) a special ‘driver’ file. No mod- tables (which tend to be quite large) when the parser ifications to the driver file or the bison produced is ‘optimized’ (more about this later). Once again, parser are necessary; all one has to do is call a C it is possible to write the parser semantic actions compiler with an appropriately defined macro (see without this (slightly unintuitive and cumbersome the Makefile for details). The resulting executable to implement) machinery. is then run which produces a . file that con- This covers most of the routine computations tains the macros necessary to use the freshly-minted inside semantic actions; all that is left is a way to parser in T X. This short brush with a C compiler E ‘tap’ into the stack automaton built by bison using is the only time one ventures outside of the world of an interface similar to the special $n variables uti- pure T X to build a parser with this software (not E lized by the ‘genuine’ bison parsers (i.e. written in counting the one needed to create the accompany- C or any other target language supported by bison). ing scanner if one is desired). It is possible to add This role is played by the several varieties of a ‘plugin’ to bison to create a ‘T X output mode’ E \yy p command sequences (for the sake of complete- but at the moment the ‘lazy’ version seems to be ness, p stands for one of (n), [name], ]name[ or sufficient. n; here n is a string of digits; and a ‘name’ is any Now \input this file into your T X document E name acceptable as a symbolic name for the term along with the macros that come with the package in bison). Instead of going into the minutiae of and voil`a! You have a brand new parser in T X! E various flavors of \yy-macros, let me just mention A full-featured parser for the bison input file format that one can get by with only two ‘idioms’ and still is included, and can be used as a template. For be able to write parsers of arbitrary sophistication: smaller projects, it might help to take a look at the \yy(n) can be treated as a token register containing examples portion of the package. the value of the n-th term of the rule’s right hand The discussion above glosses over a few impor- side, n > 0. The left hand side of a production is tant details that anybody who has experience writ- accessed through \yyval. A convenient shortcut is ing ‘ordinary’ (i.e. non-T X) parsers in bison would E \yy0{hT X materiali} which will expand the hT X be eager to find out. Let us now discuss some of E E materiali inside the braces. Thus, a simple way these details.

Parsers in TEX and using CWEB for general pretty-printing 74 TUGboat, Volume 35 (2014), No. 1

to concatenate the values of the first two produc- changes to the original grammar, as explained be- tion terms is \yy0{\the\yy(1)\the\yy(2)}. The low. included bison parser can also be used to provide Unlike strict parsers employed by most com- support for ‘symbolic names’, analogous to bison’s pilers, a parser designed for pretty-printing cannot $[name] syntax but this requires a bit more effort afford being too picky about the structure of its on the user’s part in order to initialize such sup- input ([Go] calls such parsers ‘loose’). As a way port. It could make the parser more readable and of simple illustration, an isolated identifier, such as maintainable, however. ‘lg_integer’ can be a type name, a variable name, Naturally, a parser writer may need a number of or a structure tag (in a language like C for exam- other data abstractions to complete the task. Since ple). If one expects the pretty-printer to typeset these are highly dependent on the nature of the pro- this identifier in a correct style, some must cessing the parser is supposed to provide, we refer be supplied, as well. There are several strategies a the interested reader to the parsers included in the pretty-printer can employ to get hold of the neces- package as a source of examples of such specialized sary context. Perhaps the simplest way to handle data structures. this, and to reduce the complexity of the pretty- printing algorithm, is to insist on the user providing Pretty-printing support with enough context for the parser to do its job. For short formatting hints examples like the one above, this is an acceptable strategy. Unfortunately, it is easy to come up with The scanner ‘engine’ is propelled by the same set of longer snippets of grammatically deficient text that data structures and operations that drive the parser a pretty-printer should be expected to handle. Some automaton: stacks, lists and the like. Table manip- pretty-printers, such as the one employed by CWEB ulation happens ‘behind the scenes’ just as in the and its ilk (, FWEB), use a very flexible bottom- case of the parser. There is also a stack of ‘states’ up technique that tries to make sense of as large a (more properly called subautomata) that is manip- portion of the text as it can before outputting the ulated by the user directly, where the access to the result (see also [Wo], which implements a similar al- stack is coded as a set of macros very similar to the gorithm in LAT X). corresponding C functions in the ‘real’ flex scan- E The expectation is that this algorithm will han- ners. The ‘handoff’ from the scanner to the parser dle the majority of the cases with the remaining few is implemented through a pair of registers: \yylval, left for the author to correct. The question is, how a token register containing the value of the returned can such a correction be applied? token and \yychar, a \count register that contains CWEB itself provides two rather different mech- the numerical value of the token to be returned. anisms for handling these exceptions. The first uses Upon matching a token, the scanner passes one direct typesetting commands (for example, @+ and crucial piece of information to its user: the char- @* for cancelling and introducing a line break, resp.) acter sequence representing the token just matched to change the typographic output. (\yytext). This is not the whole story, though: The second (preferred) way is to supply hidden three more token sequences are made available to context to the pretty-printer. Two commands, @; the parser writer whenever a token is matched. and @[...@] are used for this purpose. The former The first of these is simply a ‘normalized’ ver- introduces a ‘virtual semicolon’ that acts in every sion of \yytext (called \yytextpure). In most cases way like a real one except it is not typeset (it is it is a sequence of T X tokens with the same char- E not output in the source file generated by CTANGLE, acter codes as the one in \yytext but with their either but this has nothing to do with pretty-print- category codes set to 11. In cases when the tokens ing, so I will not mention CTANGLE anymore). For in \yytext are not (character code, category code) instance, from the parser’s point of view, if the pre- pairs, a few simple conventions are followed, ex- ceding text was parsed as a ‘scrap’ of type exp, the plained elsewhere. This sequence is provided merely addition of @; will make it into a ‘scrap’ of type stmt for convenience and its typical use is to generate a in CWEB’s parlance. The latter construct (@[...@]), key for an associative array. is used to create an exp scrap out of whatever hap- The other two sequences are special ‘stream pens to be inside the brackets. pointers’ that provide access to the extended scan- This is a powerful tool at one’s disposal. Stylis- ner mechanism in order to implement passing of ‘for- tically, this is the right way to handle exceptions as matting hints’ to the parser without introducing any it forces the writer to emphasize the logical struc- ture of the formal text. If the pretty-printing style

Alexander Shibakov TUGboat, Volume 35 (2014), No. 1 75

is changed extensively later, the texts with such hid- preprocessed by a specially designed input routine. den contexts should be able to survive intact in the The scanner then simply passes on the values to the final document (as an example, using a break after parser. every statement in C may no longer be considered The difficulty lies in synchronizing the token appropriate, so any forced break introduced to sup- production with the parser. This subtle complica- port this convention would now have to be removed, tion is very familiar to anyone who has designed whereas @;’s would simply quietly disappear into the TEX’s output routines: the parser and the lexer are background). not synchronous, in the sense that the scanner might The same hidden context idea has another im- be reading several (in the case of the general LR(n) portant advantage: with careful grammar fragment- parsers) tokens ahead of the parser before deciding ing (facilitated by CWEB’s or any other literate pro- on how to proceed (the same way TEX can consume gramming tool’s ‘hypertext’ structure) and a more a whole paragraph’s worth of text before exercising diverse hidden context (or even arbitrary hidden its page builder). text) mechanism, it is possible to use a strict parser If we simple-mindedly let the scanner return ev- to parse incomplete language fragments. For exam- ery hint it has encountered so far, we may end up ple, the productions that are needed to parse C’s feeding the parser the hints meant for the token that expressions form a complete subset of the parser. appears after the fragment the parser is currently If the grammar’s ‘start’ symbol is changed to ex- working on. In other words, when the scanner ‘backs pression (instead of the translation-unit as it is in up’ it must correctly back up the hints as well. the full C grammar), a variety of incomplete C frag- This is exactly what the scanner produced by ments can now be parsed and pretty-printed. When- the tools in this package does: along with the main ever such granularity is still too ‘coarse’, carefully stream of tokens meant for the parser, it produces supplied hidden context will give the pretty-printer two hidden streams (called the \format stream and enough information to adequately process each frag- the \stash stream) and provides the parser with ment. A number of such sub-parsers can be tried on two strings (currently only strings of digits are used each fragment (this may sound computationally ex- although arbitrary sequences of TEX tokens can pensive, however, in practice, a carefully chosen hi- be used as pointers) with the promise that all the erarchy of parsers will finish the job rather quickly) ‘hints’ between the beginning of the corresponding until a correct parser produced the desired output. stream and the point labelled by the current stream This somewhat lengthy discussion brings us to pointer appeared among the characters up to and, the question directly related to the tools described possibly, including the ones matched as the current in this article: how does one provide typographical token. The macros to extract the relevant parts hints or hidden context to the parser? of the streams (\yyreadfifo and its cousins) are One obvious solution is to build such hints provided for the convenience of the parser designer. directly into the grammar. The parser designer The interested reader can consult the input routine can, for instance, add new tokens (terminals, say, macros for the details of the internal representation BREAK_LINE) to the grammar and extend the pro- of the streams. duction set to incorporate the new additions. The In the interest of full disclosure, let me point risk of introducing new conflicts into the grammar is out that this simple technique introduces a signifi- low (although not entirely non-existent, due to the cant strain on TEX’s computational resources: the lookahead limitations of LR(1) grammars) and the lowest level macros, the ones that handle charac- changes required are easy, although very tedious, to ter input and are thus executed (sometimes multi- incorporate. ple times), for every character in the input stream In addition to being labor intensive, this solu- are rather complicated and therefore, slow. When- tion has two other significant shortcomings: it alters ever the use of such streams is not desired a simpler the original grammar and hides its logical struc- input routine can be written to speed up the pro- ture, and it ‘bakes in’ the pretty-printing conven- cess (see \yyinputtrivial for a working example tions into the language structure (making ‘hidden’ of such macro). context much less ‘stealthy’). A much better approach involves inserting the The parser function hints at the lexing stage and passing this informa- tion to the parser as part of the token ‘values’. The To achieve such a tight integration with bison, its hints themselves can masquerade as characters ig- parser template, yyparse() was simply translated nored by the scanner (white space, for example) and into TEX using the following well known method.

Parsers in TEX and using CWEB for general pretty-printing 76 TUGboat, Volume 35 (2014), No. 1

Given the code (where goto’s are the only the generated parser is done in several stages, the means of branching but can appear inside condi- debugging may become rather involved. tionals): All the debugging features are activated by using various \iftrace... conditionals, as well as label A: ... \ifyyinputdebug and \ifyyflexdebug (for ex- [more code ...] ample, to look at the parser stack, one would set goto C; \tracestackstrue). When all of the conditionals [more code ...] are activated, a lot of output is produced. At this label B: ... point it is important to narrow down the prob- [more code ...] lem area and only activate the debugging features goto A; relevant to any errant behaviour exhibited by the [more code ...] parser. Most of the debugging features built into ordinary bison parsers (and flex scanners) are label C: ... available. [more code ...] In general, debugging parsers and scanners (and one way to translate it into TEX is to define a set debugging in general) is a very deep topic that may of macros (call them \labelA, \labelAtail and require a separate paper (or maybe a book!) all by so forth for clarity) that end in \next (a common itself, so I will simply leave it here and encourage name for this purpose). Now, \labelA will imple- the reader to experiment with the included parsers ment the code that comes between label A: and to learn the general operational principles behind goto C;, whereas \labelAtail is responsible for the parsing automaton. One needs to be aware that, the code after goto C; and before label B: (pro- unlike the ‘real’ C parsers, the TEX parser has to deal vided no other goto’s intervene which can always be with more than simply straight text. So if it looks arranged). The conditional preceding goto C; can like the parser (or the scanner) absolutely has to now be written in TEX as accept the (rejected) input displayed on the screen, just remember that an ‘a’ with a category code 11 \if(condition) and an ‘a’ with a category code 12 look the same \let\next=\labelC on the terminal while TEX and the parser/scanner \else may treat them as completely different characters \let\next=\labelAtail (this behavior itself can be fine tuned by changing where (condition) is an appropriate translation of \yyinput). the corresponding condition in the code being trans- lated (usually, one of ‘=’ or ‘6=’). Further details can Speeding up the parser be extracted from the T X code that implements E By default, the generated parser and scanner keep these functions where the corresponding C code is all of their tables in separate token registers. Each presented alongside the macros that mimic its func- stack is kept in a single macro. Thus, every time a tionality. table is accessed, it has to be expanded making the table access latency linear in the size of the table. Debugging The same holds for stacks and the action ‘switches’, If the tools in the package are used to create medium of course. While keeping the parser tables (that are to high complexity parsers, the question of debug- constant) in token registers does not have any better ging will come up sooner or later. The grammar de- rationale than saving control sequence memory (the sign stage of this process can utilize all the excellent most abundant memory in TEX), this way of storing debugging facilities provided by bison and flex (re- stacks does have an advantage when multiple parsers porting of conflicts, output of the automaton, etc.). come into play simultaneously. All one has to do to The Makefiles supplied with the package will auto- switch from one parser to another is to save the state matically output all the debugging information the by renaming the stack control sequences accordingly. corresponding tool can provide. Eventually, when When the parser and scanner are ‘optimized’ all the conflicts are ironed out and the parser begins (by saying \def\optimization{5}, for example), to process input without performing any actions, it all these control sequences are ‘spread over’ the ap- becomes important to have a way to see ‘inside’ the propriate associative arrays (by creating a number parsing process. Since the processing performed by of control sequences that look like \array[n], where

Alexander Shibakov TUGboat, Volume 35 (2014), No. 1 77 n is the index, as explained above). While it is cer- is therefore assumed to be C code and is carefully tainly possible to optimize only some of the parsers collected and ‘cleaned up’ by the macros included in (if your document uses multiple) or even only some the package. parts of a given parser (or scanner), the details of For the purposes of documenting the TEX how to do this are rather technical and are left for parser, one additional feature of CWEAVE is taken the reader to discover by reading the examples sup- advantage of: the text inside double quotes, "..." plied with the package. At least at the beginning it is treated similarly to the verbatim portion of the is easier to simply set the highest optimization level input (this can be viewed as a ‘local’ version of the and use it consistently throughout the document. verbatim sections). Moreover, CWEAVE allows one to treat a function name (or nearly any identifier) Use with CWEB as a TEX macro. These two features are enough to implement pretty-printing of semantic actions in Since the macros in the package were designed to TEX. The macros will turn an input string such support pretty-printing of languages other than C in as, e.g. ‘TeX_( "\\relax" );’ into ‘◦’ (for the sake CWEB it makes sense to spend a few paragraphs on of convenience, the string above would actually be this particular application. The CWEB system con- written as ‘TeX_( "/relax" );’ as explained in sists of two weakly related programs: CWEAVE and the manual for the package). See the documenta- CTANGLE. The latter extracts the C portion of the tion that comes with the package and the bison users input, and outputs a C file after an appro- language pretty-printer implementation for any ad- priate rearrangement of the various sections of the ditional details. code. The task of CWEAVE is very different and ar- guably more complex: not only does it have to be An example aware of the general ‘hierarchy’ of various subsec- tions of the program to create cross references, an As an example, let us walk through the development index, etc., it also has to understand enough of the C process of a simple parser. Since the language itself code in order to pretty-print it. Whereas CTANGLE is not of any particular importance, a simple gram- simply separates the program code from the pro- mar for additive expressions was chosen. The exam- grammer’s documentation, rearranges it and out- ple, with a detailed description, and all the neces- puts the original program text (with added #line sary files, is included in the examples directory. The directives and simple C comments that can be eas- Makefile there allows one to type, say, make step1 ily removed in postprocessing if necessary), the out- to produce all the files needed in the first step of this put of CWEAVE bears very little resemblance to the short walk-through. Finally, make docs will pro- original program. It might sound a bit exaggerated duce a pretty-printed version of the grammar, the but CWEAVE’s processing is ‘one-way’: it would be regular expressions, and the test TEX file along with difficult or even impossible to write software that detailed explanations of every stage. ‘undoes’ the pretty-printing performed by CWEAVE. As the first step, one creates a bison input file There is, however, a loophole that allows one to (expp.y) and a similar input for flex (expl.l). A use CWEB with practically any language, and pretty- typical fragment of expp.y looks like the following: print the results, if an appropriate ‘filter’ is available. value: The saving grace comes in the form of CWEB’s ver- expression {TeX_("/yy0{/the/yy(1)}");} batim output: any text inside @= and @> undergoes ; some minimal processing (mainly to ‘escape’ dan- gerous TEX characters such as ‘$’) and is put inside The scanner’s regular expression section, in its en- \vb{...} by CWEAVE. tirety is: The macros in the package take advantage of [ \f\n\t\v] {TeX_("/yylexnext");} this feature by collecting all the text inside \vb {id} { groups and trying to parse it. If the parsing pass TeX_("/yylexreturnval{IDENTIFIER}");} is successful, pretty-printed output is produced, if {int} { not, the text is output in ‘typewriter’ style. TeX_("/yylexreturnval{INTEGER}");} With languages such as bison’s input script, an [+*()] {TeX_("/yylexreturnchar");} additional complication has to be dealt with: most .{ of the time the actions are written in C so it makes TeX_("/iftracebadchars"); sense to use CWEAVE’s C pretty-printer to typeset the TeX_(" /yycomplain{%%"); action code. Most material outside of \vb groups TeX_(" invalid character(s): %%");

Parsers in TEX and using CWEB for general pretty-printing 78 TUGboat, Volume 35 (2014), No. 1

TeX_(" /the/yytext}"); [Do] Jean-luc Doumont, Pascal pretty-printing: TeX_("/fi"); An example of “preprocessing with TEX”, TeX_("/yylexreturn{$undefined}"); TUGboat 15 (3), 1994 — Proceedings } of the 1994 TUG Annual Meeting. http://tug.org/TUGboat/tb15-3/ Once the files have been prepared and debugged, the tb44doumont.pdf next step is to generate the ‘driver’ files, ptabout [Er] Sebastian Thore Erdweg and Klaus and ltabout. For the parser ‘driver’, this is done Ostermann, Featherweight TEX and Parser with Correctness, Proceedings of the Third bison expp.y -o expp.c International Conference on Software gcc -DPARSER_FILE=\ Language Engineering, pp. 397–416, \"examples/expression/expp.c\" \ Springer-Verlag Berlin, Heidelberg, 2011. -o ptabout ../../mkeparser.c [Fi] Jonathan Fine, The \CASE and \FIND macros, TUGboat 14 (1), pp. 35–39, The first line generates the parser from the bison 1993. http://tug.org/TUGboat/tb14-1/ input file that was prepared in the first step and the tb38fine.pdf next line uses this file to produce a ‘driver’. If the [Go] Pedro Palao Gostanza, Fast scanners and included is used, the file is Makefile mkeparser.c self-parsing in TEX, TUGboat 21 (3), generated automatically, otherwise one has to make 2000 — Proceedings of the 2000 Annual sure that it exists and resides in the appropriate Meeting. http://tug.org/TUGboat/tb21-3/ directory first. It has no external dependencies and tb68gost.pdf can be freely moved to any place that is convenient. [Gr] Andrew Marc Greene, BASIX — An interpreter Next, run ptabout and ltabout to produce the written in TEX, TUGboat 11 (3), 1990 — automata tables: Proceedings of the 1990 TUG Annual Meeting. ptabout --optimize-actions ptab.tex http://tug.org/TUGboat/tb11-3/ tb29greene.pdf ltabout --optimize-actions ltab.tex [Ha] Hans Hagen, LuaTEX: Halfway to version 1, Now, look inside expression.sty for a way to TUGboat 30 (2), pp. 183–186, 2009. include the parser in your own documents, or sim- http://tug.org/TUGboat/tb30-2/ ply \input it in your own TEX file. Executing tb95hagen-.pdf make test.tex will produce a test file for the new [Ie] R. Ierusalimschy et al., Lua 5.1 parser. This is it! Reference Manual, Lua.org, August 2006. http://www.lua.org/manual/5.1/ Acknowledgment [La] The l3regex package: Regular A The author would like to thank the editors, Barbara expressions in TEX, The L TEX3 Project. Beeton and Karl Berry, for a number of valuable http://www.ctan.org/pkg/l3regex suggestions and improvements to this article. [Pa] Vern Paxson et al., Lexical Analysis With Flex, for Flex 2.5.37, July 2012. References http://flex.sourceforge.net/manual/ [Wo] Marcin Woli´nski, Pretprin — ALATEX2ε [Ah] Alfred V. Aho et al., Compilers: Principles, package for pretty-printing texts in formal Techniques, and Tools, Pearson Education, languages, TUGboat 19 (3), 1998 — 2006. Proceedings of the 1998 TUG Annual [Bi] Charles Donnelly and Richard Stallman, Meeting. http://tug.org/TUGboat/tb19-3/ Bison, The Yacc-compatible Parser Generator, tb60wolin.pdf The Free Software Foundation, 2013. http://www.gnu.org/software/bison/  Alexander Shibakov [DEK1] Donald E. Knuth, The TEXbook, Dept. of Mathematics Addison-Wesley Reading, Massachusetts, Tennessee Tech. University 1984. Cookeville, TN [DEK2] Donald E. Knuth The future of TEX and http://math.tntech.edu/alex , TUGboat 11 (4), p. 489, 1990. http://tug.org/TUGboat/tb11-4/ tb30futuretex.pdf

Alexander Shibakov