<<

A LR for Ambiguous Context-Dependent

Adrian D. Thurston and James R. Cordy

School of Computing Queen’s University Kingston, ON, Canada {thurston, cordy}@cs.queensu.ca

Abstract 1 Introduction

Parsing context-dependent computer lan- To successfully parse modern programming guages requires an ability to maintain and languages such as , Java and C# requires query data structures while parsing for the an ability to handle context dependencies. We purpose of influencing the parse. Parsing must lookup the meaning of an identifier to de- ambiguous computer languages requires an termine what kind of symbol we are dealing ability to generate a parser for arbitrary with before we proceed to parse the identifier. context-free . In both cases we have The established practice for dealing with this tools for generating parsers from a . problem is the “lexical feedback hack.” During However, languages that have both of these forward parsing, semantic actions are respon- properties simultaneously are much more sible for maintaining lookup tables. The - difficult to parse. Consequently, we have ical analyzer is then responsible for querying fewer techniques. One approach to parsing the type of an identifier before sending it to such languages is to endow traditional LR the parser. We have many tools which support systems with backtracking. This is a step this method of parsing. towards a working solution, however there are number of problems. In this work we present A different classification criteria is two enhancements to a basic backtracking LR . To parse context-free languages approach which enable the parsing of computer that are ambiguous requires an ability to pur- languages that are both context-dependent sue at least one parse given multiple potential and ambiguous. Using our system we have pro- parses. Again, we have many tools for gener- duced a fast parser for C++ that is composed ating parsers from ambiguous grammars. The of strictly a scanner, a name lookup stage and techniques available to use include GLR, Ear- parser generated from a grammar augmented ley parsing, and generalized recursive descent. with semantic actions and semantic ‘undo’ Languages that are both context-dependent actions. Language are resolved by and ambiguous are considerably more difficult prioritizing grammar declarations. to parse than languages with just one of these properties. They require that our facilities for Copyright c 2006 Adrian D. Thurston and James considering alternatives accommodate our need R. Cordy. Permission to copy is hereby granted pro- vided the original copyright notice is reproduced in to maintain and query global state. For exam- copies made. ple, we may pursue one potential parse, ma-

1 nipulating global state while doing so, only to to write a grammar-based C++ parser in a discover that we have made the wrong guess straightforward manner. Reduction actions are and we must give up and try an alternative free to change the global data structures that parse. Before we can however, our manipu- are used to determine the type of identifiers. lation of the global state must be abandoned. This may entail pushing a namespace to the Since the nature of these manipulations are de- declaration stack, or inserting a class name termined by the of the language to into a dictionary. Immediately below a reduc- be parsed, they must be programmed by the tion action which modifies global state, the dis- user of the parser generator. They cannot be ciplined programmer is responsible for imple- declared by the user and generated automati- menting the reverse of the reduction action, for cally by the parser generator. example popping the declaration stack or re- C++ is an example of a language that has moving an item from a dictionary. This allows context dependencies and which is ambiguous. the parser to correctly backtrack. We very rarely find C++ parsers to be gener- At strategic points, such as statement ated from a grammar. boundaries, parse trees may be committed and One approach to parsing these languages is non-reversible actions executed. In these non- to convert a standard LR parser into a back- reversible actions, which we call final actions, tracking parser. Examples of such systems in- the user may perform permanent tasks such as clude BtYacc [4], Basil [20], Ratatosk [15] and constructing an AST or printing the result of Lark [7]. There are a number of advantages the parse. Finally, rules for resolving C++ lan- to pursuing a backtracking LR approach. The guage ambiguities are implemented by ordering parser will inherit the speed of LR parsing, the mutually ambiguous productions in the order simplicity and power of the bottom-up seman- in which they should be tried. tic action model, and the ease-of-use of back- In the next section we discuss the vari- tracking, giving us a natural ability to handle ous grammar-based approaches to generating ambiguities. parsers for ambiguous context-dependent lan- There are two problems with simply endow- guages, including existing backtracking LR sys- ing an LR parser with backtracking which make tems. In Section 3 we describe our enhance- it difficult to apply the approach to the pars- ment to backtracking LR which allows ambigu- ing of ambiguous context-dependent languages. ities and context dependencies to co-exist. In Existing backtracking LR systems only back- Section 4 we describe our enhancement which track at the level of parsing. They do not el- puts control of the backtracking strategy in the evate backtracking to the level of semantic ac- hands of the user. In Section 5 we show how tions. We therefore cannot backtrack over any our parsing algorithm can be applied to C++. attempted parse that has modified the global state in preparation for handling context de- pendencies. Secondly, with these systems it 2 Related Work is difficult or impossible for the user to spec- ify which potential parses should be preferred 2.1 GLR when the grammar contains ambiguities. The generalized LR parsing method [10] is In this work we have solved these two prob- one approach to parsing ambiguous context- lems. The forward parsing phase is free to dependent languages. Due to inherent paral- manipulate the global state because our back- lelism, use of GLR relies on post-processing of tracker invokes semantic undo actions during the parse trees. backtracking. Semantic undo actions can be In the course of building a parse table, stan- used to revert the effects of the forward phase. dard LR parser generators will emit an error Secondly, we have devised a method of ordering upon discovering shift-reduce or reduce-reduce the attempts of conflicting actions to achieve a conflicts. Some parser generators may choose user-controlled and predictable parse of an am- one action by default and produce code that biguous grammar. runs, but does not necessarily work as intended. This approach has been successfully used Others generators may simply fail to proceed.

2 GLR parser generators will accept any gram- 2.2 Generalized Top-Down mar and will always produce a working parser regardless of the number of conflicts contained Generalized top-down parsing with full back- in it. At run time, the generated parser will tracking is a very flexible parsing method that take conflicts in stride; when encountering mul- can be applied to ambiguous languages. When tiple actions on a single arc of the parse table, provisions are made for handling left it will simultaneously take all actions. From a wide range of parsing tasks can be imple- then on all potential parses are parsed in lock- mented. The TXL [2] , step. Since parsing in lockstep requires multi- a language designed for prototyping and ma- ple stack instances, much research as gone into nipulating language descriptions, tools and ap- managing a shared stack which conserves space plications, contains a parser which implements and computation time, making the approach the generalized top-down parsing method. It much more practical. allows the definition of arbitrary context-free grammars, according to which it will parse in- The GLR method can be applied very suc- put using a top-down parser with full back- cessfully to the parsing of ambiguous lan- tracking. This parsing strategy has been shown guages, but we experience problems when we to be very useful in language design and soft- introduce context dependencies. The need to ware renovation tasks. maintain type information while concurrently A key advantage of this method is that it pursuing multiple parses requires that we also puts the user in control of the parsing strat- maintain multiple copies of the global data egy when the grammar is ambiguous. The pre- structures which store the type information. ferred order in which to attempt to parse mu- It may be possible to extend the idea of au- tually ambiguous alternatives can be specified tomatic parse forest sharing to the global con- locally. This is advantageous for grammar com- text dependency state. After all the parse position tasks in software engineering [3]. The is itself a global state. No work in this area is innermost backtracking strategy makes it easy known. However, if we consider that the struc- for the user to predict the result of the parse. ture of the global state information is depen- Definite clause grammars (DCGs) [17] are dent on the language being parsed, it seems a syntactic shorthand for producing parsers doubtful that automatic sharing of context de- with clauses which represent the in- pendency information is a task that can be put with difference lists. Prolog-based pars- moved to the parser generator. C++ has a ing is a very expressive parsing technique which unique and complicated namespace structure, can be considered a generalized top-down pars- accompanied by many nontrivial name lookup ing method. Prolog’s backtracking guaran- rules. Lookup of template specializations re- tees that all possible grammar derivations are quires an implementation of the type system. tested. Prolog clauses may be embedded in Models of the C++ namespace have been pro- grammar productions acting as syntactic or se- duced [18], though these exist only for the pur- mantic predicates. pose of human understanding. It is conceivable to enhance a generalized Rather than attempt to parse in a single top-down parser with semantic actions for pass, the common approach to parsing ambigu- maintaining global state and semantic undo ac- ous context-dependent languages with a GLR tions for reverting changes to the global state parser is to attempt all possible parses irre- when the parser must backtrack, though work spective of context dependencies. The tok- in this area is not known to us. enizer will yield tokens which simultaneously The primary disadvantage of this parsing ap- carry multiple types [1] and the GLR parser proach is that it can result in very long parse will acquire all possible interpretations. Fol- times. Full backtracking often induces redun- lowing the parse, or at strategic points, dis- dant reparsing when grammar alternatives con- ambiguation rules [11, 21] eliminate the parse tain common prefixes. Packrat parsing [6] at- forests which are illegal according to context tempts to solve this problem by dependency rules. of parse trees.

3 Another approach to improving the perfor- using a backtracking LR parser for the initial mance of generalized top-down parsing is de- deployment of a C++ grammar due to the lan- scribed in [9]. In this work, grammars which guage’s incompatiblity with standard LR pars- have the follow-determinism property exhibit ing and the ease with which backtracking LR improved parsing performance because the fol- allows complicated grammars to be deployed low sets can be used to prune the search space. from their specification. ANTLR [16] is another parsing tool which Backtracking LR is not without is draw- manages the tradeoff between parsing power backs. Some problems come with the ter- and performance. Parsers generated by ritory. In [10], it is shown that it is easy ANTLR normally have the LL(k) property, to write a grammar that exhibits exponential with k > 1. In recent versions it supports behaviour when given to a backtracking LR LL(*) which allows k to roam and eliminates parser generator. Our parsing method also suc- the need to explicitly set k. Since many lan- cumbs to such grammars. Users must them- guages in use have LL properties, this is often selves guard against producing poorly perform- sufficient. However, for cases when it is not, ing grammars. Also, hidden ANTLR is able to revert to a generalized top- causes us problems and must be avoided. down method. Should the LL(*) method fail, Other problems we aim to fix in this work. the parser automatically enters into a full back- An inability to backtrack over semantic actions tracking mode with memoization. which modify global state and an inability to control the parsing of ambiguous constructs are 2.3 Backtracking LR two problems which make it difficult to apply backtracking LR in practice. These problems Like GLR parser generators, a backtracking LR are discussed in more detail in the following parser generator will accept any grammar and sections. will always emit a working parser. Upon en- countering a conflict, the run time system will 2.4 BtYacc and Basil try the first action, remembering the choice that was made, and continue parsing in a sin- BtYacc is a backtracking LR parser generator gle thread. Later on, should a parse error be derived from Berkeley . When a BtYacc encountered it will undo its parsing up to the parser proceeds without encountering any con- most recent choice point, then try the next pos- flicts, regular reduction actions are executed. sibility. Since no backtracking is possible these actions Where a standard top-down parser with full can have side effects. We refer to these as final backtracking will revert to the innermost choice actions. When the parser encounters a conflict point with respect to the grammar, a back- in the parse tables, it goes into trial parsing tracking LR parser will revert to the rightmost, mode where it stops executing final actions, but topmost choice point with respect to the LR continues to execute a second class of actions, stack. Such a strategy will eventually try all which are specified differently in the grammar. possible parses. We refer to these as trial actions. Since the The primary advantage of backtracking LR reductions that the trial actions are associated parsers is that they retain the speed of LR pars- with may be undone, and when this happens ing when the grammar is deterministic. If back- there is no way to revert the effects of these ac- tracking can be kept to a minimum, the cost tions, trial actions cannot have any side effects. of processing nondeterministic grammars need When facing a shift-reduce conflict a BtY- not be prohibitive. Also, a backtracking parser acc parser will always choose to shift first, then will alway yield a single parse on a single pass. reduce. This choice of action ordering makes Yielding a single parse is a usual requirement it difficult for the user to control the relative of programming language processors. priority of mutually ambiguous productions. Merrill [14] describes changes made to Yacc For example, when a BtYacc parser built with to support backtracking for the purpose of the following grammar is given the input a b, parsing C++. Power and Malloy [19] suggest the policy of first will always yield the

4 parse S -> a b, regardless of the user’s inten- namespace ns1 tions. { template struct C { S -> A B A -> a (1) struct D; S -> a b B -> b static int f( int ) The Basil parser generator leaves the order- { return 0; } }; ing of conflicts up to the programmer. The or- } dering of shift and reduce actions is not in any way dependent on the grammar. If one writes namespace ns2 an , there is more work to { be done before a desired parse strategy is at- struct E tained. { BtYacc allows the programmer to invoke E( int ) {} a commit command from reduction actions. }; When one issues a commit command the en- } tire parse is committed to. This is a coarse- grained solution that is suitable for committing ns2::E g( ns1::C< ns2::E >::f(1) ); ns2::E h( ns1::C< ns2::E >::D ); the parse when a particular token is seen.

2.5 Elsa Figure 1: C++ code demonstrating a need to backtrack over global state modifications. Elsa [13] is a C++ parser produced with the Elkhound GLR system. Elsa demonstrates a successful application of the post-parse disam- biguation approach. It parses input regardless must enter into a trial parse mode before it can of the meaning of C++ identifiers, then later decide upon the nature of the declaration. This rejects the returned parse trees which do not trial parsing must continue past an unknown satisfy the C++ name lookup rules. number of tokens until the f and D symbols are parsed and their meaning deciphered. 2.6 Keystone To properly lookup f and D we must be aware of the qualifying scope, in both cases this is C. Keystone [12] is a C++ parser written using The scanner cannot perform this task because BtYacc. Keystone suffers from problems re- it is unable to correctly parse template param- lated to BtYacc supporting only trial and fi- eters. The class template C may indeed have nal actions. Since the effect of trial actions template specializations and the template pa- cannot be undone, they are unable to modify rameters must be used to look these up. If we the global state and therefore cannot perform try to communicate the qualifying scope using tasks such as changing the current name scope some form of attribute transfer in semantic ac- so that subsequent parsing can lookup names tions we are foiled by the fact that name lookup correctly. happens in the scanning stage, before tokens The program in Figure 1 demonstrates why are passed to the parser. The scanner could simply supporting trial and final actions is in- cheat and peek at the stack to try and guess sufficient for parsing C++. The second last the correct context, but it has no sense of what line of the example is a declaration of an ob- to expect on the stack. ject g of type E that is initialized with the value The most sensible and straightforward way that f returns. The last line is a declaration of to propagate qualification information from the a function h that returns an of type E parser to the name lookup stage is for the and has one unnamed parameter of type D. To parser to maintain the qualification informa- distinguish between these requires examining tion in a global variable and for the lookup the meaning of the symbols f and D. Once the stage to consult this variable when needed. But parser arrives at the initial open parentheses, it since we are in trial parse mode and any of our

5 parsing may get undone, we are forbidden from loop undoes parsing. This happens when the modifying global variables. parser encounters an error and must backtrack We therefore require a parsing model that al- to the most recent decision. Within the unpars- lows us to parse both in the forward and back- ing loop one item is popped from the top of the wards direction over semantic actions which parse stack. If the node is a token, the token modify the global state. The first contribu- undo action is executed and it is pushed back tion of this work, described in the Section 3 to the input stream. If the node is a nontermi- addresses this need. nal, the nonterminal’s undo action is executed, the node is discarded and the children of the node are pushed onto the parse stack. In both 3 Backtracking Semantic cases, if the recently popped node contained an Actions alternate action then unparsing terminates and forward parsing resumes, with the initial action To permit parsers to backtrack over seman- to take guided by the previous choice which was tic actions which modify the global state, we stored in the popped node. have introduced the notion of semantic undo actions. In all, the semantic action known the 3.3 Final Actions user of Yacc has been specialized into three types: trial, undo and final actions. Each is Final actions are executed when a reduction appropriate for a particular kind of task. can never be undone. They are free to make irreversible changes to the global state and should perform all work that is not required for 3.1 Trial Actions the parser to produce a correct . This Trial actions are always executed immediately could be writing out the result of the parse, upon a reduction. They are appropriate for ex- building an AST or freeing memory. ecuting actions which will affect future parsing. The execution of final actions is triggered by The user is free to make any modifications to commit operations, which are described in the the global state which can be reverted. In the next section. Following a commit operation, context of programming languages, this usually if there are no backtracking decision points re- involves tasks such as inserting or deleting dic- maining, then all pending final actions are in- tionary items, attaching or detaching list items, voked up to the top of the parse stack and or pushing or popping stack items. Should en- all children underneath the nodes of the parse abling the reversibility of an action require sav- stack are freed. This reduces the parser’s mem- ing some data, as in the case of popping from ory usage to that of a standard LR parser. a stack, it can be stored in the data element representing the reduced tree node. 3.4 Declarative Commit Points Since we may need to unparse the reduced node, we always preserve the children of a Programming language ambiguities are often reduction. This results in a stack of parse localized. For example, in C++ once the pars- trees. In many applications preserving the en- ing of a statement completes, any alternative tire parse tree is wasteful. In Section 3.4 we parses of the statement’s tokens can be dis- describe commit declarations, which give the carded. Discarding alternatives drastically im- parser hints as to when it is allowed to free proves parser performance by eliminating fruit- parse tree nodes. When nodes are freed at reg- less reparsing, expediting the execution of final ular intervals, the cost of preserving reduced actions, reducing the parser’s memory usage, data is marginal. and enabling it to report erroneous input in a prompt and accurate manner. 3.2 Undo Actions We allow the user to declare localized com- mit points within a grammar. When the parser Undo actions are used for reverting side effects arrives at a commit point, it deletes any alter- of trial actions. They are invoked as the parse natives within the commit point’s scope. Al-

6 orderState( tabState, prodState, time ): if not tabState.dotSet.find( prodState.dotID ) tabState.dotSet.insert( prodState.dotID ) tabTrans = tabState.findMatchingTransition( prodState.getTransition() )

if tabTrans is NonTerminal: for production in tabTrans.nonTerm.prodList: orderState( tabState, production.startState, time )

for all expandToState in tabTrans.expandToStates: for all followTrans in expandToState.transList reduceAction = findAction( production.reduction ) if reduceAction.time is unset: reduceAction.time = time++ end end end end end

shiftAction = tabTrans.findAction( shift ) if shiftAction.time is unset: shiftAction.time = time++ end

orderState( tabTrans.toState, prodTrans.toState, time ) end end

orderState( parseTable.startState, startProduction.startState, 1 )

Figure 2: Ordering shifts and reduces to emulate a generalized top-down strategy.

ternatives deeper in the stack are not affected, when we begin to recognize the statements con- allowing the user to delete alternatives which tained in it. are of no interest, while preserving earlier al- definition -> type name ( param_list ) ternatives that are still plausible. commit { definition_list } There are two forms of commit points: reduction-based and shift-based. A reduction- Without shift-based commit declarations we based commit point is associated with an entire would be required to restructure our grammar grammar production. When the production is if we wanted to commit the function signature initially reduced, all alternatives embedded un- before entering the body. derneath the new nonterminal are deleted. A Note however, that until a character which shift-based commit point is embedded into a follows an entire production is recognized, it production someplace before the end. When is not guaranteed that the production will be the first character of the grammar item to the reduced, which means that shift-based commit right is shifted, the grammar items to the left declarations may affect alternate parses. They are committed. should therefore be used ahead of relatively un- The need for shift-based commit points arises ambiguous language constructs. due to the block structutre of programming A commit declaration does not guarantee languages. For example, it desirable to be able that the tree underneath it will be freed be- to commit the signature of a function definition cause earlier alternatives may still exist. To

7 a / SR(AB−1, 29)

AB / S(24) b / SR(AB−2, 25)

EOF / R(S−2, 30) a / SR(A−1, 17), R(AB−3, 21)

b / R(AB−3, 22)

A / S(16) EOF / R(AB−3, 23)

a / R(A−2, 13) b / R(A−2, 14) a / R(AB−3, 1), S(12) EOF / R(A−2, 15)

a / R(AB−2, 5)

b / R(AB−2, 6) b / S(4) EOF / R(S−1, 11) START AB / S(3) a / SR(AB−1, 7)

b / R(AB−3, 2)

Figure 3: LALR(1) parse tables of grammar (2). The unique timestamp assigned to each action is shown. The resulting action ordering emulates a top-down parsing strategy.

avoid redundant , an implementa- Following the construction of the parser state tion of this technique should optimize the com- tables, our algorithm traverses both the state mit operation such that it only incurs a cost tables and the grammar productions in paral- when alternatives exist within the scope of the lel. The order of the traversal is guided by a commit. top-down interpretation of the grammar pro- ductions. We start with a nonterminal as our goal and consider each production of the non- 4 User-Controlled Parsing terminal in succession. As we move along a Strategy production’s right-hand side we recurse on non- terminals before we proceed past them. When The second contribution of this work addresses we visit a shift transition we assign a time to the need for user-controlled parsing of ambigu- it if one has not already been given. Follow- ous language constructs. We have improved ing the traversal down the right-hand side of a the backtracking LR approach by giving users production, we find the transitions which con- localized, grammar-based control of the order tain the reduction actions of the production in- in which conflicting LR actions should be at- stance and assign a time to each reduction ac- tempted. To accomplish this we have devised tion if one has not already been given. an algorithm which traverses the parser’s state To limit the traversal and guarantee termi- tables and assigns an ordering to the actions nation we visit a parse table and production such that a generalized top-down backtracking state pair only once. This is accomplished by strategy is emulated. This algorithm is shown inserting the dot item of the production state in Figure 2. By emulating a generalized top- into the parse table state and proceeding with down approach we give the user the ability to the pair only when the dot item previously did specify mutually ambiguous productions in the not exist. Note that this yields dot sets iden- order in which the parser should attempt to tical to those computed during the LR parse parse them. We also ensure that the parser will table construction. prefer the longest possible match of sequences. The following grammar demonstrates the

8 ability of our algorithm to properly order mu- Action Stack Input tually ambiguous productions and to prefer the a b a EOF longest match of a sequence. The correspond- reduce AB(r:2) a b a EOF ing parse tables are given in Figure 3. shift AB(r:2) a b a EOF reduce AB b a EOF S -> AB b AB -> AB a AB(r:2) a S -> a A AB AB -> AB b (2) shift AB b a EOF A -> A a AB -> AB(r:2) a A -> reduce AB a EOF AB b The timestamps assigned by our action or- AB(r:2) a dering algorithm are shown in each transition shift AB a EOF action. There are two conflict points. The first AB b is in the transition leaving the start state on AB(r:2) a ERROR the input character a. This transition will first unshift AB a EOF induce a reduction of AB, then a shift. This AB b represents the pursuit of the first production of AB(r:2) a S. The second conflict represents the choice be- unreduce AB b a EOF tween extending the A sequence and beginning AB(r:2) a the AB sequence when we are matching the sec- unshift AB b a EOF ond production of S. In this case the parser first AB(r:2) a shifts and reduces A to pursue a longest match unreduce AB(r:2) a b a EOF of A. unshift AB(r:2) a b a EOF unreduce a(r:2) b a EOF 4.1 Parsing Example shift a b a EOF reduce a A b a EOF Figure 4 shows the run-time behaviour of the reduce a A AB b a EOF backtracking LR parser generated from gram- shift a A AB b a EOF mar (2) when run on the input a b a. Nor- reduce a A AB a EOF mally, an LR parser discards the nodes it has AB b popped off of the stack during a reduction. shift a A AB a EOF AB b Since we must be prepared to backtrack we reduce a A AB EOF preserve these nodes as children of the newly AB a reduced node. These children nodes are shown AB b underneath their parent in the middle column. Though there are none in this example, use of Figure 4: Parsing of the string a b a, accord- commit declarations to clear the retry points ing to grammar (2). Since in this case we must causes these nodes to be freed. be prepared to backtrack, we preserve popped The retry point (r:2) is recorded in the re- nodes. Commit declarations can be used to duced node AB of the first reduction. When clear retry points, which in turn causes these the unparsing loop arrives at this retry point preserved nodes to be freed. it transfers it to the first input symbol and re- sumes forward parsing. The forward parsing loop will then read the retry point and shift instead of reduce. parse strategy we aim for is reserved only for true top-down parsers. LR parsers attempt 4.2 Out-of-Order Parse Correc- to parse common production prefixes in paral- tion lel. This allows parsers to run very fast, but it can inhibit us from achieving a top-down strat- Unfortunately it is possible to find grammars egy because it shuffles the order of backtrack- whose mutually ambiguous productions will ing decisions by delaying the of pro- not be parsed in order. As it turns out, the ductions. For example, consider the following

9 c / SR(S−3, 13)

b / S(9) x / SR(S−1, 10)

F / S(8) EOF / R(S−2, 12) START a / S(1) b / R(F−1, 2), S(4) c / SR(F−2, 5)

EOF / R(F−1, 3)

a / R(U1−1, 1), R(U2−1, 14), R(U3−1, 19)

F / S(10) b / S(11) x / SR(S−1, 12) U1 / S(2) a / S(3) START b / S(23) U3 / S(20) F / S(22) c / SR(S−3, 24) a / S(21) EOF / R(F−1, 5) U2 / S(15) b / R(F−1, 4), S(6) a / S(16) c / SR(F−2, 7) F / SR(S−2, 17)

Figure 5: LALR(1) parse tables before and after adding unique empty productions that force the parser to select on the possible derivations of S before selecting on the possible derivations of F.

grammar. The offending LR state tables are shown in the first part of Figure 5. S -> F b x F -> a Fortunately we are able to solve this problem S -> F F -> a b c (3) easily. When we find our input to be parsed S -> F b c out of order with respect to our grammar, we When given the input string a b c, a gener- can force a correct order by introducing unique alized top-down parser will attempt the follow- empty productions at the beginning of the pro- ing derivations. Note that it branches on the ductions which are parsed out of order. The possible derivations of S first, then branches on unique empty productions will cause an im- the possible derivations of F. mediate reduce conflict before any inner pro- ductions are reduced, effectively allowing us to S -> F( a ) b x fail force the slower top-down parsing approach in S -> F( a b c ) b x fail a localized manner. We can change grammar S -> F( a ) fail (3) to the following and achieve the same pars- S -> F( a b c ) accept ing strategy as a top-down parser. Our backtracking LR parser does not yield the same parse. Since all three S productions S -> U1 F b x F -> a U1 -> have a common prefix F, the prefix will be S -> U2 F F -> a b c U2 -> (4) parsed once for all productions. The parser will S -> U3 F b c U3 -> branch on the possible derivations of F first, The second part of Figure 5 shows the LR then later branch on the derivations of . This S tables after forcing an initial selection on S. An out-of-order branching causes an out-of-order ability to force a branch point is very useful parse. When we trace the parser’s behaviour, when unioning grammars because it frees us we find that it first reduces F -> a, then suc- from analyzing how the LR state tables inter- ceeds in matching S -> F b c. act. The cost of forcing a branch point lies in S -> F( a ) b x fail increasing the number of states and lengthen- S -> F( a ) fail ing parse times. However we do so only locally, S -> F( a ) b c accept and only when required.

10 4.3 Semantic Conditions and Er- globals { Stack templDecl; ror Recovery Stack declData; }; An advantage of a our approach is that it af- fords simple implementations of semantic con- declaration_start: ditions and error recovery. Should a reduction try { declData.push( DeclarationData() ); action detect that a parse violates a semantic declData.top().init(); condition, it can invoke the backtracker and the declData.top().isTemplate = parser will move on to an alternative parse. templDecl.top(); With a simple enhancement to the state ta- } undo { ble generator in the form of an any* token declData.pop(); which repeatedly matches input tokens up to a }; termination point, it is possible to implement a nonterm declaration_end { well-known error handling technique. The fol- DeclarationData declData; lowing error handler consumes input until the }; input stream and parser go back into a stable state where correct parsing may resume. declaration_end: try { $$->declData = declData.pop(); statement -> U1 for_block } statement -> U1 while_block undo { ... declData.push( $$->declData ); statement -> U2 any* ; };

5 Case Study: C++ Figure 6: Semantic actions which wrap decla- rations. To validate our ideas we have applied them to the parsing of C++. The C++ language has a reputation of being very difficult to parse using which we collect information about the decla- grammar-based techniques. Many C++ com- ration. This information will be used when we pilers use a hand-written recursive descent ap- record the declaration in the C++ name hier- proach, including GCC, OpenC++ and Open- archy. During forward parsing, we push a fresh Watcom. instance of the structure to a stack when open- Our parser is composed strictly of a scanner, ing a declaration and pop the structure when a name lookup routine inserted between the closing a declaration. During unparsing we re- scanner and parser, and a grammar. Some pro- vert these actions. ductions are accompanied by trial, undo and/or In all, semantic undo actions are relatively final actions. Backtracking performance is im- sparse. In our C++ grammar there are 576 proved with a small number of commit declara- productions, 61 of which have semantic undo tions that we associate with C++ declarations, actions. Many of them are concerned with statements and the opening of block structures. removing items from dictionaries or popping items from stacks and are similar. 5.1 Use of Semantic Undo Ac- tions 5.2 Resolving Ambiguities We use semantic undo actions to revert the ef- C++ has a number of ambiguities documented fects of trial actions which manipulate the C++ in the language standard [5]. These ambigui- name hierarchy and prepare for name lookups ties can be resolved according to the standard by posting name qualifications. An example by utilizing the parsing strategy of our back- is given in Figure 6. These empty nontermi- tracking LR algorithm. In the remainder of this nals open and close a C++ declaration. They section we describe how we have implemented are used to initialize the into the resolution of each ambiguity.

11 5.2.1 Ambiguity 1: Section 6.8 initializer_opt: ’=’ initializer_clause; There is an ambiguity between declaration initializer_opt: ’(’ expression ’)’; statements and expressions statements. To re- initializer_opt: ; solve this ambiguity, we follow the rule that any statement that can be interpreted as dec- 5.2.3 Ambiguity 3: Section 8.2, Para 2 laration is a declaration. We program this by In contexts where we can accept either a type- specifying the declaration statement produc- id or an expression, there is an ambiguity be- tion ahead of the expression statement produc- tween an abstract function declaration with no tion. parameters and a function-style cast. The res- struct C {}; olution is that any program text which can be void f(int a) a type-id is a type-id. We program this by { specifying the productions which derive type- C(a)[5]; // declaration C(a)[a=1]; // expression ids ahead of the productions which derive ex- } pressions. statement: declaration_statement commit; template class D {}; statement: expression_statement commit; int f() { sizeof(int()); // sizeof type-id 5.2.2 Ambiguity 2: Section 8.2, Para 1 sizeof(int(1)); // sizeof expression

There is an ambiguity between a function dec- D l; // type-id argument laration with a redundant set of parentheses D l; // expression argument around the parameter declaration and an ob- } ject declaration with an initialization using a unary_expression: KW_Sizeof ’(’ type_id ’)’; function-style cast expression. Again, we apply unary_expression: KW_Sizeof unary_expression; the rule that any program text that can be a declaration is a declaration. Therefore we must template_argument: type_id; prefer the function declaration. The resolution template_argument: assignment_expression; of this ambiguity is handled automatically by our parsing strategy, because parameter spec- 5.2.4 Ambiguity 4: Section 8.2, Para 7 ifications are innermost relative to object ini- In contexts which accept both abstract declara- tializations. tors and named declarators there is an ambi- struct C {}; guity between an abstract function declaration int f(int a) with a single abstract parameter and an ob- { ject declaration with a redundant set of paren- C x(int(a)); // function declaration theses. This arises in function parameter lists. C y(int(1)); // object declaration } The resolution is to consider the text as an abstract function declaration with a single ab- init_declarator: stract parameter. We program this by speci- declarator initializer_opt; fying abstraction declarators ahead of named declarator: declarators. ptr_operator_seq_opt declarator_id array_or_param_seq_opt; struct C {}; void f(int (C)); // anon function ptr param array_or_param_seq_opt: void f(int (x)); // variable parameter array_or_param_seq_opt array_or_param; array_or_param_seq_opt: ; parameter_declaration: decl_specifier_seq param_declarator_opt array_or_param: parameter_init_opt; ’[’ constant_expression_opt ’]’; array_or_param: param_declarator_opt: abstract_declarator; ’(’ parameter_declaration_clause ’)’ param_declarator_opt: declarator; cv_qualifier_seq_opt exception_spec_opt; param_declarator_opt: ;

12 5.3 Parsing Speed This causes a backtracking LR parser to be- have like a generalized top-down parser. This Our parsing method is competitively fast. idea was tested with our C++ parser. We in- Though meaningful timings are difficult to ob- serted unique empty productions at the begin- tain because there are no C++ parsers which ning of every production which did not contain perform the exact same amount of work as left recursion, direct, indirect or hidden. The ours. Admittedly, our parser is not complete; resulting parser produced the same output, but we do just enough work to obtain a nearly- performance slowed by a factor of 20, and there correct parse. We have not implemented the was an increase in the number of states by a complete type and expression evaluation sys- factor of 2. tems, which both require a considerable effort to implement. These are necessary for looking If automatic detection of out-of-order parses up template specializations. This in turn af- proves too difficult or unnecessary, it may be fects our ability to properly lookup names in worthwhile to pursue methods for analyzing an some contexts. explicitly specified pair of ambiguous produc- tions for potential out-of-order parses. This Nevertheless, we give a timing of our untuned would ensure that unique empty productions and incomplete prototype and a timing of GCC are added only when necessary. on the same file to give a general sense that our method is suitable for practical tasks. On a 2.4 GHz Intel processor, our parser handles a 1.3 MB preprocessed file belonging to the Mozilla 7 Conclusion repository in 0.154 seconds. On the same file, the g++ 3.3.5 reported In this work we describe two enhancements to that the sum of scanning, parsing and name a backtracking LR parsing approach which en- lookup took 0.980 seconds. This information able the parsing of languages that are both was obtained with the -ftime-report option. context-dependent and ambiguous. We introduce a new class of semantic ac- 6 Future Work tions for reverting changes made to the global state, which we call undo actions. These ac- The problem of detecting out-of-order parses tions are straightforward to program and per- and eliminating them by inserting unique mit the parser to backtrack over areas of in- empty productions is a task that we leave up to put text which require preparations for han- the user. It would be desirable to have a static dling context dependencies. Declarative com- which was able to detect out-of-order mit points can be used eliminate fruitless back- parses and automatically correct the problem tracking and improve performance in a local- by inserting unique empty productions where ized manner. appropriate. In initial investigations we found Secondly, we assign an ordering to conflicting this to be a difficult problem, closely related shift and reduce actions that causes the parser to the detection of ambiguities in context-free to emulate the parsing strategy of a generalized grammars, which has been shown to be an un- top-down parser for many grammars. In cases decidable problem [8]. where common prefixes inhibit the desired top- An alternate strategy for guaranteeing that down strategy, unique empty productions can no out-of-order parses are possible might be to be inserted at the beginning of productions to begin by inserting unique empty productions at force a localized top-down approach. This will the beginning of every production, then later guarantee that the parser attempts to parse eliminate those which are unnecessary. Main- mutually ambiguous productions in the order taining in-order parsing may be easier than de- in which they are given. Using our method, tecting out-of-order parsing. we can apply a top-down backtracking strategy When we insert unique empty productions where needed for resolving ambiguities, while at the beginning of every production we guar- retaining the speed of LR parsing for sections antee that no input is parsed out of order. of the grammar which are deterministic.

13 Acknowledgments [3] Thomas R. Dean, James R. Cordy, An- drew J. Malton, and Kevin A. Schneider. The authors wish to thank Nigel Horspool and Agile parsing in TXL. Journal of Au- Terence Parr for their assistance in understand- tomated Software Engineering, 10(4):311– ing the relation of our work to other methods. 336, October 2003. This work is supported by the Natural Sciences and Engineering Research Council of Canada. [4] Chris Dodd and Vadim Maslov. Back- tracking Yacc, 2006. About the Authors http://www.siber.com/btyacc/.

Adrian Thurston is a Ph.D. candidate at [5] International Organization for Standard- Queen’s University working in the Software ization. ISO/IEC 14882:1998: Program- Technology Laboratory under the supervision ming languages — C++. American Na- James Cordy. He completed his M.Sc. also tional Standards Institute, First edition, at Queen’s and his B.Math (Computer Sci- September 1998. ence) at the University of Waterloo. Adrian’s research interests include parsing technology, [6] Bryan Ford. Packrat parsing: simple, source transformation and programming lan- powerful, lazy, linear time. In Proceedings guages. of the seventh ACM SIGPLAN interna- James Cordy is the Director of the School tional conference on Functional program- of Computing and Professor of Computing ming (ICFP’02), pages 36–47, New York, and Electrical and Computer Engineering at NY, USA, 2002. ACM Press. Queen’s University. From 1995 to 2000 he was Vice President and Chief Research Scien- [7] Josef Grosch. Lark - An LALR(2) parser tist at Legasys Corporation, a software tech- generator with backtracking. Technical nology company specializing in legacy software Report 32, CoCoLab - Datenverarbeitung, system analysis and renovation. Dr. Cordy September 2002. is a founding member of the Software Tech- nology Laboratory at Queen’s University and [8] John E. Hopcroft and Jeffrey D. Ullman. winner of the 1994 ITRC Innovation Excellence Introduction to , Lan- award and the 1995 ITRC Chair’s Award for guages and Computation. Addison-Wesley, Entrepreneurship in Technology Innovation for 1979. his work there. He serves on a range of software engineering conference committees and has re- [9] Adrian Johnstone and Elizabeth Scott. cently co-chaired several conferences and work- Generalised recursive descent parsing and shops including CASCON 2005. Dr. Cordy is follow determinism. In Compiler Con- an IBM Faculty Fellow and has been awarded struction: 7th International Conference IBM Faculty Innovation Awards in both 2004 and 2005. (CC’98), volume 1383 of Lecture Notes in . Springer-Verlag, 1998. References [10] Adrian Johnstone, Elizabeth Scott, and Giorgios Economopoulos. Generalised [1] John Aycock and R. Nigel Horspool. parsing: Some costs. In Compiler Con- Schrodinger’s token. Software: Practice struction: 13th International Conference and Experience, 31(8):803–814, July 2001. (CC’04), volume 2985 of Lecture Notes in Computer Science, page 89, April 2004. [2] James R. Cordy. The TXL source trans- formation language. Science of Com- [11] Paul Klint and Eelco Visser. Using fil- puter Programming, 61(3):190–210, Au- ters for the disambiguation of context-free gust 2006. grammars. In G. Pighizzini and P. San

14 Pietro, editors, Proc. ASMICS Workshop Applied Computing (SAC’01), pages 636– on Parsing Theory, pages 1–20, October 640, New York, NY, USA, 2001. ACM 1994. Press.

[12] Brian A. Malloy, Tanton H. Gibbs, and [20] Michael Spencer. Basil: A backtracking James F. Power. Decorating tokens to fa- LR parser generator, 2006. cilitate recognition of ambiguous language http://www.lazycplusplus.com/basil/. constructs. Software: Practice and Expe- [21] Mark G. J. van den Brand, Jeroen rience, 33(1):19–39, 2003. Scheerder, Jurgen Vinju, and Eelco Visser. [13] Scott McPeak and George C. Necula. Disambiguation filters for scannerless gen- Elkhound: A fast, practical GLR parser eralized LR parsers. In Compiler Con- generator. In Compiler Construction: struction: 11th International Conference 13th International Conference (CC’04), (CC’02), volume 2304 of Lecture Notes in volume 2985 of Lecture Notes in Computer Computer Science, pages 143–158, Greno- Science, April 2004. ble, France, April 2002.

[14] Gary H. Merrill. Parsing non-LR(k) gram- mars with Yacc. Software, Practice and Experience, 23(8):829–850, 1993.

[15] Torben Mogensen. Ratatosk: A parser generator and scanner generator for Gofer, 1993. ftp://ftp.diku.dk/pub/diku/ dists/Ratatosk.tar.Z.

[16] Terence J. Parr and Russell W. Quong. ANTLR: A predicated LL(k) parser gen- erator. Software, Practice and Experience, 25(7):789–810, 1995.

[17] Fernando C. N. Pereira and David H. D. Warren. Definite clause grammars for lan- guage analysis - A survey of the formal- ism and a comparison with augmented transition networks. Artificial Intelligence, 13(3):231–278, 1980.

[18] James F. Power and Brian A. Mal- loy. Symbol table construction and name lookup in ISO C++. In Proceedings of the International Conference on the Technol- ogy of Object-Oriented Languages and Sys- tems (TOOLS’00), pages 57–68, Novem- ber 2000.

[19] James F. Power and Brian A. Malloy. Exploiting metrics to facilitate grammar transformation into LALR format. In Pro- ceedings of the 2001 ACM Symposium on

15