A Backtracking LR Algorithm for Parsing Ambiguous Context-Dependent Languages
Total Page:16
File Type:pdf, Size:1020Kb
A Backtracking LR Algorithm for Parsing Ambiguous Context-Dependent Languages Adrian D. Thurston and James R. Cordy School of Computing Queen’s University Kingston, ON, Canada {thurston, cordy}@cs.queensu.ca Abstract 1 Introduction Parsing context-dependent computer lan- To successfully parse modern programming guages requires an ability to maintain and languages such as C, Java and C# requires query data structures while parsing for the an ability to handle context dependencies. We purpose of influencing the parse. Parsing must lookup the meaning of an identifier to de- ambiguous computer languages requires an termine what kind of symbol we are dealing ability to generate a parser for arbitrary with before we proceed to parse the identifier. context-free grammars. In both cases we have The established practice for dealing with this tools for generating parsers from a grammar. problem is the “lexical feedback hack.” During However, languages that have both of these forward parsing, semantic actions are respon- properties simultaneously are much more sible for maintaining lookup tables. The lex- difficult to parse. Consequently, we have ical analyzer is then responsible for querying fewer techniques. One approach to parsing the type of an identifier before sending it to such languages is to endow traditional LR the parser. We have many tools which support systems with backtracking. This is a step this method of parsing. towards a working solution, however there are number of problems. In this work we present A different language classification criteria is two enhancements to a basic backtracking LR ambiguity. To parse context-free languages approach which enable the parsing of computer that are ambiguous requires an ability to pur- languages that are both context-dependent sue at least one parse given multiple potential and ambiguous. Using our system we have pro- parses. Again, we have many tools for gener- duced a fast parser for C++ that is composed ating parsers from ambiguous grammars. The of strictly a scanner, a name lookup stage and techniques available to use include GLR, Ear- parser generated from a grammar augmented ley parsing, and generalized recursive descent. with semantic actions and semantic ‘undo’ Languages that are both context-dependent actions. Language ambiguities are resolved by and ambiguous are considerably more difficult prioritizing grammar declarations. to parse than languages with just one of these properties. They require that our facilities for Copyright c 2006 Adrian D. Thurston and James considering alternatives accommodate our need R. Cordy. Permission to copy is hereby granted pro- vided the original copyright notice is reproduced in to maintain and query global state. For exam- copies made. ple, we may pursue one potential parse, ma- 1 nipulating global state while doing so, only to to write a grammar-based C++ parser in a discover that we have made the wrong guess straightforward manner. Reduction actions are and we must give up and try an alternative free to change the global data structures that parse. Before we can however, our manipu- are used to determine the type of identifiers. lation of the global state must be abandoned. This may entail pushing a namespace to the Since the nature of these manipulations are de- declaration stack, or inserting a class name termined by the semantics of the language to into a dictionary. Immediately below a reduc- be parsed, they must be programmed by the tion action which modifies global state, the dis- user of the parser generator. They cannot be ciplined programmer is responsible for imple- declared by the user and generated automati- menting the reverse of the reduction action, for cally by the parser generator. example popping the declaration stack or re- C++ is an example of a language that has moving an item from a dictionary. This allows context dependencies and which is ambiguous. the parser to correctly backtrack. We very rarely find C++ parsers to be gener- At strategic points, such as statement ated from a grammar. boundaries, parse trees may be committed and One approach to parsing these languages is non-reversible actions executed. In these non- to convert a standard LR parser into a back- reversible actions, which we call final actions, tracking parser. Examples of such systems in- the user may perform permanent tasks such as clude BtYacc [4], Basil [20], Ratatosk [15] and constructing an AST or printing the result of Lark [7]. There are a number of advantages the parse. Finally, rules for resolving C++ lan- to pursuing a backtracking LR approach. The guage ambiguities are implemented by ordering parser will inherit the speed of LR parsing, the mutually ambiguous productions in the order simplicity and power of the bottom-up seman- in which they should be tried. tic action model, and the ease-of-use of back- In the next section we discuss the vari- tracking, giving us a natural ability to handle ous grammar-based approaches to generating ambiguities. parsers for ambiguous context-dependent lan- There are two problems with simply endow- guages, including existing backtracking LR sys- ing an LR parser with backtracking which make tems. In Section 3 we describe our enhance- it difficult to apply the approach to the pars- ment to backtracking LR which allows ambigu- ing of ambiguous context-dependent languages. ities and context dependencies to co-exist. In Existing backtracking LR systems only back- Section 4 we describe our enhancement which track at the level of parsing. They do not el- puts control of the backtracking strategy in the evate backtracking to the level of semantic ac- hands of the user. In Section 5 we show how tions. We therefore cannot backtrack over any our parsing algorithm can be applied to C++. attempted parse that has modified the global state in preparation for handling context de- pendencies. Secondly, with these systems it 2 Related Work is difficult or impossible for the user to spec- ify which potential parses should be preferred 2.1 GLR when the grammar contains ambiguities. The generalized LR parsing method [10] is In this work we have solved these two prob- one approach to parsing ambiguous context- lems. The forward parsing phase is free to dependent languages. Due to inherent paral- manipulate the global state because our back- lelism, use of GLR relies on post-processing of tracker invokes semantic undo actions during the parse trees. backtracking. Semantic undo actions can be In the course of building a parse table, stan- used to revert the effects of the forward phase. dard LR parser generators will emit an error Secondly, we have devised a method of ordering upon discovering shift-reduce or reduce-reduce the attempts of conflicting actions to achieve a conflicts. Some parser generators may choose user-controlled and predictable parse of an am- one action by default and produce code that biguous grammar. runs, but does not necessarily work as intended. This approach has been successfully used Others generators may simply fail to proceed. 2 GLR parser generators will accept any gram- 2.2 Generalized Top-Down mar and will always produce a working parser regardless of the number of conflicts contained Generalized top-down parsing with full back- in it. At run time, the generated parser will tracking is a very flexible parsing method that take conflicts in stride; when encountering mul- can be applied to ambiguous languages. When tiple actions on a single arc of the parse table, provisions are made for handling left recursion it will simultaneously take all actions. From a wide range of parsing tasks can be imple- then on all potential parses are parsed in lock- mented. The TXL [2] programming language, step. Since parsing in lockstep requires multi- a language designed for prototyping and ma- ple stack instances, much research as gone into nipulating language descriptions, tools and ap- managing a shared stack which conserves space plications, contains a parser which implements and computation time, making the approach the generalized top-down parsing method. It much more practical. allows the definition of arbitrary context-free grammars, according to which it will parse in- The GLR method can be applied very suc- put using a top-down parser with full back- cessfully to the parsing of ambiguous lan- tracking. This parsing strategy has been shown guages, but we experience problems when we to be very useful in language design and soft- introduce context dependencies. The need to ware renovation tasks. maintain type information while concurrently A key advantage of this method is that it pursuing multiple parses requires that we also puts the user in control of the parsing strat- maintain multiple copies of the global data egy when the grammar is ambiguous. The pre- structures which store the type information. ferred order in which to attempt to parse mu- It may be possible to extend the idea of au- tually ambiguous alternatives can be specified tomatic parse forest sharing to the global con- locally. This is advantageous for grammar com- text dependency state. After all the parse tree position tasks in software engineering [3]. The is itself a global state. No work in this area is innermost backtracking strategy makes it easy known. However, if we consider that the struc- for the user to predict the result of the parse. ture of the global state information is depen- Definite clause grammars (DCGs) [17] are dent on the language being parsed, it seems a syntactic shorthand for producing parsers doubtful that automatic sharing of context de- with Prolog clauses which represent the in- pendency information is a task that can be put with difference lists. Prolog-based pars- moved to the parser generator. C++ has a ing is a very expressive parsing technique which unique and complicated namespace structure, can be considered a generalized top-down pars- accompanied by many nontrivial name lookup ing method.