A New Top-Down Parsing Algorithm to Accommodate Ambiguity and Left Recursion in Polynomial Time
Total Page:16
File Type:pdf, Size:1020Kb
A New Top-Down Parsing Algorithm to Accommodate Ambiguity and Left Recursion in Polynomial Time Richard A. Frost and Rahmatullah Hafiz School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, Ontario Canada ON N9B3P4 [email protected] ABSTRACT is closely related to the structure of the grammar of the lan- guage to be processed, and 5) in functional programming, Top-down backtracking language processors are highly higher-order functions, called parser combinators, can be modular, can handle ambiguity, and are easy to implement defined so that language processors can be implemented as with clear and maintainable code. However, a widely-held, executable specifications of grammars, and in Logic Pro- and incorrect, view is that top-down processors are in- gramming, Definite Clause Grammars (DCGs) can be used herently exponential for ambiguous grammars and cannot to the same effect. accommodate left-recursive productions. It has been known However, a naive implementation of top-down process- for many years that exponential complexity can be avoided ing, results in the processor repeating much of its work. In by memoization, and that left-recursive productions can be the worst case, this results in exponential complexity for accommodated through a variety of techniques. However, highly-ambiguous grammars, even for recognition which is until now, memoization and techniques for handling left known to be polynomial. In addition, a naive implemen- recursion have either been presented independently, or else tation cannot accommodate left-recursive grammar produc- attempts at their integration have compromised modularity tions as the top-down search would result in infinite descent. and clarity of the code. This paper contains an informal description of a new top-down parsing algorithm which accommodates ambigu- Categories and Subject Descriptors ity and left recursion in polynomial time. The algorithm is described using the notation of set theory, and informal D.3.3 [Programming Languages]: Language Constructs proofs of termination and complexity are provided. Results and Features — Control structures; Processors — Pars- of an implementation in programming language Haskell are ing; F.4.2 [Mathematical Logic and Formal Lan- presented, more details of which are available in a techni- guages]:Grammars and Other Rewriting Systems-grammar cal report available from the School of Computer Science types; parsing;I.2.7 [Artificial Intelligence]: Natural- at the University of Windsor [2]. A formal description of language Processing-language models; language parsing the algorithm together with more-detailed proofs of partial and understanding correctness, termination and complexity is in preparation. Keywords 1.2 Misconceptions Top-down parsing, left-recursion, memoization, backtrack- ing, parser combinators. For many years it was assumed that the exponential com- plexity of top-down recognition of ambiguous sentences was inevitable. However, in 1991, Norvig [13] showed 1 INTRODUCTION that polynomial complexity for top-down recognizers, built in LISP, could be achieved by use of memoization in which Top-down backtracking language processors have a num- the results of each step of the process are stored in a memo ber of advantages compared to other methods: 1) they are table and made use of by subsequent steps. general and can be used to implement ambiguous gram- It is also widely believed that top-down language pro- mars, 2) they are easy to implement in any language which cessors cannot accommodate left-recursive productions. supports recursion. Associating semantic rules with the re- Using the search terms "top-down" and "left-recursion" on cursive functions that implement the syntactic productions Google returns over 14,000 hits. Review of the results rules of the grammar is straightforward, 3) they are highly shows the extent to which it continues to be assumed that modular (Koskomies [8]) and components can be tested in- left-recursion must be eliminated (by rewriting the gram- dependently and easily reused, 4) the structure of the code mars) before top-down processing can be used. However, ACM SIGPLAN Notices 46 Vol. 41 (5), May 2006 such rewriting is not strictly necessary, and several re- also requires that for each non-terminal N, which has searchers have proposed ways in which left-recursion can a left-recursive alternative 1) a function is added to the be accommodated: parser which places a special token N at the front of the input to be Recognized, 2) a DCG corresponding 1. Kuno [9] appears to have been the first to use the length to the rule N ::= N is added to the parser, and 3) the of the input to force termination of left-recursive de- new DCG is invoked after the left-recursive DCG has scent in top-down processing. The minimal lengths of been called. The approach accommodates explicit left- the strings generated by the grammar on the continu- recursion and maintains modularity. An extension to ation stack are added and when their sum exceeds the it also accommodates hidden left recursion which can length of the remaining input, expansion of the current occur when the grammar contains rules with empty non-terminal is terminated. However, Kuno’s method right-hand sides. The shortcoming of Nederhof and is exponential in the worst case. Koster’s approach is that it is exponential in the worst 2. Shiel [14] noticed the relationship between top-down case and that the resulting code is less clear as it and Chart parsing and developed an approach in which contains additional production rules and code to insert procedures corresponding to non-terminals are called the special tokens. with an extra parameter indicating how many termi- 6. Lickman [11] has developed a technique by which pure nals they should read from the input. When a proce- functional monadic parser combinators can be modi- dure corresponding to a rule defining a non-terminal fied to accommodate left recursion. The method is n is applied, the value of this extra parameter is parti- based on an idea put forward by Wadler in an un- tioned into smaller values which are then passed to the published paper in which he claimed that fixed points component procedures on the right of the rule defining could be used to accommodate left recursion. Lick- n. The processor backtracks when a procedure defin- man fleshes out Wadler’s idea by providing a formal ing a non-terminal is applied with the same parameter mathematical justification of termination. The method to the same input position. The method terminates for involves constructing a fixed-point combinator for the left-recursion but is exponential in the worst case. set monad and then using this function to build an 3. Leermakers [10] has developed a functional approach efficient fixed-point combinator for the parser monad to memoized parsing which avoids the left-recursion (again based on an idea by Wadler). Lickman has also problem through "recursive ascent" rather than a top- developed a program which automatically generates down search process. Although maintaining polyno- parsers in the pure functional programming language mial complexity, the approach compromises modular- Haskell from the BNF specification of the grammar. ity and clarity of the code. The method accommodates left recursion whilst main- taining modularity and clarity of the code. However, 4. In earlier work, one of the authors of this paper no- it has exponential complexity. ticed that rewriting left-recursive recognizers to non- left-recursive form is relatively simple but that rewrit- 7. Johnson [6] has developed a method by which mem- ing attributed grammars (which contain semantic as oized top-down parser combinators can accommodate well as syntactic rules) can be very difficult. To avoid left recursion in the impure-functional programming this difficulty, a method was developed in which non- language Scheme. The basic idea is to use the CPS, left-recursive recognizers are used as guards to pre- continuation-passing style, of programming so that the vent non-termination of the left-recursive executable parser computes multiple results, for ambiguous cases, attribute grammars which they guard [4]. However, incrementally. Johnson demonstrates how CPS can the method is exponential in the worst case. be integrated with memoization so that polynomial complexity and termination with left recursion can be 5. Nederhof and Koster [12] have developed a method achieved with top-down parsing. Surprisingly, John- called "cancellation" parsing in which grammar rules son’s paper has not been widely cited and his approach are translated into DCG rules such that each DCG does not appear to have been used by others. One ex- non-terminal is given a "cancellation set" as an extra planation for this could be that the approach is some- argument. Every time that a new non-terminal is what convoluted and extending it to return packed rep- derived in the expansion of a rule, this non-terminal resentations of parse trees, as in Tomita’s Chart parser is added to the cancellation set and the resulting set [27], could be too complicated. is passed on to the next symbol in the expansion. If a non-terminal is derived which is already in the set 8. Camarao, Figueiredo, and Oliveiro [1] claim to have then the parser backtracks. This technique prevents built a monadic combinator compiler generator called non-termination of left-recursion. However, by itself, Mimico which accommodates left recursion. However it would miss certain parses. Therefore, the method it does not handle ambiguous grammars. ACM SIGPLAN Notices 47 Vol. 41 (5), May 2006 1.3 Left-Recursion? is used to fail a parse branch when that branch contains a cycle introduced through left recursion. This is similar to There are two reasons why we want to implement left- the approaches proposed by Kuno [9], Shiel [14] and Lick- recursive grammars: firstly, it is often easier to add attribute man [11].