A Cross-Language Analysis of Parsing Using Automatic Memoization

Benjamin Cassell University of Waterloo [email protected]

Abstract ities, are proven to be able to parse left-recursive languages [18]. This surprising increase in ability pro- This paper demonstrates the key abilities of varying lan- vided by memoization is an incredibly potent argument guages to support functional recursive descent parsing in favour of the technique. through automatic memoization. It begins by summa- It is important to consider, however, that manual im- rizing the most pertinent works that have dealt with us- plementation of memoization in a complex parser can ing automatic memoization for the purposes of top-down consume a considerable amount of time and effort. Re- parsing. Next, this paper provides an in-depth exami- gardless of the potential benefits, memoization that is te- nation of a variety of popular modern day languages to dious for the programmer and results in unreadable and determine their support for the easy implementation of messy code is far less useful. This alone often serves as an automatic memoization mechanism that is sufficiently a barrier to the use of memoization in fields that require powerful to aid recursive descent parsing. In particular, rapid development or prototyping [13]. two primary types of solutions are examined through- This is where automatic memoization becomes useful: out the various languages: Macro-based solutions, and It allows for all the benefits of memoization with minimal first-class citizen function-based solutions in combina- coding effort and loss of maintainability. For parsing, tion with mutation. this is particularly important as the grammar of the parser becomes highly complicated or convoluted (for example, 1 Introduction the notoriously complicated grammar for the C++ programming language). But this raises the question: Is au- Memoization has been an important technique in vari- tomatic memoization available to coders in most situa- ous computationally-intensive environments for decades. tions? More importantly, is the automatic memoization Michie, in one of the earliest published works to exam- powerful enough to support context-free parsing (in other ine the role of memoization in computing, described it words, does it properly memoize recursive functions)? as a method akin to the way that humans learn, and ex- The contributions of this paper are as follows: tolled the virtues of using memoization to improve performance in computationally-intensive algorithms [19]. • A brief summary of the automatic memoization Due to the ability of memoization to prevent unnec- techniques introduced by Norvig [20] and John- essary work on input that has been previously exam- son [18] in their respective works, and how they ined, it is a critical component in the successful reduc- benefit parsing. tion of computational complexity for a variety of parsing techniques. This review chooses to focus on the bene- • An examination of multiple popular modern pro- fits of memoization for functional top-down backtrack- gramming languages to determine the possibility ing parsers, which can be greatly improved in a simple, (and simplicity) of implementing general-purpose natural way with the intelligent definition of memoized automatic memoization in them, which in turn al- functions [20]. lows for the usage of the parsing techniques demon- Interestingly, recursive descent parsers gain more strated by Norvig and Johnson. than just improved performance from memoization. In fact, recursive descent parsers, when combined with • A sample implementation of a subset of Scheme in the proper techniques and sufficient memoization abil- C#, and a working implementation of Norvig and

1 Johnson’s memoized parsing algorithms built on top duced by Abelson and Sussman six years earlier [3]), of it. it was the first major work to examine automatic memoization within the scope of parsing context-free • A discussion of the features between the languages grammars [20]. Norvig demonstrates the complexities that make automatic memoization possible or diffi- associated with providing automatic memoization for cult. functional recursive descent parsing: Because grammar The remainder of this work is structured as follows: rules can be recursive, simply producing a lookup- Section 2 gives a brief overview of the Johnson and function generator can result in improperly memoized Norvig articles that serve as the basis for this review. recursive functions. Section 3 explores implementations of automatic mem- Norvig provides an implementation of automatically oization for use in parsing across a variety of popular memoized parsing in Lisp, using the setf function to modern languages, and section 4 discusses the findings circumvent issues with recursive function calls. Norvig of this paper about the availability of automatic memo- adds a generic memoizer call on top of functions repre- ization for context-free parsing. Section 5 discusses pos- senting production rules in a sample grammar. Calls to sible future work, and section 6 concludes the paper. these rules are re-bound to the memoized versions, even in recursive (and mutually recursive) cases. Norvig addi- 2 Background and Related Work tionally demonstrates the utility of making the automatic memoization function flexible enough to allow different As mentioned in section 1, the concept of memoization types of hashing and equality testing on arguments, with in is not a new. The idea of memoizing the answers to a simple change to the hashing and equality algorithms well-known problems is ancient, and the application of changing the time complexity of a sample backtracking memoization in computer science has been studied since parser from quartic to cubic. at least 1968, when Michie published an article in Na- Norvig also shows that a similar implementation for ture [19] examining memoization’s usefulness within the automatic memoization would be effective for parsing context of machine learning algorithms. Since then, a in Scheme, either through the use of the set! func- great body of work has focused on using memoization tion or through macros. Finally, Norvig demonstrates to improve everything from thread scheduling [10] and that other languages, for instance Pascal, could imple- proof systems [5] to multimedia applications [4]. ment automatic memoization through constructs such as More recently however, the concept of automatic a memo keyword (given that the language’s parser can memoization has become increasingly important. Auto- be changed). This review contends that this method matic memoization can be thought of as the implementa- is too heavy-handed to be useful. As such, we con- tion of a higher level memoization paradigm, which al- centrate on demonstrating which languages support au- lows a programmer to provide memoization benefits to tomatic memoization through already-provided features any given function with minimal effort (usually by pro- and not through explicitly modifying the language core. ducing memoized versions of functions using a generic memoizer function, or by by marking a functions as 2.2 Extending Memoized Parsing memoizable using preprocessor statements or other language properties). Most typical implementations of recursive descent pars- Automatic memoization is highly valuable to parsing ing suffer from similar problems, and foremost among as it allows for rapid prototyping without losing high per- these is the inability of minimally-implemented top- formance. It allows programmers to focus their efforts down parsers to recognize left-recursive grammars. on optimizing important special cases [20], and it also Given a left-recursive grammar, naive recursive descent prevents code bases from becoming bloated and unmain- parsers fail to terminate and, without proper supplemen- tainable [13, 14]. For recursive descent parsing in partic- tation, automatic memoization does nothing to change ular, automatic memoization techniques allow functions this. In his 1995 work, Johnson shows how to solve to continue resembling grammar production rules, even this problem using automatic memoization in combina- when memoized. This prevents grammar rules from be- tion with continuation-passing style (CPS) [18]. coming cluttered with a blurred relationship between the CPS is employed by Johnson to reverse the flow of recursive function and its associated grammar rule. computation in a top-down parser, resulting in values that travel downwards through the computation instead 2.1 Parsing with Automatic Memoization of being pulled upwards from the bottom. Additionally, Johnson memoizes on arguments to the parser and as- Although automatic memoization predates Peter sociates them with lists of continuations. This creates Norvig’s 1991 work (for example, with a sample pro- a system in which no grammar function is evaluated on

2 the same arguments more than once (subsequent evalua- different kinds of caching algorithms, including custom tions are passed forwards to the continuations memoized ones). The function returns a lambda expression repre- in the lookup table). In the case of left recursion, this im- senting the memoized version of the input function. The plies that the recursive cycle is broken after at most one implementation of this function can be seen in listing 1. recursive call, preventing non-termination. Of note is that the function must check whether or not Johnson provides a full implementation of both the the input key is null, as IDictionary types in .NET automatic memoizer and CPS-based parser in Scheme, languages do not operate on null keys. Furthermore, affirming Norvig’s previous conjectures about its feasi- this function only operates on functions that take a single bility [20]. Johnson requires some manipulation to be input for ease of implementation. There are, however, able to define mutually-recursive production rules (due to several methods that can be used to extend the number what he refers to as a vacuous deficiency in the Scheme of parameters accepted by this function. These meth- specification), and this highlights an important limitation ods are are discussed in section 4.3. The this keyword on languages that do not allow properly mutual-recursive qualifying the function fn is syntactic sugar, and defines definitions: Such languages need to support some com- Memoize as an extension method of the type Func. With bination of function predeclarations or first-class citizen this definition, all objects of type Func accepting one functions with lambda expressions to directly allow for value and returning one value can invoke Memoize as if easy implementation of recursive descent parsers. Other it were a method belonging to their class definition. more inconvenient methods may be available to lan- Having also defined a templated ConsCell data type guages without these features, for example defining a set representing two values joined by a cons function (as in of values up front and operating over interpretations of Lisp or Scheme, based on an implementation by Dustin said values, but this defies some of the spirit of functional Campbell [9]), as well as the higher order functions im- recursive descent parsing, which is to have functions that plemented by Norvig and Johnson in their works, a C# resemble the production rules of the grammar. parsing function representing the production rule (S → NP VP) in Johnson’s example grammar would resemble 3 Exploration of Automatic Memoization the sample in listing 2. The implementation of the higher order functions alt and seq function can be seen in list- This section examines implementations of automatic ings 3 and 4. Note that the vacuous macro from the memoization in various popular modern programming Johnson work is explicitly in-lined, as C# does not sup- languages. It highlights the techniques used to achieve port macro expansion. Furthermore, as with Memoize, automatic memoization, whether or not it is sufficient the seq function is defined as an extension method. for use in functional context-free parsing, and the ben- The full implementation of the Norvig parser can be efits and difficulties of the implementation or language. seen in the C# source code that accompanies this work For every language listed here, unless otherwise noted, a and is available upon request. It also includes a fully functioning code sample is included with the source code functioning example of Johnson’s CPS-based parser us- that accompanies this paper. ing automatic memoization written in C#. It is not included in this paper due to space constraints. 3.1 C# 3.2 Lisp, Scheme and Racket The most comprehensive examination of automatic memoization performed by this paper was conducted us- As described in section 2, automatic memoization that ing the C# programming language. This, and all other ex- supports context-free parsing is readily available in both aminations of .NET languages (C++.NET and VB.NET), Lisp and Scheme, as shown by Norvig [20] and John- were conducted using the .NET 4.0 language SDK and son [18]. Sample code is provided in Racket using John- run-time environment. son’s Scheme method (which is textually identical as Automatic memoization in C# has been attempted by Racket is just a super-set of the older PLT Scheme). In- others before with success, for example by Wes Dyer terestingly, Lisp, Scheme and Racket all support macro in 2007 [11]. Like these previous implementations, the expansion, meaning an alternative (but likely inferior) C# version of automatic memoization employed by this automatic memoization solution that uses macros and work uses the .NET Func templated type, which is a first- would be able to support context-free parsing is viable. class representation of a function (albeit one with specifically typed arguments and a specifically typed return 3.3 Visual Basic .NET value). The Memoize function takes in a similarly templated Func object, as well as an object implementing the Because VB.NET is a completely Microsoft-defined lan- IDictionary interface to use as a memo (allowing for guage (unlike C++.NET), it is able to work natively with

3 s t a t i c Func Memoize ( t h i s Func fn, IDictionary memo) { return key => { i f ( key == n u l l ) return fn ( key ) ; T2 v a l u e ; i f (memo. TryGetValue(key , out v a l u e ) ) return v a l u e ; value = fn(key); memo.Add(key, value); return v a l u e ; } ; } Listing 1: Automatic Memoization in C#

s t a t i c Func, ConsCell