THETHRIFTPARSER

by KADE PHILLIPS, s.b. electrical engineering and computer science massachusetts institute of technology mmxviii

submitted to the department of ELECTRICAL ENGINEERING and COMPUTER SCIENCE

in partial fulfillment of the requirements for the degree of

MASTEROFENGINEERING in ELECTRICAL ENGINEERING and COMPUTER SCIENCE

at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY

september mmxx

Copyright 2020 Kade Phillips. All rights reserved.

The author hereby grants to mit permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

author : Kade Phillips, Dep’t of Electrical Engineering and Computer Science; 14 August 2020

certified by : Dennis Freeman, Professor, Education Officer & Thesis Supervisor; 14 August 2020

certified by : Adam Hartz, Senior Lecturer & Thesis Co-Supervisor; 14 August 2020

accepted by : Katrina LaCurts, Chair of the Master of Engineering Thesis Committee THETHRIFTPARSER

by KADE PHILLIPS, s.b.

Submitted to the Department of Electrical Engineering and Computer Science on 14 August 2020 in partial fulfillment of the requirements for the degree of master of engineering in electrical engineering and computer science

abstract This thesis describes the thrift algorithm, a new pars- ing algorithm that handles ambiguous grammars, is highly parallelizable, provides fine control over the structure of the output parse trees, andis easily extensible, particularly for the purpose of context-sensitive parsing or on-the-fly syntax changes.

Supervised by Dennis Freeman, Professor of Electrical Engineering, and Adam Hartz, Senior Lecturer. TABLEOFCONTENTS

Front Matter

Title and Copyright Table of Contents Acknowledgements Preface

Background

1 Parsers and Context-free Grammars 3 Parser Generators and Metalanguages 5 Classification of Parsing Algorithms

The thrift Parser

7 Introduction 8 Thrift Grammars and their Graphs 11 The thrift Algorithm and ast Construction 15 Examples of Parsing with the thrift Algorithm

Analysis and Further Work

19 Performance 20 Optimization 21 Use Cases 22 Further Work

Back Matter

Bibliography Colophon ACKNOWLEDGEMENTS

Thanks goes first to my family, who have loved and supported me come rain or shine. • I’d like to thank Adam and Denny, who have been not only sagacious advisors but also wonderful mentors. • The staff of 6.01 changed my life. Although I didn’t know it at the time, applying to be an la was the best decision I’ve ever made. • I feel special appreciation for Kika Arias, Sam Briasco-Stewart, Jeremy Kaplan, James Pollard, and Jeremy Wright. You all have my deepest love and gratitude. •

advised by

DENNIS FREEMAN, ph.d.

ADAM HARTZ, m.eng. PREFACE

The subject of my thesis changed radically over the course of my degree. The original proposal I submitted was for an interactive system to auto- matically generate educational drills, but it quickly became apparent that this idea was too nebulous. In response, I decided to change the subject of my thesis to digital typesetting. There I was in my element. Typography and type design were already interests of mine, and digital typsetting offered a blend of the technical and the artistic that was very refreshing.

Denny, Adam, and I wanted a typesetting system that 1 handled text, mathematics, and vector graphics in a uniform way, 2 allowed general- purpose programming, and 3 was tailored to our most common use cases. Delivering such a system – or at least a prototype – was my ultimate goal.

A new digital typesetting system is a potentially massive undertaking; Knuth’s one-year sabbatical for TEX famously took seven years. At the very least it means the creation of a brand new hybrid programming- typesetting language, and so that’s where I began.* The new language was eventually dubbed Quoin.

* While I was working on the new language, I also took time to explore several other topics that fall under the purview of digital typesetting. Unfortunately, those topics are not discussed in this thesis. The first month of the project was spent reading about layout models used in graphic design to inform the design of the new system. Another month was spent learning programming language theory and sketching out the type system of Quoin. Then it came time to write some code.

I decided to write my own parser for Quoin, for three reasons:

• I wanted a parser that would tell me when I had written an ambiguous grammar and show me all possible parses,

• I wanted the output of the parser to be usable directly, which meant the concrete needed to as close to an ast as possible, and

• there was chance that the language would allow users to change or introduce syntax on-the-fly, or might otherwise be context-free, and so the parser needed to be extensible.

(There was actually a fourth reason, as well: I wanted the experience of writing another parser.)

Work on the parser took far more time than I had estimated,* and I was eventually forced to accept that the overall project was not only ambitious, but overly ambitious. This was the third and final time the subject ofmy thesis changed – and I narrowed the scope of the paper to just the parser. It was the only tangible product that was complete, but happily, it was sufficiently self-contained and interesting to merit a paper by itself.

Insofar as I am aware, the thrift parser is novel. Although the techniques underlying its design are the bread and butter of a computer scientist, the algorithm as a working whole required tackling problems with non- obvious solutions. That said, the fundamental idea is meant to be simple, and I hope that the paper is accessible as a result.

* There were seven drafts altogether. All the ways I learned not to write a parser would make for an interesting paper in their own right. PA R T 1 • BACKGROUND

1.1 parsers and context-free grammars

Parsing is an essential step of the compilation and interpretation of com- puter programs. For the most part, the parsers used in compilers and interpreters recognize context-free languages,1 but there are exceptions.2,3 The output of a parser is (by definition) a concrete parse tree, whereas the syntactic structure of a computer program forms an abstract syntax tree, or ast. A concrete parse tree usually reflects the structure of the context-free grammar, or cfg, that produced it, and may not constitute a meaningful ast. Since the latter is more useful for program analysis, tree rewriting may be necessary after a parser has processed the text of a program.

note: Parsing is often preceded by lexing; in this paper, I will refer to the input of a parser as both text and a token list interchangeably, with the understanding that one might be converted into the other.

There exist cfgs that cannot be used with all parsing algorithms. Although it is true that there are many grammars for a given context-free language, context-free languages are usually specified in terms ofa particular cfg, and so one’s choice of parsing algorithm may depend not only on perfor- mance or convenience but also the language to be parsed. Two properties of cfgs are of particular note.

First: some cfgs contain non-terminals that are left-recursive, meaning that the non-terminal appears as the leftmost symbol in one of its own productions (direct left-recursion) or as the leftmost symbol after several substitutions (indirect left-recursion). Left-recursion poses a problem for many parsers; using them to parse cfgs with left-recursive non-terminals requires the transformation of the cfgs into equivalent grammars without left-recursion.4 This causes the resulting concrete parse trees to differ, and

1 For a formal treatment of context-free languages, see the second chapter of Introduction to the Theory of Computation by Michael Sipser, published in 2013. 2 In ansi C, the code (A)*B might be interpreted as either multiplication A*B or a typecast (A)(*B), and semantic analysis is necessary to resolve the ambiguity, making the C programming language context-sensitive. The majority of the C programming language, however is context-free. See The C Programming Language by Brian Kernighan and Dennis Ritchie, published in 1988. 3 Coq allows user-defined notation to be introduced with various levels of precedence and associativity, making its specification language context-sensitive as well. See the section titled Syntax extensions and notation scopes in the Coq reference manual, whose link is given in the bibliography. 4 See Removing Left Recursion from Context-Free Grammars by Robert Moore, published in 2000.

1 tree rewriting becomes necessary to undo the transformation (in addition to any other tree rewriting that may be necessary).

Below and to the left is an example of a direct left-recursion; on the right is an example of an indirect left-recursion. (The relevant productions have been rubricated.)

A → A x | y B z A → x A | B y | z

B → u B v | C w B → C u v | w C → y C z | x | ε C → A x

Second: some texts may be ambiguous with respect to a particular cfg, meaning that they can be produced in more than one way; the grammar is then said to also be ambiguous. For example, consider the grammar

S → ε | S x | x S

which can produce the text xxxx in 5 ways:

ÞÞÞÞ ÞÞÞÞ ÞÞÞÞ

ÞÞÞÞ ÞÞÞÞ

In fact, using this grammar, the number of ways that a length-n string can be produced is the (n−1)th Catalan number.5 The Catalan numbers grow exponentially, and so we conclude: any parser that reports every possible way an ambiguous text might be produced will necessarily have a worst-case time complexity that is at least exponential in the length of the input.6 However, some parsing algorithms are designed to run in linear time, which precludes their ability to return multiple parses for an

5 The Catalan numbers are cataloged in the On-Line Encyclopedia of Integer Sequences as a000108. For the asymptotic behavior of the Catalan numbers, see the entry titled Catalan Number in Wolfram MathWorld. Links to both are given in the bibliography. 6 It is possible for a parsing algorithm to run in polynomial time and report multiple results for an ambiguous text so long as not every result is reported. Alternatively, a parsing algorithm might return a compressed representation of the parse (with a size that is polynomial in the length of the input) from which any particular result might be extracted; see the papers by Frost, Hafiz, and Callaghan listed in the bibliography.

2 ambiguous text. Linear time complexity can be achieved by giving the rules of the grammar differing precedence or more simply by refusing to deal with ambiguous grammars at all.

1.2 parser generators and metalanguages

Parsers can be written by hand, but they can also be written by machine (and frequently are). A program that generates the source of a parser is called, fittingly, a parser generator; these programs take as their input a text file containing the formal specification of a grammar and produce as output the source of parser that is capable of parsing that particular grammar. This source can then be included in an application, compiled separately and linked in, and so on.

Context-free grammars are formally specified using a metalanguage so that they can be read by a computer. I will briefly describe three common metalanguages: Backus-Naur form (bnf), extended Backus-Naur form (ebnf), and the metalanguage used for parsing expression grammars (pegs). Just as ambiguity and left-recursion in a language may influence one’s choice of parser, the desire to use a particular metalanguage may influence one’s choice of parser generator.

Bnf is similar to the mathematical notation used for cfgs, except ··= is used instead of → and symbols are written with quotes (if they are terminals) or angular brackets around them (if they are non-terminals). A grammar in bnf then is a collection of derivation rules of the form

non-terminal symbol ··= sequence of symbols | sequence of symbols | ...

Sometimes in the course of their work mathematicians will write every production of a cfg as its own rule (so that the same non-terminal symbol might appear on the left-hand side of multiple rules), but often a more succinct notation is used, where all the productions with the same left- hand side are grouped together on a line, separated by vertical bars (as in the examples on the previous page and in rule above). Bnf uses this latter convention only.

Extended Backus-Naur form augments bnf with notation to indicate group- ing (using parentheses), optionality (using square brackets), and repetition (using braces). Although the same things can be accomplished in bnf, they require writing additional productions. For example, the bnf grammar on the left and the ebnf grammar on the right have the same language:

3 ··= “g” phrase = “g”, {“o”} ,

··= “o” | “” [“a”,“l”],

··= “a”“l” | “” (“!” | “?”) ;

··= “!” | “?”

This example also shows some of the syntactic differences between bnf and ebnf: symbols are separated by commas, non-terminal symbols no longer appear in angular brackets, rules can span multiple lines, and rules must be terminated by semicolons, to name a few.

Parsing expression grammars are formal grammars similar to cfgs in most respects, except for the fact that they inherently cannot be ambiguous and, if well-formed, cannot have any instances of left-recursion.7 As the name implies, a peg is a collection of parsing rules rather than a collection of productions. The interpretation is reversed: a cfg says how strings may be built-up, whereas a peg says which reductions are allowed so that strings can be broken-down.

The standardized syntax used to write pegs resembles a combination of the notations used for cfgs and regular expressions. (Insofar as I am aware, there is not a separate name for the metalanguage of pegs itself.) Each rule in a peg has the form

non-terminal symbol ← parsing expression

The parsing expressions are constructed inductively: a single terminal sym- bol is itself an expression, as is a non-terminal symbol; larger expressions can be formed by combining expressions into sequences or collections of ordered choices (written as concatenation without a separator and concate- nation with / as a separator, respectively). Ordering the choices is what makes a peg unambiguous – the preferred parse is the correct parse.

7 See Parsing Expression Grammars: a Recognition-based Syntactic Foundation by Bryan Ford, published in 2004. It is unknown whether or not pegs can recognize all context-free languages: “A final open problem is the relationship and inter-convertibility of cfgs and pegs. Birman proved that TS and gTS can simulate any deterministic pushdown automata, implying that pegs can express any deterministic lr-class context-free language. There is informal evidence, however, that a much larger class of cfgs might be recognizable with pegs, including many cfgs for which no conventional linear-time parsing algorithm is known. It is not even proven yet that cfls exist that cannot be recognized by a peg, though recent work in lower bounds on the complexity of general cfg parsing and matrix product shows at least that general cfg parsing is inherently super-linear”, ibid.

4 Expressions can also be modified with the *, +, and ? quantifiers, which are inherited from regular expression syntax, but in pegs these match greedily rather than non-deterministically. To compensate, pegs make it possible to perform positive and negative lookahead by using the operators & and ! respectively.

Notably, all three metalanguages that we have discussed do not allow one to specify which non-terminals ought to become nodes of the concrete parse tree and which are dummy symbols. Providing greater control over the shape of concrete syntax trees is perhaps the most prominent feature of thrift grammars, which will be introduced in part two.

1.3 classification of parsing algorithms

Parsing algorithms can be broadly separated into two categories: top-down and bottom-up.

The family of top-down parsing algorithms includes ll parsers, recursive descent parsers (which include packrat, tail recursive, and Pratt parsers), and combinatory parsers. Top-down parsers work by considering how the productions of the grammar must be applied to derive the token stream starting from the grammar’s start symbol (which is the root of the concrete parse tree). Since trees are ordinarily drawn upside-down, the root of the tree is at the top, and so a top-down parser builds branches downwards, filling out the tree as it sees tokens.

The family of bottom-up parsing algorithms includes lr parsers (of which there are many named variants), precedence parsers, bounded context parsers, recursive ascent parsers, pika parsers, and chart parsers (such as cyk parsers and the ); lr and precedence parsers are members of a subclass of bottom-up algorithms called shift-reduce parsers. Bottom- up parsers work by starting at what will ultimately be the lower left-hand side of the tree. As a bottom-up parser moves along a token list, it builds disconnected portions of the tree, and whenever it has seen enough to deduce which production must have derived the fragments thus far, it joins them up, growing the tree upward.

Which parsing algorithms can handle ambiguity? Which algorithms can handle left-recursion without pre-parse grammar rewriting and post-parse tree rewriting? Which algorithms are most efficient, and for what kinds of input? Unfortunately, knowing which category a parsing algorithm belongs is not sufficient to answer these questions. However, there are

5 some general trends (for example, a randomly chosen naïve top-down parsing algorithm probably does not handle-left recursion) and sometimes unequivocal judgements can be made; for example, shift-reduce parsers as a rule cannot handle ambiguity.

The thrift parsing algorithm – the subject of this paper – is based on com- binatory parsing techniques, and so is a member of the top-down family of parsers. Combinatory parsers can straddle the boundary between parsers and parser generators, because the combinators from which a parser is built are native code objects and the structure of the parser usually reflects the structure of the grammar. Rather than generating the the parser as a separate step (by reading in a specification and emitting source), program- mers can write the parser directly in much the same way they’d write the specification.

For the same reasons, combinatory parsers are also well-suited to runtime changes. In fact, combinatory parsers are not restricted to parsing context- free grammars. Jeremy Kaplan, in his Master of Engineering thesis, pre- sented a programming language and interpreter designed for education in which syntax can be added on-the-fly to serve a variety of pedagogical purposes; the ability was implemented using a combinatory parser.8

8 See An Interpreter for a Novice-Oriented Programming Language with Runtime Macros by Jeremy Kaplan, published in 2017.

6 PA R T 2 • THETHRIFTPARSER note: Although I will refer to my own implementation of the algorithm as the thrift parser, any descriptions of the implementation ought to be read with the understanding that all of the techniques and features mentioned in this paper are part of the thrift parsing algorithm in general.

2.1 introduction

Thrift is an acronym for threaded inductive-finite-graph traversal, which a rough description of how the parsing algorithm works: to parse a text with a cfg, the algorithm maps the grammar to an inductively-built finite graph and then parsing is performed by threads which traverse this graph.

The thrift parsing algorithm combines the functions of a parser generator and a parser, taking both a cfg and a token stream and returning one or more concrete parse trees. When a text is ambiguous with respect to a grammar, the thrift parsing algorithm returns every possible parse.

Grammars for the thrift parsing algorithm must be described in a partic- ular format, and I will refer to objects in this format as thrift grammars. Thrift grammars encode cfgs but also specify which productions will be- come nodes of the resulting concrete parse tree.

7 2.2 thrift grammars and their graphs

Thrift grammars are created inductively using five constructors.

Grammar = Token (constraint : Either TokenValue TokenClass) | Sequence (label : Maybe String) (prods : List Grammar) | Alternation (label : Maybe String) (prods : List Grammar) | Repetition (label : Maybe String) (quant ∈ {*,+,?})(prod : Grammar) | Recursion (index : PositiveInteger)

For example, if ebnf were modified to allow token classes as terminal symbols in addition to token values, the grammar that would be expressed in ebnf by

arguments = identifier ,{“,” , identifier }; assignment = identifier , “=” ,( integer | arguments) ; would be, in pseudocode, Sequence “assignment”[ (Token identifier) (Token “=”) (Alternation none [ (Token integer) (Sequence “arguments”[ (Token identifier) (Repetition none * (Sequence none [ (Token “,”) (Token identifier) ]) ) ]) ]) ]

It is possible to represent thrift grammars more succinctly and in ways that resemble other metasyntax notations. In my own implementation, I defined infix operators that allow one to write the grammar aboveas

args = tok(:identifier) > (tok(“,”) > tok(:identifier))*(0..Inf) expr = tok(:identifier) > tok(“=”) > (tok(:integer) | args) args.label = “arguments” expr.label = “assignment”

8 For every grammar there is an associated directed graph made by mapping each constructor of the grammar to a graph constructor. For the Token con- structor – which takes a token value or a token class – the corresponding graph is a single node.

S²£†­ aq¦Í† Ù ¦qÂÂ

For the Sequence constructor – which takes an optional label and a list of productions – the corresponding graph is a chain of the graphs of the productions, bookended by Begin and End nodes.

†’™­ ­ OÏ}”Às¾™ OÏ}”Às¾™ §2q{†¦¨ §2q{†¦¨

For the Alternation constructor – which takes an optional label and a list of productions – the corresponding graph consists of Begin and End nodes with outgoing and incoming edges, respectively, to the graphs of the productions.

OÏ}”Às¾™

†’™­ ­ §2q{†¦¨ §2q{†¦¨

OÏ}”Às¾™

9 The graph corresponding to the Repetition constructor – which takes an optional label, a quantifier, and a production – consists of a Begin node,an Epsilon node, the graph of the production, an Epsilon node, and an End node, chained in that order. If the quantifier is * or ?, there is an additional edge from the Begin node to the End node. If the quantifier is * or +, there is a backedge from the second Epsilon node to the first.

£ • y

†’™­ ­ ö OÏ}”Às¾™ ö §2q{†¦¨ §2q{†¦¨

£ • Á

The index in the Recursion constructor works much like a De Bruijn index. It is a positive integer that refers to an enclosing constructor. For example, the index 1 in

Sequence none [Alternation none [(Token any) (Recursion 1)]] refers to the Alternation; if the index were 2, the index would be referring to the Sequence.

The graph corresponding to the Recursion constructor consists of just two nodes: a Jump node and a Return node. The target of the Jump node is the first node in the graph of the Grammar referenced by the Recursion.

/ͬ¼ J†Ç;­ Sq¾’†Ç

10 2.3 the thrift algorithm and ast construction

The thrift parser works by maintaing a queue of threads, each of which explores a plausible parse. A thread consists of exactly four things:

a location, which is a pointer to a graph node,

a token index, which is an integer indexing into a list of tokens,

a path, which is a list of labels, token indices, and exit symbols,

and a stack, which is a list of labels and pointers to graph nodes.

In practice, the stack is compressed by replacing runs of empty labels (Nones) by an integer indicating the length of the run. Empty labels are never appended to the path.

The thrift parser returns a (possibly empty) list of the threads that found a successful parse. The abstract syntax tree of a complete parse can be constructed straightforwardly from a thread’s path; a thread’s stack is used for bookkeeping to implement recursion.

The first stage of the thrift parsing algorithm is to map the input grammar to its graph. Since the mapping was explained in section 2.2 and the procedure is straightforward, we will move on to the second stage.

Besides a graph G and a token list tokenlist, the graph traversal also take a boolean partial. If partial is true, then successful parses are not required consume every token in tokenlist.

Graph Traversal (graph G, token list tokenlist, boolean partial) 1 queue ← new circular queue 2 completed ← new list 3 progenitor ← new thread (G.first, 0, new list, new list) 4 add progenitor to queue

The algorithm halts when every thread has died or completed. Threads do not interact, and so the main loop can be parallelized.

5 while queue.length > 0 6 thread ← next from queue 7 node ← thread.location 8 alive ← true

11 In the first half of a step, the computer takes action dependent on thegraph node at which a thread is located.

9 case node.type 10 when Epsilon Node 11 do nothing 12 13 when Begin Node 14 if node.label ≠ None 15 append node.label to thread.path 16 end 17 push node.label to thread.stack 18 when End Node 19 if node.label ≠ None 20 append exit to thread.path 21 end 22 pop from thread.stack 23 24 when Jump Node 25 push node.successor to thread.stack 26 when Return Node 27 pop from thread.stack 28 29 when Token Node 30 if thread.tokenindex ≥ tokenlist.length 31 match ← false 32 else 33 token ← tokenlist[thread.tokenindex] 34 case node.constraint.type 35 when Token Value 36 match ← (token.value = node.constraint) 37 when Token Class 38 match ← (token.class = node.constraint) 39 end 40 end 41 if match 42 append thread.tokenindex to thread.path 43 thread.tokenindex ← thread.tokenindex + 1 44 else 45 alive ← false 46 end 47 end

12 In the second half of a step, the computer determines which graph node a thread will visit next and updates the location of the thread accordingly.

48 if not alive 49 remove thread from queue 50 51 else if node.type = Jump Node 52 thread.location ← node.target 53 54 # returning from a jump 55 else if thread.stack is not empty and thread.stack.top is a Node 56 thread.location ← thread.stack.top 57 58 else if node has no successors 59 if (thread.tokenindex = tokenlist.length) or partial 60 add thread to completed 61 end 62 remove thread from queue 63 64 else if node has exactly one successor 65 thread.location ← node.successor 66 67 # node has multiple successors 68 else 69 thread.location ← node.successors[0] 70 for each successor in node.successors[1. . .] 71 newthread ← duplicate of thread 72 newthread.location ← successor 73 add newthread to queue 74 end 75 end 76 end 77 return completed

13 The third and final stage is the construction of parse trees from completed threads, using the procedure below. The tree returned by the procedure is structured as a list whose elements are tokens (leaves) and other lists (subtrees).

Every list except the top-level list also has a label, which is a string stored as the zeroth element of the list.

Ast Construction (token list tokenlist, list path) 1 append exit to path 2 path index ← 0 3 gather ← λ 4 set assignment scope of path index to be non-local 5 tree ← new list 6 loop 7 if path index = path.length 8 halt with error “missing closing delimiter” 9 10 else if path[path index] = exit 11 return tree from lambda 12 13 else if path[path index] is an Integer 14 tokenindex ← path[path index] 15 append tokenlist[tokenindex] to tree 16 17 else 18 label ← path[path index] 19 path index ← path index + 1 20 subtree ← gather() 21 prepend label to subtree 22 append subtree to tree # not concatenation 23 end 24 path index ← path index + 1 25 end 26 end 27 ast ← gather() 28 if path index ≠ path.length - 1 29 halt with error “extra closing delimiter” 30 end 31 return ast

14 2.4 examples of parsing with the thrift algorithm

Consider the following thrift grammar, which produces a token for the letter m which is then bookended by zero or more l/r pairs (which we might think of as parentheses).

Alternation none [ (Token “m”) (Sequence “bookended”[ (Token any “l”) (Recursion 2) (Token “r”) ]) ]

In roughly equivalent ebnf, this might be written as

bookended = “l” , expr , “r”; expr = “m” | bookended ;

As mentioned, this grammar produces token lists of the form m l m r l l m r r l l l m r r r ...

Its corresponding graph looks like



 S²£†­ ³Ý´

†’™­ ­

   

†’™­ S²£†­ S²£†­ ­ /ͬ¼ J†Ç;­ ³{²²£†­†´ ³Ü´ ³â´ ³{²²£†­†´

The nodes of the graph have been annotated with the numbers 1–9 so that we can reference them as we walk through an example of parsing with the thrift algorithm. Our input token list will be l m r.

15 We begin with a single thread whose location is Node 1, whose token index is 0 (pointing to l), and whose path and stack are empty.

On its first step, because Node 1 is a Begin node without a label, we append None to the stack (but do not append to the path). Because Node 1 is an Alternation, the thread duplicates itself and then moves to Node 2. The copy moves to Node 3.

On the thread’s second step, because Node 2 is a Token node, it attempts a match. The thread’s token index is still 0, pointing to the token l in the input token list, but Node 2 has the value m, and so the thread dies.

The other thread then runs. This time, the Begin node has a label, and so we append bookended to the path and the stack, and finish the step by moving to Node 4.

There we match against l, and the match succeeds, so the token index is advanced to 1 and 0 is appended to the path. At this point, the path is [bookended, 0] and the stack is [None, bookended].

On the next step, because Node 5 is a Jump node, we append Node 6 to the stack and then jump to Node 1.

Again, Node 1 is a Begin node without a label, and so we append None to the stack. Now the stack looks like [None, bookended, Node 6, None]. The thread then duplicates itself and moves to Node 2. The copy moves to Node 3. The copy will eventually will die, so we’ll ignore it.

At Node 2 we match against m, and the match succeeds (since the token index is 1, pointing to m in the input token list), so the token index is advanced to 2 and 1 is appended to the path. The path is now [bookended, 0, 1], and we finish the step by moving to Node 9.

Node 9 is an End node, and so the top of the stack (None) is popped off, but the node does not have a label, and so exit is not appended to the path. The top of the stack is now Node 6, which is a sign that we ought to return from a jump, and so we relocate to Node 6.

At Node 6, we simply pop the Node 6 off the top of the stack.

16 At Node 7, we match against r, and the match succeeds, so the token index is advanced to 3 and 2 is appended to the path. The path is now [bookended, 0, 1, 2]; the stack is [None, bookended].

Node 8 is an End node with a label, and so exit is appended to the path and the top of the stack is popped. The path is now [bookended, 0, 1, 2, exit] and the stack is just [None].

Node 9 is an End node without a label, and so the path is not modified, but the top of the stack is popped off, leaving it empty. There are no outgoing edges, and so we check that the token index, 3, is equal to the length of the input list. It is, and so the thread is marked as completed.

Finally, the ast construction procedure converts the path into a parse tree: [[bookended l m r]].

As a second example (to highlight the relationship between labels in thrift grammars and the resulting parse trees), consider the grammar

Sequence none [ (Token “x”) (Sequence none [ (Token “y”) (Alternation “interior”[ (Repetition none * (Token “z”)) (Recursion 2) ]) (Token “y”) ]) ] or, in roughly equivalent ebnf,

interior = {“z”} | subexp ; subexp = “y” , interior , “y”; expr = “x” , subexp ;

17 This grammar produces token lists of the form

xyy xyzy xyzzy ··· xyyyy xyyzyy xyyzzyy ··· xyyyyyy xyyyzyyy xyyyzzyyy ··· ··· ··· ··· ···

Note that the native grammar is capable of expressing something that ebnf is not: although we want the subexp production, we do not want it labelled; instead, we want the alternation labelled. Thrift productions that are labelled become nodes of the ast, whereas unlabelled productions do not.

For instance, if the input token list were xyyzzyy, the resulting parse tree would be [ • x y [ interior y [ interior z z ] y ] y ]

2 3 #(. ,#), 3

3 #(. ,#), 3

4 4

and not, for example, [ • x [ • y [ interior y [ interior [ • z z ] y ] y ] ].

2

3 #(. ,#), 3

3 #(. ,#), 3

4 4

(Bullets in the flattened trees indicate empty labels; in well-formed output, only the root of the tree lacks a label.)

18 PA R T 3 • ANALYSISANDFURTHERWORK

3.1 performance

The performance of the thrift parser is strongly dependent on the gram- mar it is given; as mentioned in section 1.1, for ambiguous grammars, the time complexity can be exponential in the length of the input (but it is not that bad necessarily). For unambiguous grammars, the thrift parser runs in something close to linear time, assuming thread duplication is fast.

The tables below show statistics from parsing four grammars with a variety of inputs. The threads column indicates the total number of threads that ran over the course of the parse, the steps column indicates the total number of steps taken by all threads, the avg steps/thread column indicates the average lifetime of each thread, and the avg steps/character column indicates the average number of execution steps altogether per input character. Inputs that (correctly) cause the parse to fail are in red.

expression ·= “(” expression “)” | “x” | “(”“(”“x”“)”

Input Input Length Threads Steps Avg Steps/Thread Avg Steps/Character x 1 3 7 2.3 7.0 (((x))) 7 9 52 5.8 7.4 (((x)) 6 9 50 5.6 8.3 (((x) 5 9 44 4.9 8.8 (((x)))] 8 9 52 5.8 6.5 (((x))] 7 9 50 5.6 7.1 (((x)] 6 9 44 4.9 7.3 ((...(x)...) 512 515 4 351 8.4 8.5 ((...(x)...) 2 048 2 051 17 407 8.5 8.5 ((...(x)...) 8 192 8 195 69 631 8.5 8.5

expression ·= “(” expression “)” | “[” expression “]” | “x”

Input Input Length Threads Steps Avg Steps/Thread Avg Steps/Character [(x)] 5 7 29 4.1 5.8 [(x]) 5 7 23 3.3 4.6 [...[x]...] 8 191 8 193 45 052 5.5 5.5

19 expression ·= (“(” expression “)” | “x”)+

Input Input Length Threads Steps Avg Steps/Thread Avg Steps/Character (x)(x) 6 12 58 4.8 9.7 ((x)(x)) 8 15 78 5.2 9.8 ((x))(x) 8 15 76 5.1 9.5 ((x)x) 6 12 60 5.0 10.0 (xx)(xx)... 8 192 14 338 73 734 5.1 9.0

expression ·= “x” expression “z” | “y”*

Input Input Length Threads Steps Avg Steps/Thread Avg Steps/Character xz 2 5 21 4.2 10.5 xyz 3 6 28 4.7 9.3 xyyz 4 7 35 5.0 8.8 xyyyz 5 8 42 5.3 8.4 xy...yz 16 21 133 6.3 8.3 xy...yz 256 259 1 799 6.9 7.0 xy...yz 4 096 4 099 28 679 7.0 7.0 xy...yz 65 536 65 539 458 759 7.0 7.0 xy...yz 1 048 576 1 048 579 7 340 039 7.0 7.0

3.2 optimization

The biggest performance overhead that a thrift implementation can have is likely thread creation, since threads are spawned so frequently. In fact, for a naïve implementation that copies the contents of a thread’s path and stack, the tests above will run in quadratic time as measured by a wall clock, because – although the number of threads spawned is linear in the length of the input – the cost of copying the contents of a thread’s path and stack grows linearly with the length of the input parsed thus far.

However, threads could be copied in constant time if a thread’s path and stack were actually just pointers to two shared data structures. I suggest using two prefix trees,* one for paths and one for stacks. Although I have not written and tested a proof of concept, I expect this strategy is sufficient

* This could be seen as a borrowing from glr parsers, although I was unaware of this at the time I was stricken by the idea.

20 to get an implementation to run in linear time as measure by a wall clock, living up to the promising asymptotic behavior of Avg Steps/Character.

.% 6

.% 5 BC89 i

& & 4 e

BC89 i & & 3 & & 1 f & & 4 e f h f & & 3 & & 1 & & 1 h f f BC89 k BC89 k BC89 k BC89 k .% 7 & & 2 & & 2 & & 2 & & 2 g g g g

GH57? 5 GH57? 6 GH57? 7 7CAACB DF9:=L HF99

The drawback to using the prefix trees is that threads become much less independent. Without prefix trees and left-recursion handling (discussed in section 3.4), threads do not interact whatsoever after creation; the use of either requires shared memory, and parallelization becomes more difficult as a result.

3.3 use cases

The primary purpose of the thrift parser is to be a tool for programming language development, which is reflected in its three design goals: that it support ambiguity (which has been touched on already), that its output require a minimal amount of post-processing – producing asts directly if possible (which is the reason for grammar labels), and that it be extensible. This last point perhaps deserves further explanation.

It is easy to perform context-sensitive parsing with only a small number of modifications to the thrift algorithm; two approaches in particular come to mind. First, since the graphs traversed by threads are native code objects that can be created – and modified – at runtime, it is possible to change syntax part-way through the processing of a token stream. Second, threads can also be given additional state, which can be used to make their behavior hysteretic.

This is not to say the thrift parser is only usable for language development or experimentation. It can be used for generic parsing tasks. Because of

21 its similarities to Kaplan’s parser,8 the thrift parsing algorithm may also have applications in education.

The applications that the thrift parser is least suited for are likely to be applications where performance is critical. Because the parser does not perform lookahead – it optimistically spawns threads and waits to find out that a path is a dead end, instead – the constant factor of its runtime complexity can be poor, even when it has the same asymptotic complexity as other parsing methods.

3.4 further work

In the thrift parser, left-recursion occurs when there are zero-weight cycles in a directed graph. A zero-weight cycle is a cycle of edge-connected nodes (including both the forward-edges from jump nodes to their return nodes as well as the back-edges from jump nodes to their targets) that does not contain any Token nodes. Threads in these cycles can loop indefinitely without being forced to match tokens (that is, without making forward progress or dying), and can continue to spawn new threads as they do so.

The version of the thrift parser presented in this paper does not terminate when presented with a grammar containing left-recursion. What follows is a sketch to add support for left-recursive grammars.

• It is straightforward to find the zero-weight cycles of a graph aftera graph has been constructed; these need to be precomputed.

• When a thread enters a zero-weight cycle for the first time, it should record its current location (the entry node) and then register a new thread group. The membership of each thread group is maintained in a global table; every thread keeps a record of the names of the groups to which it belongs. (There may be several zero-weight cycles that share one or more nodes; a new thread group will need to be created when a thread enters each of them.) Whenever a thread duplicates itself, each copy adds itself to the groups that the original belongs to.

• When a thread is about to revisit an entry node, it first consults the global table. If other members of the relevant thread group are active, the thread stalls. When there are no longer active members of the

8 See An Interpreter for a Novice-Oriented Programming Language with Runtime Macros by Jeremy Kaplan, published in 2017.

22 relevant thread group, meaning that the other members have either died or also waiting, then the thread does one of two things:

• If the members of the group collectively made forward progress – meaning that the length of the longest path of all the threads that died is longer now than when this thread last visited the entry node – then the thread proceeds.

• Otherwise, the thread kills itself, as it has no hope of making forward progress.

This definition of forward progress may not be a sufficient condition to bound unproductive left-recursion, or it may be too strict and prevent threads from exploring viable parses. The first step toward supporting left- recursion will be writing a proof that these criteria are correct, or if they are not correct, finding another condition that is actually both sufficient and permissive.

There are significant rewards for this work, however. With support for left-recursion, most if not all of the remaining need for tree rewriting is obviated, even when the target language contains a mix of left- and right- associative operators.

Besides handling left-recursion, future work includes the optimization mentioned in section 3.2. Other work may include quality-of-life features, such an ancillary function to build a grammar given a list of operators with associativity and precedence.

23 BIBLIOGRAPHY

1 Theory of Computation. Michael Sipser, 2013. Published by Cengage Learning.

2 The C Programming Language. Brian Kernighan and Dennis Ritchie, 1988. Published by Prentice Hall.

3 The Coq Reference Manual. Coq Development Team, 2020. https://coq.inria.fr/distrib/current/refman/user-extensions/syntax-extensions.html

4 Removing Left Recursion from Context-Free Grammars. Robert Moore, 2000. Published by the Association for Computational Linguistics.

5 Catalan numbers. The On-Line Encyclopedia of Integer Sequences. https://oeis.org/A000108

Catalan Number. Stanley and Weisstein. https://mathworld.wolfram.com/CatalanNumber.html

6 A new top-down parsing algorithm to accommodate ambiguity and left recursion in polynomial time. Frost and Hafiz, 2006. Published by the Association for Computing Machinery.

Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars. Frost, Hafiz, and Callaghan, 2007. Published by the Association for Computational Linguistics.

Parser Combinators for Ambiguous Left-Recursive Grammars. Frost, Hafiz, and Callaghan, 2008. Published by the Association for Computing Machinery.

7 Parsing Expression Grammars: a recognition-based syntactic foundation. Bryan Ford, 2004. Published by the Association for Computing Machinery.

8 An Interpreter for a Novice-Oriented Programming Language with Runtime Macros. Jeremy Kaplan, 2017. Published by the Massachusetts Institute of Technology. COLOPHON

This document was set in Andada, a typeface designed by Carolina Giovagnoli for Huerta Tipográfica. Text layout and rendering were done with the X TE EX typesetting engine.

An electronic version of this document, along with its source code, is available at

kadephillips.dev/thesis