An Optimal Tabular Parsing Algorithm

AN OPTIMAL TABULAR PARSING ALGORITHM MarkJan Nederhof University of Nijmegen Department of Computer Science Toerno oiveld ED Nijmegen The Netherlands markjancskunnl Abstract A rst attempt to solve this problem is to use predic tive LR PLR parsing PLR parsing allows simulta In this pap er we relate a number of parsing algorithms neous pro cessing of a common prex provided that which have b een developed in very dierent areas of the lefthand sides of the rules are the same However parsing theory and which include deterministic algo in case we have eg the rules A and B 1 2 rithms tabular algorithms and a parallel algorithm where again is not the empty string but now A B We show that these algorithms are based on the same then PLR parsing will not improve the eciency We underlying ideas therefore go one step further and discuss extended LR By relating existing ideas we hop e to provide an op ELR and commonprex CP parsing which are al p ortunity to improve some algorithms based on features gorithms capable of simultaneous pro cessing of all com of others A second purp ose of this pap er is to answer a mon prexes ELR and CP parsing are the foundation question which has come up in the area of tabular pars of tabular parsing algorithms and a parallel parsing al ing namely how to obtain a parsing algorithm with the gorithm from the existing literature but they have not prop erty that the table will contain as little entries as b een describ ed in their own right p ossible but without the p ossibility that two entries To the b est of the authors knowledge the various represent the same sub derivation parsing algorithms mentioned ab ove have not b een dis cussed together in the existing literature The main Introduction purp ose of this pap er is to make explicit the connec tions b etween these algorithms Leftcorner LC parsing is a parsing strategy which has b een used in dierent guises in various areas of com A second purp ose of this pap er is to show that CP puter science Deterministic LC parsing with k symbols and ELR parsing are obvious solutions to a problem of of lo okahead can handle the class of LCk grammars tabular parsing which can b e describ ed as follows For Since LC parsing is a very simple parsing technique and each parsing algorithm working on a stack there is a at the same time is able to deal with left recursion it is realisation using a parse table where the parse table often used as an alternative to topdown TD parsing allows sharing of computation b etween dierent search which cannot handle left recursion and is generally less paths For example Tomitas algorithm can b e seen ecient as a tabular realisation of nondeterministic LR parsing cmp-lg/9405025 26 May 94 Nondeterministic LC parsing is the foundation of a At this p oint we use the term state to indicate the very ecient parsing algorithm related to Tomitas symbols o ccurring on the stack of the original algo algorithm and Earleys algorithm It has one disad rithm which also o ccur as entries in the parse table vantage however which b ecomes noticeable when the of its tabular realisation grammar contains many rules whose righthand sides In general p owerful algorithms working on a stack b egin with the same few grammars symbols eg lead to ecient tabular parsing algorithms provided the grammar can b e handled almost deterministically A j j 1 2 In case the stack algorithm is very nondeterministic for a certain grammar however sophistication which in where is not the empty string After an LC parser creases the number of states may lead to an increasing has recognized the rst symbol X of such an it will number of entries in the parse table of the tabular re as next step predict all aforementioned rules This alization This can b e informally explained by the fact amounts to much nondeterminism which is detrimental that each state represents the computation of a number b oth to the timecomplexity and the spacecomplexity of sub derivations If the number of states is increased then it is inevitable that at some p oint some states Supp orted by the Dutch Organisation for Scientic Re represent an overlapping collection of sub derivations search NWO under grant notation of rules A A with the same which may lead to work b eing rep eated during parsing 1 2 lhs is often simplied to A j j Furthermore the parse forest a compact representa 1 2 A rule of the form A is called an epsilon rule tion of all parse trees which is output by a tabular algorithm may in this case not b e optimally dense We assume grammars do not have epsilon rules unless stated otherwise We conclude that we have a tradeo b etween the case The relation P is extended to a relation on V V that the grammar allows almost deterministic parsing as usual The reexive and transitive closure of is and the case that the stack algorithm is very nondeter denoted by ministic for a certain grammar In the former case so We dene B A if and only if A B for some phistication leads to less entries in the table and in the The reexive and transitive closure of is denoted by latter case sophistication leads to more entries pro and is called the leftcorner relation vided this sophistication is realised by an increase in We say two rules A and B have a com the number of states This is corrob orated by empirical 1 2 mon prex if and for some data from which deal with tabular LR parsing 1 1 2 2 1 and where 2 As we will explain CP and ELR parsing are more A recognition algorithm can b e sp ecied by means deterministic than most other parsing algorithms for of a pushdown automaton A T Alph Init Fin many grammars but their tabular realizations can which manipulates congurations of the form v never compute the same sub derivation twice This rep where Alph is the stack constructed from left resents an optimum in a range of p ossible parsing algo to right and v T is the remaining input rithms The initial conguration is Init w where Init This pap er is organized as follows First we discuss Alph is a distinguished stack symbol and w is the input nondeterministic leftcorner parsing and demonstrate The steps of an automaton are sp ecied by means of the how common prexes in a grammar may b e a source of relation Thus v v denotes that v bad p erformance for this technique is obtainable from v by one step of the automaton Then a multitude of parsing techniques which ex The reexive and transitive closure of is denoted by hibit b etter treatment of common prexes is dis The input w is accepted if Init w Fin cussed These techniques including nondeterministic where Fin Alph is a distinguished stack symbol PLR ELR and CP parsing have their origins in theory of deterministic parallel and tabular parsing Subse LC parsing quently the application to parallel and tabular parsing is investigated more closely For the denition of leftcorner LC recognition we need stack symbols items of the form A Further we briey describ e how rules with empty where A is a rule and Remember that righthand sides complicate the parsing pro cess we do not allow epsilon rules The informal meaning The ideas describ ed in this pap er can b e generalized of an item is The part b efore the dot has just b een to headdriven parsing as argued in recognized the rst symbol after the dot is to b e rec We will take some lib erty in describing algorithms ognized next For technical reasons we also need the from the existing literature since using the original de items S S and S S where S is a fresh scriptions would blur the similarities of the algorithms symbol Formally to one another In particular we will not treat the use of lo okahead and we will consider all algorithms work LC y I fA j A P A S g ing on a stack to b e nondeterministic We will only y describ e recognition algorithms Each of the algorithms where P represents the augmented set of rules consist can however b e easily extended to yield parse trees as ing of the rules in P plus the extra rule S S a sideeect of recognition Algorithm Leftcorner LC LC A T I Init Fin Init S S Fin The notation used in the sequel is for the most part S S Transitions are allowed according to the standard and is summarised b elow following clauses A contextfree grammar G T N P S consists of two nite disjoint sets N and T of nonterminals and B C av terminals resp ectively a start symbol S N and a B C A a v y nite set of rules P Every rule has the form A where there is A a P such that A C where the lefthand side lhs A is an element from N A a av A a v and the righthand side rhs is an element from V B C A v where V denotes N T P can also b e seen as a B C D A v relation on N V y where there is D A P such that D C We use symbols A B C to range over N symbols B A A v B A v a b c to range over T symbols X Y Z to range over V symbols to range over V and v w x in the The conditions using the leftcorner relation to range over T We let denote the empty string The rst and third clauses together form a feature which is called topdown TD ltering TD ltering makes sure for some grammar G with LC parsing for some gram that sub derivations that are b eing computed b ottom mar G which results after applying a transformation up may eventually grow into sub derivations with the re called leftfactoring quired ro ot TD ltering is not necessary for a correct Leftfactoring consists of replacing

An Optimal Tabular Parsing Algorithm

A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages

Efficient Recursive Parsing

Advanced Parsing Techniques

Verifiable Parse Table Composition for Deterministic Parsing*

Basics of Compiler Design

Deterministic Parsing of Languages with Dynamic Operators

Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice

An Efficient Context-Free Parsing Algorithm for Natural Languages1

Comp 411 Principles of Programming Languages Lecture 3 Parsing

Head-Driven Transition-Based Parsing with Top-Down Prediction

9 Deterministic Top-Down Parsing

The CYK Algorithm Informatics 2A: Lecture 20