3.2 Earley Parsing

INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UM! films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 UMI’ Practical Earley Parsing and the SPARK Toolkit by John Daniel Ay cock B.Sc., University of Calgary, 1993 M.Sc., University of Victoria, 1998 A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY' in the Department of Computer Science We accept this dissertation as conforming to the required standard Dr. R. N. Horspool, Supervisor (Department of Computer Science) Dr. J. H.xJâ^ê, Departmental Member (Department of Computer Science) Dr. M.-A. D. Stor%^^êpartm^tal Member (Department of Computer Science) -------------------------------- Dr. K. F. Li, Outside Member (Department of Electric^ and Computer Engineering) Dr. T. A. Proebstmg, External Examiner (Microsoft Research) (c) John Daniel Aycock, 2001 University of Victoria All rights reserved. This dissertation may not be reproduced in whole or in part, by photocopying or other means, without the permission of the author. 11 Supervisor: Dr. R. N. Horspool Abstract Domain-specific, “little” languages are commonplace in computing. So too is the need to implement such languages; to meet this need, we have created SPARK (Scanning, Parsing, And Rewriting Kit), a toolkit for little language implementation in Python, an object-oriented scripting language. SPARK greatly simplifies the task of little language implementation. It requires little code to be written, and accommodates a wide range of users — even those without a background in compiler theory. Our toolkit is seeing increasing use on a variety of diverse projects. SPARK was designed to be easy-to-use with few limitations, and rehes heavily on Earley’s general parsing algorithm internally, which helps in meeting these design goals. Earley’s algorithm, in its standard form, can be hard to use; indeed, experi ence with SPARK has highlighted several problems with the practical use of Earley’s algorithm. Our research addresses and provides solutions for these problems, making some significant improvements to the implementation and use of Earley’s algorithm. First, Earley’s algorithm suffers from the performance problem. Even under op timum conditions, a standard Earley parser is burdened with overhead. We extend directly-executable parsing techniques for use in Earley parsers, the results of which run in time comparable to the much-more-specialized LALR(l) parsing algorithm. Second is what we call the delayed action problem. General parsers like Earley Ill must, iu the worst case, read the entire input before executing any semantic actions associated with the grammar rules. We attack this problem in two ways. We have identified conditions under which it is safe to execute semantic actions on the fly during recognition; as a side effect, this has yielded space savings of over 90% for some grammars. The other approach to the delayed action problem deals with the difficulty of handling context-dependent tokens. Such tokens are easy to handle using what we call “Schrodinger’s tokens,” a superposition of token types. Finally, Earley parsers are comphcated by the need to process grammar rules with empty right-hand sides. We present a simple, eflBcient way to handle these empty rules, and prove that our new method is correct. We also show how our method may be used to create a new type of LR(0) automaton which is ideally suited for use in Earley parsers. Our work has made Earley parsing faster and more space-efficient, turning it into an excellent candidate for practical use in many applications. Examiners: Dr. R. N. Horspool, Supervisor (Department of Computer Science) Dr. J. H. Departmental Member (Department of Computer Science) ________________________________________________________ Dr. M.-A. D. Stor^^Dârtm ental Member (Department of Computer Science) Dr. K. F. Li, Outside Member (Department of Electricâl and Computer Engineering) Dr. T. A. Proebsting, External Examiner (Microsoft Research) IV Table of Contents Abstract ii Table of Contents iv List of Tables ix List of Figures x Acknowledgments xiii 1 Introduction 1 2 SPARK: Scanning, Parsing, And Rewriting Kit 5 2.1 Model of a Compiler ...................................................................................... 6 2.2 The F ram ew ork ................................................................................................ 10 2.2.1 Lexical Analysis ................................................................................... 11 2.2.2 Syntax Analysis ................................................................................... 14 2.2.3 Semantic Analysis ............................................................................ 18 2.2.4 Evaluation ................................................................................................. 20 V 2.3 Inner W o rk in g s ................................................................................................ 23 2-3.1 R eflectio n ............................................................................................. 23 2.3.2 GenericScanner................................................................................... 24 2.3.3 GenericParser....................................................................................... 25 2.3.4 GenericASTBuilder............................................................................. 28 2.3.5 GenericASTTraversal.......................................................................... 28 2.3.6 G enericA ST M atcher.......................................................................... 29 2.3.7 Design Patterns ................................................................................... 30 2.3.8 Class Structure ................................................................................... 30 2.4 S u m m a r y .......................................................................................................... 31 3 Languages, Grammars, and Earley Parsing 32 3.1 Languages and G ra m m a rs ............................................................................ 32 3.2 Earley P arsing ................................................................................................... 35 3.3 S u m m a r y .......................................................................................................... 39 4 Directly-Executable Earley Parsing 40 4.1 DEEP: a Directly-Executable Earley P a rse r ........................................... 41 4.1.1 Observations ...................................................................................... 41 4.1.2 Basic Organization ............................................................................ 42 4.1.3 Earley Set Representation ............................................................... 43 4.1.4 Adding Earley Item s ......................................................................... 48 4.1.5 Sets Containing Items which are Sets Containing Items .... 48 4.1.6 Implementation Rumination ............................................................ 53 VI 4.1.7 A Deeper Look at Implementation ............................................... 59 4.2 Evaluation .......................................................................................................... 62 4.3 Improvements ................................................................................................... 65 4.4 Related W ork ................................................................................................... 67 4.5 Future Work ....................................................................................................... 68 4.6 Summary .......................................................................................................... 70 5 Schrodinger’s Token 71 5.1 Schrodinger’s Token ......................................................................................... 73 5.2 Alternative Techniques ................................................................................... 75 5.2.1 Lexical Feedback ................................................................................ 75 5.2.2 Enumeration of Cases ............................................................ 76 5.2.3 Language Superset ...........................................................

Load more