Thesis This Thesis Has Been Submitted to the Phd School of the Faculty of Science, University of Copenhagen, Denmark
Total Page:16
File Type:pdf, Size:1020Kb
Stream Processing Using Grammars and Regular Expressions Ulrik Terp Rasmussen DIKU, Department of Computer Science University of Copenhagen, Denmark September 25, 2016 PhD Thesis This thesis has been submitted to the PhD School of the Faculty of Science, University of Copenhagen, Denmark i Abstract In this dissertation we study expression based parsing and the use of grammatical specifications for the synthesis of fast, streaming string- processing programs. In the first part we develop two linear-time algorithms for regular expression based parsing with Perl-style greedy disambiguation. The first algorithm operates in two passes in a semi-streaming fashion, us- ing a constant amount of working memory and an auxiliary tape storage which is written in the first pass and consumed by the second. Thesec- ond algorithm is a single-pass and optimally streaming algorithm which outputs as much of the parse tree as is semantically possible based on the input prefix read so far, and resorts to buffering as many symbols as is required to resolve the next choice. Optimality is obtained by per- forming a PSPACE-complete pre-analysis on the regular expression. In the second part we present Kleenex, a language for expressing high-performance streaming string processing programs as regular gram- mars with embedded semantic actions, and its compilation to streaming string transducers with worst-case linear-time performance. Its underly- ing theory is based on transducer decomposition into oracle and action machines, and a finite-state specialization of the streaming parsing algo- rithm presented in the first part. In the second part we also develop a new linear-time streaming parsing algorithm for parsing expression gram- mars (PEG) which generalizes the regular grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm reformulated using least fixed points and evaluated using an instance ofthe chaotic iteration scheme by Cousot and Cousot. iii Resumé I denne afhandling beskæftiger vi os med parsing med regulære udtryk samt anvendelsen af grammatiske specifikationer til syntese af hurtige, strømmende programmer til strengprocessering. I første del udvikler vi to algoritmer til parsing med regulære udtryk i lineær tid, og med grådig afgørelse af flertydigheder i stil med Perl. Den første algoritme består af to faser, der afvikles på en semi-strømmende facon med konstant størrelse arbejdslager, samt et ekstra båndlager der henholdsvis skrives og læses af hver af de to faser. Den anden algoritme består af en enkelt fase og er optimalt strømmende i den forstand, at den udskriver så meget af parse-træet, som det er semantisk muligt ud fra det præfix af inddata, der på det givne tidspunkt er blevet indlæst. Algoritmen falder tilbage til buffering af så mange inputsymboler, som det er nødvendigt for at kunne afgøre næste valg. Optimalitet opnås ved hjælp af en PSPACE-fuldstændig præanalyse af det regulære udtryk. I anden del præsenterer vi Kleenex, et sprog til at udtrykke højtyd- ende, strømmende strengprocesseringsprogrammer som regulære gram- matikker med indlejrede semantiske handlinger, samt dets oversættelse til streaming string transducers med worst-case lineær tids ydelse. Den underliggende teori er baseret på dekomponering af transducere i orakel- og handlingsmaskiner, samt en specialisering af den strømmende par- singalgoritme fra den første del som en endelig tilstandsmaskine. I anden del udvikler vi også en ny lineær tids, strømmende parsing algoritme til parsing expression grammars (PEG) der generaliserer de regulære gram- matikker fra Kleenex. Algoritmen er baseret på en bottom-up tabelop- stillingsalgoritme, der reformuleres ved brug af mindste fikspunkter, og som beregnes ved hjælp af en instans af Cousot og Cousots chaotic itera- tion. Contents Contents iv List of Figures vi Preface ix 1 Introduction 1 2 Regular Expression Based Parsing 3 2.1 Regular Expressions In Theory .................. 4 2.2 Regular Expressions In Practice .................. 8 2.3 Regular Expression Based Parsing . 11 2.4 Parsing as Transduction ...................... 14 2.5 Recognition, Matching and Parsing Techniques . 17 2.6 Our Contributions ......................... 21 2.7 Conclusions and Perspectives ................... 23 3 Grammar Based Stream Processing 25 3.1 Context-Free Grammars ...................... 26 3.2 Syntax-Directed Translation Schemes . 30 3.3 Parsing Expression Grammars ................... 32 3.4 String Processing Methods ..................... 35 3.5 Our Contributions ......................... 37 3.6 Conclusions and Perspectives ................... 38 Bibliography 41 Paper A Two-Pass Greedy Regular Expression Parsing 51 A.1 Introduction ............................. 52 A.2 Symmetric NFA Representation of Parse Trees . 53 A.3 Greedy parsing ........................... 56 A.4 NFA-Simulation with Ordered State Sets . 57 A.5 Lean-log Algorithm ......................... 59 A.6 Evaluation .............................. 61 iv CONTENTS v A.7 Related work ............................. 63 Bibliography 67 Paper B Optimally Streaming Greedy Regular Expression Parsing 69 B.1 Introduction ............................. 70 B.2 Preliminaries ............................. 72 B.3 Augmented Automata ....................... 73 B.4 Disambiguation ........................... 75 B.5 Optimal Streaming ......................... 76 B.6 Coverage ............................... 77 B.7 Algorithm .............................. 79 B.8 Example ............................... 83 B.9 Related and Future Work ...................... 87 Bibliography 89 Paper C Kleenex: High-Performance Stream Processing 91 C.1 Introduction ............................. 92 C.2 Transducers ............................. 96 C.3 Kleenex ................................ 99 C.4 Streaming Simulation . 103 C.5 Determinization . 108 C.6 Implementation and Benchmarks . 113 C.7 Use Cases .............................. 119 C.8 Discussion .............................. 122 C.9 Conclusions ............................. 126 Bibliography 129 Paper D PEG Parsing Using Tabling and Dynamic Analysis 137 D.1 Introduction ............................. 138 D.2 Parsing Formalism . 141 D.3 Tabulation of Operational Semantics . 146 D.4 Streaming Parsing with Tables . 149 D.5 Algorithm .............................. 154 D.6 Evaluation .............................. 159 D.7 Discussion .............................. 163 D.8 Conclusion .............................. 165 D.9 Proofs ................................. 166 Bibliography 173 List of Figures ∗ ∗ 2.1 Terms with flattening aba for REs a(a + b) and (ab + a)(a + b) . 14 ∗ 2.2 A bit-coded parsing transducer for the RE (ab + a)(a + b) . 16 3.1 A context-free grammar. ........................ 27 3.2 A parse tree and the corresponding derivation. 28 3.3 Regular CFG and its parsing transducer. 29 3.4 Example SDT for reformulating simple English phrases. 31 3.5 Greedy parse of the string a man who was happy, using SDT from Figure 3.4. ................................ 31 3.6 Example of a simple PEG. ....................... 32 3.7 A PEG parse tree for the string (0+1)+46. 33 A.1 aNFA construction schema. ...................... 55 A.2 Comparisons using very simple iteration expressions. 64 A.3 Comparison using a backtracking worst case expression, and its reversal. ................................. 65 A.4 Comparison using various e-mail expressions. 66 B.1 Example automaton for the RE (a + b)?b . 74 B.2 Example of streaming algorithm on RE (aaa + aa)?. 85 B.3 Example of streaming algorithm on (aa)?(za + zb) + a?z(a + b). 86 C.1 Kleenex program with transducer, oracle and action machines. 98 C.2 Path tree example. ............................ 106 C.3 SST constructed from the oracle machine in Figure C.1. 112 C.4 flip_ab run on lines with average length 1000. 116 C.5 patho2 run on lines with average length 1000. 117 C.6 Inserting separators in random numbers of average length 1000. 118 C.7 Throughput when parsing 250 MiB random IRC data. 119 C.8 Benchmark for program csv_project3. 120 C.9 JSON to SQL benchmark. 121 C.10 Apache Log to JSON benchmark. 122 C.11 ISO time stamps to JSON benchmark. 123 vi List of Figures vii D.1 Example of prefix tables and online expansion. 153 D.2 Parsing algorithm. ............................ 156 Preface This dissertation has been submitted to the PhD School of Science, Faculty of Science, University of Copenhagen, in partial fulfillment of the degree of PhD at Department of Computer Science (DIKU). The dissertation is written as a synopsis of four enclosed research papers, including three peer-reviewed conference papers and one, as of yet, unpub- lished manuscript. Chapter 1 presents a brief introduction to the two topics of this dissertation. Chapters 2 and 3 each give a more comprehensive overview of the respective topic, including an outline of the area of research, the main problems to be solved, and my contribution in relation to existing work in the literature. Each chapter concludes with a brief outline of the perspectives for future work. I could not have written this dissertation alone, so at this point I would like to take the opportunity to thank the people who have helped me along the way. First of all, the material presented here is the result of close collaboration with my coauthors, to whom I would like to express my sincere gratitude. To Fritz Henglein, my supervisor, thank you for giving me both enormous freedom in my research and expert guidance when needed. Your passion and never-ending spirit has been