Kleene Meets Church
Total Page:16
File Type:pdf, Size:1020Kb
FACULTY OF SCIENCE UNIVERSITY OF COPENHAGEN Kleene Meets Church Ordered Finite Action Transducers for High-Performance Stream Processing Kristoffer Aalund Søholm Sebastian Paaske Tørholm July 17, 2015 Abstract Efficient methods for regular string matching are well known, while the problem of regular string transduction (rewriting) is less explored. We introduce ordered finite action transducers (OFAT), building on the theory of streaming string transducers (SST) to produce an efficient two-phase transducer with good streaming behavior, that allows for execution of arbitrarily complex actions along the parsed path. We describe Kleenex, a programming language for expressing string transductions, and introduce a formalization to an OFAT with a limited subset of actions for which we can prove a worst case linear transduction time for fixed size automata. We also describe repg, an implementation of the Kleenex language, and its compilation process. In use cases we achieve good performance characteristics compared to both similar tools such as DReX and Ragel, as well as related tools such as RE2, PCRE and other regular expression libraries. Thesis supervisor: Fritz Henglein 1 2 Acknowledgements We would like to thank our thesis supervisor Fritz Henglein, as well as Niels Bjørn Bugge Grathwohl and Ulrik Rasmussen, Ph.D. students associated with the KMC project, for their excellent supervision and support during the entire project. We would also like to thank Mathias Bundgaard Svensson, Rasmus Wriedt Larsen, and René Løwe Jacobsen for their help with proofreading our thesis. Contents Contents3 List of Figures5 List of Tables6 List of Theorems7 1 Introduction8 2 Preliminaries 11 2.1 Regular expressions........................... 11 2.2 Finite automata............................ 12 2.3 Ordered finite transducers ...................... 13 2.4 Path trees ............................... 16 2.5 Streaming string transducers..................... 19 2.6 Symbolic automata.......................... 23 2.7 Action automata ........................... 24 3 The Kleenex Language 27 3.1 Overview............................... 27 3.2 Core language............................. 29 3.3 The Kleenex language......................... 32 3.4 Expressivity of Kleenex........................ 35 3.5 Time complexity bounds....................... 36 4 Implementation 38 4.1 Compilation ............................. 38 4.2 Runtime................................ 43 4.3 Pipeline implementation....................... 45 4.4 Action implementation........................ 45 4.5 Optimizations............................. 45 4.6 Correctness.............................. 47 5 Evaluation 49 5.1 Benchmarks.............................. 49 5.2 Use cases ................................ 61 5.3 Comparison.............................. 68 6 Conclusion 72 3 CONTENTS 4 6.1 Related work ............................. 72 6.2 Future work.............................. 72 6.3 Closing remarks............................ 73 7 References 75 A Appendix 79 List of Figures 1.1 Example of an INI-file........................... 8 1.2 An example of a simple INI parser written in Kleenex.......... 9 1.3 An example of a more complex INI transformation written in Kleenex.9 2.1 Example of an NFST........................... 14 2.2 Example of an NFST whose transduction is not single-valued . 14 2.3 Example of an OFT ........................... 15 2.4 The path tree resulting from simulating the OFT in Figure 2.3 on the input aaaa. ................................ 16 2.5 The pathtree from Figure 2.4, with valuations annotated on each node. 17 2.6 The update that takes place on Figure 2.4 if another a is consumed . 18 2.7 Continuing from Figure 2.6, we consume two more as. 18 2.8 Demonstration of removal of deterministic states from a path tree . 19 2.9 Example of an OFT to SST conversion ................. 23 2.10 Example of an SST that performs a swapping operation . 24 2.11 Example of an OFAT that replicates the functionality of Figure 2.10 . 24 2.12 Example of an oracle FST and action automaton............ 26 3.1 A Kleenex program, and its desugared version.............. 32 3.2 The desugaring function for converting Kleenex actions into core actions. 33 3.3 The desugaring function for converting Kleenex terms into core terms. 34 3.4 The desugaring function for converting regular expressions into Kleenex terms.................................... 34 3.5 The desugaring function for desugaring suppressed core terms into core terms.................................... 35 3.6 The oracle and action automata corresponding to the grammar from Figure 3.1.................................. 36 4.1 Kleenex compilation flowchart...................... 39 4.2 The data type representing the Kleenex AST .............. 39 4.3 Data type for recursive µ-terms ..................... 40 4.4 Recursive conversion of µ-terms to a OFT ................ 41 4.5 Visualizations of odd_even from Figure 3.1............... 42 4.6 Example code generated by repg ..................... 43 4.7 Diagram of the Kleenex runtime..................... 44 4.8 Common pattern of Kleenex programs ................. 47 5.1 flip_ab run on a file with an average line length of 1000 characters. 52 5.2 patho2 run on a file with an average line length of 1000 characters. 53 5 5.3 The program text for the rot13 program................ 53 5.4 Benchmark of rot13 ........................... 54 5.5 Benchmark of thousand_sep ....................... 55 5.6 The program text for the csv_project3 program............ 55 5.7 Benchmarks for csv_project3 ...................... 56 5.8 The program text for the iso_datetime_to_json program. 56 5.9 Throughput for the program iso_datetime_to_json.......... 57 5.10 csv_project3 with and without suppressed bitcode optimization. 58 5.11 Bitcode vs. bytecode implementation of csv_project3. 59 5.12 Bitcode vs. bytecode implementation of rot13.............. 59 5.13 Bitcode vs. bytecode implementation of thousand_sep. 60 5.14 Program text for the program ini2json . 61 5.15 Throughput of the program ini2json. ................. 62 5.16 Man-in-the-middle attack on HTML pages with forms......... 62 5.17 The apache_log program, which parses and annotates apache log files. 63 5.18 Throughput comparison on the apache_log program.......... 64 5.19 Lexer for the instructional programming language PL/0. 64 5.20 Syntax highlighter for Kleenex...................... 65 5.21 The program irc, which parses the IRC protocol. ........... 66 5.22 Throughput of the program irc...................... 67 5.23 Kleenex versions of the programs presented in the DReX paper from POPL 2015 [1] .............................. 68 5.24 Throughput of the program drex_swap-bibtex on 2 MB of input data. 69 5.25 Execution time of the program drex_swap-bibtex on 2 MB of input data. 69 5.26 The Ragel implementation of flip_ab . 71 A.1 The intermediate language definition .................. 80 A.2 The code used for syntax highlighting all Kleenex code in this paper. 81 List of Tables 2.1 The operations on regular expressions and their standard interpretation on sets of strings. Here En means E concatenated with itself n times. 11 2.2 The extended regular expression syntactical elements, and their transla- tion, T , to basic regular expressions.................... 12 5.1 Version numbers of tools used for benchmarking............. 51 6 List of Theorems 7 List of Theorems 1 Definition (Regular expression[3])................... 11 2 Definition (Nondeterministic finite automaton).......... 12 3 Definition (Deterministic finite automaton[3])........... 13 4 Definition (Nondeterministic finite state transducer)........ 13 5 Definition (Ordered finite transducer)................ 15 6 Definition (Functional semantics of OFTs)............. 16 7 Definition (Path tree)......................... 16 1 Theorem............................... 20 1 Corollary............................... 20 8 Definition (Streaming String Transducer[10])............. 21 9 Definition (Functional semantics of SSTs).............. 22 10 Definition (OFT to SST determinization).............. 22 2 Theorem............................... 22 11 Definition (Ordered finite action transducer)............ 24 12 Definition (Functional semantics of OFATs)............ 25 13 Definition (Kleenex computational environment).......... 29 14 Definition (Action).......................... 29 15 Definition (Core action)....................... 29 16 Definition (Kleenex core terms and syntax)............. 30 17 Definition (Well-formedness of Kleenex programs)......... 30 18 Definition (Grammar)......................... 31 19 Definition (Right-regular grammar with actions)........... 31 3 Theorem................................ 31 20 Definition (Kleenex core semantics)................. 32 21 Definition (Kleenex actions)..................... 33 22 Definition (Kleenex terms)...................... 33 23 Definition (Kleenex desugaring)................... 33 4 Theorem............................... 33 24 Definition (Kleenex semantics).................... 35 5 Theorem............................... 35 1 Conjecture.............................. 36 6 Theorem............................... 36 7 Theorem............................... 37 Chapter 1 Introduction Regular expressions are widely used for practical programming: they provide a convenient and succinct way of extracting data from a wide variety of structured (or even less structured) text. Compilation methods that allow for regular expressions to be compiled down to efficient deterministic finite automata are well known, and theoretically