Programming Using Automata and Transducers

University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2015 Programming Using Automata and Transducers Loris D'antoni University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Computer Sciences Commons Recommended Citation D'antoni, Loris, "Programming Using Automata and Transducers" (2015). Publicly Accessible Penn Dissertations. 1677. https://repository.upenn.edu/edissertations/1677 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/1677 For more information, please contact [email protected]. Programming Using Automata and Transducers Abstract Automata, the simplest model of computation, have proven to be an effective tool in reasoning about programs that operate over strings. Transducers augment automata to produce outputs and have been used to model string and tree transformations such as natural language translations. The success of these models is primarily due to their closure properties and decidable procedures, but good properties come at the price of limited expressiveness. Concretely, most models only support finite alphabets and can only represent small classes of languages and transformations. We focus on addressing these limitations and bridge the gap between the theory of automata and transducers and complex real-world applications: Can we extend automata and transducer models to operate over structured and infinite alphabets? Can we design languages that hide the complexity of these formalisms? Can we define executable models that can process the input efficiently? First, we introduce succinct models of transducers that can operate over large alphabets and design BEX, a language for analysing string coders. We use BEX to prove the correctness of UTF and BASE64 encoders and decoders. Next, we develop a theory of tree transducers over infinite alphabets and design ASTF , a language for analysing tree-manipulating programs. We use FAST to detect vulnerabilities in HTML sanitizers, check whether augmented reality taggers conflict, and optimize and analyze functional programs that operate over lists and trees. Finally, we focus on laying the foundations of stream processing of hierarchical data such as XML files and program traces. We introduce two new efficient and ecutableex models that can process the input in a left-to-right linear pass: symbolic visibly pushdown automata and streaming tree transducers. Symbolic visibly pushdown automata are closed under Boolean operations and can specify and efficiently monitor complex properties for hierarchical structures over infinite alphabets. Streaming tree transducers can express and efficiently ocesspr complex XML transformations while enjoying decidable procedures. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Computer and Information Science First Advisor Rajeev Alur Keywords Automata, Programming languages, String transformations, Symbolic automata, Transducers, Tree transformations Subject Categories Computer Sciences This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/1677 PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’Antoni A DISSERTATION in Computer Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfilment of the Requirements for the Degree of Doctor of Philosophy 2015 Supervisor of Dissertation Rajeev Alur, Zisman Family Professor of Computer and Information Science Graduate Group Chairperson Lyle H. Ungar, Professor of Computer and Information Science Dissertation Committee Chaired by Sampath Kannan, Henry Salvatori Professor of Computer and Information Science Benjamin C. Pierce, Henry Salvatori Professor of Computer and Information Science Val Tannen, Professor of Computer and Information Science Margus Veanes (External), Senior Researcher at Microsoft Research ii To Christian, Marinella, Monica, and Stefano iii Acknowledgements The material described in this dissertation is the culmination of an extended collabora- tion with Rajeev Alur and Margus Veanes, with additional contributions from Benjamin Livshits and David Molnar. I am lucky to count these talented individuals as colleagues and friends, and I want to acknowledge their contributions to this dissertation. Throughout my time at Penn, my adviser Rajeev Alur has granted me freedom to pursue the projects I believed in and great support when I needed to brainstorm, work on a proof, or simply decide on what to do next. Thank you for encouraging, nurturing, and providing help along this academic journey. To NSF, thank you for the Expeditions in Computing grant CCF 1138996, which funded most of my research. Sampath Kannan, Benjamin Pierce, Val Tannen, and Margus Veanes have all graciously agreed to serve on my dissertation committee. I am grateful for the time they devoted to reading both my proposal and this dissertation. Their many insightful comments and suggestions greatly improved it. Mariangiola Dezani-Ciancaglini introduced me to research when I was an undergrad- uate nearly 8 years ago and was the first person to truly believe in my capabilities, for which I am eternally grateful. Margus Veanes mentored me for two productive intern- ships at Microsoft Research Redmond during the summers of 2012 and 2013. Since then, Margus has been a great colleague, friend, and research collaborator. To the formal methods, programming languages, and verification students at Penn, thank you for creating a friendly and stimulating research environment. To the Grad- uate Student in Engineering Group, thank you for all the social events and pleasant distractions from work. To Adam, thank you for being an honest and supportive friend. To my family, especially my parents Marinella and Stefano, thank you for supporting me and letting me pursue what I believed in. To my brother Christian, thank you for helping me push my potential to the limit. To Monica, thank you for being with me and making my life complete. iv ABSTRACT PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’Antoni Dr. Rajeev Alur Automata, the simplest model of computation, have proven to be an effective tool in reasoning about programs that operate over strings. Transducers augment automata to produce outputs and have been used to model string and tree transformations such as natural language translations. The success of these models is primarily due to their closure properties and decidable procedures, but good properties come at the price of limited expressiveness. Concretely, most models only support finite alphabets and can only represent small classes of languages and transformations. We focus on addressing these limitations and bridge the gap between the theory of automata and transducers and complex real-world applications: Can we extend automata and transducer models to operate over structured and infinite alphabets? Can we design languages that hide the complexity of these formalisms? Can we define executable models that can process the input efficiently? First, we introduce succinct models of transducers that can operate over large alphabets and design BEX, a language for analysing string coders. We use BEX to prove the correctness of UTF and BASE64 encoders and decoders. Next, we develop a theory of tree transducers over infinite alphabets and design FAST, a language for analysing tree-manipulating programs. We use FAST to detect vulnerabilities in HTML sanitizers, check whether augmented reality taggers conflict, and optimize and analyze functional programs that operate over lists and trees. Finally, we focus on laying the foundations of stream processing of hierarchical data such as XML files and program traces. We introduce two new efficient and executable models that can process the input in a left- to-right linear pass: symbolic visibly pushdown automata and streaming tree transducers. Symbolic visibly pushdown automata are closed under Boolean operations and can specify and efficiently monitor complex properties for hierarchical structures over infinite alphabets. Streaming tree transducers can express and efficiently process complex XML transformations while enjoying decidable procedures. v Contents Acknowledgements iii Abstract iv List of tables xi List of figures xiii 1 Introduction 1 1.1 Preamble . .1 1.2 Automata, languages, and program properties . .2 1.3 Transducers, transformations, and programs . .3 1.4 Limitations of existing models . .5 1.5 Contributions . .6 1.5.1 Foundational results . .6 1.5.2 Language design and implementation . .7 1.5.3 Applications . .7 1.6 Acknowledgements . .8 Contents vi I Transducer-based programming languages 9 2 BEX: a language for verifying string coders 10 2.1 Introduction . 10 2.1.1 Challenges in verifying real-world coders . 11 2.1.2 Contributions . 12 2.2 Verifying BASE64 in BEX ............................ 13 2.3 Symbolic extended finite automata and transducers . 14 2.3.1 Preliminaries . 15 2.3.2 Model definitions . 15 2.3.3 From BEX to S-EFTs . 18 2.3.4 Cartesian S-EFAs and S-EFTs . 19 2.3.5 Monadic S-EFAs and S-EFTs . 20 2.4 Properties of symbolic extended finite automata . 20 2.5 Equivalence of symbolic extended finite transducers . 23 2.5.1 Equivalence of S-EFTs is undecidable . 24 2.5.2 Equivalence of Cartesian S-EFTs is Decidable . 25 2.6 Composition of symbolic extended finite transducers . 31 2.6.1 S-EFTs are not closed under composition . 31 2.6.2 A practical algorithm for composing S-EFTs . 34 2.7 Experiments and applications

Programming Using Automata and Transducers

Tree Automata Techniques and Applications

Capturing Cfls with Tree Adjoining Grammars

Deterministic Top-Down Tree Automata: Past, Present, and Future

Equivalence of Deterministic Top-Down Tree-To-String Transducers (Ydt Trans- Ducers)

Translations on a Context Free Grammar A. V. AHO and J. D

Efficient Techniques for Parsing with Tree Automata

Categorical Semantics and Composition of Tree Transducers

An Automata-Based Approach to Pattern Matching

Theory and Applications of Tree Languages

Automata and Formal Languages II Tree Automata

A Weighted Tree Transducer Toolkit for Syntactic Natural Language Processing Models

Tree Pushdown Automata