S. Bächle – Separating Key Concerns in Query Processing

Separating Key Concerns in Query Processing Set Orientation, Physical Data Independence, and Parallelism Vom Fachbereich Informatik der Technischen UniversitätKaiserslautern zur Verleihung des akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.) genehmigte Dissertation von Dipl.-Inf. Sebastian Bächle Dekan des Fachbereichs Informatik: Prof. Dr. Arnd Poetzsch-Heffter Prüfungskommission: Vorsitzender: Jun.-Prof. Dr. rer. nat. Roland Meyer Berichterstatter: Prof. Dr.-Ing. Dr. h. c. Theo Härder Prof. Dr.-Ing. Wolfgang Lehner Datum der wissenschaftlichen Aussprache: 14. Dezember 2012 ii iv Acknowledgements After crossing a finish line, one should not miss the opportunity to look back and revive the memory of all the good things that happened along the way. During my time in Kaiserslautern, I had the privilege to work with many smart, creative, and valuable persons, which supported my work and enriched my life in manifold ways. First and foremost, I want to express my deepest gratitude to my advisor Prof. Theo Härder.His offer to join the DBIS research group opened me the door to an experience of personal and professional development from which I will benefit the rest of my life. He gave me the freedom to pursue my own research and I could always count on his support and learn from his great experience. At this point, I also want to thank Prof. Wolfgang Lehner and Jun.-Prof. Roland Meyer for accepting the role as second examiner and head of the examination board, respectively. Their feedback and interest in my work before, during, and after my defense have been a great compliment to me. Due to the great cooperative culture and personal atmosphere in Kaiserslautern, I can look at long list of colleagues and friends from which I have learned and who shared their valuable thoughts with me. First, I want to thank the whole team of the XTC project, especially Christian Matthias, Andreas Weiner, and Ou Yi. Starting my research work in collaboration with you was a great experience. My special thanks go to my friend Karsten Schmidt for his consistent support in all areas of professional and personal life. His thoughts and advice greatly helped me to improve my work. It is my pleasure to thank further members of the \old" and the \new" Teeecke generation, namely Andreas Bühmann, JürgenGöres,Volker Höfner,Thomas Jörg,Daniel Schall, and Boris Stumm, for the many insightful and entertaining discussions about research and the real life. Furthermore, I appreciate the advice and experience that Prof. Stefan Deßloch shared with me. v Acknowledgements My former students, the \Brackiteers" Max Bechtold, Martin Hiller, Roxana Gomez, Caetano Sauer, and Henrique Valer, passionately con- tributed improvements and extensions to the prototype implementation Brackit used in this thesis. I had a great time with all of them. Outside the university, the strong and unconditional support of my parents Magda and Walter and my brother Matthias helped me to keep the balance during the hard times of this journey. I am blessed to have a family who always gives me the opportunity and the confidence to go my own way. Finally, I owe my deepest gratitude to my beloved wife Susanne. She always believed in me and her patience and encouragement gave me the strength to stay focused during all the ups and downs. Knowing such a wonderful person at my side is invaluable. Sebastian Bächle Homburg, February 2013 vi Abstract Declarative query languages are the most convenient and most produc- tive abstraction for interacting with complex data management systems. While the developer can focus on the application logic, the compiler takes care of translating and optimizing a query for efficient execution. Today, applications increasingly call for declarative data management for many novel storage designs and system architectures. The realization of a query processing system for every new kind of storage, language, or data model is a complex and time-consuming task. It requires considerable effort to design, implement, test, and optimize a compiler, which utilizes the system optimally. Thereby, a large part of the work is devoted to porting and adapting proven algorithms and optimizations from existing solutions. This thesis studies the design of a compiler and runtime infrastructure for consolidating these development efforts. It aims at a decoupled organization of the main concerns of every query processing system: Set orientation is a key concept for efficiently processing large amounts of data. We develop an intermediate representation for compil- ing queries and scripts with arbitrary nestings. It bases on the idea of composing higher-order functions to relational-style processing pipelines, which allow us to apply common set-oriented optimizations independently of the concrete data model used. Physical data independence is mandatory for building a portable compiler and runtime. Our approach generally abstracts from physical aspects to cover a wide range structured and semi-structured data models. For efficiency, we present compilation techniques for tailoring and optimizing a query for a concrete platform. Parallelism is crucial for exploiting modern hardware architectures. We present a novel push-based operator model, which uses divide-and- conquer and self-scheduling techniques for creating and controlling parallelism dynamically at runtime. vii viii Contents Acknowledgements v 1. Introduction 1 1.1. Language-supported Data Processing . .2 1.2. Motivation . .3 1.3. Contributions . .9 1.4. Limitations . 10 1.5. Outline . 11 2. Anatomy of a Data Programming Language 13 2.1. Data Model . 13 2.1.1. Values . 14 2.1.2. Types and Schema . 17 2.2. Basic Language Features . 19 2.2.1. Composition and Decomposition of Data . 19 2.2.2. Core Operations . 22 2.2.3. Function Calls . 23 2.3. Bulk Processing . 24 2.3.1. Transformation and Filtering . 25 2.3.2. Sorting, Grouping, and Aggregation . 27 2.3.3. Joining and Combining . 29 2.3.4. Composition of Operators . 31 2.4. Data Manipulation . 34 2.4.1. Update Queries . 35 2.4.2. Immutability . 36 2.5. Runtime Aspects . 37 2.5.1. Evaluation Model . 37 2.5.2. Side Effects . 38 2.5.3. Error Handling . 41 ix Contents 3. Extended XQuery 43 3.1. Data Model . 43 3.1.1. Items and Sequences . 44 3.1.2. Properties and Accessors . 46 3.1.3. Types . 47 3.1.4. Additional Concepts . 49 3.2. Expressions . 49 3.2.1. FLWOR Expressions . 50 3.2.2. Filter Expressions . 52 3.2.3. Path Expressions . 53 3.2.4. Quantified Expressions . 54 3.3. Evaluation Context . 56 3.3.1. Static Context . 56 3.3.2. Dynamic Context . 56 3.4. Scripting . 57 4. Hierarchical Query Plan Representation 59 4.1. Requirements . 59 4.2. Query Representation . 60 4.2.1. Comprehensions . 61 4.2.2. AST-based Query Representation . 68 4.2.3. FLWOR Pipelines . 70 4.2.4. Runtime View . 73 4.3. Compiler Architecture . 74 4.3.1. Compilation Pipeline . 75 4.3.2. Plan Generation . 78 5. Pipeline Optimization 83 5.1. Generalized Bind Operator . 83 5.2. Join Processing . 86 5.2.1. Join Recognition . 87 5.2.2. Pipeline Reshaping . 90 5.2.3. Join Groups . 91 5.3. Pipeline Lifting . 94 5.3.1. 4-way Left Join . 95 5.3.2. Lifting Nested Joins . 99 5.4. Join Trees . 100 5.5. Aggregation . 106 x Contents 6. Data Access Optimization 109 6.1. Generic Data Access . 109 6.2. Storage-specific Data Access . 112 6.2.1. Native Operations . 112 6.2.2. Eager Value Coercion . 115 6.2.3. Path Processing . 116 6.3. Bulk Processing . 118 6.3.1. Twig Patterns . 118 6.3.2. Multi-bind Operator . 119 6.3.3. Indexes . 123 7. Parallel Operator Model 125 7.1. Speedup vs. Scaleup . 126 7.2. Parallel Nested Loops . 127 7.2.1. Data Partitioning . 130 7.2.2. Task Scheduling . 131 7.3. Operator Sinks . 138 7.3.1. Parallel Data Flow Graphs . 140 7.3.2. Fan-Out Sinks . 141 7.3.3. Fan-in Sinks . 147 7.3.4. Join Sink . 155 7.4. Performance Considerations . 158 7.4.1. Partitioning . 158 7.4.2. Buffer Memory . 159 7.4.3. Process Management . 159 8. Evaluation 161 8.1. Experimental Setup . 161 8.2. Main-memory Processing . 162 8.2.1. Workload . 162 8.2.2. Pipeline Optimization . 164 8.2.3. Competitors . 164 8.3. XML Database Processing . 167 8.3.1. Access Optimization . 167 8.3.2. Scalability . 170 8.3.3. Competitors . 170 8.4. Relational Data . 174 8.4.1. Workload . 174 xi Contents 8.4.2. Data Access Optimization . 175 8.4.3. Comparison with RDBMS . 176 8.5. Parallel Processing . 177 8.5.1. Workload . 177 8.5.2. Filter and Transform Query . 178 8.5.3. Group and Aggregate Query . 179 8.5.4. Join Query . 180 8.5.5. Scalability . 181 8.5.6. XMark Benchmark . 182 8.6. Evaluation Summary . 184 9. Related Work 185 9.1. A Short History of Query Languages . 185 9.2. Related Languages and Data Models . 187 9.2.1. Lorel Query Language . 187 9.2.2. UnQL . 188 9.2.3. TQL . 189 9.2.4. Object Query Language (ODMG) . 189 9.2.5. Rule-based Object Query Language . 190 9.2.6. SQL:1999 and SQL:2003 . 190 9.3. Data Processing Languages . 191 9.3.1. JSON and Jaql . 192 9.3.2. Pig Latin . 193 9.3.3. LinQ . 194 9.3.4. Database-backed Programming Languages . 195 9.4. Extensible Data Processing Platforms . 196 9.4.1. Compiler Infrastructures . 196 9.4.2. Database Languages .

Load more