Pathfinder: Xquery Compilation for Relational Database Targets

Technische UniversitätMünchen d d d d Institut fürInformatik d d d d d d d d d Lehrstuhl fürDatenbanksysteme d d d d d d d Pathfinder: XQuery Compilation Techniques for Relational Database Targets Jens Thilo Teubner Vollständiger Abdruck der von der Fakultätfür Informatik der Technischen Uni- versität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Helmut Seidl Prüferder Dissertation: 1. Univ.-Prof. Dr. Torsten Grust 2. Prof. Dr. Martin L. Kersten, Universiteit van Amsterdam/Niederlande Die Dissertation wurde am 10. Mai 2006 bei der Technischen UniversitätMünchen eingereicht und durch die FakultätfürInformatik am 28. September 2006 angenom- men. ii This thesis is also available in print (ISBN 3-89963-440-3). iii pathfinder ("pA:TfaInd@) n. a person who makes or finds a way, esp. through unexplored areas or fields of knowledge. Collins English Dictionary [Makins95] iv Abstract Even after at least a decade of work on XML and semi-structured information, such data is still predominantly processed in main-memory, which obviously leads to significant constraints for growing XML document sizes. On the other hand, mature database technologies are readily available to handle vast amounts of data easily. The most efficient ones, relational database manage- ment systems (RDBMSs), however, are largely locked-in to the processing of very regular, table-shaped data only. In this thesis, we will unleash the power of relational database technology to the domain of semi-structured data, the world of XML. We will present a purely relational XQuery processor that handles huge amounts of XML data in an efficient and scalable manner. Our setup is based on a relational tree encoding. The XPath accelerator, also known as pre/post numbering, has since become a widely accepted means to effi- ciently store XML data in a relational system. Yet, it has turned out that there is additional performance to gain. A close look into the encoding itself and the deliberate choice of relational indexes will give intriguing insights into the efficient access to node-based tree encodings. The backbone of XML query processing, the access to XML document regions in terms of XPath tree navigation, becomes particularly efficient if the database system is equipped with enhanced, tree-aware algorithms. We will devise staircase join, a novel join operator that encapsulates knowledge on the underlying tree encoding to provide efficient support for XPath navigation primitives. The required changes to the DBMS kernel remain remarkably small: staircase join integrates well with existing features of relational systems and comes at the cost of adding a single join operator only. The impact on performance, however, is significant: we observed speedups of several orders of magnitude on large-scale XML instances. Although existing XQuery processors make use of the resulting XPath performance gain by evaluating XPath navigation steps in the DBMS back-end, they tend to perform other core XQuery operations outside the relational database kernel. Most notably this affects the FLWOR iteration primitive, XQuery’s node construction facilities, and the dynamic type semantics of XQuery. The loop-lifting technique we present in this work deals with these aspects in a purely relational fashion. The outcome is a loop-lifting compiler that translates arbitrary XQuery v vi ABSTRACT expressions into a single relational query plan. Generated plans take particular advantage of the operations that relational databases know how to perform best: relational joins as well as the computation of aggregates. The MonetDB/XQuery system to which this thesis has contributed provides the experimental proof of the effectiveness of our approach. MonetDB/XQuery is one of the fastest and most scalable XQuery engines available today and handles queries in the multi-gigabyte range in interactive time. The key to this performance are the techniques described in this thesis. They form the basis for MonetDB/XQuery’s core component: the XQuery compiler Pathfinder. Contents Abstract v 1 Introduction 1 1.1 Database Technology for XML ..................... 2 1.1.1 Native XML Databases ..................... 2 1.1.2 Relational Back-Ends ...................... 3 1.2 Contributions of this Thesis ...................... 6 2 Relational XML Storage 11 2.1 XPath Accelerator Encoding ...................... 11 2.1.1 Pre- and Postorder Ranks ................... 12 2.1.2 XPath Axis Conditions ..................... 13 2.1.3 Illustrating XPath Accelerator: The pre/post Plane ..... 14 2.1.4 SQL-Based XPath Evaluation ................. 14 2.1.5 Index Support for XPath Accelerator ............. 16 2.1.6 Techniques to Reduce the Search Space ............ 17 2.1.7 Range Encoding: An Alternative to pre/post ......... 22 2.1.8 A Word on Updates ...................... 23 2.2 XPath on Commodity RDBMSs .................... 23 2.2.1 DB2 Runs XPath ........................ 25 2.2.2 XPath Accelerator on PostgreSQL .............. 31 2.3 Related Work .............................. 32 2.3.1 Fixed-Length Encodings .................... 33 2.3.2 Variable-Length Encodings ................... 35 2.3.3 Relational Database Support .................. 36 3 XPath Evaluation with Staircase Join 39 3.1 Re-Inspecting XPath Accelerator ................... 39 3.1.1 Node Distribution in the pre/post Plane ........... 40 3.2 Staircase Join .............................. 41 3.2.1 Pruning ............................. 41 vii viii CONTENTS 3.2.2 Empty Regions in the pre/post Plane ............. 44 3.2.3 Partitioning ........................... 46 3.2.4 A Further Increase of Tree Awareness: Skipping ....... 49 3.3 Implementation Considerations .................... 51 3.3.1 A Disk-Based ¢ Implementation ............... 51 3.3.2 Main Memory-Related Adaptions ............... 56 3.4 Tree Awareness Beyond Staircase Join ................ 62 3.4.1 Loop-Lifting Staircase Join ................... 62 3.4.2 Support for Non-Recursive Axes ................ 65 3.4.3 Staircase Join Without Staircase Join ............. 67 3.4.4 Tree Awareness in Other Domains ............... 68 3.5 Related Work .............................. 69 3.5.1 Path Evaluation on RDBMSs ................. 69 3.5.2 Tree Properties in XPath .................... 70 4 Loop-Lifting: From XPath to XQuery 73 4.1 A Relational Algebra for XQuery ................... 74 4.1.1 Relational Sequence Encoding ................. 74 4.1.2 An Algebra for XQuery ..................... 77 4.1.3 A Ruleset to Compile XQuery ................. 79 4.1.4 Basic XQuery Expressions ................... 80 4.1.5 Sequence Construction ..................... 81 4.2 Relational FLWORs ............................ 82 4.2.1 for-Bound Variables ...................... 83 4.2.2 Maintaining loop ........................ 85 4.2.3 Free Variables in the return Clause .............. 85 4.2.4 Mapping Back .......................... 87 4.2.5 Complete Compilation Rule for FLWOR Expressions ..... 88 4.2.6 Optional: The order by Clause ................ 89 4.3 Other Expression Types ........................ 90 4.3.1 Arithmetics/Comparisons ................... 90 4.3.2 Conditionals: if-then-else .................. 91 4.4 Interfacing with XML/XPath ..................... 93 4.4.1 XPath Location Steps ..................... 93 4.4.2 Element Construction ..................... 98 4.4.3 A Note on Side-Effects ..................... 100 4.5 Support for Dynamic Type Tests ................... 102 4.5.1 XQuery Subtype Semantics .................. 102 4.5.2 Sequence Type Matching on Relational Back-Ends ..... 103 4.6 XQuery on DB2 ............................. 107 4.6.1 A Loop-Lifted XQuery-to-SQL Translation .......... 107 CONTENTS ix 4.6.2 XPath Bundling and Use of OLAP Functionality ...... 108 4.6.3 Live Node Sets: Compile-Time Information for Accelerated Query Evaluation ........................ 109 4.6.4 XMark on DB2 ......................... 111 4.7 Wrap-Up ................................. 112 4.7.1 Related Research ........................ 113 4.7.2 Outlook & Perspective ..................... 115 5 The Pathfinder XQuery Compiler 119 5.1 Logical Optimizations in Pathfinder .................. 120 5.1.1 DAGs for Loop-Lifted Query Plans .............. 120 5.1.2 A Peephole-Style Plan Analysis ................ 120 5.1.3 Robust XQuery Join Detection ................ 124 5.2 The Importance of Order ........................ 127 5.2.1 Order in Loop-Lifted XQuery ................. 127 5.2.2 Order Indifference in XQuery ................. 128 5.2.3 A Performance Advantage can be Realized .......... 131 5.2.4 Physical Optimization and Order Awareness ......... 131 5.3 Cardinality Forecasts for Loop-Lifted Plans .............. 133 5.3.1 Statistical Guide ........................ 134 5.3.2 Cardinality Forecasts ...................... 135 5.4 MonetDB/XQuery ........................... 137 5.4.1 System Architecture ...................... 138 5.4.2 Overall Query Performance .................. 139 5.4.3 Order Awareness in Pathfinder ................ 140 5.4.4 Scalability with Respect to Data Volumes .......... 141 5.4.5 XQuery on High Data Volumes ................ 142 5.5 Research in the Neighborhood ..................... 142 5.5.1 Algebraic Optimization for XQuery .............. 143 5.5.2 Order Awareness ........................ 144 5.5.3 XQuery Cardinality Forecasts ................. 145 5.5.4 Further Optimization Hooks .................. 145 6 Wrap-Up 147 6.1 Summary ...............................

Load more