Enhancing Relational Database Systems for Managing Hierarchical Data

Technische Universität München Fakultät für Informatik Lehrstuhl III – Datenbanksysteme Enhancing Relational Database Systems for Managing Hierarchical Data DISSERTATION Robert Brunel Technische Universität München Fakultät für Informatik Lehrstuhl III – Datenbanksysteme Enhancing Relational Database Systems for Managing Hierarchical Data Robert Anselm Brunel Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr. Helmut Seidl Prüfer der Dissertation: 1. Prof. Alfons Kemper, Ph. D. 2. Prof. Dr. Torsten Grust Die Dissertation wurde am 07.08.2017 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 10.11.2017 angenommen. Abstract Hierarchical structures permeate our daily lives, and as such are also fea- tured in many of the software applications we work with. Unfortunately, maintaining and querying hierarchical data in the underlying database systems remains a clumsy and inconvenient task even today, almost 50 years after the relational data model was first conceived. Now that modern in-memory engines are becoming more capable than ever, we take the op- portunity to revisit the challenge of making these systems truly “hierarchy- aware.” Our approach is based on modeling hierarchies in relational tables in an intuitive and light-weight way by means of a built-in abstract data type. This opens up new opportunities both on the level of the SQL query language as well as at the heart of the database engine, which we fully ex- ploit: We extend SQL with concise but expressive constructs to represent, bulk-load, manipulate, and query hierarchical tables. An accompanying set of relational algebra concepts and algorithms ensure these constructs can be translated into highly efficient query plans. We design these algorithms in terms of a carefully crafted generic framework for representing and indexing hierarchies, which enables us to freely choose among a vari- ety of sophisticated indexing schemes at the storage layer according to the application scenario at hand. While we strive to leverage the decades of existing research on indexing and processing semi-structured data in our framework, we further push the limits when it comes to robustly indexing and querying very large and highly dynamic hierarchical datasets. The result is an unprecedented degree of native support and versatility for managing hierarchical data on the level of the relational database system. This benefits a great many real-world scenarios which routinely deal with hierarchical data in enterprise software, scientific applications, and beyond. Contents 1 Introduction 1 1.1 Contributions . 2 1.2 Publications . 5 1.3 Outline . 6 2 Hierarchical Data in Relational Databases 7 2.1 Basic Concepts of Hierarchies . 7 2.2 Application Scenarios . 10 2.2.1 Common Data Models . 11 2.2.2 Our Focus . 12 2.2.3 Example Scenarios . 13 2.2.4 Typical Queries on Hierarchies . 15 2.2.5 Summary of Requirements and Challenges . 20 2.3 Status Quo: Integrating Hierarchical Data and DBMS . 22 2.3.1 The Adjacency List Model . 22 2.3.2 Recursion in SQL . 23 2.3.3 Hierarchical Queries in Oracle Database . 24 2.3.4 Recursive Common Table Expressions . 24 2.3.5 Hierarchical Computations via ROLLUP ............... 27 2.3.6 Encoding Hierarchies in Tables . 28 2.3.7 hierarchyid in Microsoft SQL Server . 29 2.3.8 ltree in PostgreSQL . 31 2.3.9 XML Databases and SQL/XML . 31 2.3.10 Multidimensional Databases and MDX . 33 3 A Framework for Integrating Relational and Hierarchical Data 35 3.1 Overview . 35 3.2 The Hierarchical Table Model . 38 3.2.1 Hierarchical Tables . 38 3.2.2 The NODE Data Type . 39 3.3 QL Extensions: Querying Hierarchies . 40 3.3.1 Hierarchy Functions . 40 3.3.2 Hierarchy Predicates . 42 3.3.3 Hierarchical Windows . 43 3.3.4 Recursive Expressions . 46 3.3.5 Further Examples . 47 3.4 DDL Extensions: Creating Hierarchies . 50 3.4.1 Hierarchical Base Tables . 50 3.4.2 Derived Hierarchies: The HIERARCHY Expression . 51 3.4.3 Deriving a Hierarchy from an Adjacency List . 51 3.5 DML Extensions: Manipulating Hierarchies . 53 v Contents 3.6 Advanced Examples . 56 3.6.1 Flexible Forms of Hierarchies . 56 3.6.2 Heterogeneous Hierarchies (I) . 57 3.6.3 Heterogeneous Hierarchies (II) . 57 3.6.4 Dimension Hierarchies . 59 3.6.5 A Complex Report Query . 60 3.6.6 Hierarchical Tables and RCTEs . 61 4 The Backend Perspective of Hierarchical Tables 63 4.1 Hierarchy Indexing Schemes . 63 4.1.1 Basic Concepts . 64 4.1.2 A Common Interface for Hierarchy Indexes . 66 4.1.3 Elementary Index Primitives . 67 4.1.4 Index Primitives for Queries . 68 4.1.5 Index Primitives for Updates . 71 4.2 Implementing Indexing Schemes: State of the Art . 73 4.2.1 Challenges in Dynamic Settings . 75 4.2.2 Naïve Hierarchy Representations . 77 4.2.3 Containment-based Labeling Schemes . 78 4.2.4 Path-based Labeling Schemes . 80 4.2.5 Index-based Schemes . 80 4.3 Schemes for Static and Dynamic Scenarios . 81 4.4 PPPL: A Static Labeling Scheme . 82 4.5 Order Indexes: A Family of Dynamic Indexing Schemes . 83 4.5.1 Order Index Interface . 84 4.5.2 The AO-Tree . 85 4.5.3 Block-based Order Indexes . 85 4.5.4 Representing Back-Links . 87 4.5.5 Implementing the Query Primitives . 88 4.5.6 Implementing the Update Primitives . 89 4.6 Building Hierarchies . 91 4.6.1 A Practical Intermediate Hierarchy Representation . 92 4.6.2 A Generic build() Algorithm . 94 4.6.3 Implementing Derived Hierarchies . 97 4.6.4 Transforming Adjacency Lists . 98 4.6.5 Transforming Edge Lists to Hierarchy IR . 100 4.7 Handling SQL Data Manipulation Statements . 103 5 Query Processing on Hierarchical Data 107 5.1 Logical Algebra Operators . 107 5.1.1 Preliminaries . 107 5.1.2 Expression-level Issues and Normalization . 108 5.1.3 Map ................................... 111 5.1.4 Join . 112 vi Contents 5.1.5 Binary Structural Grouping . 114 5.1.6 Unary Structural Grouping . 116 5.1.7 Unary Versus Binary Grouping . 118 5.2 Physical Algebra Operators . 120 5.2.1 Hierarchy Index Scan . 121 5.2.2 Hierarchy Join: Overview . 123 5.2.3 Nested Loop Join . 125 5.2.4 Hierarchy Index Join . 125 5.2.5 Hierarchy Merge Join . 126 5.2.6 Hierarchy Staircase Filter . 134 5.2.7 Structural Grouping: Overview . 136 5.2.8 Hierarchy Merge Groupjoin . 137 5.2.9 Hierarchical Grouping . 140 5.2.10 Structural Grouping: Further Discussion . 141 5.2.11 Hierarchy Rearrange . 143 5.3 Further Related Work and Outlook . 146 6 Experimental Evaluation 153 6.1 Evaluation Platform . 153 6.2 Test Data . 154 6.3 Experiments . 156 6.3.1 Hierarchy Derivation . 156 6.3.2 Index Primitives for Queries . 159 6.3.3 Hierarchy Traversal . 161 6.3.4 Hierarchy Joins . 164 6.3.5 Hierarchical Windows . 169 6.3.6 Hierarchical Sorting . 171 6.3.7 Pattern Matching Query . 174 6.3.8 Complex Report Query . 177 6.4 Summary . 179 7 Conclusions and Outlook 181 Bibliography 185 vii 1 Introduction ierarchies are infused into our daily lives. Nearly any system in the world can be arranged in a hierarchy, be it political, socioeconomic, technical, bio- H logical, or nature itself. As a consequence, hierarchies also appear in software applications throughout. Central concepts in computing are inherently hierarchical, such as file systems, access control schemes, or semi-structured document formats like XML and JSON. If we look specifically at enterprise software, we see hierarchies being used for modeling geographically distributed sites, marketing schemes, financial ac- counting schemes, reporting lines in human resources, and task breakdowns in project plans. In manufacturing industries, so-called bills of materials, which describe the hierarchical assembly structure of an end product, are a central artifact. And last but not least, in business analytics, hierarchies in the dimensions of data cubes help data analysts to effectively organize even vast amounts of data and guide their analysis to the relevant levels of granularity. A crucial component in virtually all of the larger mission-critical applications is the database layer, where the hierarchies are ultimately stored. Given the important role that hierarchies play in so many commercial applications, it comes as no surprise that the first recognized database model, devised by IBM for their Information Manage- ment System in the 1960s, was hierarchical in nature. One of its intended purposes was to manage bills of materials for the construction of the Apollo spacecraft [5]. Many other early database systems focused primarily on manufacturing scenarios, such that “BOM processor” became a household term. Today, however, virtually all mainstream database systems are.

Enhancing Relational Database Systems for Managing Hierarchical Data

Multidimensional Expressions (MDX) Reference SQL Server 2012 Books Online

Ÿþp Rovider S Ervices a Dministrator

Support Aggregate Analytic Window Function Over Large Data by Spilling

A First Attempt to Computing Generic Set Partitions: Delegation to an SQL Query Engine Frédéric Dumonceaux, Guillaume Raschia, Marc Gelgon

Look out the Window Functions and Free Your SQL

Lecture 3: Advanced SQL

Data Warehousing

Working with Mondrian and Pentaho 176

Sql Server to Aurora Postgresql Migration Playbook

Firebird 3 Windowing Functions

Microsoft SQL Server Analysis Services Multidimensional Performance and Operations Guide Thomas Kejser and Denny Lee

Microsoft Business Intelligence Interoperability with SAP