On-Demand ETL for Real-Time Analytics

TECHNISCHE UNIVERSITÄT KAISERSLAUTERN PHDDISSERTATION On-Demand ETL for Real-Time Analytics Thesis approved by the Department of Computer Science Technische Universität Kaiserslautern for the award of the Doctoral Degree Doctor of Engineering (Dr.-Ing.) to Dipl.-Inf. Weiping Qu Date of Defense: June 9, 2020 Dean: Prof. Dr. Jens Schmitt Reviewer: Prof. Dr.-Ing. Stefan Deßloch Prof. Dr.-Ing. Bernhard Mitschang D 386 For my father, mother, parents-in-law, my wife and my daughter(s) iii Acknowledgements This dissertation summarizes my research work during the period when I worked as a member of the Heterogeneous Information Systems (HIS) group at the University of Kaiserslautern while it was finished in the past two years when I am working in the Online Computing group at Ant Fi- nancial in Shanghai, China after I returned back from Germany. I think I am fortunate that the 5-year research experience helped me to successful- ly take the fist step towards industry while the most significant two-year working experience in turn adds more nutrition to this dissertation. First, I would like to thank my advisor, Prof. Dr.-Ing Stefan Deßloch for offering me the opportunity of pursuing my research work in the most interesting area in the world: database and information systems. His gui- dance on the research and teaching work helped me to complete my view in scientific research in a more systematic manner. In addition to work, I want to express my deepest gratitude to him, as he delivered strong support when I was facing my family issues during that tough time and also encouraged and guided me to finish this dissertation when I was strugg- ling with my work at Ant Financial. Next, I would like to thank Prof. Dr.-Ing. Bernhard Mitschang for taking the role of the second examiner and Prof. Dr. Christoph Garth for acting as chair of the PhD committee. Furthermore, I would like to thank my first mentor Dr. Tianchao Li, who guided me with my diploma thesis at IBM Böblingen and opened up the door of database research for me. Besides, I want to thank Prof. Dr.-Ing. Dr.h.c. Theo Härder who shows his great passion in the database research area and serves as a great example to the young generations. Moreo- ver, I am also grateful to Prof. Dr.-Ing Sebastian Michel for his interesting thoughts and valuable discussions. I am grateful to my master thesis students: Nikolay Grechanov, Dipesh Dangol, Sandy Ganza, Vinanthi Basavaraj and Sahana Shankar, who also contributed their efforts to this dissertation with their solid knowledge and highly motivated attitude. I am indebted to my former colleagues and friends, especially Dr. Tho- mas Jörg, Dr. Yong Hu, Dr. Johannis Schildgen, Dr. Sebastian Bächle, Dr. v Yi Ou, Dr. Caetano Sauer, Dr. Daniel Shall, Dr. Kiril Panev, Dr. Evica Mil- chevski, Dr. Koninika Pal, Manuel Dossinger, Steffen Reithermann and Heike Neu, with whom I had a great time in Kaiserslautern. Also, I would like to thank my best friends, Xiao Pan, Dr. Yun Wan and Yanhao He, who always have interesting ideas and made every day of these years in Kaiserlautern sparkling. Finally, I would like to express my love to my family who always supports me behind the curtain. I want to thank my parents, She Qu and Xingdong Zhang, who educated me the most strictly and love me the most deeply. Without them, I would never have gone that far. I want to thank my wife, Jingxin Xie, who always stands beside me and encourages, motivates and supports me towards the final success of my PhD carrier, and also thank her for bringing our precious daughter to this world, who really brings our life to a brand new world and makes it enjoyable. Weiping Qu Shanghai, Jan 2020 vi vii Abstract In recent years, business intelligence applications become more real-time and traditional data warehouse tables become fresher as they are conti- nuously refreshed by streaming ETL jobs within seconds. Besides, a new type of federated system emerged that unifies domain-specific computation engines to address a wide range of complex analytical applications, which needs streaming ETL to migrate data across computation systems. From daily-sales reports to up-to-the-second cross-/up-sell campaign activities, we observed various latency and freshness requirements set in these analytical applications. Hence, streaming ETL jobs with regu- lar batches are not flexible enough to fit in such a mixed workload. Jobs with small batches can cause resource overprovision for queries with low freshness needs while jobs with large batches would starve queries with high freshness needs. Therefore, we argue that ETL jobs should be self- adaptive to varying SLA demands by setting appropriate batches as nee- ded. The major contributions are summarized as follows. • We defined a consistency model for “On-Demand ETL” which ad- dresses correct batches for queries to see consistent states. Further- more, we proposed an “Incremental ETL Pipeline” which reduces the performance impact of on-demand ETL processing. • A distributed, incremental ETL pipeline (called HBelt) was introdu- ced in distributed warehouse systems. HBelt aims at providing consistent, distributed snapshot maintenance for concurrent table scans across different analytics jobs. • We addressed the elasticity property for incremental ETL pipeline to guarantee that ETL jobs with batches of varying sizes can be finished within strict deadlines. Hence, we proposed Elastic Queue Middleware and HBaqueue which replace memory-based data ex- change queues with a scalable distributed store - HBase. • We also implemented lazy maintenance logic in the extraction and the loading phases to make these two phases workload-aware. Besi- des, we discuss how our “On-Demand ETL” thinking can be exploi- ted in analytic flows running on heterogeneous execution engines. ix Contents 1 Introduction1 1.1 State-of-the-art ETL.......................1 1.2 Motivation............................4 1.2.1 Working Scope.....................4 1.2.2 Problem Analysis....................5 1.2.3 On-Demand ETL....................7 1.3 Research Issues.........................8 1.4 Outline.............................. 10 2 Preliminaries 13 2.1 Incremental View Maintenance................ 13 2.2 Near Real-time Data Warehouses............... 14 2.2.1 Incremental ETL.................... 15 2.2.2 Near real-time ETL................... 15 2.3 Lazy Materialized View Maintenance............ 16 2.3.1 Right-time ETL..................... 17 2.3.2 Relaxed Currency in SQL............... 18 2.4 Real-time Data Warehouse................... 19 2.4.1 One-size-fits-all Databases............... 19 2.4.2 Polystore & Streaming ETL.............. 20 2.4.3 Stream Warehouse................... 20 2.5 Streaming Processing Engines................. 21 2.5.1 Aurora & Borealis.................... 21 2.5.2 Apache Storm...................... 23 2.5.3 Apache Spark Streaming................ 25 2.5.4 Apache Flink...................... 27 2.5.5 S-store.......................... 29 2.5.6 Pentaho Kettle...................... 30 2.6 Elasticity in Streaming Engines................ 30 2.7 Scalable NoSQL Data Store: HBase.............. 32 2.7.1 System Design..................... 32 2.7.2 Compaction....................... 34 2.7.3 Split & Load Balancing................. 34 2.7.4 Bulk Loading...................... 35 xi Contents 3 Incremental ETL Pipeline 37 3.1 The Computational Model................... 37 3.2 Incremental ETL Pipeline................... 39 3.3 The Consistency Model..................... 41 3.4 Workload Scheduler...................... 43 3.5 Operator Thread Synchronization and Coordination.... 46 3.5.1 Pipelined Incremental Join.............. 46 3.5.2 Pipelined Slowly Changing Dimensions....... 49 3.5.3 Discussion........................ 53 3.6 Experiments........................... 53 3.6.1 Incremental ETL Pipeline Performance....... 54 3.6.2 Consistency-Zone-Aware Pipeline Scheduling... 56 3.7 Summary............................. 58 4 Distributed Snapshot Maintenance in Wide-Column NoSQL Databases 61 4.1 Motivation............................ 61 4.2 The HBelt System for Distributed Snapshot Maintenance. 63 4.2.1 Architecture Overview................. 63 4.2.2 Consistency Model................... 65 4.2.3 Distributed Snapshot Maintenance for Concurrent Scans........................... 68 4.3 Experiments........................... 70 4.4 Summary............................. 74 5 Scaling Incremental ETL Pipeline using Wide-Column NoS- QL Databases 75 5.1 Motivation............................ 75 5.2 Elastic Queue Middleware................... 77 5.2.1 Role of EQM in Elastic Streaming Processing Engines 77 5.2.2 Implementing EQM using HBase........... 79 5.3 EQM Prototype: HBaqueue.................. 82 5.3.1 Compaction....................... 82 5.3.2 Resalting......................... 83 5.3.3 Split Policy....................... 84 5.3.4 Load balancing..................... 85 5.3.5 Enqueue & Dequeue Performance.......... 89 5.4 Summary............................. 90 6 Workload-aware Data Extraction 93 6.1 Motivation............................ 93 xii Contents 6.2 Problem Analysis........................ 96 6.2.1 Terminology....................... 96 6.2.2 Performance Gain................... 100 6.3 Workload-aware CDC..................... 100 6.4 Experiments........................... 104 6.4.1 Experiment Setup.................... 104 6.4.2 Trigger-based CDC Experiment............ 104 6.4.3 Log-based CDC Experiment.............. 105 6.4.4 Timestamp-based CDC Experiment......... 107 6.5 Summary............................

On-Demand ETL for Real-Time Analytics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support