Replication at the Speed of Change – a Fast, Scalable Replication Solution for Near Real-Time HTAP Processing

Replication at the Speed of Change – a Fast, Scalable Replication Solution for Near Real-Time HTAP Processing Dennis Butterstein Daniel Martin Jia Zhong Lingyun Wang Knut Stolze Felix Beier IBM Silicon Valley Lab IBM Research & Development GmbH 555 Bailey Ave Schonaicher¨ Strasse 220 San Jose, CA 95141, United States 71032 Boblingen,¨ Germany [email protected] [email protected] [email protected] fdanmartin,stolze,[email protected] ABSTRACT engine was first replaced with Netezza (IBM PureData Sys- 2 The IBM Db2 Analytics Accelerator (IDAA) is a state- tem for Analytics ). Netezza's design is to always use table of-the art hybrid database system that seamlessly extends scans on all of its disks in parallel, leveraging FPGAs to ap- the strong transactional capabilities of Db2 for z/OS with ply decompression, projection and filtering operations before the very fast column-store processing in Db2 Warehouse. the data hits the main processors of the cluster nodes. The The Accelerator maintains a copy of the data from Db2 for row-major organized tables were hash-distributed across all z/OS in its backend database. Data can be synchronized nodes (and disks) based on selected columns (to facilitate at a single point in time with a granularity of a table, one co-located joins) or in a random fashion. The engine itself or more of its partitions, or incrementally as rows changed is optimized for table scans; besides Zone Maps there are no using replication technology. other structures (e. g., indices) that optimize the processing IBM Change Data Capture (CDC) has been employed as of predicates. With the advent of IDAA Version 6, Netezza replication technology since IDAA version 3. Version 7:5:0 was replaced with Db2 Warehouse (Db2wh) and its column- introduces a superior replication technology as a replace- store engine relying solely on general purpose processors. ment for IDAA's use of CDC { Integrated Synchronization. Various minor deviations between Db2z and Netezza were In this paper, we present how Integrated Synchronization im- removed that way, which resolved many SQL syntax and proves the performance by orders of magnitudes, paving the datatype incompatibilities faced by our customers. way for near real-time Hybrid Transactional and Analytical Figure 1 gives an overview of the system: the Acceler- (HTAP) processing. ator is an appliance add-on to Db2z running on an IBM zEnterprise mainframe. It comes in form of a purpose- PVLDB Reference Format: built software and hardware combination that is attached to Dennis Butterstein, Daniel Martin, Knut Stolze, Felix Beier, Jia the mainframe via (redundant) network connections to al- Zhong, Lingyun Wang. Replication at the Speed of Change { a Fast, Scalable Replication Solution for Near Real-Time HTAP low Db2z to dynamically offload scan-intensive queries. The Processing. PVLDB, 13(12): 3245-3257, 2020. Accelerator enhances the Db2z database with the capabil- DOI: https://doi.org/10.14778/3415478.3415548 ity to efficiently process all types of analytical workloads, typical for data warehousing and standardized reports as 1. INTRODUCTION well as ad-hoc analytical queries. Furthermore, data trans- formations inside the Accelerator are supported to simplify IBM Db2 Analytics Accelerator (IDAA)1 is an enhance- or even avoid separate ETL pipelines. At the same time, ment of Db2 for z/OS (Db2z) for analytical workloads. The the combined hybrid database retains the superior transac- current version of IDAA is an evolution of IBM Smart An- tional query performance of Db2z. Query access as well as alytics Optimizer (ISAO)[17] which was using the BLINK administration use a common interface { from the outside, in-memory query engine[14, 3]. In order to quickly broaden the hybrid system appears mostly like a single database. the scope and functionality of SQL statements, the query Table maintenance operations (e. g., reorganization and 1http://www.ibm.com/software/data/db2/zos/analytics-accelerator/ statistics collection) are fully automated and scheduled au- tonomically in the background. An IDAA installation in- herits all System Z attributes known from Db2z itself: the This work is licensed under the Creative Commons Attribution- data is owned by Db2z, i. e., security and access control, NonCommercial-NoDerivatives 4.0 International License. To view a copy backup, data governance etc. are all managed by Db2z it- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For self. The Accelerator does not change any of the existing any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights procedures or violate any of the existing concepts. As a licensed to the VLDB Endowment. result, the combination of Db2z and the Accelerator is a Proceedings of the VLDB Endowment, Vol. 13, No. 12 ISSN 2150-8097. 2 DOI: https://doi.org/10.14778/3415478.3415548 http://www.ibm.com/software/data/puredata/analytics/system/ 3245 Therefore, IDAA offers its \Incremental Update" feature SQL which refreshes the copy tables on the Accelerator by asyn- Db2 z/OS dialect Accelerator Db2 Client Optimizer Server Warehouse chronously monitoring the Db2z transaction log for changes. CALL Admin Application SQL (Db2 LUW) Stored Procedure Db2 LUW dialect Completed and committed transactions are replicated to the SQL Db2 LUW dialect Accelerator. For efficiency reasons, those changes are ap- Database Admin Stored Accelerator Query Runtime Procedures Services Execution plied in \micro batches" that typically contain data from Parallel Bulk INSERT Optimizer UNLOAD Unload Streams all transactions that committed during a certain time in- Batch Load Utility Database terval (e. g., the last 60 seconds). This technology was im- INSERT / Runtime UPDATE / plemented by the existing replication product called IBM DELETE 3 TXN Change Records Incremental Log Reader Infosphere Change Data Capture (CDC) . Log Update Db2 for z/OS Integrated Synchroniziation (InSync) is the successor for CDC in the context of IDAA. Since it is not a general purpose replication product, it is much more light-weight and specifically tailored to the Accelerator. Installation and administration of InSync is significantly reduced, while repli- Figure 1: IDAA Architecture cation performance exceeds CDC. 2. RELATED WORK hybrid system that consists of two architecturally very dif- Relational database system vendors have been focusing on ferent databases systems: a traditional RDBMS that has its analytical DBMS appliances for a while; popular examples strengths at processing highly concurrent OLTP workloads, are Oracle Exadata4, the Teradata appliance5, EXASOL's using well-established serialization and indexing concepts { EXASolution Appliance6, and SAP's HANA7 [7, 13]. Sim- and the Accelerator engine available for processing complex ilarly, appliances running MapReduce implementations to- queries. gether with an analytical DBMS have become increasingly Both systems not only complement each other, but can popular, sometimes under the name of \Big Data Analyt- also shield each other from workloads that may have a neg- ics Platform". They use MapReduce programs for analytics ative impact like exhaustive use of resources if dispatched to on unstructured data and add additional ETL flows into an the less suitable system. For these query routing decisions, analytical DBMS that runs on the same appliance for re- the Db2z optimizer implements a heuristic that decides if a porting and prediction on cleansed, structured data. Essen- given query should be processed by Db2z itself or if it should tially these systems promise an all-in-one solution for data be offloaded to the Accelerator. transformation and analytics of high-volumes of structured It is very important to understand that nothing has to and unstructured data. Examples are the EMC Greenplum be changed on existing applications that are already using DCA8 and the \Aster Big Analytics Appliance"9. Db2z in order to take advantage of the acceleration: they Common to all of these products is a shared nothing ar- are still connecting only to Db2z and use its specific SQL chitecture that hash-partitions the data of each table and dialect. Deep integration into existing Db2z components distributes it across the available compute nodes of the clus- ensures that only little specific training is required to oper- ter. There is a strong focus on optimizing long-running, ate the Accelerator. Db2z remains the owner of the data; scan-heavy queries and analytical functions. Data access is data maintenance, backup and monitoring procedures do mostly read-only and data modifications are done in large not change. batches through bulk load interfaces. The distributed, sha- Because IDAA operates on copies of the tables in Db2z, red nothing architecture makes these systems unsuitable for any changes to these tables must also be reflected on the OLTP with a mixed read/write workload and the require- Accelerator. This is normally done by running the IDAA ment for very low response times. Clearly, it is impossible ACCEL LOAD TABLES stored procedure that refreshes either to achieve the short access path and thus the low response an entire table or the changed (horizontal) partitions of that times of a traditional OLTP architecture because these sys- table on the Accelerator. This refresh mechanism matches tems require a query to go through all nodes of a cluster for many use-cases of IDAA, as analytical systems are often scanning a table and a coordinator node to combine interme- updated by ETL jobs that run in batch mode so that changes diate results before synchronizing and possibly re-distribute are applied in bulk on a scheduled basis. work at runtime. By design, these systems have a minimum Another use case, however, is to allow Accelerator-based response time of several tens or hundreds milliseconds, as reporting over \online" data, providing a solution that com- opposed to OLTP systems where the lowest response times bines the features of traditionally separated OLTP vs.

Replication at the Speed of Change – a Fast, Scalable Replication Solution for Near Real-Time HTAP Processing

Query All Tables in a Schema

A Platform for Networked Business Analytics BUSINESS INTELLIGENCE

Exasol User Manual Version 6.0.8

SQL Server Column Store Indexes Per-Åke Larson, Cipri Clinciu, Eric N

A Peek Under the Hood

EXASOL AG Our History – Inventing World’S Fastest In-Memory Database

IBM Cognos Analytics Version 11.1 : Administration and Security Guide Chapter 1

Columnar Storage in SQL Server 2012

Fastest Analytics Database

An Analytics Database Company to Watch

Exasolution User Manual

Bring Your Language (And Libraries) to Your Data