Vectorh: Taking SQL-On-Hadoop to the Next Level

VectorH: Taking SQL-on-Hadoop to the Next Level Andrei Costeaz Adrian Ionescuz Bogdan Raducanu˘ z Michał Switakowski´ z Cristian Bârca˘ z Juliusz Sompolskiz Alicja Łuszczakz Michał Szafranski´ z Giel de Nijsz Peter Bonczx Actian Corp.z CWIx ABSTRACT SQL-on-Hadoop=MPP DBMS? A SQL-on-Hadoop sys- Actian Vector in Hadoop (VectorH for short) is a new SQL- tem shares certain characteristics with an analytical parallel on-Hadoop system built on top of the fast Vectorwise an- DBMS, such as Teradata. First of all, these so-called MPP alytical database system. VectorH achieves fault tolerance systems are used in analytical workloads, which consist of and storage scalability by relying on HDFS, and extends the relatively few very heavy queries { compared to transac- state-of-the-art in SQL-on-Hadoop systems by instrument- tional workloads { that perform heavy scans, joins, aggrega- ing the HDFS replication policy to optimize read locality. tions and analytical SQL, such as SQL'2003 window func- VectorH integrates with YARN for workload management, tions based on PARTITION BY, ROLL UP and GROUP- achieving a high degree of elasticity. Even though HDFS is ING SETS. That said, these workloads are by no means an append-only filesystem, and VectorH supports (update- read-only, and systems must also sustain a continuous stream averse) ordered tables, trickle updates are possible thanks to of voluminous updates under limited concurrency. Positional Delta Trees (PDTs), a differential update struc- In the past decade, analytical database systems have seen ture that can be queried efficiently. We describe the changes the rise of (i) columnar stores { which reduce the amount of made to single-server Vectorwise to turn it into a Hadoop- IO and memory bandwidth needed in query processing fur- based MPP system, encompassing workload management, ther boosted by (ii) integrated data compression and (iii) fa- parallel query optimization and execution, HDFS storage, cilities that allow data skipping in many queries, e.g. through transaction processing and Spark integration. We evaluate lightweight index structures like MinMax indexes that ex- VectorH against HAWQ, Impala, SparkSQL and Hive, show- ploit natural orders and correlations in the data. Such mod- ing orders of magnitude better performance. ern analytical database systems typically (iv) use a query engine that is (re)designed to exploit modern hardware strengths. Two approaches for this are popular. Vectorwise [27] devel- 1. INTRODUCTION oped the first fully vectorized query engine where all opera- Hadoop, originally an open-source copy of MapReduce, tions on data are performed on vectors of values, rather than has become the standard software layer for Big Data clus- tuple-at-a-time, reducing query interpretation overhead, in- ters. Though MapReduce is losing market share as the pro- creasing data and code CPU cache locality and allowing to gramming framework of choice to an ever-expanding range of use SIMD instructions. The competing approach is to use competing options and meta-frameworks, most prominently Just-In-Time (JIT) compilation, such that a SQL query is Spark [26], Hadoop has established itself as the software translated into assembly instructions { see Impala [24]. base for a variety of technologies, thanks to its evolution A commonality between SQL-on-Hadoop and MPP sys- which decoupled the YARN resource manager from MapRe- tems is that both run on a cluster of nodes; both systems duce and the wide adoption of its distributed file system must distribute data in some way over this cluster and when HDFS, which has become the de-facto standard for cheap a single query is run it must ideally parallelize over all nodes scalable storage for on-premise clusters. SQL-on-Hadoop { and all their cores { in this cluster. While this shared systems allow to connect existing SQL-based business ap- goal is obvious, it requires a lot of technical excellence in plications with the results of \big data" pipelines, adding to data placement, query optimization and execution to achieve their worth, further accelerating the adoption of Hadoop in near-linear parallel scalability on complex SQL queries. Both commercial settings. This paper describes VectorH, a new kinds of system, finally, must be prepared for nodes or links SQL-on-Hadoop system with advanced query execution, up- between them to go down (un)expectedly, and thus replicate datability, YARN, HDFS and Spark integration. data and take measures to achieve fault tolerance. Permission to make digital or hard copies of all or part of this work for personal or SQL-on-Hadoop6=MPP DBMS. There are also signif- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation icant differences between SQL-on-Hadoop and MPP sys- on the first page. Copyrights for components of this work owned by others than the tems. MPP systems typically run on a dedicated cluster, and author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or sometimes even on specific hardware (database machines), republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. whereas SQL-on-Hadoop systems share a Hadoop cluster SIGMOD ’16, June 26–July 1, 2016, San Francisco, California, USA. with other workloads. The task of dividing the hardware re- sources between concurrent parallel queries is already hard © 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-1247-9/12/05. $15.00 in MPP, but when other Hadoop jobs also take cores, mem- DOI: http://dx.doi.org/10.1145/2882903.2903742 ory, network and disk bandwidth, this becomes even harder. In order to counter this problem, all members of the Hadoop 2. FROM VECTORWISE TO VECTORH ecosystem should coordinate their resource usage through a In this section we first summarize the most relevant fea- resource manager such as YARN. tures of the Vectorwise system, the performance-leading1 By basing data storage on HDFS, SQL-on-Hadoop sys- single-server Actian Vector product. Then, we provide a tems leverage scalability and fault-tolerance for a large part short overview of how this technology was used to architect from their Hadoop infrastructure; whereas MPP systems the new shared-nothing Actian VectorH SQL-on-Hadoop must introduce specific and proprietary mechanisms for fault product, which also serves as roadmap for the paper. tolerance (e.g. for adding and removing hardware, reacting gracefully to failures, monitoring status). MPP systems are Relevant Vectorwise Features. Vectorized processing typically more expensive, not only in upfront software and performs any operation on data at the granularity of vec- hardware cost, but also in terms of administration cost; the tors (mini-columns) of roughly 1000 elements. This dramat- required expertise is more rare than that for maintaining a ically reduces the overhead typically present in row-based Hadoop cluster, which has become a standard in this space. query engines that interpret query plans tuple-at-a-time [4]. In some analysis scenarios, raw data first needs to be Additionally, instruction cache locality increases as the in- processed with methods and algorithms that cannot be ex- terpreter stays longer in each function, and memory band- pressed in SQL, but better in a programming language. width consumption is reduced as all vectors reside in the MPP systems have introduced proprietary UDF (User De- CPU data cache. Vectorized execution exposes possibilities fined Function) interfaces with differing degrees of expres- to leverage features of modern CPUs like SIMD instructions sive power in an attempt to address this, whereas in SQL- in database workloads and increases analytical query perfor- on-Hadoop it is more natural to use parallel programming mance by an order of magnitude. The vectorized execution frameworks like Spark for such tasks. The advantage of this model has been adopted by analytical subsystems such as approach is that Spark is more powerful and flexible than IBM BLU [20] and SQLserver column indexes [17]; and it is any UDF API, and is open, so re-use in creating UDFs is also being deployed in new releases of Hive [13]. more likely. SQL-on-Hadoop systems thus must fit seam- Vectorwise introduced the compression schemes PFOR, lessly into \big data" pipelines, either through integrated PFOR-DELTA and PDICT, designed to achieve good com- APIs or through an ability to read and write large datasets pression ratios even with skewed frequency value distribu- in a number of standard formats (text, binary, ORC, Par- tions, as well as high decompression speed [28]. PDICT is quet) stored via HDFS on the cluster. dictionary compression, PFOR is zero prefix compression VectorH Features and Contributions. VectorH is a when representing values as the difference from a block- SQL-on-Hadoop system whose prime contribution is (i) to dependent base (the Frame Of Reference). PFOR-DELTA provide vastly superior query processing performance. This compresses deltas between subsequent tuples using PFOR, higher performance is based on a combination of a truly and has been adopted by the Lucene system to store its in- vectorized query engine, compressed columnar data formats verted text index. These schemes represent values as thin with better lightweight compression methods than those used fixed-bitwidth codes packed together, but store infrequent in common Hadoop formats, and better data skipping statis- values uncompressed as \exceptions" later in the block. The tics than these common formats. Its SQL technology is ma- letter P in these stands for \Patched", since decompression ture, which is especially visible in the query plan quality first inflates all values from the compressed codes; then in produced by its cost-based query optimizer, but is also ob- a second phase patches the values that were exceptions by served in its robust SQL support including analytical SQL, hopping over the decompressed codes starting from the first encryption, authentication, and user role management.

Load more