Impala: A Modern, Open-Source SQL Engine for Hadoop

Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder

Cloudera http://impala.io/

ABSTRACT that is on par or exceeds that of commercial MPP analytic Cloudera Impala is a modern, open-source MPP SQL en- DBMSs, depending on the particular workload. gine architected from the ground up for the Hadoop data This paper discusses the services Impala provides to the processing environment. Impala provides low latency and user and then presents an overview of its architecture and high concurrency for BI/analytic read-mostly queries on main components. The highest performance that is achiev- Hadoop, not delivered by batch frameworks such as Apache able today requires using HDFS as the underlying storage Hive. This paper presents Impala from a user’s perspective, manager, and therefore that is the focus on this paper; when gives an overview of its architecture and main components there are notable differences in terms of how certain technical and briefly demonstrates its superior performance compared aspects are handled in conjunction with HBase, we note that against other popular SQL-on-Hadoop systems. in the text without going into detail. Impala is the highest performing SQL-on-Hadoop system, especially under multi-user workloads. As Section7 shows, 1. INTRODUCTION for single-user queries, Impala is up to 13x faster than alter- Impala is an open-source 1, fully-integrated, state-of-the- natives, and 6.7x faster on average. For multi-user queries, art MPP SQL query engine designed specifically to leverage the gap widens: Impala is up to 27.4x faster than alternatives, the flexibility and scalability of Hadoop. Impala’s goal is and 18x faster on average – or nearly three times faster on to combine the familiar SQL support and multi-user perfor- average for multi-user queries than for single-user ones. mance of a traditional analytic database with the scalability The remainder of this paper is structured as follows: the and flexibility of and the production-grade next section gives an overview of Impala from the user’s security and management extensions of Cloudera Enterprise. perspective and points out how it differs from a traditional Impala’s beta release was in October 2012 and it GA’ed in RDBMS. Section3 presents the overall architecture of the May 2013. The most recent version, Impala 2.0, was released system. Section4 presents the frontend component, which in October 2014. Impala’s ecosystem momentum continues includes a cost-based distributed query optimizer, Section5 to accelerate, with nearly one million downloads since its presents the backend component, which is responsible for the GA. query execution and employs runtime code generation, and Unlike other systems (often forks of Postgres), Impala is a Section6 presents the resource/workload management com- brand-new engine, written from the ground up in C++ and ponent. Section7 briefly evaluates the performance of Im- Java. It maintains Hadoop’s flexibility by utilizing standard pala. Section8 discusses the roadmap ahead and Section9 components (HDFS, HBase, Metastore, YARN, Sentry) and concludes. is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as 2. USER VIEW OF IMPALA that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based Impala is a query engine which is integrated into the on daemon processes that are responsible for all aspects of Hadoop environment and utilizes a number of standard query execution and that run on the same machines as the Hadoop components (Metastore, HDFS, HBase, YARN, Sen- rest of the Hadoop infrastructure. The result is performance try) in order to deliver an RDBMS-like experience. However, there are some important differences that will be brought up 1 https://github.com/cloudera/impala in the remainder of this section. Impala was specifically targeted for integration with stan- dard business intelligence environments, and to that end supports most relevant industry standards: clients can con- nect via ODBC or JDBC; authentication is accomplished This article is published under a Creative Commons Attribution Li- with Kerberos or LDAP; authorization follows the standard cense(http://creativecommons.org/licenses/by/3.0/), which permits distri- 2 bution and reproduction in any medium as well as allowing derivative SQL roles and privileges . In order to query HDFS-resident works, provided that you attribute the original work to the author(s) and 2 This is provided by another standard Hadoop component CIDR 2015. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15) called Sentry [4], which also makes role-based authoriza- January 4-7, 2015, Asilomar, California, USA. tion available to Hive, and other components. data, the user creates tables via the familiar CREATE TABLE location of that table, using HDFS’s API. Alternatively, the statement, which, in addition to providing the logical schema same can be accomplished with the LOAD DATA statement. of the data, also indicates the physical layout, such as file Similarly to bulk insert, Impala supports bulk data dele- format(s) and placement within the HDFS directory struc- tion by dropping a table partition (ALTER TABLE DROP PAR- ture. Those tables can then be queried with standard SQL TITION). Because it is not possible to update HDFS files syntax. in-place, Impala does not support an UPDATE statement. In- stead, the user typically recomputes parts of the data set to 2.1 Physical schema design incorporate updates, and then replaces the corresponding When creating a table, the user can also specify a list of data files, often by dropping and re-adding the partition partition columns: After the initial data load, or whenever a significant frac- CREATE TABLE T (...) PARTITIONED BY (day int, month tion of the table’s data changes, the user should run the int) LOCATION ’’ STORED AS PARQUET; COMPUTE STATS

statement, which instructs Impala For an unpartitioned table, data files are stored by de- to gather statistics on the table. Those statistics will subse- fault directly in the root directory 3. For a partitioned quently be used during query optimization. table, data files are placed in subdirectories whose paths reflect the partition columns’ values. For example, for day 3. ARCHITECTURE 17, month 2 of table T, all data files would be located in directory /day=17/month=2/. Note that this form of Impala is a massively-parallel query execution engine, partitioning does not imply a collocation of the data of an which runs on hundreds of machines in existing Hadoop individual partition: the blocks of the data files of a partition clusters. It is decoupled from the underlying storage engine, are distributed randomly across HDFS data nodes. unlike traditional relational database management systems Impala also gives the user a great deal of flexibility when where the query processing and the underlying storage engine choosing file formats. It currently supports compressed and are components of a single tightly-coupled system. Impala’s uncompressed text files, sequence file (a splittable form of high-level architecture is shown in Figure1. text files), RCFile (a legacy columnar format), Avro (a binary An Impala deployment is comprised of three services. The row format), and Parquet, the highest-performance storage Impala daemon (impalad) service is dually responsible for option (Section 5.3 discusses file formats in more detail). accepting queries from client processes and orchestrating their As in the example above, the user indicates the storage execution across the cluster, and for executing individual format in the CREATE TABLE or ALTER TABLE statements. It query fragments on behalf of other Impala daemons. When is also possible to select a separate format for each partition an Impala daemon operates in the first role by managing individually. For example one can specifically set the file query execution, it is said to be the coordinator for that query. format of a particular partition to Parquet with: However, all Impala daemons are symmetric; they may all ALTER TABLE PARTITION(day=17, month=2) SET FILEFORMAT operate in all roles. This property helps with fault-tolerance, PARQUET. and with load-balancing. As an example for when this is useful, consider a table One Impala daemon is deployed on every machine in the with chronologically recorded data, such as click logs. The cluster that is also running a datanode process - the block data for the current day might come in as CSV files and get server for the underlying HDFS deployment - and therefore converted in bulk to Parquet at the end of each day. there is typically one Impala daemon on every machine. This allows Impala to take advantage of data locality, and to read 2.2 SQL Support blocks from the filesystem without having to use the network. The Statestore daemon (statestored) is Impala’s meta- Impala supports most of the SQL-92 SELECT statement data publish-subscribe service, which disseminates cluster- syntax, plus additional SQL-2003 analytic functions, and wide metadata to all Impala processes. There is a single most of the standard scalar data types: integer and floating statestored instance, which is described in more detail in point types, STRING, CHAR, VARCHAR, TIMESTAMP, Section 3.1 below. and DECIMAL with up to 38 digits of precision. Custom Finally, the Catalog daemon (catalogd), described in Section 3.2, application logic can be incorporated through user-defined serves as Impala’s catalog repository and metadata access functions (UDFs) in Java and C++, and user-defined aggre- gateway. Through the catalogd, Impala daemons may exe- gate functions (UDAs), currently only in C++. cute DDL commands that are reflected in external catalog Due to the limitations of HDFS as a storage manager, Im- stores such as the Hive Metastore. Changes to the system pala does not support UPDATE or DELETE, and essentially only 4 catalog are broadcast via the statestore. supports bulk insertions (INSERT INTO ... SELECT ...) . All these Impala services, as well as several configuration Unlike in a traditional RDBMS, the user can add data to a options, such as the sizes of the resource pools, the available table simply by copying/moving data files into the directory memory, etc.. (see Section6 for more details about resource

3 and workload management) are also exposed to Cloudera However, all data files that are located in any directory Manager, a sophisticated cluster management application5. below the root are part of the table’s data set. That is a common approach for dealing with unpartitioned tables, Cloudera Manager can administer not only Impala but also employed also by . pretty much every service for a holistic view of a Hadoop 4 We should also note that Impala supports the VALUES deployment. clause. However, for HDFS-backed tables this will generate one file per INSERT statement, which leads to very poor performance for most applications. For HBase-backed tables, the VALUES variant performs single-row inserts by 5 http://www.cloudera.com/content/cloudera/en/products- means of the HBase API. and-services/cloudera-enterprise/cloudera-manager.html SQL App Hive HDFS Statestore Catalog Metastore NameNode ODBC (1) Send SQL (6) Query results Impalad Impalad Impalad

Query Planner Query Planner Query Planner (2) Query Coordinator (3) Query Coordinator (3) Query Coordinator (3) (5) (5) (5) Query Executor Query Executor Query Executor (4) (4) HDFS DN HBase HDFS DN HBase HDFS DN HBase

Figure 1: Impala is a distributed query processing system for the Hadoop ecosystem. This figure also shows the flow during query processing.

3.1 State distribution date was successfully sent to the subscriber. Each subscriber A major challenge in the design of an MPP database that maintains a per-topic most-recent-version identifier which is intended to run on hundreds of nodes is the coordina- allows the statestore to only send the delta between updates. tion and synchronization of cluster-wide metadata. Impala’s In response to a topic update, each subscriber sends a list symmetric-node architecture requires that all nodes must be of changes it wishes to make to its subscribed topics. Those able to accept and execute queries. Therefore all nodes must changes are guaranteed to have been applied by the time the have, for example, up-to-date versions of the system catalog next update is received. and a recent view of the Impala cluster’s membership so that The second kind of statestore message is a keepalive. The queries may be scheduled correctly. statestore uses keepalive messages to maintain the connec- We might approach this problem by deploying a separate tion to each subscriber, which would otherwise time-out its cluster-management service, with ground-truth versions of subscription and attempt to re-register. Previous versions of all cluster-wide metadata. Impala daemons could then query the statestore used topic update messages for both purposes, this store lazily (i.e. only when needed), which would ensure but as the size of topic updates grew it became difficult to that all queries were given up-to-date responses. However, ensure timely delivery of updates to each subscriber, leading a fundamental tenet in Impala’s design has been to avoid to false-positives in the subscriber’s failure-detection process. synchronous RPCs wherever possible on the critical path of If the statestore detects a failed subscriber (for example, any query. Without paying close attention to these costs, we by repeated failed keepalive deliveries), it will cease sending have found that query latency is often compromised by the updates. Some topic entries may be marked as ’transient’, time taken to establish a TCP connection, or load on some meaning that if their ’owning’ subscriber should fail, they remote service. Instead, we have designed Impala to push will be removed. This is a natural primitive with which to updates to all interested parties, and have designed a simple maintain liveness information for the cluster in a dedicated publish-subscribe service called the statestore to disseminate topic, as well as per-node load statistics. metadata changes to a set of subscribers. The statestore provides very weak semantics: subscribers The statestore maintains a set of topics, which are arrays may be updated at different rates (although the statestore of (key, value, version) triplets called entries where ’key’ tries to distribute topic updates fairly), and may therefore and ’value’ are byte arrays, and ’version’ is a 64-bit integer. have very different views of the content of a topic. How- A topic is defined by an application, and so the statestore ever, Impala only uses topic metadata to make decisions has no understanding of the contents of any topic entry. locally, without any coordination across the cluster. For Topics are persistent through the lifetime of the statestore, example, query planning is performed on a single node based but are not persisted across service restarts. Processes that on the catalog metadata topic, and once a full plan has been wish to receive updates to any topic are called subscribers, computed, all information required to execute that plan is and express their interest by registering with the statestore distributed directly to the executing nodes. There is no at start-up and providing a list of topics. The statestore requirement that an executing node should know about the responds to registration by sending the subscriber an initial same version of the catalog metadata topic. topic update for each registered topic, which consists of all Although there is only a single statestore process in exist- the entries currently in that topic. ing Impala deployments, we have found that it scales well After registration, the statestore periodically sends two to medium sized clusters and, with some configuration, can kinds of messages to each subscriber. The first kind of mes- serve our largest deployments. The statestore does not per- sage is a topic update, and consists of all changes to a topic sist any metadata to disk: all current metadata is pushed (new entries, modified entries and deletions) since the last up- to the statestore by live subscribers (e.g. load information). Therefore, should a statestore restart, its state can be recov- plan optimizations such as ordering and coalescing analytic ered during the initial subscriber registration phase. Or if window functions and join reordering to minimize the total the machine that the statestore is running on fails, a new evaluation cost. Cost estimation is based on table/partition statestore process can be started elsewhere, and subscribers cardinalities plus distinct value counts for each column 6; may fail over to it. There is no built-in failover mechanism histograms are currently not part of the statistics. Impala in Impala, instead deployments commonly use a retargetable uses simple heuristics to avoid exhaustively enumerating and DNS entry to force subscribers to automatically move to the costing the entire join-order space in common cases. new process instance. The second planning phase takes the single-node plan as input and produces a distributed execution plan. The 3.2 Catalog service general goal is to minimize data movement and maximize scan Impala’s catalog service serves catalog metadata to Impala locality: in HDFS, remote reads are considerably slower than daemons via the statestore broadcast mechanism, and exe- local ones. The plan is made distributed by adding exchange cutes DDL operations on behalf of Impala daemons. The nodes between plan nodes as necessary, and by adding extra catalog service pulls information from third-party metadata non-exchange plan nodes to minimize data movement across stores (for example, the Hive Metastore or the HDFS Na- the network (e.g., local aggregation nodes). During this menode), and aggregates that information into an Impala- second phase, we decide the join strategy for every join compatible catalog structure. This architecture allows Im- node (the join order is fixed at this point). The supported pala to be relatively agnostic of the metadata stores for the join strategies are broadcast and partitioned. The former storage engines it relies upon, which allows us to add new replicates the entire build side of a join to all cluster machines metadata stores to Impala relatively quickly (e.g. HBase sup- executing the probe, and the latter hash-redistributes both port). Any changes to the system catalog (e.g. when a new the build and probe side on the join expressions. Impala table has been loaded) are disseminated via the statestore. chooses whichever strategy is estimated to minimize the The catalog service also allows us to augment the system amount of data exchanged over the network, also exploiting catalog with Impala-specific information. For example, we existing data partitioning of the join inputs. register user-defined-functions only with the catalog service All aggregation is currently executed as a local pre-aggregation (without replicating this to the Hive Metastore, for example), followed by a merge aggregation operation. For grouping since they are specific to Impala. aggregations, the pre-aggregation output is partitioned on Since catalogs are often very large, and access to tables the grouping expressions and the merge aggregation is done is rarely uniform, the catalog service only loads a skeleton in parallel on all participating nodes. For non-grouping aggre- entry for each table it discovers on startup. More detailed gations, the merge aggregation is done on a single node. Sort table metadata can be loaded lazily in the background from and top-n are parallelized in a similar fashion: a distributed its third-party stores. If a table is required before it has been local sort/top-n is followed by a single-node merge operation. fully loaded, an Impala daemon will detect this and issue Analytic expression evaluation is parallelized based on the a prioritization request to the catalog service. This request partition-by expressions. It relies on its input being sorted blocks until the table is fully loaded. on the partition-by/order-by expressions. Finally, the dis- tributed plan tree is split up at exchange boundaries. Each 4. FRONTEND such portion of the plan is placed inside a plan fragment, Impala’s unit of backend execution. A plan fragment encap- The Impala frontend is responsible for compiling SQL sulates a portion of the plan tree that operates on the same text into query plans executable by the Impala backends. data partition on a single machine. It is written in Java and consists of a fully-featured SQL Figure2 illustrates in an example the two phases of query parser and cost-based query optimizer, all implemented from planning. The left side of the figure shows the single-node scratch. In addition to the basic SQL features (select, project, plan of a query joining two HDFS tables (t1, t2) and one join, group by, order by, limit), Impala supports inline views, HBase table (t3) followed by an aggregation and order by uncorrelated and correlated subqueries (that are rewritten as with limit (top-n). The right-hand side shows the distributed, joins), all variants of outer joins as well as explicit left/right fragmented plan. Rounded rectangles indicate fragment semi- and anti-joins, and analytic window functions. boundaries and arrows data exchanges. Tables t1 and t2 The query compilation process follows a traditional divi- are joined via the partitioned strategy. The scans are in a sion of labor: Query parsing, semantic analysis, and query fragment of their own since their results are immediately planning/optimization. We will focus on the latter, most chal- exchanged to a consumer (the join node) which operates lenging, part of query compilation. The Impala query planner on a hash-based partition of the data, whereas the table is given as input a parse tree together with query-global in- data is randomly partitioned. The following join with t3 formation assembled during semantic analysis (table/column is a broadcast join placed in the same fragment as the join identifiers, equivalence classes, etc.). An executable query between t1 and t2 because a broadcast join preserves the plan is constructed in two phases: (1) Single node planning existing data partition (the results of joining t1, t2, and and (2) plan parallelization and fragmentation. t3 are still hash partitioned based on the join keys of t1 In the first phase, the parse tree is translated into a non- and t2). After the joins we perform a two-phase distributed executable single-node plan tree, consisting of the following aggregation, where a pre-aggregation is computed in the plan nodes: HDFS/HBase scan, hash join, cross join, union, same fragment as the last join. The pre-aggregation results hash aggregation, sort, top-n, and analytic evaluation. This are hash-exchanged based on the grouping keys, and then step is responsible for assigning predicates at the lowest pos- sible plan node, inferring predicates based on equivalence classes, pruning table partitions, setting limits/offsets, apply- 6 We use the HyperLogLog algorithm [5] for distinct value ing column projections, as well as performing some cost-based estimation. TopN Merge Single-Node Distributed Plan TopN TopN Plan at HDFS MergeAgg Agg at HBase Hash(t1.custid) at Coordinator HashJoin Scan: t3 Pre-Agg

Broadcast HashJoin Scan: t2 HashJoin Scan: t3

Hash(t1.id1) Hash(t2.id) Scan: t1 Scan: t1 HashJoin Scan: t2

Figure 2: Example of the two phase query optimization.

40 aggregated once more to compute the final aggregation result. IntVal'my_func(const'IntVal&'v1,'const'IntVal&'v2)'{' Codegen Off The same two-phased approach is applied to the top-n, and ''return'IntVal(v1.val'*'7'/'v2.val);' the final top-n step is performed at the coordinator, which }' function Codegen On pointer 30 returns the results to the user. my_func

function function 5. BACKEND pointer pointer 20 Impala’s backend receives query fragments from the fron- + col2 tend and is responsible for their fast execution. It is designed function function pointer pointer 10 5.7x to take advantage of modern hardware. The backend is writ- Query Time (sec) (col1 + 10) * 7 / col2 ten in C++ and uses code generation at runtime to produce 1.19x 1.87x efficient codepaths (with respect to instruction count) and col1 10 0 small memory overhead, especially compared to other engines select count(*)select from count(l_orderkey) lineitem fromTPC-H lineitem Q1 implemented in Java. interpreted codegen’d Impala leverages decades of research in parallel databases. The execution model is the traditional Volcano-style with Figure 3: Interpreted vs codegen’ed code in Impala. Exchange operators [7]. Processing is performed batch-at- a-time: each GetNext() call operates over batches of rows, similar to [10]. With the exception of “stop-and-go” opera- tors (e.g. sorting), the execution is fully pipeline-able, which 5.1 Runtime Code Generation minimizes the memory consumption for storing intermedi- Runtime code generation using LLVM [8] is one of the ate results. When processed in memory, the tuples have a techniques employed extensively by Impala’s backend to canonical in-memory row-oriented format. improve execution times. Performance gains of 5x or more Operators that may need to consume lots of memory are are typical for representative workloads. designed to be able to spill parts of their working set to LLVM is a compiler library and collection of related tools. disk if needed. The operators that are spillable are the hash Unlike traditional compilers that are implemented as stand- join, (hash-based) aggregation, sorting, and analytic function alone applications, LLVM is designed to be modular and evaluation. reusable. It allows applications like Impala to perform just- Impala employs a partitioning approach for the hash join in-time (JIT) compilation within a running process, with the and aggregation operators. That is, some bits of the hash full benefits of a modern optimizer and the ability to generate value of each tuple determine the target partition and the machine code for a number of architectures, by exposing remaining bits for the hash table probe. During normal oper- separate APIs for all steps of the compilation process. ation, when all hash tables fit in memory, the overhead of the Impala uses runtime code generation to produce query- partitioning step is minimal, within 10% of the performance specific versions of functions that are critical to performance. of a non-spillable non-partitioning-based implementation. In particular, code generation is applied to “inner loop” func- When there is memory-pressure, a “victim” partition may be tions, i.e., those that are executed many times (for every spilled to disk, thereby freeing memory for other partitions tuple) in a given query, and thus constitute a large portion to complete their processing. When building the hash tables of the total time the query takes to execute. For example, a for the hash joins and there is reduction in cardinality of the function used to parse a record in a data file into Impala’s build-side relation, we construct a Bloom filter which is then in-memory tuple format must be called for every record in passed on to the probe side scanner, implementing a simple every data file scanned. For queries scanning large tables, version of a semi-join. this could be billions of records or more. This function must function function pointer }' ''return'IntVal(v1.val'*'7'/'v2.val);' IntVal'my_func(const'IntVal&'v1,'const'IntVal&'v2)'{' col1 function function pointer function function pointer interpreted + my_func function function pointer 10 col2 function function pointer (col1 + * 7 /10) (col1 col2 codegen’d ita ucinclsicralrepromnepnly par- penalty, performance large a incur calls function Virtual hc a hnb nie.Ti sepcal aubewhen valuable especially is This inlined. be then can which ita ucinclswt oegnrto n hninlining then and generation code with calls function virtual nAr P- aaaeo cln atr10adw run using we are and We 100 factor scaling codegen. of of database 12 TPC-H and impact Avro RAM the an 48GB measure cores, 8 we having disks, node each per- with on cluster impact in dramatic shown inlines a as pointers, has formance, generation and Code unrolls offsets branches, functions. constants, eliminates propagates it example, loops, For query. a coding as such expressions. optimizations across further elimination make subexpression to compiler and addition, the parallelism, In allows instruction-level overhead. eval- increases call be functions function can inlining no tree with expression directly the uated calls, function resulting the evaluating in actually illustrated of virtual As cost the the function. calling exceeds the of e.g., far cost simple, often the quite Thus, function are numbers. functions two expressions. expression adding child these its calls of expres- recursively Many the which in class, function base virtual sion a overriding Figure by of implemented is side left-hand operators the in individual illustrated of as tree systems), functions, a many and of in composed (as are Impala expressions function, In correct trees. the expression vir- to evaluating the directly replace call to a known with generation is call code instance function use object tual the can calls of we the type runtime, as the simple, at If very inlined. is be function cannot called the when ticularly sufficient. is it functionality runtime a limited at so more if and that even time, known used, compile is be at must unknown function the are of general-purpose scanned schemas be a and the to strings However, than files as file well. such as integer-only types numbers an data floating-point other parsing handles at that faster integerfunction ex- handles be For only will that time. types function compile runtime record-parsing program handle a to at ample, order known in not necessary information always almost are tion speedups. function’s query the large from in instructions result few can performance, execution a query removing good even for and efficient extremely be therefore code run-time of performance Impala. in in generation Impact 4: Figure 3 Figure vrl,JTcmiainhsa ffc iia ocustom- to similar effect an has compilation JIT Overall, functions. virtual are overheads runtime large of source A execu- function in inefficiencies generation, code Without Query Time (sec) 10 20 30 40 0 ahtp fepeso htcnapa natree a in appear can that expression of type Each . select count(*) from lineitem 1.19x Codegen On Codegen Off 4 Figure

count(l_orderkey) from lineitem 1.87x select o xml,i 10-node a in example, For . 3 Figure

yrsligthe resolving by ,

TPC-H Q1 5.7x oetra e oainlds n ih e S) providing SSD), per eight and disk rotational per thread (one Prut20 loipeet meddsaitc:inlined statistics: embedded implements also 2.0) (Parquet caching 0M/ e ik n stpclyal ostrt l avail- all saturate to able typically is and disk) per 100MB/s h ffcieeso maasIOmngrwsrcnl cor- recently was manager I/O Impala’s of effectiveness The hntedtbs ssoe npantx,Sqec,R,and RC, Sequence, text, plain in stored is database the when . trg Formats systems. Storage tested other the than 5.3 higher 8x to up 4x from is by threads). roborated scanner (e.g. clients to interface disk asynchronous physical an per threads manager worker I/O of The number fixed component. a manager assigns I/O the of no sibility them. is checksum there and/or as blocks cycles data CPU copy saves to also need and speed bus memory is Furthermore, Impala disks, 1.2GB/sec. 12 at I/O with sustaining that of measured capable (approx. have We bandwidth disks. disk able almost at local read from can reading Impala when protocol disk. Im- DataNode speed, the called hardware bypass feature near to HDFS or an scans at uses data memory pala and perform disk to order both In from systems. SQL-on-Hadoop all Management I/O 5.2 with increasing speedup the complexity. with query 5.7x, the the up to up speeds by generation execution Code queries. aggregation simple xalteohrformats. to other up the by all outperforms 5x consistently Parquet benchmark formats. Parquet TPC-DS the from queries various for times tion among compression with Similarly, best Parquet them. the algorithms. achieves combinations compression compression scal- popular snappy and of some formats database in file TPC-H stored a of when of 1,000 table factor Lineitem ing the of disk on efficiency. scan and sion efficiency, scan indexes. of min/max optimization e.g. further version for recent statistics column most The delta encodings. for string support added optimized 2.0 and and version run-length extensible and supports an encodings 1.2 Version has dictionary Parquet encodings. column time. of scan structure set at nesting data the column of minimal from re-assembly with enable them to augments information and column-wise fields format nested ColumnIO data. Dremel’s nested by for Inspired support built-in with megabytes) of thousands to able are Cascading Parquet. and process MapReduce frameworks addi- Pig, processing In Hive, Hadoop-based including most LinkedIn. Impala, and to AMPlab, fromtion co- Berkeley contributions was with Stripe, It Cloudera Criteo, and efficiency. scan by high developed and offering compression format high file both columnar open-source state-of-the-art, a as such be algorithms, can bz2. gzip, formats compression snappy, These different with Parquet. and combined text, plain Sequence, [1] edn/rtn aafo/osoaedvcsi h respon- the is devices storage from/to data Reading/writing for challenge a is HDFS from data retrieving Efficiently smninderir aqe ffr ohhg compres- high both offers Parquet earlier, mentioned As PAX-like customizable a is Parquet described, Simply Parquet, Apache using recommend we cases use most In RC, Avro, formats: file popular most supports Impala omtotmzdfrlredt lcs(es hundreds, (tens, blocks data large for optimized format [2] losIpl oacs eoyrsdn aaat data memory-resident access to Impala allows [6] hc hw htIpl’ edthroughput read Impala’s that shows which , 5 Figure 5 Figure rgt hw h maaexecu- Impala the shows (right) hr-ici oa reads local short-circuit lf)cmae h size the compares (left) [9] aqe stores Parquet , HDFS [3] Rgt oprsno h ur ffiinyo li et EUNE C n aqe nImpala. in Parquet and RC, SEQUENCE, text, plain of efficiency query the of Comparison (Right) ANwt h neto fcretn oeo h impedance the of some correcting of intention the with YARN ihu otycnrlzddcso-aig eod ede- we Second, decision-making. centralized costly without to decisions allowing of advantage the has architecture central This the by arbitrated are which edprqeyrsuc euss ahrsuc eus is request resource Each requests. resource per-query send YARN and Llama 6.1 cross-framework the YARN. and deci- of control, latency support admission low of the making both sion supports a that through to mechanism management is single goal resource long-term mixed-workload Our support Impala. with management source cache. requests Llama’s resource hit for don’t YARN that the to deferring decisions still scheduling while actual changes allocation scheduling incremental gang and caching, resource implements MAster, cation called and service, This Impala mismatch. between sit to service intermediary workloads an their signed control to users allowed control admission that independent mechanism but complementary a mented to cycle response and long. request prohibitively per resource be queries the As of found thousands we also acquisition. many second, it of resource but workloads on state, targets latency cluster Impala the significant of a knowledge imposes full with made be resources memory and where CPU architecture, for centralized requests a partition- make has without frameworks YARN memory cluster. and the CPU ing to as frameworks such allows resources which share clusters, Hadoop on mediation latency throughput. query or compromising perhaps to without and is frameworks, queries, between difficulty between The scheduling MapReduce resource finite resources. where for coordinate network compete and cluster, frameworks memory busy bespoke CPU, often a and Impala jobs of ingest context tasks, consumption. the resource in of runs control careful is MANAGEMENT RESOURCE/WORKLOAD compression. 6. and formats file of combinations popular of ratio compression the of Comparison (Left) 5: Figure

lm sasadln amnt hc l maadaemons Impala all which to daemon standalone a is Llama re- to approaches both describes section this of rest The imple- we first, two-fold: was problem this to approach Our YARN Apache framework cluster any for challenges main the of One Database size (in GB) 150 300 450 600 750 900 0 750 Text Text w/ LZO 367 [12]

stecretsadr o resource for standard current the is Snappy Seq w/ 410

Snappy Avro w/ Llama 386 eoreManager Resource

Snappy RCFile o o-aec Appli- Low-Latency for 317 w/

Parquet Snappy 245 w/

Query runtime (in secs) service. 100 150 200 250 300 350 400 450 50 0

Q27

eg sigfr1Bmr fmmr e oe n then and node) per memory of more 1GB for asking (e.g. Q34 hr eore r eunda hyaealctd Impala’s allocated. are they as model returned allocation are ’drip-feed’ resources YARN’s where from different is This i h ttsoe oeeyIpl amni bet make to able is daemon Impala every so statestore, the via tl.Ti atpt losLaat icmetYARN’s circumvent to immedi- Llama query allows path the fast to This them ately. returns use. Llama may cache, query resource a that share resources fair available the cluster’s defines the which the pool, of resource a with associated eiin oal htrsl neceiglmt pcfidby specified limits exceeding in make result shared may that because daemons locally However, Impala decisions asynchronously, received path. is execution state request communication global synchronous the the additional on of any view aggregate without its state on based decisions admission daemons Impala make among to required disseminated State is syn- decisions making server. central without admission a admitted to be requests requests can chronous incoming daemon that Impala so de- any decentralized, was to and controller fast admission be The to signed maximum the requests. the on of re- and limits usage requests or per-pool memory concurrent defines queued, of that admitted, number policy maximum and a on pool based resource are jected Requests a requests. to incoming assigned con- throttle admission to built-in mechanism a trol has Impala management, resource Control Admission 6.2 programming unsuitable an the with absorbing interface. dealing itself without of YARN complexities with integrate fully to Impala’s from This allocation resource perspective. single a into YARN them to aggregate YARN, requests by resource new supported issue not Llama is allow have we mode we instead This inaccurate, estimates often execution. consumption during resource are their adjust sets, to queries data Impala large very over proceed may fragments avail- query parallel. all be in that to so resources simultaneously all able requires model execution pipelined returned. be to resources YARN’s all to for waits request and the forwards manager, Llama resource resources Otherwise, for low. contention is when algorithm allocation resource frsucsfrtersuc olaeaalbei Llama’s in available are pool resource the for resources If nadto oitgaigwt ANfrcluster-wide for YARN with integrating to addition In particularly plans, query for estimations resource Since Q42

Q43

Q46

Q52 adapter

Q55

Q59 rhtcuehsalwdImpala allowed has architecture

Q65

Q73 Parquet w/Snappy RC w/Snappy Seq w/Snappy Text

Q79

Q96

Geo. mean the policy. In practice this has not been problematic because 250 state is typically updated faster than non-trivial queries. Fur- Impala SparkSQL 216 ther, the admission control mechanism is designed primarily Presto Hive 0.13 200 190 to be a simple throttling mechanism rather than a resource 176 management solution such as YARN. Resource pools are defined hierarchically. Incoming re- 150 127 quests are assigned to a resource pool based on a placement 114 policy and access to pools can be controlled using ACLs. 100 The configuration is specified with a YARN fair scheduler 78 allocation file and Llama configuration, and Cloudera Man- 39 38 ager provides a simple user interface to configure resource 50 28 30 pools, which can be modified without restarting any running Geometric Mean (in secs) 18 6 services. 0 Interactive Reporting Analytic 7. EVALUATION The purpose of this section is not to exhaustively evaluate Figure 6: Comparison of query response times on the performance of Impala, but mostly to give some indi- single-user runs. cations. There are independent academic studies that have derived similar conclusions, e.g. [6]. performance advantage against Hive 0.13 (from an average 7.1 Experimental setup of 4.9x to 9x) and Presto (from an average of 5.3x to 7.5x) from earlier versions of Impala 9. All the experiments were run on the same 21-node cluster. 320 302 2500 2,333 Each node in the cluster is a 2-socket machine with 6-core 7.3 Mutli-UserSingle Performance User Impala Intel Xeon CPU E5-2630L at 2.00GHz. Each node has 64GB Impala’s superior performance10 Users becomes more pronounced 2000 SparkSQL RAM and 12 932GB disk drives (one for the OS, the rest for in multi-user240 workloads, which are ubiquitous in real-world Presto HDFS). 202 applications. Figure7 (left) shows the response time of the 1500 Hive 0.13 We run a decision-support style benchmark consisting of a four systems when there are 10 concurrent users submitting subset of the queries of TPC-DS on a 15TB scale factor data 160 queries from the interactive category.120 In this scenario, Impala 1000 set. In the results below we categorize the queries based on outperforms the other systems from 6.7x to 18.7x when going the amount of data they access, into interactive, reporting, 77

from single80 user to concurrent user workloads. The speedup Queries per Hour and deep analytic queries. In particular, the interactive 37 500 266 varies from 10.6x to 27.4x depending25 on the comparison. 106 175 bucket contains queries: q19, q42, q52, q55, q63, q68, q73, 5 11 Note thatCompletion Time (in secs) Impala’s speed under 10-user load was nearly half and q98; the reporting bucket contains queries: q27, q3, q43, that under single-user0 load–whereas the average across the 0 Throughput in Interactive bucket q53, q7, and q89; and the deep analytic bucket contains alternatives was justImpala one-fifthSparkSQL that underPresto single-userHive load. 0.13 queries: q34, q46, q59, q79, and ss max. The kit we use for 7 Similarly, Figure7 (right) compares the throughput of the these measurements is publicly available . four systems. Impala achieves from 8.7x up to 22x higher For our comparisons we used the most popular SQL-on- 8 throughput than the other systems when 10 users submit Hadoop systems for which we were able to show results : queries from the interactive bucket. Impala, Presto, Shark, SparkSQL, and Hive 0.13. Due to the lack of a cost-based optimizer in all tested engines except 7.4 Comparing against a commercial RDBMS Impala we tested all engines with queries that had been From the above comparisons it is clear that Impala is on converted to SQL-92 style joins. For consistency, we ran the forefront among the SQL-on-Hadoop systems in terms of those same queries against Impala, although Impala produces performance. But Impala is also suitable for deployment in identical results without these modifications. traditional data warehousing setups. In Figure8 we compare Each engine was assessed on the file format that it performs the performance of Impala against a popular commercial best on, while consistently using Snappy compression to columnar analytic DBMS, referred to here as “DBMS-Y” due ensure fair comparisons: Impala on , Hive to a restrictive proprietary licensing agreement. We use a 0.13 on ORC, Presto on RCFile, and SparkSQL on Parquet. TPC-DS data set of scale factor 30,000 (30TB of raw data) 7.2 Single User Performance and run queries from the workload presented in the previous paragraphs. We can see that Impala outperforms DBMS-Y Figure6 compares the performance of the four systems by up to 4.5x, and by an average of 2x, with only three on single-user runs, where a single user is repeatedly submit- queries performing more slowly. ting queries with zero think time. Impala outperforms all alternatives on single-user workloads across all queries run. Impala’s performance advantage ranges from 2.1x to 13.0x 8. ROADMAP and on average is 6.7x faster. Actually, this is a wider gap of In this paper we gave an overview of Cloudera Impala. Even though Impala has already had an impact on modern 7 https://github.com/cloudera/impala-tpcds-kit data management and is the performance leader among SQL- 8 There are several other SQL engines for Hadoop, for exam- on-Hadoop systems, there is much left to be done. Our ple Pivotal HAWQ and IBM BigInsights. Unfortunately, as far as we know, these systems take advantage of the De- 9 http://blog.cloudera.com/blog/2014/05/new-sql-choices- Witt clause and we are legally prevented from presenting in-the-apache-hadoop-ecosystem-why-impala-continues- comparisons against them. to-lead/ okod,w e nrae s fnwrfiefrasthat formats file newer of warehouse use data increased pre-existing see for we adequate workloads, often is this while eso,btsm tnadlnug etrsaesilmiss- still are features language standard some but version, ob aeilzddrn ur rcsig nodrt take to order in instructions SIMD processing, of query needs advantage a during that to materialized data switching be for considering to format also in-memory are canonical We columnar etc. as output, materialization such query transmission, tasks of network for more for generation as preparation code well data runtime as of sort use and pervasive aggregation joins, of parallelization Enhancements Performance Additional query. single a in 8.2 addressed number be or can levels that nesting elements the manner nested on a of in restrictions schemas no those imposes maps). handle that to arrays, extended (structs, be the types will with Impala column schemas, complex relational of nested essence addition in are what allow releases. next the over those adding on plan We types. data DATE/TIME/DATETIME pruning; partition dynamic set ing: Support SQL Additional environment. 8.1 Hadoop the to the unique of somewhat problems fraction are to solutions increasing that and ever workloads, an warehouse data address is existing which to technology, order DBMS addition in parallel the needed traditional categories: more two yet into of fall roughly items roadmap

lne efrac nacmnsicueintra-node include enhancements performance Planned and schemas, relational flat to restricted currently is Impala 2.0 the of as complete fairly is SQL of support Impala’s Geometric Mean (in secs)

Completion Time (in secs) 100 150 200 250 Completion Time (in secs) 50 0 100 200 300 400 500 600 700 160 240 320 MINUS 80 0 iue8 oprsno h efrac fIpl n omrilaayi RDBMS. analytic commercial a and Impala of performance the of Comparison 8: Figure 0 6 Interactive 34 iue7 oprsno ur epnetmsadtruhu nmliue runs. multi-user on throughput and times response query of Comparison 7: Figure q3 28 and Presto Impala Impala 83 5 39 38 DBMS-Y Impala 10 Users Single User INTERSECT q7 11 99 78 53 q19 SparkSQL 18 19 25 Reporting 38 30 q27 ; 120 Hive 0.13 SparkSQL ROLLUP 93 13]. [11, 176 89 q34 127 166 37 Presto and 13 q42 30 302 17 Analytic 114 RUIGSET GROUPING 91 q43 190 Hive 0.13 140 77 216 70 q46 159 202 15 q52 ; 17 34 q53 78 h ouint htpolmi odtc e aafiles data new detect to is problem that to solution The noaohr aai yial de otesse nastruc- a in system the to added format typically one is from Data conversion another. the into is side-by-side formats data Conversion Data Automated 8.4 compute statistics. which table queries incremental schedules the also and which metadata process the background updates issued. a running be by to automatically needs that it issue exactly to has when forget confused this often are but users or files, problematic: command the be data update to new out and include turned statistics to must data recompute metadata user moving to physical the Currently, command by directory. a simply root issue up an table’s in show a into unlike can files that, data fact new the RDBMS, by complicated is environment Collection Statistics and Metadata task. ongoing 8.3 challenging a is fashion such robust Utilizing rewrites a plan issues. in complex those incorporating and of metadata some near-to- additional adding rectify the to on in term plan metadata medium We table/partition the alternatives. to plan histograms more of enable would costing that accurate columns) histograms) key of primary/foreign nullability (e.g. (e.g. constraints, information statistics schema data additional to sophisticated and due of part intention- lack in currently the robustness/predictability is for explores restricted it ally space plan The optimizer. Queries per Hour 13 1000 1500 2000 2500 n ftemr hlegn set falwn multiple allowing of aspects challenging more the of One Hadoop a in statistics table and metadata of gathering The query Impala’s is improvements planned of area Another q55 500 16 0 630 q59 1,333 2,333 33 q63 Throughput inInteractive bucket 78 388 q65 464 51 q68 266 25 17 q73 19 29 q79 106 49 Hive 0.13 Presto SparkSQL Impala 55 q89 104 18 q98 11 175 ss_max 113 510 tured row-oriented format, such as Json, Avro or XML, or as iments for an organization to do something useful with its text. On the other hand, from the perspective of performance data. a column-oriented format such as Parquet is ideal. Letting Data management in the Hadoop ecosystem is still lacking the user manage the transition from one to the other is often some of the functionality that has been developed for commer- a non-trivial task in a production environment: it essentially cial RDBMSs over the past decades; despite that, we expect requires setting up a reliable data pipeline (recognition of this gap to shrink rapidly, and that the advantages of an open new data files, coalescing them during the conversion pro- modular environment will allow it to become the dominant cess, etc.), which itself requires a considerable amount of data management architecture in the not-too-distant future. engineering. We are planning on adding automation of the conversion process, such that the user can mark a table for References auto-conversion; the conversion process itself is piggy-backed onto the background metadata and statistics gathering pro- [1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. cess, which additionally schedules conversion queries that Weaving relations for cache performance. In VLDB, run over the new data files. 2001. [2] Apache. Centralized cache management in HDFS. Avail- 8.5 Resource Management able at https://hadoop.apache.org/docs/r2.3.0/hadoop- Resource management in an open multi-tenancy environ- project-dist/hadoop-hdfs/CentralizedCacheManagement.html. ment, in which Impala shares cluster resource with other pro- [3] Apache. HDFS short-circuit local reads. Available at cessing frameworks such as MapReduce, Spark, etc., is as yet http://hadoop.apache.org/docs/r2.5.1/hadoop-project- an unsolved problem. The existing integration with YARN dist/hadoop-hdfs/ShortCircuitLocalReads.html. does not currently cover all use cases, and YARN’s focus on having a single reservation registry with synchronous resource [4] Apache. Sentry. Available at http://sentry.incubator.apache.org/. reservation makes it difficult to accommodate low-latency, high-throughput workloads. We are actively investigating [5] P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. new solutions to this problem. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA, 2007. 8.6 Support for Remote Data Storage [6] A. Floratou, U. F. Minhas, and F. Ozcan. SQL-on- Impala currently relies on collocation of storage and com- Hadoop: Full circle back to shared-nothing database putation in order to achieve high performance. However, architectures. PVLDB, 2014. cloud data storage such as Amazon’s S3 is becoming more [7] G. Graefe. Encapsulation of parallelism in the Volcano popular. Also, legacy storage infrastructure based on SANs query processing system. In SIGMOD, 1990. necessitates a separation of computation and storage. We are actively working on extending Impala to access Amazon S3 [8] C. Lattner and V. Adve. LLVM: A compilation frame- (slated for version 2.2) and SAN-based systems. Going be- work for lifelong program analysis & transformation. In yond simply replacing local with remote storage, we are also CGO, 2004. planning on investigating automated caching strategies that [9] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivaku- allow for local processing without imposing an additional mar, M. Tolton, and T. Vassilakis. Dremel: Interactive operational burden. analysis of web-scale datasets. PVLDB, 2010. [10] S. Padmanabhan, T. Malkemus, R. C. Agarwal, and 9. CONCLUSION A. Jhingran. Block oriented processing of relational In this paper we presented Cloudera Impala, an open- database operations in modern computer architectures. source SQL engine that was designed to bring parallel DBMS In ICDE, 2001. technology to the Hadoop environment. Our performance [11] V. Raman, G. Attaluri, R. Barber, N. Chainani, results showed that despite Hadoop’s origin as a batch pro- D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Light- cessing environment, it is possible to build an analytic DBMS stone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, on top of it that performs just as well or better that cur- I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, rent commercial solutions, but at the same time retains the and L. Zhang. DB2 with BLU Acceleration: So much flexibility and cost-effectiveness of Hadoop. more than just a column store. PVLDB, 6, 2013. In its present state, Impala can already replace a tradi- [12] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, tional, monolithic analytic RDBMSs for many workloads. We M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, predict that the gap to those systems with regards to SQL S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, functionality will disappear over time and that Impala will B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: be able to take on an every increasing fraction of pre-existing Yet another resource negotiator. In SOCC, 2013. data warehouse workloads. However, we believe that the modular nature of the Hadoop environment, in which Impala [13] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, draws on a number of standard components that are shared A. Zeier, and J. Schaffner. SIMD-scan: ultra fast in- across the platform, confers some advantages that cannot be memory table scan using on-chip vector processing units. replicated in a traditional, monolithic RDBMS. In particular, PVLDB, 2, 2009. the ability to mix file formats and processing frameworks means that a much broader spectrum of computational tasks can be handled by a single system without the need for data movement, which itself is typically one of the biggest imped-