Memory Footprint Matters: Efficient Equi-Join Algorithms for Main

Memory Footprint Matters: Efficient Equi-Join Algorithms for Main Memory Data Processing Spyros Blanas and Jignesh M. Patel University of Wisconsin–Madison {sblanas,jignesh}@cs.wisc.edu Abstract 1 Introduction Fueled by the economies of scale, businesses and gov- High-performance analytical data processing systems ernments around the world are consolidating their data often run on servers with large amounts of main mem- processing infrastructure to public or private clouds. In ory. A common operation in such environments is com- order to meet the stringent response time constraints bining data from two or more sources using some “join” of interactive analytical applications, analytic data pro- algorithm. The focus of this paper is on studying hash- cessing systems running on these clouds aim to keep based and sort-based equi-join algorithms when the data the working data set cached in main memory. In fact, sets being joined fully reside in main memory. We only many database vendors now offer dedicated main mem- consider a single node setting, which is an important ory database appliances, like SAP HANA [13] and Or- building block for larger high-performance distributed acle Exalytics [27], or have incorporated main-memory data processing systems. A critical contribution of this technology, such as IBM Blink [5], to further accelerate work is in pointing out that in addition to query response data processing. time, one must also consider the memory footprint of While memory densities have been increasing, at the each join algorithm, as it impacts the number of con- other end of the motherboard, another big change has current queries that can be serviced. Memory footprint been underway. Driven by power and heat dissipation becomes an important deployment consideration when considerations, about seven years ago, processors started running analytical data processing services on hardware moving from a single core to multiple cores (i.e. multi- that is shared by other concurrent services. We also core). The processing component has moved even fur- consider the impact of particular physical properties of ther now into multi-sockets, with each socket having the input and the output of each join algorithm. This multiple cores. The servers powering the public and pri- information is essential for optimizing complex query vate clouds today commonly have multi-socket, multi- pipelines with multiple joins. Our key contribution is core processorsand hundredsof gigabytesof main mem- in characterizing the properties of hash-based and sort- ory. based equi-join algorithms, thereby allowing system im- This paper focuses on the workhorse operation of an plementers and query optimizers to make a more in- analytical query processing engine, namely the equi-join formed choice about which join algorithm to use. operation. As is readily recognized, an equi-join operation is a common and expensive operation in analyt- ics and decision support applications, and efficient join algorithms are critical in improving the overall perfor- Copyright is held by the owner/author(s). Publication rights licensed mance of main-memory data processing for such work- to the Association for Computing Machinery, Inc. (ACM). Permission loads. to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not Naturally, there has been a lot of recent interest in made or distributed for profit or commercial advantage and that copies designing and evaluating main memory equi-join algo- bear this notice and the full citation on the first page. Copyrights for rithms. The current literature, however, provides a con- components of this work owned by others than the author(s) must be fusing set of answers about the relative merits of dif- honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior ferent join algorithms. For example, Blanas et al. [6] specific permission and/or a fee. Request permissions from permis- showed how the hash join can be adapted to single- [email protected]. socket multicore main memory configurations. Then, Albutiu et al. [2] showed that on multi-socket configu- SoCC’13, 1–3 Oct. 2013, Santa Clara, California, USA. ACM 978-1-4503-2428-1. http://dx.doi.org/10.1145/2523616.2523626 rations, a new sort-merge join algorithm (MPSM) out- plemented. performed the hash join algorithm proposed in [6], but The remainder of this paper is organized as follows. their study did not evaluate hash-based algorithms that We introduce the join algorithms in Section 2, and pro- are multi-socket friendly. At the same time, Balkesen et vide implementation details in Section 3. We continue al. [4] found that a hardware-conscious implementation with the experimental evaluation in Section 4. Related of the radix partitioned join algorithm can greatly im- work is described in Section 5, and Section 6 contains prove performance over the join algorithm proposed in our conclusions. [6]. All of these studies are set againstthe backdropof an earlier study by Kim et al. [22] that described how sort- based algorithms are likely to outperform hash-based al- 2 Parallel equi-join algorithms gorithms in future processors. Thus, there is no clear answer regarding what equi-join algorithm works well, In this section, we describe key main memory join algo- and under what circumstances, in main memory, multi- rithms that have been recently proposed. We first briefly socket, multicore environments. present each algorithm, and then analyze how each al- Previous work in this space has also largely ignored gorithm fares with respect to two factors: (1) memory the memory footprint of each join algorithm. Main traffic and access pattern, and (2) working memory foot- memory, however, is a precious resource in these “in- print. memory” settings. The peak working memory that each We describe each algorithm assuming that the query query requires is a key factor in determining how many engine is built using a parallel, operator-centric, pull- tenants can be consolidated into a single system. Algo- based model [17]. In this model, a query is a tree of op- rithms that use working memory judiciously allow more erators, and T worker threads are available to execute tenants to simultaneously use a single system, and thus the query. The leaves of the tree read data from the base lower the total infrastructure operating cost. tables or indexes directly. To execute a query, each oper- In this paper, we build on previous work as we adapt ator recursively requests (“pulls”) data from one or more the hash join algorithm by Balkesen et al. [4] for a multi- children, processes them, and writes the results in an out- socket setting. We consider the recently proposed mas- put buffer local to each worker thread. The output buffer sively parallel sort-merge (MPSM) join algorithm by Al- is then returned to the parent operator. Multiple worker butiu et al. [2], in addition to two more traditional sort- threads may concurrently execute (part of) the same op- merge join variants. Finally, we also experiment with a erator. sort-merge algorithm that implements a bitonic merge When the algorithm operates on a single source, such network using SIMD instructions to exploit data-level as the partitioning and sorting algorithms we describe parallelism, as it was proposed by Kim et al. [22]. below, we use R to denote the entire input, and kRk to This paper makes three key contributions. First, we denote the cardinality (number of tuples) of the input. characterize the memory access pattern and the mem- When the algorithm needs two inputs, such as the join ory footprint of each join algorithm. Second, we iden- algorithms, we use R and S to refer to each input, and tify two common physical properties of the join input we assume that kRk≤kSk. In the description of each and output that greatly impact the overall performance. join algorithm that follows, we focus on equi-joins: an The two physical properties are (1) data being hash par- output tuple is producedonly if the join key in R is equal titioned on the join key, and (2) data being pre-sorted on to the join key in S. the join key. This information is important for query op- timization when considering queries with multiple joins. 2.1 Partitioning Finally, we conduct a comprehensive experimental study for each join algorithm in a main memory setting. The partitioning operation takes as inputs a table R and a Our results show that the hash join algorithm in many partitioning function p(·), such that p(r) is an integer in cases produces results faster than the sort-based algo- [1,P] for all tuples r ∈ R. R is partitioned on the join key rithms, and the memory footprint of the hash join al- if the partitioning function p(·) ensures that if two tuples gorithm is generally smaller than that of the sort-based r1,r2 ∈ R have equal join keys, then p(r1)= p(r2). The algorithms. Sort-based algorithms become competitive output of the partitioning operation is P partitions of R. only when one of the join inputs is already sorted. Our We use Ri, where i ∈ [1,P], to refer to the partition whose study concludes that to achieve the best response time tuples satisfy the property p(r)= i, for r ∈ R. and consolidation for main memory equi-join process- Partitioning plays a key role in intra-operator paral- ing, it is necessary to take into accountthe physicalprop- lelism in various settings, since once the input is parti- erties of the inputs and the output, and it is crucial that tioned, different worker threads can operate on differ- both hash-based and sort-based join algorithms are im- ent partitions [10, 12, 28] in parallel.

Load more