Fast and Effective Distribution-Key Recommendation for Amazon Redshift

Fast and Effective Distribution-Key Recommendation for Amazon Redshift Panos Parchas Yonatan Naamad Peter Van Bouwel Amazon Web Services Amazon Devices Amazon Web Services Berlin, Germany Sunnyvale, CA, USA Dublin, Ireland [email protected] [email protected] [email protected] Christos Faloutsos Michalis Petropoulos Carnegie Mellon University & Amazon Web Services Amazon Palo Alto, CA, USA [email protected] [email protected] ABSTRACT chines, how should we distribute the data to reduce network How should we split data among the nodes of a distributed cost and ensure fast query execution? data warehouse in order to boost performance for a fore- Amazon Redshift [12, 2, 3] is a massively parallel process- casted workload? In this paper, we study the effect of ing MPP database system, meaning that both the storage different data partitioning schemes on the overall network and processing of a cluster is distributed among several ma- cost of pairwise joins. We describe a generally-applicable chines (compute nodes). In such systems, data is typically data distribution framework initially designed for Amazon distributed row-wise, so that each row appears (in its en- Redshift, a fully-managed petabyte-scale data warehouse in tirety) at one compute node, but distinct rows from the same the cloud. To formalize the problem, we first introduce the table may reside on different machines. Like most commer- Join Multi-Graph, a concise graph-theoretic representation cial data warehouse systems, Redshift supports 3 distinct of the workload history of a cluster. We then formulate ways of distributing the rows of each table: the \Distribution-Key Recommendation" problem { a novel • \Even" (= uniform/random), which distributes the rows combinatorial problem on the Join Multi-Graph{ and relate among compute nodes in a round-robin fashion; it to problems studied in other subfields of computer sci- • \All" (= replicated), which makes full copies of a database ence. Our theoretical analysis proves that \Distribution-Key table on each of the machines; Recommendation" is NP-complete and is hard to approxi- • \Dist-Key" (=fully distributed = hashed), which hashes the rows of a table on the values of a specific attribute mate efficiently. Thus, we propose BaW, a hybrid approach that combines heuristic and exact algorithms to find a good known as distribution key (DK). data distribution scheme. Our extensive experimental eval- In this work, we focus on the latter approach. Our pri- uation on real and synthetic data showcases the efficacy of mary observation is that, if both tables participating in a our method into recommending optimal (or close to optimal) join are distributed on the joining attributes, then that join distribution keys, which improve the cluster performance by enjoys great performance benefits due to minimized network reducing network cost up to 32x in some real workloads. communication cost. In this case, we say that the two tables are collocated with respect to the given join. The goal of this PVLDB Reference Format: paper is to minimize the network cost of a given query work- Panos Parchas, Yonatan Naamad, Peter Van Bouwel, Christos load by carefully deciding which (if any) attribute should be Faloutsos, Michalis Petropoulos. Fast and Effective Distribution- used to hash-distribute each database table. Key Recommendation for Amazon Redshift. PVLDB, 13(11): Figure 1 illustrates collocation through an example. Con- 2411-2423, 2020. DOI: https://doi.org/10.14778/3407790.3407834 sider the schema of Figure 1(a) and the following query Q1: -- query Q1 SELECT * 1. INTRODUCTION FROM Customer JOIN Branch Given a database query workload with joins on several re- ON c_State = b_State; lational tables, which are partitioned over a number of ma- With the records distributed randomly (\Even" distribution style) as in Figure 1(b), evaluating Q1 incurs very high This work is licensed under the Creative Commons Attribution- network communication cost. In particular, just joining the NonCommercial-NoDerivatives 4.0 International License. To view a copy red \CA" rows of `Customer' from Node 1 with the corre- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For sponding \CA" rows of `Branch' from Node 2 requires that any use beyond those covered by this license, obtain permission by emailing at least one of the nodes transmits its contents to the other, [email protected]. Copyright is held by the owner/author(s). Publication rights or that both nodes transmit their rows to a third-party node. licensed to the VLDB Endowment. The same requirement holds for all states and for all pairs Proceedings of the VLDB Endowment, Vol. 13, No. 11 ISSN 2150-8097. of nodes so that, ultimately, a large fraction of the database DOI: https://doi.org/10.14778/3407790.3407834 must be communicated in order to compute this single join. 2411 Despite the importance of the problem, there have been te te te te te te a ta a a a a St S t t t t few related papers from the database community. The ma- _ _ S _S S _S c b c_ b c_ b CA CA CA OR NE WA jority of related methods either depend on a large number OR CA CA NE NE WA of assumptions that are not generally satisfied by most com- WY CA OR CA ... CA WA NE NE NY CA CA NY mercial data warehouse systems, including Amazon Red- NY OR NY NE WY NY shift, or provide heuristic solutions to similar problems. To ... … … ... ... ... the best of our knowledge, this work is the first to propose Node 1 Node 2 Node m an efficient, assumption-free, purely combinatorial, provably (a) Schema. (b) \Even" distribution. optimal data distribution framework based on the query workload. Furthermore, our method, BaW, is applicable to 1000 BaW any column store, distributed, data warehouse. It is stan- wins! te te te te te te dalone and does not require any change in the system's ar- a ta a a a a 18x St S t t t t 100 q _ _ S _S S _S c b c_ b c_ b chitecture. CA CA NE NE NY NY CA CA NE NE NY NY The contributions of the paper are: CA CA NE NE ... NY WA 10 • Problem formulation: We formulate the \Distribution- CA CA WY WA OR OR (sec) distribution EVEN CA CA OR OR WY WA Key Recommendation" (DKR) problem, as a novel graph ... ... ... ... ... ... 1 theoretic problem and we show its connections to graph 1 10 100 1000 Node 1 Node 2 Node m DIST-KEY distribution (sec) matching and other combinatorial problems. Also, we in- (c) \Dist-Key" distribution (d) Query time (sec) troduce the Join Multi-Graph, a concise and light-weight on State attribute. graph-theoretic representation of the join characteristics of a cluster. • Theoretical Analysis: We analyze the complexity of Figure 1: BAW wins. (a-c) Data distribution is im- portant: visually, \Dist-Key" needs less communi- DKR, and we show it is NP-complete. cation, for the Join on State, when tables are spread • Fast Algorithm: We propose BaW, an efficient meta- algorithm to solve ; is extensible, and can ac- over m nodes. (d) Our method BAW makes al- DKR BaW most all TPC-DS queries faster, up to 18x (execu- commodate any and every optimization sub-method, past or future. tion times of \Even" vs proposed BAW in seconds). • Validation on real data: We experimentally demon- strate that the distribution keys recommended by our This is in stark contrast with Figure 1(c), in which all method improve the performance of real Redshift clusters \CA" rows are on Node 1, all \NE" rows are on Node 2, and by up to 32x in some queries. so on. This is the best-case scenario for this join, in that The rest of the paper is organized as follows. Section 2 sur- no inter-server communication is needed; all relevant pairs of veys the related work. Section 3 introduces the Join Multi- rows are already collocated and the join can be performed lo- Graph and mathematically formulates the \Distribution-Key cally at each compute node. This ideal setup happens when- Recommendation" problem (DKR). Section 4, presents a ever the distribution method \Dist-Key" is chosen for both theoretical analysis of DKR and proves its complexity. Sec- tables, and the distribution key attributes are chosen to be tion 5 proposes BaW, an efficient graph algorithm to solve Customer.c State and Branch.b State respectively, hashing DKR. Section 6 contains an extensive experimental eval- all records with matching state entries to the same compute uation of our method on real and synthetic datasets, on node. This yields significant savings in communication cost, Amazon Redshift. Lastly, Section 7 concludes the paper. and in overall execution time. 1.1 Informal problem definition 2. BACKGROUND AND RELATED WORK The insight that collocation can lead to great savings in This section starts with an overview of previous papers on communication costs raises a critical question: How can we workload-based data partitioning. Then, it describes related achieve optimal data distribution in the face of multiple ta- graph problems motivating our BaW algorithm. bles each participating in multiple joins, which may or may not impose conflicting requirements? We refer to this as 2.1 Data partitioning the \Distribution-Key Recommendation" problem. Our so- Workload-based data partitioning has been well studied lution, Best of All Worlds (BaW), picks a distribution from various perspectives, including generation of indices key for a subset of the tables so to maximally collocate the and materialized views [1], partitioning for OLTP workloads most impactful joins. [5, 21, 23], and others. However, our main focus here is data Figure 1 (d) illustrates the query performance of the TPC- partitioning in OLAP systems such as Amazon Redshift.

Fast and Effective Distribution-Key Recommendation for Amazon Redshift

3 the Friedmann-Robertson-Walker Metric

An Analysis of Gravitational Redshift from Rotating Body

Gravitational Redshift/Blueshift of Light Emitted by Geodesic

A Hypothesis About the Predominantly Gravitational Nature of Redshift in the Electromagnetic Spectra of Space Objects Stepan G

The High Redshift Universe: Galaxies and the Intergalactic Medium

FRW Cosmology

Observation of Exciton Redshift-Blueshift Crossover in Monolayer WS2

Cosmic Microwave Background

Gravitational Redshift/Blueshift of Light Emitted by Geodesic Test Particles

AST4220: Cosmology I

Hubble's Diagram and Cosmic Expansion

FRW (Friedmann-Robertson-Walker) Universe