Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning Guna Prasaad Alvin Cheung Dan Suciu University of Washington University of California, Berkeley University of Washington [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION Research on transaction processing has made significant As modern OLTP servers today ship with an increasing progress towards improving performance of main memory number of cores, users expect transaction processing perfor- multicore OLTP systems under low contention. However, mance to scale accordingly. In practice, however, achieving these systems struggle on workloads with lots of conflicts. scalability is challenging, with the bottleneck often being Partitioned (and variants) perform well on high the concurrency control needed to ensure a serializable ex- contention workloads that are statically partitionable, but ecution of the transactions. Consequently, a new class of time-varying workloads often make them impractical. To- high performance concurrency control protocols [8, 18, 19, wards addressing this, we propose StrifeÐa novel transac- 37, 39, 44] have been developed and are shown to scale to tion processing scheme that clusters transactions together even thousands of cores [39, 43]. dynamically and executes most of them without any con- Yet, the performance of these protocols degrades rapidly currency control. Strife executes transactions in batches, on workloads with high contention [1, 43]. One promising where each batch is partitioned into disjoint clusters with- approach to execute workloads with high contention is to out any cross-cluster conflicts and a small set of residuals. partition the data, and have each core execute serially the The clusters are then executed in parallel with no concur- transactions that access data in one partition [1, 15, 24, 29]. rency control, followed by residuals separately executed with The intuition is that, if many transactions access a common concurrency control. Strife uses a fast dynamic clustering al- data item, then they should be executed serially by the core gorithm that exploits a combination of random sampling and hosting that data item without any need for synchroniza- concurrent union-find data structure to partition the batch tion. In these systems, the is partitioned statically online, before executing it. Strife outperforms lock-based using either user input, application semantics, or workload- and optimistic protocols by up to 2× on high contention driven techniques [6, 25, 26, 28]. Once the data is partitioned, workloads. While Strife incurs about 50% overhead relative the transactions are routed to the appropriate core where to partitioned systems in the statically partitionable case, they are executed serially. For workloads where transactions it performs 2× better when such static partitioning is not tend to access data from a single partition [34], static data possible and adapts to dynamically varying workloads. partitioning systems are known to scale up very well [15]. However, cross-partition transactions (i.e., those that ac- ACM Reference Format: Guna Prasaad, Alvin Cheung, Dan Suciu. 2020. Handling Highly cess data from two or more partitions) do require synchro- Contended OLTP Workloads Using Fast Dynamic Partitioning. In nization among cores and can significantly degrade the per- Proceedings of the 2020 ACM SIGMOD International Conference on formance of the system. Unfortunately, in some applications, Management of Data (SIGMOD’20), June 14ś19, 2020, Portland, OR, a large fraction of the transactions are cross-partition as no USA. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/ good static partitioning exists, or the hot data items change 3318464.3389764 dynamically over time. As an extreme example, in the 4Chan social network, a post lives for only about 20-30 minutes, Permission to make digital or hard copies of all or part of this work for and the popularity can vary within seconds [2]. The various personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear static partitioning schemes differ in how they synchronize this notice and the full citation on the first page. Copyrights for components cross-partition transactions: H-Store [13] acquires partition- of this work owned by others than ACM must be honored. Abstracting with level locks, Dora [1, 24] migrates transactions between cores, credit is permitted. To copy otherwise, or republish, to post on servers or to while Orthrus [29] delegates concurrency control to a ded- redistribute to lists, requires prior specific permission and/or a fee. Request icated set of threads. If the database is statically partition- permissions from [email protected]. able, then these approaches can mitigate a small fraction of SIGMOD’20, June 14ś19, 2020, Portland, OR, USA © 2020 Association for Computing Machinery. cross-partitioned transactions, however, their performance ACM ISBN 978-1-4503-6735-6/20/06...$15.00 degrades rapidly when this fraction increases, or when the https://doi.org/10.1145/3318464.3389764 workload is not statically partitionable as discussed in [1].

527 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

In this paper, we propose Strife, a new batch-based OLTP workload is statically partitionable and there are very few system that partitions transactions dynamically. Strife groups cross-partition transactions, Strife achieves about half of transactions into batches, partitions them based on the data the throughput of a statically partitioned system (due to they access, then executes transactions in each partition seri- clustering overhead). However, when the workload is not ally on an exclusive core. Our key insight is to recognize that statically partitioned, or there are many cross-partition trans- most OLTP workloads, even those that are not amenable to actions, then it can improve throughput by 2× as compared static partitioning, can be łalmostž perfectly partitioned at a to other systems. We conclude by reporting a number of batch granularity. More precisely, Strife divides the batch micro-benchmarks to evaluate the quality of the clustering into two parts. The first consists of transactions that can be algorithm, and the impact of its parameters. grouped into disjoint clusters, such that no conflicts exist While many automated static partitioning techniques exist between transactions in different clusters, and the second in the literature (surveyed in Sec. 6), to the best of our knowl- consists of residual transactions, which include the remain- edge the only system that considers dynamic partitioning ing ones. Strife executes the batch in two phases: in the first is LADS [42]. Like Strife, LADS uses dynamic graph-based phase it executes each cluster on a dedicated core, without partitioning for a batch of transactions. Unlike Strife, the any synchronization, and in the second phase it executes the partitioning algorithm in LADS depends on range-based par- residual transactions using some fine-grained serializable titioning. In addition, it does not eliminate the need for con- concurrency protocol. currency control, instead it breaks a transaction into record As a result, Strife does not require a static partition actions, and migrates different pieces to different cores, to of the database apriori; it discovers the partitions dynam- reduce the number of locks required. This creates new chal- ically at runtime. If the workload is statically partitionable, lenges, such as cascading aborts, and is unclear how much Strife will rediscover these partitions, setting aside the overhead it introduces. In contrast, Strife is not limited cross-partition transactions as residuals. When the work- to range partitions, is free from cascading aborts, and com- load is not statically partitionable, Strife will discover a pletely eliminates concurrency control when possible. good way to partition the workload at runtime, based on hot In summary, we make the following contributions: data items. It easily adapts to time-varying workloads, since • We describe Strife, an OLTP system that relies on dy- it computes the optimal partition for every batch. namic clustering of transactions, aiming to eliminate con- The major challenge in Strife is the clustering algorithm, currency control for most transactions in a batch. (Sec. 2). and this forms the main technical contribution of this paper. • We describe a novel clustering algorithm that produces Optimal graph clustering is NP-hard in general, which rules high quality clusters using only three scans over the set out an exact solution. Several polynomial time approxima- of transactions in the batch. (Sec. 3). tion algorithms exist [9, 16, 21, 23], but their performance is • We then present a concurrent implementation of the union- still too low for our needs. To appreciate the challenge, con- find data structure that enables parallelizing the clustering sider a statically partitioned system that happens to execute of transactions. (Sec. 4). a perfectly partitionable balanced workload. Its runtime is linear in the total number of data accesses and scales with the • We conduct an extensive experimental evaluation, com- number of cores. To be competitive, the clustering algorithm paring our system with both general-purpose concurrency must run in linear time, with a small constant, and must be control systems, and with systems designed for high con- perfectly parallelizable across cores. Our novel clustering al- tention, based on a static partition of the database. (Sec. 5). gorithm satisfies both criteria ś it performs clustering using only three passes over data items and exploits intra-batch 2 OVERVIEW parallelism (Sec. 3). To achieve this, we also extend the union- Strife executes transactions in batches, where it collects find data structure to support concurrent execution (Sec. 4). transactions for a fixed duration and executes them together. We implemented a prototype of Strife and conducted For each batch, Strife analyzes the conflicts among trans- an extensive empirical evaluation of the entire algorithm actions based on their read-write sets1 and partitions them (Sec. 5). We start by evaluating Strife’s clustering algorithm with the goal of minimizing concurrency control. Strife exe- and show that, in addition to having high performance, it cutes batches in three phases: analysis, conflict-free, and also finds high quality clusters on three different benchmarks. residual. These phases are executed synchronously on all Next, we compare Strife with general purpose concurrency cores as shown in Fig. 1. During analysis phase, the batch is control techniques, and find that its throughput is at least partitioned into a number of conflict-free clusters and some 2× that of the nearest competitor. A more interesting com- 1Read-write sets can be obtained either through static analysis of parison is with protocols designed specifically for high con- the transaction code or via a reconnaissance query and conditional tention workloads. We find that, in the ideal case when the execution as in deterministic databases [36].

528 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Analysis Conflict-Free Residual However, real workloads often contain outliers that access data items from multiple clusters (such as red transaction T7 C1 in Fig. 2). A conflict-free cluster for the batchT1 −T7, thus will C2 contain all transactions. To avoid this, we instead remove

C3 T7 as a residual and create conflict-free clusters only with respect to the remaining batch. This results in 2 conflict-free C4 clusters {T1,T2,T3}, {T4,T5,T6} and a residual T7. One Batch Time Formally, a clustering in Strife is an (n + 1)-partition of Fig. 1. Strife execution scheme a batch of transactions B into {C1,C2,...,Cn, R} such that, ′ ′ for any i , j and transactions T ∈ Ci ,T ∈ Cj , T and T do Batch Access Graph (!) not conflict. Notice that there is no such requirement on R. a Ideally, we want to maximize n, the number of conflict-free T1 w1, a, b T b 1 ) clusters, to maximize parallelism, while minimizing the size

T w , c, d # 2 1 w1 T2 of R. To elucidate the trade-offs, we describe two naive clus- T w , a, c, d c 3 1 T3 d terings. The first is fallback clustering, where all transactions T4 w2, e, f = e T4 are in R and set n 0; this corresponds to running the entire

T5 w2, f, h ( Items Data w2 batch using conventional concurrency control. The second T5 f (ℬ) Transactions T6 w2, g, h is sequential clustering, where we place all transactions in C1, g T6 and set n = 1 and R = ∅; this corresponds to executing all T7 w2, c, f, g h T7 transactions sequentially, on a single core. Obviously, neither Fig. 2. Access Graph of a batch of transactions of these are good clusterings. Hence, we want n to be at least as large as number of cores, and constrain R to be no larger residuals. Transactions in a conflict-free cluster are executed than a small fraction, α, of the batch. serially with no concurrency control, with several clusters ex- In practice, a good clustering exists for most workloads, conflict-free ecuted in parallel on different cores in phase. except in the extreme cases. In one extreme, when all trans- After all conflict-free transactions are executed, residuals are actions access a single highly contended data item, then no residual then executed in a separate phase across all cores good clustering exists besides fallback and sequential, and using a conventional concurrency control protocol. Once the Strife simply resorts to fallback clustering. Once data con- Strife residual transactions are executed, repeats the same tention decreases, i.e., the number of contentious data items process for the next batch. increases, a good clustering would try to place each hot analysis The goal of phase is to partition the batch into item in a different conflict-free cluster. When contention clusters such that transactions from two different clusters further decreases such that all transactions access different Strife are conflict-free. To do so, first represents a batch of data items, then any n-clustering of roughly equal sizes is transactions using its data access graph. A data access graph adequate. Thus, we expect good clustering to exist in all but A = B ∪ D, E is an undirected bipartite graph ( ), where most extreme workloads. The challenge, however, is to find B D is the set of transactions, is the set of data items (e.g., a good clustering efficiently. B tuples or tables) accessed by transactions in , and the edges After partitioning, conflict-free clusters are placed ina E contain all pairs (T,d) where transaction T accesses data ′ shared worklist. Multiple cores then obtain a cluster from item d. Transactions T , T are in conflict if they access a the worklist, and execute all transactions in it serially one common data item with at least one of them updating it. after another in the conflict-free phase. Once done with a Fig. 2 depicts the access graph of an example batch (col- cluster, a core obtains another from the worklist until done. ors explained later) derived from the TPC-C [5] benchmark. The degree of parallelism in this phase is determined by the Each transaction updates a warehouse (w1 or w2) and a set number and size of conflict-free clusters. More clusters result − of stocks (a h). Transaction T1, for example, accesses ware- in parallel execution, thus reducing total time to execute house w1 and stocks a and b. Transactions T1, T2 are in con- them. Once the worklist has exhausted, a core waits for other flict because they both access w1; whereas T1, T5 are not. cores to finish processing their clusters before moving tothe A batch can be partitioned into many conflict-free clusters residual phase. This is because the residuals may conflict when transactions access disjoint sets of data items. Trans- with the conflict-free transactions that are being executed − − actions T1 T3 and T4 T6 ( Fig. 2) access different data concurrently, and hence requires data synchronization. A { , − } { , − } items ( w1 a d and w2 e h respectively). Hence, the skew in cluster sizes may reduce concurrency since threads − { , , } batch T1 T6 can be partitioned into 2 clusters T1 T2 T3 that complete early cannot advance to next phase. and {T4,T5,T6} with no cross-cluster conflicts.

529 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Algorithm 1: Strife Clustering Algorithm As such, optimal allocation of conflict-free clusters to cores Input: List batch can be modeled as a 1-D bin-packing problem. However, Output: Queue> worklist solving this requires estimating the cost of executing a trans- Queue residuals action, and the quality of solution critically depends on how 1 set special; good the estimate is. We instead rely on work-sharing par- 2 int count[k][k] = {0}; allelism powered by a low-overhead concurrent queue to // Spot Step (initially each data node is a cluster) balance load across cores dynamically. 3 i = 0; After the conflict-free clusters are processed, transactions 4 repeat in the residual cluster are executed in residual phase. As dis- 5 Pick a random transaction T from batch; cussed earlier, concurrency control is needed to execute them 6 set C = { Find(r ) | r ∈ T.nbrs }; concurrently across cores. Strife uses two-phase locking 7 set S = { c | c ∈ C and c is special }; 8 if |S | == 0 then with no-wait deadlock avoidance policy as it is shown to be 9 c = C.first; highly scalable [43]. Under this policy, a transaction aborts 10 foreach other ∈ C do immediately when it fails to acquire a lock. Strife retries 11 Union(c, other); transactions aborted this way until successful commit or 12 c.id = i; logical abort. Finally, after the residuals are executed, Strife 13 c.count++; processes the next batch. We next describe the clustering 14 c.is_special = true; algorithm in Strife. 15 special.insert(c); 16 i++; 3 STRIFE CLUSTERING ALGORITHM 17 until k times; A batch of transactions in Strife, B, is partitioned into a // Fuse Step conflict-free clusters C1,C2,...,Cn, and a set of residuals, 18 foreach T ∈ batch do R, of size at most α |B|. The inputs to the algorithm are the 19 set C = { Find(r ) | r ∈ T.nbrs }; batch of transactions (denoted B) along with their read-write 20 set S = { c | c ∈ C and c is special }; sets, a number k, typically a small factor times core count, 21 if |S | ≤ 1 then and the residual parameter α, a fraction between 0 and 1. 22 c = (|S | == 0) ? C.first : S.first; 23 foreach other ∈ C do Our clustering problem can be viewed as partitioning the = 24 Union(c, other); data access graph A (B ∪ D, E) (refer Fig. 2) with dual

25 c.count++; objectives: maximize number of conflict-free clusters, and 26 else reduce the size of residuals. Graph partitioning in general 27 foreach (c1, c2) ∈ S × S do is NP-hard, and several PTIME approximations have been 28 count[c1.id][c2.id]++; proposed in the literature [9, 10, 21, 23, 27]. These solutions, however, do not meet the latency budget of our use case. For // Merge Step instance, our preliminary study revealed that partitioning a 29 foreach (c1, c2) ∈ special × special do batch from the TPC-C benchmark using METIS [21], a state- 30 n1 = count[c1.id][c2.id]; of-the-art graph partitioning library, is an order of magnitude 31 n2 = c1.count + c2.count + n1; slower than executing directly with concurrency protocol. 32 if n ≥ α × n then 1 2 Hence, our key challenge is to devise a clustering algorithm 33 Union(c1, c2); by touching each node in the graph as few times as possible. // Allocate Step In other words, we seek an approximation algorithm that 34 foreach T ∈ batch do runs in linear time, with a very small constant and can be 35 C = { Find(r ) | r ∈ T.nbrs }; parallelized to incur the least runtime overhead. We next 36 if |C | = 1 then describe our algorithm (Sec. 3.1), illustrate it using an exam- 37 c = C.first; ple (Sec. 3.2), and finally analyze its runtime (Sec. 3.3); the 38 ∅ if c.queue = then pseudo-code is listed in Alg. 1. 39 c.queue = new Queue(); 40 worklist.Push(c.queue); 3.1 Description 41 c.queue.Push(T); Strife 42 else Partitioning a batch of transactions in is akin to par- 43 residuals.Push(T); titioning the data nodes in A into data clusters. Once that is done, we can then allocate each transaction to a conflict-free 44 return (worklist, residuals); cluster or residuals depending on how its connected data nodes are clustered. Hence, our algorithm will cluster the

530 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

data nodes first, then allocate transactions to an appropriate probability that none of the k sampled transactions accesses k k/n cluster or the residuals. The algorithm runs in 5 steps: (1) a data item di is (1−1/n) ≈ 1/e , which is very low when k/n 3 prepare (2) spot (3) fuse (4) merge and (5) allocate. k ≫ n; for example, if k ≳ 3n, then 1/e ≤ 1/e = 0.04. Initially each data node in the graph is a singleton cluster. This argument extends to the case of pairs of data items The algorithm repeatedly merges clusters connected to a d1,d2 that have direct affinity, meaning that a large number transaction, so that data nodes accessed by that transaction of transactions access both; then, with high probability, some belong to the same data cluster. It uses different heuristics special cluster will include both. to prevent merging all data nodes into a single large data 3.1.3 Fuse Step. The spot step processed only a small cluster. We describe each step in detail below. number of transactions (up to k); the fuse step (Lines 18- 3.1.1 Prepare Step. First, Strife creates the data access 28) processes all remaining transactions in the batch. For graph from read-write set of transactions in the batch. Strife each transaction, we examine the data clusters connected to does not explicitly create a graph data structure in memory, it. If at most one of them is labeled łspecial,ž we merge all instead uses the read-write sets along with some metadata of them together. To prevent creating a single cluster with to avoid overheads. We consider only data nodes that are all data items, fuse never merges two special clusters: if a updated by at least one transaction in the batch, since there transaction accesses two or more special clusters, then fuse is no need to coordinate concurrent access to read-only data skips it, and instead records the count of such transactions items. For instance, the Items table in the TPC-C benchmark for each special cluster pair (Line 28). The number of special is only read and never updated; as a result Items is not part clusters remains the same throughout fuse step, but each of our access graph. We do this by scanning and labeling special cluster may have grown from it’s initial size in the records that are updated as active and ignoring all inactive spot step, and new non-special clusters may be created as records in subsequent steps (not shown in Alg. 1 for brevity). well. The merge of two clusters is łspecialž if and only if at least one of them was special. 3.1.2 Spot Step. The spot step (Lines 3-17) identifies hot data items in the batch so that we can avoid placing several of 3.1.4 Merge Step. So far the special clusters created in them in the same data cluster. Initially, all data nodes in the spot have never been merged. In this step, we revisit this access graph are singleton data clusters. We say a transaction invariant and merge some of them to reduce the number of is connected to a data cluster c if it accesses any data node in residuals. If two hot data items d1,d2 are accessed by many c. In this step, we create a small number of clusters that are transactions, then the special clusters they belong to are all labeled łspecialž using a sample-reject scheme. We begin good candidates for merging; we say that d1,d2 have direct by picking a transaction T from B uniformly at random. We affinity. Special clusters may also have to be merged due to merge data clusters connected to T and label the resulting indirect affinity as illustrated by the following example. cluster łspecial.ž Suppose there are two hot items, d1,d2, each in a separate We continue by repeatedly picking another transaction special cluster, and a third item d3, such that a large number T ′ at random; if T ′ is connected to a special cluster, then we of transactions access both d1,d3 and another large number reject it, otherwise we merge all its data nodes into a new access d2,d3; then we say that d1,d2 have high indirect affin- łspecialž cluster. We continue this for k trials, creating k or ity. Depending on the order of processing, fuse will include fewer special clusters. Recall that k is an input parameter to d3 either in d1’s special cluster or in d2’s special cluster. In the algorithm, and is typically chosen to be a small factor the first case, all transactions that access d2,d3 are skipped times the number of cores. For instance, in our experiments as residuals, in the second case the d1,d3 transactions are we use k = 100 for 30 cores. skipped. merge step examines the benefit of merging the With a high probability, our scheme will find the łhotž two special clusters of d1 and d2, thus allowing all these data items (i.e., items accessed by many transactions) and transactions to be executed in the conflict-free phase. create a special cluster for each of them. To understand the In such scenarios, we relax the criterion from fuse step intuition, suppose d is a hot data item accessed by half of and merge special clusters. We use a new heuristic to prevent the transactions. Then, with high probability, one of the ran- merging all data items into a single huge cluster in the merge domly chosen transactions will access d, and spot will create step (Lines 29-33). If two special clusters ci and cj are merged, a special cluster for d; if another sampled transaction also ac- transactions connected to ci and cj can now be classified into cessesd, it will be rejected. Suppose now that there are n łhotž a conflict-free cluster. We merge them only when residuals α data items, d1,d2,...,dn, and each is accessed by approxi- are more than the bound specified by parameter , i.e., |R| ≥ mately |B|/n transactions. In this case, with high probability, α |B|. Recall that α is one of the input parameters of the spot will create a special cluster for each di , because the algorithm: it chooses between executing transactions on more cores with concurrency control (if α is small) versus

531 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

on fewer cores without concurrency control. Empirically, we in one of the four special clusters as shown in Fig. 3c; in found α = 0.2 to give the best results in our experiments. general, this is not the case, instead some new clusters may Let count[ci ][cj ] denote the number of transactions in B be formed that are not special, or some data items may be left that access data items in ci and cj . Note that the transactions in their singleton cluster (also non-special). Notice that most that are accounted for in count[ci ][cj ] can access data items of the catalog stock items of w2 and w3 are also accessed by from clusters other than ci and cj as well. If the two clusters transactions of the other warehouse and, since we process ci and cj are not merged, then all of count[ci ][cj ] will be transactions from top to bottom, most of them will be allotted residuals. So, we merge cluster pairs ci ,cj when: to the green cluster (of w2). At this stage, most of w1,w2 and w4 transactions are con- count[ci ][cj ] ≥ α × (|ci | + |cj | + count[ci ][cj ]) nected to a single data cluster, while several w3 transactions Since, |R| ≤ P , count[c ][c ], this merge scheme results in i j i j access data from both green and red clusters (notice that w3 residuals fewer than α |B| when most transactions access at itself is red). We say that the w2 and w3 warehouse tuples most two special clusters. have an indirect affinity via their stock items. These two 3.1.5 Allocate Step. In the allocate step (Lines 34-43), special clusters are candidates for merging, because many we examine data clusters connected to each transaction and transactions access both red and green items. More precisely, allot it to a conflict-free cluster or the residuals. If the number merge checks that the number of transactions accessing both of clusters connected to a transaction T , denoted |C|, is more clusters is more than α fraction of those accessing either of than one, thenT is added to the residuals since it has conflict- them, then merges them (refer Fig. 3d). This reduces the ing data accesses with more than one data cluster. If |C| = 1, number of transactions in the residual cluster. Note that the all data items accessed by T belong to one data cluster and clusters of w1 and w2 (orange and green) are not merged, hence does not conflict with transactions not connected to because too few transactions access both (just two). the data cluster. So, T can be assigned to a conflict-free clus- Finally, allocate iterates over all transactions, and for ter. We first check if the data cluster already has a designated each, checks if it can be included in a conflict-free cluster queue (Line 38). If so, we add T to that queue. Else, we create or residuals. As shown in Fig. 3d, only three transactions a new conflict-free queue and addT to it. This queue is added are residual, two accessing orange+red, the other accessing to the shared worklist. This way all transactions connected red+blue; other transactions are included in one of the three to a data cluster are added to the same conflict-free queue, conflict-free clusters: orange, red, or blue. thus ensuring a serializable execution. 3.3 Runtime Analysis 3.2 Example We use the union-find33 [ ] data structure to maintain data We illustrate in Fig. 3 how Strife clusters an example batch clusters in Strife. The amortized cost of a union-find opera- of new-order transactions from the TPC-C benchmark. We tion is O(α (N )) ≈ O(1) where N is the number of data items consider only stock and warehouse accesses for brevity. As and α the inverse Ackermann function, thus we assume that shown in Fig. 3a, each transaction (mid column) updates a every union/find operation takes time O(1). prepare, fuse warehouse (right column) and few stock tuples (left column). and allocate steps iterate over the entire batch of trans- Each stock item is listed in the catalog of one warehouse, as actions B, but do this in parallel using n cores, thus, their labeled on the left of Fig. 3a. total runtime is O(|B|/n), where the constant under O(··· ) At the start, each warehouse and stock item is a singleton is small (about 3). spot and merge are executed sequentially, data cluster. spot randomly samples one transaction, say the and their runtime is as follows: spot takes time O(k), since it 2 first from the top, and merges its data items into one special samples using k trials, while merge takes time O(k ), since it 2 cluster: they are shown in orange. If the next transaction examines up to k pairs of special clusters. The total runtime 2 sampled accesses any of the orange items, it is ignored, oth- of the clustering algorithm is O(|B|/n +k ). For comparison, erwise its data items form a new special cluster. After several if the database can be perfectly partitioned statically, and trials, Strife creates four special clusters as shown in Fig. 3b: there are no cross-partition transactions, then a system using orange, green, red, blue. All black data items continue to be static partitioning can execute entire batch in time O(|B|/n). singleton clusters. Next, fuse processes all transactions, top to bottom. fuse 4 PARALLEL UNION-FIND merges each transaction’s data items into one single cluster, Strife heavily exploits the union-find data structure to par- unless if doing so will merge two special clusters, in which tition the batch efficiently. The analysis is done in parallel case the transaction is skipped. For instance, the transaction on multiple cores to reduce end-to-end transaction latency. accessing both a green and a red data item is skipped. In our Specifically, the spot and merge steps are executed in a example, after the fuse step, all data items will be included

532

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

avoidance policy (i.e., immediately aborts the transaction if it fails to grab a lock). This phase is similar to traditional 5 EVALUATION concurrency control based OLTP systems. In this section, we evaluate Strife, by comparing it to other Strife moves to analysis phase of next batch once resid- systems and on micro benchmarks. Specifically: ual phase is complete. Threads could start analyzing the next residual ğ5.1 First, we analyze how well Strife’s clustering algo- batch while phase of previous batch is in-progress. rithm works on OLTP workloads. However, we did not implement this optimization to simplify ğ5.2 Then, we compare Strife with several concurrency the analysis and interpretability of results. control (CC) protocols using the shared-everything Workloads. We use the following benchmarks: architecture. TPC-C [5]: We use a subset of full TPC-C consisting of Strife ğ5.3 We next evaluate against state-of-the-art CCs a 50/50 mixture of new-order and payment transactions, for handling high contention workloads: MOCC [39], with all payments accessing customer through primary key Dora [24], Orthrus [29] and PartCC [13]. lookups. These two transactions make up the vast major- Strife ğ5.4 We then evaluate ’s adaptability when the hot ity of the benchmark and has been used previously [29] in data items dynamically vary over time. evaluating high contention workloads to stress concurrency ğ5.5 Finally, we study the impact of the residual parameter control. Refer Sec. 5.6.4 for payment transactions that access α , batch size, and contention using microbenchmarks. customer records through last name. In TPC-C benchmark, 15% payment transactions are remote and 1% of items are 5.1 Setup, Implementation and Workloads ordered from remote warehouses. Experimental Setup. We run all our experiments on a multi- YCSB [4]: For YCSB, each transaction reads/updates 20 records socket Intel Xeon CPU E7-4890 v2 @ 2.80GHz machine with from a single table with 20M keys and a payload size of 128 2TB Memory and two sockets - each having 15 physical cores. bytes. We use a workload that is statically partitionable and We implemented Strife and our baselines in C++. We use a hence is the best case scenario for such protocols. Transac- default batch size of 10K transactions. tions are such that intra-partition key access is determined θ System Implementation. We have implemented a prototype by a Zipfian distribution controlled by the parameter. Un- θ = . of Strife that clusters transactions using the algorithm along less specified otherwise, we use 0 99 to generate a highly with optimizations described in Sec. 3. Strife, at its core, contended workload. is a multi-threaded transaction manager that interacts with HOT: HOT is a microbenchmark on a single table database storage layer of the database using the standard get-put in- with 50M records, each having an 8-byte key and twenty terface. A table is implemented as a collection of in-memory 8-byte integer fields. Each transaction updates 10 records pages that contain sequentially organized records. Some ta- from the table. To add contention, we classify some records bles are indexed using a hashmap that maps primary keys (default: 100) as hot and the remaining as cold. Every trans- to records. We implement the index using libcuckoo [22] action updates one hot record and 9 cold records. We control library, a thread-safe fast concurrent hashmap. Records con- partitionability of the workload by limiting remote parti- tain meta-data for clustering and concurrency control. We tion accesses to at most 3. By default, hot records are evenly chose this design to avoid overheads due to multiple hash distributed across all partitions. look-ups and focus primarily on concurrency control. The analysis phase produces several conflict-free (CF) 5.2 Clustering Algorithm queues and a residual queue. All CF queues are stored in a Fig. 5 summarizes the clustering solutions produced by Strife multi-producer multi-consumer (MPMC) concurrent queue, on a batch of 10K transactions, α = 0.2 and different bench- called the worklist. The worklist is shared among all threads. mark parameters. Spot Clusters refers to the number of clus- In conflict-free phase, threads obtain a queue from the ters identified in the spot step, CF Clusters denotes the num- worklist and execute the transactions in it serially one after ber of conflict-free clusters finally produced, and Residuals another without any concurrency control. Once a thread is denotes the size of residual cluster. done with its queue, it obtains another from the worklist. We present TPC-C results for different number of ware- When worklist is empty, threads spin wait for other threads houses (denoted w), specifically 4, 15 and 30 in Fig. 5. spot to complete. step correctly identifies 4, 15 and 30 special clusters, one Once all threads are done with conflict-free queues, they corresponding to each warehouse. TPC-C has around 25% enter residual phase. The residuals are stored in a shared cross-warehouse transactions and hence some of them are MPMC queue. Threads dequeue and execute them using 2- classified as residuals. We note that residuals are much less phase locking concurrency protocol with no wait deadlock

534 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

TPC-C YCSB (30 Partitions) HOT Description/Workload Parameter 4 15 30 0.1 0.5 0.8 0.99 1.2 10 50 100 200 (a) Spot Clusters 4 15 30 100 98 78 48 30 10 43 63 84 (b) CF Clusters 4 15 30 100 98 78 30 30 10 43 63 84 (d) Residuals 636 191 94 0 0 298 0 0 350 371 330 250 Fig. 5. Clustering: TPC-C (Varying #Warehouses), YCSB (Varying θ), HOT (Varying #Hot Records) than 25% because a cross-warehouse transaction is not nec- the graph and updates it only when waiting [43]. essarily a cross-cluster transaction in the batch. For example, NoWait is 2PL variant that aborts a transaction immediately a remote order for item i from warehouse w2 by a w1 trans- when a lock cannot be acquired during transaction execution. action is cross-cluster only when some w2 transaction in the We do not use a centralized lock table in any of our 2PL im- batch orders item i from w2, which is not always the case. plementations since it is a known scalability bottleneck and We vary the Zipfian parameter θ in YCSB with 30 par- instead use metadata co-located with records in the database. titions. spot identifies approximately 100 special clusters PreNoWait acquires all locks at the start of the transaction when θ = 0.1 or 0.5 since there are very few conflicting ac- and releases only after commit; it aborts the transaction im- cesses, resulting in 100 CF clusters of almost equal size with mediately if any of the requested locks cannot be acquired. no residuals. When θ = 0.8, spot identifies 78 special clusters Silo [37] is a representative OCC protocol and is considered and produces same number of CF clusters. However, 30 of state-of-the-art [1, 43, 44]. them contain majority of transactions, each corresponding We implemented these protocols in the Strife code-base to a partition. The remaining 48 CF clusters are small: they and corroborated it with [43] to make the comparisons fair. contain transactions that access partition-local data from Fig. 6 depicts the throughput as we vary the number of one of these 30 partitions. Since some of these transactions cores on TPC-C (4 and 30 warehouses), YCSB, and HOT result in cross-edges with the main 30 CF clusters, we have benchmarks. We present the execution profile for HOT bench- a small fraction (3%) of residuals; merge does not combine mark on 30 cores in Fig. 7; these protocols exhibit similar them since it does not meet the merging threshold derived behavior on others. Abort is time spent in aborted execution, from α. When θ = 0.99, spot identifies 48 special clusters. Lock is time spent in acquiring, releasing and waiting for But after fuse and merge, the final number of CF clusters is locks, Execute is transaction code execution, Index is time 30 since Strife merges some of them. When the contention spent querying the index, Validation is time spent in validat- is very high (θ = 1.2), spot easily identifies the 30 partitions ing for Silo, and Analysis is time spent in analysis (Strife). and produces a conflict-free cluster for each. Strife produces as many CF clusters as warehouses for We now vary number of hot records in the HOT bench- TPC-C (Sec. 5.2). With 2 cores, Strife is significantly better mark. In each case, spot identifies most but not all of the than other protocols on TPC-C (4 warehouses) due to its hot records in the batch since we sample the batch only 100 very high contention, and at the same time caching benefits times. We found that more sampling attempts identify all hot for clustering due to its small size. With more cores, through- records. The same number of CF clusters are produced since put doubles from 2 to 4 cores since conflict-free phase none of them are merged. As we will see in Sec. 5.3, differ- is 2× faster. Beyond 4 cores, improvement in throughput is ent sizes for CF clusters does not impact system throughput only due to increased parallelism in analysis and residual heavily since Strife uses a shared worklist and dynamically phases and conflict-free phase remains the same. With 30 balances load across cores. In summary, the experiments warehouses, Strife increases parallelism in all three phases show that Strife identifies hot records and clusters conflict- to scale almost linearly. Strife throughput on YCSB (30 parti- ing transactions together while distributing hot records to tions) also follows a similar trend. Since YCSB is partitionable as many clusters as possible. without any cross-partition transactions, no residuals are created. Throughput increase from 15 to 25 cores is gradual 5.3 CC protocols in both TPC-C (30 warehouses) and YCSB since the core We now compare Strife against 4 classes of CC protocols: count is insufficient to execute all 30 CF clusters in paral- LockSorted sorts the read-write set and acquires locks in lel. However, with 30 cores Strife exploits all parallelism an ordered fashion before executing the transaction. yielding almost a 2× improvement over 15 cores. DlDetect is a 2-Phase-Locking (2PL) variant which main- A batch in HOT benchmark is clustered into approxi- tains a waits-for graph among transactions and uses it to mately 60 clusters, each containing at least one hot record. detect and eliminate deadlocks by abortion. To avoid scala- conflict-free phase exploits maximum parallelism, thus bility bottlenecks, each thread maintains a local partition of scaling linearly as we increase number of cores since Strife

535

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

and releasing locks, so its performance is relatively stable vary α. When α is zero, Strife clusters all transactions into compared to Dora even with partition-induced skews. When a single cluster. However, as we increase α number of CF the distribution is very skewed, even CC threads in Orthrus clusters increase and settles at the number of warehouses. become a bottleneck and performance drops. MOCC is able Fig. 10(b) depicts the CF clusters produced for various values to detect changing hot items dynamically and is not affected of θ (controls contention in the workload) in YCSB with 30 by partition-induced skews since it is oblivious to partitions. partitions. A smaller θ implies lower contention and hence Strife spots hot records in a given batch by random sam- Strife results in 100 special clusters, which is the maximum pling and distributes them dynamically across as many CF value set in our experiments. Note that there is minimal im- clusters as possible. This eliminates any partition-induced pact of increasing α for lower θ cases. For larger θ, each CF skews present in other architectures, and enables Strife to cluster corresponds to a YCSB partition and hence no impact. outperform all other protocols with stable throughput as it 5.6.3 Latency and Batch Size. Strife employs a batched eliminates partition-induced skews. execution strategy. Similar to streaming systems [3, 46], 5.6 Microbenchmarks batch size poses an important trade-off between throughput and latency. With a larger batch size, the latency is higher, Strife 5.6.1 Clustering Performance. Fig. 9(a) depicts anal- but has more potential for amortization and hence through- ysis throughput (transactions analyzed per second) as we put. When the batch is small, other overheads in the system Strife vary number of cores. algorithm scales almost lin- become more prominent but with a smaller latency. early until 15 cores (within single NUMA socket). Beyond 15 We present throughput, end-to-end latency and analysis cores, there is a slight drop in scalability due to cross-socket latency for TPC-C, YCSB and HOT as we vary batch size in analysis communication but throughput still increases. cost Fig. 11 (in log scale). For small batch sizes (1K), synchroniza- depends on total number of data accesses in the batch. The de- tion overheads at phase boundaries affect the throughput. For picted performance (YCSB) is for 20 read-write accesses and larger batches, throughput is marginally higher due to better analysis throughput is much better than most main-memory amortization and cache locality. Beyond a point, throughput Strife OLTP engines today. This shows that in execution does not increase since the data structures no longer fit in analysis pipeline, phase is unlikely to be the bottleneck. cache. End-to-end latency increases linearly with batch size. Fig. 9(b) breaks down the time taken by each step dur- Although, we can cluster a 1K batch in sub millisecond laten- analysis θ = ing for TPC-C, YCSB-Low ( 0.1), YCSB-High cies, the throughput is sub-optimal. In our experiments we θ = prepare ( 0.99) and HOT. step scans through read-write chose a batch size of 10K since it provides a good trade-off sets, and this is the most expensive since records are looked between latency (< 5 ms) and throughput. up for the first time and brought into cache. In subsequent steps, cached data is reused, and hence they are much faster. 5.6.4 Cost of Reconnaissance Queries. When read write analysis time also depends on number of distinct data items sets are not known a priori, Strife uses the OLLP proto- accessed and cache size. In YCSB-Low, data items are ac- col [36]: issues a read-only, low isolation level query called cessed almost uniformly, while YCSB-High access fewer dis- the reconnaissance query to acquire the entire read/write tinct data items. So, analysis of a YCSB-Low batch is more set as part of the prepare step. Additionally index lookups expensive than that of YCSB-High. We do not show spot are cached in transaction context for later execution. A par- and merge as they are negligible compared to these 3 steps. ticularly common pattern of such transactions are those that Our implementation caches the current set of clusters con- perform a secondary index lookup [36]. In full TPC-C, some nected to a transaction in thread-local cache, which further payment transactions obtain customer record through last improves fuse and allocate performance. name which involves a secondary index lookup. Strife ex- ecutes these transactions using the OLLP protocol. Fig. 12 α 5.6.2 Residual Parameter. The residual parameter plays depicts the throughput of Strife and other CC protocols on Strife a key role in . It determines (1) the proportion of 30 warehouses as we increase the percentage of payment transactions executed with or without CC and (2) the number transactions that require a reconnaissance query in a 50/50 α Strife of CF clusters. A larger allows to classify more cross- mixture of payment and new-order transactions. While there cluster transactions as residuals improving the number of is a marginal decrease in performance for other CC proto- CF clusters; but more transactions are executed with CC, cols due to secondary index lookups, the decrease is more α reducing the benefits of CC-free execution. A smaller , on pronounced in Strife since it performs it twice. At the same conflict- the other hand, may reduce concurrency in the time, additional cost incurred is a single lock-free secondary free phase but fewer transactions are executed with CC. index lookup, so the impact of reconnaissance query is min- Fig. 10(a) depicts the number of conflict-free clusters pro- imal (at most 9%) and Strife still outperforms other CC duced for different number of warehouses in TPC-C aswe protocols on high contention workloads.

538

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

tracking [45] or reusing partial executions of aborted trans- are not efficient enough for dynamic clustering in Strife. actions [7] have addressed this partially. MOCC [39], state- of-the-art for high contention, combines pessimistic locking Transaction Decomposition. Callas [41] partitions a trans- with optimistic validation-based execution. There are no action into groups that are executed independently under aborts due to conflicts in Strife. different CC protocols. IC340 [ ] tracks run time dependen- cies allowing transactions to be chopped into finer pieces Partitioned and Deterministic Databases. H-Store [13, than possible statically. LADS [42] breaks a transaction into 15], Hyper [17] yield state-of-the-art performance on par- a set of record actions using their semantics. Overall, decom- titionable high contention workloads but are plagued by posing a transaction into smaller pieces can help alleviate cross-partition transactions [43]. Appuswamy et al. [1] advo- reduce contention by reducing effective lock hold time on hot cate for fine-grained partitioning to avoid skew induced by data items. LADS adopts a similar graph-based partitioning static partitions, which Strife eliminates and dynamically strategy on batches of transactions as Strife, but there are adapts to changing hot set. Calvin [31, 36] acquires locks in several key differences: LADS does not entirely eliminate CC a deterministic order on a single thread while Strife is free during execution since it synchronizes on logical dependen- from such concurrency bottlenecks and mostly avoids locks. cies between record actions of a transaction. A user initiated abort will lead to cascading aborts in LADS due to specula- Alternate Architectures. Orthrus [29] delegates CC to ded- tive execution of record actions. Further, graph partitioning icated threads and uses message passing to synchronize ac- algorithm in LADS initially assumes a range-based partition cess. Dora [1, 24] adopts thread-to-data assignment on a and iteratively removes cross-edges by relocating records. partitioned database; transactions flow from one thread to This does not perform well on workloads not suitable for another acquiring/releasing locks. Both suffer on update- range-based partitioning, such as the HOT benchmark. intensive workload due to longer lock-hold times and wors- ened by cross-partition transactions and skews [1]. Strife does not use locks, adopts fine-grained partitioning, and 7 CONCLUSIONS avoids skews. Tang et al. [35] propose adaptive concurrency We presented Strife, a transaction processing protocol for control (ACC) that clusters data and chooses optimal CC high-contention workloads. Strife identifies hot records in for each cluster using a machine learning model. Both these the database during runtime, clusters transactions that ac- combine existing CC protocols to yield better performance cess them together, and executes them without concurrency on a mixed workload. This is unlike Strife which optimizes control. Strife achieves this with an algorithm that exploits for and performs better than each of these CC protocols un- random sampling and efficient union-find data structures to der high contention. cluster thousands of transactions in milliseconds. Our ex- periments have shown that Strife can achieve substantial Automatic Database Partitioning. Schism [6] partitions performance improvement over conventional CC protocols; the co-access graph among data items and uses the partition while it incurs about 50% overhead over partitioned systems to infer partition predicates statically; Horticulture [25] au- on a statically partitionable workload, Strife can improve tomatically partitions a database using local neighborhood performance by up to 2× when such static partitioning is search on the design space with an analytical cost model. not viable and when workload changes over time. While it seeks to reduce partition-induced skews and cross- partition transactions similar to Strife, its solution space ACKNOWLEDGEMENTS is limited to hash/range partitioning. Sword [28] is an ex- This work is supported in part by the NSF grants IIS-1907997, tension of Schism that uses hyper-graph compression to AITF-1535565, IIS-1546083, IIS-1651489; the Intel-NSF CAPA partition a database also with incremental repartitioning. All Center; DOE award DE-SC0016260; and gifts from Adobe these approaches are designed for offline computation and and Google. We also thank Xiangyao Yu and the anonymous reviewers for their valuable comments and feedback.

540 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

REFERENCES Proceedings of the 2013 ACM SIGMOD International Conference on Man- [1] Raja Appuswamy, Angelos C. Anadiotis, Danica Porobic, Mustafa K. agement of Data (SIGMOD ’13). ACM, New York, NY, USA, 73ś84. Iman, and Anastasia Ailamaki. 2017. Analyzing the Impact of System https://doi.org/10.1145/2463676.2465271 Architecture on the Scalability of OLTP Engines for High-contention [15] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Workloads. Proc. VLDB Endow. 11, 2 (Oct. 2017), 121ś134. https: Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, //doi.org/10.14778/3149193.3149194 , Yang Zhang, John Hugg, and Daniel J. Abadi. [2] Michael S Bernstein, Andrés Monroy-Hernández, Drew Harry, Paul 2008. H-store: A High-performance, Distributed Main Memory Trans- André, Katrina Panovich, and Gregory G Vargas. 2011. 4chan and/b: action Processing System. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1496ś An Analysis of Anonymity and Ephemerality in a Large Online Com- 1499. https://doi.org/10.14778/1454159.1454211 munity.. In ICWSM. [16] George Karypis and Vipin Kumar. 1998. A Fast and High Quality Mul- [3] Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert De- tilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. Line, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Werns- 20, 1 (Dec. 1998), 359ś392. https://doi.org/10.1137/S1064827595287997 ing. 2014. Trill: A High-performance Incremental Query Processor [17] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid for Diverse Analytics. Proc. VLDB Endow. 8, 4 (Dec. 2014), 401ś412. OLTP&OLAP Main Memory Database System Based on Virtual Mem- https://doi.org/10.14778/2735496.2735503 ory Snapshots. In Proceedings of the 2011 IEEE 27th International Con- [4] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, ference on Data Engineering (ICDE ’11). IEEE Computer Society, Wash- and Russell Sears. 2010. Benchmarking Cloud Serving Systems with ington, DC, USA, 195ś206. https://doi.org/10.1109/ICDE.2011.5767867 YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing [18] Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pan- (SoCC ’10). ACM, New York, NY, USA, 143ś154. https://doi.org/10. dis. 2016. ERMIA: Fast Memory-Optimized Database System for Het- 1145/1807128.1807152 erogeneous Workloads. In Proceedings of the 2016 International Confer- [5] T. P. P. Council. 2018. TPCC-C Standards Specification. http://www.tpc. ence on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf. [Online; 1675ś1687. https://doi.org/10.1145/2882903.2882905 accessed 19-Dec-2017]. [19] Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Cores [6] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: and NVRAM. In Proceedings of the 2015 ACM SIGMOD International A Workload-driven Approach to Database Replication and Partitioning. Conference on Management of Data (SIGMOD ’15). ACM, New York, Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 48ś57. https://doi.org/10.14778/ NY, USA, 691ś706. https://doi.org/10.1145/2723372.2746480 1920841.1920853 [20] H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for [7] Mohammad Dashti, Sachin Basil John, Amir Shaikhha, and Christoph Concurrency Control. ACM Trans. Database Syst. 6, 2 (June 1981), Koch. 2017. Transaction Repair for Multi-Version Concurrency Control. 213ś226. https://doi.org/10.1145/319566.319567 In Proceedings of the 2017 ACM International Conference on Management [21] Dominique LaSalle, Md Mostofa Ali Patwary, Nadathur Satish, of Data (SIGMOD ’17). ACM, New York, NY, USA, 235ś250. https: Narayanan Sundaram, Pradeep Dubey, and George Karypis. 2015. Im- //doi.org/10.1145/3035918.3035919 proving Graph Partitioning for Modern Graphs and Architectures. In [8] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Proceedings of the 5th Workshop on Irregular Applications: Architectures Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. and Algorithms (IA3 ’15). ACM, New York, NY, USA, Article 14, 4 pages. Hekaton: SQL Server’s Memory-optimized OLTP Engine. In Proceed- https://doi.org/10.1145/2833179.2833188 ings of the 2013 ACM SIGMOD International Conference on Manage- [22] Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J. ment of Data (SIGMOD ’13). ACM, New York, NY, USA, 1243ś1254. Freedman. 2014. Algorithmic Improvements for Fast Concurrent https://doi.org/10.1145/2463676.2463710 Cuckoo Hashing. In Proceedings of the Ninth European Conference [9] Michael T. Heath and Padma Raghavan. 1995. A Cartesian Parallel on Computer Systems (EuroSys ’14). ACM, New York, NY, USA, Article Nested Dissection Algorithm. SIAM J. Matrix Anal. Appl. 16, 1 (Jan. 27, 14 pages. https://doi.org/10.1145/2592798.2592820 1995), 235ś253. https://doi.org/10.1137/S0895479892238270 [23] Gary L. Miller, Shang-Hua Teng, William Thurston, and Stephen A. [10] Bruce Hendrickson and Robert Leland. 1995. A Multilevel Algorithm Vavasis. 1993. Automatic Mesh Partitioning. In Graph Theory and for Partitioning Graphs. In Proceedings of the 1995 ACM/IEEE Confer- Sparse Matrix Computation, Alan George, John R. Gilbert, and Joseph ence on Supercomputing (Supercomputing ’95). ACM, New York, NY, W. H. Liu (Eds.). Springer New York, New York, NY, 57ś84. USA, Article 28. https://doi.org/10.1145/224170.224228 [24] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia [11] Takashi Horikawa. 2013. Latch-free Data Structures for DBMS: Design, Ailamaki. 2010. Data-oriented Transaction Execution. Proc. VLDB Implementation, and Evaluation. In Proceedings of the 2013 ACM SIG- Endow. 3, 1-2 (Sept. 2010), 928ś939. https://doi.org/10.14778/1920841. MOD International Conference on Management of Data (SIGMOD ’13). 1920959 ACM, New York, NY, USA, 409ś420. https://doi.org/10.1145/2463676. [25] Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-Aware 2463720 Automatic Database Partitioning in Shared-Nothing, Parallel OLTP [12] Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. 2009. Im- Systems. In SIGMOD ’12: Proceedings of the 2012 international conference proving OLTP Scalability Using Speculative Lock Inheritance. Proc. on Management of Data. 61ś72. http://hstore.cs.brown.edu/papers/ VLDB Endow. 2, 1 (Aug. 2009), 479ś489. https://doi.org/10.14778/ hstore-partitioning.pdf 1687627.1687682 [26] Andrew Pavlo, Evan P. C. Jones, and Stanley Zdonik. 2011. On [13] Evan P.C. Jones, Daniel J. Abadi, and Samuel Madden. 2010. Low Over- Predictive Modeling for Optimizing Transaction Execution in Par- head Concurrency Control for Partitioned Main Memory Databases. In allel OLTP Systems. Proc. VLDB Endow. 5, 2 (Oct. 2011), 85ś96. Proceedings of the 2010 ACM SIGMOD International Conference on Man- https://doi.org/10.14778/2078324.2078325 agement of Data (SIGMOD ’10). ACM, New York, NY, USA, 603ś614. [27] Alex Pothen, Horst D. Simon, and Kan-Pu Liou. 1990. Partitioning https://doi.org/10.1145/1807167.1807233 Sparse Matrices with Eigenvectors of Graphs. SIAM J. Matrix Anal. [14] Hyungsoo Jung, Hyuck Han, Alan D. Fekete, Gernot Heiser, and Appl. 11, 3 (May 1990), 430ś452. https://doi.org/10.1137/0611030 Heon Y. Yeom. 2013. A Scalable Lock Manager for Multicores. In [28] Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. 2013. SWORD: Scalable Workload-aware Data Placement for Transactional

541 Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Workloads. In Proceedings of the 16th International Conference on Ex- [38] Tianzheng Wang, Milind Chabbi, and Hideaki Kimura. 2016. Be My tending Database Technology (EDBT ’13). ACM, New York, NY, USA, Guest: MCS Lock Now Welcomes Guests. In Proceedings of the 21st ACM 430ś441. https://doi.org/10.1145/2452376.2452427 SIGPLAN Symposium on Principles and Practice of Parallel Programming [29] Kun Ren, Jose M. Faleiro, and Daniel J. Abadi. 2016. Design Principles (PPoPP ’16). ACM, New York, NY, USA, Article 21, 12 pages. https: for Scaling Multi-core OLTP Under High Contention. In Proceedings //doi.org/10.1145/2851141.2851160 of the 2016 International Conference on Management of Data (SIGMOD [39] Tianzheng Wang and Hideaki Kimura. 2016. Mostly-optimistic Con- ’16). ACM, New York, NY, USA, 1583ś1598. https://doi.org/10.1145/ currency Control for Highly Contended Dynamic Workloads on a 2882903.2882958 Thousand Cores. Proc. VLDB Endow. 10, 2 (Oct. 2016), 49ś60. https: [30] Kun Ren, Alexander Thomson, and Daniel J. Abadi. 2013. Lightweight //doi.org/10.14778/3015274.3015276 locking for main memory database systems. In Proceedings of the 39th [40] Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi, Haibo Chen, and Jinyang international conference on Very Large Data Bases (PVLDB’13). VLDB Li. 2016. Scaling Multicore Databases via Constrained Parallel Exe- Endowment, 145ś156. http://dl.acm.org/citation.cfm?id=2448936. cution. In Proceedings of the 2016 International Conference on Manage- 2448947 ment of Data (SIGMOD ’16). ACM, New York, NY, USA, 1643ś1658. [31] Kun Ren, Alexander Thomson, and Daniel J. Abadi. 2014. An Eval- https://doi.org/10.1145/2882903.2882934 uation of the Advantages and Disadvantages of Deterministic Data- [41] Chao Xie, Chunzhi Su, Cody Littley, Lorenzo Alvisi, Manos Kapritsos, base Systems. Proc. VLDB Endow. 7, 10 (June 2014), 821ś832. https: and Yang Wang. 2015. High-performance ACID via Modular Con- //doi.org/10.14778/2732951.2732955 currency Control. In Proceedings of the 25th Symposium on Operating [32] Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Systems Principles (SOSP ’15). ACM, New York, NY, USA, 279ś294. Fabian Nagel, and Kenneth A. Ross. 2014. Reducing Database Locking https://doi.org/10.1145/2815400.2815430 Contention Through Multi-version Concurrency. Proc. VLDB Endow. 7, [42] Chang Yao, Divyakant Agrawal, Gang Chen, Qian Lin, Beng Chin Ooi, 13 (Aug. 2014), 1331ś1342. https://doi.org/10.14778/2733004.2733006 Weng-Fai Wong, and Meihui Zhang. 2016. Exploiting Single-Threaded [33] Robert Sedgewick and Kevin Wayne. 2011. Algorithms (4th ed.). Model in Multi-Core In-Memory Systems. IEEE Trans. on Knowl. and Addison-Wesley Professional. Data Eng. 28, 10 (Oct. 2016), 2635ś2650. https://doi.org/10.1109/TKDE. [34] Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": 2016.2578319 An Idea Whose Time Has Come and Gone. In Proceedings of the 21st [43] Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and International Conference on Data Engineering (ICDE ’05). IEEE Com- Michael Stonebraker. 2014. Staring into the Abyss: An Evaluation of puter Society, Washington, DC, USA, 2ś11. https://doi.org/10.1109/ Concurrency Control with One Thousand Cores. Proc. VLDB Endow. 8, ICDE.2005.1 3 (Nov. 2014), 209ś220. https://doi.org/10.14778/2735508.2735511 [35] Dixin Tang, Hao Jiang, and Aaron J. Elmore. 2017. Adaptive Concur- [44] Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, and Srinivas Devadas. rency Control: Despite the Looking Glass, One Concurrency Control 2016. TicToc: Time Traveling Optimistic Concurrency Control. In Does Not Fit All. In CIDR 2017, 8th Biennial Conference on Innovative Proceedings of the 2016 International Conference on Management of Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Data (SIGMOD ’16). ACM, New York, NY, USA, 1629ś1642. https: Proceedings. http://cidrdb.org/cidr2017/papers/p63-tang-cidr17.pdf //doi.org/10.1145/2882903.2882935 [36] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, [45] Yuan Yuan, Kaibo Wang, Rubao Lee, Xiaoning Ding, Jing Xing, Spy- Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Trans- ros Blanas, and Xiaodong Zhang. 2016. BCC: Reducing False Aborts actions for Partitioned Database Systems. In Proceedings of the 2012 in Optimistic Concurrency Control with Low Cost for In-memory ACM SIGMOD International Conference on Management of Data (SIG- Databases. Proc. VLDB Endow. 9, 6 (Jan. 2016), 504ś515. https: MOD ’12). ACM, New York, NY, USA, 1ś12. https://doi.org/10.1145/ //doi.org/10.14778/2904121.2904126 2213836.2213838 [46] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott [37] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Madden. 2013. Speedy Transactions in Multicore In-memory Databases. Streaming Computation at Scale. In Proceedings of the Twenty-Fourth In Proceedings of the Twenty-Fourth ACM Symposium on Operating ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New Systems Principles (SOSP ’13). ACM, New York, NY, USA, 18ś32. https: York, NY, USA, 423ś438. https://doi.org/10.1145/2517349.2522737 //doi.org/10.1145/2517349.2522713

542