Distributed Transactions Can Scale

The End of a Myth: Distributed Transactions Can Scale Erfan Zamanian Carsten Binnig Tim Harris Tim Kraska Brown University Brown University Oracle Labs Brown University [email protected] carsten [email protected] [email protected] tim [email protected] ABSTRACT abort rate and relatively unpredictable performance [9]. For The common wisdom is that distributed transactions do not other applications (e.g., social apps), a developer might not scale. But what if distributed transactions could be made scal- even be able to design a proper sharding scheme since those able using the next generation of networks and a redesign of applications are notoriously hard to partition. distributed databases? There would no longer be a need for But what if distributed transactions could be made scalable developers to worry about co-partitioning schemes to achieve using the next generation of networks and we could rethink decent performance. Application development would become the distributed database design? What if we would treat every easier as data placement would no longer determine how scal- transaction as a distributed transaction? The performance of able an application is. Hardware provisioning would be sim- the system would become more predictable. The developer plified as the system administrator can expect a linear scale- would no longer need to worry about co-partitioning schemes out when adding more machines rather than some complex in order to achieve scalability and decent performance. The sub-linear function, which is highly application specific. system would scale out linearly when adding more machines In this paper, we present the design of our novel scalable rather than sub-linearly because of partitioning effects, making database system NAM-DB and show that distributed transac- it much easier to provision how much hardware is needed. tions with the very common Snapshot Isolation guarantee can Would this make co-partitioning obsolete? Probably not, indeed scale using the next generation of RDMA-enabled net- but its importance would significantly change. Instead of be- work technology without any inherent bottlenecks. Our ex- ing a necessity to achieve a scalable system, it becomes a periments with the TPC-C benchmark show that our system second-class design consideration in order to improve the per- scales linearly to over 6:5 million new-order (14:5 million to- formance of a few selected queries, similar to how creating an tal) distributed transactions per second on 56 machines. index can help a selected class of queries. In this paper, we will show that distributed transactions with 1 Introduction the common Snapshot Isolation scheme [8] can indeed scale The common wisdom is that distributed transactions do not using the next generation of RDMA-enabled networking tech- scale [40, 22, 39, 12, 37]. As a result, many techniques have nology without an inherent bottleneck other than the work- been proposed to avoid distributed transactions ranging from load itself. With Remote-Direct-Memory-Access (RDMA), locality-aware partitioning [35, 33, 12, 43] and speculative ex- it is possible to bypass the CPU when transferring data from ecution [32] to new consistency levels [24] and the relaxation one machine to another. Moreover, as our previous work [10] of durability guarantees [25]. Even worse, most of these tech- showed, the current generation of RDMA-capable networks, niques are not transparent to the developer. Instead, the devel- such as InfiniBand FDR 4×, is already able to provide a band- oper not only has to understand all the implications of these width similar to the aggregated memory bandwidth between techniques, but also must carefully design the application to a CPU socket and its attached RAM. Both of these aspects take advantage of them. For example, Oracle requires the are key requirements to make distributed transactions truly user to carefully specify the co-location of data using special scalable. However, as we will show, the next generation of SQL constructs [15]. A similar feature was also recently intro- networks does not automatically yield scalability without re- duced in Azure SQL Server [2]. This works well as long as all designing distributed databases. In fact, when keeping the queries are able to respect the partitioning scheme. However, “old” architecture, the performance can sometimes even de- transactions crossing partitions usually observe a much higher crease when simply migrating a traditional database from an Ethernet network to a high-bandwidth InfiniBand network using protocols such as IP over InfiniBand [10]. This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view 1.1 Why Distributed Transactions are con- a copy of this license, visit http://creativecommons.org/licenses/by- nc-nd/4.0/. For any use beyond those covered by this license, obtain sidered not scalable permission by emailing [email protected]. To value the contribution of this paper, it is important to un- Proceedings of the VLDB Endowment, Vol. 10, No. 6 derstand why distributed transactions are considered not scal- Copyright 2017 VLDB Endowment 2150-8097/17/02. 685 able. One of the most cited reasons is the increased contention is the most common transaction guarantee in practice [18] be- likelihood. However, contention is only a side effect. Perhaps cause it allows for long-running read-only queries without ex- surprisingly, in [10] we showed that the most important factor pensive read-set validations. Other RDMA-based systems fo- is the CPU overhead of the TCP/IP stack. It is not uncom- cus instead on serializability [14] or do not have transaction mon that the CPU spends most of the time processing network support at all [20]. At the same time, existing (distributed) SI messages, leaving little room for the actual work. schemes typically rely on a single global snapshot counter or Additionally, the network bandwidth also significantly lim- timestamp; a fundamental issue obstructing scalability. its the transaction throughput. Even if transaction messages are relatively small, the aggregated bandwidth required to han- 1.3 Contribution and Outline dle thousands to millions of distributed transactions is high In our vision paper [10], we made the case for a shift in the [10], causing the network bandwidth to quickly become a bot- way transactional and analytical database systems must be de- tleneck, even in small clusters. For example, assume a clus- signed and showed the potential of efficiently leveraging high- ter of three servers connected by a 10Gbps Ethernet network. speed networks and RDMA. In this paper, we follow up on this With an average record size of 1KB, and transactions reading vision and present and evaluate one of the first transactional and updating three records on all three machines (i.e., one per systems for high-speed networks and RDMA. In summary, we machine), 6KB has to be shipped over the network per trans- make the following main contributions: (1) We present the action, resulting in a maximal overall throughput of ∼ 29k full design of a truly scalable system called NAM-DB and distributed transactions per second. propose scalable algorithms specifically for Snapshot Isolation Furthermore, because of the high CPU-overhead of the TCP/ (SI) with (mainly one-sided) RDMA operations. In contrast to IP stack and a limited network bandwidth of typical 1/10Gbps our initial prototype [10], the presented design has much less Ethernet networks, distributed transactions have much higher restriction on workloads, supports index-based range-request, latency, significantly higher than even the message delay be- and efficiently executes long-running read transactions by stor- tween machines. This causes the commonly observed high ing more than one version per record. (2) We present a novel abort rates due to time-outs and the increased contention like- RDMA-based and scalable global counter technique which al- lihood; a side-effect rather than the root cause. lows for efficiently reading the latest consistent snapshot in Needless to say, there are workloads for which the con- a distributed SI-based protocol. (3) We show that NAM-DB tention is the primary cause of why distributed transactions are is truly scalable using a full implementation of TPC-C. Most inherently not scalable. For example, if every single transac- notably, for the standard configuration of TPC-C benchmark, tion updates the same item (e.g. incrementing a shared counter), we show that our system scales linearly to over 3:6 million the workload is not scalable simply because of the existence transactions per second on 56 machines, and 6:5 million trans- of a single serialization point. In this case, avoiding the ad- actions with locality optimizations, which is 2 million more ditional network latencies for distributed message processing transactions per second than what FARM [14] achieves on 90 would help to achieve a higher throughput but not to make the machines. Note, that our total transaction throughput is even system ultimately scalable. Fortunately, in many of these “bot- higher (14:5 million transactions per second) as TPC-C speci- tleneck” situations, the application itself can easily be changed fies to only report the new-order transactions. to make it truly scalable [1, 5]. 2 System Overview 1.2 The Need for a System Redesign InfiniBand offers two network communication stacks: IP Assuming a scalable workload, the next generation of net- over InfiniBand (IPoIB) and remote direct memory access works remove the two dominant limiting factors for scalable (RDMA). IPoIB implements a classic TCP/IP stack over In- distributed transaction: the network bandwidth and CPU over- finiBand, allowing existing database systems to run on fast head. Yet, it is wrong to assume that the hardware alone solves networks without any modifications. While IPoIB provides an the problem. In order to avoid the CPU message overhead easy migration path from Ethernet to InfiniBand, IPoIB can- with RDMA, many data structures have to change.

Load more