Hyperloop: Group-Based NIC-Offloading To
Total Page:16
File Type:pdf, Size:1020Kb
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems Daehyeok Kim1∗, Amirsaman Memaripour2∗, Anirudh Badam3, Yibo Zhu3, Hongqiang Harry Liu3y, Jitu Padhye3, Shachar Raindel3, Steven Swanson2, Vyas Sekar1, Srinivasan Seshan1 1Carnegie Mellon University, 2UC San Diego, 3Microsoft ABSTRACT CCS CONCEPTS Storage systems in data centers are an important component • Networks → Data center networks; • Information sys- of large-scale online services. They typically perform repli- tems → Remote replication; • Computer systems orga- cated transactional operations for high data availability and nization → Cloud computing; integrity. Today, however, such operations suffer from high tail latency even with recent kernel bypass and storage op- KEYWORDS timizations, and thus affect the predictability of end-to-end Distributed storage systems; Replicated transactions; RDMA; performance of these services. We observe that the root cause NIC-offloading of the problem is the involvement of the CPU, a precious commodity in multi-tenant settings, in the critical path of ACM Reference Format: replicated transactions. In this paper, we present HyperLoop, Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo a new framework that removes CPU from the critical path Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven of replicated transactions in storage systems by offloading Swanson, Vyas Sekar, Srinivasan Seshan. 2018. HyperLoop: Group- them to commodity RDMA NICs, with non-volatile memory Based NIC-Offloading to Accelerate Replicated Transactions in as the storage medium. To achieve this, we develop new and Multi-Tenant Storage Systems. In SIGCOMM ’18: ACM SIGCOMM general NIC offloading primitives that can perform memory 2018 Conference, August 20–25, 2018, Budapest, Hungary. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3230543.3230572 operations on all nodes in a replication group while guar- anteeing ACID properties without CPU involvement. We demonstrate that popular storage applications can be easily 1 INTRODUCTION optimized using our primitives. Our evaluation results with Distributed storage systems are an important building block microbenchmarks and application benchmarks show that for modern online services. To guarantee data availability HyperLoop can reduce 99th percentile latency ≈ 800× with and integrity, these systems keep multiple replicas of each close to 0% CPU consumption on replicas. data object on different servers [3, 4, 8, 9, 17, 18] and rely on replicated transactional operations to ensure that updates are consistently and atomically performed on all replicas. ∗The first two authors contributed equally to this work. Such replicated transactions can incur large and unpre- †The author is now in Alibaba Group. dictable latencies, and thus impact the overall performance of storage-intensive applications [52, 57, 58, 75, 86, 92]. Rec- ognizing this problem, both networking and storage commu- Permission to make digital or hard copies of all or part of this work for nities have proposed a number of solutions to reduce average personal or classroom use is granted without fee provided that copies and tail latencies of such systems. are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights Networking proposals include kernel bypass techniques, for components of this work owned by others than the author(s) must such as RDMA (Remote Direct Memory Access) [64], and be honored. Abstracting with credit is permitted. To copy otherwise, or userspace networking technologies [26, 90]. Similarly, there republish, to post on servers or to redistribute to lists, requires prior specific have been efforts to integrate non-volatile main memory permission and/or a fee. Request permissions from [email protected]. (NVM) [6, 11], and userspace solid state disks (SSDs) [7, 29, SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary 98] to bypass the OS storage stack to reduce latency. © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. While optimizations such as kernel bypass do improve the ACM ISBN 978-1-4503-5567-4/18/08...$15.00 performance for standalone storage services and appliance- https://doi.org/10.1145/3230543.3230572 like systems where there is only a single service running in SIGCOMM ’18, August 20–25, 2018, Budapest, Hungary D. Kim and A. Memaripour et al. the entire cluster [60, 100, 101], they are unable to provide same logical lock across all replicas, and durably flushing the low and predictable latency for multi-tenant storage systems. volatile data across all replicas. HyperLoop can help offload The problem is two fold. First, many of the proposed tech- these to the NIC.1 niques for reducing average and tail latencies rely on using Realizing these primitives, however, is not straightforward CPU for I/O polling [7, 26]. In a multi-tenant cloud setting, with existing systems. Our design entails two key technical however, providers cannot afford to burn cores in this man- innovations to this end. First, we repurpose a less-studied yet ner to be economically viable. Second, even without polling, widely supported RDMA operation that lets a NIC wait for the CPU is involved in too many steps in a replicated storage certain events before executing RDMA operations in a special transaction. Take a write operation as an example: i) It needs queue. This enables us to pre-post RDMA operations which CPU for logging, processing of log records, and truncation of are triggered only when the transactional requirements are the log to ensure that all the modifications listed in a trans- met. Second, we develop a remote RDMA operation posting action happen atomically; ii) CPU also runs a consistency scheme that allows a NIC to enqueue RDMA operations on protocol to ensure all the replicas reach identical states be- other NICs in the network. This is realized by modifying fore sending an ACK to the client; iii) During transactions, the NIC driver and registering the driver metadata region CPU must be involved to lock all replicas for the isolation itself to be RDMA-accessible (with safety checks) from other between different transactions for correctness; iv) Finally, NICs. By combining these two techniques, an RDMA-capable CPU ensures that the data from the network stack reaches a NIC can precisely program a group of other NICs to perform durable storage medium before sending an ACK. replicated transactions. To guarantee these ACID properties, the whole transac- Our approach is quite general as it only uses commodity tion has to stop and wait for a CPU to finish its tasks at each RDMA NICs and as such applications can adopt our prim- of the four steps. Unfortunately, in multi-tenant storage sys- itives with ease. For example, we modified RocksDB (an tems, which co-locate 100s of database instances on a single open source alternative to Google LevelDB) and MongoDB server [67, 92] to improve utilization, the CPU is likely to (an open source alternative to Azure CosmosDB and Ama- incur frequent context switches and other scheduling issues. zon DynamoDB) to use HyperLoop with under 1000 lines Thus, we explore an alternative approach for predictable of code. We evaluate the performance of these systems us- replicated transaction performance in a multi-tenant envi- ing microbenchmarks as well as the Yahoo Cloud Storage ronment by completely removing the CPU from the critical Benchmark (YCSB) workload. The results show that run- path. In this paper, we present HyperLoop, a design that ning MongoDB with HyperLoop decreases average latency achieves this goal. Tasks that previously needed to run on of insert/update operations by 79% and reduces the gap be- CPU are entirely offloaded to commodity RDMA NICs, with tween average and 99th percentile by 81%, while CPU usage Non-volatile Memory (NVM) as the storage medium, with- on backup nodes goes down from nearly 100% to almost out the need for CPU polling. Thus, HyperLoop achieves 0%. Further, microbenchmarks show that HyperLoop-based predictable performance (up to 800× reduction of 99th per- group memory operations can be performed more than 50× centile latency in microbenchmarks!) with nearly 0% CPU and 800× faster than conventional RDMA-based operations usage. Our insight is driven by the observation that repli- in average and tail cases, respectively. cated transactional operations are essentially a set of memory 2 BACKGROUND & MOTIVATION operations and thus are viable candidates for offloading to In this section, we briefly define the key operations of repli- RDMA NICs, which can directly access or modify contents cated storage systems. We also present benchmarks that in NVM. highlight the high tail latency incurred by these systems in a In designing HyperLoop, we introduce new and general multi-tenant setting, even with state-of-the-art optimizations group-based NIC offloading primitives for NVM access in such as kernel bypass. contrast to conventional RDMA operations that only of- fload point-to-point communications via volatile memory. 2.1 Storage Systems in Data Centers HyperLoop has a necessary mechanism for accelerating repli- Replicated storage systems maintain multiple copies of data cated transactions, which helps perform logically identical to deal with failures. For instance, large-scale block stores [20, and semantically powerful memory operations on a group 22, 36, 53, 63], key-value stores [23, 43], and