Distributed Lock Management with RDMA

Distributed Lock Management with RDMA: Decentralization without Starvation Dong Young Yoon Mosharaf Chowdhury Barzan Mozafari University of Michigan, Ann Arbor University of Michigan, Ann Arbor University of Michigan, Ann Arbor [email protected] [email protected] [email protected] ABSTRACT ACM Reference Format: Lock managers are a crucial component of modern distributed Dong Young Yoon, Mosharaf Chowdhury, and Barzan Mozafari. 2018. Dis- tributed Lock Management with RDMA: Decentralization without Starva- systems. However, with the increasing availability of fast RDMA- tion. In SIGMOD’18: 2018 International Conference on Management of Data, enabled networks, traditional lock managers can no longer keep June 10–15, 2018, Houston, TX, USA. ACM, New York, NY, USA, 16 pages. up with the latency and throughput requirements of modern sys- https://doi.org/10.1145/3183713.3196890 tems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are them- selves single points of contention and failure. Consequently, they 1 INTRODUCTION fall short in leveraging the full potential of RDMA networks. On With the advent of high-speed RDMA networks and affordable the other hand, decentralized (RDMA-based) lock managers either memory prices, distributed in-memory systems have become in- completely sacrifice global knowledge to achieve higher through- creasingly more common [39, 57, 65]. The reason for this rising put at the risk of starvation and higher tail latencies, or they resort popularity is simple: many of today’s workloads can fit within the to costly communications in order to maintain global knowledge, memory of a handful of machines, and they can be processed and which can result in significantly lower throughput. served over RDMA-enabled networks at a significantly faster rate In this paper, we show that it is possible for a lock manager to than with traditional architectures. be fully decentralized and yet exchange the partial knowledge nec- A primary challenge in distributed computing is lock manage- essary for preventing starvation and thereby reducing tail latencies. ment, which forms the backbone of many distributed systems ac- Our main observation is that we can design a lock manager pri- cessing shared resources over the network. Examples include OLTP marily using RDMA’s fetch-and-add (FA) operations, which always databases [53, 59], distributed file systems [22, 56], in-memory stor- succeed, rather than compare-and-swap (CAS) operations, which age systems [39, 51], and any system that requires synchronization, only succeed if a given condition is satisfied. While this requires us consensus, or leader election [14, 28, 37]. In a transactional setting, to rethink the locking mechanism from the ground up, it enables us the key responsibility of a lock manager (LM) is ensuring both to sidestep the performance drawbacks of the previous CAS-based serializability—or other forms of isolation—and starvation-free be- proposals that relied solely on blind retries upon lock conflicts. havior of competing transactions [49]. Specifically, we present DSLR (Decentralized and Starvation-free Centralized Lock Managers (CLM) Lock management with RDMA), a decentralized lock manager — In traditional distributed that targets distributed systems running on RDMA-enabled net- databases, each node is in charge of managing the locks for its own works. We demonstrate that, despite being fully decentralized, DSLR objects (i.e., tuples hosted on that node) [3, 8, 25, 34]. In other words, prevents starvation and blind retries by guaranteeing first-come- before remote nodes or transactions can read or update a particular first-serve (FCFS) scheduling without maintaining explicit queues. object, they must communicate with the LM daemon running on We adapt Lamport’s bakery algorithm [36] to an RDMA-enabled en- the node hosting that object. Only after the local LM grants the vironment with multiple bakers, utilizing only one-sided READ and lock can the remote node or transaction proceed to read or modify centralized atomic FA operations. Our experiments show that, on average, DSLR the object. Although distributed, these LMs are still , as central delivers 1:8× (and up to 2:8×) higher throughput than all existing each LM instance represents a point of decision for granting RDMA-based lock managers, while reducing their mean and 99.9% locks on the set of objects assigned to it. global knowledge latencies by 2:0× and 18:3× (and up to 2:5× and 47×), respectively. Because each CLM instance has and full visi- bility into the operations performed on its objects, it can guarantee many strong and desirable properties. For example, it can queue all lock requests and take holistic actions [61, 63], prevent starvation [25, 30, 34], and even employ sophisticated scheduling of Permission to make digital or hard copies of all or part of this work for personal or incoming requests to bound tail latencies [26, 33, 40, 66]. classroom use is granted without fee provided that copies are not made or distributed Unfortunately, the operational simplicity of having global knowl- for profit or commercial advantage and that copies bear this notice and the full citation edge is not free [23], even with low-latency RDMA networks. First, on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, the CLM itself—specifically, its CPU—becomes a performance bot- to post on servers or to redistribute to lists, requires prior specific permission and/or a tleneck for applications that require high-throughput locking/un- fee. Request permissions from [email protected]. locking operations and as transactional workloads scale up or out. SIGMOD’18, June 10–15, 2018, Houston, TX, USA © 2018 Association for Computing Machinery. Second, the CLM becomes a single point of failure [23]. Conse- ACM ISBN 978-1-4503-4703-7/18/06...$15.00 quently, many modern distributed systems do not use CLMs in https://doi.org/10.1145/3183713.3196890 practice; instead, they rely on decentralized lock managers, which SIGMOD’18, June 10–15, 2018, Houston, TX, USA Dong Young Yoon, Mosharaf Chowdhury, and Barzan Mozafari offer better scalability and reliability by avoiding a single point of (2) To compete with existing RDMA-based DLMs, our algorithm contention and failure [23]. must be able to acquire locks on uncontented objects using a single RDMA atomic operation. However, the original bakery Decentralized Lock Managers (DLM)— To avoid the drawbacks algorithm requires setting a semaphore, reading the existing of centralization and to exploit fast RDMA networks, decentralized tickets, and assigning the requester the maximum ticket value. lock managers are becoming more popular [15, 18, 47, 62]. This is because RDMA operations enable transactions to acquire and DSLR overcomes all of these challenges (see §4) and, to the best release locks on remote machines at extremely low latencies, with- of our knowledge, is the first DLM to extend Lamport’s bakery out involving any remote CPUs. In contrast to CLMs, RDMA-based algorithm to an RDMA context. decentralized approaches offer better CPU usage, scalability, and fault tolerance. Contributions— We make the following contributions: Unfortunately, existing RDMA-based DLMs take an extremist (1) We propose a fully decentralized and distributed locking al- approach, where they either completely forgo the benefits of main- gorithm, DSLR, that extends Lamport’s bakery algorithm and taining global knowledge and rely on blind fail-and-retry strate- combines it with novel RDMA protocols. Not only does DSLR gies to achieve higher throughput [15, 62], or they emulate global prevent lock starvation, but it also delivers higher throughput knowledge using distributed queues and additional network round- and significantly lower tail latencies than previous proposals. trips [47]. The former can cause starvation and thereby higher tail latencies, while the latter significantly lowers throughput, under- (2) DSLR provides fault tolerance for transactions that fail to release mining the performance benefits of decentralization. their acquired locks or fall into a deadlock. DSLR achieves this goal by utilizing leases and determining lease expirations using Challenges— There are two key challenges in designing an effi- a locally calculated elapsed time. cient DLM. First, to avoid data races, RDMA-based DLMs [15, 62] must only rely on RDMA atomic operations: fetch-and-add (FA) (3) Through extensive experiments on TPC-C and micro-benchmarks, and compare-and-swap (CAS). FA atomically adds a constant to a we show that DSLR outperforms existing RDMA-based LMs; on remote variable and returns the previous value of the variable. CAS average, it delivers 1:8× (and up to 2:8×) higher throughput, atomically compares a constant to the remote variable and updates and 2:0× and 18:3× (and up to 2:5× and 47×) lower average the variable only if the constant matches the previous value. Al- and 99.9% percentile latencies, respectively. though CAS does not always guarantee successful completion, it is easy to reason about, which is why all previous RDMA-based DLMs The rest of this paper is organized as follows. Section 2 provides have relied on CAS for implementing exclusive locks [15, 18, 47, 62]. background material and the motivation behind DSLR. Section 3 Consequently, when there is high contention in the system, proto- discusses the design challenges involved in distributed and decen- cols relying on CAS require multiple retries to acquire a lock. These tralized locking. Section 4 explains DSLR’s algorithm and design blind and unbounded retries cause starvation, which increases tail decisions for overcoming these challenges. Section 5 describes how latency. Second, the lack of global knowledge complicates other DSLR offers additional features often used by modern database sys- run-time issues, such as deadlock detection and mitigation. tems. Section 6 presents our experimental results, and Section 7 discusses related work.

Distributed Lock Management with RDMA

Understanding Lustre Filesystem Internals

High Availability for RHEL on System Z

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Scaling HDFS with a Strongly Consistent Relational Model for Metadata Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling

Clustering of Openvms Installations for High Availability

Linux Clustering & Storage Management

Inside the Lustre File System

Digital Technical Journal, Number 5, September 1987: Vaxcluster Systems

GFS Common Goals of GFS and Most Distributed File Systems

GPFS General Parallel File System

Using Sas® 9 and Red Hat's Global File System2 (Gfs2)