Flexible Paxos: Quorum Intersection Revisited

Flexible Paxos: Quorum Intersection Revisited Heidi Howard1,2, Dahlia Malkhi1, and Alexander Spiegelman1,3 1 VMware Research, Palo Alto, US [email protected] 2 University of Cambridge Computing Laboratory, Cambridge, UK [email protected] 3 Viterbi Dept. of Electrical Engineering, Technion Haifa, 32000, Israel [email protected] Abstract Distributed consensus is integral to modern distributed systems. The widely adopted Paxos algorithm uses two phases, each requiring majority agreement, to reliably reach consensus. In this paper, we demonstrate that Paxos, which lies at the foundation of many production systems, is conservative. Specifically, we observe that each of the phases of Paxos may use non-intersecting quorums. Majority quorums are not necessary as intersection is required only across phases. Using this weakening of the requirements made in the original formulation, we propose Flex- ible Paxos, which generalizes over the Paxos algorithm to provide flexible quorums. We show that Flexible Paxos is safe, efficient and easy to utilize in existing distributed systems. We discuss far reaching implications of this result. For example, improved availability results from reducing the size of second phase quorums by one when the system size is even, while keeping majority quorums in the first phase. Another example is improved throughput of replication by using much smaller phase 2 quorums, while increasing the leader election (phase 1) quorums. Finally, non intersecting quorums in either first or second phases may enhance the efficiency of both. 1998 ACM Subject Classification C.2.4 Distributed Systems Keywords and phrases Paxos, Distributed Consensus, Quorums Digital Object Identifier 10.4230/LIPIcs.CVIT.2016.23 1 Introduction Distributed consensus is the problem of reaching agreement in the face of failures. It is a common problem in modern distributed systems and its applications range from distributed locking and atomic broadcast to strongly consistent key value stores and state machine replication [35]. Lamport’s Paxos algorithm [18, 19] is one such solution to this problem and since its publication it has been widely built upon in teaching, research and practice. At its core, Paxos uses two phases, each requires agreement from a subset of participants (known as a quorum) to proceed. The safety and liveness of Paxos is based on the guarantee that any two quorums will intersect. To satisfy this requirement, quorums are typically composed of any majority from a fixed set of participants, although other quorum schemes have been proposed. In practice, we usually wish to reach agreement over a sequence of values, known as Multi-Paxos [19]. We use the first phase of Paxos to establish one participant as a leader and the second phase of Paxos to propose a series of values. To commit a value, the leader must always communicate with at least a quorum of participants and wait for them to accept the value. © Heidi Howard, Dahlia Malkhi and Alexander Spiegelman; licensed under Creative Commons License CC-BY 42nd Conference on Very Important Topics (CVIT 2016). Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:17 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 23:2 Flexible Paxos: Quorum Intersection Revisited In this paper, we weaken the requirement in the original protocol that all quorums intersect to require only that quorums from different phases intersect. Within each of the phases of Paxos, it is safe to use disjoint quorums and majority quorums are not necessary. We will refer to this new formulation as Flexible Paxos (FPaxos) as it allows developers the flexibility to choose quorums for the two phases, provided they meet the above requirement. FPaxos is strictly more general than Paxos and FPaxos with intersecting quorums is equivalent to Paxos. Given that Multi-Paxos and its variants are widely deployed, such a result has wide reaching practical applications. Since the second phase of Paxos (replication) is far more common than the first phase (leader election), we can use FPaxos to reduce the size of commonly used second phase quorums. For example, in a system of 10 nodes, we can safely allow any set of only 3 nodes to participate in replication, provided that we require 8 nodes to participate when recovering from leader failure. This strategy, decreasing phase 2 quorums at the cost of increasing phase 1 quorums, is referred to in the body of the paper as simple quorums. The simple quorum system reduces latency, as leaders will no longer be required to wait for a majority of participants to accept proposals. Likewise, it improves steady state throughput as disjoint sets of participants can now accept proposals, enabling better utilization of participants and decreased network load. The price we pay for this is reduced availability as the system can tolerate fewer failures whilst recovering from leader failure. However, it is not always necessary to compromise availability for improved steady state performance. For example, in a system of 10 nodes, we can safely use any set of 5 nodes to form phase 2 quorums and any set of 6 nodes to form phase 1 quorums. This improves both the performance and availability of the second phase, without hurting availability of the first phase at all. In the paper we also illustrate that surprisingly, there are quorum systems such as grid quorums, in which FPaxos allows us to decrease the quorum sizes of both phases. Furthermore, in this quorum system, the quorums within either phase do not intersect with each other. In the following section we outline the basic Paxos algorithm using the standard termino- logy. Readers who are already familiar with the algorithm should proceed directly to the next section. In §3 we describe the observation in detail and then in §4 motivate why such flexibility is useful in practice. §5 gives an informal description of why it is safe to weaken Paxos’s assumption on quorum intersection. In §6 we evaluate a naïve implementation of FPaxos and demonstrate its usefulness. §7 outlines how to dynamically choose quorums and §8 relates FPaxos to the existing work in the field. The appendix includes a TLA+ [20] specification of the FPaxos algorithm which has been model checked against our safety assumption. 2 Paxos We wish to decide a single value v between a set of processes. The system is asynchronous, each process may fail and the messages passed between them may be lost. Each process has one or more roles. We have three roles: the proposer, a process who wishes to have a particular value chosen, the acceptor, a process which agrees and persists decided values or the learner, a process wishing to learn the decided value. A proposer who has a candidate value will try to propose the value to the acceptors. If a value has already been chosen, the proposer will instead learn it. The process of proposing a value has two stages: phase 1 and phase 2, each phase requires a majority of acceptors to H. Howard, D. Malkhi and A Spiegelman 23:3 agree in order to proceed. We will now look at each of these stages in details: Phase 1 - Prepare & Promise i A proposer selects a unique proposal number p and sends prepare(p) to the acceptors. ii Each acceptor receives prepare(p). If p is the highest proposal number promised, then p is written to persistent storage and the acceptor replies with promise(p’,v’). (p’,v’) is the last accepted proposal (if present) where p’ is the proposal number and v’ is the corresponding proposed value. iii Once the proposer receives promise from the majority of acceptors, it proceeds to phase two. Otherwise, it may try again with higher proposal number. Phase 2 - Propose & Accept i The proposer must now select a value v. If more than one proposal was returned in phase 1 then it must choose the value associated with the highest proposal number. If no proposals were returned, then the proposer can choose its own value for v. The proposer then sends propose(p,v) to the acceptors. ii Each acceptor receives a propose(p,v). If p is equal to or greater than the highest promised proposal number, then the promised proposal number and accepted proposal is written to persistent storage and the acceptor replies with accept(p). iii Once the proposer receives accept(p) from the majority of acceptors, it learns that the value v is decided. Otherwise, it may try phase 1 again with a higher proposal number. Paxos guarantees that once a value is decided, the decision is final and no different value can be chosen. Paxos will reach agreement provided that bn/2c + 1 acceptors out of n acceptors are up and are able to communicate. Proving progress requires us to make some assumptions about the synchrony of the system, as we cannot guarantee progress in a truly asynchronous systems [7]. Usually, we wish to reach agreement over a sequence of values, which we will refer to as slots. We could use distinct instances of Paxos to decide each value in the sequence i.e. the ith slot is decided by the ith instance of Paxos. In practice however, we can do much better and this is referred to as Multi-Paxos. The first phase of Paxos is independent of the value proposed for any given instance, therefore phase 1 can be executed prior to knowledge of which value to propose. Furthermore, we can aggregate phase 1 over a series of slots. We refer to a proposer who has completed phase 1 as a leader. To avoid loss of generality, we introduce another agent, the client who is the origin of values for proposal.

Flexible Paxos: Quorum Intersection Revisited

Understanding Distributed Consensus

Distributed Consensus Revised

Distributed Consensus: Performance Comparison of Paxos and Raft

42 Paxos Made Moderately Complex

CS5412: Lecture II How It Works

State Machine Replication Is More Expensive Than Consensus

A History of the Virtual Synchrony Replication Model Ken Birman

A Formal Treatment of Lamport's Paxos Algorithm

Consensus on Transaction Commit

Consensus on Transaction Commit

In Search of Lost Time

The Impact of RDMA on Agreement [Extended Version]