Distributed Consensus Revised

UCAM-CL-TR-935 Technical Report ISSN 1476-2986 Number 935 Computer Laboratory Distributed consensus revised Heidi Howard April 2019 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 https://www.cl.cam.ac.uk/ c 2019 Heidi Howard This technical report is based on a dissertation submitted September 2018 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Pembroke College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: https://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Distributed consensus revised Heidi Howard Summary We depend upon distributed systems in every aspect of life. Distributed consensus, the abil- ity to reach agreement in the face of failures and asynchrony, is a fundamental and powerful primitive for constructing reliable distributed systems from unreliable components. For over two decades, the Paxos algorithm has been synonymous with distributed consensus. Paxos is widely deployed in production systems, yet it is poorly understood and it proves to be heavyweight, unscalable and unreliable in practice. As such, Paxos has been the subject of extensive research to better understand the algorithm, to optimise its performance and to mitigate its limitations. In this thesis, we re-examine the foundations of how Paxos solves distributed consensus. Our hypothesis is that these limitations are not inherent to the problem of consensus but instead specific to the approach of Paxos. The surprising result of our analysis is a substantial weakening to the requirements of this widely studied algorithm. Building on this insight, we are able to prove an extensive generalisation over the Paxos algorithm. Our revised understanding of distributed consensus enables us to construct a diverse family of algorithms for solving consensus; covering classical as well as novel algorithms to reach consensus where it was previously thought impossible. We will explore the wide reaching implications of this new understanding, ranging from pragmatic optimisations to production systems to fundamentally novel approaches to consensus, which achieve new tradeoffs in performance, scalability and reliability. Acknowledgments First, and foremost, I would like to sincerely thank my advisor, Jon Crowcroft, for his tireless enthusiasm and support since our paths first crossed six years ago. Without his encouragement, I would not have believed it possible for to me to pursue a PhD. It takes a village to raise a child and it takes a research group to raise a graduate student. Since my first steps into the world of research, the Systems Research Group (SRG) at the Computer Lab has been my home. I am indebted to my mentor, Richard Mortier, for his counsel and patience, particularly during the longer than expected final stretch of my PhD. I am also indebted to Tim Harris for insightful feedback on this thesis which substantially helped toward improving the clarity and readability. I am thankful for the friendship of my former fellow students, Natacha Crooks, Malte Schwarzkopf, Matthew Grosvenor, Shehar Bano and my current fellow students, Krittika D'Silva, Zahra Tarkhani, Mohibi Hussain and Marco Caballero. In addition to those already mentioned, I am sincerely grateful to my friends beyond the SRG, especially Laura Scriven, Shreedipta Mitra and Jeunese Payne, for keeping me sane over the years. I am grateful to my colleague Martin Kleppmann for our discussions on distributed systems during key points of this PhD. I am deeply grateful to my 2nd supervisor, Anil Madhavapeddy, for adopting me into the OCaml labs community and regularly feeding me. Over the years, I have enjoyed many lively lunchtime discussions and trips to the food park with Gemma Gordon, David Allsopp and my office mates, KC Sivaramakrishnan and Stephen Dolan. I am also thankful to my former colleagues, Mindy Preston and Amir Chaudhry for their companionship during their time with me at the computer lab. I have been extremely privileged to work with Dahlia Malkhi, my mentor in the field of distributed systems. Thanks are also due to the friends, new and old, who kept me company during my time in California; Nick Spooner, Diego Ongaro, Jenny Wolochow and Igor Zablotchi. Without you all my time away from Cambridge could have been very lonely. Last but not least, I'd like to thank my husband, Olly Andrade, for the joy you have brought to my life over the last seven years as well as the countless hours spend proofreading this thesis. This thesis is dedicated to my father, Daniel Howard, who raised me and supported me throughout my life but sadly passed away after a short battle with cancer before the completion of this PhD. Contents 1 Introduction ..................................... 13 1.1 State of the art . 14 1.2 Historical background . 14 1.3 Motivation . 16 1.4 Approach . 17 1.5 Contributions . 18 1.5.1 Publications . 19 1.5.2 Follow up research . 19 1.6 Scope & limitations . 19 2 Consensus & Classic Paxos ........................... 21 2.1 Preliminaries . 21 2.1.1 Single acceptor algorithm . 23 2.2 Classic Paxos . 27 2.2.1 Proposer algorithm . 29 2.2.2 Acceptor algorithm . 31 2.3 Examples . 32 2.4 Properties . 38 2.5 Non-triviality . 39 2.6 Safety . 39 2.7 Progress . 45 2.8 Summary . 47 3 Known revisions .................................. 49 3.1 Negative responses (NACKs) . 49 3.2 Bypassing phase two . 52 3.3 Termination . 55 3.4 Distinguished proposer . 59 3.5 Phase ordering . 60 3.6 Multi-Paxos . 60 3.7 Roles . 61 3.8 Epochs . 62 3.9 Phase one voting for epochs . 63 5 3.10 Proposal copying . 65 3.11 Generalisation to quorums . 68 3.12 Miscellaneous . 69 3.13 Summary . 73 4 Quorum intersection revised .......................... 75 4.1 Quorum intersection across phases . 75 4.1.1 Algorithm . 76 4.1.2 Safety . 76 4.1.3 Examples . 78 4.2 Quorum intersection across epochs . 78 4.2.1 Algorithm . 81 4.2.2 Safety . 81 4.2.3 Examples . 83 4.3 Implications . 85 4.3.1 Bypassing phase two . 86 4.3.2 Co-location of proposers and acceptors . 86 4.3.3 Multi-Paxos . 88 4.3.4 Voting for epochs . 90 4.4 Summary . 90 5 Promises revised .................................. 91 5.1 Intuition . 91 5.2 Algorithm . 92 5.3 Safety . 92 5.4 Examples . 94 5.5 Summary . 96 6 Value selection revised .............................. 97 6.1 Epoch agnostic algorithm . 97 6.1.1 Safety . 101 6.1.2 Progress . 104 6.1.3 Examples . 105 6.2 Epoch dependent algorithm . 106 6.2.1 Safety . 108 6.2.2 Progress . 109 6.3 Summary . 111 7 Epochs revised ................................... 113 7.1 Epochs from an allocator . 114 7.2 Epochs by value mapping . 115 6 7.3 Epochs by recovery . 118 7.3.1 Intuition . 118 7.3.2 Algorithm . 120 7.3.3 Safety . 122 7.3.4 Progress . 125 7.3.5 Examples . 126 7.4 Hybrid epoch allocation . 134 7.4.1 Multi-path Paxos using allocator . 135 7.4.2 Multi-path Paxos using value-based allocation . 136 7.4.3 Multi-path Paxos using recovery . 136 7.5 Summary . 137 8 Conclusion ...................................... 141 8.1 Motivation . 141 8.2 Summary of contributions . 142 8.3 Implications of contributions . 143 8.3.1 Greater flexibility . 143 8.3.2 New progress guarantees . 144 8.3.3 Improved performance . 145 8.3.4 Better clarity . 146 Bibliography ...................................... 147 7 List of Figures 2.1 Single acceptor algorithm (Alg. 1,2) . 25 2.2 Classic Paxos with serial proposers, p1 then p2 (Alg. 3,4) ..

Distributed Consensus Revised

Understanding Distributed Consensus

Distributed Consensus: Performance Comparison of Paxos and Raft

42 Paxos Made Moderately Complex

CS5412: Lecture II How It Works

State Machine Replication Is More Expensive Than Consensus

A History of the Virtual Synchrony Replication Model Ken Birman

A Formal Treatment of Lamport's Paxos Algorithm

Consensus on Transaction Commit

Consensus on Transaction Commit

In Search of Lost Time

The Impact of RDMA on Agreement [Extended Version]

Leaderless Byzantine Paxos