Distributed Computing with Modern Shared Memory

Total Page:16

File Type:pdf, Size:1020Kb

Distributed Computing with Modern Shared Memory Thèse n° 7141 Distributed Computing with Modern Shared Memory Présentée le 20 novembre 2020 à la Faculté informatique et communications Laboratoire de calcul distribué Programme doctoral en informatique et communications pour l’obtention du grade de Docteur ès Sciences par Mihail Igor ZABLOTCHI Acceptée sur proposition du jury Prof. R. Urbanke, président du jury Prof. R. Guerraoui, directeur de thèse Prof. H. Attiya, rapporteuse Prof. N. Shavit, rapporteur Prof. J.-Y. Le Boudec, rapporteur 2020 “Where we going man?” “I don’t know but we gotta go.” — Jack Kerouac, On the Road To my grandmother Momo, who got me started on my academic adventure and taught me that learning can be fun. Acknowledgements I am first and foremost grateful to my advisor, Rachid Guerraoui. Thank you for believing in me, for encouraging me to trust my own instincts, and for opening so many doors for me. Looking back, it seems like you always made sure I had the perfect conditions to do my best possible work. Thank you also for all the things you taught me: how to get at the heart of a problem, how to stay optimistic despite all signs to the contrary, how to be pragmatic while also chasing beautiful problems. I am grateful to my other supervisors and mentors: Yvonne Anne Pignolet and Ettore Ferranti for my internship at ABB Research, Maurice Herlihy for the time at Brown University and at Oracle Labs, Dahlia Malkhi and Ittai Abraham for my internship at VMware, Virendra Marathe and Alex Kogan for my internship at Oracle Labs and the subsequent collaboration, Marcos Aguilera for the collaboration on the RDMA work, and Aleksandar Dragojevic for the internship at Microsoft Research. Thank you for spoiling me with such interesting problems to work on and for teaching me so many valuable lessons about research. I am grateful to the members of my PhD oral exam jury: Prof. Hagit Attiya, Prof. Jean-Yves Le Boudec, and Prof. Nir Shavit. Thank you for your insightful questions and comments, which have helped improve my thesis. I am also thankful to Prof. Rüdiger Urbanke for presiding over my thesis committee. I am grateful to my collaborators: Naama Ben-David, Thanasis Xygkis, Karolos Antoniadis, Vincent Gramoli, Vasilis Trigonakis, Tudor David, Oana B˘alm˘au, Julien Stainer, Peva Blanchard, and Nachshon Cohen. Discussing and working with you on tough problems has undoubtedly been the most enjoyable part of my PhD. Huge thanks to Naama and Thanasis for the hundreds of hours of peer programming, Skype meetings and deadline crunches that we shared in the last two years. I am grateful to the administrative staff of the lab. Thank you, France, for your invaluable help on so many fronts, big and small. I always look forward to interacting with you because I know you will help me get things done quickly, and your green office feels like an oasis where I can take a breath of fresh air. I also always enjoy our chats on many different subjects. Thank you, Fabien, for your help with the technical infrastructure for my experiments; thanks to you, it has always been smooth sailing in that regard. i Acknowledgements I am grateful to my colleagues and friends in the Distibuted Computing Lab at EPFL, who have made it fun to come into work every day: Karolos, Matej, George D., Adi, George C., Thanasis, Sébastien, Le, Mahdi, Alex, Jovan, Arsany, Vlad, Tudor, David, Vasilis, Jingjing, Mahsa, Matteo, Rhicheek, Victor, and Andrei. Thanks for all the goofy memories that have helped me make light of the often stressful PhD life. Special thanks to Karolos for always being extra supportive in my moments of doubt, and for the countless hours of (sometimes scientific) discussions. Big thanks to Le for helping with the French translation of the thesis abstract. I am grateful to my friends outside of the lab: Corina, Camilo, Cristina, Bogdan Stoica, Olivier, Maria, Viki, Étienne, Laura, Bogdan Sassu, C˘alin, R˘azvan, and Diana. Thank you for the conversations, laughs, hikes, parties, and adventures that we shared. Without you the PhD would have been a much lonelier experience. I am also thankful to all the interesting people I met during my internships: Irina, Dan, Aran, Sasha, Heidi, Mike, Zoe, Rashika, Lily, Petros, Dario, and Ho-Cheung. Thank you for the fun memories! I am deeply grateful to my girlfriend, Monica, who has been by my side for most of the PhD. Thank you for supporting and encouraging me during my most difficult moments. You have been the best adventure partner I could have hoped for. I am looking forward to our next adventure! I am profoundly grateful to my family: my parents, Gabriela and Nicolae, my grandmother, Irina (to whom this thesis is dedicated), my aunt, Mihaela, and my cousin, Silvana. Thank your for your constant, loving support, not only during the PhD, but throughout my entire education. You have always seen in me the best version of myself and given me courage to believe in that version too. My work has been supported in part by the European Research Council (ERC) Grant 339539 (AOC). Igor Zablotchi Lausanne, 2020 ii Preface The research presented in this dissertation was conducted in the Distributed Computing Laboratory at EPFL, under the supervision of Professor Rachid Guerraoui, between 2015 and 2020. The main results of this dissertation appear originally in the following publications (author names are in alphabetical order): 1. Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, Athanasios Xygkis, Igor Zablotchi. Microsecond Consensus for Microsecond Applications. Under submission. 2. Rachid Guerraoui, Alex Kogan, Virendra Marathe, Igor Zablotchi. Efficient Multi-word Compare and Swap. In International Symposium on Distributed Computing (DISC), 2020. 3. Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, Igor Zablotchi. The Impact of RDMA on Agreement. In ACM Symposium on Principles of Distributed Computing (PODC), 2019. 4. Nachshon Cohen, Rachid Guerraoui, Igor Zablotchi. The Inherent Cost of Remembering Consistently. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2018. 5. Oana Balmau, Rachid Guerraoui, Maurice Herlihy, Igor Zablotchi. Fast and Robust Memory Reclamation for Concurrent Data Structures. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2016. Besides the above-mentioned publications, I was also involved in other research projects that resulted in the following publications: 1. Tudor David, Aleksandar Dragojevic, Rachid Guerraoui, Igor Zablotchi. Log-Free Con- current Data Structures. In USENIX Annual Technical Conference (ATC), 2018. 2. Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, Igor Zablotchi. FloDB: Unlocking Memory in Persistent Key-Value Stores. In European Conference on Computer Systems (EuroSys), 2017 3. Peva Blanchard, Rachid Guerraoui, Julien Stainer, Igor Zablotchi. The Disclosure Power of Shared Objects. In International Conference on Networked Systems (NETYS), 2017 iii Abstract In this thesis, we revisit classic problems in shared-memory distributed computing through the lenses of (1) emerging hardware technologies and (2) changing requirements. Our con- tributions consist, on the one hand, in providing a better understanding of the fundamental benefits and limitations of new technologies, and on the other hand, in introducing novel, efficient tools and systems to ease the task of leveraging new technologies or meeting new requirements. First, we look at Remote Direct Memory Access (RDMA), a networking hardware feature which enables a computer to access the memory of a remote computer without involving the re- mote CPU. In recent years, the distributed computing community has taken an interest in RDMA due to its ultra-low latency and high throughput and has designed systems that take advantage of these characteristics. However, we argue that the potential of RDMA for dis- tributed computing remains largely untapped. We show that RDMA’s unique semantics enable agreement algorithms which improve on fundamental trade-offs in distributed computing between performance and failure-tolerance. Furthermore, we show the practical applicability of our theoretical results through Mu, a state machine replication system which can replicate requests in under 2 microseconds, and can fail-over in under 1 millisecond when failures occur. Mu’s replication and fail-over latencies are at least 61% and 90% lower, respectively, than those of prior work. Second, we focus on persistent memory, a novel class of memory technologies which is only now starting to become available. Persistent memory provides byte-addressable persistent storage with access times comparable to traditional DRAM. Recent work has focused on designing tools for working with persistent memory, but little is known about the fundamental cost of providing consistency in persistent memory. Furthermore, important shared-memory primitives do not yet have efficient persistent implementations. We provide an answer to the former question through a tight bound on the number of persistent fences required to implement a lock-free persistent object. We address the latter problem by presenting a novel efficient multi-word compare-and-swap algorithm for persistent memory. Third and finally, we consider the current exponential increase in the amount of data world- wide. Memory capacity has been on the rise for decades, but remains scarce when compared to the rate of data growth. Given this scarcity and the prevalence of concurrent in-memory processing, the classic problem of concurrent memory reclamation
Recommended publications
  • Understanding Distributed Consensus
    What’s Live? Understanding Distributed Consensus∗ Saksham Chand Yanhong A. Liu Computer Science Department, Stony Brook University, Stony Brook, NY 11794, USA {schand,liu}@cs.stonybrook.edu Abstract Distributed consensus algorithms such as Paxos have been studied extensively. They all use the same definition of safety. Liveness is especially important in practice despite well-known theoretical impossibility results. However, many different liveness proper- ties and assumptions have been stated, and there are no systematic comparisons for better understanding of these properties. This paper systematically studies and compares different liveness properties stated for over 30 prominent consensus algorithms and variants. We introduce a precise high- level language and formally specify these properties in the language. We then create a hierarchy of liveness properties combining two hierarchies of the assumptions used and a hierarchy of the assertions made, and compare the strengths and weaknesses of algorithms that ensure these properties. Our formal specifications and systematic comparisons led to the discovery of a range of problems in various stated liveness proper- ties, from too weak assumptions for which no liveness assertions can hold, to too strong assumptions making it trivial to achieve the assertions. We also developed TLA+ spec- ifications of these liveness properties, and we use model checking of execution steps to illustrate liveness patterns for Paxos. 1 Introduction arXiv:2001.04787v2 [cs.DC] 21 Jun 2021 Distributed systems are pervasive, where processes work with each other via message passing. For example, this article is available to the current reader owing to some distributed system. At the same time, distributed systems are highly complex, due to their unpredictable asyn- chronous nature in the presence of failures—processes can fail and later recover, and messages can be lost, delayed, reordered, and duplicated.
    [Show full text]
  • Distributed Consensus Revised
    UCAM-CL-TR-935 Technical Report ISSN 1476-2986 Number 935 Computer Laboratory Distributed consensus revised Heidi Howard April 2019 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 https://www.cl.cam.ac.uk/ c 2019 Heidi Howard This technical report is based on a dissertation submitted September 2018 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Pembroke College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: https://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Distributed consensus revised Heidi Howard Summary We depend upon distributed systems in every aspect of life. Distributed consensus, the abil- ity to reach agreement in the face of failures and asynchrony, is a fundamental and powerful primitive for constructing reliable distributed systems from unreliable components. For over two decades, the Paxos algorithm has been synonymous with distributed consensus. Paxos is widely deployed in production systems, yet it is poorly understood and it proves to be heavyweight, unscalable and unreliable in practice. As such, Paxos has been the subject of extensive research to better understand the algorithm, to optimise its performance and to mitigate its limitations. In this thesis, we re-examine the foundations of how Paxos solves distributed consensus. Our hypothesis is that these limitations are not inherent to the problem of consensus but instead specific to the approach of Paxos. The surprising result of our analysis is a substantial weakening to the requirements of this widely studied algorithm. Building on this insight, we are able to prove an extensive generalisation over the Paxos algorithm.
    [Show full text]
  • (12) Patent Application Publication (10) Pub. No.: US 2004/0107227 A1 Michael (43) Pub
    US 2004O107227A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2004/0107227 A1 Michael (43) Pub. Date: Jun. 3, 2004 (54) METHOD FOR EFFICIENT Publication Classification IMPLEMENTATION OF DYNAMIC LOCK-FREE DATASTRUCTURES WITH (51) Int. Cl." ............................ G06F 17/30; G06F 12/00 SAFE MEMORY RECLAMATION (52) U.S. Cl. .............................................................. 707/206 (57) ABSTRACT Inventor: (75) Maged M. Michael, Danbury, CT (US) A method for Safe memory reclamation for dynamic lock free data structures employs a plurality of Shared pointers, Correspondence Address: called hazard pointers, that are associated with each partici F. CHAU & ASSOCIATES, LLP pating thread. Hazard pointers either have null values or Suite 501 point to nodes that may potentially be accessed by a thread 1900 Hempstead Turnpike without further verification of the validity of the local East Meadow, NY 11554 (US) references used in their access. Each hazard pointer can be written only by its associated thread, but can be read by all (73) Assignee: International Business Machines Cor threads. The method requires target lock-free algorithms to poration, Armonk, NY (US) guarantee that no thread can access a dynamic node at a time when it is possibly unsafe (i.e., removed from the data (21) Appl. No.: 10/308,449 Structure), unless one or more of its associated hazard pointerS has been pointing to the node continuously, from a (22) Filed: Dec. 3, 2002 time when it was not removed. 2 3 4. Hash table With 7 buckets and hash function h(k) F k mod 7 Patent Application Publication Jun. 3, 2004 Sheet 1 of 8 US 2004/0107227A1 PRIMARY STORAGE OPERATING SECONDARY I/O SYSTEM STORAGE SERVICES Application Programs FIG.
    [Show full text]
  • Practical Parallel Data Structures
    Practical Parallel Data Structures Shahar Timnat Technion - Computer Science Department - Ph.D. Thesis PHD-2015-06 - 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-06 - 2015 Practical Parallel Data Structures Research Thesis Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Shahar Timnat Submitted to the Senate of the Technion | Israel Institute of Technology Sivan 5775 Haifa June 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-06 - 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-06 - 2015 This research was carried out under the supervision of Prof. Erez Petrank, in the Faculty of Computer Science. Some results in this thesis have been published as articles by the author and research collaborators in conferences during the course of the author's doctoral research period, the most up-to-date versions of which being: Keren Censor-Hillel, Erez Petrank, and Shahar Timnat. Help! In Proceedings of the 34th Annual ACM Symposium on Principles of Distributed Computing, PODC 2015, Donostia-San Sebastian, Spain, July 21-23, 2015. Erez Petrank and Shahar Timnat. Lock-free data-structure iterators. In Distributed Computing - 27th International Symposium, DISC 2013, Jerusalem, Israel, October 14-18, 2013. Proceedings, pages 224{238, 2013. Shahar Timnat, Anastasia Braginsky, Alex Kogan, and Erez Petrank. Wait-free linked-lists. In Principles of Distributed Systems, 16th International Conference, OPODIS 2012, Rome, Italy, December 18-20, 2012. Proceedings, pages 330{344, 2012. Shahar Timnat, Maurice Herlihy, and Erez Petrank. A practical transactional memory interface. In Euro-Par 2015 Parallel Processing - 21st International Conference, Vienna, Austria, August 24-28, 2015.
    [Show full text]
  • Distributed Consensus: Performance Comparison of Paxos and Raft
    DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Distributed Consensus: Performance Comparison of Paxos and Raft HARALD NG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Distributed Consensus: Performance Comparison of Paxos and Raft HARALD NG Master in Software Engineering of Distributed Systems Date: September 10, 2020 Supervisor: Lars Kroll (RISE), Max Meldrum (KTH) Examiner: Seif Haridi School of Electrical Engineering and Computer Science Host company: RISE Swedish title: Distribuerad Konsensus: Prestandajämförelse mellan Paxos och Raft iii Abstract With the growth of the internet, distributed systems have become increasingly important in order to provide more available and scalable applications. Con- sensus is a fundamental problem in distributed systems where multiple pro- cesses have to agree on the same proposed value in the presence of partial failures. Distributed consensus allows for building various applications such as lock services, configuration manager services or distributed databases. Two well-known consensus algorithms for building distributed logs are Multi-Paxos and Raft. Multi-Paxos was published almost three decades before Raft and gained a lot of popularity. However, critics of Multi-Paxos consider it difficult to understand. Raft was therefore published with the motivation of being an easily understood consensus algorithm. The Raft algorithm shares similar characteristics with a practical version of Multi-Paxos called Leader- based Sequence Paxos. However, the algorithms differ in important aspects such as leader election and reconfiguration. Existing work mainly compares Multi-Paxos and Raft in theory, but there is a lack of performance comparisons in practice. Hence, prototypes of Leader- based Sequence Paxos and Raft have been designed and implemented in this thesis.
    [Show full text]
  • 42 Paxos Made Moderately Complex
    Paxos Made Moderately Complex ROBBERT VAN RENESSE and DENIZ ALTINBUKEN, Cornell University This article explains the full reconfigurable multidecree Paxos (or multi-Paxos) protocol. Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. We provide pseudocode and explain it guided by invariants. We initially avoid optimizations that complicate comprehension. Next we discuss liveness, list various optimizations that make the protocol practical, and present variants of the protocol. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Syst- ems—Network operating systems; D.4.5 [Operating Systems]: Reliability—Fault-tolerance General Terms: Design, Reliability Additional Key Words and Phrases: Replicated state machines, consensus, voting ACM Reference Format: Robbert van Renesse and Deniz Altinbuken. 2015. Paxos made moderately complex. ACM Comput. Surv. 47, 3, Article 42 (February 2015), 36 pages. DOI: http://dx.doi.org/10.1145/2673577 1. INTRODUCTION Paxos [Lamport 1998] is a protocol for state machine replication in an asynchronous environment that admits crash failures. It is useful to consider the terms in this 42 sentence carefully: —A state machine consists of a collection of states, a collection of transitions between states, and a current state. A transition to a new current state happens in response to an issued operation and produces an output. Transitions from the current state to the same state are allowed and are used to model read-only operations. In a deterministic state machine, for any state and operation, the transition enabled by the operation is unique and the output is a function only of the state and the operation.
    [Show full text]
  • A General Characterization of Indulgence
    A General Characterization of Indulgence R. Guerraoui1;2 N. Lynch2 (1) School of Computer and Communication Sciences, EPFL (2) Computer Science and Arti¯cial Intelligence Laboratory, MIT Abstract. An indulgent algorithm is a distributed algorithm that, be- sides tolerating process failures, also tolerates arbitrarily long periods of instability, with an unbounded number of timing and scheduling failures. In particular, no process can take any irrevocable action based on the op- erational status, correct or failed, of other processes. This paper presents an intuitive and general characterization of indulgence. The characteri- zation can be viewed as a simple application of Murphy's law to partial runs of a distributed algorithm, in a computing model that encompasses various communication and resilience schemes. We use our characteriza- tion to establish several results about the inherent power and limitations of indulgent algorithms. 1 Introduction Indulgence The idea of indulgence is motivated by the di±culty for any process in a dis- tributed system to accurately ¯gure out, at any point of its computation, any information about which, and in what order, processes will take steps after that point. For instance, a process can usually not know if other processes have failed and stopped operating or are simply slow to signal their activity and will in- deed perform further computational steps. More generally, a process can hardly exclude any future interleaving of the processes. This uncertainty is at the heart of many impossibilities and lower bounds in distributed computing, e.g., [9], and it has been expressed in various forms and assuming speci¯c computation models,e.g., [7, 4, 19].
    [Show full text]
  • Velos: One-Sided Paxos for RDMA Applications
    Velos: One-sided Paxos for RDMA applications Rachid Guerraoui Antoine Murat Athanasios Xygkis EPFL EPFL EPFL Abstract These solutions primarily focus on increasing the through- put and lowering the latency of common case executions, Modern data centers are becoming increasingly equipped thus achieving consensus in the order of a few microseconds. with RDMA-capable NICs. These devices enable distributed Nevertheless, outside of their common case, these systems systems to rely on algorithms designed for shared memory. suffer from orders of magnitude higher failover times, rang- RDMA allows consensus to terminate within a few microsec- ing from 1ms [2] to tens or hundreds of ms [23, 25]. ond in failure-free scenarios, yet, RDMA-optimized algo- In this work we introduce a leader-based consensus algo- rithms still use expensive two-sided operations in case of fail- rithm based on the well-known Paxos [16] that is efficient ure. In this work, we present a new leader-based algorithm both during the common-case as well as during failover. Our for consensus based on Paxos that relies solely on one-sided algorithm provides comparative performance to the state-of- RDMA verbs. Our algorithm decides in a single one-sided . s RDMA operation in the common case, and changes leader the-art system Mu, i.e., it achieves consensus in under 1 9µ , making it only twice as slow as Mu. At the same time, our also in a single one-sided RDMA operation in case of fail- algorithm is 13 times faster than Mu in the event of a leader ure.
    [Show full text]
  • CS5412: Lecture II How It Works
    CS5412 Spring 2016 (Cloud Computing: Birman) 1 CS5412: PAXOS Lecture XIII Ken Birman Leslie Lamport’s vision 2 Centers on state machine replication We have a set of replicas that each implement some given, deterministic, state machine and we start them in the same state Now we apply the same events in the same order. The replicas remain in the identical state To tolerate ≤ t failures, deploy 2t+1 replicas (e.g. Paxos with 3 replicas can tolerate 1 failure) How best to implement this model? CS5412 Spring 2016 (Cloud Computing: Birman) Two paths forwards... 3 One option is to build a totally ordered reliable multicast protocol, also called an “atomic broadcast” protocol in some papers To send a request, you give it to the library implementing that protocol (for cs5412: probably Vsync). Eventually it does upcalls to event handlers in the replicated application and they apply the event In this approach the application “is” the state machine and the multicast “is” the replication mechanism Use “state transfer” to initialize a joining process if we want to replace replicas that crash CS5412 Spring 2016 (Cloud Computing: Birman) Two paths forwards... 4 A second option, explored in Lamport’s Paxos protocol, achieves a similar result but in a very different way We’ll look at Paxos first because the basic protocol is simple and powerful, but we’ll see that Paxos is slow Can speed it up... but doing so makes it very complex! The basic, slower form of Paxos is currently very popular Then will look at faster but more complex reliable multicast options (many of them...) CS5412 Spring 2016 (Cloud Computing: Birman) Key idea in Paxos: Quorums 5 Starts with a simple observation: Suppose that we lock down the membership of a system: It has replicas {P, Q, R, ..
    [Show full text]
  • Distributed Computing Column 36 Distributed Computing: 2009 Edition
    Distributed Computing Column 36 Distributed Computing: 2009 Edition Idit Keidar Dept. of Electrical Engineering, Technion Haifa, 32000, Israel [email protected] It’s now the season for colorful reviews of 2009. While you can read elsewhere about the year in sports, top-40 pop hit charts, people of the year, and so on, I throw my share into the mix with a (biased) review of this year’s major distributed computing events. Awards First, let’s look at awards. This year we learned that two women were recognized with ACM and IEEE prestigious awards for their achievements in, (among other things), distributed computing. Barbara Liskov won the 2008 ACM Turing Award for a range of contributions to practical and theoretical foundations of programming language and system design, some of which are in distributed computing. The award citation mentions her impact on distributed programming, decentralized information flow, replicated storage, and modular upgrading of distributed systems. I include below a short reflection, by Rodrigo Rodrigues, on Barbara Liskov’s Turing Award and legacy. Looking ahead to 2010, it was recently announced that Nancy Lynch is the recipient of the prestigious 2010 IEEE Emanuel R. Piore Award1 for her contributions to foundations of distributed and concurrent computing. The award is given for outstanding contributions in the field of information processing in relation to computer science. Leslie Lamport has won the same award in 2004 for his seminal contributions to the theory and practice of concurrent programming and fault-tolerant computing. The 2009 Edsger W. Dijkstra Prize in Distributed Computing was awarded to Joseph Halpern and Yoram Moses for “Knowledge and Common Knowledge in a Distributed Environment”.
    [Show full text]
  • State Machine Replication Is More Expensive Than Consensus
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Dagstuhl Research Online Publication Server State Machine Replication Is More Expensive Than Consensus Karolos Antoniadis EPFL, Lausanne, Switzerland karolos.antoniadis@epfl.ch Rachid Guerraoui EPFL, Lausanne, Switzerland rachid.guerraoui@epfl.ch Dahlia Malkhi VMware Research, Palo Alto, USA [email protected] Dragos-Adrian Seredinschi EPFL, Lausanne, Switzerland dragos-adrian.seredinschi@epfl.ch Abstract Consensus and State Machine Replication (SMR) are generally considered to be equivalent prob- lems. In certain system models, indeed, the two problems are computationally equivalent: any solution to the former problem leads to a solution to the latter, and vice versa. In this paper, we study the relation between consensus and SMR from a complexity perspect- ive. We find that, surprisingly, completing an SMR command can be more expensive than solving a consensus instance. Specifically, given a synchronous system model where every instance of consensus always terminates in constant time, completing an SMR command does not necessar- ily terminate in constant time. This result naturally extends to partially synchronous models. Besides theoretical interest, our result also corresponds to practical phenomena we identify em- pirically. We experiment with two well-known SMR implementations (Multi-Paxos and Raft) and show that, indeed, SMR is more expensive than consensus in practice. One important im- plication of our result is that – even under synchrony conditions – no SMR algorithm can ensure bounded response times. 2012 ACM Subject Classification Computing methodologies → Distributed algorithms Keywords and phrases Consensus, State machine replication, Synchronous model Digital Object Identifier 10.4230/LIPIcs.DISC.2018.7 Related Version A full version of the paper is available at [4], https://infoscience.epfl.ch/ record/256238.
    [Show full text]
  • A History of the Virtual Synchrony Replication Model Ken Birman
    A History of the Virtual Synchrony Replication Model Ken Birman Introduction A “Cloud Computing” revolution is underway, supported by massive data centers that often contain thousands (if not hundreds of thousands) of servers. In such systems, scalability is the mantra and this, in turn, compels application developers to replicate various forms of information. By replicating the data needed to handle client requests, many services can be spread over a cluster to exploit parallelism. Servers also use replication to implement high availability and fault‐tolerance mechanisms, ensure low latency, implement caching, and provide distributed management and control. On the other hand, replication is hard to implement, hence developers typically turn to standard replication solutions, packaged as sharable libraries. Virtual synchrony, the technology on which this article will focus, was created by the author and his colleagues in the early 1980’s to support these sorts of applications, and was the first widely adopted solution in the area. Viewed purely as a model, virtual synchrony defines rules for replicating data or a service that will behave in a manner indistinguishable from the behavior of some non‐replicated reference system running on a single non‐faulty node. The model is defined in the standard asynchronous network model for crash failures. This turns out to be ideal for the uses listed above. The Isis Toolkit, which implemented virtual synchrony and was released to the public in 1987, quickly became popular. In part this was because the virtual synchrony model made it easy for developers to use replication in their applications, and in part it reflected the surprisingly good performance of the Isis protocols.
    [Show full text]