State Machine Replication for Wide-Area Networks

UC San Diego UC San Diego Electronic Theses and Dissertations Title State machine replication for wide area networks Permalink https://escholarship.org/uc/item/9m759181 Author Mao, Yanhua Publication Date 2010 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO State Machine Replication for Wide Area Networks A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Yanhua Mao Committee in charge: Professor Keith Marzullo, Chair Professor Chung-Kuan Cheng Professor Curt Schurgers Professor Amin Vahdat Professor Geoffrey M. Voelker 2010 Copyright Yanhua Mao, 2010 All rights reserved. The dissertation of Yanhua Mao is approved, and it is accept- able in quality and form for publication on microfilm and electronically: Chair University of California, San Diego 2010 iii TABLE OF CONTENTS Signature Page ................................... iii Table of Contents .................................. iv List of Figures .................................... viii List of Tables .................................... x Acknowledgements ................................. xi Vita and Publications ................................ xiii Abstract of the Dissertation ............................. xiv Chapter 1 Introduction ............................. 1 Chapter 2 Background ............................. 5 2.1 System Model ......................... 5 2.1.1 Asynchronous message passing model ........ 6 2.1.2 Failure models .................... 8 2.2 Consensus ........................... 9 2.3 Replicated state machine ................... 11 2.4 FLP impossibility result .................... 12 2.5 Failure detectors ....................... 13 2.5.1 The concept ...................... 14 2.5.2 Failure detector classes ................ 15 2.5.3 The weakest failure detector for solving Consensus . 17 2.6 Lower bounds ......................... 18 2.7 Paxos ............................. 19 2.8 PBFT ............................. 21 Chapter 3 Benefits and challenges of wide-area networks ........... 24 3.1 Benefits ............................ 24 3.2 Challenges .......................... 25 3.3 Paxos vs. Fast Paxos ..................... 26 3.3.1 Fast Paxos ...................... 27 3.3.2 Protocol Analysis ................... 27 3.3.3 Simulation ...................... 32 3.3.4 Summary . ...................... 34 3.4 Design considerations ..................... 35 3.5 Summary ........................... 37 3.6 Acknowledgement ...................... 37 iv Chapter 4 Rotating leader design for multi-site systems ............ 39 4.1 Multi-site systems ....................... 39 4.2 Simple Consensus ....................... 41 4.3 R: a rotating leader protocol ................. 46 4.3.1 Coordinated Protocols ................ 47 4.3.2 A generic rotating leader protocol .......... 48 4.4 Performance analysis of R .................. 53 4.4.1 Delayed commit ................... 56 4.5 Optimizations for R ..................... 58 4.5.1 Revocation in large blocks .............. 59 4.5.2 Out-of-order commit ................. 60 4.6 Summary ........................... 61 Chapter 5 Crash Failure: Protocol Mencius .................. 62 5.1 Performance issues ...................... 63 5.2 Why not existing protocols? . .............. 65 5.3 Deriving Mencius ....................... 68 5.3.1 Assumptions ..................... 68 5.3.2 Coordinated Paxos .................. 69 5.3.3 P: A simple state machine ............. 71 5.3.4 Optimizations ..................... 71 5.4 Implementation considerations . ............. 76 5.4.1 Prototype implementation .............. 76 5.4.2 Recovery . ...................... 77 5.4.3 Revocation ...................... 77 5.4.4 Commit delay and out-of-order commit ....... 78 5.4.5 Parameters ...................... 79 5.5 Evaluation ........................... 81 5.5.1 Experimental settings ................. 81 5.5.2 Throughput ...................... 83 5.5.3 Throughput under failure ............... 86 5.5.4 Scalability ...................... 89 5.5.5 Latency . ...................... 90 5.5.6 Other possible optimizations ............. 97 5.6 Related work ......................... 97 5.7 Future directions and open issues . ............. 99 5.8 Summary ........................... 100 5.9 Acknowledgement ...................... 101 Chapter 6 Byzantine Failure: Protocol RAM ................. 102 6.1 Assumptions ......................... 104 6.2 Design goals ......................... 106 6.2.1 Low latency as a main goal .............. 106 v 6.2.2 Discouraging uncivil rational behavior ........ 107 6.2.3 Exposing irrational behavior ............. 108 6.3 Reducing latency for PBFT .................. 108 6.3.1 Mutually Suspicious Domains ............ 109 6.3.2 Rotating leader design ................ 110 6.3.3 A2M for identical Byzantine failure ......... 111 6.4 Coordinated Byzantine Paxos ................. 112 6.4.1 Revocation sub-protocol ............... 116 6.4.2 Correctness proof ................... 117 6.4.3 Performance metrics ................. 121 6.5 Deriving RAM ........................ 122 6.5.1 Revocation in RAM ................. 122 6.5.2 Protocol Q ...................... 123 6.5.3 Optimizations ..................... 127 6.6 Discourage uncivil rational behavior ............. 128 6.6.1 Failure detection under network jitter ........ 129 6.6.2 Turnaround time advertisement ........... 130 6.6.3 Unverifiable suspicion ................ 131 6.6.4 Verifiable suspicion .................. 133 6.7 Combating irrational behavior . ............. 137 6.7.1 Compromised A2M ................. 138 6.7.2 Untrustworthy local servers ............. 139 6.7.3 Ignoring revocation .................. 140 6.7.4 Latency upper bound ................. 140 6.8 Evaluation ........................... 144 6.8.1 Prototype . ...................... 144 6.8.2 Experimental setup .................. 145 6.8.3 Latency . ...................... 146 6.8.4 Throughput ...................... 147 6.8.5 Revocation ...................... 149 6.9 Related work ......................... 150 6.10 Future direction ........................ 153 6.11 Summary ........................... 154 6.12 Acknowledgement ...................... 154 Chapter 7 Conclusion ............................. 155 Bibliography .................................... 163 Appendix A Pseudo code of Coordinated Paxos ................. 169 Appendix B Pseudo code of Protocol P ..................... 173 Appendix C Pseudo code of Mencius ...................... 176 vi Appendix D Pseudo code of Coordinated Byzantine Paxos ........... 181 vii LIST OF FIGURES Figure 2.1: The asynchronous message passing model .............. 8 Figure 2.2: The asynchronous message passing model enhanced with failure detectors ............................... 14 Figure 2.3: The message flow of Paxos during failure-free runs ......... 20 Figure 2.4: The message flow of Paxos leader change .............. 20 Figure 2.5: The message flow of Paxos when ACCEPT messages are broadcasted 21 Figure 2.6: The message flow of PBFT in the steady state. ........... 22 Figure 2.7: The message flow of PBFT view change ............... 23 Figure 3.1: The structure of a {W×L} system .................. 29 Figure 3.2: Cumulative distribution comparing Paxos and Fast Paxos, learning latency. ................................ 34 Figure 3.3: Cumulative distribution comparing Paxos and Fast Paxos, client latency. ................................. 35 Figure 4.1: A multi-site system .......................... 40 Figure 4.2: Delayed commit ........................... 57 Figure 5.1: The communication pattern in Paxos ................. 66 Figure 5.2: The message flow of suggest, skip and revoke in Coordinated Paxos. 69 Figure 5.3: Liveness problem can occur when using Optimization M.2 without Accerlerator M.1 ........................... 73 Figure 5.4: A comparison of the communication pattern in Paxos and Mencius . 76 Figure 5.5: Delayed commit in Mencius. ..................... 79 Figure 5.6: Throughput for 20 Mbps bandwidth ................. 83 Figure 5.7: Mencius with asymmetric bandwidth (ρ = 4,000) ......... 84 Figure 5.8: Mencius dynamically adapts to changing network bandwidth (ρ = 4,000) ................................ 85 Figure 5.9: The throughput of Mencius when a server crashes .......... 86 Figure 5.10: The throughput of Paxos when the leader crashes .......... 87 Figure 5.11: The throughput of Paxos when a follower crashes .......... 88 Figure 5.12: The scalability of Paxos and Mencius ................ 90 Figure 5.13: Mencius’s commit latency when client load shifts from one site to another ................................ 91 Figure 5.14: Commit latency distribution under low and medium load ...... 92 Figure 5.15: Commit latency vs offered client load (ρ = 4,000, no network variance) ................................. 93 Figure 5.16: Commit latency vs offered client load (ρ = 0, no network varianc) . 94 Figure 5.17: Commit latency vs offered client load (ρ = 0, with network variance) 95 Figure 6.1: The relationship between different kinds of server behavior ..... 106 viii Figure 6.2: A space time diagram showing the message flows of the suggest and skip actions in Coordinated Byzantine Paxos. ............ 114 Figure 6.3: A space time diagram showing the message flow of the revoke ac- tion in Coordinated Byzantine Paxos. ................ 115 Figure 6.4: Algorithm A1 ............................. 137 Figure 6.5: Commit latency

State Machine Replication for Wide-Area Networks

Failure Detectors for Wireless Sensor-Actuator Systems

Understanding Distributed Consensus

Distributed Consensus Revised

Distributed Consensus: Performance Comparison of Paxos and Raft

The Dynamic Enterprise Bus∗

42 Paxos Made Moderately Complex

Low-Overhead Accrual Failure Detector

A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus

Failure Detectors for Wireless Sensor-Actuator Systems Hamza A

CS5412: Lecture II How It Works

INFORMATION to USERS This Manuscript Has Been Reproduced from the Microfilm Master. UMI Films the Text Directly from the Origina

Self-Healing Distributed Systems