Proof-of-Execution: Reaching Consensus through Fault-Tolerant Speculation

Suyash Gupta Jelle Hellings Sajjad Rahnama Mohammad Sadoghi Exploratory Systems Lab Department of Computer Science University of California, Davis

ABSTRACT a common order for these requests [9, 21, 29, 38, 51]. Furthermore, Multi-party data management and systems require bft consensus comes with the added benefit of democracy, as data sharing among participants. To provide resilient and consis- bft consensus gives all replicas an equal vote in all agreement tent data sharing, transactions engines rely on Byzantine Fault- decisions, while the resilience of bft can aid in dealing with the Tolerant consensus (bft), which enables operations during fail- billions of dollars losses associated with prevalent attacks on data ures and malicious behavior. Unfortunately, existing bft proto- management systems [44]. cols are unsuitable for high-throughput applications due to their Akin to commit protocols, the majority of bft consensus pro- high computational costs, high communication costs, high client tocols use a primary-backup model in which one replica is des- latencies, and/or reliance on twin-paths and non-faulty clients. ignated the primary that coordinates agreement, while the re- In this paper, we present the Proof-of-Execution consensus pro- maining replicas act as backups and follow the protocol [46]. tocol (PoE) that alleviates these challenges. At the core of PoE are This primary-backup bft consensus was first popularized by the out-of-order processing and speculative execution, which allow influential Pbft consensus protocol of Castro and Liskov [9]. The PoE to execute transactions before consensus is reached among design of Pbft requires at least 3f +1 replicas to deal with up-to-f the replicas. With these techniques, PoE manages to reduce the malicious replicas and operates in three communication phases, costs of bft in normal cases, while guaranteeing reliable con- two of which necessitate quadratic communication complexity. sensus for clients in all cases. We envision the use of PoE in As such, Pbft is considered costly when compared to commit or high-throughput multi-party data-management and blockchain replication protocols, which has negatively impacted the usage systems. To validate this vision, we implement PoE in our effi- of bft consensus in large-scale data management systems [8]. cient ResilientDB fabric and extensively evaluate PoE against The recent interest in blockchain technology has revived in- several state-of-the-art bft protocols. Our evaluation showcases terest in bft consensus, has led to several new resilient data that PoE achieves up-to-80% higher throughputs than existing management systems (e.g., [3, 18, 29, 43]), and has led to the bft protocols in the presence of failures. development of new bft consensus protocols that promise effi- ciency at the cost of flexibility (e.g.,21 [ , 28, 38, 51]). Despite the existence of these modern bft consensus protocols, the majority 1 INTRODUCTION of bft-fueled systems [3, 18, 29, 43] still employ the classical In federate data management a single common database is man- time-tested, flexible, and safe design of Pbft, however. aged by many independent stakeholders (e.g., an industry con- In this paper, we explore different design principles that can sortium). In doing so, federated data management can ease data enable implementing a scalable and reliable agreement protocol sharing and improve data quality [17, 32, 48]. At the core of fed- that shields against malicious attacks. We use these design princi- erated data management is reaching agreement on any updates ples to introduce Proof-of-Execution (PoE), a novel bft protocol on the common database in an efficient manner, this to enable that achieves resilient agreement in just three linear phases. To fast query processing, data retrieval, and data modifications. concoct PoE’s scalable and resilient design, we start with Pbft One can achieve federated data management by replicating and successively add four design elements: the common database among all participant, this by replicat- (I1) Non-Divergent Speculative Execution. In Pbft, when ing the sequence of transactions that affect the database to all the primary replica receives a client request, it forwards that stakeholders. One can do so using commit protocols designed request to the backups. Each backup on receiving a request from for distributed databases such as two-phase [22] and three-phase the primary agrees to support by broadcasting a prepare message. commit [49], or by using crash-resilient replication protocols When a replica receives prepare message from the majority of arXiv:1911.00838v3 [cs.DB] 23 Feb 2021 such as Paxos [39] and Raft [45]. other replicas, it marks itself as prepared and broadcasts a commit These solutions are error-prone in a federated decentralized message. Each replica that has prepared, and receives commit environment in which each stakeholder manages its own replicas messages from a majority of other replicas, executes the request. and replicas of each stakeholder can fail (e.g., due to software, Evidently, Pbft requires two phases of all-to-all communica- hardware, or network failure) or act malicious: commit protocols tion. Our first ingredient towards faster consensus is speculative and replication protocols can only deal with crashes. Conse- execution. In Pbft terminology, PoE replicas execute requests quently, recent federated designs propose the usage of Byzantine after they get prepared, that is, they do not broadcast commit mes- Fault-Tolerant (bft) consensus protocols. bft consensus aims at sages. This speculative execution is non-divergent as each replica ordering client requests among a set of replicas, some of which could has a partial guarantee–it has prepared–prior to execution. be Byzantine, such that all non-faulty replicas reach agreement on (I2) Safe Rollbacks and Robustness under Failures. Due to speculative execution, a malicious primary in PoE can en- © 2021 Copyright held by the owner/author(s). Published in Proceedings of the 24th International Conference on Extending Database Technology (EDBT), March sure that only a subset of replicas prepare and execute a request. 23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org. Hence, a client may or may not receive a sufficient number of Distribution of this paper is permitted under the terms of the Creative Commons matching responses. PoE ensures that if a client receives a full license CC-by-nc-nd 4.0. Protocol Phases Messages Resilience Requirements proof-of-execution, consisting of responses from a majority of the Zyzzyva 1 O(n) 0 Reliable clients and unsafe non-faulty replicas, then such a request persists in time. Other- PoE (our paper) 3 O(3n) f Sign. agnostic 2 wise, PoE permits replicas to rollback their state if necessary. This Pbft 3 O(n + 2n ) f HotStuff 8 O(8n) f Sequential Consensus proof-of-execution is the cornerstone of the correctness of PoE. SBFT 5 O(5n) 0 Optimistic path (I3) Agnostic Signatures and Linear Communication. bft protocols are run among distrusting parties. To provide secu- Figure 1: Comparison of bft consensus protocols in a sys- rity, these protocols employ cryptographic primitives for signing tem with n replicas of which f are faulty. The costs given the messages and generating message digests. Prior works have are for the normal-case behavior. shown that the choice of cryptographic signature scheme can impact the performance of the underlying system [9, 30]. Hence, we allow replicas to either employ message authentication codes Hyperledger Fabric [4]; in multi-primary meta-protocols such as (MACs) or threshold signatures (TSs) for signing [36]. When few RCC [26, 28]; and in sharding protocols such as AHL [15]. replicas are participating in consensus (up to 16), then a single phase of all-to-all communication is inexpensive and using MACs 2 ANALYSIS OF DESIGN PRINCIPLES for such setups can make computations cheap. For larger setups, To arrive at an optimal design for PoE, we studied practices fol- TS TS we employ s to achieve linear communication complexity. s lowed by state-of-the-art distributed data management systems permit us to split a phase of all-to-all communication into two and applied their principles to the design of PoE where possi- [21, 51]. linear phases ble. In Figure 1, we present a comparison of PoE against four (I4) Avoid Response Aggregation. SBFT [21], a recently- well-known resilient consensus protocols. proposed bft protocol, suggests the use of a single replica (desig- To illustrate the merits of PoE’s design, we first briefly look at nated as the executor) to act as a response aggregator. In specific, Pbft. The last phase of Pbft ensures that non-faulty replicas only all replicas execute each client request and send their response to execute requests and inform clients when there is a guarantee the executor. It is the duty of the executor to reply to the client that such a transaction will be recovered after any failures. Hence, and send a proof that a majority of the replicas not only executed clients need to wait for only f + 1 identical responses, of which this request, but also outputted the same result. In PoE, we avoid at-least one is from a non-faulty replica, to ensure guaranteed this additional communication between the replicas by allowing execution. By eliminating this last phase, replicas speculatively each replica to respond directly to the client. execute requests before obtaining recovery guarantees. This im- In specific, we make the following contributions: pacts Pbft-style consensus in two ways: (1) We introduce PoE, a novel Byzantine fault-tolerant con- (1) First, clients need a way to determine proof-of-execution sensus protocol that uses speculative execution to reach after which they have a guarantee that their requests are agreement among replicas. executed and maintained by the system. We shall show (2) To guarantee failure recovery in the presence of specu- that such a proof-of-execution can be obtained using nf ≥ lative execution and Byzantine behavior, we introduce a 2f + 1 identical responses (instead of f + 1 responses). novel view-change protocol that can rollback requests. (2) Second, as requests are executed before they are guaran- (3) PoE supports batching, out-of-order processing, and is teed, replicas need to be able to rollback requests that are signature-scheme agnostic and can be made to employ dropped during periods of recovery. either MACs or threshold signatures. (4) PoE does not rely on non-faulty replicas, clients, or trusted PoE’s speculative execution guarantees that requests with a proof- hardware to achieve safe and efficient consensus. of-execution will never rollback and that only a single request (5) To validate our vision of using PoE in resilient federated can obtain a proof-of-execution per round. Hence, speculative data management systems, we implement PoE and four execution provides the same strong consistency (safety) of Pbft other bft protocols (Zyzzyva, Pbft, SBFT, and HotStuff) in all cases, this at much lower cost under normal operations. in our efficient ResilientDB1 fabric [23–25, 27, 29, 30, 47]. Furthermore, we show that speculative execution is fully com- (6) We extensively evaluate PoE against these protocols on patible with other scalable design principles applied to Pbft, e.g., a Google Cloud deployment consisting of 91 replicas and batching and out-of-order processing to maximize throughput, 320 k clients under (i) no failure, (ii) backup failure, (iii) even with high message delays. primary failure, (iv) batching of requests, (v) zero payload, Out-of-order execution. Typical bft systems follow the and (vi) scaling the number of replicas. Further, to prove order-execute model: first replicas agree on a unique order of the correctness of our results, we also stress test PoE and the client request, and only then they execute the requests in other protocols in a simulated environment. Our results order [9, 21, 29, 38, 51]. Unfortunately, this prevents these sys- show that PoE can achieve up to 80% more throughput tems from providing any support for concurrent execution. A than existing bft protocols in the presence of failures. few bft systems suggest executing prior to ordering, but even such systems need to re-verify their results prior to commit- To the best of our knowledge, PoE is the first protocol that ting changes [4, 35]. Our PoE protocol lies between these two achieves consensus in only two phases while being able to deal extremes: the replicas speculatively execute using only partial with Byzantine failures and without relying on trusted clients ordering guarantees. By doing so, PoE can eliminate communi- (e.g., Zyzzyva [38]) or on trusted hardware (e.g., MinBFT [50]). cation costs and minimize latencies of typical bft systems, this Hence, PoE can serve as a drop-in replacement of Pbft to improve without needing to re-verify results in the normal case. scalability and performance in permissioned blockchain fabrics Out-of-order processing. Although bft consensus typically such as our ResilientDB fabric [27–31], MultiChain [20], and executes requests in-order, this does not imply they need to process proposals to order requests sequentially. To maximize 1ResilientDB is open-sourced at https://github.com/resilientdb. throughput, Pbft and other primary-backup protocols support out-of-order processing in which all available bandwidth of the of transactions in all cases. Finally, when malicious behavior is primary is used to continuously propose requests (even when detected, replicas can recover by rolling back transactions, which previous proposals are still being processed by the system). By ensures correctness without depending on any twin-path model. doing so, out-of-order processing can eliminate the impact of high message delays. To provide out-of-order processing, all replicas 3.1 System model and notations will process any request proposed as the 푘-th request whenever Before providing a full description of our PoE protocol, we present 푘 is within some active window bounded by a low-watermark the system model we use and the relevant notations. and high-watermark [9]. These watermarks are increased as the A system is a set ℜ of replicas that process client requests. system progresses. The size of this active window is—in practice— We assign each replica r ∈ ℜ a unique identifier id(r) with only limited by the memory resources available to replicas. As 0 ≤ id(r) < |ℜ|. We write F ⊆ ℜ to denote the set of Byzantine out-of-order processing is an essential technique to deliver high replicas that can behave in arbitrary, possibly coordinated and throughputs in environments with high message delays, we have malicious, manners. We assume that non-faulty replicas (those in included out-of-order processing in the design of PoE. ℜ\F ) behave in accordance to the protocol and are deterministic: Twin-path consensus. Speculative execution employed by on identical inputs, all non-faulty replicas must produce identical PoE is different that the twin-path model utilized by Zyzzyva [38] outputs. We do not make any assumptions on clients: all client can and SBFT [21]. These twin-path protocols have an optimistic fast be malicious without affecting PoE. We write n = |ℜ|, f = |F |, path that works only if none of the replicas are faulty and require and nf = |ℜ\ F | to denote the number of replicas, faulty replicas, aid to determine whether these optimistic condition hold. and non-faulty replicas, respectively. We assume that n > 3f In the fast path of Zyzzyva, primaries propose requests, and (nf > 2f). backups directly execute such proposals and inform the client We assume authenticated communication: Byzantine replicas (without further coordination). The client waits for responses are able to impersonate each other, but replicas cannot imper- from all n replicas before marking the request executed. When sonate non-faulty replicas. Authenticated communication is a the client does not receive n responses, it timeouts and sends minimal requirement to deal with Byzantine behavior. Depend- a message to all replicas, after which the replicas perform an ing on the type of message, we use message authentication codes expensive client-dependent slow-path recovery process (which is (MACs) or threshold signatures (TSs) to achieve authenticated com- prone to errors when communication is unreliable [2]). munication [36]. MACs are based on symmetric in The fast path of SBFT can deal with up to c crash-failures which every pair of communicating nodes has a secret key. We using 3f + 2c + 1 replicas and uses threshold signatures to make expect non-faulty replicas to keep their secret keys hidden. TSs communication linear. The fast path of SBFT requires a reliable are based on asymmetric cryptography. In specific, each replica collector and executor to aggregate messages and to send only holds a distinct private key, which it can use to create a signature a single (instead of at-least-f + 1) response to the client. Due share. Next, one can produce a valid threshold signature given at to aggregating execution, the fast path of SBFT still performs least nf such signature shares (from distinct replicas). We write four rounds of communication before the client gets a response, 푠⟨푣⟩푖 to denote the signature share of the 푖-th replica for signing ′ whereas PoE only uses two rounds of communication (or three value 푣. Anyone that receives a set 푇 = {푠⟨푣⟩푗 | 푗 ∈ 푇 } of signa- when PoE uses threshold signatures). If the fast path timeouts (e.g., ture shares for 푣 from |푇 ′| = nf distinct replicas, can aggregate the collector or executor fails), then SBFT falls back to a threshold- 푇 into a single signature ⟨푣⟩. This can then be version of Pbft that takes an additional round before the client verified using a public key. gets a response. Twin-path consensus is in sharp contrast with We also employ a collision-resistant cryptographic hash function the design of PoE, which does not need outside aid (reliable D(·) that can map an arbitrary value 푣 to a constant-sized digest clients, collectors, or executors), and can operate optimally even D(푣) [36]. We assume that it is practically impossible to find while dealing with replica failures. another value 푣 ′, 푣 ≠ 푣 ′, such that D(푣) = D(푣 ′). We use notation Primary rotation. To minimize the influence of any single 푣||푤 to denotes the concatenation of two values 푣 and 푤. replica on bft consensus, HotStuff opts to replace the primary Next, we define the consensus provided by PoE. after every consensus decision. To efficiently do so, HotStuff uses an extra communication phase (as compared to Pbft), which Definition 3.1. A single run of any consensus protocol should minimizes the cost of primary replacement. Furthermore, Hot- satisfy the following requirements: Stuff uses threshold signatures to make its communication lin- Termination. Each non-faulty replica executes a transaction. ear (resulting in eight communication phases before a client gets Non-divergence. All non-faulty replicas execute the same trans- responses). The event-based version of HotStuff can overlap action. phases of consecutive rounds, thereby assuring that consensus of Termination is typically referred to as liveness, whereas non- a client request starts in every one-to-all-to-one communication divergence is typically referred to as safety. In PoE, execution is phase. Unfortunately, the primary replacements require that all speculative: replicas can execute and rollback transactions. To consensus rounds are performed in a strictly sequential manner, provide safety, PoE provides speculative non-divergence instead eliminating any possibility of out-of-order processing. of non-divergence: Speculative non-divergence. If nf − f ≥ f + 1 non-faulty repli- 3 PROOF-OF-EXECUTION cas accept and execute the same transaction 푇 , then all In our Proof-of-Execution consensus protocol (PoE), the primary non-faulty replicas will eventually accept and execute 푇 replica is responsible for proposing transactions requested by (after rolling back any other executed transactions). clients to all backup replicas. Each backup replica speculatively To provide safety, we do not need any other assumptions on executes these transactions with the belief that the primary is communication or on clients. Due to well-known impossibility behaving correctly. Speculative execution expedites processing results for asynchronous consensus [19], we can only provide 푐 푇 Client-role (used by client 푐 to request transaction 푇 ) : p 1: Send ⟨푇 ⟩푐 to the primary p. r1 2: Await receipt of messages inform(⟨푇 ⟩푐 , 푣, 푘, 푟) from nf replicas. r2 3: Considers 푇 executed, with result 푟, as the 푘-th transaction. b (a) PoE using MACs Primary-role (running at the primary p of view 푣, id(p) = 푣 mod n) : 푐 4: Let view 푣 start after execution of the 푘-th transaction. 푇 5: event p awaits receipt of message ⟨푇 ⟩푐 from client 푐 do p 6: Broadcast propose(⟨푇 ⟩푐 , 푣, 푘) to all replicas. r1 7: 푘 := 푘 + 1. r2 8: end event b 9: event p receives nf message support(푠 ⟨ℎ⟩푖, 푣, 푘) such that: propose support certify inform (1) each message was sent by a distinct replica, 푖 ∈ {1, . . . , 푛}; and (2) All 푠 ⟨ℎ⟩푖 in this set can be combined to generate signature ⟨ℎ⟩. (b) PoE using TSs. do 10: Broadcast certify(⟨ℎ⟩,푣, 푘) to all replicas. 11: end event Figure 2: Normal-case algorithm of PoE: Client 푐 sends its request containing transaction 푇 to the primary p, which Backup-role (running at every 푖-th replica r.) : proposes this request to all replicas. Although replica b is 12: event r receives message 푚 := propose(⟨푇 ⟩푐 , 푣, 푘) such that: (1) 푣 is the current view; Byzantine, it fails to affect PoE. (2) 푚 is sent by the primary of 푣; and (3) r did not accept a 푘-th proposal in 푣 do 13: Compute ℎ := D(⟨푇 ⟩ | |푣 | |푘). liveness in periods of reliable bounded-delay communication dur- 푐 14: Compute signature share 푠 ⟨ℎ⟩푖 . ing which all messages sent by non-faulty replicas will arrive at 15: Transmit support(푠 ⟨ℎ⟩푖, 푣, 푘) to p. their destination within some maximum delay. 16: end event 17: event r receives messages certify(⟨ℎ⟩,푣, 푘) from p such that: (1) r transmitted support(푠 ⟨ℎ⟩푖, 푣, 푘) to p; and 3.2 The Normal-Case Algorithm of PoE (2) ⟨ℎ⟩ is a valid threshold signature do PoE operates in views 푣 = 0, 1,... . In view 푣, replica r with 18: View-commit 푇 , the 푘-th transaction of 푣 (VCommitr (⟨푇 ⟩푐 , 푣, 푘)). id(r) = 푣 mod n is elected as the primary. The design of PoE 19: end event 20: event r logged VCommitr (⟨푇 ⟩푐 , 푣, 푘) and relies on authenticated communication, which can be provided ′ ′ ′ ′ has logged Executer (푡 , 푣 , 푘 ) for all 0 ≤ 푘 < 푘 do using MACs or TSs. In Figure 2, we sketch the normal-case working 21: Execute 푇 as the 푘-th transaction of 푣 (Executer (⟨푇 ⟩푐 , 푣, 푘)). of PoE for both cases. For the sake of brevity, we will describe PoE 22: Let 푟 be the result of execution of 푇 (if there is any result). 23: Send inform(D(⟨푇 ⟩푐 ), 푣, 푘, 푟) to 푐. built on top of TSs, which results in a protocol with low—linear— 24: end event message complexity in the normal case. The full pseudo-code for this algorithm can be found in Figure 3. In Section 3.6, we detail Figure 3: The normal-case algorithm of PoE. the minimal changes to PoE necessary when switching to MACs. Consider a view 푣 with primary p. To request execution of transaction 푇 , a client 푐 signs transaction 푇 and sends the signed client 푐 will wait for a proof-of-execution for the transaction 푇 ⟨푇 ⟩ transaction 푐 to p. The usage of signatures assures that mali- it requested, which consists of identical inform messages from cious primaries cannot forge transactions. To initiate replication nf distinct replicas. This proof-of-execution guarantees that at 푇 푘 and execution of as the -th transaction, the primary proposes least nf − f ≥ f + 1 non-faulty replicas executed 푇 as the 푘-th 푇 to all replicas via a propose message. transaction and in Section 3.3, we will see that such transactions 푖 푚 After the -th replica r receives a propose message from are always preserved by PoE when recovering from failures. nf p, it checks whether at least other replicas received the same If client 푐 does not know the current primary or does not get 푚 proposal from primary p. This check assures r that at least any timely response for its requests, then it can broadcast its nf − f non-faulty replicas received the same proposal, which request ⟨푇 ⟩푐 to all replicas. The non-faulty replicas will then for- will play a central role in achieving speculative non-divergence. ward this request to the current primary (if 푇 is not yet executed) To perform this check, each replica supports the first proposal and ensure that the primary initiates successful proposal of this 푚 it receives from the primary by computing a signature share request in a timely manner. 푠⟨푚⟩ 푖 and sending a support message containing this share to To prove correctness of PoE in all cases, we will need the the primary. following technical safety-related property of view-commits. The primary p waits for support messages with valid sig- nature shares from nf distinct replicas, which can then be ag- Proposition 3.2. Let r푖 , 푖 ∈ {1, 2}, be two non-faulty replicas ⟨ ⟩ gregated into a single signature 푚 . After generating such a that view-committed to ⟨푇푖 ⟩푐푖 as the 푘-th transaction of view 푣 signature, the primary broadcasts this signature to all replicas (VCommitr (⟨푇 ⟩푐, 푣, 푘)). If n > 3f, then ⟨푇1⟩푐1 = ⟨푇2⟩푐2 . via a certify message.

After a replica r receives a valid certify message, it view- Proof. Replica r푖 only view-committed to ⟨푇푖 ⟩푐푖 after r푖 re- commits to 푇 as the 푘-th transaction in view 푣. The replica logs ceived certify(⟨ℎ⟩, 푣, 푘) from the primary p (Line 17 of Figure 3). this view-commit decision as VCommitr (⟨푇 ⟩푐, 푣, 푘). After r view- This message includes a threshold signature ⟨ℎ⟩, whose construc- commits to 푇 , r schedules 푇 for speculative execution as the tion requires signature shares from a set 푆푖 of nf distinct replicas. 푘-th transaction of view 푣. Consequently, 푇 will be executed Let 푋푖 = 푆푖 \F be the non-faulty replicas in 푆푖 . As |푆푖 | = nf and by r after all preceding transactions are executed. We write |F | = f, we have |푋푖 | ≥ nf − f. The non-faulty replicas in 푋푖 will Executer (⟨푇 ⟩푐, 푣, 푘) to log this execution. only send a single support message for the 푘-th transaction in

After execution, r informs the client of the order of execution view 푣 (Line 12 of Figure 3). Hence, if ⟨푇1⟩푐1 ≠ ⟨푇2⟩푐2 , then 푋1 and of execution result 푟 (if any) via a message inform. In turn, and 푋2 must not overlap and nf ≥ |푋1 ∪ 푋2| ≥ 2(nf − f) must hold. As n = nf + f, this simplifies to 3f ≥ n, which contradicts p′ r n > 3f. Hence, we conclude ⟨푇1⟩푐1 = ⟨푇2⟩푐2 . □ 1 r2 We will later use Proposition 3.2 to show that PoE provides b speculative non-divergence. Next, we look at typical cases in vc-reqest vc-reqest nv-propose Enter view 푣 + 1 which the normal-case of PoE is interrupted: (detection) (join) Example 3.3. A malicious primary can try to affect PoE by not Figure 4: The current primary b of view 푣 is faulty and conforming to the normal-case algorithm in the following ways: needs to be replaced. The next primary, p′, and the replica (1) By sending proposals for different transactions to different r1 detected this failure first and request view-change via non-faulty replicas. In this case, Proposition 3.2 guarantees vc-reqest messages. The replica r2 joins these requests. that at most a single such proposed transaction will get view-committed by any non-faulty replica. (2) By keeping some non-faulty replicas in the dark by not non-faulty replicas. Second, all replicas exchange information to sending proposals to them. In this case, the remaining establish which transactions were included in view 푣 and which non-faulty replicas can still end up view-committing the were not. Third, the new primary p′ proposes a new view. This transactions as long as at least nf − f non-faulty replicas new view proposal contains a list of the transactions executed in receive proposals: the faulty replicas in F can take over the the previous views (based on the information exchanged earlier). role of up to f non-faulty replicas left in the dark (giving Finally, if the new view proposal is valid, then replicas switch to the false illusion that the non-faulty replicas in the dark this view; otherwise, replicas detect failure of p′ and initiate a are malicious). view-change for the next view (푣 + 2). The communication of the (3) By preventing execution by not proposing a 푘-th transac- view-change algorithm of PoE is sketched in Figure 4 and the tion, even though transactions following the 푘-th transac- full pseudo-code of the algorithm can be found in Figure 5. Next, tion are being proposed. we discuss each step in detail. When the network is unreliable and messages do not get deliv- 3.3.1 Failure Detection and View-Change Requests. If a replica ered (or not on time), then the behavior of a non-faulty primary r detects failure of the primary of view 푣, then it halts the normal- can match that of the malicious primary in the above example. case algorithm of PoE for view 푣 and informs all other replicas Indeed, failure of the normal-case of PoE has only two possi- of this failure by requesting a view-change. The replica r does ble causes: primary failure and unreliable communication. If so by broadcasting a message vc-reqest(푣, 퐸), in which 퐸 is communication is unreliable, then there is no way to guarantee a summary of all transactions executed by r (Figure 5, Line 1). continuous service [19]. Hence, replicas simply assume failure Each replica r can detect the failure of primary in two ways: of the current primary if the normal-case behavior of PoE is (1) r timeouts while expecting normal-case operations toward interrupted, while the design of PoE guarantees that unreliable executing a client request. E.g., when r forwards a client communication does not affect the correctness of PoE. request to the current primary, and the current primary To deal with primary failure, each replica maintains a timer fails to propose this request on time. for each request. If this timer expires (timeout) and it has not (2) r receives vc-reqest messages, indicating that the pri- been able to execute the request, it assumes that the primary mary of view 푣 failed, from f + 1 distinct replicas. As at is malicious. To deal with such a failure, replicas will replace most f of these messages can come from faulty replicas, at the primary. Next, we present the view-change algorithm that least one non-faulty replica must have detected a failure. performs primary replacement. In this case, r joins the view-change (Figure 5, Line 6).

3.3 The View-Change Algorithm 3.3.2 Proposing the New View. To start view 푣 + 1, the new If PoE observes failure of the primary p of view 푣, then PoE will primary p′ (with id(p′) = (푣 + 1) mod n) needs to propose a new elect a new primary and move to the next view, view 푣 + 1, via view by determining a valid list of requests that need to be pre- ′ the view-change algorithm. The goals of the view-change are served. To do so, p waits until it receives sufficient information. ′ (1) to assure that each request that is considered executed by In specific, p waits until it received valid vc-reqest messages ⊆ | | any client is preserved under all circumstances; and from a set 푆 ℜ of 푆 = nf distinct replicas. (2) to assure that the replicas are able to agree on a new view An 푖-th view-change request 푚푖 is considered valid if it in- ( ⟨ ⟩ ) whenever communication is reliable. cludes a consecutive sequence of pairs 푐, 푇 푐 , where 푐 is a valid certify message for request ⟨푇 ⟩푐 . Such a set 푆 is guaranteed to As described in the previous section, a client will consider its exist when communication is reliable, as all non-faulty replicas request executed if it receives a consisting proof-of-execution will participate in the view-change algorithm. The new primary of identical inform responses from at-least nf distinct replicas. collects the set 푆 of |푆| = nf valid vc-reqest and proposes them Of these nf responses, at-most f can come from faulty replicas. in a new view message nv-propose to all replicas. Hence, a client can only consider its request executed whenever the requested transaction was executed (and view-committed) by 3.3.3 Move to the New View. After a replica r receives a nv- at-least nf − f ≥ f + 1 non-faulty replicas in the system. We note propose message containing a new-view proposal from the new the similarity with the view-change algorithm of Pbft, which primary p′, r validates the content of this message. From the set will preserve any request that is prepared by at-least nf −f ≥ f +1 of vc-reqest messages in the new-view proposal, r chooses, non-faulty replicas. for each 푘, the pair (certify(⟨ℎ⟩,푤, 푘), ⟨푇 ⟩푐 ) proposed in the The view-change algorithm of PoE consists of three steps. most-recent view 푤. Furthermore, r determines the total number First, failure of the current primary p needs to be detected by all of such requests 푘max. Then, r view-commits and executes all vc-request (used by replica r to request view-change) : 3.4 Correctness of PoE 1: event r detects failure of the primary do First, we show that the normal-case algorithm of PoE provides 2: r halts the normal-case algorithm of Figure 3 for view 푣. 3: 퐸 := {(certify(⟨ℎ⟩,푤, 푘), ⟨푇 ⟩푐 ) | non-divergent speculative consensus when the primary is non- 푤 ≤ 푣 and Executer (⟨푇 ⟩푐 , 푤, 푘) and ℎ = D(⟨푇 ⟩푐 | |푤 | |푘)}. faulty and communication is reliable. 4: Broadcast vc-reqest(푣, 퐸) to all replicas. 5: end event Theorem 3.5. Consider a system in view 푣, in which the first 6: event r receives f + 1 messages vc-reqest(푣푖, 퐸푖 ) such that (1) each message was sent by a distinct replica; and 푘 − 1 transactions have been executed by all non-faulty replicas, in (2) 푣푖 , 1 ≤ 푖 ≤ f + 1, is the current view which the primary is non-faulty, and communication is reliable. If do the primary received ⟨푇 ⟩푐 , then the primary can use the algorithm 7: r detects failure of the primary (join). 8: end event in Figure 3 to ensure that (1) there is non-divergent execution of 푇 ; On receiving nv-propose (use by replica r) : (2) 푐 considers 푇 executed as the 푘-th transaction; and 9: event r receives 푚 = nv-propose(푣 + 1,푚1,푚2, ...,푚nf ) do (3) 푐 learns the result of executing 푇 (if any), 10: if 푚 is a valid new-view proposal (similar to creating nv-propose) then 11: Derive the transactions 푁 for the new-view from 푚1,푚2, . . . ,푚nf . this independent of any malicious behavior by faulty replicas. 12: Rollback any executed transactions not included in 푁 . 13: Execute the transactions in 푁 not yet executed. Proof. Each non-faulty primary would follow the algorithm 14: Move into view 푣 + 1 (see Section 3.3.3 for details). of PoE described in Figure 3 and send propose(⟨푇 ⟩푐, 푣, 푘) to 15: end if 16: end event all replicas (Line 6). In response, all nf non-faulty replicas will compute a signature share and send a support message to the ′ nv-propose (used by replica p that will act as the new primary) : primary (Line 15). Consequently, the primary will receive signa- ′ 17: event p receives nf messages 푚푖 = vc-reqest(푣푖, 퐸푖 ) such that ture shares from nf replicas and will combine them to generate a (1) these messages are sent by a set 푆, |푆 | = nf, of distinct replicas; threshold signature ⟨ℎ⟩. The primary will include this signature (2) for each 푚푖 , 1 ≤ 푖 ≤ nf, sent by replica q푖 ∈ 푆, 퐸푖 consists of a consecutive sequence of entries (certify(⟨ℎ⟩,푣, 푘), ⟨푇 ⟩푐 ); ⟨ℎ⟩ in a certify message and broadcast it to all replicas. Each (3) 푣푖 , 1 ≤ 푖 ≤ nf, is the current view 푣; and replica will successfully verify ⟨ℎ⟩ and will view-commit to 푇 (4) p′ is the next primary (id(p′) = (푣 + 1) mod n) do (Line 17). As the first 푘 − 1 transactions have already been exe- 18: Broadcast nv-propose(푣 + 1,푚1,푚2, ...,푚nf ) to all replicas. cuted, every non-faulty replica will execute 푇 . As all non-faulty 19: end event replicas behave deterministically, execution will yield the same result 푟 (if any) across all non-faulty replicas. Hence, when the Figure 5: The view-change algorithm of PoE. non-faulty replicas inform 푐, they do so by all sending identical messages inform(D(⟨푇 ⟩푐 ), 푣, 푘, 푟) to 푐 (Line 20–Line 23). As all nf non-faulty replicas executed 푇 , we have non-divergent execution. Finally, as there are at most f faulty replicas, the faulty replicas can only forge up to f invalid inform messages. Consequently, 푘max chosen requests that happened before view 푣 + 1. Notice that replica r can skip execution of any transaction it already the client 푐 will only receive the message inform(D(⟨푇 ⟩푐 ), 푣, 푘, 푟) executed. If r executed transactions not included in the new-view from at least nf distinct replicas, and will conclude that 푇 is exe- proposal, then r needs to rollback these transactions before it can cuted yielding result 푟 (Line 3). □ proceed executing requests in view 푣 + 1. After these steps, r can At the core of the correctness of PoE, under all conditions, switch to the new view 푣 + 1. In the new view, the new primary ⟨ ⟩ ′ is that no replica will rollback requests 푇 푐 for which client 푐 p starts by proposing the 푘max + 1-th transaction. already received a proof-of-execution. We prove this next: When moving into the new view, we see the cost of speculative execution: some replicas can be forced to rollback execution of Proposition 3.6. Let ⟨푇 ⟩푐 be a request for which client 푐 al- transactions: ready received a proof-of-execution showing that 푇 was executed as the 푘-th transaction of view 푣. If n > 3f, then every non-faulty ′ Example 3.4. Consider a system with non-faulty replica r. replica that switches to a view 푣 > 푣 will preserve 푇 as the 푘-th When deciding the 푘-th request, communication became unreli- transaction of view 푣. able, due to which only r received a certify message for request Proof. Client 푐 considers ⟨푇 ⟩푐 executed as the 푘-th transac- ⟨푇 ⟩푐 . Consequently, r speculatively executes 푇 and informs the tion of view 푣 when it received identical inform-messages for 푇 client 푐. During the view-change, all other replicas—none of from a set 퐴 of |퐴| = nf distinct replicas (Figure 3, Line 3). Let which have a certify message for ⟨푇 ⟩푐 —provide their local state 퐵 = 퐴 \F be the set of non-faulty replicas in 퐴. to the new primary, which proposes a new view that does not Now consider a non-faulty replica r that switches to view include any 푘-th request. Hence, the new primary will start its ′ = ′ 푣 > 푣. Before doing so, r must have received a valid proposal푚 view by proposing client request ⟨푇 ⟩푐′ as the 푘-th request, which ′ ′ nv-propose(푣 ,푚1, ...,푚nf ) from the primary of view 푣 . Let 퐶 be gets accepted. Consequently, r needs to rollback execution of 푇 . the set of nf distinct replicas that provided messages 푚1, . . . ,푚nf Luckily, this is not an issue: the client 푐 only got at-most f +1 < nf and let 퐷 = 퐶 \F be the set of non-faulty replicas in 퐶. We responses for request, does not yet have a proof-of-execution, have |퐵| ≥ nf − f and |퐷| ≥ nf − f. Hence, using a contradiction and, consequently, does not consider 푇 executed. argument similar to the one in the proof of Proposition 3.2, we conclude that there must exists a non-faulty replica q ∈ (퐵 ∩ 퐷) In practice, rollbacks can be supported by, e.g., undoing the that executed ⟨푇 ⟩푐 , informed 푐, and requested a view-change. operations of transaction in reverse order, or by reverting to an To complete the proof, we need to show that ⟨푇 ⟩푐 was pro- old state. For the correct working of PoE, the exact working of posed and executed in the last view that proposed and view- rollbacks is not important as long as the execution layer provides committed a 푘-th transaction and, hence, that q will include ⟨푇 ⟩푐 support for rollbacks. in its vc-reqest message for view 푣 ′. We do so by induction on the difference 푣 ′ − 푣. As the base case, we have 푣 ′ − 푣 = 1, (Figure 3, Line 12) and only executes a 푘-th transaction if it has in which case no view after 푣 exists yet and, hence, ⟨푇 ⟩푐 must already executed all the preceding transactions (Figure 3, Line 20). be the newest 푘-th transaction available to q. As the induction As the size of the active out-of-order processing window deter- hypothesis, we assume that all non-faulty replicas will preserve mines how many client requests are being processed at the same 푇 when entering a new view 푤, 푣 < 푤 ≤ 푤 ′. Hence, non-faulty time (without receiving a proof-of-execution), the size of the replicas participating in view 푤 will not support any 푘-th trans- active window determines the number of transactions that can actions proposed in view 푤. Consequently, no certify messages be rolled back during view-changes. can be constructed for any 푘-th transaction in view 푤. Hence, ′ the new-view proposal for 푤 + 1 will include ⟨푇 ⟩푐 , completing 3.6 Designing PoE using MACs the proof. □ The design of PoE can be adapted to only use message authen- As a direct consequence of the above, we have tication codes (MACs) to authenticate communication. This will sharply reduce the computational complexity of PoE and elim- Corollary 3.7 (Safety of PoE). PoE provides speculative non- inate one round of communication, this at the cost of higher divergence if n > 3f. quadratic overall communication costs (see Figure 2). We notice that the view-change algorithm does not deal with The usage of only MACs makes it impossible to obtain threshold minor malicious behavior (e.g., a single replica left in the dark). signatures or reliably forward messages (as forwarding replicas Furthermore, the presented view-change algorithm will recover can tamper with the content of unsigned messages). Hence, us- all transactions since the start of the system, which will result ing MACs requires changes to how client requests are included in unreasonable large messages when many transactions have in proposals (as client requests are forwarded), to the normal- already been proposed. In practice, both these issues can be re- case algorithm of PoE (which uses threshold signatures), and to solved by regularly making checkpoints (e.g., after every 100 the view-change algorithm of PoE (which forwards vc-reqest requests) and only including requests since the last checkpoint messages). The changes to the proposal of client requests and to in each vc-reqest message. To do so, PoE uses a standard fully- the view-change algorithm can be derived from the strategies decentralized Pbft-style checkpoint algorithm that enables the used by Pbft to support MACs [9]. Hence, next we only review independent checkpointing and recovery of any request that is the changes to the normal-case algorithm of PoE. executed by at least f + 1 non-faulty replicas whenever communi- Consider a replica r that receives a propose message from cation is reliable [9]. Finally, utilizing the view-change algorithm the primary p. Next, r needs to determine whether at least nf and checkpoints, we prove other replicas received the same proposal, which is required to achieve speculative non-divergence (see Proposition 3.2). When Theorem 3.8 (Liveness of PoE). PoE provides termination in using MACs, r can do so by replacing the all-to-one support and periods of reliable bounded-delay communication if n > 3f. one-to-all certify phases by a single all-to-all support phase. In the Proof. When the primary is non-faulty, Theorem 3.5 guar- support phase, each replica agrees to support the first proposal antees termination as replicas continuously accept and execute propose(⟨푇 ⟩푐, 푣, 푘) it receives from the primary by broadcast- requests. If the primary is Byzantine and fails to guarantee ter- ing a message support(D(⟨푇 ⟩푐 ), 푣, 푘) to all replicas. After this mination for at most f non-faulty replicas, then the checkpoint broadcast, each replica waits until it receives support messages, algorithm will assure termination of these non-faulty replicas. identical to the message it sent, from nf distinct replicas. If r Finally, if the primary is Byzantine and fails to guarantee termi- receives these messages, it view-commits to 푇 as the 푘-th transac- nation for at least f +1 non-faulty replicas, then it will be replaced tion in view 푣 and schedules 푇 for execution. We have sketched using the view-change algorithm. For the view-change process, this algorithm in Figure 2. each replica will start with a timeout 훿 after it receives nf match- ing vc-reqests and double this timeout after each view-change 4 RESILIENTDB FABRIC (exponential backoff). When communication becomes reliable, To test our design principles in practical settings, we imple- this mechanism guarantees that all replicas will eventually view- ment our PoE protocol in our ResilientDB fabric [27–31]. Re- change to the same view at the same time. After this point, a silientDB provides its users access to a state-of-the-art replicated f non-faulty replica will become primary in at most view-changes, transactional engine and fulfills the need of a high-throughput after which Theorem 3.5 guarantees termination. □ permissioned blockchain fabric. ResilientDB helps us to realize the following goals: (i) implement and test different consensus 3.5 Fine-Tuning and Optimizations protocols; (ii) balance the tasks done by a replica through a paral- To keep presentation simple, we did not include the following lel pipelined architecture; (iii) minimize the cost of communication optimizations in the protocol description: through batching client transactions; and (iv) enable use of a se- (1) To reach nf signature shares, the primary can generate one cure and efficient ledger. Next, we present a brief overview of itself. Hence, it only needs nf − 1 shares of other replicas. our ResilientDB fabric. (2) The propose, support, inform, and nv-propose messages ResilientDB lays down a client-server architecture where are not forwarded and only need MACs to provide message clients send their transactions to servers for processing. We use authentication. The certify messages need not be signed, Figure 6 to illustrate the multi-threaded pipelined architecture as tampering them would invalidate the threshold signa- associated with each replica. At each replica, we spawn multiple ture. The vc-reqest messages need to be signed, as they input and output threads for communicating with the network. need to be forwarded without tampering. Batching. During our formal description of PoE, we assumed Finally, the design of PoE is fully compatible with out-of-order that the propose message from the primary includes a single processing as a replica only supports proposals for a 푘-th trans- client request. An effective way to reduce the overall cost of action if it had not previously supported another 푘-th proposal consensus is by aggregating several client requests in a single 5 Network Network ·10 Client 6 ) Requests Batch Creation s 0.6

Input Output / ) 4 s Messages Support Messages Worker 0.4 from Clients & Certify to Clients and Replicas Execute and Replicas 2 Latency ( 0.2 Checkpoint Throughput (txn 0 0 No exec. Exec. No exec. Exec Figure 6: Multi-threaded Pipelines at different replicas. Figure 7: Upper bound on performance when primary only replies to clients (No exec.) and when primary exe- batch and use one consensus step to reach agreement on all these cutes a request and replies to clients (Exec.). requests [9, 21, 38]. To maximize performance, ResilientDB facilitates batching requests at both replicas and clients. At the primary replica, we spawn multiple batch-threads that rotating leaders. Through our experiments, we want to answer aggregate clients requests into a batch. The input-threads at the the following questions: primary receive client requests, assign them a sequence number (Q1) How does PoE fare in comparison with the other protocols and enqueue these requests in the batch-queue. In ResilientDB, under failures? all batch-threads share a common lock-free queue. When a client (Q2) Does PoE benefits from batching client requests? request is available, a batch-thread dequeues the request and con- (Q3) How does PoE perform under zero payload? tinues adding it to an existing batch until the batch has reached (Q4) How scalable is PoE on increasing the number of replicas a pre-defined size. Each batching-thread also hashes the requests participating in the consensus, in the normal-case? in a batch to create a unique digest. Setup. We run our experiments on the Google Cloud, and All other messages received at a replica are enqueued by the deploy each replicas on a 푐2 machine having a 16-core Intel Xeon input-thread in the work-queue to be processed by the single Cascade Lake CPU running at 3.8 GHz with 32 GB memory. We worker-thread. Once a replica receive a certify message from deploy up to 320 k clients on 16 machines. To collect results after the primary, it forwards the request to the execute-thread for reaching a steady-state, we run each experiment for 180 s: the execution. Once the execution is complete, the execution-thread first 60 s are warmup, and measurement results are collected over creates an inform message, which is transmitted to the client. the next 120 s. We average our results over three runs. Ledger Management. We now explain how we efficiently Configuration and Benchmarking. For evaluating the pro- maintain a blockchain ledger across different replicas. A block- tocols, we employed YCSB [13] from Blockbench’s macro bench- chain is an immutable ledger, where blocks are chained as a marks [16]. Each client request queries a YCSB table that holds linked-list. An 푖-th block can be represented as 퐵푖 := {푘,푑, 푣, half a million active records. We require 90% of the requests to be 퐻 (퐵푖−1)}, in which 푘 is the sequence number of the client re- write queries as the majority of typical blockchain transactions quest, 푑 the digest of the request, 푣 the view number, and 퐻 (퐵푖−1) are updates to existing records. Prior to the experiments, each the hash of the previous block. In ResilientDB, prior to any replica is initialized with an identical copy of the YCSB table. The consensus, we require the first primary replica to create a gen- client requests generated by YCSB follow a Zipfian distribution esis block [31]. This genesis block acts as the first block in the and are heavily skewed (skew factor 0.9). blockchain and contains some basic data. We use the hash of the Unless explicitly stated, we use the following configuration identity of the initial primary, as this information is available to for all experiments. We perform scaling experiments by varying each participating replicas (eliminating the need for any extra replicas from 4 to 91. We divide our experiments in two dimen- communication to exchange this block). sions: (1) Zero Payload or Standard Payload, and (2) Failures or After the genesis block, each replica can independently create Non-Failures. We employ batching with a batch size of 100 as the the next block in the blockchain. As stated above, each block percentage increase in throughput on larger batch sizes is small. corresponds to some batch of transactions. A block is only created Under Zero Payload conditions, all replicas execute 100 dummy by the execute-thread once it completes executing a batch of instructions per batch, while the primary sends an empty pro- transactions. To create a block, the execute-thread hashes the posal (and not a batch of 100 requests). Under Standard Payload, previous block in the blockchain and creates a new block. To with a batch size of 100, the size of Propose message is 5400 B, prove the validity of individual blocks, ResilientDB stores the of Response message is 1748 B, and of other messages is around proof-of-accepting the 푘-th request in the 푘-th block. In PoE, such 250 B. For experiments with failures, we force one backup replica a proof includes the threshold signature sent by the primary as to crash. Additionally, we present an experiment that illustrates part of the certify message. the effect of primary failure. We measure throughput as trans- actions executed per second. We measure latency as the time 5 EVALUATION from when the client sends a request to the time when the client We now analyze our design principles in practice. To do so, we receives a response. evaluate our PoE protocol against four state-of-the-art bft pro- Other protocols: We also implement Pbft, Zyzzyva, SBFT tocols. There are many bft protocols we could compare with. and HotStuff in our ResilientDB fabric. We refer to Section 2 Hence, we pick a representative sample: (1) Zyzzyva—as it has for further details on the working of Zyzzyva, SBFT, and Hot- the absolute minimal cost in the fault-free case, (2) Pbft—as it is a Stuff. Our implementation of Pbft is based on the BFTSmart [7] common baseline (the used design is based on BFTSmart [7]), (3) framework with the added benefits of out-of-order processing, SBFT—as it is a safer variation of Zyzzyva, and (3) HotStuff—as pipelining, and multi-threading. In both Pbft and Zyzzyva, digi- it is a linear-communication protocol that adopts the notion of tal signatures are used for authenticating messages sent by the ·105 4 5.3 Scaling Replicas under Standard Payload )

s 1.5 / In this section, we evaluate scalability of PoE both under backup

) 3 s 1 failure and no failures. 2 (1) Single Backup Failure. We use Figures 9(a) and 9(b) to

0.5 Latency ( 1 illustrate the throughput and latency attained by the system on

Throughput (txn running different consensus protocols under a backup failure. 0 0 None ED MAC None ED MAC These graphs affirm our claim that PoE attains higher throughput and incurs lower latency than all other protocols. Figure 8: System performance using three different signa- In case of Pbft, each replica participates in two phases of ture schemes. In all cases, n = 16 replicas participate in quadratic communication, which limits its throughput. For the consensus. twin-path protocols such as Zyzzyva and SBFT, a single failure is sufficient to cause massive reductions in their system through- puts. Notice that the collector in SBFT and the clients in Zyzzyva clients, while MACs are used for other messages. Both SBFT and have to wait for messages from all n replicas, respectively. As HotStuff require threshold signatures for their communication. predicting an optimal value for timeouts is hard [11, 12], we chose a very small value for the timeout (3 s) for replicas and 5.1 System Characterization clients. We justify these values, as the experiments we show later in this section show that the average latency can be as large We first determine the upper bounds on the performance of as 6 s. We note that high timeouts affect Zyzzyva more than ResilientDB. In Figure 7, we present the maximum throughput SBFT. In Zyzzyva, clients are waiting for timeouts during which and latency of ResilientDB when there is no communication they stop sending requests, which empties the pipeline at the among the replicas. We use the term No Execution to refer to the primary, starving it from new request to propose. To alleviate case where all clients send their request to the primary replica such issues in real-world deployments of Zyzzyva, clients need and primary simply responds back to the client. We count every to be able to precisely predict the latency to minimize the time query responded back in the system throughput. We use the term the clients needs to wait between requests. Unfortunately, this is Execution to refer to the case where the primary replica executes hard and runs the risk of ending up in the expensive slow path of each query before responding back to the client. Zyzzyva whenever the predicted latency is slightly off. In SBFT, The architecture of ResilientDB (see Section 4) states the use the collector may timeout waiting for threshold shares for the of one worker thread. In these experiments, we maximize system 푘-th round while the primary can continues propose requests performance by allowing up to two threads to work indepen- for future round 푙, 푙 > 푘. Hence, in SBFT replicas have more dently at the primary replica without ordering any queries. Our opportunity to occupy themselves with useful work. results indicate that the system can attain high throughputs (up HotStuff attains significantly low throughput due to its se- to 500 ktxn/s) and can reach low latencies (up to 0.25 s). Notice quential primary-rotation model in which each of its primaries that if we employ additional worker-threads, our ResilientDB has to wait for the previous primary before proposing the next fabric can easily attain higher throughput. request, which leads to a huge reduction in its throughput. In- 5.2 Effect of Cryptographic Signatures. terestingly, HotStuff incurs the least average latency among all protocols. This is a result of intensive load on the system ResilientDB enables a flexible design where replicas and clients when running other protocols. As these protocols process several can employ both digital signatures (threshold signatures) and requests concurrently (see the multi-threaded architecture in Sec- message authentication codes. This helps us to implement PoE tion 4), these requests spend on average more time in the queue and other consensus protocols in ResilientDB. before being processed by a replica. Notice that all out-of-order To achieve authenticated communication using symmetric consensus protocols employ this trade off: a small sacrifice on cryptography, we employ a combination of CMAC and AES [36]. latency yields higher gains on system throughput. Further, we employ ED25519-based digital signatures to enable In case of PoE, its high throughputs under failures is a result asymmetric cryptographic signing. For generating efficient thresh- of its three-phase linear protocol that does not rely on any twin- old signature scheme, we use Boneh–Lynn–Shacham (BLS) sig- path model. To summarize, PoE attains up to 43%, 72%, 24× and natures [36]. To create message digests and for hashing purposes, 62× more throughputs than Pbft, SBFT, HotStuff and Zyzzyva. we use the SHA256 algorithm. (2) No Replica Failure. We use Figures 9(c) and 9(d) to il- Next, we determine the cost of different cryptographic signing lustrate the throughput and latency attained by the system on schemes. For this purpose, we run three different experiments in running different consensus protocols in fault-free conditions. which (i) no signature scheme is used (None); (ii) everyone uses These plots help us to bound the maximum throughput that can digital signatures based on ED25519 (ED); and (iii) all replicas use be attained by different consensus protocols in our system. CMAC+AES for signing, while clients sign their message using First, as expected, in comparison to the Figures 9(a) and 9(b), ED25519 (MAC). In these three experiments, we run Pbft consen- the throughputs for PoE and Pbft are slightly higher. Second, sus among 16 replicas. In Figure 8, we illustrate the throughput PoE continues to outperform both Pbft and HotStuff, for the attained and latency incurred by ResilientDB for the experi- reasons described earlier. Third, both Zyzzyva and SBFT are ments. Clearly, the system attains its highest throughput when now attaining higher throughputs as their clients and collector no signatures are employed. However, such a system cannot han- no longer timeout, respectively. The key reason SBFT’s gains dle malicious attacks. Further, using just digital signatures for are limited is because SBFT requires five phases and becomes signing messages can prove to be expensive. An optimal config- computation bounded. Although Pbft is quadratic, it employs uration can require clients to sign their messages using digital MAC, which are cheaper to sign and verify. signatures, while replicas can communicate using MACs. PoE Pbft SBFT HotStuff Zyzzyva

(a) Scalability (Single Failure) (b) Scalability (Single Failure) (c) Scalability (No Failures) (d) Scalability (No Failures) ·105 ·105 . 10 2 5 8 ) ) s s / / 1.5 8 2 ) ) 6 s s 6 1.5 1 4 4 1 Latency ( 0.5 Latency ( 2 2 0.5 Throughput (txn Throughput (txn 0 0 0 0 4 16 32 64 91 4 16 32 64 91 4 16 32 64 91 4 16 32 64 91 Number of replicas (n) Number of replicas (n) Number of replicas (n) Number of replicas (n) (e) Zero Payload (Single Failure) (f) Zero Payload (Single Failures) (g) Zero Payload (No Failures) (h) Zero Payload (No Failures) ·105 ·105 3 ) ) s 8 s / / 6 ) ) 2 s 2 s 6 4 4 1 1 2 Latency ( 2 Latency ( Throughput (txn 0 0 Throughput (txn 0 0 4 16 32 64 91 4 16 32 64 91 4 16 32 64 91 4 16 32 64 91 Number of replicas (n) Number of replicas (n) Number of replicas (n) Number of replicas (n) (i) Batching (Single Failure) (j) Batching (Single Failure) ·103 (k) Out-of-ordering disabled (l) Out-of-ordering disabled ·105 ·10−2 ) ) 2 30 s s 4 / / 8 ) 1.5 ) s 20 s 1 6 3 10 Latency ( 0.5 4 Latency ( 2 Throughput (txn Throughput (txn 0 0 2 10 50 100 200 400 10 50 100 200 400 4 16 32 64 91 4 16 32 64 91 Batch size Batch size Number of replicas (n) Number of replicas (n)

Figure 9: Evaluating system throughput and average latency incurred by PoE and other bft protocols.

Notice that the differences in throughputs of PoE and Zyzzyva as its other linear counterparts. Its throughput is impacted by are small. PoE has 20% (on 91 replicas) to 13% (on 4 replicas) less the delay of five phases. throughputs than Zyzzyva. An interesting observation is that on 91 replicas, Zyzzyva incurs almost the same latency as PoE, even 5.5 Impact of Batching under Failures though it has higher throughput. This happens as clients in PoE have to wait for only the fastest nf = 61 replies, whereas a client Next, we study the effect of batching client requests on bft pro- for Zyzzyva has to wait for replies from all replicas (even the tocols [9, 51]. To answer this question, we measure performance slowest ones). To conclude, PoE attains up to 35%, 27% and 21× as function of the number of requests in a batch (the batch-size), more throughput than Pbft, SBFT and HotStuff, respectively. which we vary between 10 and 400. For this experiment, we use a system with 32 available replicas, of which one replica has failed. We use Figures 9(i) and 9(j) to illustrate, for each consensus 5.4 Scaling Replicas under Zero Payload protocol, the throughput and average latency attained by the sys- We now measure the performance of different protocols under tem. For each protocol, increasing the batch-size also increases zero payload. In any bft protocol, the primary starts consensus throughput, while decreasing the latency. This happens as larger by sending a Propose message that includes all transactions. As batch-sizes require fewer consensus rounds to complete the exact a result, this message has the largest size and is responsible for same set of requests, reducing the cost of ordering and executing consuming the majority of the bandwidth. A zero payload ex- the transactions. This not only improves throughput, but also periment ensures that each replica executes dummy instructions. reduces client latencies as clients receive faster responses for Hence, the primary is no longer a bottleneck. their requests. Although increasing the batch-size reduces the We again run these experiments for both Single Failure and number of consensus rounds, the large message size ca