Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography

Appears as Technical Memo MIT/LCS/TM-589, MIT Laboratory for Computer Science, June 1999 Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography Miguel Castro and Barbara Liskov Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA 02139 castro,liskov @lcs.mit.edu Abstract of these optimizations in detail. It explains how to modify We have developed a practical state-machine replication the base algorithm to eliminate the major performance algorithm that tolerates Byzantine faults: it works correctly bottleneck in previous systems Ð the cost of using public- in asynchronous systems like the Internet and it incorporates key cryptography to produce digital signatures. several optimizationsthat improve the responsetime of previous The time to perform public-key cryptography oper- algorithms by more than an order of magnitude. This paper ations to generate and verify signatures is cited as the describes the most important of these optimizations. It explains major bottleneck [24, 17, 13] in state-machine replica- how to modify the base algorithm to eliminate the major tion algorithms designed for practical application. The performance bottleneck in previous systems Ð public-key optimization described in this paper replaces digital sig- cryptography. The optimization replaces public-key signatures natures by vectors of message authentication codes dur- by vectors of message authentication codes during normal operation, and it overcomes a fundamental limitation on the ing normal operation. It uses digital signatures only for power of message authentication codes relative to digital view changes that occur when a replica fails and are likely signatures Ð the inability to prove that a message is authentic to be infrequent. For the same level of security and typ- to a third party. As a result, authentication is more than two ical service con®gurations, our authentication scheme is orders of magnitude faster while providing the same level of more than two orders of magnitude faster than one using security. public-key signatures. Message authentication codes are widely used. What is 1 Introduction interesting in the work described here is that it overcomes a fundamental limitation of message authentication codes The growing reliance of industry and government on relative to digital signatures Ð the inability to prove online information services makes malicious attacks that a message is authentic to a third party. Previous more attractive and makes the consequences of successful state-machine replication algorithms [24, 26, 13] (as well attacks more serious. Byzantine-fault-tolerant replication as the base version of our algorithm described in [5]) enables the implementation of robust services that rely on this property of digital signatures for correctness. continue to function correctly even when some of their We explain how to modify our algorithm to overcome replicas are compromised by an attacker. this problem while retaining the same communication We have developed a practical algorithm for state- performance during normal case operation and the same machine replication [14, 29] that tolerates Byzantine resiliency. Our solution to the problem takes advantage faults. The algorithm is described in [5]. It offers both of the bound of 1 on the number of faulty replicas liveness and safety provided at most 1 out of a total 3 3 (which is also required by the algorithms that use digital of replicas are faulty. This means that clients eventually signatures) and the fact that correct replicas agree on an receive replies to their requests and those replies are order for requests. correct according to linearizability [12, 4]. Unlike previous algorithms [29, 25, 13], ours works This work is part of our research to produce practical correctly in asynchronous systems like the Internet, and Byzantine fault tolerance algorithms and to demonstrate it incorporates important optimizations that enable it to their practicality by implementing real systems. Our outperform previous systems by more than an order of initial results are very promising. We have implemented a magnitude [5]. This paper describes the most important Byzantine-fault-tolerant NFS ®le system and it performs less than 3% slower than a standard, unreplicated implementation of NFS [5]. This research was supported in part by DARPA under contract F30602- The rest of the paper is organized as follows. Section 2 98-1-0237 monitored by the Air Force Research Laboratory, and in part by NEC. Miguel Castro was partially supported by a PRAXIS XXI presents an overview of our system model and lists fellowship. our assumptions. Section 3 describes the problem 1 solved by the algorithm and states correctness conditions. cannot produce a valid signature of a non-faulty node, Section 4 describes the algorithm using public-key or ®nd two messages with the same digest. The signatures. These sections also appeared in [5]. They are cryptographic techniques we use are thought to have these repeated here for completeness and because the version properties [28, 30, 27]. of the algorithm with public-key signatures is easier to understand, but can be skipped by a reader that is 3 Service Properties familiar with the algorithm in [5]. Section 5 discusses the changes to the algorithm that allow us to avoid public- Our algorithm can be used to implement any deterministic key cryptography during normal operation. Section 6 replicated service with a state and some operations. The discusses related work. Our conclusions are presented in operations are not restricted to simple reads or writes of Section 7. portions of the service state; they can perform arbitrary deterministic computations using the state and operation 2 System Model arguments. Clients issue requests to the replicated service to invoke operations and block waiting for a reply. The We assume an asynchronous distributed system where replicated service is implemented by replicas. Clients nodes are connected by a network. The network may and replicas are non-faulty if they follow the algorithm fail to deliver messages, delay them, duplicate them, or and if no attacker can forge their signature. deliver them out of order. The algorithm provides both safety and liveness assum- 1 We use a Byzantine failure model, i.e., faulty nodes ing no more than 3 replicas are faulty. Safety means may behave arbitrarily, subject only to the restriction that the replicated service satis®es linearizability [12] mentioned below. We assume independent node failures. (modi®ed to account for Byzantine-faulty clients [4]): it For this assumption to be true in the presence of malicious behaves like a centralized implementation that executes attacks, some steps need to be taken, e.g., each node operations atomically one at a time. should run different implementations of the service code Safety is provided regardless of how many faulty and operating system and should have a different root clients are using the service (even if they collude with password and a different administrator. It is possible faulty replicas): all operations performed by faulty clients to obtain different implementations from the same code are observed in a consistent way by non-faulty clients. base [23] and for low degrees of replication one can buy In particular, if the service operations are designed to operating systems from different vendors. N-version preserve some invariants on the service state, faulty programming, where different teams of programmers clients cannot break those invariants. produce different implementations, is another option for This safety property is very strong but it is insuf®cient some services. to guard most services against faulty clients, e.g., in a We use cryptographic techniques to prevent spoo®ng ®le system a faulty client can write garbage data to some and replays and to detect corrupted messages. Our shared ®le. However, we limit the amount of damage messages contain public-key signatures [28], message a faulty client can do by providing access control: we authentication codes [30], and message digests produced authenticate clients and deny access if the client issuing by collision-resistant hash functions [27]. We denote a a request does not have the right to invoke the operation. message signed by node as and the digest of Also, services may provide operations to change the message by . We follow the common practice access permissions for a client. Since the algorithm of signing a digest of a message and appending it to ensures that the effects of access revocation operations the plaintext of the message rather than signing the full are observed consistently by all clients, this provides a message ( should be interpreted in this way.) All powerful mechanism to recover from attacks by faulty replicas know the public keys of other replicas and clients clients. to verify signatures. Clients also know the public keys of The algorithm does not rely on synchrony to provide replicas. safety. Therefore, it must rely on synchrony to provide We allow for a very strong adversary that can liveness; otherwise it could be used to implement coordinate faulty nodes, delay communication, or delay consensus in an asynchronous system, which is not correct nodes in order to cause the most damage to the possible [9]. We guarantee liveness, i.e., clients replicated service. We do assume that the adversary eventually receive replies to their requests, provided at 1 cannot delay correct nodes inde®nitely. We also

Load more