<<

Distributed Logging for

Dean Spencer Daniels

December 1988 CMU-CS-89-114

Submittedto CarnegieMellonUniversityin partialfulfillmentof the requirementsfor thedegreeof Doctorof Philosophy

Department of Carnegie Mellon University Pittsburgh, Pennsylvania

Copyright© 1988 DeanSpencerDaniels

This work was supported by IBM and the Defense Advanced Research Projects Agency, ARPA Order No. 4976, monitored by the Air Force Avionics Laboratory under Contract F33615-87-K-C-1499.

The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any of the sponsoring agencies or of the United States Government.

Abstract

This dissertation defends the thesis that recovery logs for transaction processing can be efficiently and reliably provided by a highly available network service. Recovery logs are a special implementation of stable storage that transaction processing systems use to record information essential to their operation. Recovery logs must be very reliable and have fast access. Typically, mirrored magnetic disks are dedicated to log storage in high performance transaction systems. Dedicated mirrored disks are an expensive resource for small processors like workstations or nodes in a non-shared memory multiprocessor. However, it is these types of processors that participate in many distributed programs and benefit from the availability of a general purposed distributed transaction facility. Distributed logging promotes reliable distributed computing by addressing the problem of the resources needed by the recovery log for a general purpose distributed transaction processing facility.

The distributed logging thesis is defended by discussion of the design, implementation, and evaluation of distributed logging services. The design issues considered include the global representation of distributed logs, communication, security, log server data structures, log space management, and load assignment. A new distributed algorithm for replicating recovery logs and a new data structure for direct access to log records in append-only storage are presented. The dissertation explores the use of uninterruptible power supplies to implement low-latency non- volatile virtual memory disk buffers.

The implementation and evaluation of the Distributed Log Facility for the Camelot Distributed Transaction Facility is described. The Camelot DLF uses the new distributed algorithm for representing distributed logs and uses uninterruptible power supplies to implement non-volatile virtual memory. Specially designed protocols are used for communication between clients and log servers in the Camelot DLF. The performance of the Camelot DLF is competitive with the Camelot local log implementation for a variety of benchmarks. The throughput capacity of log servers is reported.

Acknowledgments

This thesis, and my entire graduate career, would not have been possible without support, guidance, encouragement, and friendship from Alfred Spector. I doubt that I will ever participate in collaborations as successful as my eight years (so far) of working with Alfred.

My readers, Maurice Herlihy, Bruce Lindsay, and Rick Rashid, are to be commended for promptly producing helpful comments on (often very rough) drafts of this thesis. Bruce's efforts at quality control are particularly appreciated.

Dean Thompson collaborated in the development of many of the new ideas presented here and implemented portions of the Camelot Distributed Log Facility. The system described here should really be called Deans' logger.

The entire Camelot group, especially Dean, Alfred, Josh Bloch, Dan Duchamp, Jeff Eppinger and Randy Pausch, is to be thanked for providing an exciting family in which to work and play. Thanks to Steve Berman for procuring the UPSs used for the Camelot DLF.

My graduate career and life in Pittsburgh was a really great experience because of many fine friends. There are too many to list them all, but some deserve special mention, including: Alan, Mike, Ann, Jumpin, Chris, Archie, Steve, David, Dan, Sherri, Doug, Penn, and of course, Bill.

For W.A.B.

Table of Contents

1. Introduction 1 1.1. Distributed Logging: Thesis and Motivation 1 1.2. Goals 3 1.3. Outline 4 2. Background 5 2.1. Distributed Systems 5 2.2. What a Log Is 7 2.2.1. Transactions and Transaction Processing Facilities 8 2.2.1.1. The Transaction Concept 8 2.2.1.2. Transaction System Applications 9 2.2.1.3. DistributedTransaction Processing Facilities 13 2.2.2. Log Definition, Use, and Implementation 19 2.2.2.1. A Simple Log Definition 19 2.2.2.2. Log-based Transaction Recovery Algorithms 20 2.2.2.3. Practical Log Interfaces 25 2.2.2.4. Log Implementation Issues 29 3. Design Issues and Alternatives 33 3.1. Representation of Distributed Log Data 34 3.1.1. Mirroring 34 3.1.2. Distributed Replication 35 3.1.2.1. Replicated Time-Ordered Unique Identifiers 36 3.1.2.2. Replicated Log Algorithm 38 3.1.2.3. Formal Proof of Restart Procedure 45 3.1.3. Hybrid Schemes 51 3.1.4. Comparison Criteria 52 3.1.4.1. Reliability 52 3.1.4.2. Availability 55 3.1.4.3. Performance 57 3.2. Client/Server Communication 60 3.2.1. Stream Protocols 61 3.2.2. RPC Protocols 62 3.2.3. The Channel Model and LU 6.2 64 3.2.4. Parallel Communications and Multicast Protocols 66 3.2.5. Comparison Criteria 67 3.2.5.1. Load Models 67 3.2.5.2. Force and Random Read Times 68 3.2.5.3. Streaming Rates 69 3.2.5.4. Resilience 69 3.2.5.5. Complexity and Suitability 70 3.3. Security 70 3.3.1. Alternative Mechanisms 71 3.3.1.1. Authentication Mechanisms 71 3.3.1.2. Physical Security 72 3.3.1.3. End-to-End Encryption 72 3.3.1.4. Encryption-Based Protocols 73 3.3.2. Policies and Comparison Criteria 74 3.3.2,1. Threat Models and Security Requirements 74 3,3.2.2. Cost 75 3.4. Log Representation on Servers 75 3.4.1. Disk Representation Alternatives 76 3.4.1.1. Files 76 3.4.1.2. Partitioned Space 77 3.4.1,3. Single Data Stream 79 3.4.2. Low-Latency Buffer Alternatives 82 3.4.3. Comparison Criteria 84 3.4.3.1. Low-Latency Buffer Costs 84 3.4.3,2. Disk Utilization 85 3.4,3.3. Performance 85 3.5. Log Space Management 86 3.5.1. Mechanisms 87 3.5.1.1. Server Controlled Mechanisms 87 3.5.1.2. Client Controlled Mechanisms 89 3,5.1.3. Compression 92 3.5.2. Policy Alternatives 93 3.5.3. Comparison Criteria 93 3.5.3.1, Use Models 94 3.5.3.2, Costs 94 3.5.3.3. Performance 95 3.6. Load Assignment 95 3.6.1. Mechanisms 96 3.6.1.1. Load Assessment 96 3.6.1.2. Load Assignment Mechanisms 97 3.6.2. Policies 98 3.7. Summary (and Perspective) 99 4. The Design of the Camelot Distributed Log Facility 101 4.1. Camelot Distributed Log Design Decisions 101 4.1.1. Distributed Log Representation 102 4.1.2. Communication 102 4.1.3. Security 103 4.1.4. Server Data Representation 103 4.1.5. Log Space Management 103 4.1.6. Load Assignment 104 4.2. Communication Design 104 4.2.1. Transport Protocol 104 4,2.2. Message Protocols 105 4.2.2.1. Data Message Packing and ReadLog Buffering 105 4.2.2.2. RPC Subprotocols 106 4.2.2.3. WriteLog Subprotocol 108 4.2.2.4. CopyLog Subprotocol 110 4.3. Log Client Structure 111 III

4.3.1. Camelot Architecture 111 4.3.2. Camelot's Local and Network Loggers 113 4.3.2.1. The Camelot Log Interface 113 4.3.2.2. The Local Logger 116 4.3.2.3. The Network Logger 117 4.4. Log Server Design 118 4.4.1. Log Server Threads 118 4.4.2. Log Server Data Structures 119 4.4.2.1. Main Memory Structures 119 4.4.2.2. Disk Data Structures 120 4.4.3. Uninterruptible Power Supply Operations 123 5. Performance of the Camelot Distributed Logging Facility 125 5.1. Methodology 126 5,1.1. LatencyExperiments 126 5.1.1.1. CPA Tests 126 5.1.1.2. Debit/Credit Tests 128 5.1.1.3. Debit/Credit Latency Breakdown Tests 128 5.1.2. Throughput Experiments 129 5.2. Experiments 129 5.2.1. Experimental Environment 130 5.2.2. Latency Experimerits 130 5.2.2.1. CPA Tests 131 5.2.2.2. Debit/Credit Tests 132 5.2.2.3. Debit/Credit Latency Breakdown Test 134 5.2.3. Throughput Experiments 135 5.3. Results and Discussion 135 5.3.1. Latency Experiments 135 5.3.1.1. CPA Tests 136 5.3.1.2. Debit/Credit Tests 137 5.3.1.3. Debit/Credit Latency Breakdown Test 138 5.3.2. Throughput Experiments 140 5.3.3. Performance Summary 142 6. Conclusions 143 6.1. Evaluations 143 6.1.1. The Camelot Distributed Logging Facility 143 6.1.2. Future Distributed Logging Systems 145 6.1.3. High Performance Network Services 145 6.2. Future Research 146 6.2.1. Camelot DLF Enhancements 146 6.2.2. Future Distributed Log Servers 147 6.3. Summary 147 iv List of Figures

Figure 2-1 : A Simple Log Interface 19 Figure 2-2: Practical Log Interface 26 Figure 2-3: Figure 2-2 Continued 27 Figure 3-1: Unique Identifier Generator State Representative Interface 37 Figure 3-2: Program for OrderedId Newld 38 Figure 3-3: Three log sewers with LSN 10 Partially Replicated 39 Figure 3-4: Figure 3-3 after Restart with Sewers 1 and 2 41 Figure 3-5: Directory for Distributed Log in Figure 3-4 (merges interval lists 42 from Sewer 1 and 2) Figure 3-6: Type Definitions for Distributed Log Replication 42 Figure 3-7: Log Sewer Interface for Distributed Log Replication 43 Figure 3-8: Global Variables and Open Procedure for Distributed Log 44 Replication Figure 3-9: Figure 3-8 Continued 45 Figure 3-10: Read Procedure for Distributed Log Replication 46 Figure 3-11: Write Procedure for Distributed Log Replication 47 Figure 3-12: Markov Reliability Model for 2 Copy Logs 53 Figure 3-13: Write Availability of Different Distributed Logs 56 Figure 3-14: Restart Availability of Different Distributed Logs 57 Figure 3-15: Stream Protocol Interface for Log Writes 62 Figure 3-16: RPC Interface for Log Writes 64 Figure 3-17: Channel Interface for Log Writes 65 Figure 3-18: Entry Bit-Map Index (adapted from [Finlayson and Cheriton 87]) 81 Figure 3-19: Eleven Node Append-forest 82 Figure 4-1 : Log Data Message (WriteLog, ReadLogReply, etc) Layout 106 Figure 4-2: Camelot Processes 112 Figure 4-3: Client Information Record 119 Figure 4-4: Interval Lists Table Record 119 Figure 4-5: Interval List Element Record 120 Figure 4-6: Data Block Record Group Identifier Record 121 Figure 4-7: Data Block Layout 122 Figure 4-8: Checkpoint Interval List Record 122 Figure 4-9: Checkpoint Block Layout 123 Figure 5-1 : CPA Test Architecture 127 Figure 5-2: One Small Read from 1 Server CML Transaction 131 Figure 5-3: Ten Small Reads from 1 Sewer CML Transaction 131 Figure 5-4: One Large Read from 1 Sewer CML Transaction 131 Figure 5-5: Ten Large Reads from 1 Sewer CML Transaction 132 vi

Figure 5-6: One Small Read from 3 Servers CML Transaction 132 Figure 5-7: Distributed Log Debit/Credit Throughput 140 vii

List of Tables

Table 3-1: Distributed Log Representation Performance Comparisons 57 Table 3-2: Communication Comparison Summary 67 Table 5-1: CML Read Operation Times in Milliseconds 136 Table 5-2: CML Write Operation Times in Milliseconds 136 Table 5-3: CML Transaction Overhead Times in Milliseconds 136 Table 5-4: Debit/Credit Logging Latency Breakdown 139

Chapter 1 Introduction

This dissertation is concerned with a distributed implementation of the stable storage recovery logs used by transaction processing facilities. General purpose distributed transaction facilities could greatly simplify the construction of reliable distributed programs, but the resources required for recovery logs are not always available on the small processors and workstations that need to participate in the execution of the distributed programs. A network log service overcomes this difficulty.

A transaction system recovery log is a specialized implementation of stable storage (highly reliable storage) in which transaction processing facilities store recovery and transaction management information. Distributed transaction processing facilities are extensions to network operating systems that permit the management of permanent abstract objects and the composition of sequences of operations on the objects in transactions. These concepts entail a significant amount of background on distributed systems and transactions.

Chapter 2 provides background suitable for readers not familiar with the distributed systems and transaction facilities concepts. This introduction continues in Section 1.1 with a precise statement of the thesis defended by this dissertation and the motivation for this research. Section 1.2 present goals for the defense of the thesis. Section 1.3 is an outline of the remainder of the dissertation.

1.1. Distributed Logging: Thesis and Motivation

The thesis defended in this dissertation is that recovery logs for transaction processing can be efficiently and reliably provided by a highly available network service. The transaction processing systems under consideration are general purpose ones, which could support large or other applications with large amounts of data and moderate transaction rates. Recoverable data could reside in main memory, on disks local to the transaction processing node, or on a disk server, and the recoverable data could move dynamically between different kinds of storage. The logging service should be suitable for various recovery techniques, and the network over which the logging services are provided could be either a local area network or a high speed bus in a multiprocessor. 2

The reliability and efficiency requirements preclude a trivial demonstration of the thesis using a disk or file server. The reliability of distributed logging should equal or exceed that of local logging tO mirrored disk storage, which is implemented by executing each disk write on two (or more) disks that have independent failure modes. Distributed logging should be efficient enough so that servers can be shared by a reasonable number of clients and so that the performance of the distributed log service is comparable to local logging.

Distributed transactions are an essential foundation for simplifying the construction of distributed programs. It should be easier to build a distributed computing service if a distributed transaction facility is available. A distributed transaction facility needs a recovery log. A distributed recovery log service necessarily can not use a distributed transaction facility in its implementation. Instead, the log service must be constructed using more fundamental distributed system services.

Many arguments for distributed logging are similar to those for disk and file servers. In some environments, the use of shared logging facilities could have advantages in cost, reliability, operations, and performance. These advantages are particularly applicable to a workstation environment, but the benefits might also apply to the processing nodes of a multiprocessor.

Cost advantages can be obtained if expensive logging devices can be shared. For example, in a workstation environment, it would be wasteful to dedicate mirrored disks and tapes to each workstation. Cost advantages depend on the cost of server nodes and on the number of clients that can share each server.

Reliability can be better for shared facilities, because they can be specially hardened, or replicated in two or more locations. Reliability is a critical issue for transaction processing recovery logs, since the log is the basis for all error recovery in the system. A transaction processing system can tolerate (with some performance loss) unreliable storage for data, providing the log is available for media recovery. Hence, local disk storage can be viewed as a cache for data that actually resides in the log.

Operational advantages are possible because it is easier to manage high volumes of log data at a small number of logging nodes, rather than at all transaction processing nodes. For the workstation environment these operational advantages are particularly important and result in additional reliability and cost advantages because the log data can be at physically secure central locations that are more convenient for operations personnel.

Performance can be improved because shared logging facilities can have faster hardware than could be afforded for each processing node. However, it is more likely that there will be a performance penalty for accessing a remote service. This cost must be traded against the advantages of distributed logging. 1.2. Goals

This dissertation presents a three part defense of the distributed logging thesis. First, the design of a distributed logging service is discussed in detail. Second, an implementation of a service is described. Third, the service and the distributed logging concept are evaluated using a variety of criteria.

The goals for the distributed logging service that is the focus of this dissertation are that it have reliability, and performance comparable to a local logging service. In addition, the distributed nature of the service should provide it with advantages in resource sharing, availability, and operations.

The first goal for the discussion of distributed log service design issues is comprehensiveness. All issues relevant to a distributed logging service should be considered. Some issues, such as the overall distributed log service architecture, and the design of individual servers, are very specific to distributed logging. The designs presented for distributed log services should include innovative technology needed to achieve overall goals for the service. Others issues, such as security, communication, and load assignment, have more general applicability in distributed systems. It is not a goal of this dissertation to present new technology in these areas. Instead, existing paradigms can be applied to distributed logging.

The goal of the implementation effort is to take a complete design for a distributed logging service and make it work. The distributed logging service implemented for this dissertation is an integral component of the Camelot distributed transaction processing facility. The service should work well enough to be preferred over local logging for workstations running the transaction facility. A complete implementation implies refining design decisions and adapting them to a particular system environment. Limited implementation resources may require some tradeoffs in the service's function or completeness.

The goal of the evaluation is to show that the distributed logging service works and meets its performance objectives. The evaluation should identify performance limitations and attempt to show whether they are inherit in a distributed logging system or artifacts of an implementation. The potential for scaling the service and adapting it to other technologies should be identified by the evaluation.

A peripheral goal of this dissertation is to assess technologies available for constructing high performance network services. 1.3. Outline

Chapter 2 elaborates this introductionand presents considerablebackgroundinformation suitable for non-experts in distributed transaction processing. The background includes motivation for distributed transaction facilities as a means for simplifying the construction of distributed programs, definitions of recovery logs and explanations of their use, and a discussion of the implementation of local recovery logs. The top level organization of the remainder of this dissertation follows the three parts of the defense of the distributed logging thesis.

Chapter 3 discusses a number of issues in the design of the distributed logging service. The most attention is given the overall distributed architecture of the service and the design of storage on individual log servers. Other issues discussed include communication, security, log space management, and load assignment. The design chapter concludes by presenting an overall design methodology.

Chapter 4 first identifies which of design alternatives for each of the issues in Chapter 3 was chosen for the distributed log facility in the Camelot distributed transaction system. The communication protocols used between log servers and their clients are described. Both the local log implementation for Camelot, and the client side of the distributed log facility are described, along with the design of log servers.

Chapter 5 presents the performance evaluation of the Camelot DLF in three sections. First, the measurements used for the evaluation are described. Second, the experiments that performed the measurements are detailed. Finally, the experimental results are presented and discussed.

Chapter 6 first presents evaluations of the Camelot DLF, distributed logging in general, and technology for creating high performance servers. Then, topics for future research are discussed. A summary of results completes the dissertation. 5

Chapter 2

Background

This chapter presents background material on distributed computing, transactions, and transaction recovery logs. Readers familiar with these concepts can skip this chapter, although it may later be necessary to refer to the interface specifications contained herein.

2.1. Distributed Systems

Distributed computing is partially motivated by the proliferation of cheap microprocessor-based computers and workstations. These machines offer greater computing power per dollar than more powerful (and more expensive) mainframe computers. Together with this price/performance advantage, personal workstations provide a higher quality of user interaction, compared with terminal access to central computers.

The workstations and personal computers that are a key element of distributed computing have some limitations. Although they may have considerable computing power, personal computers have limited input/output equipment (like printers, or tape drives), and relatively small secondary storage. To compensate for these limitations, personal computers need to communicate to share resources and information.

Local area networks are a key enabling communication technology for distributed computing [Clark et al. 78]. Typically, such networks are communications channels that can be shared among hundred of nodes and provide communication rates of from one million to one hundred million bits per second. Local networks can span distances ranging from hundreds of yards to a few miles. The etherne_tm [Metcalfe and Boggs 76, Digital Equipment Corporation 80] is a canonical example of such technology. Ethernet implementations transmit data at ten megabits and can span distances up to about 1500 meters. There are some local area networks and special interconnection busses that operate at higher speeds [Kronenberg 86]. This dissertation is concerned with developments that can take advantage of current local area networks and future developments.

Local area networks can be joined to other local networks and long haul networks by gateways to form large internetworks[Postel 82]. Special internetwork protocols make communications 6 between nodes on different networks in an internetwork transparent to most software, although communications that span multiple local networks are subject to delays for processing in gateways and retransmission on each network [Boggs et al. 82].

Network protocols, particularly those used in long haul networks and internetworks are specified and often implemented in several layers of increasing function [Tanenbaum 81]. This layered structure often makes communication expensive and there are alternatives to layered communications such as the specialized hardware and microcode support for fast message passing discussed by Spector[Spector 81, Spector 82]. Some operating systems have emphasized efficient support for message passing and access to network sewices [Cheriton 83, Cheriton 84a, Cheriton 84b].

Networks and communication protocols provide a basis for distributed computing, but building distributed systems is very difficult without a suitable paradigm for structuring the systems. The client-server model of distributed systems is a widely accepted paradigm for structuring distributed systems [Svobodova et al. 79, Watson 81a]. In this model, resources are provided to client processes by server processes and clients access server resources using network communication. Servers may themselves be clients of other servers. The communication interface to a server encapsulates the resource in much the same way that a procedural interface to an abstract data type encapsulates its implementation [Dahl and Hoare 72, Department of Defense 82, Wulf et al. 74, Wulf et al. 76, Liskov et al. 77].

The most common example of a distributed systems service is that provided by a file server. Svobodova's excellent survey describes many different file servers [Svobodova 84]. Disk servers, which provide a more primitive interface than file servers, are used to support diskless operation of personal workstations [Cheriton 83, Lazowska et al 86]. Another common service is a name server for mapping high level names to network identifiers [Terry 85]. Authentication servers distribute the encryption keys used in secure communications [Needham and Schroeder 78]. Mail servers store and deliver electronic mail [Birrell et al. 82, Schroeder et al. 84]. This dissertation is concerned with the design of a new service for distributed systems and thus fits directly into the client-server paradigm.

When the client-server model is applied to operating system design, the result is a system where every service is obtained through a message interface. Operating systems with this structure include Tandem's Non Stoptm kernel [Bartlett 81], Accent[Rashid and Robertson 81],V [Cheriton 84a], and Mach [Baron et al. 85, Accetta et al. 86], among others. These systems are intended as versatile platforms for distributed computing.

The client-server modelfor structuring distributed systems meshes nicely with the use of remote procedure calls (RPCs) for programming them [Nelson 81, Birrell and Nelson 84, Jones et al. 85]. RPC systems provide programmers with a simple model of communication in which a procedure 7 call executes on one machine while the body of the procedure executes on another machine. In addition to allowing for efficient implementation, RPCs also relieve programmers from the burden of packing and unpacking messages.

Powerful low cost processors, high speed communication, efficient and flexible communication protocols, network operating systems, and remote procedure calls are all important foundations for constructing distributed computer systems. The software components of these foundations attempt to make writing distributed programs a simple extension of writing non-distributed ones. Despite all this infrastructure, distributed programs are still very difficult to construct. This is because distributed systems present new classes of problems and (often inescapable) opportunities for programmers.

Implementations of distributed systems force many programmers to deal with difficult problems for the first time. Concurrent access to shared resources is a problem that occurs often in distributed systems. Along with the concurrency problems that arise in distributed systems is the opportunity to exploit the parallelism possible with distributed computing. Failures of only some of the systems on which a program is executing are another problem that can occur in distributed systems. It is difficult to guarantee that distributed programs execute atomically, either completing execution at every node, or leaving no effect at any node. Of course, the redundancy possible in a distributed system provides opportunities for fault tolerance.

Concurrency, atomicity, and permanence of aata in the face of faults are problems that are not unique to distributed computing, but they are more common problems for distributed system programmers. Transactions, a programming construct that originated with management systems, are useful for dealing with these problems. Many research projects have investigated the use of transactions as a building block for reliable distributed programs. Transactions are defined and their implementation is discussed is the next section.

2.2. What a Log Is

This section presents the background on distributed transaction facilities and their recovery logs that is needed to understand the thesis defended in the remainder of the dissertation. First, fundamental concepts are explained, including distributed transaction facilities, their components, and applications. Second, these concepts are applied when this section discusses the functions, use, and implementation of transaction recovery logs. 8

2.2.1. Transactions and Transaction Processing Facilities

2.2.1.1. The Transaction Concept

The transaction concept is broadly useful, but a precise semantic definition of a transaction is a topic for other research. The difficulty of formally defining a transaction comes in part from the difficulty of formalizing concepts like consistency, permanence, and simultaneity. There are many good informal tutorials on the transaction concept including those by Date [Date 83] and Gray [Gray 80, Gray 81].

Syntactically, a transaction is a computer program (or portion of a program) that is bracketed by the commands Begin_Transaction and EndTransaction. The program contains operations on shared permanent data and optional Abort Transaction commands. Transactions provide three properties that are useful for constructing reliable applications: concurrency, atomicity, and permanence.

The concurrency property provided by transactions is intended to permit different applications to execute in parallel. Transactions that execute concurrently are isolated so that the behavior of the transactions is the same as it would be if they executed one at a time. A formalization of this property is known as [Eswaran et al. 76, Bernstein and Goodman 81]. There are many methods by which transaction facilities can guarantee serializability and some of the most important are surveyed by Bernstein and Goodman [Bernstein and Goodman 81].

The atomicity property of transactions assures that either all of the changes made by a transaction occur or none do. If a transaction executes an End Transaction command, then the transaction is said to and the transaction processing facility must ensure that all changes made by the transaction occur. Because of this property programmers need to be concerned only with the correctness of programs that transform a database from one consistent state to another. Intermediate states that violate consistency constraints will not be visible to other transactions even if failures or Abort Transaction commands cause the transaction to abort. The transaction manager and recovery manager components of a transaction processing facility are responsible for ensuring atomicity, often by using information recorded in a recovery log.

The permanence property of transactions assures that data changes made by committed transactions will not be lost in spite of processor, disk or other failures. The exact extent of failures that a transaction processing system will tolerate varies with different systems and applications; there are always disasters from which a given system can not completely recover.

The properties described above make transactions a convenient tool for writing programs that manipulate databases and other valuable data that have complex consistency constraints. 9

Programmers only need to be concerned with writing programs that correctly transform data from one consistent state to another when run to completion in isolation. Interactions with other programs, recovery from interruptions, and processor or storage failures are the responsibility of the transaction processing system. These advantages are particularly important in distributed systems where there can be more concurrent activities and more opportunities for failures than in centralized systems.

2.2.1.2. Transaction System Applications

The transaction concept evolved from many influences including both academic and commercial ones. Online transactionprocessingis now recognizedas a major commercial applicationfor computersystems. Researchershaveemphasizedflexibletransactionsystemsfor moreambitiousapplications.

Airline reservationsystemswere among the first commercialonlinetransactionsystems. The SABER system, developed jointly by IBM and American Airlines was one of the first airline reservationsystems[Perry and Plugge 61]. Subsequently,IBM developed the Airline Control Program (ACP), which is nowcalledthe TransactionProcessingFacility(TPF) [IBM Corporation 87]. TPF is nowused fornon-airlineapplicationsincludingelectronicbanking[Bamberger87].

The transactionmodel providedby ACP and related systems is not as general as the one introducedin Section2.2.1.1. Transactionsin an airlinesystemare oftena singlemessageto the system and a single response[Giffordand Spector 84]. The functionsperformed by such a "transaction"are not arbitraryactionson the data in the system but are chosen from a set of proceduresdefinedby thesystem'sprogrammer.

Banking,includingprocessingfor automaticand humantellers,credit card authorizations,and other applications,is another major use of commercial transactionsystems. The standard benchmarkof transactionsystem performance is a transactionthat performs a bank account debit/creditoperation[Anonymouset al. 85].

Other major commercial transactionsystem applicationinclude online order, inventory and manufacturing planning systems, and factory automation. Major commercial transaction processingsystemsincludeIBM's CICS and IMS productsand the Tandem TMF system[IBM Corporation78, IBM Corporation80, Helland85].

Transactionfacilitiesmustbe integratedwitha fullfunctiondatabase managementsystemwhen the data managementneedsof an applicationare too complexor rapidlychangingfor a simpler file system. IBM's SQUDS and DB2 relationalDBMS products,and the System R research prototype, incorporatefull transaction facilities that permit arbitrary compositionof DBMS operationsin a transaction,as well as concurrencycontrolto isolate paralleltransactions[IBM 10

Corporation 82, Haderle and Jackson 84, Astrahan et al. 76]. The R* distributed relational DBMS research prototype added distributed function to the transaction management facilities of System R (and to the data management facilities as well) [Williams et al. 81, Lindsay et al. 84]. Tandem's Non-Stop SQL DBMS product uses the distributed transaction management facilities of TMF [Tandem Computers 87].

The utility of transactions in writing database applications suggests that they may be useful in other domains. Lomet was one of the first to suggest that transactions had applicability outside of database systems [Lomet 77]. Experience with some systems reinforced this conjecture. Transactions were used to coordinate the execution of a distributed compiler for the R* system [Daniels 82, Lindsay et al. 84]. Gifford implemented a transaction-based distributed calendar system [Gifford 79a].

The attraction of transactions as a general construct for building reliable distributed programs has led to the construction of systems offering transactions. A number of these systems augment file servers with transaction facilities. The Cambridge File Server permits transactions to access a single file at one server [Mitchell and Dion 82]. The Felix file server permits transactions to access multiple files at one server [Fridrich and Older 81]. The Xerox Distributed File System, the Alpine file server, and Paxton's extension to the Woodstock File Server permit transactions to access multiple files at multiple servers [Mitchell and Dion 82, Brown 85, Paxton 79]. The Locus distributed operating system offers nested distributed transactions on files with record level locking [Weinstein et al. 85].

Transactional file systems have been used as the basis for database management systems [Brown et al. 81, Cattell 83]. However, transactional file systems have had mixed success. The performance overheads they impose for recovery and are neither necessary nor acceptable for many traditional file system applications (compiler temporary files, for example). General purpose distributed transaction facilities which provide a more flexible data model have been proposed to address short comings of transactional file systems,

General purpose distributed transaction facilities extend network operating systems to provide facilities for implementing permanent abstract atomic objects and coordinating distributed transactions that manipulate the objects. Unlike database management systems and transactional file systems, these facilities do not impose a specific data model on distributed system implementors. A major advantage of such facilities is that they encourage the creation of new distributed applications by writing transaction that compose operations on different existing objects and database systems. The distributed transaction facilities provide a common set of compatible services that permit such composition. The functions of general purpose distributed transaction facilities are discussed further in Section 2.2.1.3. Implementations of general purpose distributed transaction facilities include TABS, Quicksilver, Clouds, and Camelot [Spector et al. 11

85a, Spector et al. 85b, AIIchin and McKendry 83, Haskin et al. 88, Spector et al. 86, Spector and Swedlow 88].

The extra flexibility provided by general purpose distributed transaction facilities complicates the design and implementation of atomic abstract objects. Object implementors must use the facilities provided by the transaction facility to represent, recover, and synchronize access to their data. This task is simplified by programming languages that have permanent atomic data as one of the storage classes provided by the language. Examples of programming languages that are integrated with distributed transaction facilities include Argus and Avalon [Liskov and Scheifler 83, Liskov et al. 83, Liskov 84, Liskov et al. 87, Herlihy and Wing 87].

General purpose distributed transaction facilities encourage a variety of applications in addition to databases and traditional online transaction systems. Electronic mail systems have been designed and implemented using such facilities [Kato and Spector 88, Liskov and Scheifler 83]. Other applications include a room reservation service [Swedlow 87, Spector et al. 88], and transactional spread sheets for collaborative work [Jones and Spector ??]. Pausch designed a transaction-based system for input/output with emphasis on user interaction [Pausch 88]. Conventional file systems, which do not necessarily offer transactions on files, can benefit from the performance and reliability of transaction based management of their allocation information and other meta-data [Satyanarayanan 88, Hagmann 87].

Replication is a major use for distributed transaction facilities. The motivation for replication is to exploit the redundancy possible in a distributed system to achieve greater availability and reliability. The challenge in designing a replication algorithm is to achieve high availability despite the temporary inaccessibility of some number of replicas of data and still guarantee that all access observe consistent data. Generally, the goal is to preserve single copy semantics. This means that it should not be apparent to the users of replicated objects that replication is taking place.

One fundamental distributed replication strategy is unanimous update: every update operation must be done on all replicas, but reads may be directed to any replica. This replication strategy guarantees single copy semantics if the systems storing each replica guarantee data consistency locally. Unfortunately, the availability for updates is poor when large numbers of replicas are used. Update availability can be increased by using the communication system to buffer updates to replicas that are not available. The SDD-1 distributed database system uses this approach [Rothnie et al. 77]. A similar approach is taken in the available copies method [Bernstein and Goodman 84]. These methods do not address the problem of network partitions.

A second approach to replication is based on keeping primary and secondary copies of data. The primary copy receives all updates and then relays the updates to the secondary copies [Alsberg and Day 76]. An inquiry may be sent to a secondary copy, but the result might 12 not reflect the most recent updates. Because responses to inquiries might not reflect recent updates, it is difficult for a primary/secondary copy strategy to duplicate the semantics of a non- replicated object. Techniques for alleviating this problem have been developed. For example, each file open operation in the Locus distributed file system ensures the currency of data by consulting a known synchronization site [Popek et al. 81]. Locus maintains availability after synchronization site failure by nominating a new synchronization site.

A third basic approach to replication is weighted voting [Gifford 79b, Gifford 81]. This approach is described in more detail because it is related to algorithms developed in Chapter 3. A file is stored as a collection of replicas, called representatives, each of which is assigned a certain number of votes. A representative consists of a copy of the file and a version number. The entire collection of representatives is called a file suite. Write operations write an updated copy of the file to each representative in a group called a write quorum and associate a new version number with all of these representatives. The new version number is higher than any version number previously associated with the file. Read operations read from each representative in a read quorum and return the data from the representative with the highest version number. Write operations establish a higher version number by incrementing the highest version number encountered in a read quorum.

A write quorum consists of any set of representatives whose votes total at least W and a read quorum consists of any set of representatives whose votes total at least R. The constants R and W are chosen so that their sum is greater than the total number of votes assigned to all representatives, N. Thus, every read quorum has a non- intersection with every write quorum and each inquiry is guaranteed to access at least one current copy of the data. Current copies will always have a higher version number than non-current copies so the read operation will always return current data. The values chosen for R and W control a tradeoff between the cost and availability of read and write operations.

Abbadi, Skeen and Christian extend the available copies approach to handle partitions [Abbadi et al. 85]. In this approach, the nodes maintain virtual partitions, which are logical groups corresponding to perceived actual partitions. The unanimous update approach is used within each virtual . Only a virtual partition containing a majority of the replicas for any object can access that object.

Abbadi and Toueg extend the virtual partitions approach to gain added flexibility [Abbadi and Toueg 86]. In this system, nodes maintain views, similar to the virtual partitions described above. Within each , the weighted voting technique is used. Performance and availability tradeoffs between read and write operations can be controlled by choosing appropriate quorum sizes.

All of the replication methods above apply to files, data objects supporting only read and write operations. Herlihy describes a technique called generalized quorum consensus whereby the 13

weighted voting technique can be systematically applied to any abstract data type [Herlihy 86]. This technique is completely general but results in implementations that are costly in terms of communications, storage, and computation. Herlihy suggests some optimizations to decrease these costs, but the emphasis in his work is on complete generality and theoretical investigation of quorum intersection issues rather than on providing efficient implementations. Bloch, Daniels, and Spector described an efficient weighted voting algorithm for replicated directories objects that exploits type-specific properties of directories [Bloch et al. 87].

The unanimous update, available copies, weighted voting, generalized quorum consensus, virtual partitions, and views replication algorithms all require an underlying distributed transaction system. Implementations of other replication algorithms may rely on transactions as well. Transactions are important to the replication algorithms because they guarantee the atomicity of operations on multiple nodes. It is much easier to assure the invariant that voted files be written to W representatives if all of the writes are done in one atomic transaction. Transaction synchronization guarantees that no other transaction observes a state where fewer than W copies of a file have been written. Similar properties are essential for the transactions used by the virtual partitions algorithm to reintegrate partitioned representatives.

Not all replication method attempts to provide single copy semantics. Some variations on primary/secondary copy replication might propagate updates to secondary copies as soon as possible, perhaps using separate transactions for each copy. Transactions are still useful for locating primary or secondary copies as data migrates in the system, and for keeping track of which secondary copies need to receive updates.

When replication does not provide single copy semantics, transactions are often useful for coping with inconsistencies that result. For example, R* caches replicas of catalog information at nodes remote from the storage site of a table. The R* query compiler uses the cached information when planning the execution of a database query. If out of date catalog information is detected later during the compilation, the compilation transaction is backed out to a save point (a primitive form of nested transaction) established before catalog accesses [Daniels 82, Lindsay et al. 84].

2.2.1.3. Distributed Transaction Processing Facilities

A distributed transaction processing facility is an extension to a network operating system that permits the management of abstract objects and the composition of sequences of operations on the objects in transactions. Some of the services provided by a distributed transaction facility are simple extensions of services of network operating systems. A distributed transaction processing facility also cooperates with remote distributed transaction processing facilities to permit operations on remote objects to be composed into distributed transactions. 14

Abstract objects are data or input/output devices on which collections of operations have been defined. Access to an object is permitted only by these operations. A queue object having operations such as Enqueue, Dequeue, EmptyQueue is a typical data object, and a CRT display having operations such as WriteLine, and ReadLine iS a typical I/O object. Objects vary in their lifetimes and their implementation. This notion of object is similar to the class construct of Simula [Dahl and Hoare 72], packages in ADA [Department of Defense 82], and the abstract objects supported by operating systems such as Hydra [Wulf et al. 74]. The operating system work has tended to emphasize authorization.

Many models exist for implementing abstract objects that are shared by multiple processes. In one model, objects are encapsulated in protected subsystems and accessed by protected procedure calls or capability mechanisms [Saltzer 74, Fabry 74]. The client-server model is commonly used in distributed transaction facilities. Servers that encapsulate data objects are called Data Servers in TABS and Camelot, Resource Managers in R* [Lindsay et al. 84], and Guardians in Argus [Liskov et al. 83].

Message transmission mechanisms and server organizations differ among distributed transaction facilities using the client/server model. Most systems offer communication connections that are transparent as to whether the sender and receiver are on the same or different machines. They differ in whether or not messages are typed, in the amount of data that can be transmitted in messages, and in whether communication connections can be passed in messages. The message system provided by the Accent and Mach operating systems is particularly flexible. Many Mach processes may have send rights to a message queue (called a port), but only one has receive rights. Send rights and receive rights can be transmitted in messages along with ordinary data. Large quantities of data are efficiently conveyed between Mach processes on the same machine via copy-on-write mapping into the address space of the recipient process. This message model differs from that of Unix 4.2 [Joy et al. 83] and the V Kernel [Cheriton 84a] in that messages are typed sequences of data which can contain port capabilities, and that large messages can be transmitted'with nearly constant overhead.

The programming effort associated with packing and unpacking messages can be reduced in a message oriented system by the use of a remote procedure call facility like Matchmaker [Jones et al. 85]. The term remote procedure call used here applies to both intra-node and inter-node communication. Matchmaker's input is a syntactic definition of procedure headers and its outputs are client and server stubs that pack data into messages, unpack data from messages, and dispatch to the appropriate procedures on the server side.

Servers that never wait while processing an operation can be organized as a loop that receives a request message, dispatches to execute the operation, and sends a response message. Unfortunately, servers may wait for many reasons: to synchronize with other operations, to execute a remote operation or system call, or to page-fault For such servers, there must be 15 multiple threads of control within a server, or else the server will pause or deadlock when it need not.

One implementation approach for servers is to allocate independently schedulable processes that share access to data. With this approach, a server is a class of related processes - in the Simula sense of the word "class." An alternative approach is to have multiple lightweight processes within a single server process. Lightweight process vary in the degree to of parallelism and multiprocessor support they provide. Simple coroutines or other facilities implemented at user level allow servers to schedule other operations when a single process would block for synchronization, but block for page faults or synchronous I/O. The lightweight processes, called threads, that are provided by the Mach operating system permit multiprogramming for I/O and paging. Mach threads exploit multiprocessors by allowing threads from one heayweight process (task) to simultaneously run on different processors. The topic of server organization has been clearly discussed by Liskov and Herlihy [Liskov and Herlihy 83].

In addition to providing communication, distributed transaction facilities support operations on objects by providing a name resolution service that creates bindings between object names in programs and communication connections to servers encapsulating the objects. A communication connection to a server and a logical object identifier that distinguishes between the various objects implemented by that server are sufficient to name an object. Name resolution is a difficult process because there are conflicting requirements. One name in different programs might refer to different objects, or different names might refer to the same object. Names might be independent of an object's changing location, or names might be sensitive to the location of objects. Lindsay has discussed the resolution of names in the context of a distributed database system [Lindsay 80].

Resources used by abstract objects include processing and storage. A distributed transaction facility can allocate processing resources to servers encapsulating objects using conventional operating system scheduling techniques. Operating system scheduling does need to be tuned for the frequent transfers of control that occur between processes in message based systems.

Storage management entails a number of considerations. The amount of storage required for an object can vary enormously. Objects like counters need a few bytes while video images require megabytes of storage. The range of reasonable object sizes exceeds the range provided by many file systems. Objects sizes are not static but can grow and shrink. A distributed transaction facility will probably use a variety of allocation strategies to meet these requirements.

Because a very large amount of storage is required, a distributed transaction facility must manage a hierarchy consisting or secondary (or even tertiary) storage like disks and primary processor memory. Objects, or parts of object must be moved from disk to main memory when they are needed and main memory must be cleaned of less frequently referenced objects. It can be important to allocate large objects contiguousty on disk for good access performance. 16

A distributed transaction facility should integrate the management of objects. A convenient interface is needed for controlling the bindings of programs to processes and objects names, for starting and stopping servers, and for allocating storage to them. A distributed transaction facility needs to provide system directory objects to centralize this information. Such system objects could be implemented using the services of the distributed transaction facility itself.

The proceeding functions of a distributed transaction processing facility are all essential to its operation. However, none of them distinguishes a transaction processing facility from the distributed systems services necessary for building distributed systems without transactions. To support transactions the system must have facilities for serializability, atomicity, and permanence.

Seriaiizability requires controlling the interactions of concurrent transactions. This entails selection of the concurrency control method and its implementation. Many techniques exist for synchronizing the concurrent execution of transactions. Locking, optimistic, timestamp, and many hybrid schemes are frequently discussed; these are surveyed by Bernstein and Goodman [Bernstein and Goodman 81]. If different transaction processing facilities use different concurrency control methods, then transactions accessing objects in both facilities are not guaranteed to be serializable [Weihl 84]. The designers of a distributed transaction facility must agree in advance on compatible concurrency control methods.

The implementation of the concurrency control function can have a major impact of the degree of concurrency possible in a system, regardless of the concurrency control method chosen. Greater concurrency is often possible if a scheduler has knowledge of the semantics of the operations being scheduled. If all concurrency is managed by the distributed transaction facility then it must assume that concurrent operations are not permitted on any object that it manages. On the other hand, individual servers can exploit knowledge about individual operations to increase concurrency. For example, a directory object can permit concurrent inserts of entries with different keys. Weihl discussed type-specific synchronization in his dissertation, and Schwarz, Spector and Korth have described type-specific schemes for locking [Korth 83, Schwarz and Spector 84, Schwarz 84].

Concurrency control methods that delay operations (as locking does) can permit cycles of transactions waiting resources held by other transactions. To deal with this problem a distributed transaction facility can implement local and distributed deadlock detectors that identify and break cycles of waiting transactions [Obermarck 82, Lindsay et al. 84, Knapp 87]. Other systems rely on time-outs that are explicitly set by system users [Tandem 82, Spector et al. 85b].

Transaction management functions are used to define transactions and achieve atomicity for transactions that may modify objects at multiple transaction processing facilities. One transaction management function is the assignment of transaction identifiers. For a simple transaction model, an identifier consisting of a unique identifier and the identifier of the originating transaction 17 processing system is sufficient. When nested transactions are supported more complex transaction identifiers are required. To resolve synchronization conflicts between two transactions, the genealogy of the transactions must be compared. However, it is very expensive to embed complete transaction genealogies in transaction identifiers. Duchamp considers solutions to this problem in his dissertation [Duchamp 88].

Transaction commit protocols guarantee the atomicity property for distributed transactions. The objective of a commit protocol is the guarantee that all objects modified by a transaction agree that the transaction has either committed or aborted and that all objects participating in the transaction are informed of the transaction's termination. Commit protocols must reach this agreement despite the temporary unavailability of one or more transaction participants.

The basic commit protocol used by most distributed transaction processing facilities is the two phase commit protocol [Gray 78, Lindsay et al. 79]. In the first phase of the protocol a distinguished coordinator requests all participants to become recoverably prepared to either commit or abort the transaction. The coordinator collects votes from each of the participants. A yes vote indicates the participant is willing to commit and is recoverably prepared to either abort or commit the transaction. After all votes are collected the commit coordinator recoverably records the commit decision (if all participants voted yes) and the second phase of the protocol is entered. In the second phase of the protocol the coordinator informs all participants of the transaction outcome and the participants acknowledge the notification.

The basic two phase commit protocol requires four messages for every participant (other than the coordinator) and each participant must make two recoverable state transitions. Optimizations that reduce this overhead for common cases have been developed [Mohan 83, Duchamp 88]. The basic algorithm can block indefinitely when the coordinator crashes during the phase one to phase two transition. The participants remain prepared to either commit or abort the transaction and other transactions may be prevented from accessing objects at participants because of concurrency controls. Commit protocols that reduce the probability of blocking address this problem [Dwork and Skeen 83, Skeen 82, Duchamp 88].

The recovery component of a transaction processing system is responsible for implementing the permanence property of transactions. The recovery component is also called upon to undo the effects of an aborted transaction in order to provide the atomicity property of transactions. To discuss different recovery schemes it is first necessary to describe the three-tiered model of storage upon which they are based. volatile storage is the main memory of transaction processors. The contents of this memory are (detectably) lost by processor or other failures. non-volatile storage is generally disk storage attached to transaction processors. Data in non- volatile storage survives processor and power failures, but is destroyed by various detectable errors that are expected to occur at (low) predictable rates. Non-volatile storage may also be battery backup CMOS memory or 18

other low-latency memory that survives power failures and other common errors.

stable storage is constructed by redundantly storing data in multiple non-volatile storage modules. The modules are chosen to have independent failure modes and the storage is manipulated by carefully designed procedures so that data successfully written to stable storage is not lost unless a disaster that the system is not designed to survive occurs [Lampson 81]o Haerder and Reuter's taxonomy of transaction-oriented database recovery classifies recovery algorithms according to four different criteria [Haerder and Reuter 83]. Recovery schemes can also be conveniently separated into those based on careful replacement[Verhofstad 78] and those based on logging.

Careful replacement schemes have been described by Lorie [Lorie 77] and Lampson [Lampson 81]. Such schemes are also referred to as shadow page schemes because when a page of recoverable storage is modified by a transaction a new page is allocated for the data and the old page remains as a "shadow" copy of the data. When a transaction commits, the shadow pages are "carefully replaced" with the new pages in an atomic operation. The Xerox Distributed File System uses shadow pages [Israel et al. 78].

Shadow page recovery schemes are frequently criticized as having poor performance (especially for normal processing, as opposed to crash recovery), limited concurrency, and poor support for the commit protocols needed for distributed transactions [Gray et al. 81, Reuter 84]. These problems are remedied by write-ahead log schemes [Date 83, Gray 78, Lindsay et _[. 79, Schwarz 84, Oki 83, Oki et al. 85]. In write-ahead logging, information necessary to either redo or undo operations of transactions is written to a recovery log before modified data is written to secondary storage and before transactions commit. Log-based transaction recovery is discussed extensively in Section 2.2.2.2.

In summary, a distributed transaction facility provides many functions in support of the management of objects and transactions that access them. The facility supports objects implemented and accessed according to an general model like the client/server model. The facility must provide a mechanism for structuring objects, and a means, such as RPCs for invoking operations on them. Naming objects and binding to instances of them is a critical function. A distributed transaction facility must efficiently manage a hierarchy of storage for objects. The objects of the distributed transaction facility must utilize compatible concurrency control mechanisms to synchronize the actions of transactions. Transaction management functions assure the atomicity of transactions and recovery management functions assure the permanence of objects managed by the facility. 19

2.2.2. Log Definition, Use, and Implementation

The concepts of transaction and distributed transaction facility presented above form the basis for the detailed discussion of transaction recovery logs presented here. Section 2.2.2.1 starts the discussion with a description of a simp{ifiedlog and its interface. Section 2.2.2.2 explains the use of logs in recovery algorithms. Sections 2.2.2.3 and 2.2.2.4 discuss a number of issues in the practical implementation and use of recovery logs.

2.2.2.1. A Simple Log Definition

A log is an abstraction for stable storage used by a transaction processing system's recovery manager to store information needed to implement the atomicity and permanence properties of transactions. The log manager is the component of a transaction processing facility that implements the log abstraction. A minimal set of log operations consists of the five calls shown in Figure 2-1.

routine Write( recPtr : pointer_t; recLength : unsigned t; OUT Isn : isn t) ; Appends a new record to the log and returns the LSN for the record. lsn is greater than any LSN previously assigned.

routine Read( isn : isn t; OUT recPtr : pointer t; OUT length : unsigned_t);

Read a record by LSN. Signals an error if lsn is not a valid LSN.

routine Previous( lsn : isn t; OUT previouLsn : isn_t);

Return the LSN of the immedi_ly preceding log reco_ given an LSN. Signals an error if lsn is not a valid LSN.

routine Next ( lsn : isn t; OUT nextLsn : lsn t) ;

Return the LSN of the immediatly succceeding log record given an LSN. Signals an error if lsn is not a valid LSN.

routine EndOfLog( OUT highLsn : isn_t);

Return the LSN of the most recently wri_en log record. Figure 2-1: A Simple Log I_erface

Logs are logically an append-only sequence of unstructured records. The log manager knows 20 the length of a record but nothing about its internal structure. In practice a log manager often places a limit on the length of records it stores, but this limit is usually quite large. Restricting records to the size of a disk block is unusual and insufficiently flexible, because some recovery managers will log before and after images of large objects (like disk blocks) in single log records.

Each record appended to a stable storage log with the Write call is assigned a unique log sequence number (LSN) by the log manager. 1 Log records can later be retrieved from the log by passing their LSNs as arguments to Read calls.

LSNs are assigned as monotonically increasing identifiers. This allows a transaction processing facility to infer the ordering of events based on the LSNs of the log records describing the events. Recovery and synchronization algorithms depend on this property.

The End©fLog operation always returns the same LSN as the most recent Write operation. EndOfLog may be used to find a starting point in the log for recovery processing after a node crash. The Next and Previous calls are used to examine log records in the sequence they were written (or in reverse sequence).

2.2.2.2. Log-based Transaction Recovery Algorithms

An idealized log-based transaction recovery system implements two forms of stable storage. A stable storage recovery log implementation is used by a recovery manager to implement recoverable object storage. The recoverable object storage is a hierarchy consisting of non- volatile disk storage and volatile main memory. Most data resides on disk storage and it is transferred from disk to main memory when accessed. Modifications to recoverable data are initially made in main memory and later propagated to disk. The recoverable storage survives three classes of failures: transaction aborts, node crashes that erase volatile memory, and media failures that corrupt disk storage. This section first discusses some general principles for log- based recovery algorithms and then describes some particular algorithms. A summary of the log usage made by different recovery algorithms concludes this section.

Log-based transaction recovery algorithms can differ in the exact type of information they record in the log and elsewhere. However, all recovery methods must maintain information sufficient for recovering from each of the three types of failures. Log based algorithms follow the write-ahead log principle: recovery data for a transaction must be written to the stable log before a transaction prepares or commits and recovery data for a recoverable data page must be written to the log before the page is written from main memory to disk.

1In this simple log interface, the Write call writes each log record directly to stable storage, rather than buffering the log record in volatile storage as the practical interface described in Section 2.2.2.3 does. Also, in this simple interface LSNs are assigned by the log manager, not by the log client. LSN assignment is also discussed in Section 2.2.2.3. 21

Recovery managers are integrated with buffer pool (or virtual memory 2) managers so that the write-ahead log principle can be enforced. Different recovery algorithms impose different policies on buffer pool managers. The policies described here are defined by Haerder and Reuter in their survey of recovery mechanisms [Haerder and Reuter 83]. Some recovery algorithms permit the buffer manger to steal dirty main memory pages by writing them to disk before the transactions that modify them complete. The write-ahead log principle requires the tog records referring to the dirty pages must be written to the stable storage log before the pages are cleaned. No-steal buffer management policies result in simpler recovery algorithms but limit the length of transactions. Some recovery algorithms require the buffer manager to force all pages modified by a transaction to disk when the transaction commits. Recovery algorithms with no-force buffer management policies are more complicated, but they do less I/O than those which force buffers at commit.

To permit transaction aborts, a recovery manager logs enough information to undo the effects of a transaction. As a transaction executes, records describing its operations are append to the log. The information logged will vary depending on the recovery algorithm.

A value-based logging algorithm with a steal buffer management policy logs the old and new values of the recoverable storage modified by the transaction in modify log records. A value- based recovery algorithm using a no-steal buffer management policy needs only to record the new values and addresses of locations modified by the transaction. An operation-based recovery algorithm writes modify records with enough information about a logical operation to undo it (and to redo it if a no-force buffer management policy is used). For example, a record in an operation log for a directory insert operation would contain the key of the entry, so that the entry can be removed if the transaction aborts.

When a transaction aborts, the information contained in its log records is used to undo the operation's effects on recoverable storage. Most logging disciplines require that a transaction be undone in reverse of the order that its operations were performed. To facilitate this, many transaction systems maintain a singly linked chain of a transaction's modify log records. Each log record written by a transaction contains the LSN of the immediately preceding log record for the transaction. The LSN of each transaction's most recently written log record is kept in the recovery manager's virtual memory. 3

A crash erases the contents of main memory and leaves each recoverable storage page on disk in one of two states. Disk pages that were not being modified in main memory at the time of the

2Eppinger'disss ertationdiscussetsheintegrationof recoverymanagemenandt virtualmemory[Eppinge88r ]

3Tandem'sTMFtransactionprocessingfacilitydoesnotmaintaintransactionchainsinthelogandmustscantheleg sequentiallytoabortatransaction[Helland88]. 22 crash reflect all logged updates. Disk pages that were dirty in main memory at the time of the crash do not reflect all updates in the recovery log. An efficient recovery algorithm must distinguish between theses two states so that it restores only pages that are missing updates. To accomplish this a recovery manager can log fetch and end-write records to indicate when a page is first modified in main memory and when a page has been cleaned to disk. Pages which have fetch records but no corresponding end-writes must be restored during crash recovery. Other techniques can be used instead of fetch and end-write records. Fetch records are redundant with the first modify record referring to a page and end-write records can be replaced by a summary in a checkpoint record.

A crash divides transactions into three classes. First, there are transactions that committed before the crash. The effects of completed transactions must appear in recoverable storage after recovery. If the recovery manager forces all dirty pages at transaction commit time, there is no work necessary to redo committed transactions. If a more efficient no-force buffer management policy is used it may be necessary to redo some of the operations of committed transactions during crash recovery. Second, transactions that aborted before the crash and in progress transactions aborted by the crash must have their effects expunged from recoverable storage. Third, distributed transactions in the prepared state must be returned to the prepared state. Typically recovery managers restore prepared transactions as though they were committed and also restore their transaction management and concurrency control state. 4 Transaction managers must write state transitions to the log when transactions prepare, commit, or abort so that transaction classes may be identified during recovery. Concurrency control state may be logged when transactions prepare or it may be reconstructed from other log records.

The type of logging used influences how carefully restorations must be applied to pages. Not all modifications are idempotent or insensitive to the order in which they are redone or undone. Write operations (that is, value-based logging) are always idempotent, but the most recently committed write of any piece of recoverable storage must be the last (or only) one redone. For operation logging, the recovery manager may need to place restrictions on concurrent operations because of the order in which operations might be undone or redone by crash recovery. Often a recovery manager using operation logging will write the LSN of the most recently logged modification with each page, either in the sector header or in a reserved portion of the page. By examining these page-LSNs the recovery manger can determine what operations are reflected in the page. The recovery manager uses the LSNs returned by each log write to update page- LSNs.

In summary, for node failures, a recovery manager analyzes logged information to determine

4Only part of the concurrency control state may be required to re-prepare a transaction. For example, only exclusive locks must be restored. 23 transaction outcomes, and to determine what pages may have been dirty in main memory at the time of the crash. This information is used to determine what logged operations must be redone or undone. The order in which operations are undone or redone depends in part on the restrictions that recovery places a priori on the interleaving of operations of different transactions.

Different recovery algorithms have varying costs, both during normal processing and crash recovery. The costs of algorithms during normal processing include the volume and frequency of log writes and the algorithms influence on paging behavior. Recovery algorithms which force the buffer pool have comparably poor performance. The costs of crash recovery include the number of log records read, and the number of pages of recoverable data that must be paged into main memory, restored, and then (eventually) paged out. The buffer management policy affects the amount of data paging during recovery. No-force recovery algorithms will read more data pages during recovery than force algorithms. Recovery algorithms with steal buffer pool management policies may read data pages for both redo and undo processing, while no-steal buffer managers mainly read pages for redo processing.

The amount of log to be analyzed during crash recovery can be reduced using checkpoints. A checkpoint is information, usually written to the log, that has the effect of limiting the amount of redo activity is performed during recovery [Haerder and Reuter 83]. Checkpoints often consist of a list of dirty recoverable pages in main memory and a list of active and prepared transactions. A policy for writing dirty pages to disk is often integrated with the policy for setting checkpoint intervals so that no page remains dirty for a large number of checkpoints. The most recent checkpoint in a recovery log provides a starting point for the log analysis that determines transaction and page status at the time of a crash.

Media recovery can be accomplished by procedures similar to crash recovery. Instead of starting from storage on disk, media recovery starts from an archive dump that is usually made to magnetic tape. For crash recovery, recoverable storage is not necessarily consistent with the recovery log. That is, it does not necessarily reflect all logged operations. In addition, some aborted transactions may need to be undone during crash recovery. Media recovery does not need to deal with these inconsistencies. A sharp archive dump that reflects all committed transactions as of a certain LSN can be used for media recovery. Media recovery from a sharp dump simply redoes the operations of committed transactions from the log. A fuzzy dump, taken gradually during normal operation can be used for media recovery, and in this case recovery functions as though the recovery pass started from a checkpoint taken before the start of the dump. Rather than using end-write records in the log to determine what restorations are to be made to each page, media recovery uses a dump LSN written with each page. A Dump LSN is the high LSN at the time a page is written to the dump. It is important for the media recovery logic to be able to selectively restore small increments of recoverable storage (like disk pages) rather than entire disks. 24

A very simple log-based recovery algorithm makes no use of non-volatile memory. All data resides in virtual memory when it is manipulated and recovery is always media recovery starting from an archive dump. Crash recovery consists of replaying all operations logged after the archive dump and undoing operations of aborted transactions. Transaction aborts should write compensation log records when they undo an operation so that the compensations will be redone during crash recovery. Birrell et al described a variant of this approach without general transaction abort support [Birrell et al. 87]. This roll forward from dump recovery approach is efficient for small amounts of recoverable storage.

Lindsay et al described a value logging recovery algorithm that permits a steal and no-force buffer management policy [Lindsay et al. 79]. Each modify record contains old and new values for a page modification and locks are set on entire pages. Lindsay's algorithm uses three passes for node recovery. First, an analysis pass starts at the most recent checkpoint and continues forward in the log to determine transaction status and buffer pool contents at the time of the crash. Second, an undo pass starts at the end of the log and scans backward to the first record of the oldest transaction aborted by the crash. The undo pass restores old values logged by aborted transactions. Third, a redo pass starts from the fetch record of the oldest dirty page in the buffer pool at the time of the crash and scans forward restoring the values logged by committed transactions to pages that were in the buffer pool at the time of the crash. Lindsay's algorithm uses page-LSNs, and fetch and end-write log records.

Schwarz adapted Lindsay's recovery algorithm to do all recovery with a single backward pass starting from the end of the log [Schwarz 84]. Careful bookkeeping combines the analysis, redo, and undo in one pass. Fetch and end-write records are used but page-LSNs are unnecessary. Schwarz's algorithm allows logging and locking of arbitrary non-overlapping quanta, rather than pages. Eppinger and Thompson's modification of Schwarz's value logging algorithm permits overlapping byte ranges [Thompson 88a]. They also permit new-value only logging if buffer management is restricted to a no-steal policy. During transaction undo, old values are obtained either from the pages on disk or from previously logged new values.

Operation logging recovery algorithms are desirable because they permit greater concurrency than value logging algorithms and because they potentially reduce the volume of log data. Schwarz's operation logging algorithm logs operations on arbitrary size objects [Schwarz 84]. Operations are restricted so that if different transactions perform concurrent operations on the same object one transaction's action can be undone before a preceding transaction's action is redone. The node recovery algorithm uses analysis, undo and redo passes like Lindsay's algorithm. Fetch and end-write records are used and page-LSNs guarantee that non-idempotent operations are not redone when they should not be. A steal and no-force buffer pool management policy is supported, but the algorithm does a lot of bookkeeping to determine when multi-page objects are operation consistent on disk so that redo and undo operations can be 25 applied to them. When an object is not operation consistent on disk it must be restored from an archive copy or logged snapshot.

Recovery for operation logging is greatly simplified when operations refer to single page objects. A log record can describe an operation that modifies more than one page, but it must be possible redo or undo the operation on only one page. Mohan et al's algorithm is based on this principle [Mohan et al. 88]. This algorithm uses page-LSNs and tracks dirty pages by their modify-LSNs, which are the LSNs of the first modify record referring to a page since it was brought into main memory. A combined analysis and redo pass scans forward from the modify- LSNs of the oldest page in the buffer pool at the time of the crash (as determined from the most recent checkpoint record). The first pass redoes all operations including actions and compensations for transactions that aborted before the crash and transactions that were aborted by the crash. After redo, transactions aborted by the crash are undone either by a backwards undo scan or by invoking the regular transaction abort logic. Transaction aborts write compensation records that are linked to to corresponding operation records. Compensations are redone during the node recovery redo scan. This use of compensation records avoids writing duplicate compensation records or compensation records for compensation records during crash recovery.

Log-based recovery methods vary in a number of details. However, all methods place some common requirements on the recovery log. First, all recovery methods adhere to the write-ahead log invariant. Information is always written to stable storage in the recovery log before it propagates to non-volatile memory and before transactions commit and prepare. Second, all recovery methods rely on the ordering of operations expressed in the recovery log to provide an ordering for redo and undo. Third, all recovery methods use checkpoints and other techniques to bound the amount of log processed for recovery. Fourth, all recovery methods need a starting point for crash recovery. For some methods this is the end of the log, while others must locate the most recent checkpoint record. Most recovery methods make random access to the log when performing transaction abort processing. Some recovery methods scan the log sequentially forward during crash recovery. Others scan the log sequentially backwards. Still others use both forward and backward scans.

2.2.2.3. Practical Log Interfaces

The simple abstraction for a recovery log described in Section 2,2.2.1 is not practical for a number of reasons. Some of the limitations of the simple abstract log are due to implementation issues that will be addressed in Section 2.2.2.4. Other extensions are required to support some recovery algorithms and sophisticated distributed transaction facilities, Figure 2-2 gives a more practical log interface,

The Open call indicates one of the most important features of a realistic log interface. A 26

routine Open( recMan : authenticator t); Opens the log for a pa_icular recove_ manager. Signals an error if the authentication is invalid.

routine GetRestartRecord( OUT recPtr : pointer_t; OUT recLength : unsigned_t);

Reads the emergency resta_ record, recLength can be zero if the record has not been written using PutRestartRecord. routine PutRestartRecord( recPtr : pointer t; recLength : unsigned_t);

Writes the emergency _sta_ record. An er_r is signaled if the record exceeds a system defined maximum length. routine Write( recPtr : pointer t; recLength : unsigned t; OUT lsn : lsn t);

Appends a new record to the log, possibly buffering the record in vol_ile memo_, lsn is greater than any Isn previously assigned by this log manager for any recove_ manager. routine Force( recPtr : pointer_t; recLength : unsigned_t; OUT isn : lsn_t);

Appends a new record to the log. Writes the record and any previously buffered records to stable storage before returning, lsn is greater than any Isn previously assigned by this log manager for any recove_ manager. routine ForceLSN( isn : isn t) ;

Forces all records with LSNs less than or equal to lsn to stable storage, if they are not already in stable storage. Signals an error if lsn is not a valid LSN.

routine Read( lsn : lsn t; OUT recPtr : pointer_t; OUT length : unsigned_t);

Read a record by LSN. Signals an error if lsn is not a LSN for a record that this recove_ manager has wri_en. Figure 2-2: Practical Log Interface 27

routine CreateCursor( startLsn : isn t; forward : boolean ; OUT : cursor_t) ; Create a cursor on this recovery manager's log records. The first record returned on the cursors will have startLsn as its LSN. Subsequent Next operations on the cursor will return records with greater LSNs if forward is true and smaller LSNs otherwise. Signals an error if startLsn is not valid for this recovery manager.

routine Next ( cursor : cursor t; OUT isn : isn t; OUT recPtr : pointer t; OUT length : unsigned_t) ;

Returns the next record from a cursor and the record's LSN. Signals an error if the cursor is invalid or the end of the cursor has been reached.

routine CloseCursor( cursor : cursor_t);

Closes a cursor. Signals an error if the cursor is invalid.

routine HighLsn( OUT highLsn : lsn t);

Returns the highest LSN in the log for a recovery manager.

routine HighStableLsn( OUT highStableLsn : lsn t);

Returns the highest LSN in the stable storage log for a recovery manager.

routine Truncate( newLowLsn : lsn t) ;

Sets the lowest LSN for any record kept online for a recovery manager to be newLowLsn. Records with lower LSNs can be spooled offline or discarded depending on operating policy. Figure 2-3: Figure 2-2 Continued sophisticated distributed transaction facility supports many recovery managers with a common recovery log. The recovery managers might use different recovery algorithms, but they use common transaction management and a common recovery log. The motivation for a common log lies in the implementation of a stable storage recovery log using mirrored disk storage. If every recovery manger and the transaction manager have their own logs the number of I/Os required to commit a transaction is more than twice the number of participating recovery managers. Each transaction manager must individually write its own recovery log to stable storage (two disk writes per recovery manager) and then the transaction manger writes the commit record to its own log 28

(another two disk writes). A common log providing service to multiple recovery manager allows a single pair of disk writes to be used for a commit, because the recovery managers' writes will be buffered as explained below. Disk space utilization can be improved as well.

The performance of a mirrored disk implementation of a recovery log motivates splitting the log write primitive into the Write, Force, and ForceLsn calls shown in Figure 2-2. The Write call buffers a record in volatile storage and returns its LSN. The log manager can initiate transfers to disk when log buffers fill. A recovery manager would use the write call to spool fetch, end-write, and modify records. The recovery manager then uses the ForceLsn call to force the most recently written operation record for a recoverable storage page (along with all preceding records) before the page is copied to disk. At the end of a transaction, the transaction manager uses the Force call to write a commit record to stable storage along with al_ records previously buffered log records. The common recovery log property guarantees that forcing the transaction manager's commit record also forces all of the records that different recovery managers have previously written for the transaction. The Force call is redundant with the Write and ForceLSN calls, but it is more common to need to force commit records when they are written than to force other log records for page cleaning.

Because of buffering, there can be a difference between the end of the _og, returned by the HighWrittenLSN call, and the LSN of the newest stable log record, returned by the HighStableLSN call.

Disk latency motivates the use of cursors for sequential log access. A recovery manager uses the CreateCursor call to initiate a sequential scan of its log records. The log manager can implement cursors by reading the log from the disk with larger buffers and can use read-ahead to improve performance.

The abstract recovery log is an infinite resource, but in practice the online disk space available for storing the log is limited. Log users can free online log space using the Truncate call. The exact function of a the Truncate call depends on system policies. Some systems will spool truncated log records to tape storage for use in media recovery. Other systems will provide enough online log space for media recovery and will discard truncated log data.

Not all node recovery algorithms start from the end of the log as returned by the

HighStableLsn call. Instead, many recovery a_gorithms need to rapidly locate the most recent checkpoint record, or some other point in the log. This starting point is obtained from a restart record that is maintained at a known location on mirrored disks using a careful replacement protocol. The log manager gets the extra job of maintaining this information, even though it is a separate abstraction from the main recovery log. The GetRestartRecord and PutKestartRecord calls are used to access the information. A restart record may be restricted to be less than one disk block in size. 29

2.2.2.4. Log Implementation Issues

Logs are most commonly implemented with mirrored disk storage. Each write to mirrored disk storage actually writes two physically separate disks. Ideally, the disks have separate controllers, so that a controller failure will not corrupt both disks.

The log manager uses a low level interface to disk. Logically adjacent locations in the log "files" should be physically close together on the disk so that log performance can benefit from the sequential behavior of most accesses. The disk interface used by the log manager should not delay writes with any buffering. The reliability properties of a log are predicated on data being on disk 5 when a write returns. In Unix, this type of access is obtained by opening a character disk partition file in the/dev/directory [University of California at Berkeley 86].

The log manager lays out the log with knowledge of the size and alignment of disk sectors, which are the unit of atomic transfer and physical integrity checking on the disk. Sector sizes are often 512 bytes. A log manager can implement multi-sector log pages with software checksums. Generally a log manager picks a small block size to reduce internal fragmentation when a block is forced before it full, as explained below.

A log manager takes a large chunk of disk space and lays it out for the log as a circular buffer of log pages. A subset of the logical log is stored on this disk space at any time, with portions of the log written earlier either discarded or spooled to tape when they are truncated. Generally each log page is tagged with its togical address in the log. The physical position of of a log page is its logical address modulo the size of disk logging space, s The log manager appends new blocks sequentially to the circular buffer as the log fills. When the log manager restarts it may find the end of the log by using a binary search of the circular buffer to locate the point where a log page has a lower logical log address than the immediately preceding page, or the log manager may find the end of the log by searching forward linearly from a frequently updated hint.

The log manager must be able to retrieve a log record based on the record's LSN. This is easy if the log manager assigns LSNs to be the relative addresses of log records in the log. It is necessary to have a means for checking the validity of an LSN passed to the tog manager when relative addresses are assigned for LSNs. One technique that also provides an integrity check on multi-page log records is to format log records in storage with their lengths at the beginning and their LSNs at the end.

Log managers are not necessarily in a separate process from other transaction system components, but they should be separate protection domains, especially when a log is shared by

SOrina non-volatilebufferin acontrollerthatisphysicallyintegralwiththe diskdrive.

SThismappinggetsmorecomplicatedif the sizeofthe diskloggingspacechangesdynamically. 30 multiple recovery managers and a common transaction manager. The separation of the log manager form its recovery manager clients can complicate LSN assignment when the cost of cross domain calls are significant. Ideally, the log manager assigns LSNs to be the relative addresses of log records and returns the LSNs with the results of Write calls. The recovery managers need the LSNs immediately to use for page-LSNs and to use in the log record chains they maintain for transaction aborts and other purposes. But, expensive cross domain calls motivate the use of asynchronous queued w_:ite calls, because they are write calls are very common and LSNs are the only results returned.

If an asynchronous Write interface is used, log clients can assign their LSNs either consecutively or using a global time source. Log managers can implement indexes mapping client supplied LSNs to physical log locations. Chapter 3 describes data structures that can be used for these indexes. One alternative is a complicated protocol to keep all clients approximately aware of overall logging rates so that clients can individually assign LSNs that closely approximate the relative byte addresses of the log records. A second alternative is for clients to assign temporary LSNs and have the log manager issue permanent LSNs as it receives the log records.

Log records are supposed to be in stable storage after they have been forced. A simplified version of the invariant property defining stable storage is that there are, under normal conditions, two failure independent copies of each record. Careful replacement implementations of stable storage maintain this invariant by first writing one copy of data, then checking that the write succeeded by reading the copy, and then writing and checking the second copy7 [Lampson 81]. Recovery logs are append-only and writes to different copies can go on in parallel, rather than sequentially, because there is never a possibility of corrupting valid data during a write. Parallel writes contribute to a log's performance advantages over careful replacement stable storage, but parallel writes can only be used if each log page is written only once on each pass through the circular buffer of disk log space.

The write-once log pages that result from parallel writes to mirrored log storage can complicate multiplexing a log among multiple recovery managers. The problem is the implementation of cursors that scan forward in the log. One recovery manager's log records can be widely spaced in the log with many records belonging to other recovery managers in between. It is easy to implement backward scans by storing the LSN of a recovery manager's previous record with each new record written by the recovery manager. However, maintaining a similar LSN chain for forward scans would require re-writing log pages. Less efficient scans can be implemented by sequentially scanning the logging disk for the next record for a log manager. The techniques

7The read after write protocol is ignored by all practical implementors. Instead they assume that a successful write does not corrupt the data [Mitchell 82]. 31 described in Chapter 3 for indexing append only storage are also applicable to implementing forward scans.

Write-once log pages can hurt log space utilization because a log force will frequently cause a log page to be written before it is full. This problem can be addressed by keeping separate mirrored ping-pong buffer pairs for forcing partially full pages. The first time a partially full log page must be forced it is written to one of the buffers in the ping-pong pair. The second force of the partially full page goes to the second of the paired buffers and subsequent writes ping-pong between the two buffers until the full page is written to the regular log storage. Ping-pong buffer writes can never corrupt important data (because it is always in the other buffer), so the mirrored writes can proceed in parallel. The log manager must check both ping-pong buffers when it restarts. When a log manager uses a ping-pong buffer it can use a much larger log page size (an entire disk track for example) which results in more efficient disk transfers. Multiple sets of ping-pong buffers can be used to build large blocks for efficient transfers.

The time required to force a log record is a critical parameter for transaction processing facility performance. The force latency of a log that uses disks as its only form of non-volatile storage is determined the average rotational delay of its disks. This is about sixteen milliseconds for many current disks. Sixteen milliseconds can be the greatest latency component for short transactions that operate on data in main memory and this limitation leads to interest in constructing logs with non-volatile or stable storage buffers that have lower latency than disk storage. Section 3.4 discusses low-latency buffer technology.

It is possible to write to a disk with lower average latency than a disk's rotational delay. Writes must be directed to the sector that is about to pass under the write head. IBM's IMS system does this by writing to a count-key-data disk with a write command that matches all of the keys on a track. A sophisticated disk driver can accomplish the same thing by measuring the elapsed time since the last write and computing the address of the next sector to pass under the write head. These techniques for fast forces require dedicating an entire disk (or disk arm) so that a single sector can be written very quickly and are likely to be made obsolete by the solid-state low- latency buffer technology described in Section 3.4.2.

Technology for low-latency forces is expensive and, even when it is used, the overhead of two I/Os for each force of a mirrored log limits system throughput. Transaction managers overcome this limit using a technique called group commit. When group commit is used, the transaction manager (or possibly the log manager) delays writing commit records until a group of transactions are waiting to commit [Gawlick 85, DeWitt 84]. This permits a full buffer of log data to be forced and amortizes the overhead of the force over several transactions. Selecting the amount of time to delay forcing a commit record is an optimization problem that trades transaction response time for throughput [Helland et al. 87]. 32

A multi-threaded log manager is an alternative to implementing group commit in a transaction manager. In a multi-threaded log manager, new _orce operations are allowed to execute and buffer data during a disk transfer for preceding Force operations. After a £orce buffers its data it waits for disk write completion. Waiting force operations complete when the disk writes for their log records complete. If transaction management is multi-threaded, this technique automatically builds commit groups when commits arrive at rates faster than individual log forces. System R and the Camelot local log manager have multi-thread log mangers [Lindsay 88, Thompson 88b]

When group commit is used, the performance of a log system is limited by how fast data can be streamed to the logging disk. Logging rates of very high performance transaction systems can exceed the streaming rates of conventional disk (and even I/O channels). Parallel transfer disk subsystems partially address this problem [Patterson et al 88].

For many systems the disk space used for the recovery log is only a buffer for an infinite recovery log. The log is continually being archived to off-line tape storage. The process of archiving a log implies reading it from disk. Spooling ties up a disk arm and destroys the performance of log writes. For this reason, a high performance log system uses three disks, rather than just two for implementing mirrored log storage. The log is written to two of the three disks at a time, with archiving being done on the free disk [Gray 78]. When the log is spooled off-line it can be compressed to reduce storage space or facilitate media recovery [Gray 78]. 33

Chapter 3 Design Issues and Alternatives

This chapter examines in detail the issues that must be addressed in the design of a distributed logging service.

The issues discussed in this chapter follow a top-down approach to the design of a distributed log service. First, the global representation of distributed log data is addressed in Section 3.1. Second, the communication interfaces used by clients and log servers are considered in Section 3.2. Security, which is closely related to communications, is treated as a separate topic in Section 3.3. The implementation of log servers, particularly the representation of log data from multiple clients, is discussed in Section 3.4. Management of the space occupied by logs is discussed in Section 3.5. Load assignment is discussed in Section 3.6. These final issues are more independent of the preceding log service design issues. This set of topics includes all of the design issues encountered in the development of the Camelot Distributed Logging Facility. Specific implementation decisions, such as the division of functions among processes, are addressed by example in Chapter 4.

There is some interdependence among the design alternatives available for each issue. For example, the choice of representation for distributed logs controls what alternatives are appropriate to use for representing multiple logs on log servers. The descriptions of the design issues and alternatives, and the order in which they are presented reflect a suggested procedure for the design of a distributed log service that is summarized in Section 3.7.

For each design issue, a general discussion of the issue is followed by descriptions of alternative designs. The general discussion includes a indication of the alternative chosen for the Camelot DLF. Some of the alternatives presented are straw men, that is, design alternatives that are simple to implement but which can not meet system goals. In some cases, such as log space management in Section 3.5, a set of mechanisms is described and then policy alternatives that can be constructed from the mechanisms are detailed. An analysis of the design alternatives according to various comparison criteria completes the discussion of each issue. In the analysis, the reasons for the choices made in the Camelot DLF are given. Comparison criteria are selected based on the goals for a distributed logging facility set in Section 1.2. 34

The logging service discussed in this chapter is based on the client/server model. Log clients (referred to simply as clients where context is clear) are the log managers in distributed transaction processing facilities. Log servers (or, simply servers) are the remote nodes that provide services that the log managers use to implement logs for their transaction facilities.

3.1. Representation of Distributed Log Data

The choice of a representation for distributed log data involves tradeoffs between reliability, availability, and performance that are unavailable in a local logging situation. Local logging implementations, as discussed in Section 2.2.2.4, are almost universally implemented using mirrored disk technology. Reliability is achieved by writing data to two disks with independent failure modes, preferably disks with separate controllers. The availability of a local logging system is limited by the availability of the local system, except that in the event of the failure of a single logging disk there is a policy question of whether to continue operation with simplex logging or to wait for repair of the failed copy (by copying the good copy to a spare disk). The first option increases availability with some loss of reliability, while the second option decreases availability while maintaining high reliability.

The representation of a distributed log achieves reliability in the same manner as a local log: each log record is written to multiple non-volatile memories that have independent failure modes. For distributed logging, there is no requirement that the non-volatile memories be disks attached to a single network server. Instead a distributed replication algorithm may be used to write data to non-volatile memories on multiple servers. Or, a hybrid scheme using multiple servers with mirrored disks can be designed. The Camelot DLF uses distributed replication algorithm for log representation. Below, these alternatives are described in detail. Comparisons of the alternatives in terms of reliability, availability and performance are given in Section 3.1.4.

For simplicity, the algorithms presented in this section are described in terms of an implementation of the simplified abstract log interface given in Section 2.2.2.1.

3.1.1. Mirroring

A distributed log representation based on mirroring uses log servers with mirrored disks for storing log data. Conceptually a mirrored server could be implemented by moving the local log component of a transaction processing system to a remote server and communicating with it through a RPC interface. Such an approach might not work well in practice because the local log implementation would not be suitable for sharing by multiple transaction processing systems. Requiring an RPC for every log call would result in unacceptable performance.

Simplicity is he major attraction of mirrored representations for distributed logs. A minimal 35 distributed log facility based on mirroring uses a single server node with two disks. The function required of a mirrored log server is very much like a local log that is multiplexed among multiple recovery managers. The function required of a mirrored log client is just some simple buffering and communication logic.

With mirroring, the availability of a single log server node is critical to the availability of many client transaction processing nodes. Therefore, a service must use high availability log server nodes. This is exactly the approach used by Tandem for its logging system [Borr 81, Borr 84]. Normally a Tandem log server only provides service to its local cluster (multi-processor node).

A Tandem log server uses a process-pair running on a non-shared memory multi-processor with redundant power supplies and disk controllers. The process-pair writes log data to two disks. Programming process-pairs is considerably more difficult than writing ordinary server programs. Each process in the pair receives all input messages (which are the only input allowed in the system). One process is designated the primary and it processes input messages and issues responses. The second, backup process, receives and holds all inputs messages but processes them only when the operating system notifies it of failure of the primary process. To allow for take over at any time, the primary process sends periodic checkpoint messages to the backup process. The Tandem operating system uses process-pairs for most services so that a Tandem node has greater availability than similar uni-processors.

3.1.2. Distributed Replication

Logs are normally replicated for reliability. However, the use of a distributed log service makes the availability of the log service a critical issue because many client systems depend on the service. Replicated distributed abstract datatypes, like those described in Section 2.2.1.2, are a technique for constructing highly available distributed services. A log service using distributed replication can achieve high availability and high reliability with a single mechanism.

Voting algorithms, such as the weighted voting algorithm for files sketched in Section 2.2.1.2, are appealing as a basis for a distributed log replication algorithm because they can provide arbitrarily high availability (and high reliability) and because tradeoffs can be made in the expense and availability of different operations. One difficulty with voting algorithms is that like most other sophisticated distributed replication algorithms they are designed with the assumption of an underlying transaction mechanism to guarantee that updates either occur at all copies written or that updates are undone at all copies. Naturally, an algorithm for implementing a recovery log can not make such an assumption and must be prepared to deal with incomplete updates. An additional difficulty with a voting algorithm is that if a reasonable number of log servers (for example six) used and each log record is written to two log servers, then a log read operation must exchange vote messages with many servers (five in this example), making log reads expensive. 36

The lack of a distributed transaction facility can be overcome. First, a log is accessed by a single client process so the concurrency control facilities of transaction facility are unnecessary. There must be an external mechanism that prevents two or more nodes from simultaneously attempting to access the same log. Transaction processing facilities are mostly bound to a single node, so this is not a problem. The same property means that a distributed replication algorithm for recovery logs automatically tolerates network partitions because the single client can be in at most one partition at a time. A log server supports the implementation of multiple clients' logs. Each log server provides service to many clients even though access to an individual log is restricted to a single client.

To overcome the lack of multi-site atomic write, a generalization of the atomic write technique originally developed for the SABER reservation system may be used [Perry and Plugge 61]. The problem addressed by the SABER system was the atomic update of a disk record when a system crash could corrupt the record. The solution used was to keep two copies of each record and to add timestamps to the records. When a record is written a new timestamp value is used and the copies are written one at a time. If the timestamps in the copies differ after a crash the copy with the higher timestamp is written onto the other copy.

An approach to atomic actions like the one used by SABER can use any source of monotonically increasing unique identifiers. In Section 3.1.2.1 a replicated distributed method for generating time-ordered unique identifiers is described. These identifiers are used in the client restart procedure for the replicated log algorithm given in Section 3.1.2.2

3.1.2.1. Replicated Time-Ordered Unique Identifiers

The version numbers used in the distributed log replication algorithm can be obtained by a unique identifier generator function called OrderedId NewId. OrderedId HewId returns integer identifiers such that if one identifier is greater than another then it was the result of by a later invocation of OrderedId NewId.

OrderedId_.NewId uses a replicated distributed representation for the state of the unique identifier generator. A collection of some number, N, generator state representatives is used. Each generator state representative provides non-volatile storage for an integer and exports operations to read and write the integer. Figure 3.1.2.1 gives headers for an RPC interface to generator state representatives. The serverPort parameter to generator state representative operations is an identifier for a connection to a generator state representative server. The procedure GetQuoruJn is used to obtain ports for a set of representatives.

Operations are atomic at individual generator state representatives. That is, a GenStateRep._Write operation either completely updates the value stored in non-volatile storage at the representative or it leaves the value unmodified. The semantics of the remote 37 operations on representatives are at-most-once. A client performing a remote operation on a generator state representative either receives an indication that the operation completed successfully, or it receives no indication and the operation may or may not have been performed.

function GenStateRep_Read (IN serverPort : port) : integer; { returns the value stored by the generator state } { representative at port serverPort. Returns 0 if the value } { has not been set by a prior GenStateRep_Write call. }

procedure GenStateRep_Write(IN serverPort: port; value: integer); { sets the value stored in non-volatile storage by the } { generator state representative at port serverPort. }

Figure 3-1: Unique Identifier Generator State Representative Interface

OrderedId NewId operates by first reading the values stored by a quorum of w=LNJ+I generator state representatives. Then, a value one greater than the maximum value read is written to w generator state representatives. 8 After all the writes complete, the value written is returned as the result of OrderedId_NewId. A program for this operation is shown in Figure 3-2.

If every invocation of OrderedId NewId executes completely, then a consecutive sequence of integers starting with 1 is the result of the invocations. If some invocations do not complete then the sequence returned may skip some integers. The sequence contains gaps when one execution of OrderedId NewId fails after the first invocation of GenStateRep_Write (either because of a timeout of a GenStateRep_Write invocation or because of a crash of the process executing OrderedId_NewId) and the GenStateRep_Read operation in a subsequent

invocationof OrderedId_NewId returnsthevaluewritteninthefailedexecution.Regardlessof whether integers are skipped, the sequence of values returned reflects the order of invocations. It is not possible to determine whether an arbitrary integer is a member of a sequence of identifiers generated by this method.

OrderedID_NewId should not be executed concurrently by multiple client processes. Duplicate identifiers could result from concurrent executions using the same generator state representatives. This restriction is acceptable for the distributed logging because each log client will have its own generator state representatives.

SDifferentsizesof intersectingquorumsmaybe usedfor readsandwrites,but thevalue for W'shownhas highest availability. 38

const N = 5; { number of generator state representatives } W = (N div 2) + i; { write quorum size } type OrderedId = integer;

function OrderedId NewId () : OrderedId; var Quorum : array[l...N] of Port; MaxId : OrderedId; TempId : OrderedId; i : integer; begin GetQuorum ("GenStateRep" ,Quorum, W)

MaxId := MININT;

for i := 1 to W do begin TempId := GenStateRep Read (Quorum[i]) ; if TempTd > Maxld then MaxId := TempId; end;

MaxId := MaxId + i; for i := 1 to W do GenStateRep Write (quorum [i ] ,MaxId) ;

OrderedId NewId := MaxId; end;

Figure 3-2: Program for OrderedId NewId

3.1.2.2R.eplicatedLog Algorithm

A replicatelogd isimplemented usinga setofN logservers,each ofwhich storesportionsof a log. Servers can, of course, storeportionsof multipleclientsl'ogs,but thisdescriptionis concerned withtheimplementationof a singlelog. Each recordinthe replicateldogisstoredon at leastW log servers.The parametersN and W can be variedto adjustthe reliabilitofythe replicateldogand the relativeavailabilityofdifferenopert ations.Typicalvaluesare 2 or :3forW and 3 to 6 for N.

The development of this algorithm starts by first considering Write and EndOfLog operations and how they are implemented. The problem of partially replicated log records is then described and a revised client restart procedure is developed to deal with this problem. Next Read operations are considered and the problem of locating servers storing a particular log record leads to another refinement in the restart procedure. The final algorithms are illustrated with pascal-like programs. 39

A log client needs to know the highest previously assigned LSN to assign new log sequence numbers in Write operations and implement EndOfLog operations. This is done when the client process is restarted by sending messages requesting the highest LSN stored at each log server. Responses are collected from (N-Wg+I log servers so that at least one log server storing each log record will respond.

For each Write operation, the client generates a new consecutive LSN by incrementing the high LSN value for the log. The log record, together with its LSN is then sent to W log servers. If all W log servers acknowledge the log record, then the new high LSN is returned as the result of the log write operation. If some log servers do not respond then the record is sent to other servers until W responses are gathered.

Server 1 Server 2 Server 3

LSN Epoch Present LSN Epoch Present LSN Epoch Present

Figure 3-3: Three log servers with LSN 10 Partially Replicated

A partially replicated log record may be created if a log client crashes when a new log record has been received by fewer than w log servers. This situation is illustrated in Figure 3-3, where the log record with LSN 10 is stored only on log Server 3 in a suite with N=3 and W=2. If responses to high LSN requests are received from only Servers 1 and 2 when the client process restarts then log record 10 will not be observed and the high LSN will be 9. In other cases record 10 will be available. This ambiguity would not occur if a distributed transaction system were used 4O tO implement the replicated log because the crash of the client process would cause a transaction abort that would undo the write of LSN 10 at log Server 3.

When a partially replicated log record is created by a client node crash the log Write operation creating the record can not have completed, and hence the client transaction processing system has not been affected by the log write. For example, if the partially replicated log record is a commit record for some transaction, the transaction's commit will not have completed, its locks will not have been dropped, and other transactions will not have read data modified by the committing transaction. It would be correct to either commit or abort the transaction. Therefore, a recovery procedure can either replicate a partially replicated log record at a full write quorum or eliminate such a record from the log.

If a procedure for eliminating partially replicated log records is run during every client restart, then there can be at most one partially replicated record in a distributed log during a client restart. The partially replicated log record, if it exists, either has the high LSN determined by consulting (N-W)+] log servers, or has a LSN one greater than the high LSN found. That is, the only partially replicated record will be the last one written and it will either be at the observed end of the log or it will be unobserved at some log server that was not contacted during client restart. The recovery procedure executed at client restart will replicate an observed partially replicated log record by coping the record at the observed end of log to a full write quorum, and will eliminate an unobserved partially replicated log record by writing replicated log record marked as "not present" with a LSN one greater than the highest observed LSN to W log servers.

The first step in the client restart procedure is to obtain a new restart epoch number. The epoch number is written with every log record at each log server and is used to distinguish between different versions of log records. Any source of time-ordered unique identifiers may be used for epoch numbers. The replicated time-ordered unique identifier generator algorithm given in Section 3.1.2.1 may be used, or a global time service may be used.

The second step in the client restart procedure is to obtain from each of (N-W)+I log servers the highest LSN stored at the log server and the epoch of the record with that LSN. Then, the record with the globally highest LSN and highest epoch number (if there is more than one version of the record with highest LSN) is read from one of the log servers storing it. This record, with the highest observed LSN, is potentially a partially replicated log record.

The third step in the client restart procedure is to write the record with the highest observed LSN to W log servers. This is done using the new epoch number. After this step, the record with highest observed LSN is guaranteed to be fully replicated.

The fourth step in the client restart procedure is to write a new log record with LSN one greater than the observed high LSN to W log servers. The new epoch number is used for these writes. 41

Server 1 Server 2 Server 3

LSN Epoch Present LSN Epoch Present LSN Epoch Present

Figure 3-4: Figure 3-3 after Restart with Servers 1 and 2

This new record is marked as not present in the distributed log so that the client process will not pass it to its transaction processing system as the result of a read operation. This step compensates for a possible unobserved partially written log record, because the new record has higher epoch number and supersedes the partially written record. This step completes the client restart procedure and the new high LSN is the last one written (the not present log record). Figure 3-4 shows the result of running the restart procedure on the distributed log from Figure 3-3 with log Server 3 unavailable so that records were read from and copied to log Servers 1 and 2.

After the restart procedure completes, Write operations proceed by incrementing the high LSN and writing new records with that LSN, and the current epoch number to W log servers. Any W of theN log servers may be used, and the client can switch log servers at any time.

Read operations present a special problem because writes are sent to a small number (typically W=2) of a fairly large number (N=3 to 6) of log servers. To do Read operations by simple voting could require two to five messages for each operation. However, a single remote procedure call to one of the log servers storing (the most recent version of) a desired log record will suffice if a directory giving the log servers storing each log record is kept by the client. The cached directory is feasible for a replicated log because the log is accessed only by a single client process, so there are no cache consistency problems. 42

Low High Epoch Server LSN LSN Number Present List

1 2 1 yes 1,2 3 3 3 yes 1 4 4 3 no 1

5 6 3 yes 1 7 7 3 yes 1,2 8 8 3 yes 1 9 9 4 yes 1,2 10 10 4 no 1,2

Figure 3-5: Directory for Distributed Log in Figure 3-4 (merges interval lists from Server 1 and 2)

The size of a directory that maps LSNs to sets of log server storing log records is potentially quite large, but it is convenient for the client to facilitate compression of the directory by sending subsequent log records to the same log servers. Then, the directory can be compactly represented as list of intervals of contiguous log sequence numbers together with the epoch number of the interval, an indication of whether the log records are present in the distributed log or not, and the set of servers storing the log records. The directory for the distributed log of Figure 3-4 is shown in Figure 3-5.

The construction of the directory should be combined with the second step of the client restart procedure. Instead of obtaining a high LSN from each of W log servers, the client obtains interval lists from the servers and merges the lists into a single directory giving the log servers storing the most recent version of each LSN. The client updates its directory as new log records are written.

const N = 6; { for example, maximum number of log servers } W = 2; { for example, replication factor } R = (N - W) + I; { restart quorum size }

type Interval = record highLsn, lowLsn : LSN; epochNumber : OrderedId; present : boolean; serverCount : integer; servers : array [I..N] of ServerId end;

{ Interval I List arrays are sorted by decreasing highLsn } type Interval_List = array [i..maxIntervals] of Interval;

Figure 3-6: Type Definitions for Distributed Log Replication 43

Figure 3-6 gives type definitions used in the replicated log implementation shown in Figures 3-7 and 3-8 to 3-11. A RPC interface to log servers used for distributed replication is shown in Figure 3-7. The calls in this interface are used in the implementations of the open, Read and

Write operations. The EndOfLog, Next, and Previous operations are simple interval list searches and are not shown.

The serverPort parameter in the log server RPC interface routines is an identifier for a communication connection from a client to a log server. The interval lists maintained by the client use permanent server names, rather than temporary port identifiers. The ServerId To Port and Port To ServerId routines translate between the two types of identifiers.

Merging the interval lists is the key to the Open operation. In Figure 3-8 this is done by repeated IList_Merge calls. IList_Merge is a procedure that merges two interval lists into a new list that contains the information with highest epoch number from each list. The details of merging the interval list information are complicated because the interval lists are a compact representation of the directory, not an easily modified representation. Logically, IList._Merge examines the information about each LSN in either list and if one list records the LSN with a higher epoch number it is that information which goes in to the result list. If both lists contain the same epoch number for some LSN, then the result list contains the union of the sets of servers from the input lists. Of course, IList_Merge does not examine LSNs individually, but instead works with intervals that can overlap in many complex ways so that the number of cases handled is large. The IList_AddRecord procedure adds information about a single record to an interval list.

procedure LogServer_Intervals (IN serverPort : port; OUT intervalCount : integer ; intervals : ^Interval_list) ; { returns the directory of intervals of log records stored at } { server }

procedure LogServer_Write(IN serverPort: port; recLsn : LSN; recEpoch : OrderedId; recPresent : boolean; recLength : integer; recPtr : ^LogRecord;) ; { writes a log record to the server }

procedure LogServer_Read(IN serverPort; recLsn : LSN; OUT recLength : integer; recPtr : ^LogRecord) ; { reads a record from the server }

Figure 3-7: Log Server Interface for Distributed Log Replication 44

{ Global Variables for client of distributed replicated log } var IntervalCount : integer; Intervals : ^Interval_List; HighLsn : LSN; CurrentEpoch : OrderedId;

{ Open procedure for client side of distributed replicated log } procedure Open; var

ReadQuorum : array[l..R] of Port; WriteQuorum : array[l..W] of Port; rPort : Port; tempIntervalCount : integer; tempIntervals : ^Interval_List; recLength : integer; recPtr : ^LogRecord; i : integer; tempServers : array[l..N] of ServerId; begin IntervalCount := 0; HighLsn := 0;

CurrentEpoch := OrderedId m NewID;

GetQuorum ("LogServer", ReadQuorum, r) ;

for i := 1 to R do begin LogServer_Intervals(ReadQuorum[i],terpIntervalCount, tempIntervals);

IList_Merge(tempIntervalCount,tempIntervals, IntervalCount, Intervals,IntervalCount,Intervals); end;

GetQuorum("LogServer",WriteQuorum, w);

{ if there is one, copy the high log record to w servers } { with the new epoch number } if (IntervalCount <> 0) then begin HighLsn := Intervals^[l].highLsn; highPresent := Intervals^[l].present; if (highPresent) then begin rPort := ServerId To Port(Intervals^[l].servers[l]); LogServer_Read(rPort,HighLsn, recLength, recPtr); else begin recLength := 0; recPtr := NULL end;

Figure 3-8: Global Variables and Open Procedure for Distributed Log Replication 45

for i := 1 to W do begin LogServer_Writ e (writeQuorum [i ] ,HighLsn, CurrentEpoch, highPresent, recLength, recPtr) ; tempServers [i] "= Port To ServerID (WriteQuorum[i]) end; IList_AddRecord (Int ervalCount, Intervals, HighLsn, CurrentEpoch, highPresent, W, tempServers) end;

{ write a "not present" log record to W servers } HighLsn := HighLsn + i;

for i := 1 to W do begin LogServer_Write (Writ eQuorum [i ] ,HighLsn, CurrentEpoch, FALSE, 0, NULL) ; tempServers [i] := Port To ServerID (WriteQuorum[i]) end; ILi st_AddRecord (IntervalCount, Intervals, HighLsn, CurrentEpoch, FALSE, W, t empServer s) end; Figure 3-9: Figure 3-8 Continued

3.1,2.3, Formal Proof of Restart Procedure

The correctness of the distributed log replication algorithm in the face of client node failures is difficult to discern from a simple description of the algorithms. This section presents a more formal approach to the problem. The arguments presented here assume that external mechanisms guarantee that only one process accesses a distributed log at any time. Thus, concurrency conflicts and partitions are not considered.

The proof presented here is based on a model of the contents of log servers and a model of a view of a replicated log. Definitions of the log write and restart operations are presented in terms of these models. Then using these definitions, theorems are proved that show the consistency of the distributed log is maintained by these procedures.

A distributed log consists of N log servers. Sl,. • • ,SN.

Each log server is a a set of tuples, . l and e are non-negative integers representing LSNs and epoch numbers, respectively, p is a boolean present flag. Values of log records are ignored in this model.

A log server is keyed by LSN and epoch number. A log sewer might store more than one tuple with the same LSN, but each such tuple will have a different epoch number:

{lV_ij Ii= lj---e>i_:ej} 46

procedure Read(recLsn: LSN, var recLength: integer, vat recPtr : ^LogRecord) ; var i : integer; tooLow : boolean; intIndex : integer; rPort : Port; begin { find the record in the interval lists } i := i; intlndex := 0; tooLow := false; while ((intIndex = 0) & (not tooLow) & (i <= IntervalCount)) do begin if ( (Intervals [i] .highLsn >= recLsn) & (Intervals [i] .lowLsn <= recLsn) ) then intXndex = i; if (Intervals [i] .highLsn < recLsn) ) then tooLow = true;

i := i + 1 end;

if (intIndex = 0) then begin { no record exits } recLength := 0; recPtr := NULL end else begin rPort := ServerlD To Port (Intervals [intlndex] .servers [i] ) ; LogServer_Read (rPort, recLsn, recLength, recPtr) ; end end Figure 3-10: Read Procedure for Distributed Log Replication

The high LSN appearing in any tuple in all log servers of a distributed log is designated lhigh.

A set of log servers, Q, is a read quorum if it has cardinality N-W+I, where W is the write quorum size. The contents of a read quorum, C((2), is the union of the log servers in the read quorum.

The view of a read quorum is the set of LSN and present flag tuples from the contents of the read quorum that have higher epoch numbers than other tuples in the contents of the read quorum with the same LSNs: {i _ C(Q.)^VE C(Q) Us i ,,,Ii= lj)--',ei>ej} A view of a distributed log is the view of some read quorum for the distributed log. 47

procedure Write(recLength : integer; recPtr : ^LogRecord; var recLsn : LSN) var

writeQuorum : array[l..W] of port; i : integer; tempServers : array[l..W] of ServerId; begin GetQuorum('tLogServer",writeQuorum, W);

{ assign a new isn } HighLsn := HighLsn +I; recLsn := HighLsn;

for i := 1 to W do begin LogServer_Writ e (writeQuorum [i ] , recLsn, CurrentEpoch,true,recLength, recPtr); tempServers[i] := Port To ServerID(WriteQuorum[i]) end IList_AddRecord(IntervalCount,Intervals,recLsn, CurrentEpoch, true,W, tempServers) end

Figure 3-11: Write Procedure for Distributed Log Replication

A distributed log is consistent if it has one view, that is the views of all (Nflw+l) read quorums are the equal.

A complete write operation inserts the tuple into W log servers, lhigh+l is one greater than the highest LSN in the distributed log. ew is greater than or equal to any epoch number in the distributed log. A complete write is represented by w in a regular expression describing a sequence of distributed log operations. Write operations are only permitted when the distributed log is consistent. A partial write operation is like a complete write except that the new tuple is inserted into at least one and fewer than W log servers. A partial write is represented as _vin a sequence of operations.

The restart procedure involves two steps, both of which are based on the tuple tuple that has highest LSN in the view of some restart read quorum. The same restart read quorum must be used for both steps of the restart procedure.

The first, or replicate step, in the restart procedure determines the tuple and appends the tuple to W log servers, er is a new epoch number that is greater than any epoch number in the distributed log. If the contents of the read quorum are empty (C(Qr)=O), the replicate step does nothing. The replicate step of the restart procedure is designated r in sequences of distributed log operations. A partial replicate step writes the tuple to at least one and fewer than w log servers. A partial replicate step is designated _-in sequences of log operations. 48

The second, cancel step in the restart procedure appends the tuple to W log servers. The cancel step is designated c in log operation sequences and must always be preceded by a replicate step. A partial cancel step appends the tuple to at least one and fewer than W log servers. The partial cancel step is designated _ in log operation goquence_ and mu_t alway_ be preceded by a replicate step. Lemma 1: An empty distributed log is consistent Lemma 2- If a consistent distributed log has view V, then after a complete write the distributed log is consistent with view Vu{}, where lhigh is the largest LSN in the distributed log before the write operation

Proof: The new tuple appears in W log servers and thus is included in all (NJw+]) read quorums, because for a read quorum to not contain one of the W log servers it must have fewer than N-W+I members and hence can not be a read quorum. Because the log was consistent before the write, V was the view of all read quorums before the write. With the new tuple in at least one member of each read quorum VLJ{}is the view of every read quorum after the complete write. Hence, the distributed log is consistent after the complete write. Lernma 3: If a consistent distributed log has view V, then after a replicate step the distributed log is consistent and has view v. Proof: When views of the distributed log are constructed after execution of the replicate step, the tuple that was written by the restart step will appear in every read quorum. Because er is greater than any other epoch number, the new tuple will be used to form the tuple in the view of the distributed log. But is already in the view because it is the one used by the replicate step to construct in the first place. The replicate step only writes a tuple with LSN lh, so other tuples in the view are not affected either. Lemma 4: If a consistent distributed log has view V, then after a cancel step the distributed log is consistent and has view Vw{}, where lhigh is the high LSN in the distributed log before restart. Proof: Analogous to the proof of Lemma 2. Lernma 5: If a consistent distributed log has view V, then after a partial write the distributed log inconsistent and has views V and VL){},where lhigh is the high LSN in the distributed log before the partial write.

Proof: According to the definition of a partial write, the tuple, was written to at least one and fewer than W log servers. Because it is written to at least one, the is some read quorum with view VL){}. Because it is written to fewer than W log servers there remains some read quorum with view V. L_mrna 6: If distributed log has views V and V_{} (where Ih is the highest LSN in view V), and if the restart read quorum has view Vw{ }, then after a replicate step the distributed log is consistent with view Vu{}. If the restart read quorum has view V, then after the replicate step the distributed log is inconsistent and has views V and V_{}. Proof: When the restart read quorum has view Vu{ }, the replicate step writes the tuple to W log servers because is the tuple with highest LSN in the view of the restart read quorum. Because the new tuple appears in W log servers the views of all read quorums will be Vu{

When the restart read quorum has view V the replicate step writes the tuple to W log servers because iSthe tuple with highest LSN in the view of the restart 49

read quorum. Writing this tuple does not change the view of any read quorum and so the log remains inconsistent with views V and Vw{ } (where lh is the highest LSN in view V), and if the restart read quorum has view V then after a cancel step the distributed log is consistent and has view Vu{ }. Proof: The cancel step writes the tuple to W log servers and this tuple appears in all read quorums. Read quorums that previously had view V will now have view Vu{ }, since the there is not other tuple in the contents with LSN lh. Read quorums that previously had view VU{ } will now have view VU{}, because the restart epoch number er is greater than the epoch number in the tuple in the contents of the read quorums. Lemma 8: If a consistent distributed log has view V, then after execution of a partial replicate step the log is consistent and has view V. Proof: When views of the distributed log are constructed after execution of the replicate step, the tuple that was written by the restart step will appear in some read quorums. Because er is greater than any other epoch number, the new tuple will be used to form the tuple in the view of the read quorums where the tuple was written. But iS already in the view because it is the tuple used by the replicate step to construct in the first place. The replicate step only writes a tuple with LSN lh, so other tuples in the view are not affected either. Lemma 9: If a consistent distributed log has view V, then after execution of a partial cancel step procedure the distributed log is inconsistent and has views V and Vu{ }, where lh iS the highest LSN in view V. Proof: Analogous to the proof of Lemma 5.

Proofs of the following lemmas are omitted. The structure of the proofs is similar to the proofs of the preceding lemmas. In each case, the outcomes of partial or complete replicate and cancel steps are derived from the state that the log was in before the step. I.emma 10: If a distributed log has views V and Vu{} (where lh is the highest LSN in view V), and if the restart read quorum has view V, then after a partial replicate step the distributed log has views V and V_{}. If the restart read quorum has view Vw{ }, then after a partial replicate step the distributed log either is consistent with view Vu{}, or is inconsistent with views V and Vu{ }. t.emma 11: If a distributed log has views V and Vw{} (where lh is the highest LSN in view V), and if the restart read quorum has view V, then after a partial cancel step the distributed log either is consistent with view Vu{}, or is inconsistent with views V, Vw{ }, and Vw{ }, or is inconsistent with views V, and Vw{}, or is inconsistent with views V_{}, and Vt_){ }. Lamina 12: If a distributed log has views V, Vw{}, and Vu{ } (where lh is the highest LSN in view V), and if the restart read quorum has view V, then after a replicate step the distributed log is inconsistent and has views V, Vu{}, and VkJ{ }. If the restart read quorum has view Vu{ }, then after a replicate step the distributed log is consistent and has view Vw{}. If the restart read quorum has view VU{ }, then after a replicate step log is consistent and has view Vu{ }. t.emma 13: If a distributed log has views Vu{ }, and Vu{ } (where lh is the highest LSN in view V), and if the restart has quorum has view 51}

Vu{}, then after a replicate step the distributed log is consistent with view Vu{ }. If the restart read quorum has view V_{}, then after a replicate step the distributed log is consistent with view Vu{ }. Lemma 14: If a distributed log has views V, Vu{ }, and Vu{ } (where lh is the highest LSN in view V), and if the restart read quorum has view V, then after the cancel step the distributed log is consistent with view Vu{ }.

Lemma 15" If a distributed log has views V, Vu{ }, and Vk.){ } (where lh is the highest LSN in view V), and if the restart read quorum has view V, then after a partial replicate step the distributed log is inconsistent and has views V, Vu{ }, and VL_{ }. If the restart read quorum has view Vu{ }, then after a partial replicate step the distributed log either is consistent and has view Vu{}, or is inconsistent and has views Vu{}, and Vu{}, or is inconsistent and has views V, and Vu{}, or is inconsistent and has views V, Vu{ }, and Vu{ }. If the restart read quorum has view Vu{}, then after a partial replicate step the distributed log either is consistent and has view VL3{ }, or is inconsistent and has views Vu{ }, and Vu{}, or is inconsistent and has views V, and Vu{}, or is inconsistent and has views V, Vu{}, and Vu{ }.

I_emma 16: If a distributed log has views Vu{ }, and VLP{ } (where Ih is the highest LSN in view V), and if the restart read quorum has view Vu{}, then after a partial replicate step the distributed log either is consistent and has view Vu{}, or is inconsistent and has views Vu{}, and Vu{ }. If the restart read quorum has view VLP{}, then after a partial replicate step the distributed log either is consistent and has view Vw{ }, or is inconsistent and has views Vw{ }, and Vu{}. Lemma 17: If a distributed log has views V, Vu{}, and Vu{ } (where lh is the highest LSN in view V), and if the restart read quorum has view V, then after a partial cancel step the distributed log either is consistent with view Vu{}, or is inconsistent and has views Vu{ }, and Vu{ }, or is inconsistent and has views V, and Vu{}, or is inconsistent and has views V, V_{ }, and V_{ }.

The lemmas presented above describe all of the view that a distributed log can have if a complete restart procedure is run when the log is first used, and then a complete restart procedure is run after any crash that might have resulted in a partial write or partial restart. This restriction on distributed log operations can be expressed succinctly with a regular expression. The lemmas can be used to prove that if the sequence of distributed log operations conforms to this restriction the log is always consistent after complete restarts and subsequent complete writes. Theorem 18: A distributed log will be consistent when the sequence of operations performed on it corresponds to this regular expression: A A A A A ((r*(rc)* )*rc(w*w+w* ))*(r*(rc)* )*rcw* Proof: kemma 1 provides the base case for an empty log. kemmas 2 - 4 show that the complete write and restart operations preserve the consistency of a distributed log. Lemmas 5, 8 - 10, and 15 - 17 show what views of a distributed distributed log can 51

result from partial writes, partial replicate steps, and partial cancel steps in permitted sequences. Lemmas 6 - 7, and 12 - 14 show how complete restart procedures restore consistency to a distributed log that has any views that can arise from a permitted sequence of operations. The same restart read quorums must be used by the replicate and cancel steps in a complete restart procedure.

3.1.3. Hybrid Schemes

A mirrored representation for a distributed log is simpler and requires fewer messages for log writes than a representation based entirely on distributed replication. Section 3.1.4.3 describes how a log server for a mirrored representation can also be used as a central commit coordinator for distributed transactions. Central commit coordination reduces both the number of log forces and the number of messages required for the commit of a multi-site transactions. On the other hand, a log based on distributed replication can be more available for log writes than a mirrored log because a client can use a different log server if one fails. The distributed replication algorithm can switch servers because it maintains a distributed replicated directory telling what log servers store what log records. It is feasible to design a hybrid log representation in which each log record is sent to a single log server that uses mirrored disks, but a directory telling what servers store what ranges of log records is maintained on a larger set of log directory servers. The directory servers can run on the same nodes as log servers.

In hybrid log replication the abstractions of the log directory and the log record storage are more cleanly separated than in a pure distributed replication representation. A higher degree of replication can be used for the directory than for the log record storage itself. For example, log servers might maintain two copies of each log record stored using mirrored disks while each new directory entry is written to three of five different log directory servers. The log record storage might even use distributed replication with a smaller number of copies than the log directory.

The directory for a hybrid log representation is logically similar to the directory for a representation using distributed replication. The difference is that the hybrid log's directory is a separate object, rather than information that is gathered from log sewers that store it implicitly. It is too expensive to continuously send information for the most recent range of LSNs continuously to log directory servers. Instead, the information for a range of LSNs is written when a client switches log servers. Each entry stored at hybrid log directory representative contains the high LSN of the previous range of log records and the log server for the current range. Because multi-site atomic updates are implemented by a method similar to the one used for distributed replicated logs, each entry also contains a sequence number, present flag, and epoch number.

The directory for a hybrid log representation is implemented by collection of N,_ log directory representatives. Each new directory entry is sent to W_ directory representatives. When a client is restarted it reconstructs the contents of the directory by retrieving the contents of Ra=(NcWa)+] 52 log directory representatives and merging the results. As in the case of distributed log replication, the last entry in a directory might have been written to fewer than Wa directory representatives and so is copied with a new epoch number to Wa representatives. An entry with the next sequence number that is marked not present is written to Wa representatives. Then, to start logging, a client must write a new entry delimiting the last LSN it had previously logged and the node at which it is starting logging. To write this it must obtain the high LSN stored at the log server at which it was logging prior to the restart.

3.1.4. Comparison Criteria

Comparisons of different distributed log representation can be made with respect to reliability, availability, and performance criteria.

Reliability and availability are related concepts with precise definitions in the field of reliable systems design [Siewiorek and Swarz 82]. Availability is defined as a function of time, A(t), that is the probability that a system is available at time t. For the purposes of database and transaction processing system applications reliability can be defined as the probability that a system will not lose or corrupt data and will function correctly when it is available.

The performance of a distributed log depends on a number of factors besides distributed representation, but the representation does affect communications costs. The choice of a distributed representation controls whether either of three additional performance enhancements of possible for a distributed log implementation. First, a log server for log using distributed replication can use non-volatile main memory to enhance its performance. Second, mirrored and hybrid log servers can also serve as central commit coordinators for multi-site transactions. Third, the representation controls whether LSNs can be assigned so as to permit direct access to log records, rather than searching or using an index.

The Camelot DLF uses distributed replication (Section 3.1.2) as its representation for distributed log data. The primary reason for this choice is that distributed replication, unlike mirroring or hybrid representations, achieves high availability for the distributed log service without using special high availability server hardware and programs. Use of distributed replication also provided an opportunity for further research in voted replication algorithms.

3.1.4.1. Reliability

Different reliability criteria are available for comparing the reliability of different distributed log representations. Under a simple deterministic evaluation criteria all representations that maintain the same number of non-volatile copies of each log records have the same availability because they all tolerate the same number of failures of non-volatile memory modules [McConnel and Siewiorek 82]. More sophisticated criteria are needed to evaluate differences between representations. 53

A suitable criteria is the reliability function, R(t), and the related mean time to failure (MTYF) measure. Reliability is defined as a function of time that is the probability that a system is functioning at time t, given that it started operation at time 0. Mean time to failure is given by:

oR(Oat.

A model of the reliability of distributed log representations should reflect the differences between representations in two key areas. The first is the different degrees of failure independence of the copies of log data kept by different representations. The second is different repair rates for failed copies of log data. A Markov model of distributed log reliability permits consideration of both of these differences [McConnel and Siewiorek 82, Drake 67]].

Figure 3-12: Markov Reliability Model for 2 Copy Logs

Figure 3-12 shows a Markov model of a distributed log that maintains two copies of each log record. The remainder of the discussion in this section is restricted to duplex logs. Similar analysis can be applied to logs keeping more than two copies of data.

The model in Figure 3-12 has three states. State 1 is the initial state that a distributed log is in when operation starts, and there are two copies of all log records. State 2 is entered when one copy (of some 1°) of the log records fails. State 3 is the failed state, it is entered when both copies

1°For a distributed replicated log, state 1 is entered when the storage on one of the log servers fails, even though there may still be two copies of some of the log records. 54 of (some of) the log data have failed. When in state 2, it is possible to repair the failed copy of the log and return to state 1.

The Xij are the rates at which transitions are made among the model states. Formally, XL/Atis the probability that a transition occurs from state i to j during a suitably small time period At, given that the system is in state i at the beginning of the time period. Different distributed log representations have different values for the _%q.

The X]2 transition rate is the rate of failure for any one of the non-volatile memories in the system, when they are all operating. For a mirrored system with two disks this parameter, X12, is about twice the failure rate for the individual disks (ignoring higher order terms in the compound failure estimate) which is about one failure per 10,000 hours. The state 1 to state 2 transition rate for a distributed replicated log with three one-disk servers, _'f2, is about three times the single disk failure rate, or 0.00003 failures per hour.

The rate of transitions from state 1 to state 3 include the very small probabilities of simultaneous non-volatile memory failures and the very real possibilities of flood, fire, or sabotage destroying m d tWOor more non-volatile memories. Very often X13> _'13because greater physical separation of copies possible with a distributed replicated log makes it less susceptible to catastrophes than a mirrored log. In any case, simultaneous failures are less likely than single failures ('%_3< XT; and _,_3< _'e_2).

The rate of failure of a second non-volatile memory once one has failed, ;k23,is much greater than the rate of occurrence of the first failure, especially for mirrored log servers. This is partially because of the annoying chance that a repair person will attempt to repair the working memory instead of the failed one. Such accidents are less likely if the memories are in different rooms. Thus: _12< X_3' _'_2< _'_3'and _'_s< Z'_3"

Repair methods and rates vary for different representations. A mirrored disk is repaired by copying the contents of the good disk to a spare. So L_1' is about 10 per hour. A server for a distributed log is repaired by reading the distributed log from other servers. Thus, X_z]is about 1 per hour.

Given values for the Lij, the probabilities of the distributed log being in either state 1, state 2, or state 3 are time varying functions: Pl(t), P2(t),P3. For the distributed log model in Figure 3-12, these probabilities are given by the solution to the differential equations [Drake 67]:

dPl(O dt - P2(t);L21 - Pl(t)_12 - Pl(t)_,13

aP2(t) dt - Pl(t)_.12 - P2(t)_21 -- P2(t)_,23 55

Subject to the restriction: Pl(t)+P2(t)+P3(t) = 1

and the initial condition:

PI(O) = 1

The reliability of a log is the probability that it is in states 1 or 2: R(t) = Pi(t) + P2(t).

3.1.4.2. Availability

Comparisons of the availability of different distributed log representations must take into account the availability of different operations, since, for example, restarting the client of a log implemented with distributed replication requires more log servers than are needed for writing new records to the log.

Network availability can be ignored in comparisons of different distributed log representations because failures of the entire network are common to all three representations. Failures of communications to individual nodes are modeled as failures of the nodes. Depending upon the applications that are considered, network availability could be a critical issue when comparing the availability of a distributed log with local logging. Many distributed transaction processing applications depend on network communications anyway, so that network availability is not relevant to distributed log availability for these applications.

A mirrored representation of a distributed log has the same availability for all operations. This availability is a function of the rates of failure and repair of the log server, and for simple constant failure and repair rate models this is a constant, am, describing the fraction of the time the mirrored log server is available. If a log server uses redundant hardware and software, such a Tandem multi-processor, it will typically be unavailable one hundred times less often than a server using conventional processors [Gray 86].

The availability of a log based on distributed replication is different for different operations. _r±te operations require W log servers, and the availability of these operations is given by zn-w (n,_(1 _ _i_ n-i i=0 \i)_--u,Pua , where aa is the availability of a log server for distributed replication.

Similarly, the availability of Open operations in this model is the simultaneous availability of n-w+1 or more log servers: z..,i=o_\7i)'_w-["'_1 l-adJ_iad"-i" The availability of log records for reading is 1-(1-aa) w.

A hybrid log representation permits a client to switch mirrored log servers while writing, 56

1.0000. =. /______. .... ¢ -. • :

o.ggoo / / ,_ / I / / 0.9980// // .J / / 0.9940 / / / / /

0.9920 / ->k'--"t-- :[.-'_ HybridHybrid Log,Log,Server Availability 0.996,0.99996,Wd=2Wd=2

// 0--O- .--..__ HybDistributrid Log,ed ReplicatedServerAvaiLog,labilServerity0.99Av99a6ilabil, Wd=ity3 0.996, W=2 0.9900 / x...... )< HybridLog,Server Availability0.996, Wd=3 / _-- -- -_ DistributedReplicatedLog, ServerAvailability0.996, W=3 / 0.9880 _ I i , = I 2 3 4 5 6 7 8 Number of Servers

Figure 3-13: Write Availability of Different Distributed Logs increasing the availability of write operations. Only one log server needs to be available while a client is using it, but to switch log servers when the current one crashes wa log directory servers (which presumably run on the same nodes as log servers) are needed. Therefore, the availability of continuous W=ite operations is given by: _-__""-iWd=o (nki)_,, (1--amVa, ,,,n-i , Open operations require that the log server that was last written to be available, together with n-w a other hybrid log directory servers. This availability is,-,mL-_'iWd=O -l(n-l'_\(1 i 2" ----m/.n--m"_in (n-l)-/

Figures 3-13 and 3-14 show the availabilities of different distributed and hybrid log configurations for writing and restart. Individual server availabilities of 0.996 and 0.99996 are used for the estimates in the figures. These availabilities correspond to 90 minutes of system downtime once every two weeks or four years, respectively. The availability models show that a distributed replicated log using servers without specially redundant hardware achieves availability equal to a mirrored log servers that uses the special hardware when as few as 3 servers are used. Restart availability for distributed replicated logs remains competitive with hardened mirrored servers when as many as seven servers are used for two copy logging. 57

,oooo,o.... 0.9995

_ 0.9990

1:. 0.9905

0.9980

0.9975 -_ --- --P HybridLog,ServerAvailability0.99996, Wd=2 -- --_ HybridLog,ServerAvailability0.996, Wd=2 0.9970 _- m ._ DistributedReplicatedLog,ServerAvailability0.996, W=2 B-- - --El HybridLog,ServerAvailability0.99996, Wd=3 ×...... _< HybridLog,ServerAvailability0.996, Wd=3 0.9965 0-----0 DistributedReplicatedLog,ServerAvailability0.996, W=3

. .._...... -...... ::....:....--.:._-...... 0.9955 i i I i , i 2 3 4 5 6 7 8 Number of Servers

Figure 3-14: Restart Availability of Different Distributed Logs

3.1.4.3. Performance

The representation of a distributed log dictates a number of aspects of its performance. Other details of a distributed log system can have a major impact on performance as well. The choice of a distributed or centralized (mirrored or hybrid) representation for each log record controls the amount of communication required. The representation choice also dictates some aspects of log server design, including whether it is beneficial to make the server's main memory non-volatile, and whether the log server can coordinate commits of distributed transactions. Table 3-1 summarizes performance comparisons.

Communication Non-volatile Commit Representation Cost Main Memory Coordination LSNs with dual Mi rrored low processors yes addresses Distributed with single Replication high processors no contiguous with dual Hybrid low processors yes addresses

Table 3-1: Distributed Log Representation Performance Comparisons Communication Costs 58

In the simplest implementation the number of messages and the amount of amount of communications bandwidth used by write operations in a distributed replicated log is W (the number of copies written) times the communication cost of a mirrored or hybrid representation. Read operations have the same communication costs regardless of the distributed log representation and the communication cost of a Open operation is not important, because the operation is comparatively infrequent.

Write operations are appropriate for a multicast implementation and this would reduce the message overhead of a distributed representation. Bandwidth use will be reduced even more because acknowledgment messages are quite short. Section 3.2.4 discusses the use of multicast communications for distributed replicated logs. Use of Non-volatile Main Memory to Improve Server Performance

Local log implementations typically use only mirrored disk storage for log data despite the latency cost for log writes that this implies. 1° Section 3.4.2 discusses designs for low latency stable memories. Most such memories rely on devices that are external to a processors main storage so than one or more I/O operations is required to write to them. Designs for memory- mapped low latency stable memories do exist [Banatre et al. 83], but they are complex and the ease of access that makes their latency tow also makes them more susceptible than disk storage to software or undetected CPU errors.

It is comparatively easy to make a computer's main memory non-volatile. For workstations and small computers the easiest method is to provide a battery based uninterruptible power supply to provide for a few minutes of operation in the event of a power failure. Battery backup CMOS memory boards can also be used. Although such a memory could be corrupted by an erroneous program or processor, they are acceptable for use as the non-volatile memories that are replicated to create stable log storage.

The use of non-volatile main memory in log server for a distributed replicated log enables the server to acknowledge log writes as soon as the data is received instead of waiting for a disk write. This can significantly improve response, especially for servers that do not have the extra disks needed for fast disk write methods. In addition, disk usage is improved because the server does not need to write partially filled pages.

Mirrored log servers that use non-shared memory multiprocessors (like Tandem systems) for high availability could use non-volatile main memory buffers. In such systems, incoming log data can be buffered in two processor's battery-backup main memory before it is acknowledged. The entire node implements low latency stable storage by this mechanism.

10Section 2.2.2.4 describes special techniques that permit this latency to be reduced to one millisecond or so. 59

Central Commit Coordination

Typical optimized two phase commit algorithms make n log forces and sends 3(n-I) messages to commit a transaction at n nodes. If a log server is used as a central coordinator, then a single log force and 3n messages are sufficient to commit the transaction. The first set of messages from the coordinator asks participants to prepare to commit the transaction at the central coordinator. In the second set of messages participants inform the coordinator of their votes. The coordinator then forces a commit record (proceeded by any unforced log records from the participants). The final set of messages informs all participants of the commit. The centralized commit protocol uses only one permanent state transition for all participants.

A log server can serve as a central commit coordinator only when it is used by all participants in a transaction. The central commit coordinator allows a significant performance improvement only for mirrored or hybrid representations of a distributed log. If servers for a distributed replicated log are used for coordination, the extra messages between the log servers will offset the performance improvement possible by central coordination. LSN Assignment

Local log implementation routinely assign LSNs to be the relative byte offsets of records in the log. This permits the log system to directly access a record on disk without search or using an index. Distributed log implementations must do LSN assignment on the log client machines or else every log write operation would require communication with log servers.

Clients of distributed replicated logs must assign contiguous LSNs, to allow the client restart procedure can correctly deal with partially replicated log records. When clients assign contiguous LSNs, log servers must use an index or search a disk when records are read. Clients of mirrored logs can assign relative byte address as LSN, and this will enable a server to directly access records (providing the server does not use a single data stream representation for log data. See Section 3.4.). Clients of hybrid logs are free to negotiate with log servers to determine a new starting LSN that will map conveniently to a disk address on the server.

Client assignment LSNs as relative byte addresses can conflict with efficient central commit coordination on mirrored and hybrid log servers. The central commit optimization is based on one pair of disk writes forcing all required log records from all clients. This is not possible if the server uses a separate disk partition for each log clients data, which is how a log server arranges for client supplied LSNs to serve as the relative byte addresses of log records. This conflict is avoided if a log server first forces log data to a low latency buffer and later writes client log data to individual disk partitions. 60

3.2. Client/Server Communication

Communication between client transaction processing nodes and log servers accounts for a large part of the latency of distributed log operations and much of the implementation complexity. There are a number of alternatives for the abstraction provided by the communication interface between distributed log servers and their clients. For example, the interface could resemble a byte stream, rather than the remote procedure call interface used in Section 3.1 for the exposition of different distributed log representations. This section describes various alternatives and compares them according to performance, resilience, and implementation complexity criteria.

The representations for distributed logs in Section 3.1 were presented in terms of the simple log interface given in Section 2.2.2.1. This simplified the exposition of different representations, and the examination of their correctness, reliability and availability. However, it is necessary to consider the requirements of practical log interfaces like the one given in Section 2.2.2.3 when designing the communication between clients and log servers. For example, it is important to distinguish between Write and Force operations at this level of abstraction.

Communication interfaces may be based on a variety of different network technology. The most sophisticated network protocols support internetworks of local and long distance networks interconnected by gateways [Postel 82]. Distributed logging is practical only if it uses low latency local area networks [Clark et al. 78] or direct high speed long distance communication links. The protocols used of the client server/interface in distributed logging should be specific to the properties of these networks. This specialization does not preclude using internetwork protocol headers so that the distributed log's communication can coexist with internet communication on clients and servers.

Distributed logging is a very specialized application, and the ideal communication interface for distributed logging must also be very specialized. Among other issues, distributed logging has special end-to-end reliability requirements [Saltzer et al. 84]. Log client software must receive an acknowledgment (either explicitly or implicitly) for each log record it sends to a server. Most communication paradigms do not by themselves provide this degree of reliability. Additionally, depending on the type of transaction mix being processed, the distributed log might need either quick turnaround responses for remote operations, or efficient bulk data transfer. The ideal communication system for distributed logging combines features from a number of existing paradigms.

This section continues with descriptions of different communication paradigms that might be used for a distributed log. Stream protocols, remote procedure calls, and a new communication paradigm called the channel model are presented. Section 3.2.4 considers how multicast communication can be used in the implementation of distributed log that uses distributed replication for its representation. This section concludes with comparisons of different 61 communication paradigms according to performance, resilience, and complexity criteria. The Camelot DLF uses specially designed communication protocols with features similar to the channel model and parallel operations on multiple servers.

3.2.1. Stream Protocols

Protocols specialized for efficiently moving streams of data from one node to another have been the subject of considerable research and development. Examples of such protocols are the DARPA internet stream transport protocol, TCP, and the X.25 protocol [Comer 88, CCITT 84]. A distributed log interface needs to move data efficiently from one node to another and could benefit from the techniques used in stream protocols. However, all the issues of concern to the designer of general purpose stream protocol may not be relevant to a distributed log interface.

Stream protocols are typically designed for good performance and reliability in a complex long- haul or internetwork environments. To achieve good performance stream protocols use complex flow control mechanisms to allocate buffers, match the speeds of senders and receivers, and permit multiple packets to be transiting a network at once. For example, TCP uses a moving window flow control mechanism. Stream protocols attempt to guarantee that all data sent is reliably delivered to the receiver in the order it was sent with no omissions, duplications, or corruption. Additionally both ends should be notified when a stream terminates. To achieve these guarantees stream protocols use acknowledgments (which are often piggybacked on regular traffic for performance), checksums, sequence numbers,buffers for assembling packets in correct order, and special connection establishment and disconnection subprotocols. TCP uses a three way handshake to establish connections.

Not all of the features of complex stream protocols are appropriate for a distributed log. Distributed logs will use low latency networks and so there is little advantage (or even physical possibility) to having multiple packets in transit at once, except when a network controller can be transmitting or receiving one packet while it is transferring another packet to or from main memory. Local networks have very low packet loss rates and probably can not deliver packets out of order so that protocols optimized for higher error rates with automatic retransmission may have unacceptable overhead. Guaranteeing delivery to the receiving side of a stream connection does not mean that data is safely stored in non-volatile memory and therefore the reliability mechanisms of the stream protocol must be supplemented with end-to-end acknowledgments.

Despite some superfluous features, stream protocols do enable a single packet to serve as a flow control acknowledgment for multiple data packets, reducing the total number of packets that are exchanged to move a large amount of log data from a client to a server. This is important for performance, particularly in a highly loaded servers when there is a moderate processing cost for each message. The flow control features of a stream protocol allocate a server's message buffer space among multiple clients. 62

1

000 Client to Server Data Stream

000

Server to Client Data Stream

Command Codes a: acknowledge force w: write record (unforced) f: force record

Figure 3-15: Stream ProtocolInterface for Log Writes

Stream protocols provide a transport layer, but for implementing a distributed log it is necessary to layer a logging interface on the byte stream. A means of representing log data, server operation requests, and responses is needed. For example, the sample stream interface for log writes shown in Figure 3-15, represents individual operations in the data stream by a single character identifying the operation, followed by a length for the remainder of the parameters for the operation. Operations for writing and forcing log records are defined in the stream from log clients to servers. A stream from server to client is used for acknowledgments. Programming such a byte stream based interface would be tedious and error prone at best. To implement log forces the the stream protocol implementation must include some way to flush buffers and force a message to be sent. This interface does not show the application level messages that are needed to establish connections. Application level connection establishment is necessary because a client, a server, or the byte stream connection could fail, leaving the log client uncertain about what log records are stored by a server.

3.2.2. RPC Protocols

Remote procedure calls are a distributed programming construct in which a procedure call executes on one node and the body of the procedure executes on a different node [Birrell and Nelson 84, Nelson 81]. These simple semantics make RPCs convenient for structuring distributed computations, which is why they are used in Section 3.1 to describe distributed log representations. 63

RPCs can be implemented on top of stream protocols, but more commonly they are built directly on low-level network (or internetwork) datagrams. When a RPC is called a local stub- procedure (is generated by a RPC stub-compiler) collects arguments and packs them with appropriate header information into a single (usually) network packet. This message is sent to the server host where a RPC runtime service matches the message with the correct server process and invokes a server side stub procedure that unpacks the message and calls the desired procedure in the server. On return from the remote procedure, the server stub packs results into a message and sends them back to the client.

When everything works well, a RPC uses only two network packets. The stub-compiler generated code handles the tedious details of packing arguments into messages, and the RPC runtime procedures deliver the messages to their destinations without network induced errors or duplication. RPCs require additional packets when arguments or results do not fit into a single network packet, when network errors occur, or responses are delayed (perhaps simply because a procedure takes a long time to execute). RPCs do not work well with complex structures11 or reference arguments and their failure semantics can be quite different from ordinary procedure calls.

Because of their simplicity, RPCs often have very good performance for executing synchronous procedures remotely, especially when arguments and results fit in single network packets. When arguments are larger than a single packet, the RPC's transport protocol might use simple, one packet at time, stop and wait for acknowledgment protocols to transfer data [Birrell and Nelson 84], although RPC transport protocols with more efficient flow control exist [Cheriton 87]. Regardless of whether multi-packet RPCs are transported efficiently, the basic RPC paradigm blocks the calling process until the call has completed. For distributed logging this means that a client can not stream unforced log writes to a server while continuing to generate log data (the related problem of sending data in parallel to multiple servers for distributed replicated logs is discussed in Section 3.2.4). A client can circumvent this problem by using a separate process to wait for the results of the RPC or the client can use one-way messages generated by a RPC stub-compiler like Matchmaker [Jones et al. 85] (one-way messages are declared as "simple procedures" in Matchmaker syntax). In either case, the distributed log client must be aware of the network packet size to use asynchronous messages efficiently. Some RPC stub-compilers do not efficiently support variable number of variable length arguments and the distributed log clients must be manually programmed to pack and unpack portions of messages if efficient bulk data transfer is attempted with RPCs.

Figure 3-16 show a log writing interface using one way RPCs. Unforced log writes are collected

11The problem of transmitting abstract values in messages has been studied by Herlihy [Herlihy and Liskov 82], but few RPC systems implement Herlihy's mechanisms in their full generality, in part because encoding a single value could require many procedure calls. 64

{ Synchronous call from client to Server to force log records } Routine RPCLog_Force(IN serverPort: port; lowLsn, highLsn: LSN; recLengths: ^array[] of integer; recs: ^array[] of char; OUT missingLowLSN, missingHighLSN: LSN; newAllocation: integer);

{ Asynchronous call from client to Server to spool log records } Simple Procedure RPCLog_Write(IN serverPort: port; lowLsn, highLsn: LSN; recLength: ^array[] of integer; recs: ^array[] of char);

{ Asynchronous call from Server to Client to acknowledge log } { records and give new flow control allocation } Simple Procedure RPCLog_Ack(IN serverPort: port; highAckedLsn: LSN; newAllocation: integer);

{ Asynchronous call from Server to inform Client of missing } { log records } Simple Procedure RPCLog_Missing(IN serverPort: port; missingLowLSN, missingHighLSN: LSN) ;

Figure 3-16: RPC Interface for Log Writes and sent in one-way RPCT.og_Write RPCs when an approximately full packet of data is available. Log forces are sent, together with any buffered writes in two way R.PCLog Force RPCs. When a rare log record exceeds a single data packet, it is sent by itself either as a RPCLog_Write or RPCLog_Force and the RPC transport protocol must deliver it in total. Because the client could potentially overrun the server with one way RPCs, a moving window flow control is used and new allocations are included in the results from each RPCLog_.Force RPC, as well as in one-way RPCLog_Ack messages from server to client. A one-way RPC could be lost, but the server detects this when a RPC is received with out of sequence LSNs and it notifies the client with a RPCLog._Missing RPC.

3.2.3. The Channel Model and LU 6.2

Gifford and Glasser proposed the channel model of distributed computation to address the bulk data transfer limitations of RPCs [Gifford and Glasser 88]. LU (for Logical Unit) 6.2 is a protocol suite that is part of IBM's System Network Architecture [IBM Corporation 79]. LU 6.2 is designed for communication between application programs and has attributes that make it appealing as the transport protocol for the channel model.

The channel model incorporates two kinds of remote operations: ordinary synchronous remote procedure calls that block the calling process until results are returned, and asynchronous pipe 6S

calls that do not return results and do not block the caller. Pipe calls can be buffered by the underlying transport system to improve network packet utilization and throughput. Ordinarily, calls on different pipes implemented by the same sewer are not ordered and can execute in parallel with other pipe calls and a procedure call by the same client processes. To provide more control over ordering of different calls, remote pipes and procedures that are implemented by the same sewer can be combined in channel groups that enforce ordering of calls made by the same client process. Channel groups are created dynamically with a Group operator, or in an interface specification using a Sequence primitive.

{ Channel model interface to log server for writing } Remote Inteface ChannelLogServer; { Force log records } Procedure ChannelLog_Force(recLsn : LSN; recLength: integer; recPtr : ^LogRecord) ; { Spool log records } Pipe ChannelLog_Write(recLsn : LSN; recLength : integer; recPtr : ^LogRecord); Sequence ChannelLog_Write, ChannelLog_Force; End ChannelLogServer.

Figure 3-17: Channel Interfacefor Log Writes

The channel model is well suited to the needs of a distributed log interface because pipe calls can be used to efficiently buffer unforced log writes and send them in large packets, while fast RPCs are used for log forces. For example, a log sewer can export a channel group consisting of a ChannelLog_Write pipe and a ChannelLog_Force call each of which takes a single log record as an argument as shown in Figure 3-17. The log client can implement Write operations with calls on the sewer's ChannelLog_._Write pipe and Force operations with the ChannelLog_Force RPC. The stub-compiler generated local procedures handle the details of packing call arguments into messages, which are buffered for higher throughput if they are pipe calls. The channel abstraction relieves the distributed log implementor of most of the details of message building, buffering, and reliable transmission.

LU 6.2 provides message oriented communication that would make a good transport service for an implementation of the channel model. Request/reply half duplex communication is the basic abstraction provided by LU 6.2. The operations provided by LU 6.2 (verbs in LU 6.2 terminology) permit multiple messages to be sent and buffered for improved throughput, or a message may be sent immediately with a response expected. The exactly models the communication pattern for remote pipe and procedure calls. Sequencing, flow control, low level acknowledgments, reliable delivery, and failure reporting are all handled by LU 6.2. It should be simple to implement the Channel Model on top of LU 6.2 by providing a binding service and stub-compiler to automate massage packing. 66

3.2.4. Parallel Communications and Multicast Protocols

The distributed replication representation for a distributed log requires roughly twice as much communication and server processing for log writes as the mirrored and hybrid representations. Latency would be prohibitive if the server processing could not go on in parallel at separate replicas. Ideally, the physical multicast facilities of local area networks [Digital Equipment Corporation 80] can be exploited to permit communications to different servers to proceed in parallel as well.

Efficient logging to multiple servers in parallel is complicated. Using a separate (preferably lightweight) process for each server is an obvious solution, but this incurs extra process dispatch costs for each message sent to a server. Parallel RPC primitives could be used for parallel communications in a distributed replicated log, although the problems of buffering and blocking inherent in a RPC mechanism are still incurred. A parallel RPC mechanism like Satyanarayanan's MultiRPC [Satyanarayanan and Siegel 89] that implements reliable multicast of RPCs is preferred over mechanisms like the V Kernel Group IPC [Cheriton and Zwaenepoel 85] that provides for receipt of only one response. If only non-blocking communication primitives are used, a single client process can manage connections with multiple servers.

Local network interfaces implement physical multicast addressing by providing special address registers that may be dynamically loaded with addresses chosen from a space of reserved multicast addresses. The interfaces receive messages with destination addresses that match any of their address registers. Even without special hardware multicast can be implemented at the lowest level network driver, thus reducing the software overhead for sending the same message to multiple nodes. Special protocols are needed to assign multicast addresses and map the addresses used by higher level protocol layers to physical multicast addresses. A multicast addressing scheme has been proposed for the IP/TCP protocol hierarchy [Deering 85]. The VMTP protocol, intended as a transport protocol for RPCs, has provisions for using physical multicast [Cheriton 87]. However, VMTP does not provide a reliable multicast that collects acknowledgments from all recipients.

Its difficult to imagine a high-level communication model that fully supports the very specialized needs of a replicated distributed log representation. A version of the channel model that supported reliable physical multicast would not be sufficient by itself because a distributed log replication client must be able to detect and recover from server failures. Server failure handling requires a client to know what log records have been acknowledged by the server and what records have been sent but not acknowledged. It is not possible to hide from the log client the fact that communication involves two (or more) separate remote nodes and that connections with those nodes can fail independently. 67

3.2.5. Comparison Criteria

The performanceof the distributedlog communicationinterface is amongthe more important criteriato be used in comparingdifferentalternatives. A load model is needed to evaluate differentalternatives. Inthe case of a distributedlog,the load model, showsthat both the round trip time for Force operations, and the streaming rates achievable for long transactions and sequential reads are relevant measures of communications performance. The resilience of the communication protocols, and the effort required to implement the interfaces are also important comparison criteria. It is difficult to make definitive comparisons between different communication paradigms. Evidence about performance or resilience is often anecdotal, and even where there is published data (such as for Birrelrs RPC implementation [Birrell and Nelson 84]) there are no guarantees that a different implementation will have similar performance. Comparisons must focus on the claimed or intended properties of different communication paradigms. Table 3-2 summarizes communication paradigm comparisons.

Communication Force Stream Protocol Log Interface Paradigm Time Rate Resilience Complexity Suitability Stream Protocols Slow Fast Good Moderate Low RPC Protocols Fast Slow Fair Low Low Channel Model/LU 6.2 Moderate Moderate Very Good High High

Table 3-2: Communication Comparison Summary

None of the high level communication paradigms described above is perfectly suited to the implementation of a log service based on distributed replication. RPC and multicast RPC protocols do not provide the data streaming that is desirable when long update transaction are executed. Conventional RPC protocols, stream protocols, and the channel model do not support parallel operations on multiple servers that are essential for good performance with distributed replication. Because of these drawbacks, Camelot uses specially designed message protocols for communication between clients and log servers. The Camelot log communication interface is described in detail in Section 4.2.

3.2.5.1. Load Models

The usage pattern of a clienttransactionprocessingsystemdetermineswhat aspectsof client server communicationsmost affect performance. Log usage varies depending on the applicationsprocessedby a transactionsystemandalsoduringdifferentphasesof operation.

When a transactionprocessingsystemrestarts it firstexecutesnodeor media recovery. There are several differentrecoveryalgorithms,but they all read log recordssequentiallyby LSN (in either increasingor decreasingorder, or both) first. The amount of log read sequentiallywill correspondto between a few minutesto many hoursof normaltransactionprocessing.At some 68

point during recovery random accesses to the log might be made.12 The log interface should include hints (or the client should infer) that sequential reads are being undertaken and inform the tog server so that the server can invoke efficient stream communication to send multiple log records in each network packet. Some recovery algorithms write log records during recovery, but forces are rare. Streamed reading is the majority of communication during recovery.

During normal operation, there are three different kinds of logger interactions that are relevant to communication performance. The first is the pattern of logging observed when short update transactions such as debit/credit are running. These transactions log a small (less than one thousand bytes) amount of data and force it very soon. The round trip time to send the log data with a force request and receive the response is the critical measure for this load component. The second kind of log interaction occurs when a long running update transaction executes or when group commit is being used for multiple short transactions. Such a transaction generates a large amount (possibly many kilobytes) of log data but forces it very rarely. For these types of transactions, the rate at which log records can be streamed to the server is the important measure. The final interaction occurs when a transaction aborts. In this case the recovery manager follows a back-chain of log records written by the transaction. For each log record written by the transaction a request for the log record is sent to a server and a response received back.13 Unlike node recovery there is no streaming of log reads from server to client. The transaction abort component of normal processing appears similar to short transactions to the communication system.

3.2.5.2. Force and Random Read Times

The critical performance measure for short transactions such as debit/credit is the time required to execute a £orce operation. The communication required by these operations is sending a single packet message (typically about one thousand bytes) to one (or more, depending on the representation being used) log server and synchronously receiving a short reply packet from the server. This pattern exactly fits the RPC communication paradigm, though the channel model should approach the performance of RPCs for these communications.

RPCs are very suitable for this sort of communication because they combine high level aspects of communication in one mechanism. Acknowledgments and flow control for all protocol levels

12There will often be a period during node recovery when only a small portion of the log records read are processed to any significant extent. For example, only log records referring to some small number of transactions might be of interest. This suggests that the log server could filter the records read for the client as discussed in Section 6.2.2.

13A log client could cache recently written log data in anticipation of transaction aborts, but the amount of data that must be cached to accommodate a reasonable fraction of aborts might require a significant amount memory that could be better used for other purposes (for example as database buffers). Consider a transaction system that generates 50 kilobytes of log data per second. If most transactions are aborted because of timeouts during commit processing or because of periodic (or slow) deadlock detection, then an average aborted transaction might be 20 seconds old and a megabyte of log buffer would be needed to avoid communicating with a log server. 69

are contained in a single request and response packet. RPC Transport protocols can be tuned for this sort of interaction and will deliver very good performance. RPCs are easily adapted to use physical multicast facilities in support of distributed replicated logs. In Table 3-2 RPCs are shown as having the fastest force times and stream interfaces the slowest. Channel model/LU 6.2 communication promises performance approaching RPCs, but there is less evidence that this performance can be realized in practice.

3.2.5.3. Streaming Rates

Long running transactions, and the sequential Log_Read operations that take place during node recovery, are best served by communications interfaces that permit data to be efficiently streamed from client to server (or vise-versa). This communication fits the stream protocol communication paradigm.

Stream-based interfaces, achieve performance for bulk data transfer because they better utilize each packet transferred. First the network is better utilized because the interface, rather than the client programmer is concerned with filling each network packet. The programmer does not need to know the optimum number of log records to include in a RPC, for example. Second, stream protocols can allocate and acknowledge multiple data packets at once. This reduces the total number of packets that must be exchanged to move a given amount of data. In long haul networks, the more sophisticated flow control facilities of stream protocols have the advantage of better utilizing network buffers, but this is not an important a consideration in a local network. Table 3-2 shows that stream protocol interfaces have the highest log stream transfer rates and RPC interfaces the lowest. An implementation of the channel model on LU 6.2 might fall somewhere in between.

3.2.5.4. Resilience

The performance of a communication interface is the most significant comparison criteria, but the resilience of the protocol to transient network failures and effects of high loads is also important. A key issue is the level of abstraction at which error handling is done.

The end-to-end argument asserts that a client of a distributed logging facility must use end-to- end mechanisms to guarantee that log data sent to a server is reliably received and stored. The distributed log representation algorithm and the communication interface that implements it must observe this requirement. However, the degree to which a communication facility automatically recovers from errors can contribute to or detract from the performance of distributed logging.

Local area networks are supposed to have error rates that are much lower than long haul networks, and in addition, the cost of retransmission is lower on a local network. Hence, resilience mechanisms that contribute to performance in a long haul network environment, like low level retransmission, may actually impose overheads (for buffering, for example) that reduce 7O

performance in a local network environment. The good performance of some RPC implementations is partially attributed to a lack of complex low level error recovery mechanisms [Birrell and Nelson 84]. The SNA protocol suite that LU 6.2 is based on is known for complex and expensive error recovery mechanisms, however given the greater degree of parallelism inherent in the channel model a greater degree of resilience is justified. Byte stream protocols like TCP probably are somewhere between datagram based RPC protocols and SNA in terms of resilience.

As server loads increase timeout events unrelated to network failures may become more frequent. Timeout driven retransmissions can exasperate server loads. In consideration of this, a protocol with some sort of load control is probably important. Any communication interface using fixed interval timeouts or keep alive messages, as do some RPCs, will have problems with highly loaded servers.

3.2.5.5. Complexity and Suitability

The communications interface can be a big component of a distributed log implementation, particularly if a sophisticated protocol must be implemented. Debugging and tuning protocol implementations is notoriously difficult. These considerations argue for using a simple protocol, or for using an standard one that has an existing implementation. There is a tradeoff between high level protocols that have complex implementations, like the channel model/LU 6.2, but are well suited to the requirements of a distributed log interface, and lower level protocols, like streams, that have simpler implementation but are not suited to distributed logging.

Communication interfaces vary with different environments. Stream protocols are widely available, thanks in part to well defined standards. RPC systems are increasingly common, and RPCs are easy to implement on top of a simple datagram interface. For a single application like distributed logging, the implementor can use the RPC paradigm for communications but skip the task of implementing a stub compiler. Instead, stubs can be hand coded. The channel model is an experimental proposal and not widely available, but the communication facilities of LU 6.2 (upon which channel model communication can easily be constructed) are available on many commercial systems.

3.3. Security

In a perfect world, security might be considered in Section 3.2, along with other communication issues, since the most general security mechanisms are integrated with communication. The reality is that security is often an afterthought and security mechanisms are added to a system after its initial design and implementation. The Camelot DLF makes no provisions for security. This separate discussion reflects the status of security as a secondary issue and also indicates that there are circumstances when security mechanisms do not need to be integrated with communication. 71

The information in a transaction recovery log is valuable to an enterprise, because it is possible to reconstruct an enterprise's database from information in the log, or to prevent database recovery by denying access to the log. Security policies, which are implemented with various mechanisms, protect of data in distributed logs from unauthorized access or intentional corruption. Some classes of mechanisms support security policies by authenticating log servers and clients, while other mechanisms prevent improper access.

3.3.1. Alternative Mechanisms

Below, four security mechanisms that can be used in combination to fulfill various security requirements are described. Some mechanisms, like physical security for the log servers and some form of authentication, are essential in any environment. It may be difficult to base security entirely on physical (and related social) mechanisms, and therefore cryptographic security mechanisms may also be necessary to protect log data from unauthorized release or access.

3.3.1.1. Authentication Mechanisms

The simplest authentication mechanism is client and server identifiers in messages. However, it is trivial to forge such identifiers when they are publicly known, unless they are inserted by hardware in a physically secure environment.

Secret identifiers, or passwords, that are known only to clients and servers provide a little additional protection. Passwords must be exchanged using some trusted means outside of regular communication channels. Passwords provide only a small amount of additional security over known identifiers, because it is easy for an intruder to steal a password by copying it from a message.

The problems of stolen or forged authenticators is solved by the use of encrypted authentication tokens distributed by a trusted authentication server [Needham and Schroeder 78, Steiner 88]. Protocols for establishing authenticated communications between two nodes where originally described by Needham and Schroeder. They described different protocols for authentication using either conventional single key encryption or public key encryption [Rivest, Shamir, and Adleman 78]. The protocols use trusted authentication servers to distribute conversation keys or public keys, depending upon what type of encryption is being used. The protocols then use a three way handshake to establish the time-authenticity of the conversation. 72

3.3.1.2. Physical Security

If physicalaccess to logservers,clients,and the networkcan be restrictedto authorizedusers, and if social pressures and training are adequate to ensure acceptable user behavior, then no other security mechanisms are needed.

Physical security of log servers is essential, independent of any other security mechanisms. The physical media storing log data must be protected from intentional or accidental theft or destruction. Physical security for computer systems is nothing new. It should be easy to provide physical and environmental security for log servers since they will typically be fairly small, air- cooled machines.

It is possible to provide physical security for a local area network and the client machines attached to it. Local networks are physically contained within rooms, buildings, or sites that may already be subject to access restrictions. These restrictions may be sufficient to guarantee that only authorized connections are made to the local network being used for distributed logging. A local network may be connected to an internetwork by a gateway, but this need not limit the effectiveness of physical network security since gateways can be programmed to restrict the destinations of incoming packets. Secure gateways effectively limit access from outside the local network.

If local network access is limited to authorized machines, then the only way to compromise a distributed log is to circumvent the hardware and software protection mechanisms on a machine attached to the network. There are limits to the attacks that can be undertaken without detection and the potential for detection and subsequent sanctions may discourage more serious attacks. It might be possible to place a workstation's network interface in a promiscuous mode so that it receives all network traffic. Other user's log data could be read, but not modified in this way (end-to-end encryption prevents the exploitation of any data read). Attempts to circumvent authentication mechanisms might be detected, for example by the the multiple messages that result when a real node responds in addition to the impostor. Even though serious damage could be done by such attacks (for example, logs could be truncated or corrupted), the prospect of detection and subsequent sanctions may be sufficient to discourage such behavior within many organizations.

3.3.1.3. End-to-End Encryption

The contents of a recovery log can be protected from unauthorized disclosure if a client encrypts every record before it is transmitted over a network. The client must follow sound cryptographic techniques to secure the data. Managing keys is also the client's responsibility.

For end-to-end encryption, a log client uses a cipher known only to it and encrypts every log record it sends to log servers. The log servers can not decrypt the log data and therefore store it 73 encrypted. Each log record must be encrypted individually, rather than as part of a stream of log records, so that it may easily be decrypted when it is read. If the Data Encryption Standard [National Bureau of Standards 77] is used for encrypting log records the key actually used to encrypt a log records should be a function of the record's LSN and a private key for the log client. Deriving a different key for each LSN makes it harder for an attacker to guess the contents of each record [Voydock and Kent 83]. The function deriving the key could be the DES encryption of the LSN using the log client's primary key.

If the only encryption used for distributed logging is end-to-end, then much information is sent in the clear. For example, LSNs, log record lengths, operations on servers, and responses are all transmitted without encryption. This makes the information in the log susceptible to traffic analysis. Such attacks can be partially circumvented by writing extra random log records and by randomly padding other log records.

3.3.1.4. EncrypUon-Based Protocols

Encryption can be used in protocols to authenticate and protect network communications [Needham and Schroeder 78, Birrell 85, Voydock and Kent 83]. These protocols protect against more general threat models in which networks are not secure and an intruder may record, delay, copy, modify, or replay any message. The drawback of such protocols is that they require connection setup and state overhead.

Encryption-based protocol use special connection establishment protocols involving trusted authentication servers to create secure communication connections (Section 3.3.1.1). Once authenticated communication is established the nodes use a variety of techniques to prevent or detect attempts to discover message contents, analyze message traffic, modify the message stream, or deny communications service. Techniques for dealing with these attacks for connection oriented communication were surveyed by Voydock and Kent [Voydock and Kent 83].

Birrell has described how secure authenticated RPCs can be implemented with a minimum of overhead for establishing or maintaining connection state [Birrell 85]. This technique uses authentication servers and handshakes for establishing connections just as conversational protocols do (an RPC is actually piggybacked on the first message of the handshake). Servers are free to discard the connection state when a connection is idle for longer than the latest possible retransmission. 74

3.3.2. Policies and Comparison Criteria

The security mechanisms described above can be combined in various ways to form a security policy. There are two criteria to be used when comparing security policies. The first criteria is the threats addressed by a policy. The second is policy costs in performance and facilities.

The Camelot DLF made no provisions for security, in part because hardware support for encryption was not available on the computer used for clients and servers. Existing security mechanisms are adequate for distributed logging systems and there was comparably little opportunity for new research in security. The log servers used for the Camelot DLF must be kept physically secure.

3.3.2.1. Threat Models and Security Requirements

The security threats expected in a particular environment determine the security mechanisms required in a security policy for distributed logging. The degree of physical security for log servers, clients, and the network determines what other security measures are necessary.

The most general threat model for communication assumes that an intruder can place a computer on any communication link. The intruder can copy, retransmit, delay, reorder, or modify any message in addition to being able to send new messages as though it was another node. Authenticated, encryption-based protocols are necessary to address these attacks.

A slightly less general threat model occurs when intruders can copy or retransmit messages, or can send messages as though it was another node, but can not otherwise modify message streams. In this case, cryptographic authentication protocols can be used to establish a secure connection, but encryption of each message to establish its identity is unnecessary. End-to-end encryption maystill be necessary to prevent disclosure of log contents.

In many distributed logging environments, a less general threat model in which an intruder can only copy messages from the network may be appropriate. Physical network security, together with mechanisms to detect and sanction any network user that attempts to masquerade as a log server or log client are sufficient to prevent such attacks. In this case, disclosure of log contents can be prevented with end-to-end encryption. Additional mechanisms will be needed to prevent traffic analysis.

Some environments require no special security provisions, other than physical protection for servers and network access. This situation could arise if distributed logging were implemented on at loosely coupled multiprocessor that was located entirely in one room. Trusted users, hardware, and software on a physically secure local network are another such environment. Regardless of whether special provisions are necessary to protect log data from disclosure or corruption using network access, it is essential to protect log servers from fire, flood, or insurrection by locating them in a locked, room appropriate equipment. 75

3.3.2.2. Cost

The costs of security mechanisms include the costs for physical security for system components,and for encryptionhardware, if it is used. There are also latencyand overhead COSTSincurredfor encryptionand for maintainingthe extrastatethat secureprotocolsrequire.

The costof providingphysicalsecurityfor logserversisthe same as the costof securingsimilar network servers. A well built and welt managed machine room or closet works well. Geographicallyseparate roomscouldbe used to increasethe reliabilityof a logusingdistributed replication.

Preventing unauthorized physical access to a local network that spans several buildings could be expensive, both in physical security devices and personal, and also politically difficult for some organizations. Other commercial or government organizations already have adequate levels of physical security at their facilities.

The cost of using encryption to protect data in a distributed log depends on the type of use and on whether special encryption hardware is available. Encryption performed without special hardware support is CPU intensive. Fast DES chip sets are readily available, although general purpose encryption boards suitable for workstations or other processors are less common. Even with special hardware, processing is needed to start the encryption and be notified of its completion, and direct memory access encryption devices will steal buss cycles from the CPU. The processing costs for sending an encrypted message with hardware support will still be about twice the cost of sending an unencrypted one because two I/0 operations are performed instead of one.

End-to-end encryption is less costly than encryption-based protocols. In an end-to-end approach all encryption is done by the log client, so that log servers do not incur additional loads for security. Encryption-based protocols require a key distribution service, and extra state must be established and maintained on each end of a connection.

3.4. Log Representation on Servers

The choice of distributedlog representation,addressedin Section 3.1, affects choices in the representationof data as it is stored in log servers. For example, mirrored or hybrid representationscan use LSNs that are easily mapped to disk addresses, while clients of distributedreplicated logs must assign contiguousLSNs. On the other hand, servers for distributedreplicatedlogs can moreeasilyexploitlow-latencybuffersto improveutilizationof disk space and bandwidth. The Camelot DLF uses distributedreplicationand makes use of the somewhatnovel technique of using an uninterruptiblepower supplyto implement non-volatile virtual memory for diskbuffers. The Camelot DLF combines logdata from differentclients in a 76

single disk data stream and access individual log records using the range searching technique described in this section. A server for any representation may need mechanisms for measuring and enforcing limits on clients' use of resources (this is done partially using the mechanisms described in Section 3.5).

3.4.1. Disk Representation Alternatives

The majority of log data stored on a server will be in disk storage, even if main memory buffers are used for some information. The disk storage might be reusable magnetic disks or it might be write-once optical storage.

Write-once optical storage is appealing as a media for recovery log storage. There is no need to copy data to less expensive media (like magnetic tape) when it is spooled offline. Instead, an operator or a robot simply dismounts a full disk and stores it in an offline library. Write-once optical media should have excellent archive life.

A disk representation for log data should efficiently utilize the log servers disk, which is shared by many client transaction processing systems. In addition, reasonably fast random access for reading log records is needed. Finally, it might be easy or difficult to account for allocations of disk space to various clients and subsequent management of log space (addressed in Section 3.5) depending on the disk representation for log data.

One disk representation for client logs is an operating system file for each client. If ordinary files are not suitable, the log server could implement its own scheme for partitioning disk space among different client logs. Instead of partitioning space among different clients, a log server can combine client logs in a single data stream and use indexes to locate individual client's log records. These three alternatives are described below.

3.4.1.1. Files

Most operating systems provide some sort of dynamically allocated file service with primitives for reading, writing and appending data to named files stored on disk. Unfortunately, the overhead and semantics of such services can make them unsuitable for log representations. A standard file system is unlikely to provide the correct write-once storage model for log servers, so consideration of file systems is limited to re-writable disks.

Files are a convenient disk representation for client logs. A separate file is used for each client's log. New log data is written by appending records to the appropriate file. Most file representations have efficient random read access (for example the Unix lseek call), so that locating a log record by LSN can be done either by direct access when LSNs are assigned by relative byte address for mirrored or hybrid logs, or by binary search of the file for distributed 77

replicated logs. Additionally, file systems include useful utilities, like tape dump programs, that can simplify operational chores on a log server.

File systems have many conflicting performance and function objectives and the tradeoffs that a file system designer makes may not be suitable for log servers. File systems attempt to be efficient in disk space utilization while simultaneously allocating disk space to a large number of dynamically growing files. The data structures used to accomplish this can cause inefficiencies in other aspects of system performance such as disk bandwidth utilization. For example, Unix allocates files as lists of small fixed sized blocks, with the result that there is no external fragmentation and little internal fragmentation. Many seeks might be used to read an entire file because the blocks comprising a file might be scattered anywhere on disk. 14 A block oriented file system like Unix also incurs overhead for frequently updating lists and index structures as new blocks are appended. Extent based file systems, like the one used in IBM's OS/MVS, avoid many of the performance problems of block oriented files systems by allocating files as a small number of large (multiple contiguous blocks or tracks) extents. Extent based file system can have space utilization problems, but these are not likely to be an issue for a log server.

Log servers must be able to reliably write logical log pages that might be bigger than the file system or disk sectors. The file system and the underlying disk controller can make this problematic. A major problem with the semantics of many file systems is that file writes are buffered so that the completion of a write system call does not guarantee that the data is on disk. Even worse, when appending data to a block oriented file system the completion of the write call will probably not guarantee that the additional data structures necessary to access the newly appended file block are written out, even if the block of data itself reaches the disk. Log servers that wish to guarantee that data will be accessible on disk must issue a close operation on the file or some other expensive operation (like fsync in Unix). File systems use buffering to improve I/O throughput in a fashion that is supposed to be transparent to users by performing disk writes with large blocks. Section 2.2.2.4 mentioned that log managers avoid re-writing log pages that contain valid log data and log servers will want to take the same precautions. Thus, it is desirable for log server software to do its own buffering and use write calls that guarantee the data is accessible on disk when the calls complete.

3.4.1.2. Partitioned Space

If an extent based file system with adequate performance and appropriate function is not available, a log server can implement its own scheme for partitioning disk space among client logs. The partitioning could be completely static or an extent based dynamic partitioning scheme can be implemented. A log server needs only one extent size, so log server implemented partitioning avoids the external fragmentation that occurs in an extent based file system.

141nfacttheBerkeleyUnixfilesystemincorporatesoptimizationstoattempttoallocateblocksofa fileclosetogetheron disk[McKusicketal 85]. 78

A low-level interface to disk storage that permits unbuffered access to specific disk sectors or tracks is needed to implement a partitioned space representation. Obviously, the portions of the disk being used for this sort of log representation can not simultaneously be used for an ordinary file system. In Unix this low level access is obtained by opening a raw device file in /dev/ and

using regular read, write, lseek calls on the file descriptor returned by the open. Log servers can not use file system information to determine the ends of client logs. Techniques like the one described in Section 2.2.2.4 can be used to determine the ends of logs, but this requires formatting all logging space before it is first used.

Log record access by LSN is accomplished in the same fashion as using a file representation for logs. Either the LSNs are assigned by clients represent relative positions in the log space, or log servers for distributed replicated logs use binary search of the space to locate log records.

There are a few advantages to using a completely static partitioning scheme that gives each log client a single fixed partition of disk space for its log. Such a scheme is very simple to implement and it guarantees each log client a fair share of disk space. For write once-storage, the log volume must be replaced by a new one when the when the first client log partition fills, so that space utilization might average 50 percent. For re-writable storage, log space management techniques (Section 3.5) permit clients to reclaim disk space as it fills.

All log clients will not necessarily need equal amounts of online log data. A simple dynamic partitioning scheme based on fixed sized extents can achieve better space utilization without forcing clients to frequently recycle online log space. Such a dynamic partitioning scheme can be very simple. For re-writable disk, allocation is recorded in a table indexed by extent number that gives the log client using the extent (if any) and the position within the client's space of the extent. When the log server is restarted it reads the table and transform it into contiguous maps of each client's space. The extent map on disk is carefully rewritten whenever a new extent is added to a client's log.

Extent size choice is a tradeoff between list length and internal fragmentation. The space used for representing extent lists is not likely to be a problem since 500 extents can be mapped into lists with just a kilobyte of main memory. If there are fewer, large extents, then access performance will be better when searching a client's disk space because the space is more contiguous and seeks are fewer and shorter. Internal fragmentation will average one-half of the extent size times the number of log clients. A log server with 50 clients, 1000 cylinders and an extent size of 4 cylinders will have 10 percent of its space wasted by internal fragmentation. For many common disk geometries that extent sizes on the order of one cylinder are likely to be a convenient compromise. 79

3.4.1.3. Single Data Stream

Combining the logs of different clients into a single data stream that is written to disk through a low level interface provides for efficient disk utilization both in I/O bandwidth and space. A single stream representation is particularly suited to optical disk storage because reuse of storage is not important and the difficulties in accounting for clients' use of space are not a concern. An append-only data stream could make good use of write once disks that have separate read and write heads. A single data stream representation improves I/O throughput by eliminating most seeks for log writes. A single data stream representation can use a Single low-latency non- volatile or stable memory buffer to enable it to write only full disk blocks. Partitioned space representations would require multiple buffers to achieve similar disk space utilization improvements.

A single data stream disk representation is implemented using the same sort of low level disk interface used by a partitioned space representation. The representation distinguishes between different client's log record in the data stream. The overhead for this is small because a client typically sends several log records to a server at once and the overhead of a client identifier can be amortized over several records. Section 4.4.2.2 describes the single data stream representation of log records used in the Camelot Distributed Log Facility. The difficult representation problem is mapping LSNs assigned by log clients to positions in the data stream.

Log records can be located by LSN in a number of ways. On a re-writable disk a conventional index structure can be used. The three techniques considered here are all applicable to append- only storage. The first is a hybrid binary-sequential search combined with a single level index. The second is a compact mechanism for indexing the entire contents of a logging disk. The third is an append-only data structure for indexing individual client logs.

Range Searching is the term for log record access where a range of disk blocks is searched for a record with the desired LSN. The search used is a variant of a standard binary search. Because log records for a particular client might not be stored on any particular block in the range being searched, linear search is used when a probe of the binary search examines a block that contains no records for the client. The worst case number of blocks searched is linear in the size of the range, but if a client's log records are uniformly distributed in the range the logarithmic performance of binary search will be realized. A good method for selecting the range of disk blocks to search can improve the performance of the search. Log servers implementing distributed replicated logs can augment their lists of contiguous intervals of LSNs for each client with the addresses of the blocks where the high and low LSNs in each interval are stored. Log servers can keep a finer grained range index if intervals are stored on ranges that contain blocks with no records for a client, but such a single level index could get very long. Range searching will work well for log servers with small logging disks because clients will have to frequently reduce the amount of online log space they use. 8O

Entry Bit-Map Indexes are an index structure that stores on disk a multi-level tree of bit maps describing what blocks (or groups of blocks in higher levels of the tree) contain log records for a client. This index was designed by Finlayson for a network log service based on write once storage [Finlayson and Cheriton 87]

Finlayson's logging service is not specific to transaction logging and does not address all the necessary reliability, availability and performance issues; instead it allows time-based access to records in append-only files. The differences between the intended uses of the service are not relevant the applicability of the index to distributed logs except that all logs in Finlayson's service use a common space of LSNs that are timestamps assigned either by the log server or by clients using synchronized clocks. LSNs are not included with the entry bitmaps because a LSN from any log on the disk block with the entry map can be used for search of that node of the tree. Distributed logging must include a LSN for each log client with a bitmap in each entry in the index. With this change, each entry has the following structure (adapted from Finlayson [Finlayson and Cheriton 87]): 1. A 'level-l' entry map index entry appears every N blocks on the log device. Such an entry contains a bitmap of size N, for each log client that has log records in any of the previous N blocks. This bitmap indicates those blocks that contain such log records. The entry also contains the high LSN in any of the previous N blocks for each log client.

2. A 'level-2' entry map index entry appears every N 2 blocks on the log device. Such an entry contains a bitmap, of size N, for each log client that has log records in any of the previous N2 blocks. This bitmap indicates those groups of N blocks that contain such entries. The entry also contains the high LSN in any of the previous Nz blocks for each log client. 3. and so on...

The bit maps and high LSNs for each client in each entry map index entry form a search tree of degree N, rooted at the kth level entry map, if k is the highest level entrymap index entry written. Figure 3-18 shows a two level entry map index with (N--4) for a single log client. The number of disk accesses needed to follow an entry map index is at most the log to the base N of the total disk space used for all logs. In practice the higher levels of the tree would be cached in main memory. Finlayson's analysis suggests that the best size forN is between 16 and 32.

Append-Forest Indexes are search trees designed for indexing individual client logs in a distributed transaction logging system [Daniels et al. 87]. Nodes in an append-forest are not placed at any specific disk locations, but nodes contain pointers so that nodes written earlier can be located from more recent nodes. The most recent node is found by examining a checkpoint record or by scanning recently written portions of the data stream when a log server is restarted.

New records may be added to an append-forest in constant time using append-only storage, providing that keys are appended to the tree in strictly increasing order. A complete append- forest (with 2"-1 nodes) resembles and is accessed in the same manner as a binary search tree having the following two properties: 81

entrymap level

Iooo I I-!]BI-I I-I1-11717r117I-I log device blocks (with high LSNs)

Figure 3-18: Entry Bit-Map Index (adapted from [Finlayson and Cheriton 87]) 1. The key of the root of any subtree is greater than all its descendants' keys. 2. All the keys in the right subtree of any node are greater than all keys in the left subtree of the node.

An incomplete append-forest (with more than 2"-1 nodes and less than 2"+1-1 nodes) consists of a forest of (at most n+l) complete binary search trees, as described above. Trees in the forest have height n or less, and only the two smallest trees may have the same height. All but at most one of the smallest trees in the forest will be left subtrees of nodes in a complete append-forest (with 2"+1-1 nodes).

All nodes in the append-forest are reachable from its root, which is the node most recently appended to the forest. The data structure supports incomplete append-forests by adding an extra pointer, called the forest pointer, to each node. This pointer links a new append-forest root (which is also the root of the right most tree in the append-forest) to the root of the next tree to the left. A chain of forest pointers pointers permits access to nodes that are not dependents of the root of the append-forest.

Figure 3-19 illustrates an eleven node append-forest. The solid lines are the pointers for the right and left sons of the trees in the forest. Dashed lines are the forest pointers that would be used for searches on the eleven node append-forest. Dotted lines are forest pointers that were used for searches in smaller trees. The last node inserted into the append-forest was the node with key 11. A new root with key 12 would be appended with a forest pointer linking it to the node with key 11. An additional node with key 13 would have height 1, the nodes with keys 11 and 12 as its left and right sons, and a forest pointer linking it to the tree rooted at the node with key 10. 82

I r -

" i--:: ...... ,E- --

I 14 II , I .-- . j .- . j -. _

Figure 3-19: Eleven Node Append-forest Another node with key 14 could then be added with the nodes with keys 10 and 13 as sons, and a forest pointer pointing to the node with key 7.

3.4.2. Low-Latency Buffer Alternatives

For mirrored and hybrid log representation the log data must be in stable (mirrored) storage when log writes are acknowledged. Stable storage is normally implemented using mirrored disks, but lower latency stable buffer memories can be constructed. Log servers for distributed replicated logs can use low-latency non-volatile buffers to improve response time and disk usage. There are different ways of creating low-latency non-volatile and stable memories. The technologies differ in reliability, cost, latency, and ease of design and implementation.

Magnetic disk storage is the prototypical high latency non-volatile memory. Magnetic disks are expected to retain data written on them after most power failures (there is an expected small chance that an additional failure during a power failure will cause detectable loss of disk contents - stable storage designs address this problem). Low-latency non-volatile memory is storage that does not loose its contents during ordinary processor power failures and is accessed more quickly than typical disk memory.

There are two classes of low-latency non-volatile memory. The first class is external memory devices that are accessed through an I/O interface like a disk memory. The interface to an external memory might or might not be identical to magnetic disk storage and access requires an I/O operation. The second class is non-volatile RAMs that are addressable as ordinary main processor memories. The two classes of non-volatile memories differ in access times and in the degree to which they tolerate failures of processors, bus hardware, and operating systems.

One example of external low-latency memory is the dedicated log tail disk described in Section 2.2.2.4. This buffer is quick to write and incurs an average rotational delay to read, but read access times are not important because the contents of the memory is also in volatile memory for 83

writing to the main logging disk later when a block fills. Various solid state technologies, including magnetic bubbles and charge coupled devices have been proposed to replace the magnetic disk in external memory non-volatile memories, but CMOS RAMs with battery power backup seem the best current technology.

CMOS RAM with a battery can be used to build external memories that are accessed as I/O devices, or non-volatile memory that is accessed just like any other memory. The designer of a memory board with battery backup must assure that the bus interface can reject the spurious signals that might appear on the bus when a system is powered up or down.

An alternate to a CMOS battery backup memory module is a backup power supply for an entire computer system. Uninterruptible power supplies (UPSs) for moderate sized computer systems (workstations with large disks) are readily available. These UPSs use use large storage batteries and DC to AC inverters. A simple circuit detects loss of line power (many UPSs have these circuits) and the log server can respond to the power fail interrupt by flushing main memory buffers to disk and checkpointing itself. Only a few minutes of battery operation are needed.

Lampson's design for stable storage is noteworthy because it is based on a thorough model of the underlying components [Lampson 81]. The behavior of disk memories is divided into desired actions, expected failures, and disasters. The stable storage system tolerates any number of expected failures, and the disasters that will corrupt data are explicitly defined. Lampson's model differs from the common recovery log model of stable storage (which is rarely stated as formally) in two ways. First, Lampson's model is concerned with implementing mutable stable objects, unlike the immutable records in an append-only recovery log. Second, implementors assume that data is correct on if a disk write completes without reporting any errors, so that re-reading the disk after a write is unnecessary [Mitchell 82]. In practice these differences mean that the writes to different replicas of a recovery log can go on in parallel, rather than sequentially, provided that no block of the log media is ever rewritten while it contains important data.

A low-latency stable memory can be designed either as a single module (board) containing redundant non-volatile storage (probably CMOS memories with batteries) that is accessed through one interface, or it can be designed using multiple low-latency non-volatile memory modules with separate interfaces. If the low-latency stable memory is comprised of multiple modules, then software protocols similar to those for disk based logs or stable memories are used to access it. A single module design has lower access overhead, but is likely to be less reliable than a multiple module design. A single erroneous interface could corrupt data as it is transferred into the multiple memories in a single module stable storage and a single module implementation is subject to physical failures (like a cracked printed circuit board) that a multiple module system avoids. If it is necessary to move stable memory from one system to another separate modules are preferred because they can be transferred by different routes. 84

External low-latency memory modules have roughly the same failure behavior as disk memories and can be used to construct a multiple module low-latency stable memory buffer in the same way that disks are used to construct higher latency stable memories. Non-volatile RAMs are different because they are more likely to be corrupted by a system software failure. The low- latency stable memory used by Enchre [Banatre et al. 83] uses non-volatile RAMs that have special hardware controls so that only one memory in a pair can be written at once. This seems a better solution than requiring a new value to be written to stable storage be accompanied by the precious value as suggested by Needham et al [Needham et al. 83].

3.4.3. Comparison Criteria

Criteriafor comparingdifferentlog serverrepresentationsincludecost, utilizationof disk space and I/O bandwidth,and the performance(latency)of logforcesand randomreadaccesses. Cost comparisonsof mirroredand distributedreplicationlog representationsappear in Section3.1.4. This sectioncomparesthe costsand performanceof differentdiskrepresentations,and the costs of differentlow-latencybufferimplementations.

The Camelot DLF uses a single data streamrepresentationbecause it could have better disk utilizationthan partitionedspace, particularlyfor write-onceopticalstoragewhere the write head mightmove relativelyslowly. Also, there was research interestin experimentingwith append- onty storage systems. The append-forest was the planned index structure for the Camelot DLF and range searching was implemented as an interim access structure. Range search was simpler to implement and the performance of range search was expected to be acceptable for testing and other initial applications for the Camelot DLF. Camelot's use of UPSs for low-latency non-volatile storage was motivated by the simplicity and portability of the non-volatile storage system. A battery-backup non-volatile memory board must be redesigned for every bus architecture and requires special operating system support to map the memory into the address space of the log server and to prevent the memory's contents from being erased during restart. A UPS-based non-volatile memory avoids these complications.

3.4.3.1. Low-Latency Buffer Costs

Battery backup CMOS memory in board configurations costs about a few times as much as the same amount of volatile memory. Perhaps more importantly for servers, a low-latency non- volatile memory takes up a memory or I/O slot in a machine. Stable memories cost roughly twice as much as non-volatile memories.

UPS based non-volatile memories are priced according to the power demands of the system they are used with. The $7000 supplies used in the Camelot Distributed Log Facility (described in Chapter 4) contributed about 15% to system hardware cost. A UPS based non-volatile memory costs about twice as much as 256K of semiconductor non-volatile memory, but a UPS is available for any machine and does not need any backplane space or kernel software support. 85

3.4.3.2. Disk Utilization

A low-latency buffer is the most important factor affecting both disk space and disk I/O utilization. With a low-latencybuffer a log server writesinfrequentlyto diskwith largeblocksor whole tracks disk and the blocks are always full because incompleteblocksare forced to the tow-latencybuffer. Withouta low-latencybuffera logserverwritesfrequentlyto cliskwith small blocksthat are notalwaysfull at the time it mustforcethemto disk. Serverswith partitioneddisk space need low-latencybuffersbig enoughto buffer a track for each active client. When no low-latencybufferis available, servers usingsingledata streamrepresentationcan tradeoff disk utilization and force latency, by using group commit (actually group force) techniques for multiple clients' logs.

The choice between single data stream and partitioned space representations has a secondary (after the low-latency buffer) affect on disk I/O utilization, but for mirrored log servers the affect depends on the relative frequency of forces and read operations. A single data stream representation does fewer seeks when forcing log data from different clients than a partitioned space representation. However, a mirrored tog server with a partitioned space representation can assign relative byte addresses for LSNs and thus uses fewer seeks to read log data than a log server using a single data stream representation.

The disk space required for the indexes needed for single data stream representations are negligible (about 2% of log space).

3.4.3.3. Performance

There are two performance measures relevant to server data representation. The first performance measurement is the response time for log forces. The second is the time to access a random log record. Random read access times affect the performance of a log server when a client starts aborting a transaction or recovering. Random access performance does not reflect the performance of a server for most log reads. The performance of typical log reads on any server representation will benefit from locality of reference, caching, and streaming. For log reads subsequent to an initial random access the server already knows approximately where the log record is, the disk arm is close to the record, the record is already in the server's main memory, or it is already at the client.

The low-latency buffer is the most important determinant for log force performance because forces must wait for one or two disk writes when no buffer is available. When there is no low- latency buffer a server using a single data stream will do forces a little faster than one using partitioned space because shorter seeks are used for the disk writes.

The most important determinant for random read performance is whether LSNs are relative byte addresses so that only one seek and disk read is used to retrieve the record. This only occurs on 86 mirrored servers that use partitioned space. Other servers must use an index or binary search to locate log records.

Search and index access speeds are affected by the size of the space being searched and by whether index entries are cached in main memory. Partitioned space representations have a smaller search space than single data stream representations.

3.5. Log Space Management

The online storage space in log servers is a limited resource that must be managed efficiently. This is true for online space in local logs as well, but distributed logging is complicated by many clients sharing a common pool of log space and by clients that might be unavailable and unable to participate in the management of their log space for long periods.

The simplest model of a recovery log is an infinite append-only sequence of log records. Treating a log as infinite is unrealistic for a number of reasons. First, storing a nearly infinite amount of data is impractical. Magnetic tape (the most common bulk storage) deteriorates after only a few years so that large tape libraries must be continually copied. Second, performance would not be acceptable if recovery processing must examine many volumes of offline log data.

The model of a recovery log advocated in this dissertation has limited amount of online storage that can be accessed in one second or less is and some amount (possibly none) of offline storage that requires from a few minutes to hours for access. The spooled (offline) log is usually viewed as an extension of the online log (conversely the online log is a cache of the most recent portion of the offline log). An exception to this model is when log compression techniques (Section 3.5.1.3) are used on the offline log, in which case the online and offline logs are different objects that are interpreted in different ways by the recovery mechanisms.

This section continues with descriptions of mechanisms that are used to manage online log space and limit recovery time. Then, policies that combine these mechanisms in different ways are outlined. The section concludes with comparisons of the policies according to various criteria. The Camelot DLF uses the simple policy of using log servers with only online log storage. Log servers request clients to their logs when the servers begin to run low on log space.

lSSome optical disks have lots of tracks, and slow seeks resulting in maximum seek times close to one second. 87

3.5.1. Mechanisms

Most log space management mechanisms can be divided between server controlled mechanismsand clientcontrolledmechanisms. Only the servercontrolledmechanismsand log compression need to be considered in the design of log servers, but the choice of server controlledmechanismsaffect what client controlledlog space mechanismsare appropriateor necessary.

3.5.1.1. Server Controlled Mechanisms

Server controlledlogspace management functionsincludespoolinglog data to offiine storage, implementation of the Archive and Truncate calls to spool or discard unnecessary log data, and actively requesting clients to truncate their logs.

Spooling is the process of transferring log data from online to offline storage. Spooling is essential to providing an infinite log model to clients, but because that model is unrealistic and fraught with performance traps spooling should be presented to clients in a controlled manner.

Typically, log data is spooled from online magnetic disks to offline magnetic tape. Reading the data from disk might be performance problem. Reading ties up the disk arm and the I/O bandwidth to the disk so that logging to the disk should not continue at full speed during spooling. A low latency buffer would help some, but a high performance log server would still require an extra disk (giving total of two logging disks for a server for distributed replicated logs and three disks for mirrored log servers) so that one disk can be used for spooling to a tape drive. Mirrored servers should schedule logging and archiving on their disks so that a maximum amount of online log data is maintained while not causing contention for spooling. A second tape drive and a small disk might be needed for staging spooled log data back to the server for recovery.

Write once optical disks simplify log spooling. The server does not need a tape drive. A server for distributed replicated logs uses only two drives and a mirrored server uses three drives. With these configurations it is possible to keep an entire optical disk volume (about 100 megabytes for 5.25 inch optical disks, and gigabytes for 14 inch disks) of old data online on a disk not used for writing. The full disk can be dismounted temporarily if o_derlog data is needed. Capacity is so great that the system can run without an operator for a long time (like overnight), when offline log data is not needed.

Spooling log servers keep a database called an media catalog describing what log data is on what offline volumes. Ambitious log sewer designs will implement the media catalog as a recoverable object stored in some transaction facility that is itself a client of the log server. This leads to special case logic for locating the media catalog's offline log. More conservative designs implement the media catalog with a small amount of mirrored storage that is updated with a careful replacement strategy. 88

The Archive call is an advisory call that is interpreted by a spooling log server as meaning that a client will not need log records for crash recovery. The LSN passed in the call is the lowest LSN that might be needed for crash recovery. Records with smaller LSNs can be archived to offline storage and will be needed only for media recovery. Log servers that do not spool log data offline ignore this call. Log servers that do spool log data have to option of spooling data before they receive Archive calls for the data, but a they might delay clients if the data is later needed for crash recovery.

Servers using partitioned space on rewriteable disks can spool each client's data individually, though this might entail a lot of mounting, unmounting, and remounting of tapes. Server using a single data stream on rewriteable disks will want to spool all clients data from one region of storage and so will wait until all (or most) clients have issued Archive calls. Serves using write-once storage will move storage offline when it fills.

The Truncate call indicates that log data will not be needed for any purpose and the space it occupies may be reclaimed. Log clients issues Truncate calls when they determine that log data will not be needed for media recovery.

Spooling log servers will normally find that truncated data is on offline storage. When all of the log data on an offiine volume has been truncated the log server notifies an operator that the volume can be reused or destroyed and deletes the volume from its media catalog.

Log servers that partition online disk space among clients can immediately reclaim the space freed by a client when it truncates its log. The space is either returned to a free extents pool, or is marked for reuse by the same client, depending on the partitioning strategy. Single data stream log servers that do not spool log data to offline storage must wait until all log data on a disk page has been truncated before the page can be reused. This means that the server records a truncation point for each log client and a high water mark that is the lowest numbered page of log storage that contains any valid records. The log server moves the high water mark and makes disk space available for reuse as clients issue new truncations.

Nearly all log servers that do not spool log data to offline storage and some spooling log servers (those that have offline storage limited by the capacity of an automatic optical disk changer, for example) will need mechanisms to deal with situations when clients are not freeing log space rapidly enough. The first line of defense is to send truncation request messages to log clients that are using more than their share of log space. For these requests to be effective the clients must be up, and they must have mechanisms available for truncating their logs (discussed next in Section 3.5.1.2).

If requests for truncation fail to free enough log space, a log server must deny service. Only log writes need to be refused, reads and (especially) log truncations can still be permitted. Log 89 servers with partitioned logging space only need to deny service to clients that are using more than their share of space while log servers with a single logging stream must deny log writes to all clients. There is a trap in denying log write service because log clients must often do some log writes (for example, during crash recovery) before they can truncate their logs. Logs using distributed replication and hybrid representations might tolerate such a situation because the client is free to use other log servers if they are available and not full. Mirrored log servers especially, and other log servers as well, should deny service when a reserve of space is still available. The reserved space can be used by offending log clients for the logging that they need to perform to truncate their logs. The most common reason for a client failing to truncate on request will be that it is down and the reserved logging space can be used for crash recovery. There must be a protocol by which a log client can ask a server that it be permitted to do logging specifically in support of truncation.

3.5.1.2. Client Controlled Mechanisms

Mechanisms that log clients use to limit the amount of log space they use include checkpoints that limit the amount of redo work performed during crash recovery, techniques for handling long running transactions that limit the amount of undo processing, and archive dumps to facilitate media recovery. After using these mechanisms log clients perform calculations to determine what LSNsto pass as arguments to Archive and Truncate calls.

Checkpoints are a mechanism used by to limit the amount of redo processing performed during crash recovery (and consequently the amount of log that must be read) [Haerder and Reuter 83]. Checkpoints do not by themselves limit the amount of undo processing that must be performed during crash recovery for transactions aborted by a node crash. Other mechanisms must be combined with checkpoints to truncate the log needed for node recovery.

A transaction processing system takes a checkpoint by performing some processing and writing a checkpoint record to its recovery log. The exact processing performed and the contents of the checkpoint record depend on the recovery and checkpointing algorithm used by the transaction system. Haerder and Reuter characterized the four checkpointing strategies [Haerder and Reuter 83]. In general, a checkpoint record contains a list of recoverable pages that are dirty in main memory (if there are any), and a list of active and prepared transactions. A checkpoint strategy must have an integrated recoverable page flushing policy that insures that no page remains dirty in recoverable memory for too long.

Checkpoints do not by themselves limit the amount of log processed during crash recovery. Crash recovery must also go back in the log to the first record for the oldest transaction that is active or prepared at the time of the crash. Therefore, a log clients may need techniques for handling long running transaction to manage their log space. 9O

If the first log record of the oldest transaction in a system is too far back in the tog, one option is to abort the transaction so that the log can be truncated at a later point. Aborting a transaction can require as much log space as was originally required for the transaction's log, so transaction abort is not a policy that should be used when a log is about to completely run out of space. Transaction system users are often encouraged to believe that transactions can perform arbitrary operations and last arbitrarily long, so aborting a long running transaction could annoy users considerably. Besides, the transaction might be automatically restarted and have another chance to fill up all the log space.

Not all long running transactions can be aborted. If a system is a participant in a multi-site distributed transaction that has entered the prepared phase of a commit protocol, then the transaction system has pledged to maintain the transaction in the prepared state and retain all information (log data) needed to commit or abort the transaction until it is informed of the transaction's final state by the commit protocol coordinator. If the coordinator node crashes this can take a long time.

There are a few strategies for dealing with long running prepared transactions. First, the transaction system can inform an operator of long prepared transactions that are preventing log truncation and encourage the operator to resolve (i.e. guess) their final status. Second, it is important to realize that the log records of a prepared transaction probably account for only a small fraction of the log space that is prevented from reclamation. This suggests that log records for long running prepared transactions should be copied somewhere else so that the log can be truncated.

One place to copy a prepared transaction's tog records is into the log itself. By copying a prepared transaction's records forward in the log the information needed to commit or abort the transaction is kept near the end of the log and earlier portions of the log containing the original records for the transaction can be truncated. Copying the records of a prepared transaction must be done carefully because the copying process could be interrupted by a node crash. This will

typically be done by writing a Copying_Complete record to the log when all records for a transaction have been copied. Copied records, which are identified as such in their headers, are

ignored during recovery unless they are preceded by a Copying_Complete record in a backward (newest to oldest) scan of the log.

The problem with copying prepared transactions' log records forward in a log is that the process must be repeated if the log is truncated again. An alternative to copying log records into a log is to copy the records to some other stable storage the first time that a prepared transaction's log data must be copied to truncate the log. The recoverable storage implemented by the transaction processing system itself can be a suitable place. To store the log records of a long running transaction in recoverable storage, the system runs a local transaction to write the records into recoverable storage. The copying transaction is entirely local, so it can not get stuck in a 91 prepared state. A small record must be kept in the log, perhaps in each subsequent checkpoint record, to tell the recovery processes where to look for the prepared transaction's data, and this record must be copied as the log is truncated.

Archiving, the process of making a backup copy of recoverable storage, is done to to permit media recovery and the truncation of recovery logs. Media recovery uses the most recent archive copy of recoverable storage and the recovery log to restore the recoverable storage to the state it had as of the time of the media failure. Archive frequency affects log space management policies because it controls how much log must be retained for media recovery.

The need to truncate recovery logs is only one of the factors influencing selection of archive frequencies. The size of recoverable storage, together with the bandwidth of the I/O path to the backup storage, controls how long an archive dump takes. Availability requirements may force fuzzy dumps to be performed during normal operations. Fuzzy dumps require longer to create than snapshot dumps, which are taken with normal transaction processing suspended, and the amount of log that must be retained for media recovery is greater because additional log is written as the dump takes place. Other availability constraints might require more frequent archive dumps to reduce the amount of time for media recovery (and the amount of log processed during media recovery).16

Transaction processing systems with very large recoverable storage will have tape drives dedicated to archive dumps. These big systems will probably use local logging as well. Workstations, and processors in multiprocessor transaction systems that use distributed logging will probably use a network archive service. Workstations, will probably dump their recoverable storage to remote servers at night or at other times when they are not being used for transaction processing. Other applications (like multiprocessor transaction systems) might share network bandwidth with distributed logging to do continuous archive dumps.

A recovery manger indicates to the log manager that it will not need logs records other than for media recovery using the Archive call to the log manager. Logs are truncated using the Truncate call. Methods for calculating archive and truncation points vary with the recovery algorithms used and with the archive and checkpointing strategies.

The criteria for log truncation is to eliminate log data not needed for media recovery and knowledge of the media recovery algorithm and archive policies are necessary to determine where to truncate the log. For archive strategies that shut down a transaction processing system to take a media dump, the truncation point for media recovery is the same as the truncation point for node recovery for the non-fuzzy checkpoint taken immediately prior to shutdown. For fuzzy

161fmediarecoverytime isaseriousenoughcriteriai,twillbe costeffectivetomirrorthedisksforrecoverablestorage, asTandemsystemsdo[Gray86]. 92 strategies the node archive point as of the checkpoint prior to the start of the fuzzy dump is a lower bound on the amount of log that must be retained to recover from media failure. Careful analysis of a particular algorithm might provide a tighter bound.

3.5.1.3. Compression

Log compression is the process of reorganizing the recovery log to eliminate information that is not needed for recovery and to make recovery processing more efficient. The term log compression refers to the elimination of information that is not needed for crash or media recovery like undo information for committed transactions or values that have been superseded by subsequent updates to the same locations. Gray refers to log compression as change accumulation in his description of the process [Gray 78]. There are a number of log compression steps that vary in cost, degree of compression, and recovery speed improvement. Log compression techniques are most appropriately performed when logs are spooled to offline storage in anticipation of media failure recovery. The most complete log compression schemes do a lot of processing to produce a compressed log that is only read after a media failure. This expense is appropriate for a large system that needs very fast media recovery.

The first step in compressing a recovery log is eliminating unnecessary redo and undo information. If the archive copy of recoverable storage only reflects the actions of committed transactions, 17 then undo information for committed transactions is unnecessary and can be eliminated from the log along with all records for aborted transactions. If the archive copy might contain the effects of aborted transactions, then the undo data for those transactions must be retained.

The second step in log compression is eliminating superseded data values. Value oriented logs are sorted by object identifier (recoverable storage address for physical logs) and then by decreasing time (newest first) for records for the same object. Only the most recent value for each object is retained in the final compressed log. The compressed log is then sorted by recoverable storage address, which hopefully corresponds closely to a linear disk address space, so that the compressed log can be merged with the archive dump using a single sequential scan. Logs for recovery algorithms that permit logging of arbitrary overlapping byte ranges, and logs for operation oriented recovery schemes, which contain log records describing operations like increments on counters, can not be compressed this simply.

Log compression is logically a function that should be performed at log sewers, which are assumed to have the tape drives and disk space necessary to manipulate the log in large volumes. Also the network traffic needed transfer the log back to clients is avoided by

17Graysuggestsmergingafuzzydumpwiththerecoveryloggeneratewhiled itistakentocreatea sharpdump[Gray 78] 93 compressing the log at servers. However, log compression is logically a function that should be performed at log clients because it requires knowledge of recovery algorithms and the ability to interpret log records. Section 6.2.2 discusses possible future research that would allow client programs for log compression to run in log servers.

3.5.2. Policy Alternatives

Distributed log space management policies consist of selections of server controlled mechanisms, client controlled mechanisms, and possibly log compression. The designers of a distributed log service directly select the server controlled mechanisms and whether log compression will be supported. These choices dictate what client controlled log space mechanisms need to be available.

The server design choice with largest impact is whether to provide servers that have limited storage capacity and must occasionally request truncations or deny service, or servers that have effectively infinite storage. Servers attempting to provide infinite storage probably spool to offline storage, but this is not a requirement. Servers with finite storage can use just online storage, or a combination of online and offline storage.

Log servers with a finite storage model force their clients to be able to truncate recovery logs on request. This implies that clients must use checkpoints and hot page flushing (which are needed to limit crash recovery time in any case), and must be able to move old log records for long prepared transactions (either forward in the log or to other storage). Log clients also need access to archive dump services, so that they can dump recoverable storage and truncate their media recovery logs on request.

Log compression is a special mechanism that requires close cooperation of log clients and log servers for an efficient implementation at log servers. Clients could read their logs from servers in order to compress the logs by themselves. If a server is to perform compression on client logs it must be able to interpret and rewrite the log records.

3.5.3. Comparison Criteria

The suitability of different log space management policies depends of the type of use to which they are put. For an overall evaluation a model of the frequency of different operations is needed. The costs of different policies include the amount of storage and other resources required at servers, the processing that servers and clients must perform, operation costs, and the complexity of the mechanisms. The Camelot DLF uses an excessively simple log space management policy because operators and offline storage were not available for the log servers. Performance measures include recovery speed, which is partly determined by whether recovery operations can be performed with online log data. 94

3.5.3.1. Use Models

Three different uses of distributed logs are affected by log space management policy. Crash and media recovery are discussed frequently. Normal transaction processing is the most common activity using a recovery log and this performance is also affected by log space management policy.

Crash recovery is most common activity for a distributed log after writing log data and reading logs for individual transaction aborts. The amount of log required for crash recovery depends on the frequency of checkpoints, and other aspects of log truncation policy, including the handling of long running transactions.

Media recovery is a somewhat frequent (about once every three years per disk drive) event for most transaction systems. The time to do media recovery depends on the frequency of archive dumps, the type of archive dump taken and its granularity, and on whether the log is compressed to accelerate media recovery.

3.5.3.2. Costs

Costs of different log server space management policies can be measured in online storage requirements, processing costs in server and clients, the complexity of the algorithms that must be implemented and tested, and in the cost of operating servers.

Online storage requirements are generally greatest for servers that do not spool and provide an infinite storage model. All servers that do not spool data to offline storage should have substantial online storage so that they do not need to force their clients to frequently truncate logs. Non-spooling servers do not need tape drives, extra disk arms or staging disks. A non- spooling log server must be prepared to store as much as an entire day's log for each of its clients. If a server has 10 active clients at any time, each executing about 10 debit/credit , then it will need over seven gigabytes of online storage to hold all of the log generated each day. This is feasible but expensive. Of course, smaller servers could handle fewer clients or clients that archived more frequently. Log servers that spool data offline should have enough online storage to permit clients to do crash recovery from online storage without requiring excessively frequent Archive calls. How much storage is needed is a function of the client's optimal checkpoint intervals, which is in turn a function of the size of the clients' main memory buffer sizes and other parameters.

Processing overhead is higher for log servers that spool data from magnetic disk to tape. More server processing resources can be used for the useful functions of receiving new log records or retrieving old records if the server is operator-less or uses optical disks. On the other hand, clients loads are increased because they must frequently make archive dumps in order to truncate their logs. The overhead for doing log compression, particularly for data stored on a log server seems very high, and the alternative of mirroring all data storage is viable. 95

Spooling log servers need human (or robot) operators. How often the operator must mount a tape or swap optical disks depends on the client's logging rates, and the speed of transfers to offline storage. A server with optical disks could have low operator service requirements.

3.5.3.3. Performance

Log space management policies affect the performance of normal transaction processing by adding overheads at servers and clients. Server overheads for spooling log data can slow responses for log forces and read unless priority disk scheduling is used, or spooled data comes from separate disk arms. Client that are forced to truncate frequently will have less time for transaction execution.

Recovery speed is affected mainly by whether the log space management policy permits node or media recovery to use exclusively online log data or whether data must be staged from offline storage. Non-spooling log servers guarantee that both node and media recovery use online data. Clients using infinite storage log server will typically use Archive calls and set truncation policies so that node recovery can use only online log data while media recovery might require staging log data.

3.6. Load Assignment

Distributed logging is an example of a network service that can be provided by a number of servers. In such a situation the problem of assigning a dynamically varying client load to a varying number of servers arises. Load assignment in distributed systems is an important and complex problem in its own right, and a complete treatment of it here is infeasible. This discussion gives cursory treatment to the general issues and concentrates on the special load assignment issues raised by distributed logging.

One of the unique aspects of distributed logging, compared with a general load assignment problem, is that loads on log servers change in well understood and predictable ways. A model of how logging loads change provides a basis for comparing different load assignment policies and suggests the development of special load assignment mechanisms and policies.

Comparatively small changes in loads or logging resources occur for a number of reasons. A client system might increase its transaction processing rate. If several clients do this at once the effect on server loads will be considerable. A single client aborting one long running transaction can cause a noticeable change in a log server's load because the disk seeks for reading records of the aborted transaction will limit the disk bandwidth available for writing. A server that does spooling will reduce its available logging capacity when it starts transferring data to offiine storage. 96

A client's biggest effect on logging load occurs when it crashes and restarts. While the client is down it does no logging, then when it restarts its node recovery procedure will read a log data in quantities that depend on its recovery algorithms and checkpoint strategies. Depending on the recovery algorithm, the client might write log records during recovery as well. The sequential reading activity performed by a client during restart is a different load than the logging being carried out by other operational clients.

A log server's crash causes the biggest change in the logging resources provided by a system. If distributed or hybrid representation for the log are being used, then the clients that were formerly using the failed server must be distributed among the remaining resources. Clients of a failed mirrored log server must wait for it to restart. When a log server restarts there must be a way of informing clients of the availability of new logging resources and distributing them evenly over the new set of servers.

This section continues with a description of the mechanisms used for load assignment in distributed logging. Then, policies that can be constructed using the mechanisms are discussed. The Camelot DLF uses a mostly static assignment of clients to log servers. When they are started, log clients are passed an ordered list of log server nodes. Normal_y,log clients send log data to the first servers on the list and switch to other serves only when failures are detected.

3.6.1. Mechanisms

There are two classes of mechanisms needed for implementing a toad assignment policy. The first class of mechanisms are means of assessing the loads experienced by log servers and offered by log clients. The second class of mechanisms are ways of assigning clients to servers.

3.6.1.1. Load Assessment

Measuring or predicting the load on a distributed logging service can be done in a variety of ways. Two assessment mechanisms, instrumenting servers to measure their loads and instrumenting clients to measure server response times, can be used by a load assignment policy for any service. Semantic knowledge of the operations of distributed logs and their patterns of use can be used to predict logging loads.

Server measurements of CPU, disk, and network interface utilization are a good indication of the load a particular log server. Interpretation of CPU and network interface utilization is straightforward. Simple disk utilization information does not convey all needed load information because disks used for reading log records for a single client might have less idle time than a disk that is being written to by many clients. These measurements assess the load on a server, but they do not identify which clients are generating the load. To obtain this information, which is needed to determine how to reassign clients, message or operation counts should be kept for each client. 97

Client measurement of the time required for a server to respond to requests is among the simplest load measures to implement. By itself, a measurement of server response time provides very little information on which to base load assignment policies. If response time measurements of similar operations on different servers are collected and compared, then there is a basis for distributed resource allocation. Client measurements of their logging volume can also be input to a load assignment policy.

The use of operations in distributed logging is fairly well known. When a client restarts, its easy to predict that read operations will predominate. Similarly, a particular client executing a known mix of repetitive transactions will generate a predictable volume of log data. Such semantic data is more predictive of logging loads than measurements and operations counts.

3.6.1.2. Load Assignment Mechanisms

Load assignment mechanisms are the means by which logging load are directed to servers. These mechanisms are separate the policies that they implement. Clients can be assigned statically to servers, or they can be assigned by external inputs (either from log servers or some other coordinator), or clients can assign themselves to log servers.

Static assignment is the simplest mechanism. Clients of mirrored log servers use a static assignment scheme since they can not change log servers. 18 Clients of distributed replicated and hybrid log servers can use a nearly static scheme in which an administrator assigns preferred servers and clients select from a ordered list of alternates when necessary. Administrators making static assignments of clients to servers should take the average loads generated by clients into account and attempt to balance them.

If log clients incorporate a message interface that allows a remote service to assign the client to a particular log server or servers, then assignments be made by servers or some other coordinator. Server directed load assignment mechanisms are useful when a server needs to shed load for some reason. A client directed load assignment mechanism is desirable in case a load assignment coordinator is unavailable.

Client directed load assignment is the most robust mechanism because it does not rely on a coordinator or log server. Clients need mechanisms to locate available log servers in order to assign themselves to servers.

1Sinfact, mirroredlogclientscanbe reassignedto a newserverwithconsiderabledifficulty.The mechanismfor reassigninga mirroredlogclientrequiresa carefulshutdownofthetransactionprocessingsystemsothatno logdatais requiredforrestartingtheclientThen,theclientisrestartedwitha newemptylog. 98

3.6.2. Policies

Mechanisms for assessing logging loads and assigning clients to particular log servers can be combined to implement specific load assignment policies. There can be a wide variety of load assignment policies, but the discussion here is limited to a survey of the general classes. Some distributed log systems will use simple static assignments while others will allow clients to select their log servers. An alternative is to have log servers select their clients. Both clients and server directed load assignment policies make local decisions about a global set of resources. Instead of local optimization, global policies based on (possibly incomplete) knowledge of the status of all server and clients can be implemented either at a central server, or using consensus protocols between either clients or servers.

Mirrored representations must use static load assignment policies. Static policies estimate each transaction processing system's logging volume and assign clients to servers to divide the logging load evenly among available resources.

Clients of hybrid and distributed replicated logs might be statically assigned to a small number of servers from which they choose randomly. Administrators of small distributed log services might choose to designate one server as a backup and use the backup server for other work normally. In this case clients would be statically assigned to other servers and switch to the backup server only when primary servers fail. A mechanism for informing clients of primary server restarts is necessary to completely implement a backup assignment policy.

Clients of hybrid and distributed replicated logs can make local decisions to distribute their load among servers. These decisions can be based on local resource and load assessment measurements like response time measurements, timeouts, and on reception of restart broadcasts from servers. Completely local server selection can lead to oscillating behavior as all clients switch to the most lightly loaded server when one fails. A degree of randomness or other distribution (like selecting new server on the basis of the client's node identifier) can alleviate this problem.

Servers can attempt to even loads by assigning their clients to other servers. This sort of load assignment is well suited to situations where one of several clients increases its load on a server as happens when a client starts to do node recovery. This is another form of local optimization and so some form or randomization or arbitrary selection is needed to smooth out the response of the assignment policy.

The most complicated load assignment policies are likely to be based on complete knowledge of the loads being generated by all clients and the resources available to all servers. With such knowledge, a fair and balanced assignment of clients to servers can be computed. No individual server or client can collect the information needed to perform such an assignment, but clients and 99 servers can inform an assignment service of their measurements. Alternatively, a consensus protocol can be used to propagate the load information to all clients (or all log servers) and they can make collective load assignment decisions.

3.7. Summary (and Perspective)

This section summarizes the design alternatives discussed in this chapter and presents a design methodology for a distributed log service. The previous sections focused on many design details, but this section attempts to put the design issues in perspective. Some design decisions must be made at the outset of a distributed log design, while others, such as load assignment policies, can be changed after a system is in operation provided that the underlying mechanisms are present. In Chapter 4 the decisions made in designing the Camelot Distributed Log Facility are described.

An understanding of the expected use of a distributed logging system is an important aid to making design decisions, since many of the alternatives are better suited to some environments than others. Among the information needed is the reliability and availability desired for the distributed log, the reliability and availability of the systems to be used for log servers, the number of log clients, the logging volume, the type of transactions executed (small, or large updates, local or distributed), and the frequency of client restarts.

The representation of distributed log data is the first choice that must be made because it affects many other details. The reliability and availability of different distributed log representations depend on the characteristics of the underlying components. According to the simplest analysis, the reliability of different representations is essentially the same because they all store log data in two (or more) non-volatile memories. If no special highly available systems are available for server nodes then a distributed replicated log will have greater availability than a mirrored or hybrid log. A distributed replicated log needs three servers for the smallest configuration, but it can grow in smaller increments (one half a log, in effect) than hybrid or mirrored logs. Distributed replicated logs require extra communication, reducing their performance compared to other representations, but it is easier to improve distributed replicated log server performance with a low-latency buffer because non-volatile, rather than stable memory is needed. If high proportions of transactions will update data on multiple log clients, then the optimized commit protocols possible with a mirrored or hybrid log will improve transaction system performance.

If the security requirements for the network logging service necessitate encryption based authentication (with or without the encryption of entire messages), then this requirement should be included when client/server communications are designed. Less stringent security requirements can be addressed through physical server and network protection, and end-to-end encryption. 100

The design of the communication for network logging should be heavily influenced by the transport protocols and related technology available on the systems being used for distributed log servers and clients. In general, the highest level communication service with performance suitable for distributed logging should be used. If the expected logging load is predominantly short transactions instead of long update transactions and frequent node restarts, then a RPC based interface can be used, otherwise an interface based on stream protocols or the channel model may be needed. The interface to a server for a distributed replicated log may need to use a low-level datagrams to exploit multicast and streaming, and still achieve fast response for forces.

The first decision to be made when designing a log server data representation is whether a low-latency non-volatile or stable buffer will be used. A buffer should be used if possible, especially on servers for distributed replicated logs where the buffer can be implemented using commercially available uninterruptible power supplies. Single data steam allocation should be used if there is no low-latency buffer. If the log server will not spool data to offline storage, then partitioned space should be used (with a low-latency buffer), so that a single log client can not deny service to other clients by failing to truncate its log. If a low-latency buffer is used and the log server will spool data to offline storage, then both single data stream and partitioned space representations are likely to have good performance. The use of append-forests or entry-map indexes for single data stream representation facilitates the use of write-once optical storage.

Log servers that spool to offline storage must have operation staffs or automated offline storage libraries. These resources are expensive, but they are probably cheaper than online storage sufficient to provide an infinite storage model to clients. If log servers provide only a finite storage model, then clients must be able to truncate their logs on demand and service denial is possible.

A distributed logging server using a distributed replicated or hybrid representation that is expected to involve more than two or three log servers in an installation should include mechanisms to support flexible load assignment policies. Mirrored distributed logs and installations with few servers can use static assignments. The mechanisms required for implementing load assignment policies include ways of assigning clients to servers, measurements of server loads and loads generated by clients, and a language or interface for describing a policy. 101

Chapter 4 The Design of the Camelot Distributed Log Facility

This chapter describes the distributed logging facility implemented for the Camelot Distributed Transaction Facility. The distributed log facility is part of Camelot Release 1.0 and is the preferred logging facility when Camelot runs on workstation nodes.

Camelot is a facility that extends the Mach operating system kernel by providing support for general purpose distributed transactions. Mach is an experimental operating system kernel developed at Carnegie Mellon. Mach is derived from and object code compatible with Berkeley Unix version 4.3 [Accetta et ai. 86]. Mach is not modified by Camelot, but Camelot uses some special facilities of Mach, most notably the Mach external pager interface [Young et al. 87].

This chapter continues with a description of decisions made in the design of the Camelot Distributed Log Facility (DLF). This description follows the outline of design issues given in Chapter 3. The next section presents the details of the Camelot DLF's communication protocols and interfaces. The distributed log client and its interfaces to the rest of the Camelot system are described in Section 4.3. The final section in this chapter describes the Camelot log server design.

This chapter concentrates on descriptions of the Camelot DLF design. Justifications for specific design decisions are based on considerations made when the design was developed. Evaluations of the Camelot DLF are given in Chapters 5 and 6.

4.1. Camelot Distributed Log Design Decisions

This section examines each of the design issues discussed in Chapter 3 and describes the alternative chosen for Camelot in each case. Justification is given for each decision. Evaluations of the design of the Camelot DLF are given in Chapter 6. This section is a summary of the design choices identified in Chapter 3. 102

4.1.1. Distributed Log Representation

Camelot uses distributed replication (Section 3.1.2) as its representation for distributed logdata. The primary reason for this choice is that distributed replication, unlike mirroring or hybrid representations, achieves high availability for the distributed log service without using special high availability server nodes. An additional justification for use of a distributed representation is the potential for a log service based on distributed replication to grow in smaller increments than a log using a mirrored or hybrid representation. Use of distributed replication also provided an opportunity for further research in voted replication algorithms.

The Camelot DLF generalizes the distributed log replication algorithm presented in Section 3.1.2 to permit more than one log write operation to be in progress at once. A fixed limit of log records (50 in the implementation) can be sent to log server without waiting for acknowledgments. When the log client restarts it copies this maximum number of unacknowledged records and writes that many invalidating records.

The log servers used for the Camelot DLF are also the unique identifier generator state representatives used to produce the time-ordered unique identifiers required for epoch numbers in the log replication algorithm.

4.1.2. Communication

None of the high level communication paradigms described in Section 3.2 is perfectly suited to the implementation of a log service based on distributed replication. RPC and multicast RPC protocols do not provide the data streaming that is desirable when long update transactions are executed. Conventional RPC protocols, stream protocols, and the channel model do not support parallel operations on multiple servers that are essential for good performance with distributed replication. Because of these drawbacks, Camelot uses specially designed message protocols for communication between clients and log servers.

Camelot's logging protocols could be described as RPC-influenced because they were designed by starting with an RPC interface design and modifying the interface to support the multi-record operations, streaming, and parallelism that were deemed essential for efficient network logging. The resulting protocols do not resemble RPCs very much. Rather than the datagram oriented transport that many RPC protocols use, the Camelot log protocols use a connection-oriented transport protocol for flow control and multi-packet message delivery. The Camelot log communication interface is described in detail in Section 4.2. 103

4.1.3. Security

The Camelot DLF makes no provisions for providing security. The system could be used with a security policy based entirely on physical protection. This would require establishing a physically secure operating environment and procedures for enforcing this physical security.

End-to-end encryption or encryption-based protocols are not used for security in the Camelot DLF partially because hardware support for encryption is not available on the computers used for clients and servers.

4.1.4. Server Data Representation

A single data stream is used for the disk representation of data in servers in the Camelot DLF. Magnetic disks are used for the implementation, though the system could use optical write-once storage. An append forest index structure was planned to support access to records by LSN, but the simpler range searching approach was implemented. The servers use uninterruptible power supplies to implement non-volatilevirtual memory buffers.

The single data stream representation was chosen because it could have better disk utilization than partitioned space, particularly for write-once optical storage where the write head might move relatively slowly. Also, there was research interest in experimenting with append-only storage systems. The append-forest was the planned index structure for the Camelot DLF and range searching was implemented as an interim access structure. Range search was simpler to implement and the performance of range search was expected to be acceptable for testing and other initial applications for the Camelot DLF. The use of UPSs for low-latency non-volatile storage was motivated by the simplicity and portability of the non-volatile storage system. A battery-backup non-volatile memory board must be redesigned for every bus architecture and requires special operating system support to map the memory into the address space of the log server and to prevent the memory's contents from being erased during restart. A UPS based non-volatile memory avoids these complications. Section 4.4.3 describes how servers operate with a UPS.

4.1.5. Log Space Management

Camelot log servers have only online storage and provide a finite amount of log storage. The servers reuse logging disk blocks when all clients with records on a disk block truncate their on-line space. Clients take checkpoints, copy long running transaction records forward in the log, and make archive dumps limit their use of on line log space. When it starts run low on disk space, a log server sends messages to clients requesting that they truncate their online log space. The request messages include a target truncation LSN. Clients perform truncation and 104 send truncation messages to all log servers storing some of their log records either in response to a truncation request from a server or according to their own criteria. A log server periodically re-sends truncation requests as long as its free space is below a threshold amount. When the amount of free space on a log server falls below a small limit, it suspends receiving new log data (though it continues to process read requests).

The log servers use only online storage because operators or automated offline storage were not expected for all Camelot DLF installations. The limited disk storage available for practical implementations required the finite storage model.

4.1.6. Load Assignment

The Camelot DLF uses a mostly static assignment of clients to log servers. When they are started, log clients are passed an ordered list of log server nodes. Normally, log clients send log data to the first servers on the list and switch to other serves only when failures are detected. This load assignment policy accommodates the Carnegie Mellon installation of the Camelot DLF where there are two primary log servers and one backup log server. The backup server is normally used used for other functions in addition to logging service.

4.2. Communication Design

There are two separate components to the communication interface between log clients and servers. The low level component is the connection-oriented transport protocol that is used to move messages between clients and server nodes. The high level component is the message protocols that are used to implement the distributed log replication algorithms. This section describes both components in bottom up order.

4.2.1. Transport Protocol

The Camelot DLF's transport protocol is used for flow control and multi-packet message delivery only. It does not perform retransmission for error recovery, although it does detect and discard out of sequence and duplicate packets. The message level protocols retry after a timeout when messages are discarded because of errors detected by the transport protocol.

The messages transmitted by the transport protocol are called chains because they are chains of packet buffers. Routines are provided for putting data in chains and extracting data from chains. An interface for sending a single packet chain without first allocating and packing the chain is also provided.

The transport is layered on the internet protocol suite user datagram protocol (IP/UDP). UDP 105 packets are sent and received using a special packet filter interface that is part of the Mach operating system. The packet filter mechanism allows IP/UDP packets to be sent by sending a Mach message containing the packet to a special kernel port. Packets are received by first passing a port to the kernel in the NetIPCListen call [Baron et al. 87], together with a packet filter specifying the Ie and UDP header field values that the desired packet should contain. The kernel then sends packets received from the net with those header values to the port passed in the Net IPCListen call. The transport protocol uses a specially designated UDP port.

Connection initiation uses a timer based subprotocol [Watson 81b]. Flow control uses a moving window strategy and flow control allocations are on a per packet basis.

4.2.2. Message Protocols

There are eleven different messages that pass from clients to log servers and nine different messages that are sent from log servers to clients. Most of the messages exchanged are treated as remote procedure calls and returns. The major exception to this is the special protocol for log writes described in Section 4.2.2.3. A similar protocol is used for the log copying that goes on at client restart time. Messages containing log data have a special format and are packed according to the rules given in Section 4.2.2.1.

4,2.2.1. Data Message Packing and ReadLog Buffering

The messages that contain log data (CopyLog, WriteLog, ForceLog, and ReadLogReply) all have essentially the same layout as shown in Figure 4-1. The first field in each of these messages is a message identifier that consists of a message type and a client identifier (the client identifier is omitted from the ReadLogReply message). The next fields are the high and low LSNs of the group of log records contained in the message. The third field in the message is the epoch number for the log records. The fourth field is the offset within the message of an array of offsets of individual log records. After this header information come the log records themselves, and then the array of offsets. This message layout allows the records to be packed into the messages as they are presented (for example in log Write calls). The offsets array must be assembled on the side and copied into the message just before it is sent.

The log data message packing policy is to pack as many records as will fit into a single network packet into one message. Exceptions to this rule are log forces, copy log operations when the records remaining to be copied total less than a full packet, and read log operations when a log server does not store a full packet of log records contiguously with the requested log record. In the exception cases, less than a full packet of log records will be sent. Application of the packing policy results either in single packet messages containing as many records as fit, or multi-packet messages containing one log record that spans multiple network packets. 106

Message Identifier

Client Identifier

Low LSN Message Header_ High LSN

Epoch Number

Indexes Offset

Log Records"

First Record Offset

Second Record Offset _"

Record Offsets "

Last Record Offset G--- --

End of Last Record Offset G---

Figure 4-1: Log Data Message (WriteLog, ReadLogReply, etc) Layout

To support efficient backward scans log servers pack ReadLogReply messages with as many contiguous proceeding log records as will fit in a single network packet, if the log servers store that many records contiguously with the requested record. The Camelot recovery algorithms do not use forward log scans. All log accesses are either direct by LSN or, as part of a backward log scan. The Camelot log interface (described in Section 4.3.2.1 supports direct access and backward scans through the LD_Read call, which returns the LSN of the preceding log record along with the requested record.

4.2.2.2. RPC Subprotocols

Most of the messages in the distributed logging protocols are request and response pairs that constitute remote procedure calls executed on log servers. These messages pairs include ReadEpoch and ReadEpochReply, WriteEpoch and WriteEpochReply, KequestIntervals and Intervals, CopyLog and CopyLogReply, and InstallCopies and InstallCopiesReply messages. These message pairs are not programmed as synchronous RPCs. Instead, the client sends a request message and then waits in a receive 107 loop until a response is received. The client can process other messages, like

RequestTruncate, while waiting for a reply. Except for the ReadLog and ReadLogReply pair, the RPC subprotocol pairs are executed as parallel RPCs that execute simultaneously on different servers.

The log truncation messages, Truncate and RequestTruncate, are one-way RPCs that are executed on the log client and log server, respectively. ReadEpoch/ReadEpochReply RPCs read the unique identifier generator state from log servers. The ReadEpoch messages contain a client identifier. The ReadEpochReply messages contain the epoch number stored at the log server for the client. Writ eEpoch/Writ eEpochReply RPCs write the unique identifier generator state at log servers. The WriteEpoch messages contain a client identifier and a new epoch number. The WriteEpochReply messages contain the epoch number from the WriteEpoch message.

Reque st Interval s/Int erval s RPCs read interval lists from a log servers. The RequestIntervals messages contain a client identifier. The Intervals messages contain a count, and an array of records containing low and high LSNs, and epoch numbers for each interval of contiguous LSNs that a log server stores for the client.

ReadLog/ReadLogReply RPCs read log records from a log server. The ReadLog messages contain a client identifier, a LSN, and an epoch number. The ReadLogReply messages contain an epoch number, high and low LSNs and log data. The layout of a ReadLogReply message is shown in Figure 4-1. CopyLog/CopyLogReply RPCs are used to copy log records during client restart (the CopyLog subprotocol is described in Section 4.2.2.4). The CopyLog messages contain a client identifier, an epoch number, high and low LSNs, and log data. The layout of a CopyLog message is shown in Figure 4-1. The CopyLogReply messages contain an epoch number, and high and low LSNs.

Inst allCopie s/I nst allCopie sKeply RPCs are used to terminate the copy log subprotocol (described in Section 4.2.2.4). number. The InstallCopies messages contain a client identifier and an epoch number. The InstallCopiesReply messages contain an epoch number. Truncate messages contain a client identifier and an LSN and are used as one-way RPCs to truncate a client's log at a log server. The message indicates that a log server may discard all records with LSNs less than the LSN contained in the message.

RequestTruncate messages are used as one-way RPCs to request that a client truncate its log. These messages ware sent periodically when a log server's free disk space is less than some threshold. The messages contain a target LSN for the client's log truncation. 108

4.2.2.3. WriteLog Subprotocol

The WriteLog subprotocol supports both streaming of large amounts of log data and fast forces. The subprotocol automatically detects lost messages and establishes state information for the distributed replication algorithm when a client switches log servers.

The WriteLog subprotocol uses the following messages that are sent from clients to log servers:

WriteLog messages contain a client identifier, an epoch number, high and low LSNs and log data. The layout of a WriteLog message is shown in Figure 4-1. The log server sends a MissingLSNs message in response to this message if some log records might be missing because of a lost message. Otherwise, the server stores the log data contained in the message and no response is sent. ForceLog messages contain a client identifier, an epoch number, high and log log sequence numbers and log data. The layout of a ForceLog message is shown in Figure 4-1. The log server sends a MissingLSNs message in response to this message if some log records might be missing because of a lost message. Otherwise, the server stores the log data contained in the message and a NewHighLSN message is sent in response. NewInterval messages contain a client identifier, and an LSN. They are sent by a log client in response to a MissingLsns message and inform a log server that some (or all) of the log records it regards as missing are stored at other log servers and so the log server should start a new interval of contiguous LSNs. A NewHighLSN message or a MissingLSNs message is sent in response. ForceLSN messages contain a client identifier and an LSN. The message requests an indication of whether a log server stores a record with an LSN equal to or greater than the LSN in the message. If the log server stores such a record, a NewHighLSN message is sent in response. Otherwise, a MissingLSNs messageissentinresponse.

The WriteLog subprotocol uses the following messages that are sent from log servers to clients:

NewHighLSN messages are sent in response to ForceLog, ForceLSN, and NewInterval messages. They contain the high non-volatile LSN for at a server for the client they are sent to.

MissingLSNs messages are sent in response to WriteLog, ForceLog, ForceLSN, and NewInterval messages. The messages contain low and high LSN delimiting a range of LSNs that the server believes should be (re)sent to it. The subprotocol relies on two pieces of state information at a log server. The first is the high stored LSN. When a log server restarts (actually when a client first contacts a server) the high stored LSN is set to the highest LSN stored at the server for the client. The high stored LSN is subsequently updated as described below. The second state information is the high received LSN. The high received LSN is originally set to the value of the high stored LSN and is updated as described below.

Normally, a log client spools log data to servers using WriteLog messages. WriteLog messages can be sent at any rate without waiting for acknowledgments because the flow control 109 mechanisms in the transport protocol prevent a client from overrunning a log server. Unless a message is lost, a client sends a ForceLog message for one of two reasons. First, the transaction processing system might issue an LD_ForceLSN call the LSN of a log record that the log client is buffering. Second, while processing an LD_WriteLSN call, the log client might detect that it has reached the maximum number of log records that are permitted to be written, but unacknowledged (50 in current implementations of the Camelot DLF). In either case all buffered records are sent in a ForceLog message.

ForceLog messages are (logically) synchronous. After sending a ForceLog message, a client sends no additional messages until a reply is received or a timeout occurs. If a timeout occurs, the client sends a ForceLSN message to any log servers that have not responded. If too many timeouts occur, a log client chooses a new server and sends a ForceLSN message to the new server. A client also sends a ForceLSN message when a LD_ForceLSN call requests the log client to force a log record that has been sent to servers in a WriteLog message but has not been acknowledged by some servers.

Normally, when a log server receives a WriteLog message, the LSNs in the message are contiguous with the high received LSN and the log server simply adds the records to its storage and updates its high stored LSN and high received LSN to be the high LSN in the WriteLog message. A ForceLog message with LSNs that are contiguous with previously received data is processed the same way, and a NewHighLSN message with the new high stored LSN (the high LSN in the ForceLog message) is sent in reply. If a ForceLSN message is received with a LSN that is less than or equal to the high stored LSN for a client, the servers responds with a NewHighLSN message with the high stored LSN for the client.

When a log server receives a WriteLog or ForceLog message with data that is not contiguous with the high stored LSN for a client, or it receives a ForceLSN message for a LSN that is greater than the high stored LSN, the missing log write subprotocol is invoked. First, the log server updates the high received LSN for the client if the high LSN in the WriteLog or ForceLog, or the LSN in the ForceLSN message is greater the high received LSN. Other than updating the high received LSN, data in WriteLog and ForceLog messages is discarded. The log server then sends a MissingLSNs message to the client for the range from one greater than the high stored LSN through the high received LSN for the client.

A MissingLSNs message implicitly acknowledges all LSNs less than the low LSNs in its range, and the first thing a log client does when it receives a MissingLSNs message is to update its record of what LSNs the server has acknowledged. Next, the log client checks to see whether some or all of the log records that the server is missing have been acknowledged by a write quorum of other log servers. In this case the client sends the log server a NewInterval message with the highest LSN acknowledged by a write quorum of log servers. A NewInterval message is often sent when a client switches log servers. If a NewInterval message is not 110

sent, then the log client sends all or some of the missing log records to the server in a ForceLog message. The message packing policy described in Section 4.2.2.1 is followed when determining how many of the missing log records to send.

When a log server receives a NewInterval message it updates its high stored LSN for the client to be the LSN in the message. If the high received LSN is still greater than the high stored LSN for the client, the log server sends another MissingLSNs message. Otherwise, the log server sends a NewHighLSN message with the new high stored LSN.

When a log server receives a ForceLog message while its high received LSN is greater than its high stored LSN for the client it processes the message in the same way as any normal ForceLog message except that a NewHighLSN response is sent only if the high received LSN and high stored LSN are equal after processing the ForceLog message. Otherwise a

MissingLSNs message is sent in reply.

4.2.2.4. CopyLog Subprotocol

Section 4.1.1 explains that Camelot DLF uses distributed replication for log representation, but modifies the algorithm given in Section 3.1.2 to allow more than one log record to be buffered or sent to log servers but not acknowledged. This means that during client restart a number of log records must be copied. This copying is done using a special subprotocol involving the

ReadLog, ReadLogReply, CopyLog, CopyLogKeply, InstallCopies, and InstallCopiesReply messages.

Log records to be copied are obtained from a log server storing them by sending ReadLog messages and receiving ReadLogReply messages. After enough log records to fill a CopyLog message are obtained, a CopyLog message is sent to a write quorum of log servers and then CopyLogReply message are received from a write quorum. This process is repeated until all records to be copied have been acknowledged by a write quorum.

The InstallCopies message is intended to allow a log server using append-only storage to reduce the overhead of copying records by buffering the index information for many records and then installing that information all at once. This is useful because the order in which records are copied violates the assumptions under which index structures like append-forests are designed. Installing the index data for all copied records at once reduces the overhead. 111

4.3. Log Client Structure

The client side of the Camelot DLF is a module that is interchangeable with a local disk logging module. This section first provides background for the description of the network logger by describing the architecture of a Camelot node. Then, the logging interface implemented by both the local and network logger is presented. The design of the local logging module is given for perspective. Finally, the program structure of the network logger is described.

4.3.1. Camelot Architecture

Camelot extends the Mach operating system kernel to include support for distributed transaction processing. Camelot supports transaction processing but does not itself provide a database or transactional file system. Instead, Camelot users implement processes called data servers that encapsulate permanent recoverable data. Operations on data, servers are performed using remote procedure calls from other data servers or from application processes that use Camelot facilities for initiating and terminating transactions and for invoking operations on data servers, but not for implementing recoverable objects. Programming Camelot data servers and applications is facilitated by the Camelot Library [Spector and Swedlow 88].

Figure 4-2 shows the process (tasks in Mach terminology) architecture of a node running Camelot. In addition to the Mach kernel there are seven processes that comprise the Camelot system for a node.

Two Camelot system processes are involved in starting the Camelot system. The process named camelot is a configuration program. This program obtains Camelot system configuration information from a file or from an interactive user dialog. Configuration information is passed as parameters to the Camelot master control process which starts other Camelot system processes and redirects their error output to a file.

The disk manager process implements the recoverable virtual memory that Camelot data server processes use to store permanent data. The disk manager is a Mach external pager process that implement virtual memory objects that are mapped in data server address spaces. When a data server faults on a recoverable page, the kernel sends a message to the disk manager requesting the data for the page, which the disk manager reads from disk. The kernel sends data from pages it is cleaning to the disk manager and the disk manager writes the data to disk after it has ensured that the all log records referring to the page have been forced. The disk manager cooperates with the recovery manager during transaction aborts and node recovery. To limit the amount of time required for node recovery, the disk manger writes periodic checkpoints to the log and forces the contents of hot recoverable pages to disk. The log implementation (either local or network) is part of the disk manager. 112

Config. • • • Applicatio Application

Server • • • Server __ Recoverable _ Data J I Data _

ServerNode __ Processes

Recovery Transaction Manager Manager ___ C_elot System

Disk | I Components

Manage_ L°g I Co=unicatiOnManager

Master Camelot Control

Mach Kernel

Figure 4-2: Camelot Processes

The communication manager forwards inter-node Mach messages and provides name resolution services used for locating data servers and other resources. These functions are performed by the standard Mach message server process. The Camelot communication manager extends the Mach functions by providing logical and physical clocks and by examining message headers to record a list of all participants in each transaction that spans nodes.

The recovery manager process performs transaction abort, crash, and media recovery processing. When local logging is used, the recovery manager process contains a read only version of the Camelot logging module (see Section 4.3.2.2). When distributed logging is used, the recovery manager uses a message interface to the network logger in the disk manager process (Sections 4,3.2.1 and 4.3.2.3).

The transaction manager process coordinates the initiation, commitment, and aborting of local and distributed transactions.

The node server is a distinguished data server process that is the repository of configuration data for a node. This information includes the mapping of recoverable segments to disk 113 addresses. The node server is recovered separately from other servers during an initial recovery pass. New data servers are added to a Camelot node, and other administrative functions are performed using the node configuration application, which is a client program for the node server.

4.3.2. Camelot's Local and Network Loggers

Camelot provides two implementations of the same log interface. The local logger stores the log on one or two disks that are accessed through the (raw) Unix file system. The network logger stores the log data on remote log servers using the algorithms identified in Section 4.1. This section first describes the Camelot log interface and then explains the implementation of both loggers.

Sections 2.2.2.3 and 2.2.2.4 discussed how a recovery log can be shared by multiple recovery managers with a common transaction manager. In such an environment the log is logically a separate component. The architectural choice of where to place a recovery manager is decided by access patterns and the cost of cross protection domain calls. For example, in a multiple recovery manager environment, the log might be implemented in the same process as transaction management.

Camelot Release 1 supports only one recovery manager (whose function is divided between the disk manager and recovery manager processes). Because of this, the logger is integrated with the disk manager, which is the process that makes the majority of log calls. Overall logging performance is improved because the disk manager is the only process that directly writes log data. The transaction manager uses a separate interface to write the log through the disk manager process. When the disk manager receives a transaction manager log record it ensures that its own log data is written to the log first.

The recovery manager is the only Camelot process that reads the log. During a recovery pass, the transaction manager's log records are replayed to it by the recovery manager. For network logging, the recovery manger uses a message interface to the log implementation in the disk manager for reading log records. For local logging, the recovery manger contains a read only implementation of the local logger that accesses the logging disk directly. An message interface allows the disk manager to send its log buffers to the read only logger in the recovery manager so that the read only logger can read unforced log records.

4.3.2.1. The Camelot Log Interface

There are two components to the Camelot log interface. The first is a set of calls that the disk manager and recovery manager make on the log. Procedures in this interface begin with LD . The second is a call that the log makes on the disk manager, which begins with the DL_ prefix. The first parameter to every call is a Mach port. This parameter is only used in calls made by the recovery manager, where the port is used by the RPC system as the address for a call message. 114

routine LD_OpenLog( idPort : port t; readOnly : bool_an_t) ;

The LD OpenLog call initializes the logger process. The local logger implementation opens the loggingdiskfiles and locatesthe end ofthe log. Then_tworkIoo0_rimpl_m_ntmion_x_out_ _h_ node restart procedure for the distributed log replication algorithm. The recovery manager uses the second parameter to open the local log in read-only mode.

routine LD Write( idPort : port t; recPtr : pointer_t; forceWhenFull : boolean t; newTruncPoint : isn t; OUT isn : isn t; OUT bufferSent : boolean t; OUT reqTruncPoint : isn_t);

The LD_Write call writes a log record. The LSN assigned to the log record is returned as an output parameter. This call is used only by the disk manager. Records are normally buffered in volatile storage by this call, but this call includes buffering control so that full log pages can be written to disk (or fewer messages sent to the log servers) when there are additional log records available to be included in the log force. After the disk manager writes a log record that is to be forced (a commit record, for example) the disk manager can continue processing and write additional records to the log (to immediately force the log record the disk manager issues the LD_ForceLsn call). The additional records are written with the forceWhenFull flag set on, and if the log buffer is full, the logger will force the buffer to stable storage before the returning from the LD_Write call with the bufferSent flag turned on. The bufferSent flag indicates that previously buffered records have been forced, not the record written in the call. The call includes log truncation parameters The newTruncPoint parameter is the lowest LSN that needs to be kept in the log after the record being written is in stable storage. The logger drives truncation by setting the reqTruncPoint output parameter to the LSN that the logger wants for the next truncation point.

routine LD GatherWrite( idPort : port t; ioVec : cam _o vec t; forceWhenFull : boolean t; newTruncPoint : isn t; OUT isn : isn t; OUT bufferSent : boolean t; OUT reqTruncPoint : isn_t) ;

The LD GatherWrite call writes a log record by gathering together a number of pieces. The ioVec parameteris a nullterminatedvectorof pointersand lengthsof parts of a log recordthat are to be copied into the log buffer to form a contiguous record. Even though this routine has a ldPort parameter, it is not a message interface and is only used by the disk manager to call the logger implementation that is integrated with the disk manager process. 115

routine LD ForceLsn( idPort : port t; isn : isn_t) ;

The LD_ForceLsn call immediately forces the log up to the LSN specified by the Isn parameter. The call returns when the log is stable at least up to to lsn.

routine LD HighStable( idPort : port_t; OUT isn : isnt);

The LD_HighStable call returns the highest LSN for records in stable storage. This call is used to determine the starting point for crash recovery.

routine LD_HighWritten ( ldPort : port_t ; OUT lsn : isn_t) ;

The LD_HighWritten call returns the highest LSN for records written to the log. Any difference in the results of the LD_HighWritten and LD_HighStable calls will be because of log records that have been written but not forced to stable storage. The LD_HighWritten call is used to start a transaction abort.

routine LD_Read( idPort : port_t; Isn : isn t; OUT prevLsn : isn t; OUT recPtr : pointer_t);

The LD_Read call reads a record given an LSN. The call returns a pointer to the record and the record's length. The call also returns the LSN of the previous record.

routine LD SetReserve( IdPort : port_t; reserveInBytes : u int) ;

The LD_SetReserve call instructs the local logger to resewe the specified amount of space in the log. When the amount of used log space approaches the specified minimum, the logger should begin to make DL_RequestTruncate calls.

simpleroutine DL RequestTruncate( dlPort : port t targetLsn : isn_t) ;

The DL_RequestTruncate call instructs the disk manager to truncate the log. The network logger issues this call to force a truncation when a node is idle and not writing the log. 116

4.3.2.2. The Local Logger

The local logger implements recovery logs using either simplex or duplex (mirrored) disk storage. This section describes the local logger to provide a counterpoint for the description of the network logger and log servers. Also, Chapter 5 compares the performance of the local and network loggers. A description of the local logger is a prerequisite to explaining the performance differences.

The disk storage used by the local loggers is accessed with standard Unix file operations. Files used for the log are either ordinary buffered files or unbuffered raw disk partition files (files in the /dev/ directory). If buffered files are used for the log, the local logger must use the fsync call to force buffered log writes to disk. Use of fsync calls is typically much slower than writing directly to unbuffered raw disk partition files.

Duplex recovery logs are implemented by parallel writes to two disk files. Parallel writes are performed by separate threads (lightweight processes) in the logger. The separate threads are required because Unix does not support asynchronous disk I/O operations. Thread dispatching adds processing time to duplex log writes and the total average latency of duplex log forces may be more than twice the average latency of simplex log forces because contention in operating system drivers and disk controllers can serialize the writes even though they are initiated in parallel threads.

The logger uses disk space in blocks (sectors) that are the unit of atomic transfer between main memory and the disk. For most systems on which Camelot runs this size is 512 bytes. Disk writes will be initiated for single blocks or for groups of contiguous blocks. The algorithms for finding the last written sector do not require that multi-block transfers to be entirely completed if a crash occurs. The local logger writes consecutive disk blocks starting with the first one in the logging file. Each log block is written at most once in any pass over the log file and log blocks are reused for new log data only after log records contained on the blocks are truncated or spooled offline. Each block contains a pass number that is incremented when a block is reused. At restart time the end of the log is found by locating (with a binary search) the block with the largest pass number that is farthest from the beginning of the log file.

The 64 bit LSNs assigned by the local logger have two components. The first component is a 32 bit pass number from the log file block that contains the first part of the log record. The second component is the 32 bit byte offset from the beginning of the log file of the start of the log record. To read a record the logger uses the byte offset in the LSN to seek to the block containing the record and verifies that the block's pass number matches the pass number in the LSN. 117

4.3.2.3. The Network Logger

The network logging module implements the logging interface from Section 4.3.2.1 using the distributed tog replication algorithm from Section 3.1.2. The network logger is the client side of the communication interface described in Section 4.2. The log servers to be used by the network logger are determined at system configuration time and when Camelot is started the names of nodes running log servers are passed to the network logger as command line parameters.

As described in Section 4.3.2 the network logger executes as part of the Camelot disk manager and exports an interface to both the disk manager (for writing the log) and the recovery manager (for reading the log). The network logger is structured as a single monitor with a single mutex variable, ld.Hutex, controlling access. There is a disk manager thread dedicated to servicing RPCs from the recovery manager's log interface and processing asynchronous messages (like RequestTruncate) from log servers. Routines in the LD_ interface acquire .-LdMutex at the beginning of their execution (and release ld.Mutex before exiting) whether they are executed by the server loop thread processing RPCs from the recovery manager or by other disk manager threads calling the interface directly. The server loop thread acquires ldHutex before processing messages from log servers.

There are three major software modules in the logger. The lowest level is the transport protocol module that is common to both the log client and log server. The next level is a client state module that encapsulates the state of the higher level logging algorithm for each client. Among other functions the client state module translates transport protocol connection identifiers to server identifiers, reopens transport connections when they fail, lists what LSNs are stored by what servers, and lists records have been sent to servers but not acknowledged. The logger interface module implements all LD_ routines, and is responsible for log record buffering, message packing, and implementation of the log replication algorithm using the information in the client state module. The logger interface module assigns consecutive LSNs by incrementing an internal high LSN variable, which is initialized during the LD_OpenLog call.

The logger interface module maintains two tog record buffers. The first is the log tail. The log tail contains the most recently written log records, including all records that have been written but not acknowledged by enough log servers. The tail is used to retransmit log records and as a cache for SD_Read calls. The tail can contain records that have been acknowledged by a write quorum if the tail uses less space than a maximum length that is set when when the log client is compiled. The second buffer maintained by the logger is a cache of the most recently received ReadLogReply message. This cache enables some sequential log reads to be satisfied without messages to servers. 118

4.4. Log Server Design

Serversfor the CamelotDistributedLog Facilityare speciallyequippednodeswith largedisk partitions dedicated for log data and uninterruptible power supplies to implement non-volatile virtual memory. Mach is the operating system for the servers. The log server is a user level process with special privileges for accessing the logging disk as a raw device and for Mach net access.

This section first describes the thread (lightweight process) structure of the log server and then examines the main memory and disk data structures implemented by a log server. Section 4.4.3 explains how uninterruptible power supplies are used to make a log server's virtual memory non-volatile.

4.4.1. Log Server Threads

The log server is a single Mach task with two threads of control to permit the logging disk to be written asynchronously while new log data is being received in the other thread. An additional thread to process log reads was designed but not implemented.

The main log server thread performs all log server functions except for disk writes. These functions include server initialization, transport protocol and higher level message processing, packing received log records into disk buffers, maintenance of index and other main memory structures, locating log data on disk (or in a main memory buffer) when a ReadLog message is received, and checkpointing the log server's state to disk during idle times and when a power failure occurs. Log writes are acknowledged as soon as log data is buffered in main memory.

The track writer thread's sole function is to write buffered log data to disk. A separate thread is used for this function so that the synchronous Unix write calls do not delay the main thread while it is receiving log data. Buffers of log data to be written to disk are passed from the main thread to the track writer in a main memory queue. The track writer and main thread synchronize their access to the queue using mutual exclusion semaphores and condition variables.

The log read thread was intended to permit the main thread to continue receiving new log data while the log read thread executed synchronous disk read operations to satisfy a ReadLog message. The thread was not included in the log server's initial implementation because of the difficulty of synchronizing the log read thread's activity with other threads and because use of buffering and hints eliminated many disk accesses for reads. 119

4.4.2. Log Server Data Structures

Diskstorageis the only permanentnon-volatilestorageon a logserver. Alldata necessaryfor restartinga logserver hasa diskrepresentation.Forclient logrecords,the diskrepresentationis the only one that the log server manipulates. Other information, like the interval lists describing what log records are stored for each client, is normally manipulated in a main memory data structure and is checkpointed19to disk when a power failure occurs.

4.4.2.1. Main Memory Structures

Two main memory data structures are used by log servers. Both structures are hash tables indexed by client identifier and there is some redundancy between the two tables. The first table, called the client information table, is accessed by the log server module that implements the server side of the logging message protocols. There is no disk representation for the client information table. The second table, called the interval lists table, is accessed by the disk storage component of the log server. The interval lists table is written to disk during a server checkpoint.

client Id

¢i

highStableLSN

highReceivectLSN

Figure 4-3: Client Information Table Record

Figure 4-3 shows a record from the client information table. The table is keyed by the clientId field in the record, which is the identifier of a client of the log server. There is an entry in the table for each client that the log server has communicated with since the server was restarted. The ci field is the most recently used transport layer connection for the client. The other fields are the high stored and high received LSNs for the client that are used by the logging message protocols.

client Id

epochNumber

iList

Figure 4-4: Interval Lists Table Record

The interval lists table contains an entry for each client that has data stored on a log server. Figure 4-4 shows the record format for the table, which is keyed by the client identifier in the ciientId field. The epochNu.mber field in the record is the state for the replicated unique

lgHere, checkpoint is a copy of the main memory structures on disk, not the checkpoints that transaction systems create to bound recovery time. 120

epochNumber

IowLSN

highLSN

lowBlockAddress

highBlockAddress

nextPointer

Figure 4-5: Interval List Element Record identifier generator used by the tog replication algorithm. The ReadEpoch and WriteEpoch messages request access to this value for a client. The iList field is the root of the interval list for the client.

An interval list is a singly linked list of records describing the intervals records with contiguous LSNs stored for a client. Figure 4-5 shows the record format for an element in the list. Each element in an interval list describes a disjoint interval of contiguous LSNs from lowLsn to highLsn. All log records in an interval have the same epoch number as given by the epochNumber field. The address of the first and last disk blocks containing records from an interval are given in the lowBlockAddress and highBlockAddress fields. Interval lists are ordered with the higher ranges of LSNs first.

4.4.2.2. Disk Data Structures

The log server uses disk block consisting of many sectors to permit more efficient disk transfers. It is intended that log server disk blocks correspond to tracks on the logging disk, but some disk interfaces do not support this correspondence because disk sectors with adjacent addresses may not be adjacent on disk. In any case, the use of a large block size does amortize the overhead for operating system calls and initiating I/O.

Use of a multi-sector disk block raises the issue of block integrity. All sectors might not be written to disk if a block write is interrupted by a failure. Such an error would not be caught by the sector oriented error detection built into a disk drive. A block checksum could be used to address this problem. The implementation of the Camelot log server does not use checksums, so the Camelot log server is vulnerable to failures during writes.

Each disk block begins with the logical byte address of the block in the log servers disk space. The address is the position of the block in the (logically) append-only disk space for the log server Disk blocks are assigned new logical addresses as space is reused. The physical address of a block is its logical address modulo the size of the logging space. Immediately after the block address is a one byte code that determines the type of block. There are five block types. The zero block type code indicates an unused block, so the disk space can be written with zeros to initialize the logger. The other block types are data blocks, data continuation blocks, checkpoint header blocks and checkpoint continuation blocks. 121

client Id

epochNumber

IowLSN

highLSNof f set

Figure 4-6: Data BlockRecord Group Identifier Record

Data and data continuationblockscontain log recordsfrom (potentially)multipleclients. The difference betweenthe two blocktypes is that the first recordon a data continuationblockis the continuationof a recordsthat spans multipleblocks. Data blocklayoutsare designedto allow log recordsto be copied directlyintodisk buffers as they are received. Recordsare storedin groups of upto 256 that are from the same client,have the same epochnumber,and have a contiguous range of LSNs. The records in a groupshare the commongroupidentifierinformationshown in Figure 4-6. The groupidentifier layoutcompressesthe LSN informationfor a set of log records and allows the space for client identifiers,epoch numbersand LSNs to be amortized over many logrecords. The range of LSNs in a groupis indicatedin the groupidentifierby a base LSN and the integeroffset of the highestLSN in the group. Immediatelyfollowingthe block type field on data and data continuationblocksis an eight bit count of the numberof groupson the blockand the sixteenbitoffsetof the first ofthe groupidentifiers.

The layoutof data and data continuationblocksis shownin Figure4-7. All recordsare packed contiguouslyafter the block header information. The group identifiersfollow records,and the record index informationfollowsthe group identifiers. Record index informationconsistsof a recordlength andthe offset inthe blockof each record. There is a blockof indexinformationfor eachgroup identifierand the blocksare storedin the same orderas the groupidentifiers. Within a block,the offsetinformationis in LSNorder. When a recordspansmultipleblocks,the lengthof the record given in the first blockis the length of the entire record (the amount of the record storedin the blockis foundby subtractingits startingoffsetfrom the groupidentifieroffset inthe disk block header). The lengthfor the first record appearingon a data continuationblock is the lengthof that portionof the recordonly.

Checkpointand checkpointcontinuationblocksare writtenwhen a log servercheckpointsitself before shutting down (for example, because of a line power failure, see Section 4.4.3). Checkpointinformationconsistsof one checkpointblock followed by zero or more checkpoint continuationblocks. A checkpointcontainsenough informationto rebuildthe intervallisttable in main memory. Besides the blocktype, the headers of checkpointand checkpointcontinuation blockscontaintwo counts,the countof interval listelements (from the listsfor all clients)in the entirecheckpoint,and the countof elementson that block(Figure4-9. The intervallistelements, as shownin Figure4-8 containthe informationfrom the listelementsin the intervalliststable and 122

blockAddress

blockType Block Header groupIdCount

groupldsOffset O < < <

Log Records

_ <

Group Identifiers

length offset e--

length offset O

length offset O Record Indexes

length offset O

length offset O i!

Figure 4-7: Data Block Layout

clientId

clientEpochNumber

IowLSN

highLsn

intervalEpoch

Figure 4-8: Checkpoint Interval List Record in addition each element contains the client identifier and the unique identifier state (epochNu_'Lber field) from the interval list table header. This redundant structure was chosen because it simplifies the generation and processing of checkpoints. The counts in the checkpoint headers are used to verify that a multi-block checkpoint was completely wri_en. 123

blockAddress

blockType Block Header checkpointCount

blockCount

Intervals .I Figure 4-9: Checkpoint Block Layout

4.4.3. Uninterruptible Power Supply Operations

The uninterruptible power supplies used for Camelot log servers have a built in RS232 interface to the microprocessor that controls and monitors the power supply. The RS232 interface is a remote panel facility that provide status information and to controls the power supply's operation. For Camelot, the power supply's RS232 interface is connected to a processor that runs a monitor process.

When a log server starts it sends a message to the process monitoring its power supply. This message contains a log server port to which the monitor process sends a shutdown request message to when the line power to the UPS fails. A log server thread is dedicated to waiting for message on this signal port. When a shutdown message is received, the signal handling thread sets a flag. The flag is noticed by the main log server thread after normal message handling and the log server flushes disk buffers, checkpoints itself, and exits. The UPS monitor detects the log server shutdown when it receives a port death notification for the signal port. At this point, the UPS monitor can shutdown the UPS if necessary. 124 125

Chapter 5 Performance of the Camelot Distributed Logging Facility

Part of the thesis defended in this dissertation states that tog services can be efficiently provided by a network service. The performance evaluation presented in this chapter demonstrates the efficiency of the Camelot DLF. There are specific objectives for the performance evaluation of the Camelot DLF.

The first performance evaluation objective is to demonstrate the operation of the Camelot DLF. Although the performance evaluation experiments might not duplicate the conditions the DLF would encounter in actual operation, they do represent more use for the facility than ad hoc testing provides. Successful completion of performance experiments also increases confidence in the facilities correctness, availability, and reliability.

The second objective is to characterize the performance of the Camelot DLF in meaningful terms. This includes absolute measures of latency and throughput for realistic workloads. In addition, latency comparisons with the Camelot local logging module can show the relative performance of the facility. The efficiency of distributed logging is partially demonstrated by the Camelot DLF performance characterization. Distributed logging can be considered efficient if two criteria are met. First, the latency of distributed logging should be competitive with local logging. Second, the throughput capacity of log server should be sufficient to permit the resources used for log servers to be amortized over a number of clients.

The third objective is to identify the costs of different components of the distributed logging implementation. This includes the identifying of what components contribute most to latency and what resources limit throughput.

The fourth objective, which is closely related to the third, is to be able to predict the performance of other distributed logging systems. This includes predicting how the Camelot DLF scales or performs with different processors or improved low level facilities like faster message passing. The performance evaluation should also be useful for predicting the performance of other distributed log facilities with different hardware or software components.

The next section, on performance evaluation methodology, describes the measurements used 126 to satisfy these performance evaluation objectives. Section 5.2 specifies the experiments conducted to generate these measurements. Section 5.3 presents and discusses the results.

5.1. Methodology

There are two classes of performance measurements relevant to the evaluation of the Camelot DLF. The first is the latency of various log operations and latency comparisons for transactions using either the local or distributed log. The second is the throughput capacity of the distributed log facility. The Camelot local log is not intended to be shared among multiple systems, so measurements of local log throughput are not relevant to an evaluation of the Camelot DLF.

5.1.1. Latency Experiments

This section describes experiments for comparing the latency of local and distributed logging by measuring the performance of logging primitives by themselves and by measuring latencies of entire Camelot operations and transactions. Complete system comparisons demonstrate the relative importance of logging on overall performance and reflect the effects of contention (for example, for disk controllers). Measurements of logging components in isolation are more easily related to other system configurations and to the latency component analysis. Logging latency comparisons using entire Camelot operations and transaction are made using a simple but flexible performance analysis tool called the Camelot Performance Analyzer (CPA) [Eppinger 87]. The industry standard debit/credit benchmark is also used for comparisons of entire Camelot systems and for direct comparisons of logging latencies. The debit/credit logging load is analyzed to determine the components of distributed logging latency. This analysis provides details on the latency costs of various components of the Camelot DLF.

5.1.1.1. CPA Tests

The CPA consists of a simple programming language for describing transaction loads to be executed repeatedly and timed, and two processes that are interpreters for the CPA language, CML. The architecture of a Camelot system for CPA tests is shown in Figure 5-1. The cpa process is a Camelot application program that interprets CML programs and invokes operations on cpaserver processes, cpaserver processes can also interpret CML programs that are sent to them by cpa applications or other cpaserver processes. CPA tests can be run with the application and server processes on two or more different nodes, but for log performance tests only one machine was used. Figure 5-2 shows a CML program used for one Camelot DLF performance test.

A cpaserver implements a recoverable object that is an array of bytes. CML language primitives include operations (RPCs) on servers that read or write ranges of recoverable data. 127

cpaserverll I cpaserver2

I cpaserver31

Camelot System Processes

I Mach Kernel I

Figure 5-1: CPA Test Architecture

The operations provided by CML include reading and writing recoverable storage, beginning and ending (possibly nested) transactions, and aborting transactions. Parameters control the amount of data read or written by an operation and whether the accesses cause paging. Latencies of individual operations can be obtained using regressions on the times of transactions that contain different numbers of operations. Similarly, commit latencies, can be derived from regressions of transactions that access different numbers of data servers. Commit latencies are expressed as a constant delay, plus a delay that is the product of a constant term and the number of local data servers participating in the transaction.

The results of the CPA tests are operation and commit latencies for Camelot running with local simplex, local duplex, distributed simplex, and distributed duplex logs. These results are used to compare the different log configurations. Differences in write operation latencies reflect differences in overheads for buffering log data and streaming log data to the logging disk or log servers. Differences in commit latencies reflect differences in log force times for the different loggers. 128

5.1.1.2. Debit/Credit Tests

Unlike the CPA tests, the debit_credit test is a standard benchmark for transaction processing systems [Anonymous et al. 85]. The debit_credit transaction modifies one hundred byte account, branch, and teller records, and appends a fifty byte record to a history file. A typical implementation will cache the branch and teller records in main memory. The account record is chosen randomly from among one million accounts, so it is fetched from disk. The history file is an intentional bottleneck because concurrent transactions append to the sequential historyfile.

The benchmark specifies that fifteen percent of the transactions are distributed. Pausch interprets this to mean that fifteen percent of transactions write branch and teller records on one node and write account and history file records on a second node [Pausch 88]. The benchmark also specifies that X.25 protocols are to be used for communication between terminals and the transaction processor, and that the implementation is supposed to include presentation services for integrating different kinds of terminals. The Camelot implementation of debit/credit addresses these requirements as described in Section 5.2.2.2.

Debit_Credit integrates more Camelot function than most CPA tests. In particular the combination of main memory access for some data, and sequential or random paging for other data is difficult to construct with a CPA program. The data paging performed by a debit_credit can result in disk or controller contention that further slows local logging. As result of these considerations, distributed logging might compare more favorably with local logging for debit/credit transactions than it does in some CPA tests.

A debit_credit transaction logs more data than the simplest write transactions, but it is still a very short transaction. Debit/Credit transactions are a good measure of log performance for forcing relatively small amounts of data and they make a good basis for study the latency of log forces in detail. Comparisons of the latencies of the logging components of debit/credit transactions for different local and distributed logging configurations are a first step in this study.

The results of the debit/credit tests will be transaction rates (in debit/credits per second) for local and distributed logging configurations of Camelot. The latency of the logging component of the debit/credit transaction will also be reported for various local and distributed loggers.

5.1.1.3. Debit/Credit Latency Breakdown Tests

Detailed analysis of debit/credit transaction logging explains the contributions that different components of the distributed logging implementation make to latency. This information can suggest what components deserve the most attention when tuning the system for improved performance. It also provides evidence about the relative cost of different design decisions (the log replication algorithm, for example). 129

Analyzing the latency components of logging for debit credit transactions is a challenge because it is difficult to precisely profile a distributed computation involving asynchronous I/O and parallel actions at multiple nodes. Eliminating parallelism by profiling simplex logging helps somewhat. The result of the debit/credit latency breakdown tests are a profile that breaks down the latency according into contributions from various components such as low level communication, transport protocols, logging algorithm processing at the client, message packing, and data copying. This result conveys information necessary for understanding how design and implementation decisions affect logging latency.

5.1.2.ThroughputExperiments

Throughput evaluation is performed to determine the total capacity of a distributed logging service. It is expected that maximum throughput will be achieved with more than one client for the service. This is because distributed logging latency is greater than the time that the servers are busy when there is only a single client.

Debit/Credit is a good benchmark transaction for a throughput test because it is the standard benchmark for high performance transaction processing systems so it provides perspective on the performance range of distributed log systems. Compared with longer transactions, debit/credit places more stress on server processors and network interfaces because it transmits data in shorter packets and requires more frequent responses from servers.

The results of throughput experiments using debit/credit on the Camelot DLF will be determine the maximum throughput (in debit/credits per second) of a distributed logging system, independent from transaction processing components other than logging. Other results will be the number of clients at which maximum throughput is reached, and resource utilization, including client and server CPU use, network bandwidth and disk use. The results should allow predictions of how throughput will scale as resources such as faster CPU, better network interfaces, and additional servers are added to a system using distributed logging.

5.2. Experiments

This section supplies details about the experiments used to evaluate the performance of the Camelot DLF. First, Section 5.2.1 describes the hardware and software environment used for the experiments. Section 5.2.2 explains the the experiments for evaluating latency and Section 5.2.3 explains the throughput evaluation experiment. 130

5.2.1. Experimental Environment

IBM RT PC workstations running Mach were used as clients and servers for Camelot DLF performance evaluation experiments. The RT PC is based on a RISC processor[IBM Corporation 85]. Two versions of the RT PC were used because insufficient numbers of a single model were available The original RT PCs (referred to simply as RT PCs) were used for distributed log servers and clients. Newer models of the RT PCs (referred to as RT APCs) were used for distributed log clients and for local logging tests. The processors in the RT APCs are about twice as fast as the processors in the RT PCs. Experiments that compared local and distributed logging always ran Camelot on the same machine type.

Log servers had six and eight megabytes of main memory. Client machines had main memory sizes ranging from six to twelve megabytes. There was no evidence that variances in main memory sizes affected test results.

Each log server used a 140 megabyte partition of a 400 megabyte IBM 9332 SCSI disk. Local logging used fourteen megabyte partitions of IBM and Micropolis 70 megabyte disks. Differences in disk speed are not directly relevant to comparisons of local and network logging because the log servers implemented non-volatile virtual memory using Sun 3 KVA uninterruptible power supplies.

All machines were connected by the CMU Computer Science Department's 4 megabit per second token ring. Most tests were conducted at times of low expected network load (as a courtesy to other department members). Normal variations in network load had no observable effects on performance results.

Version M6 of Mach was used on both clients and servers for all tests except for debit/credit transaction execution. These tests used an experimental version of Mach with support for external pagers (approximately version XM19). The external pager kernel was important for good performance when paging recoverable data but otherwise had no effect on the test results.

Camelot versions 0.98(49) and 0.98(50) were used for these experiments. There were no performance related differences in the two versions of the system.

5.2.2. Latency Experiments

Latency experiments consist of the CPA tests, the debit/credit test, and the debit/credit latency breakdown test as described below. 131

5.2.2.1. CPA Tests

Figures 5-2 to 5-6 show some of the CML programs that were run to provide inputs for regressions giving operation and transaction latencies for different logging configurations. The programs were run on an RT APC with local simplex, local duplex, distributed simplex, and distributed duplex logging. Each test program executes fifty identical transactions and reports the time for the entire test divided by fifty as the transaction time for the test. Each test is executed ten times for each logging configuration and the median test results are used as input to the regressions.

do (50) begin-top-level (nv, standard, two-phased) do (i) read ("cpaserverl ", small, none) end end end Figure 5-2: One Small Read from 1 Server CML Transaction

The simplest CML transaction used for log performance evaluation is shown in Figure 5-2. This program executes the same transaction 50 times. Each transaction performs a RPC that reads 32 bytes bytes from a data server named cpaserverl, which is running on the same machine as the cpa application program. The parameter small in the read statement in the CML program indicates that the read is of a small (32 byte) quantity. The parameter none indicates that subsequent reads are not to cause paging of recoverable data. In this test each transaction reads the same 32 bytes from the server.

do (50) begin-top-level (nv, standard, two-phased) do (10) read ("cpaserverl", small, none) end end end Figure 5-3: Ten Small Reads from 1 Server CML Transaction

do (50) begin-top-level (nv, standard, two-phased) do (1) read ("cpaserverl", large, none) end end end

Figure 5-4: One Large Read from 1 Server CML Transaction

The regression that computes the time for 32 byte read operations subtracts the transaction latency for the CML transaction in Figure 5-2 from the latency for the 10 small read CML transaction in Figure 5-3 and divides the difference by nine. Latencies for 1024 byte read 132 operations, and 32 and 1024 byte write operations are derived similarly using CML programs with the keyword write substituted for read. do (50) begin-top-level (nv,standard,two-phased) do (10) read ("cpaserverl ", large, none ) end end end Figure 5-5: Ten Large Reads from 1 Server CML Transaction do (50) begin-top-level (nv, standard, two-phased) do (1) read ("cpaserverl ", small, none) read ("cpaserver2", small, none) read ("cpaserver3", small, none ) end end end Figure 5-6: One Small Readfrom 3 Servers CML Transaction

The CML transaction shown in Figure 5-6 is used to derive a formula for read transaction commit latency. This transaction performs one read operation on each of three local data servers. To produce a formula for transaction commit latency, read operation times are subtracted from the transaction times for the one and three data server transactions shown in Figures 5-2 and 5-6. The time remaining after read operations in the single data server transaction is subtracted from the time remaining in the three server transaction and the difference is divided by two to determine the overhead each data server adds to commit time. The base transaction commit latency is obtained by subtracting the data server overhead and operation time from the transaction times. Commitment latencies for write transactions are obtained in the same way using an analogous CML program.

5.2.2.2. Debit/Credit Tests

Tests to determine the effect of distributed logging latency on debit/credit benchmark performance used the Camelot implementation of the debit/credit benchmark [Spector et al. 88, Pausch 88] The benchmark was run with local simplex logging and with duplex network logging. It was not possible to run the test with local duplex logging because there were not enough disks on the R'I-APC workstation used for the test.

The debig/credit benchmark was programmed by Pausch as part of his dissertation research [Pausch 88]. The test uses two data servers and one application process. The application process, Terminal, initiates debit/credit transactions (in any number of parallel threads) by sending RPCs to the FrontEnd data server. The FrontEnd data sewer encapsulates no recoverable data. FrontEnd merely begins a transaction and forwards the RPC to the Bank 133 data server (the FrontEnd data server is in the benchmark implementation to approximate the overhead of presentation services, as the benchmark specification requires). The Bank data server executes the debit/credit transaction.

The Bank data server contains branch, teller, and account record arrays in recoverable storage. Branch, teller, and account records are determined by hashing the account number in the input message form the Terminal process. The history file is represented as an array of history records in recoverable storage and a recoverable end-of-file pointer. Updates to the end-of-file pointer would be a bottle neck if it were locked exclusively by a debit/credit transaction. Instead, the end-of-file pointer is updated by a lazy nested top-level transaction that commits and releases its locks immediately. The commit record for the lazy nest top-level transaction is not forced to stable storage but spooled so that it is guaranteed to be forced before the debit/credit transaction's commit.

Normally, for a test in closer compliance with the benchmark specification, the Terminal process(es) would run on different machines from the FrontEnd and Bank processes. Also, 15 percent of the transactions would involve a RPC from the Bank data server on one machine to the Bank data server on another machine. Insufficient numbers of machines were available to run such a test. Therefore, all processes ran on one machine and there were no distributed transactions.

Before executing a debit/credit test, a separate application is run against the Bank Data server to randomly page in as much of the account data as will fit in the processors main memory. This ensures that start up paging transients are not observed during tests. The application permits a varying number of threads to execute debit/credit transactions in parallel. Short tests were run for each logging configuration to determine the multiprogramming level that produced maximum throughput. Then, tests of 1200 transactions were run to determine the average execution rate in transactions per second. The application program tracks the latency of each transaction so that it can verify the benchmark specification's requirement that 90% of transactions execute in less than one second.

A debit/credit logging stub was created to measure debit/credit logging latency directly. First, the debit/credit benchmark was traced. The LD_ interface call sequences and parameters for a debit/credit transaction were determined from the trace. The only significant variations in logging between two debit/credit transactions are in the log records written when the disk manager reads and write pages of recoverable storage. On average the disk manager reads and writes one page for each transaction and the sequence of calls from the trace was adjusted to reflect this.

The sequence of LD interface calls obtained from the trace of the debit/credit benchmark execution was used to construct a stub program that repeatedly executes the log calls made by the benchmark. The stub program can be linked directly with either the local or network logging 134 module to create a test program that exercises the Camelot log without the rest of the Camelot system. The logging stub is used to directly measure the latency of the logging component of the debit/credit transaction (this program is also used for the throughput test described in Section 5.2.3).

5.2.2.3. Debit/Credit Latency Breakdown Test

The debit/credit latency breakdown was measured for simplex logging on RT PCs. The experimental methodology required using the same machine type for both clients and servers, since the CPU contributions to latency would be distorted if different machine types were used. Only simplex logging was measured because it would be very difficult to account for the overlapped components of processing that occur when more than one server is used for logging.

The Unix profiling tool prof is not suitable for directly producing a latency breakdown using the simple debit/credit logging stub for two reasons. First, the tool does not construct profiles for distributed programs. Accurate communication latencies will not be reported. Second, I/O and message waits and the OS scheduling that results, interfere with the accuracy of profiling results. A strategy that combined use of prof and direct measurement of communication latency was used to construct a complete latency breakdown for debit/credit forces.

Communication latency was measured with two programs that sent messages using the same math net interface that the logging transport protocol uses. One program simulated the message traffic of the log client by repeatedly sending a message to the other program that ran on another machine. The second program simulated a log server's message traffic. The server program repeatedly receives the message from the client and sends a reply message. Message sizes were determined from a trace of the distributed log service. This test measured round trip real-time latency, server processing, and client processing for the lowest level communication component of distributed logging.

To measure the latency contributions of other processing in the client and server a new program was constructed that combines the client and server programs in one process. Low level communication was modified to call a slightly modified version of the log server's receive loop when the client send a message. The server's send routine places the message on a main memory queue. The client's receive routine removed message from this queue. Disk I/O calls were commented out of the log server. The resulting program executes the logic of both the client and server sides of simplex distributed logging without the low level communication. The program does no I/O so prof produces an accurate accounting of processing overhead by procedure name.

prof reports time by individual procedure. The prof results for different procedures were combined to report time by module. Some modules were common to both client and server and so the time in those functions could not be attributed specifically. 135

5.2.3. Throughput Experiments

The throughput experiment used the debit/credit logging stub that was also used for the debit/credit logging latency comparisons. Instances of the stub on multiple machines sent log data to two log servers.

The logging stub programs are instrumented to display latencies for groups of one thousand transactions. Latency information from clients is used to determine when steady state response has been reached as new log clients are started during a test. The logging stub also reported client machine CPU use.

The log server is instrumented to measure throughput by recording the times for one thousand completed log force operations. The log server records real time, user state CPU time, and system state CPU time. Statistics are kept in a main memory buffer and are printed and reset when the server receives a special signal message. The signal interval was selected so that throughput data for approximately ten intervals of 1000 forces were collects for each test point.

Tests were made with from one to six client machines. Only three client RT APCs were available for the tests, so some tests were conducted with a mixture of three RT APCs and one or two RT PCs. Additional tests were conducted with from one to six RT PCs.

5.3. Results and Discussion

This section presents and discusses the results of the performance experiments. Results are organized according the the type of experiment. For each experiment, experimental conditions that affect results and conclusions about the experiment's results are discussed. Particular attention is paid to how the results would scale or relate to other systems. Section 5.3.3 summarizes and relates different results.

5.3.1. Latency Experiments

The resultsof the CPA and debit/creditlatency experiments primarilydemonstrate that the performanceof distributedloggingis competitivewith local loggingfor Camelot. The CPA and debit/credittest resultsare consistent. The debit]creditlatencybreakdownresultprovideinsight intothefunctionof the distributedlog. 136

5.3.1.1. CPA Tests

Results (after regressions) of the CPA experiments described in Section 5.2.2.1 are shown in Tables 5-1 to 5-3. Times are in milliseconds. The parameter n in Table 5-3 is the number of local data servers participating in the transaction.

CML Read Operation Latencies on RT APC Simplex Simplex Duplex Duplex Operation Size Local Log Distributed Local Log Distributed 32 Bytes 3.2 3.2 3.2 3.2 1024 Bytes 3.9 3.9 3.9 3.9

Table 5-1: CML Read Operation Times in Milliseconds

CML Write Operation Latencies on RT APC Simplex Simplex Duplex Duplex Operation Size Local Log Distributed Local Log Distributed 32 Bytes 6.1 6.1 5.9 6.1 1024 Bytes 10.1 14.6 12.4 17.8

Table 5-2: CML Write Operation Times in Milliseconds CML Transaction Overheads on RT APC Simplex Simplex Duplex Duplex Transaction Type Local Log Distributed Local Log Distributed Local Read Only 8.4+5.6n 8.4+5.6n 8.4+5.6n 8.4+5.6n Local Write 35.5+4.5n 26.8+9.5n 36.0+9.2n 27.7+13n

Table 5-3: CML Transaction Overhead Times in Milliseconds

The times for read operations in Table 5-1 and read only transactions in Table 5-3 are independent of the type of logging used. This result is not surprising, since read operations and read only transactions do absolutely no logging. The read only results provide confidence in the other results and to illustrate the importance of efficient logging, since the difference in latency between read and write operations or transactions are due to recovery related processing, including logging.

The latency of write operations is independent of the logging method as long as only a small amount of data is modified. When one thousand bytes are modified the average latency of a write operation on a system using distributed logging is more than 40% greater than locally logging the same number of copies of data. This is partially because the local logger can transfer bulk data more efficiently than the distributed log. The ten large write CML transaction logs more 137 than ten thousand bytes of data and the local log transfers it all to disk with one operating system I/O call. The distributed log sends at least 10 messages to transmit the same amount of data to a log server. A better tuned bulk data transfer protocol might improve distributed logging performance.

Write transaction overheads reflect the log force latencies for various amounts of data. Distributed logging is faster than local logging when only one data server is involved, but as additional data servers participate and increase the amount of log data forced at commit time, local logging is faster.

The CPA tests generally indicate that distributed log forces are faster than local log forces in Camelot, while bulk log writing is faster locally. CPA tests are comparably simple and do not involve I/O for paging recoverable data. The tests indicate that the relative performance of distributed logging will decrease as larger more complex transactions are executed.

These tests were run without an external paging kernel, so there was recovery-related activity other than logging. Similar tests on an external paging version of Camelot would tend to accentuate the differences in logging performance. Duplex local logging suffered in relative comparisons because the version of Mach used for the tests does not permit concurrent writes for the disk process. Also, paging of recoverable data would introduce disk or controller contention that would improve the relative performance of distributed logging. The debit/credit tests illustrate some of these differences.

5.3.1.2. Debit/Credit Tests

With local simplex logging an RT APC achieved a throughput of 8.4 transactions per second averaged over 1200 transactions (97% of transactions executed in less than one second). The same machine achieved 7.6 transactions per second with duplex distributed logging (also with 97% less than one second). The local logging test used three application threads while the distributed logging tests had a multiprogramming level of two.

The debit/credit test results were inconclusive with regard to absolute performance of the Camelot system. The greatest transaction rate observed consistently for a relatively long period was 8.4 per second. However with both local and distributed logging, transaction rates in excess of 11 per second were observed for shorter periods. A small number of transactions (less than 1%) are delayed in execution for more than one second, significantly reducing throughput. This delay is probably caused by some problem in Mach or Camelot. In any case, all results indicate that distributed logging of two copies of log data is competitive in performance with local logging of only one copy.

The debit/credit logging stub program was run on an RT APC to measure latency of the logging 138 component of debit/credit with distributed duplex, local simplex, and local duplex logging. Distributed duplex debit/credit logging time is 27 milliseconds, compared with 18 milliseconds for local simplex logging and 35 milliseconds for local duplex logging. Even though the local logger uses parallel threads for disk writes of duplex logs, the writes are apparently serialized by the I/O driver. Similar contention makes distributed duplex logging competitive with local simplex logging for the execution of the debit/credit benchmark, even though latency of local logging is less than distributed logging when there is no contention for the disk controller.

The debit/credit tests measure the logging latency for a single debit/credit transaction committing individually. However, group commit is generally used for debit/credit when transaction system throughputs are greater. These results do not predict the performance of distributed logging for group commit of debit/credit transactions. The relatively poor bulk data transfer performance of distributed logging as indicated by the CPA test results indicates that distributed logging performance for group commit will not be as competitive as it is for individual commits.

5.3.1.3. Debit/Credit Latency Breakdown Test

Distributed simplex logging from an RT PC client to one RT PC server requires 34.1 milliseconds for a debit/credit transaction, as measured with the debit/credit logging stub. Figure 5-4 gives the breakdown of this time. The measured time for debit/credit logging differs from the total time in the latency breakdown by 0.3 milliseconds (about 1%).

Low level communication dominates latency for debit/credit logging, accounting for nearly 60% of the time. Communication time consists of approximately 2.6 milliseconds of system state CPU processing on servers and 4.2 milliseconds of system state time on clients 2°. The remainder of communication time is in I/O, network access and other delays in the network interface, and network transfer. Network transfer time is about 3 milliseconds at 4 megabits per second. I/O latency is a particular problem because the link level protocol used on the token ring has a 17 byte header. This forces one byte per cycle DMA transfers, rather than one word per cycle transfers that the token ring interface can use when data is properly aligned.

The second largest latency component is the 4.3 milliseconds used for client logging calls and algorithm state. This time would be significant if the communication time was reduced to 4 or 5 milliseconds that an RPC mechanism with good performance would use to move the same amount of data [Birrell and Nelson 84]. Much of the 1.7 milliseconds in the client state module could be eliminated by converting routines that simply access variables to macros.

Z°System state time was 2.8 milliseconds on servers during the execution of the communication timing stubs and 2.4 milliseconds during execution of the debit/credit logging stub tests (Client times were 3.9 and 4.5 milliseconds). The differences in these times could be attributed to operating system accounting 139

Component Time (ms) Percent Round Trip Communication 19.5 58%

Client Logging Calls 2.6 8%

Client Algorithm State 1.7 4%

Client Message Packing 1.5 4%

Client Transport 1.0 3%

Server Calls & Algorithm: 0.4 1%

Server Storage Formatting: 0.7 2%

Server Message Packing 0.5 1%

Server Transport: 0.9 3%

Threads Library 2.0 6%

64 Bit Arithmetic 2.0 6%

Bcopy 1.0 3%

Total Latency 33.8

(34.1 Measured)

Table 5-4: Debit/Credit Logging Latency Breakdown

Other components of the distributed logging code that are candidates for in-line expansion are the 64 bit arithmetic routines which are used for calculations on LSNs and log storage addresses, and the threads (lightweight process) library routines. Programming language support is needed to implement these functions most efficiently.

Improvement to the communications component of distributed logging and straight forward code improvements (primarily in-line expansions) could reduce the debit/credit logging times by 14 and 5.5 milliseconds, respectively. The latency of the improved service would be about 14.3 milliseconds, a 58 percent improvement to the measured performance.

This breakdown accurately characterizes the latency of the Camelot DLF for simplex logging, but adjustments for other configurations are tricky. Its generally difficult to analyze latency when parallel execution is involved. Even in the simplex logging case, there is a small amount of parallelism that might occur when the client updates its state information after sending data to a server. To evaluate the latency of parallel communication, a duplex communication client stub was programmed to send messages in parallel to two communication server stubs. The latency 140 for this simulation of the duplex logging communication component was 26 milliseconds. The remainder of the latency should be about the same as for simplex logging, yielding latency of about 40 milliseconds for duplex logging on RT PCs.

5.3.2. Throughput Experiments

Figure5-7 showsthe resultsof throughputexperimentsfor variouscombinationsof RT PC and RT APC clients. The greatestmedianthroughput,achievedwith3 RT APC clientsand 2 RT PC clients was 89.2 debit/credittransactionlog forces per second. Accordingto the benchmark specification,this transactionloggingrate correspondsto a load that would be generated by nearly9000 tellerterminals[Anonymouset al. 85]. A thirdRT PC clientapparentlysaturatesthey system, reducingthroughputto 86.4 transactionsper second. The ratesare mediansof average timesfor one thousandtransactions.The greatestshortterm averagethroughputobservedwas 94.6 transactionsper second. 90

.I "re" f "" " "_r- IBMRT-APCClients f i t_ 80 4-- m -t- IBMRT-PCClients 1 1 //

70 ...... Je

i "° J

3O // // 20 Ii/

0', I I I I I I 1 2 3 4 5 6 Number of Clients

Duplex Logging to Two IBM RT-PC Servers

Figure 5-7: DistributedLogDebiVCreditThroughput

Duringthistest a singleRT APC loggingto twoservers observed latenciesof 26.7 milliseconds. Latencies on RT PCs were greater and latencies increased to 60 milliseconds when there were 141

three RT APCs and three RT PCs logging. When only one RT PC was logging its CPU utilization was 62%. This decreased with additional clients until RT APCs had only 22% CPU utilization when there were six clients.

Network traffic was about one million bits per second at maximum throughput. Server disk usage reached about 20% of the maximum streaming rate as independently determined using Unix write calls for large blocks. Server CPU utilization reached 79%. Neither network bandwidth, nor server disk bandwidth was saturated by the test. Server CPUs were heavily loaded and may have been the factor limiting throughput. Also, the servers' network interfaces might have been saturated, but there was no data available on packet rejection. 21

This test did good job of illustrating the system's capacity for short transactions like debit/credit. Limiting resources were server CPUs and server network interfaces and they potential for scaling was determined to be about 45 transaction per server. The results do not necessarily relate directly to transactions that log more data because they would invoke the streaming protocols (which are not used at all by the debit/credit transaction) and thus might encounter different bottlenecks.

This performance measurement enables prediction of Camelot DLF performance for more novel workloads such as workstation file management. File systems benefit from using recovery logs to record changes to file system metadata like directory entries, protection information, and disk block allocation maps. Few workstation applications need the reliability that would be provided by logging all file data as it is written, but workstation crash resilience and file system performance is improved if changes to the metadata are logged [Hagmann 87]. In particular, crash recovery can be much faster because it is not necessary to scavenge the disks to reconstruct a file system. For typical workstation files (which are frequently only a few disk blocks [Satyanarayanan 81]) the metadata will take up about the same log space as a debit credit transaction. Data from Andrew file system measurements indicates the workstations write files at rates averaging less than one file per minute per workstation [Howard et al 88]. 22 A distributed logging system consisting of three servers could accommodate the file system logging loads of 8000 workstations.

21Thetokenringspecificationcallsfora refusedbitthatcouldbeusedfor thispurpose.

22Howardet al reportthat the Andrewfile serversreceiveda total of 135000requeststo storedata or file status informationina 78hourperiod.Therewereabout400workstationsusingthefile system.If allfile writesoccurredina 24 period(correspondintog workinghoursonthedaysstudied)thiswouldaverageto 14filewritesper hourperworkstation. 142

5.3.3, Performance Summary

The results presented here demonstrate the performance of the Camelot DLF under a variety of conditions using a number of measures. These results satisfy the performance evaluation objectives listed at the beginning of this chapter.

The operation the Camelot DLF was extensively demonstrated by the performance experiments. As expected, some bugs were uncovered by the performance evaluation and repaired. Overall confidence in the facility is much greater as a consequence of the evaluation.

The performance evaluation measurements characterize the performance of the Camelot DLF for a variety of workloads. The latency of log forces using the facility is as low as 26.7 milliseconds for a debit/credit force. Throughput is about 45 debit/credit forces per second per server for duplexed logging. The streaming rate for the facility, as evidenced by the CPA tests, is not as good as the force performance. Distributed logging was compared with local logging and found to be competitive in performance for all transaction types studied.

The latency breakdown tests demonstrated that communication dominates latency. If communication were improved, then standard code tuning techniques (like converting frequently used routines to macros) would reduce major latency components in clients and server. The throughput test indicated that the facility limited by either server CPU capacity or by server network interface capacity. Neither the network itself nor disk bandwidtt-, at the servers was shown to be a significant bottleneck. The use of low latency non-volatile virtual memory in the log servers was critical for good latency and throughput.

The performance evaluation indicates how the Camelot DLF would scale or adapt to different environments. First, performance would be greatly enhanced if communication overhead were lowered. Latency reductions of 50 percent or more are possible with improved communication and simple program tuning. Second, throughput would be enhanced by servers with faster CPUs and more efficient network interfaces. A Camelot DLF, as it performs in this evaluation, can grow to accommodate more than 250 debit/credit transactions per second simply by adding additional servers. An additional server adds about 45 transactions per second to system capacity for duplex logging. Network saturation is reached at over 250 transactions per second and to expand further with a 4 megabit token ring it would be necessary to develop multicast protocols. 143

Chapter 6 Conclusions

The dissertation defends the thesis that transaction system recovery logs can be efficiently and reliably provided by a distributed network service. These conclusions start in Section 6.1 with evaluations of this thesis, and the technology presented in the dissertation. Section 6.2 examines future research in distributed logging. Section 6.3 summarizes the results of this dissertation.

6.1. Evaluations

This section presents three evaluations of distributed logging and the underlying technology. First, the Camelot Distributed Logging Facility is evaluated. Second, the potential of other distributed logging systems is assessed. Third, the technology underlying high performance network services is evaluated.

6.1.1. The Camelot Distributed Logging Facility

The Camelot Distributed Logging Facility demonstrates the practicality of distributed logging. The service is part of Camelot Release 1.0 and is expected to be the preferred logging service when Camelot runs on personal workstations that have few disk drives. This section evaluates the Camelot DLF in light of the analysis of design alternatives in Chapter 3, the performance analysis in Chapter 5, and experience from the facility's use.

The Camelot DLF is compatible with the Camelot local tog implementation. Multiple client Camelot nodes can share a set of log servers, eliminating the need to dedicate local disk space for logs. The service is successful and achieves its objectives of providing a logging services that is transparent, efficient, and reliable.

The same Camelot log interface is implemented by the local log implementation and the Camelot DLF. This interface only supports the Camelot disk and recovery managers and can not be used by a separate recovery manager implementing its own recovery algorithm. The Camelot log interface lacks the multiplexing, and the forward scans that would be needed to support multiple recovery managers (see Section 2.2.2.3). An extended log interface specification supporting multiple recovery managers has been proposed. 144

Use of the Camelot DLF is transparent to users and to most of the Camelot system software. The only time users normally are aware that they are using a network log service is when they initially configure Camelot to use the DLF. In the event of a log failure administrators receive error messages from the Camelot DLF that are different than those issued by the local log and administrators must use different diagnostic procedures.

Chapter 5 demonstrated that the performance of the Camelot DLF is comparable to the Camelot local log implementation for a variety of relevant benchmarks. Latency of duplex log forces for debit/credit transaction is 27 milliseconds, which is faster than local duplex logging. The latency of some operations increases when distributed logging is used, but this increase is not likely to be noticed by interactive users. Debit/Credit throughput with duplexed distributed logging using the Camelot DLF can grow at the rate of 45 transactions per second for each additional log server. This throughput result means that Camelot DLF servers can support many clients at once, depending on client activities, so that server resources can be amortized over multiple active clients. Log servers use less disk space than the sum of the space that clients would need to reserve.

When the Camelot DLF is used for duplex logging, its reliability is comparable to local mirrored logging. Log data written to the Camelot DLF is more reliable than most local logging because local log users might only have enough disk storage for simplex logging.

The security of the Camelot DLF is inadequate for many practical environments. The only protection (other than physical security) that the DLF provides is a reliance on the authenticity of client identifiers in requests. More robust authentication and protection mechanisms are needed.

Operational experience with the Camelot DLF uncovered an availability problem. Server availability is reduced by the interaction of the single data stream representation of log servers and the decision not to spool log data offline. There is a limited amount of storage available on servers and servers must deny service to all clients when they run out of space. Logging service is unavailable when clients fail to truncate their logs promptly. This can occur often in a development environment where the Camelot system running on client machines is frequently restarted and stopped.

Although the space management problem has a serious negative impact on availability, the problem is less likely to occur in a production environment where Camelot is always running on client machines. Other than this problem, server availability and reliability have been good. Overall, the limited experience with the Camelot DLF indicates that it is successful and meets its goals. 145

6.1.2. Future Distributed Logging Systems

Distributed logging is an efficient and reliable alternative to local logging for transaction processing systems. Future transaction processing systems supporting moderate transaction rates on processors with limited disks for logging will certainly use a network logging service. Other transaction systems, for example in non-shared memory multiprocessors, might also use network log servers.

The technology presented by this dissertation in support of the distributed logging thesis is applicable to future distributed log systems. The distributed replication algorithm for log representation is viable alternative to mirroring for network log services. The distributed replication algorithm permits highly available and reliable network log services without the use of specially hardened mirrored log servers. Mirrored log servers would require special hardware, operating system software, and log server software to achieve availability comparable to that possible with distributed replication.

Experience with the Camelot DLF indicates that communication will have a major impact on the performance of future distributed log services, but this experience is inconclusive as to what the best communication paradigm is. A choice of communication facilities for future services will depend more on the performance of available paradigms than on levels of function. The log service implementor will start from whatever fast low level communication transport is available and will layer communication along the lines of the channel model on top of that transport.

A low-latency stable or non-volatile buffer dramatically increases log server performance and future log servers will probably use one. This is an example of how a network service can amortize the cost of special purpose hardware over many clients. The use of a low-latency buffer makes the choice of single data stream or partitioned space server data representations less important. The append forest and entry bit-map index structures enable log servers to use a single append-only data stream for multiple client's logs.

6.1.3. High Performance Network Services

The design of a distributed log service provided an opportunity to survey the technology available for constructing high performance network services. The experiences with the Camelot DLF were mixed.

Communication has a major impact on network service performance. Certainly the Camelot DLF performance suffers because of the high latency of inter-node messages in Mach. However, communication in other operating systems can be much faster than the performance of Mach-net messages. It seems important for services to use a common communication facility, because that one will be better tuned. For example, the standard Mach message facility improved over the 146 course of the development of the Camelot DLF and now Mach messages, which are more reliable, and more convenient to use, are also as fast as the Mach-net messages used by the Camelot DLF.

Communication paradigms are evolving. RPCs are adequate and convenient for many applications, however some high performance network services can take advantage of more sophisticated paradigms, like the channel model, that permit both synchronous and asynchronous remote calls. Any network service based on distributed replication needs to use communications that permit parallel operations on servers and, ideally, exploits hardware multicast.

Uninterruptible power supplies were used successfully in the Camelot DLF and there is potential for using UPSs for other network services. UPSs can be used to implement non-volatile virtual memory for other types of stable memories based on distributed replication, for other servers that need fast non-volatile memory, and simply to increase server availability. As power requirements for network servers decline because of more powerful microprocessors and more dense memories, UPSs become even more practical.

6.2. Future Research

6.2.1. Camelot DLF Enhancements

The Camelot DLF is a complete service, but it can be improved by a few changes including multi-threading, multi-volume log storage, spooling, and implementation of the append-forest access structure. In addition, changes to the Camelot log interface might necessitate additional changes to the DLF.

The implementation of the Camelot DLF uses a single mutual exclusion variable to restrict concurrent access to log routines. This implementation prevents the execution of log _rite calls to buffer additional log data while a log force is in progress. A more sophisticated concurrency control structure that permits buffering during log forces would improve log throughput.

The original implementation of the Camelot DLF uses only a single Unix raw disk file for log storage at each server. This limits the total storage space to at most one disk drive. Extending the DLF to allow multiple files should be fairly simple.

Multiple disk files are a useful first step in implementing the spooling of log data to offline storage. Multiple volumes are useful so that one disk can be used for spooling while logging continues to others. In addition, a media catalog and operator interface must be implemented. Support of an Archive log call is not essential for the single data stream used by the Camelot DLF because data will be spooled when disk fills, not when clients no longer need data for crash recovery. 147

The range searching access method used in the Camelot DLF is adequate for the small logging space (150 megabytes) used in current configurations, at least for the tests and other uses that the service has received thus far. However, with larger amounts of storage space and more active clients a more efficient access structure is necessary. Implementation of the append-forest is not expected to be a major problem.

6.2.2.FutureDistributedLogServers

There is additional technology that could benefit network log servers that was not considered in this dissertation, or was considered in only a limited fashion. The Camelot DLF could be modified to use some of the technology, or it could be used in future services. One such technology is for clients to export programs that can interpret client log records to log servers.

Log server interpretation of client records makes it possible for log servers to select records that a clients needs for transaction abort or for a final phase of crash recovery. In these situations, only a portion of the records read by a sequential scan are actually needed for recovery processing. Server interpretation enables the server to efficiently stream the selected records to the client. For many log record layouts a general expression pattern matcher applied to individual log records is sufficient for selecting the desired records. For example, to abort a transaction a client might what all log records of certain types, with a certain transaction identifier. Most log record layouts will place the record type at a fixed location and transaction identifiers at fixed locations within each record type, so a general expression pattern matcher is all that is needed to select records needed for transaction aborts.

For log compression, a log server needs to do more interpretation of log records than is possible with a simple pattern matcher. The log server must supply a means by which the client can supply a general program to execute in the log server. This raises a number of issues including protection of the log servers from the foreign programs and the interface for the log server to provide to the imported program.

6.3. Summary

This dissertation has defended the thesis that the recovery logs used by transaction processing system clan be provided efficiently and reliability by a highly available network service. The results presented in support of this thesis are summarized below.

The results in this dissertation are based on a simple, but thorough definition of a transaction system recovery log, the log's interface, and the uses that transaction systems make of the recovery log as presented in Chapter 2. Many of the results presented in this dissertation are possible because the interface to a recovery log is very simple, because log use patterns are predictable, and because a log is used by only one client and not shared. 148

The problem of designing a distributed log service is addressed comprehensively by Chapter 3. New algorithms were presented to address some design problems, such as the overall distributed representation of log data and the indexing of log records on write-once storage. For other issues, existing technology was surveyed and analyzed.

A key contribution of this dissertation is a new algorithm for distributed replication of transaction recovery logs that is presented in Section 3.1.2. The algorithm is based on voting and uses a collection of log servers, each of which stores one copy of log data. Each log record is sent to two log servers (or more, if greater reliability is desired). The client can switch log servers freely if one server becomes unavailable. Unlike other voting algorithms, which depend on an underlying transaction system for atomicity, reliability, and concurrency control, the distributed replication algorithm for logs achieves reliability and availability with a single mechanism. Additionally, the algorithm permits reads to be satisfied with a single message.to one log server, rather than many messages for voting that are required by most voting algorithms. These features of the representation are possible because a log is accessed by only one client.

Section 3.2 surveyed a number of different communication paradigms and concluded that the channel model is well suited to distributed logging because it combines the fast response possible with remote procedure calls with efficient buffering for asynchronous operations (like spooled log writes). In addition, the communication for distributed replicated logs needs to support parallel operations on multiple remote servers, and if possible exploit hardware multicast.

Two important important ideas were presented in the discussion of log server data structures in Section 3.4. The first is a new data structure, called an append-forest that is used to index log records in append-only storage. An append-forest, and the entry-map index developed by Finlayson, allow logarithmic time access to log records by LSN. Append-forests and entry-map indexes can be constructed in constant time when the keys are presented in ascending order. These index structures permit a single data stream representation for multiple clients' logs on a server. The single stream representation allows better disk utilization, compared with partitioning space among different clients. The append-only nature of the data structures permits them to be used with write-once optical disk storage.

The second important idea for log server data structures is the use of an uninterruptible power supply to implement non-volatile virtual memory for use as a low latency disk buffers on servers for distributed replicated logs. The use of a UPS is very convenient because no operating system kernel modifications are required to use the UPS, in contrast to low-latency buffers constructed with battery-backup main memory or RAM disks. A low latency buffer was shown to be very important for distributed log server performance, both in terms of log force latency and disk utilization.

Chapter 3 also considers issues of security, log space management, and load assignment. For 149 these issues existing technology applies to distributed logging with few modifications. Standard mechanisms can be use to authenticate and secure the communication between log clients and server. Online log space is a limited resource for both distributed logging and local logging and log users and log servers can use standard mechanisms to manage this space. Log space management is more complicated for distributed logging because the logging space is shared by multiple log clients and because log servers can not directly interpret client log records to compress the client logs when they are spooled offline. Load assignment policies and mechanisms can be somewhat specialized for the loads and usage patterns found in logging.

The discussion of design issues in Chapter 3 is complemented by the description of the Camelot Distributed Log Facility in Chapter 4. The Camelot DLF is a complete distributed log service including log servers and a client interface that is compatible with the Camelot local log implementations. The service constitutes an implementation of several of the new algorithms and concepts presented in this dissertation. The Camelot DLF uses distributed replication for logs. Log clients and servers communicate with special protocols that are optimized for logging. The log servers use a single data stream disk representation for clients logs and the servers are equipped with uninterruptible power supplies to implement non-volatile virtual memory disk buffers.

The performance of the Camelot DLF is analyzed in Chapter 5. A variety of benchmarks are used and the the performance of the distributed log is shown to be comparable to local logging for each benchmark. In addition, the throughput capacity of log servers was shown to be about 45 industry standard debit/credit transactions per second. The latency of distributed logging is dominated by communication and throughput is limited by either CPU or network interface capacity in log servers. Improvements in communication and in client and server processors will substantially improve distributed logging performance.

In conclusion, this dissertation has demonstrated, both analytically and experimentally, that the recovery logs used by transactions processing systems can be provided efficiently and reliably by a highly available network service. 150 151

References

[Abbadi and Toueg 86] Amr El Abbadi, Sam Toueg. Availability in Partitioned Replicated Databases. In Proceedings of the Fifth ACM SIGACT-SlGMOD Symposium on Principles of Database Systems. 1986. [Abbadi et al. 85] Amr El Abbadi, Dale Skeen, Flaviu Cristian. An Efficient, Fault-Tolerant Protocol for Replicated Data Management. In Proceedings of the Fourth A CM SlGA CT-SlGMOD Symposium on Principles of Database Systems. March, 1985. [Accetta et al. 86] Mike Accetta, Robert Baron, William Bolosky, David Golub, Richard Rashid, Avadis Tevanian, Michael Young. Mach: A New Kernel Foundation for UNIX Development. In Proceedings of Summer Usenix. July, 1986. [AIIchin and McKendry 83] J. E. AIIchin, M. S. McKendry. Support for Objects and Actions in Clouds: Status Report. Technical Report GIT-iCS-83/11, Georgia Institute of Technology, May, 1983. [Alsberg and Day 76] P. A. Alsberg, J. D. Day. A Principle for Resilient Sharing of Distributed Resources. In Proceedings of the Second International Conference on Software Engineering, pages 562-570. October, 1976. [Anonymous et al. 85] Anonymous, et al. A Measure of Transaction Processing Power. Datamation 31(7), April, 1985. Also available as Technical Report TR 85.2, Tandem Corporation,Cupertino, California, January 1985. [Astrahan et al. 76] M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, R. A. Lode, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W. Wade, and V. Watson. System R: A Relational Approach to Database Management. ACM Transactions on Database Systems 1(2), June, 1976. [Bamberger 87] Frank K. Bamberger. Citicorp's New High-Performance Transaction Processing System. September, 1987. Presented at the Second InternationalWorkshop on High Performance Transaction Systems, Asilomar, September, 1987. 152

[Banatre et al. 83] J. P. Banatre, M. Banatre, F. PIoyette. Construction of a Distributed System Supporting Atomic Transactions. In Proceedings of the Third Symposium on Reliability in Distributed Software and Database Systems. IEEE, October, 1983. [Baron et al. 85] Robert Victor Baron, Richard F. Rashid, Ellen H. Siegel, Avadis Tevanian, Jr., Michael Wayne Young. Melange: A Multiprocessor-Oriented Operating System and Environment. September, 1985. [Baron et al. 87] Robert V. Baron, David Black, William Bolosky, Jonathan Chew, David B. Golub, Richard F. Rashid, Avadis Tevanian, Jr., Michael Wayne Young. Mach Kernel Interface Manual. February, 1987. [Bartlett 81] Joel Bartlett. A NonStopTM Kernel. In Proceedings of the Eighth Symposium on Operating System Principles. ACM, 1981. [Bernstein and Goodman 81] Philip A. Bernstein, Nathan Goodman. Concurrency Control in Distributed Database Systems. ACM Computing Surveys 13(2):185-221, June, 1981. [Bernstein and Goodman 84] P. Bernstein and N. Goodman. An algorithm for concurrency control and recovery in replicated distributed databases. ACM Transactions on Database Systems 9(4):596-615, December, 1984. [Birrell 85] Andrew D. Birrell. Secure Communication Using Remote Procedure Calls. ACM Transactions on Computer Systems 3Number=1:1-14, February, 1985. [Birrell and Nelson 84] Andrew D. Birrell, Bruce J. Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems 2(1):39-59, February, 1984. [Birrell et al. 82] Andrew D. Birrell, Roy Levin, Roger M. Needham, and Michael D. Schroeder. Grapevine: An Exercise in Distributed Computing. Communications of the ACM 25(4), April, 1982. [Birrell et al. 87] Andrew D. Birrell, Michael B. Jones, Edward P. Wobber. A Simple and Efficient Implementation for Samll Databases. In Proceedings of the 1lth Symposium on Operating System Principles, pages 149-154. ACM, 1987. [Bloch et al. 87] Joshua J. Bloch, Dean S. Daniels, Alfred Z. Spector. A Weighted Voting Algorithm for Replicated Directories. JACM 34(4), October, 1987. To appear. Also available as Technical Report CMU-CS-86-132, Carnegie- Mellon University, July 1986. [Boggs et al. 82] David R. Boggs, John F. Shoch, Edward A. Taft, Robert M. Metcalfe. A Specific Internetwork Architecture (Pup). In Paul E. Green, Jr. (editor), Computer Network Architectures and Protocols, chapter 19, pages 527-555. Plenum Press, 1982. 153

[Borr 81] Andrea J. Borr. Transaction Monitoring In EncompassTM: Reliable Distributed Transaction Processing. In Proceedings of the Very Large Database Conference, pages 155-165. September, 1981.

[Borr 84] Andrea J. Borr. Robustness to Croach in A Distributed Database: A Non Shared-Memory Multi-Processor Approach. In Proceedings of the Very Large Database Conference, pages 445-453. August, 1984.

[Brown 85] Mark R. Brown, Karen N. Kolling, Edward A. Taft. The Alpine File System. ACM Trans. on Computer Systems 3(4):261-293, November, 1985.

[Brown et al. 81] Mark R. Brown, R.G.G. Cattell, N. Suzuki. The Ceadar DBMS: A Preliminary Report. In Proceedings of A CM SIGMOD 1981 International Conference on Management of Data, pages 205-211. April, 1981. [Cattell 83] R.G.G. Cattell. Design and Implementation of a Relationship-Entity-Datum Data Model Xerox Research Report CSL-83-4, Xerox Research Center, Palo Alto, CA, May, 1983. [CCITT 84] CCITT. Recommendation X.25 - Interface between Data Terminal Equipment (DTE) and Data Circuit-Terminating Equipment (DCE) for Terminals Operating in the Packet Mode and Connected to Public Data Networks by Dedicated Circuit. October, 1984

[Cheriton 83] D.R. Cheriton and W. Zwaenpoel. The Distributed V Kernel and its Performance for Diskless Workstations. In Proceedings of the Ninth Symposium on Operating System Principles. ACM, 1983.

[Cheriton 84a] David R. Cheriton. The V Kernel: A Software Base for Distributed Systems. IEEE Software 1(2):186-213, April, 1984.

[Cheriton 84b] David R. Cheriton. An Experiment using Registers for Fast Message-Based Interprocess Communication. Operating Systems Review 18(4):12-20, October, 1984. [Cheriton 87] David R. Cheriton. VMTP Versatile Message Transaction Protocol. 1987. Computer Science Department, Stanford University.

[Cheriton and Zwaenepoel 85] David R. Cheriton, and Willy Zwaenepoel. Distributed Process Groups in the V Kernel. A CM Transactions on Computer Systems 2(2) :77-107, 1985. [Clark et al. 78] David D. Clark, Kenneth T. Pogran, David P. Reed. An Introduction to Local Area Networks. Proceedings of the IEEE 66(11 ):1497-1516, November, 1978. 154

[Comer 88] Douglas Comer. Intemetworking with TCP/IP, Principles, Protocols and Architecture. Prentice-Hall, 1988.

[Dahl and Hoare 72] O.J. Dahl, C. A. R. Hoare. Hierarchical Program Structures. In C. A. R. Hoare (editor), A.P.LC. Studies in Data Processing. Volume 8: Structured Programming, chapter 3, pages 175-220. Academic Press, London and New York, 1972.

[Daniels 82] Dean S. Daniels. Query Compilation in a Distributed Database System. Master's thesis, Massachusetts institute of Technology, March, 1982.

[Daniels et al. 87] Dean S. Daniels, Alfred Z. Spector, Dean Thompson. Distributed Logging for Transaction Processing. In Sigmod '87 Proceedings. ACM, May, 1987. Also available as Technical Report CMU-CS-86-106, Carnegie-Mellon University, June 1986.

[Date 83] C.J. Date. The System Programming Series: An Introduction to Database Systems Volume 2. Addison-Wesley, Reading, MA, 1983.

[Deering 85] S.E. Deering and D. R. Cheriton. Host Groups: A Multicast Extension to the Intemet Protocol Technical Report RFC 966, Network Working Group, December, 1985. [Department of Defense 82] Reference Manual for the Ada Programming Language July 1982 edition, Department of Defense, Ada Joint Program Office, Washington, DC, 1982.

[DeWitt 84] DeWitt, D., et. al. Implementation Techniques for Main Memory Database Systems. In Proc. SIGMOD 84, pages 1-8. May, 1984. [Digital Equipment Corporation 80] The Ethemet Specifications First edition, Digital Equipment Corporation, Intel Corporation, Xerox Corporation, Maynard, MA, 1980.

[Drake 67] Alvin W. Drake. Fundamentals of Applied Probability Theory. McGraw-Hill, New York, 1967.

[Duchamp 88] Dan Duchamp. Transaction Management for Transaction Processing Systems. PhD thesis, Carnegie-Mellon University, 1988. Forthcoming.

[Dwork and Skeen 83] Cynthia Dwork, Dale Skeen. The Inherent Cost of Nonblocking Commitment. In Proceedings of the Second Annual Symposium on Principles of Distributed Computing, pages 1-11. ACM, August, 1983. 155

[Eppinger 87] Jeffrey L. Eppinger. CPA: The Camelot Performance Analyzer. August, 1987. Camelot Working Memo 12.

[Eppinger 88] Jeffrey L. Eppinger. Virtual Memory Management for Transaction Processing Systems. PhD thesis, Carnegie-Mellon University, 1988. Forthcoming. [Eswaran et al. 76]K. P. Eswaran, James N. Gray, Raymond A. Lorie, Irving L. Traiger. The Notions of Consistency and Predicate Locks in a Database System. Communications of the ACM 19(11 ):624-633, November, 1976. [Fabry 74] R.S. Fabry. Capability-Based Addressing. Communications of the ACM 17(7):403-411, July, 1974. [Finlayson and Cheriton 87] R. Finlayson, D. Cheriton. Log Files: An Extended File Service Exploiting Write-once Storage. In Proceedings of the 1lth Symposium on Operating System Principles, pages 139-148. ACM, 1987.

[Fridrich and Older 81] M. Fridrich and W. Older. The FELIX File Server. In Proceedings of the Eighth Symposium on Operating System Principles. ACM, 1981.

[Gawlick 85] Gawlick, D. Processing 'hot spots' in high performance systems. In Proceedings COMPCON'85. 1985.

[Gifford 79a] David K. Gifford. Violet, an Experimental Decentralized System. In Proceedings of the LR.LA. Workshop on Integrated Office Systems. Versailles, France, November, 1979. Also available as Xerox Palo Alto Research Center Report CSL-79-12. [Gifford 79Io] David K. Gifford. Weighted Voting for Replicated Data. In Proceedings of the Seventh Symposium on Operating System Principles, pages 150-162. ACM, December, 1979.

[Gifford 81] David K. Gifford. Information Storage in a Decentralized Computer System. PhD thesis, Stanford University, 1981. Available as Xerox Palo Alto Research Center Report CSL-81-8, March 1982.

[Gifford and Glasser 88] David K. Gifford and Nathan Glasser. Remote Pipes and Procedures for Efficient Distributed Communication. ACM Transactions on Computer Systems 6(3), August, 1988.

[Gifford and Spector 84] David K. Gifford, Alfred Z. Spector. A Case Study: The -I-WA Reservation System. Communications of the ACM 27(7):650-665, July, 1984. 156

[Gray 78] James N. Gray. Notes on Database Operating Systems. In R. Bayer, R. M. Graham, G. Seegmuller (editors), Lecture Notes in Computer Science. Volume 60: Operating Systems - An Advanced Course, pages 393-481. Springer-Verlag, 1978. Also available as Technical Report RJ2188, IBM Research Laboratory, San Jose, California, 1978. [Gray 80] James N. Gray. A Transaction Model Technical Report RJ2895, IBM Research Laboratory, San Jose, California, August, 1980. [Gray 81] James N. Gray. The Transaction Concept: Virtues and Limitations. In Proceedings of the Very Large Database Conference, pages 144-154. September, 1981. [Gray 86] Gray, J. Why do computers stop and what can be done about it? In Proceedings 5th IEEE Symposium on Reliabifity in distributed software and database systems, pages 3-12. 1986. [Gray et al. 81] James N. Gray, et al. The Recovery Manager of the System R Database Manager. ACM Computing Surveys 13(2):223-242, June, 1981. [Haderle and Jackson 84] D.J. Haderle, R.D. Jackson. IBM Database 2 Overview. IBM Systems Journal 23(2_:112-125, 1984. [Haerder and Reuter 83] Theo Haerder, Andreas Reuter. Principles of Transaction-Oriented Database Recovery. ACM Computing Surveys 15(4):287-318, December, 1983. [Hagmann 87] Robert Hagmann. Reimplementing the Cedar File System Using Logging and Group Commit. In Proceedings of the 1lth Symposium on Operating System Principles, pages 155-162. ACM, 1987. [Haskin et al. 88] Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chart. Recovery Management in QuickSilver. A CM Transactions on Computer Systems 6(1):82-108, February, 1988. [Helland 85] Pat Helland. Transaction Monitoring Facility. Database Engineering 8(2):9-18, June, 1985. [Helland 88] Pat Helland. 1988 Personal Communication. [Helland et al. 87] Pat Helland, Harald Sammer, Jim Lyon, Richard Carr, Phil Garrett, Andreas Reuter. Group Commit Timers and High Volume Transaction Systems. September, 1987. Presented at the Second International Workshop on High Performance Transaction Systems, Asilomar, September, 1987. 157

[Herlihy 86] Maurice P. Herlihy. A Quorum-Consensus Replication Method for Abstract Data Types. ACM Transactions on Computer Systems 4(1), February, 1986. [Herlihy and Liskov 82] Maurice P. Herlihy and Barbara H. Liskov. A Value Transmission Method for Abstract Data Types. ACM Transactions on Programming Languages and Systems 4(4):527-551, October, 1982.

[Herlihy and Wing 87] M. P. Herlihy, J. M. Wing. Avalon: Language Support for Reliable Distributed Systems. In Proceedings of the Seventeenth International Symposium on Fault-Tolerant Computing. IEEE, July, 1987. [Howard et al 88] John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. Scale and Performance in a Distributed File System. ACM Transactions on Computer Systems 6(1):51-81, February, 1988. Presented at the Eleventh Symposium on Operating System Principles, Austin, Texas, November, 1988.

[IBM Corporation 78] Customer Information Control System/Virtual Storage, Introduction to Program Logic SC33-0067-1 edition, IBM Corporation, 1978. [IBM Corporation 79] Systems Nework Architecture - Introduction to Sessions Between Locical Units GC290-1869-1 edition, IBM Corporation, White Plains, New York, 197o. [IBM Corporation 80] IMS/VS Version I General Information Manual GH20-1260 edition, IBM Corp., White Plains, NY., 1980. [IBM Corporation 82] SQL/Data System Concepts and Facilities GH24-5013 edition, IBM Corporation, Armonk, NY, 1982.

[IBM Corporation 85] IBM RT PC Hardware Technical Reference, Volume 1 SC23-0854-1 edition, IBM Corporation, Austin, Texas, 1985.

[IBM Corporation 87] TPF2 General Information Manual GH20-6200 edition, IBM Corporation, Information Systems Group, Transportation Industry Marketing, Valhalla, NY, 1987.

[Israel et al. 78] Jay E. Israel, James G. Mitchell, Howard E. Sturgis. Separating Data from Function in a Distributed File System. In Proceedings of the Second International Symposium on Operating Systems. IRIA, 1978.

[Jones and Spector ??] Scott Jones, Alfred Z. Spector. A Transactional Spread Sheet for Distributed Colaborative Work. Camelot Working Memo, to be written. 158

[Jones et al. 85] Michael B. Jones, Richard F. Rashid, Mary R. Thompson. Matchmaker: An Interface Specification Language for Distributed Processing. In Proceedings of the Twelfth Annual Symposium on Principles of Programming Languages, pages 225-235. ACM, January, 1985. [Joy et al. 83] William Joy, Eric Cooper, Robert Fabry, Samuel Leffler, Kirk McKusick, David Mosher. 4.2 BSD System Interface Overview. Technical Report CSRG TR/5, University of California Berkeley, July, 1983. [Kato and Spector 88] Toshihiko Kato and Alfred Z. Spector. A Reliable Mail System Using the Camelot Distributed Transaction Facility. May, 1988. [Knapp 87] Edgar Knapp. Deadlock Detection in Distributed Databases. ACM Computing Surveys 19(4):303-328, December, 1987. [Korth 83] Henry F. Korth. Locking Primitives in a Database System. Journal of the ACM 30(1):55-79, January, 1983. [Kronenberg 86] Nancy P. Kronenberg, Henry M. Levy, and William D. Strecker. VAXclusters: A Closely-Coupled Distributed System. ACM Transactions on Computer Systems 4(2), May, 1986. Presented at the Tenth Symposium on Operating System Principles, Orcas Island, Washington, December, 1985. [Lampson 81] Butler W. Lampson. Atomic Transactions. In G. Goos and J. Hartmanis (editors), Lecture Notes in Computer Science. Volume 105: Distributed Systems - Architecture and Implementation: An Advanced Course,chapter 11,, pages 246-265. Springer-Verlag, 1981. [Lazowska et al 86] Edward D. Lazowska, John Zahorjan, David R. Cheriton, and Willy Zwaenepoel. File Access Performance of Diskless Workstations. ACM Transactions on Computer Systems 4(3):238-268, 1986. [Lindsay 80] Bruce G. Lindsay. Object Naming and Catalog Management for a Distributed Database Manger. IBM Research Report RJ2914, IBM Research Laboratory, San Jose, CA, August, 1980. [Lindsay 88] Bruce G. Lindsay. 1988 Personal Communication.

[Lindsay et al. 79] Bruce G. Lindsay, et al. Notes on Distributed Databases. Technical Report RJ2571, IBM Research Laboratory, San Jose, California, July, 1979. Also appears in Droffen and Poole (editors), Distributed Databases, Cambridge University Press, 1980. [Lindsay et al. 84] Bruce G. Lindsay, Laura M. Haas, C. Mohan, Paul F. Wilms, Robert A. Yost. Computation and Communication in R*: A Distributed Database Manager. ACM Transactions on Computer Systems 2(1):24-38, February, 1984. 159

[Liskov 84] Barbara Liskov. Overview of the Argus Language and System. Programming Methodology Group Memo 40, Massachusetts Institute of Technology Laboratoryfor Computer Science, February, 1984. [Liskov and Herlihy 83] Barbara Liskov, Maudce Herlihy. Issues in Process and Communication Structure for Distributed Programs. tn Proceedings of the ThirdSymposium on Reliability in Distributed Software and Database Systems. October, 1983. [Liskov and Scheifler 83] Barbara H. Liskov, Robert W. Scheifler. Guardians and Actions: Linguistic Support for Robust, Distributed Programs. A CM Transactions on Programming Languages and Systems 5(3):381-404, July, 1983. [Liskov et al. 77] B. Liskov, A. Snyder, R.Atkinson, C. Schaffert. Abstraction Mechanisms in CLU. Communications of the ACM 20(8), August, 1977. [Liskov et al. 83] B. Liskov, M. Herlihy, P. Johnson, G. Leavent, R. Scheifler, W. Weihl. Preliminary Argus Reference Manual Programming Methodology Group Memo 39, Massachusetts Institute of Technology Laboratory for Computer Science, October, 1983. [Liskov et al. 87] B. Liskov, D. Curtis, P. Johnson, R. Scheifler. Implementation of Argus. In Proceedings of the 1lth Symposium on Operating System Principles, pages 111-122. ACM, November, 1987. [Lomet 77] David B. Lomet. Process Structuring, Synchronization, and Recovery Using Atomic Actions. ACM SlGPLAN Notices 12(3), March, 1977. [Lorie 77] Raymond A. Lorie. Physical Integrity in a Large Segmented Database. ACM Transactions on Database Systems 2(1):91-104, March, 1977. [McConnel and Siewiorek 82] Stephen McConne[ and Daniel P. Siewiorek. Evaluation Criteria. The Theory and Practice of Reliable System Design. Digital Press, Bedford, MA, 1982, pages 201-302, Chapter 5. [McKusick et al 85] M.K. McKusick, W.N. Joy, S.J. Leffler, R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems 2(3):181-197, August, 1985. [Metcalfe and Boggs 76] R. M. Metcalfe, D. R. Boggs. Ethernet: Distributed Packet Switching for Local Computer Networks. Communications of the ACM 19(7), July, 1976. [Mitchell 82] James G. Mitchell. Septermber, 1982 Distinguished Lecture, Computer Science Department, Carnegie Mellon University. 160

[Mitchell and Dion 82] James G. Mitchell, Jermey Dion. A Comparison of Two Network-Based File Servers. Communications of the ACM 25(4), April, 1982. [Mohan 83] C. Mohan, B. Lindsay. Efficient Commit Protocols for the Tree of Processes Model of Distributed Transactions. In Proceedings of the Second Annual Symposium on Principles of Distributed Computing, pages 76-88. ACM, August, 1983. [Mohan et al. 88] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, P. Schwarz. Recovery Method Using Write-Ahead Logging and Supporting Fine-Granularity Locking. June, 1988 Slides from lecture presentation. [National Bureau of Standards 77] National Bureau of Standards. Data Encryption Standard. Federal Information Processing Standards Publication 46, Goverment Printing Office, Washington, D.C. 1977

[Needham and Schroeder 78] Roger M. Needham, Michael D. Schroeder. Using Encryption for Authentication in Large Networks of Computers. Communications of the ACM 21(12):993-999, December, 1978. Also Xerox Research Report, CSL-78-4, Xerox Research Center, Palo Alto, CA.

[Needham et al. 83] R.M. Needham, A.J. Herbert, J.G. Mitchell. How to Connect Stable Memory to a Computer. Operating Systems Review 17(1):16, January, 1983. [Nelson 81] Bruce Jay Nelson. Remote Procedure Call. PhD thesis, Carnegie-Mellon University, May, 1981. Available as Technical Report CMU-CS-81-119a, Carnegie-Mellon University. [Obermarck 82] Ron Obermarck. Distributed Deadlock Detection Algorithm. ACM Transactions on Database Systems 7(2):187-208, June, 1982. [Oki 83] Brian Masao Oki. Reliable Object Storage to Support Atomic Actions. Master's thesis, Massachusetts Institute of Technology, May, 1983. Also Technical Report MIT/LCS/TR-308, MIT Laboratory for Computer Science.

[Oki et al. 85] Brian M. Oki, Barbara H. Liskov, Robert W. Scheifler. Reliable Object Storage to Support Atomic Actions. In Proceedings of the Tenth Symposium on Operating System Principles, pages 147-159. ACM, December, 1985.