Totalrecall: System Support for Automated Availability Management
Total Page:16
File Type:pdf, Size:1020Kb
Total Recall: System Support for Automated Availability Management Ranjita Bhagwan, Kiran Tati, Yu-Chung Cheng, Stefan Savage, and Geoffrey M. Voelker Department of Computer Science and Engineering University of California, San Diego Abstract configure the system to compensate appropriately. While Availability is a storage system property that is both this approach may be acceptable when failures are con- highly desired and yet minimally engineered. While sistently rare, such as for the individual drives in a disk many systems provide mechanisms to improve availabil- array (and even here the management burden may be ob- ity – such as redundancy and failure recovery – how to jectionable [23]), it quickly breaks down in large-scale best configure these mechanisms is typically left to the distributed systems where hosts are transiently inaccessi- system manager. Unfortunately, few individuals have ble and individual failures are common. the skills to properly manage the trade-offs involved, let Peer-to-peer systems are particularly fragile in this re- alone the time to adapt these decisions to changing con- spect as their constituent parts are in a continual state of ditions. Instead, most systems are configured statically flux. Over short time scales (1-3 days), individual hosts and with only a cursory understanding of how the config- in such systems exhibit highly transient availability as uration will impact overall performance or availability. their users join and leave the system at will – frequently While this issue can be problematic even for individual following a rough diurnal pattern. In fact, the majority of storage arrays, it becomes increasingly important as sys- hosts in existing peer-to-peer systems are inaccessible at tems are distributed – and absolutely critical for the wide- any given time, although most are available over longer area peer-to-peer storage infrastructures being explored. time scales [4, 19]. Over still longer periods, many of This paper describes the motivation, architecture and these hosts leave the system permanently, as most peer- implementation for a new peer-to-peer storage system, to-peer systems experience high levels of churn in their called TotalRecall, that automates the task of availability overall membership. In such systems, we contend that management. In particular, the TotalRecall system auto- availability management must be provided by the system matically measures and estimates the availability of its itself, which can monitor the availability of the underly- constituent host components, predicts their future avail- ing host population and adaptively determine the appro- ability based on past behavior, calculates the appropriate priate resources and mechanisms required to provide a redundancy mechanisms and repair policies, and delivers specified level of availability. user-specified availability while maximizing efficiency. This paper describes the architecture, design and im- plementation of a new peer-to-peer storage system, called TotalRecall, that automatically manages availability in a 1 Introduction dynamically changing environment. By adapting the de- gree of redundancy and frequency of repair to the distri- Availability is a storage system property that is highly de- bution of failures, TotalRecall guarantees user-specified sired in principle, yet poorly understood in practice. How levels of availability while minimizing the overhead much availability is necessary, over what period of time needed to provide these guarantees. We rely on three key and at what granularity? How likely are failures now and approaches in providing these services: in the future and how much redundancy is needed to tol- erate them? When should repair actions be initiated and Availability Prediction. The system continuously how should they be implemented? These are all ques- monitors the current availability of its constituent tions that govern the availability of a storage system, but hosts. This measured data is used to construct pre- they are rarely analyzed in depth or used to influence the dictions, at multiple time-scales, about the future dynamic behavior of a system. availability of individual hosts and groups of hosts. Instead, system designers typically implement a static set of redundancy and repair mechanisms simply param- Redundancy Management. Short time-scale predic- eterized by resource consumption (e.g., number of repli- tions are then used to derive precise redundancy re- cas). Determining how to configure the mechanisms and quirements for tolerating transient disconnectivity. what level of availability they will provide if employed is The system selects the most efficient redundancy left for the user to discover. Moreover, if the underlying mechanism based on workload behavior and system environment changes, it is again left to the user to re- policy directives. Dynamic Repair. Long time-scale predictions cou- The AutoRAID system provided two implementations of pled with information about current availability storage redundancy – mirroring and RAID5 – and dy- drive system repair actions. The repair policy is dy- namically assigned data between them to optimize the namically selected as a function of these predictions, performance of the current workload. While this sys- target availability and system workload. tem did not directly provide users with explicit con- trol over availability it did significantly reduce the TotalRecall is implemented in C++ using a modified management burden associated with configuring these version of the DHash peer-to-peer object location ser- high-availability storage devices. A later HP project, vice [8]. The system implements a variety of redun- AFRAID, did allow user-specified availability in a disk dancy mechanisms (including replication and online cod- array environment, mapping availability requests into ing), availability predictors and repair policies. However, variable consistency management operations [20]. more significantly, the system provides interfaces that al- low new mechanisms and policies to describe their be- In the enterprise context, several researchers have re- havior in a unified manner – so the system can decide cently proposed systems to automate storage manage- how and when to best use them. ment tasks. Keeton and Wilkes have described a sys- The remainder of this paper describes the motivation, tem designed to automate data protection decisions that is architecture and design of the TotalRecall system. The similar in motivation to our own, but they focus on longer following section motivates the problem of availability time scales since the expected failure distribution in the management and describes key related work. In Section 3 enterprise is far less extreme than in the peer-to-peer en- we discuss our availability architecture – the mechanisms vironment [10]. The WiND system is being designed to and policies used to ensure that user availability require- automate many storage management tasks in a cluster en- ments are met. Sections 4 describes the design of the vironment, but is largely focused on improving system TotalRecall Storage System and its implementation. Sec- performance [2]. Finally, the PASIS project is exploring tion 5 describes the TotalRecall File System, a NFS file system support to automatically make trade-offs between service implemented on the core storage system. In Sec- different redundancy mechanisms when building a dis- tion 6, we quantitatively evaluate the effectiveness of our tributed file system [24]. system and compare it against existing approaches. Fi- Not surprisingly, perhaps the closest work to our own nally, Section 7 concludes. arises from the peer-to-peer (P2P) systems community. Due to the administrative heterogeneity and poor host availability found in the P2P environment, almost all 2 Motivation and Related Work P2P systems provide some mechanism for ensuring data The implicit guarantee provided by all storage systems availability in the presence of failures. For example, the is that data, once stored, may be recalled at some future CFS system relies on a static replication factor coupled point. Providing this guarantee has been the subject of with an active repair policy, as does Microsoft’s FAR- countless research efforts over the last twenty years, and SITE system (although FARSITE calculates the replica- has produced a wide range of technologies ranging from tion factor as a function of total storage and and is more RAID to robotic tape robots. However, while the effi- careful about replica placement) [1,8,9]. The Oceanstore ciency of these techniques has improved over time and system uses a combination of block-level erasure cod- while the cost of storage itself has dropped dramatically, ing for long term durability and simple replication to tol- the complexity of managing this storage and its availabil- erate transient failures [12, 22]. Finally, a recent paper ity has continued to increase. In fact, a recent survey by Blake and Rodrigues argues that the cost of dynamic analysis of cluster-based services suggests that the op- membership makes cooperative storage infeasible in the erational cost of preparing for and recovering from fail- transiently available peer-to-peer environments [5]. This ures easily dominates the capital expense of the individ- finding is correct under certain assumptions, but is not ual hardware systems [17]. This disparity will only con- critical in the environments we have measured and the tinue to increase as hardware costs are able to reflect ad- system we