ARCHIVAL REPOSITORIES FOR DIGITAL LIBRARIES
a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy
Arturo Crespo March 2003 c Copyright by Arturo Crespo 2003 All Rights Reserved
ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Hector Garcia-Molina (Principal Adviser)
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jennifer Widom
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andreas Paepcke
Approved for the University Committee on Graduate Studies:
iii iv Abstract
Current libraries use an assortment of uncoordinated and unreliable techniques for storing and managing their digital information. Digital information can be lost for a variety of reasons: magnetic decay, format and device obsolescence, human or system error, among many others. In this thesis, we address the problem of how to build archival repositories (AR). An AR has the following combined key requirements that distinguish it from other repositories. First, digital objects (e.g., documents, tech- nical reports, movies) must be preserved indefinitely, as technologies, file formats, and organizations evolve. Second, ARs will be formed by a confederation of inde- pendent organizations. Third, published digital objects have a historical nature, so changes should not be done in-place; instead, they should be recorded in versions. We provide an architecture for ARs that assures long-term archival storage of digital objects. This assurance guarantee is achieved by having a federation of independent but collaborating sites, each managing a collection of digital objects. We also pro- vide a framework for evaluating the reliability and cost of an AR. Finally, we present techniques for efficient access of documents in a federation of independent sites.
v Acknowledgments
First, I would like to thank my advisor, Hector Garcia-Molina who taught me much of what I know about research, clear thinking, and effective writing. Working with Hector was not only a privilege but a pleasure as well. I also want to thank the other members of my reading committee, Andreas Papecke and Jennifer Widom, as well as my orals committee members, Chris Jacobs and Terry Winograd, for their many useful suggestions and comments. I would like to thank Jerry Saltzer for providing the initial spark that ignited my interest on this topic. In addition I thank Carl Lagoze for his useful suggestions and encouragement on pursuing this topic. I would also like to thank all the members of the Stanford Database Group for their helpful advice, insightful discussions, and for always being there during good and bad times. In particular, I would like to thank my co-authors Orkut Buyukkokten and Brian Cooper as well as Stefan Decker (whose ideas lead to the work presented in Chapter 7). Finally, a special thank to my officemates Sergey Brin, Wilburt Labio, Claire Cui, Taher Haveliwala, and Glen Jeh, who made the long hours at Stanford much more enjoyable. At Stanford, I was fortunate to have met, and count among my friends, a number of smart, interesting, and caring people. The list of friends that provided me support and encouragement is far too long to write here, but I would like to mention a couple of names. Very special thanks to Erik Boman and the leaders of the Stanford Outing Club, who introduce me to the natural wonders of California; and to Carlos Guestrin for encouraging me to break away with my self-imposed limits allowing me to to go farther and higher than I ever thought I could ever go. I also want to thank
vi Sergey Melnik not only for his true friendship, but also for his technical help in developing some of the ideas in this dissertation. Finally, I want to thank my family: my parents Arturo and Inmaculada, my sister Massiel, and my brother Joaquin. Familia, esta tesis se la dedico a ustedes. Sin su fe en mi y su apoyo incondicional escribir esta tesis hubiese sido imposible.
vii Contents
Abstract v
Acknowledgments vi
1 Introduction 1
I Design and Evaluation of Archival Repositories 8
2 Evaluating the Reliability of an AR 9 2.1 Overview ...... 10 2.2 Archival Problems and Solutions ...... 11 2.2.1 Sources of Failures ...... 11 2.2.2 Component Failure Avoidance Techniques ...... 13 2.2.3 System-Level Techniques ...... 17 2.3 Architecture of an Archival Repository ...... 22 2.3.1 Archival Documents ...... 22 2.3.2 Architecture of the Non-Fault-Tolerance Store ...... 23 2.3.3 Architecture of the Archival System ...... 24 2.4 Failure and Recovery Modeling ...... 25 2.4.1 Modeling a Non-Fault-Tolerance Store ...... 25 2.4.2 Modeling an Archival System ...... 27 2.4.3 Modeling Archival Documents ...... 28 2.4.4 Example: An Archival System based on Tape Backups . . . . 30
viii 2.5 ArchSim: A Simulation Tool for Archival Repositories ...... 34 2.5.1 Challenges in Simulating an Archival Repository ...... 35 2.5.2 The Simulation Engine: ArchSim ...... 36 2.5.3 Library of Failure Distributions ...... 36 2.5.4 ArchSim’s Implementation ...... 40 2.6 Case Study: MIT/Stanford TR Repository ...... 41 2.7 Discussion ...... 49
3 A Design Methodology for ARs 51 3.1 Overview ...... 52 3.2 AR Costs ...... 53 3.2.1 A Taxonomy of Cost Sources ...... 54 3.2.2 A Taxonomy of Cost Events ...... 56 3.3 Designing an AR ...... 58 3.3.1 Framing the problem ...... 59 3.3.2 Identifying Uncertainties, Alternatives, and Preferences ...... 59 3.3.3 Modeling an AR ...... 60 3.3.4 Transforming Non-Critical Uncertainties into Variables ...... 61 3.3.5 Eliminating Futile Alternatives ...... 66 3.3.6 Probabilistic Assessment of Uncertainties ...... 67 3.3.7 Evaluating an AR Design ...... 68 3.3.8 Appraising Cost Decisions ...... 69 3.4 Discussion ...... 70
II An Architecture and Algorithms for Archival Reposi- tories 71
4 An Architecture for Archival Repositories 72
ix 4.1 Overview ...... 73 4.2 Key Components ...... 73 4.2.1 Signatures as Object Handles ...... 73 4.2.2 No Deletions ...... 76 4.2.3 Layered Architecture ...... 77 4.2.4 Awareness Everywhere ...... 78 4.2.5 Disposable Auxiliary Structures ...... 78 4.3 Object Store Layer ...... 79 4.3.1 Object Store Interface ...... 79 4.3.2 Object Store Implementation ...... 80 4.4 Identity Layer ...... 81 4.4.1 Identity Interface ...... 81 4.4.2 Identity Implementation ...... 83 4.5 Complex Object Layer ...... 84 4.5.1 Tuples ...... 85 4.5.2 Versions ...... 85 4.5.3 Sets ...... 87 4.6 Reliability Layer ...... 88 4.6.1 Reliability Interface ...... 89 4.6.2 Implementation ...... 89 4.7 A Complete Example ...... 92 4.8 Discussion ...... 94 4.9 Related Work ...... 95
5 Archive Synchronization 97 5.1 Overview ...... 98 5.2 Problem Definition ...... 100 5.3 Related Work ...... 103 5.4 The Client-Store Design Space ...... 103 5.5 The UNI-AWARE Algorithm ...... 105 5.5.1 Snapshot Algorithms ...... 109
x 5.5.2 Timestamps ...... 110 5.5.3 Logs ...... 113 5.5.4 Signature Algorithm ...... 114 5.5.5 UNI-AWARE Summary ...... 116 5.6 Distributed UNI-AWARE Algorithm ...... 117 5.6.1 Hierarchical Signatures ...... 120 5.6.2 Hierarchical Timestamps ...... 123 5.6.3 Performance Issues ...... 123 5.7 Discussion ...... 126
III Efficient Content Access in a Federation of Archival Repositories 127
6 Efficient Searching in an AR Federation 128 6.1 Overview ...... 129 6.2 Related Work ...... 131 6.3 Peer-to-peer Systems ...... 132 6.3.1 Query Processing in a Distributed-Search P2P System . . . . 132 6.4 Routing indices ...... 133 6.4.1 Using Routing Indices ...... 136 6.4.2 Creating Routing Indices ...... 137 6.4.3 Maintaining Routing Indices ...... 138 6.5 Alternative Routing Indices ...... 140 6.5.1 Hop-count Routing Indices ...... 140 6.5.2 Exponentially aggregated RI ...... 141 6.6 Cycles in the P2P Network ...... 143 6.7 Experimental Results ...... 145 6.7.1 Modeling search mechanisms in a P2P system ...... 145 6.7.2 Simulating a P2P System ...... 147 6.7.3 Evaluating P2P Search Mechanism ...... 149 6.8 Discussion ...... 157
xi 7 Efficient Networks in an AR Federation 159 7.1 Overview ...... 160 7.2 Related Work ...... 162 7.3 Semantic Overlay Networks ...... 163 7.3.1 Classification Hierarchies ...... 165 7.4 Classification Hierarchies ...... 170 7.4.1 Experiments ...... 172 7.5 Classifying Queries and Documents ...... 176 7.5.1 Experiments ...... 177 7.6 Nodes and SON Membership ...... 179 7.6.1 Layered SONs ...... 179 7.6.2 Experiments ...... 181 7.7 Searching SONs ...... 183 7.7.1 Searching with Layered SONs ...... 183 7.7.2 Experiments ...... 184 7.8 Discussion ...... 187
8 Conclusions and Future Work 189 8.1 Conclusions ...... 190 8.2 Future Work ...... 192
Bibliography 195
A MIT/Stanford CSTR Scenario Parameters 205
B List of Styles, Substyles, and Tones 209 B.1 Styles ...... 210 B.2 SubStyles ...... 211 B.3 Tones ...... 217
xii List of Figures
2.1 Typical Media Decay Times ...... 12 2.2 Reducing Media Decay ...... 18 2.3 Archival Repository Model ...... 22 2.4 Materializations ...... 23 2.5 Archival Repository Model Parameters ...... 28 2.6 Archival Document Model ...... 29 2.7 A model for a dual tape system ...... 31 2.8 A Dual-tape Archival System ...... 32 2.9 A Dual-tape Archival System ...... 32 2.10 Parameter values ...... 34 2.11 Failure Functions ...... 40 2.12 MIT/Stanford CSTR Scenario ...... 42 2.13 Base values ...... 44
2.14 MTTF vs. φsto with ρsto of 60 days ...... 44 2.15 System MTTF (logarithmic scale) ...... 45
2.16 MTTF φform=20 years, ρsto=ρform=60 days ...... 46 2.17 MTTF with 3 Copies (logarithmic scale) ...... 46 2.18 Conversion between %failed devices and MTTF ...... 47 2.19 MTTF with Infant Mortality ...... 48
2.20 System MTTF with Aging and ρsto of 60 days ...... 48 2.21 System MTTF with PM and Aging ...... 49
3.1 AR Design Cycle ...... 58 3.2 Archival Repository Model Parameters ...... 62
xiii 3.3 Tornado Diagram (MTTF) ...... 63 3.4 Tornado Diagram (Cost) ...... 63 3.5 Base values ...... 63 3.6 MTTF for base values ...... 66 3.7 MASC for Base Values ...... 66 3.8 MTTF Evaluation ...... 67 3.9 Cost Evaluation ...... 67 3.10 MTTF Sensitivity Analysis (Detection Interval) ...... 67 3.11 MASC Sensitivity Analysis (Detection Interval) ...... 67 3.12 MASC Sensitivity Analysis (Detection Cost) ...... 67
4.1 Number of bits required for typical n, p ...... 75 4.2 Layers of a Cellular Repository...... 78
4.3 The tuple << O1,O2 >, O3 > ...... 85
4.4 A document with versions v1 and v2 ...... 86 4.5 A set with two members ...... 87 4.6 The repositories after replication...... 93 4.7 New versions at Stanford and MIT...... 94 4.8 Inconsistent State...... 94
5.1 UNI-AWARE Algorithm ...... 106 5.2 Example of the UNI-AWARE Algorithm ...... 108 5.3 Custom Functions for the Simple Snapshot Algorithm ...... 109 5.4 Custom Functions for the Handle Snapshot Algorithm ...... 110 5.5 Custom Functions for Timestamp Algorithm ...... 111 5.6 Example of the Timestamp Algorithm ...... 111 5.7 Custom Functions for the Log Algorithm ...... 114 5.8 Custom Functions for the Signature Algorithm ...... 115 5.9 Example of the Signature Algorithm ...... 116 5.10 DIST-UNI-AWARE Algorithm ...... 118 5.11 Custom Functions for the Hierarchical Signature Algorithm ...... 120 5.12 Example of the Hierarchical Signature Algorithm ...... 122
xiv 5.13 Example of the Clustering Technique ...... 125
6.1 Routing Indices ...... 130 6.2 P2P Example ...... 133 6.3 A Sample Compound RI ...... 134 6.4 Routing Indices ...... 136 6.5 Creating a Routing Index ...... 139 6.6 A sample Hop-count RI for node W ...... 140 6.7 A sample Exponential Routing Index for Node W ...... 142 6.8 Cycles and Routing Indices ...... 144 6.9 Simulation Parameters ...... 146 6.10 Comparison of CRI, HRI, and ERI ...... 150 6.11 Effect of Number of Documents Requested ...... 151 6.12 Effect of Number of Potential Results ...... 151 6.13 Effect of Overcounts ...... 152 6.14 Effect of Cycles ...... 154 6.15 Network topology ...... 155 6.16 Updates and Network Topology ...... 156 6.17 Updates vs. Queries ...... 157
7.1 Semantic Overlay Networks ...... 160 7.2 A Classification Hierarchy ...... 163 7.3 Classification Examples ...... 165 7.4 Generating Semantic Overlay Networks ...... 169 7.5 Classification Hierarchies ...... 169 7.6 Distribution of Style Buckets ...... 173 7.7 Bucket Size Distribution for Style Hierarchy ...... 173 7.8 Distribution of Substyle Buckets ...... 175 7.9 Bucket Size Distribution for Substyle Hierarchy ...... 175 7.10 Distribution of Decade Buckets ...... 175 7.11 Bucket Size Distribution for Decade Hierarchy ...... 175 7.12 Distribution of Tone Buckets ...... 176
xv 7.13 Bucket Size Distribution for Tone Hierarchy ...... 176 7.14 Choosing SONs to join ...... 177 7.15 Distribution of Style SONs ...... 181 7.16 SON Size Distribution for Style Hierarchy ...... 181 7.17 Distribution of Substyle SONs ...... 182 7.18 SON Size Distribution for Substyle Hierarchy ...... 182 7.19 Number of messages for query “Spears” ...... 184 7.20 Average number of Messages for a Query Trace ...... 185
xvi Chapter 1
Introduction
1 2 CHAPTER 1. INTRODUCTION
The advancement of human knowledge critically depends on the preservation of knowledge gained in the past. Currently, more and more of that knowledge is stored electronically, in the form of files, databases, web pages, and programs. Unfortu- nately, most of the systems that store this information are not designed for preserv- ing information over very long periods of time, and their information is vulnerable as technologies and organizations evolve. Furthermore, the information is often stored in an uncoordinated and unintegrated fashion. This can lead to the loss of important components of the intellectual record. This problem will only get worse as more and more information is provided only in digital form.
Digital information can be lost in many different ways and in Chapter 2, we will study in detail the threats to digital information. Some of the major threats to digital information are:
Media decay and failure: Natural decay processes, such as magnetic decay, may make media unreadable. For reference, while microfilm has a life of 100 to 200 years, magnetic tapes decay in only 10 to 20 years. Media may also fail when it is being accessed, e.g., a tape may break when in a reader, or a disk head may crash into a platter.
Access Component Obsolescence: A document may be lost if we do not have one of the components necessary for supporting access to it. For example, components required to access documents stored in a floppy disk include a floppy drive and the drivers to access it. If we have a document in a 8 inch floppy disk, but we do not have a 8 inch floppy drive, we will not be able to access the document (even if the media is still readable).
Human and Software Errors: Documents can be damaged by human errors or by software bugs. For example, a person or a program may delete a document (e.g., by reformatting the disk that holds the document), or may improperly modify the files and data structures that represent the document (e.g., by writing random characters in the middle of the document).
External Events: Events external to the archival repository, such as fires, earth- quakes and wars, may of course also cause damage. 3
When considering long-term preservation of digital objects, there are two inter- related factors: data and meaning preservation. To illustrate, consider the Mayan inscriptions on their temples. For us to “read” them, first the carvings and paintings had to be preserved (data preservation) over the centuries. Second, the meaning of their hieroglyphs had to be decoded. Thus, to preserve the meaning there needs to be some translation machinery, which is based on a lot of guesswork (as in the case of Mayan writings), or aids left behind (which are of course extremely hard to provide in advance). The translation could be done gradually and continuously, to avoid span- ning large differences in representations (e.g., translating a document in MS Word 4 to Word 5 to Word 6). In this dissertation we focus on data preservation only. This is admittedly the much simpler of the two problems, but clearly, without data preservation as a first step, meaning cannot be preserved. Thus, we view a digital object as a bag of bits (with some simple header information, to be discussed). We will not concern ourselves here on whether this object is a postscript file (or any other format), the document that explains how postscript is interpreted (an aid for preserving the meaning of the postscript file), or an object giving the metadata for the postscript file (e.g., author, title). However, we do wish to preserve relationships among objects. That is, we will develop an identification scheme so that one object can “point” or “reference” another one. This way, for instance, the metadata object we just discussed can identify the postscript file it is describing. Our approach to dealing with the problem of data preservation is to build a reliable archival repository that protects digital information from failures. Users who create digital documents would deposit these documents in their local repository. This repository would then take whatever actions were necessary to protect the documents. Our vision of an archiving system is built on several principles:
• Write-once archiving: Documents, once archived, are never modified or deleted. Updates to documents are represented as version chains.
• Federated replication: A federation of archives work together, storing copies of each others’ data so that if a site fails, no documents are permanently lost. 4 CHAPTER 1. INTRODUCTION
• Scientific design: The policies for constructing and operating an archiving sys- tem should be studied scientifically, to find the most reliable and resource effi- cient techniques.
• Robust operation: The system should be robust in the face of failures. This means both that documents are preserved, and also that the system operations (document replication, content searches, and so on) continue to run.
Building an archival repository is a complex task because the problem of long- term preservation entails not only a large number of uncertainties about the future, but also a large number of configuration options that could impact the reliability of the system. For example, how many copies should we make of each document? How frequently should we check these copies for corruption? Should these copies be stored on a few expensive disks or many cheap disks? In addition, the designer needs to predict future events such as the reliability of sites and disks, survivability of formats, how many resources will be consumed by the recovery algorithms, how frequently the recovery algorithms will be invoked, how many user accesses will be made to the documents, and many other uncertainties. In addition to the challenges of long-term preservation, an archival repository is only useful if the documents can be accessed and found efficiently. Given our problem definition, the reader may wonder if data preservation is a solved problem. After all, a database system can very reliably store objects and their relationships. This may be true, as long as the same or compatible software is used to manage the objects, but it is not true otherwise. For instance, suppose that the Stanford and MIT libraries wish to store backup copies of each other’s technical reports, but they each use different database systems. It is not possible (at least with current systems) to tell the Stanford system that an object is managed jointly with MIT. Similarly, if Stanford’s database vendor goes out of business in 500 years, or Stanford decides to use another vendor, migrating the objects elsewhere can be problematic, since database systems typically represent reliable objects in ways that are intimately tied to their architecture and software. 5
The objective of our work is not to replace database systems, but rather to allow existing and future systems to work together in preserving an interrelated collection of digital objects (and their versions) in the simplest and the most reliable possible way. We aim for a system that provides preservation, even if some efficiency is lost, while ensuring the autonomy of individual archives. Our work complements other investigators’ in areas such as preserving digital formats, ensuring the security of dig- ital objects against malicious users, encoding semantic meaning in digital documents, and so on. This dissertation is divided into three main parts: (i) design and evaluation of ARs (Chapters 2 and 3), (ii) a reference architecture and algorithms for ARs (Chapters 4 and 5), and (iii) efficient access of ARs (Chapters 6 and 7). Relevant related work is covered in each chapter of the dissertation. Specifically, in this dissertation we study:
• AR Reliability Analysis (Chapter 2): In this chapter we study the archival prob- lem, how a digital library can preserve electronic documents over long periods of time. We analyze how an archival repository can fail and we present different strategies to help solve the problem. We introduce ArchSim, a simulation tool for evaluating an implementation of an archival repository system and com- pare options such as different disk reliabilities, error detection and correction algorithms, and preventive maintenance.
• AR Design Methodology (Chapter 3): Building on the evaluation tools presented in Chapter 2, we study the design of archival repositories. Designing an archival repository is a complex task because there are many alternative configurations, each with different reliability levels and costs. The objective of this chapter is to be able to answer questions such as: how much would it cost to have an AR with a given archival guarantee? Or the converse: given a budget, what is the best archival guarantee that we can achieve?. We achieve this objective by systematically and scientifically analyzing each step of the design with the aid of an ArchSim extension that models and evaluates costs. In summary, in this chapter, we study the costs involved in an Archival Repository and we introduce a design framework for evaluating alternatives and choosing the 6 CHAPTER 1. INTRODUCTION
best configuration in terms of cost. The design framework and the usage of the ArchSim extension are illustrated with a case study of a hypothetical (yet realistic) archival repository shared between two universities.
• Cellular Repository Architecture (Chapter 4): In this chapter we present a ref- erence architecture for archival repositories. This architecture is formed by a federation of independent but collaborating sites, each managing a collection of digital objects. The architecture is based on the following key components: use of signatures as object handles, no deletions of digital objects, functional layering of services, the presence of an awareness service in all layers, and use of disposable auxiliary structures. Long-term persistence of digital objects is achieved by creating replicas at several sites.
• Awareness Algorithms (Chapter 5): One of the most critical components in an AR is the awareness mechanism, used to notify clients of inserted, deleted or changed objects. In this chapter we survey the various awareness schemes (including snapshot, timestamp and log based), describing them all as variations of a single unified scheme. This makes it possible to understand their relative differences and strengths. In particular we focus on a signature-based awareness scheme that is especially well suited for digital libraries, and show enhancements to improve its performance.
• Routing Indices (Chapter 6): Finding information in a large federation of ARs, such as the one proposed in Chapter 4, currently requires either a costly and vulnerable central index, or flooding the network with queries. In this chapter we introduce the concept of Routing Indices (RIs), which allow ARs to forward queries to neighbors that are more likely to have answers. If an AR cannot answer a query, it forwards the query to a subset of its neighbors, based on its local RI, rather than by selecting neighbors at random or by flooding the network by forwarding the query to all neighbors. We present three RI schemes: compound, hop-count, and exponential. We evaluate their performance via sim- ulations, and find that RIs can improve performance by one or two orders of 7
magnitude vs. a flooding-based system, and by up to 100% vs. a random for- warding system. We also discuss the tradeoffs between the different RI schemes and highlight the effects of key design variables on system performance.
• Semantic Overlay Networks (Chapter 7): Current network topologies for large federations of nodes, such as the ones used in P2P networks, are usually gen- erated randomly. Each node typically connects to a small set of random nodes (their neighbors), and queries are propagated along these connections. Such query flooding tends to be very expensive. We propose that node connections be influenced by content, so that for example, nodes having many “computer science” documents will connect to other similar nodes. Thus, semantically re- lated nodes form a Semantic Overlay Network (SON). Queries are routed to the appropriate SONs, increasing the chance that matching files will be found quickly, and reducing the search load on nodes that have unrelated content.
We conclude this thesis by discussion challenging open problems in Chapter 8. Part I
Design and Evaluation of Archival Repositories
8 Chapter 2
Evaluating the Reliability of an AR
9 10 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
2.1 Overview
One of the main challenges in designing an archival repository (AR) is how to configure the repository to achieve some target “preservation guarantee” while minimizing the cost and effort involved in running the repository. For example, the AR designer may have to decide how many computer sites to use, what types of disks or tape units to use, what and how many formats to store documents in, how frequently to check existing documents for errors, what strategy to use for error recovery, how often to migrate documents to a more modern format, and so on. Each AR configuration leads to different levels of assurance; e.g., on the average a document will not be lost for 1000 years, or in 1000 years we expect to still have access to 99% of our documents. Each configuration has an associated cost, e.g., disk hardware involved, computer cycles used to check for errors, or staff running each site. The number of options and choices is daunting, and the AR designer has few good tools to help in this task. The traditional fault tolerance models and techniques, of the type used to evaluate hardware, are a helpful starting point, but they do not capture the unique complexities of ARs. For example, traditional models may have difficulty capturing different document loss scenarios (e.g., missing interpreter, missing bits, missing metadata) and they frequently assume failure distributions (e.g., exponential) that are too simplistic. In this chapter we present a powerful modeling and simulation tool, ArchSim, for helping in AR design. ArchSim can model important details, such as multiple formats, preventive maintenance, and realistic failure distribution functions. ArchSim is capable of evaluating a large number of components over very long time periods. ArchSim uses specialized techniques in order to run comprehensive simulations in a time frame that allows the exploration and testing of different policies. Of course, no model can be absolutely complete: there is an intrinsic tradeoff between how detailed the model is and the complexity (even feasibility) of its analysis. In our case, we have chosen to ignore (at least in this initial study) information loss due to format conversion (migration to a new format is always successful and does not introduce any loss). We also do not model partial failures (a failure in which we 2.2. ARCHIVAL PROBLEMS AND SOLUTIONS 11
can salvage part, but not all, of the information in a device). As we will see, we also rely on an “expert” that can provide failure distributions for the components of the systems. For instance, if format obsolescence is an issue, an expert needs to give us the probabilities of different format lifetimes. In summary, in this chapter we present:
• A comprehensive model for an AR, including options for the most common recovery and preventive maintenance techniques (Sections 2.2, 2.3 and 2.4).
• A powerful simulation tool, ArchSim, for evaluating ARs and for studying avail- able archival strategies (Section 2.5).
• A detailed case study for a hypothetical Technical Report repository operated between two universities. Through this case study, we evaluate AR factors such as disk reliability, handling of format failures, and preventive maintenance.
2.2 Archival Problems and Solutions
We define an Archival Repository (AR) as a data store that gives a guarantee of long- term survivability of its collection. We will refer to each unit of information in the AR as an Archival Document or just document for short. As a first step in modeling and evaluating ARs, it is important to understand how information can be lost, and what techniques can reduce the likelihood of loss.
2.2.1 Sources of Failures
An AR fails to meet its guarantee when the archive loses information. Such a loss may be caused by a variety of undesired events, such as the failure of a disk, or an operator error. A document is lost if the bits that represent it are lost, and also if the necessary components that give meaning to those bits are lost. We define the components of a document to be all the resources that are needed to support access to the document, e.g., the document bits, the disk that stores the bits, and the viewer that interprets the bits. 12 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
An undesired event does not necessarily cause information loss. For instance, if the AR keeps two copies of a document, and the disk holding one of the copies fails, then the document is not lost. It would take a second undesired event affecting the second copy to cause information loss. A document is damaged if a copy or instance of one of its components is corrupted or lost. Damaged documents should be repaired by the system to protect them from further failures. The most common undesired events that may lead to the loss of a document can be broken down into the following categories. (We do not consider transient failures, e.g., a power failure, that do not lead to a permanent loss.) Media decay and failure: Natural decay processes may make media unreadable. A well known example is magnetic decay, occurring when media loses its ability to hold the “bits” that encode a document. In Figure 2.1, we summarize typical decay times for electronic media [8]. These decay times assume “good” storage and operating conditions, and infrequent accesses. Note that decay typically does not affect all parts of a tape or disk simultaneously. For example, a few bits may first decay, rendering a document (or part of a document unreadable), followed by other decays. Figure 1 gives times to the first decay.
Media Decay Time Magnetic Tapes 10-20 years CD-ROM 5-50 years Newspaper 10-20 years Microfilm 100-200 years
Figure 2.1: Typical Media Decay Times
Media may also fail when it is being accessed; e.g., a tape may break when in a reader, or a disk head may crash into a platter. The likelihood of failure increases with the frequency that the medium is accessed. For example, the expected life of a hard disk that starts and stops spinning very frequently is 5 years, while it is 10 to 20 years for an inactive disk. Component Obsolescence: A document may be lost if we do not have one of 2.2. ARCHIVAL PROBLEMS AND SOLUTIONS 13
the components necessary for supporting access to it. To illustrate, we describe two important forms of component obsolescence: media obsolescence and format obsoles- cence. Media obsolescence occurs when we do not have a machine that can read the medium (even if the medium is still readable). Format obsolescence happens when the system does not know how to interpret the bits that are encoded in the medium. This kind of failure is the result of the non-self-describing nature of information stored in electronic media. That is, the stored bits must be transformed into a human com- prehensible format by a process that cannot be inferred from the original bits. For example, say we have a postscript document. To view it, we must (a) know that its format is some particular version of postscript, (b) have a program that can interpret that version of postscript and display the document on a screen (or printer), and (c) have a computer that can execute the code for the program. If we do not have any of these components, then we cannot access the Postscript document. Human and Software Errors: Documents can be damaged by human errors or by software bugs. Reference [37] suggests that humans and software are the most serious sources of failures. Better technologies and designs are reducing media decay and component failure rates, but there is no sign of a drop in the rate of faults created by software and human errors. External Events: Events external to the AR, such as fires, earthquakes and wars, may of course also cause damage. Such events may cause several AR components to fail simultaneously. For example, a flood can destroy a collection of disks at one site. If a document were replicated using those disks, all copies would be lost.
2.2.2 Component Failure Avoidance Techniques
There are two well-known ways to avoid an AR failure: we can either reduce the probability of component faults, or we can design the AR so those faults do not result in document loss. In this subsection, we review techniques in the first category, while the next subsection will cover the second category. We organize the presentation in this subsection using the taxonomy of faults presented in Section 2.2.1. 14 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
Media decay and failure
A fundamental decision in designing an AR is which medium will be used. In this subsection, we first outline how we can choose the appropriate medium for an AR. Then, we present strategies that allow reduction of the rate of decay and failure for these media. The question of which medium to use involves two decisions. First, we need to decide on the access properties that will be needed from the medium. Second, we need to find the most cost-effective technology that provides these access properties. By access properties, we mean the characteristics of the medium such as how fast can we start reading the medium, how fast can we find a document, if we can rewrite a document without having to rewrite the whole medium, etc. The reason we need to consider the access properties is that they will affect how we design our system. For example, the time that it takes to start accessing a medium will have an impact on the available strategies to reduce Media Decay. Specifically, on-line media (i.e., media with very little initial delay) allow for easy detection of decay, as it is cheap to access all the information periodically. On the other hand, detecting decay on off-line media is more complicated because it is expensive to load and read the media. Nevertheless, we need to recognize the tradeoffs between different kinds of media, as off-line media tend to have a lower cost of maintenance and a longer expected life than on-line media. In terms of underlying technologies, there are many alternatives for archival media. In our work, we have concentrated on “digital” media; however, the boundary between digital media and paper has been diffused by some interesting new media that have been developed recently. For example, “PaperDisk” technology uses 2D barcodes to print electronic information on paper [2]. The barcodes can then be scanned back into the computer and transformed into digital information. By using special algorithms, this technology can save up to 1Mb of information on a sheet of paper. For ARs, the most important factor in choosing media technology is its lifetime. Evaluating the lifetime of media technologies is a complicated process. Media lifetime is nearly impossible to determine in practice (we cannot wait “n” years for the medium to fail). Therefore, estimates are based on the physical properties of the medium. The 2.2. ARCHIVAL PROBLEMS AND SOLUTIONS 15
challenge is to find which are the relevant properties (so we can find the weakest link) and what is the appropriate model of decay for those characteristics. For example, a CD-ROM is basically a reflective layer sandwiched between a clear substrate and a lacquer film. A CD-ROM can fail because of either the deterioration of the reflective layer or because of the loss of optical clarity of the substrate. Deterioration of the reflective layer can occur because of oxidation, corruption, or delamination. As these three sources of deterioration are well known for the materials used in the reflective layer, we can predict when they will deteriorate and eventually fail. Similarly, we can study the causes for losing optical clarity of the substrate and try to predict when this failure will occur. Beyond the choice of which medium to use, we need to make certain that the medium is stored in the best conditions to ensure its preservation. In the case of tapes, a controlled environment with low temperature and low humidity may increase the life of the tape in an order of magnitude [9].
Component Obsolescence
Avoiding component obsolescence is a complex problem, as an accurate prediction of which operating systems, document formats, and devices will be used in the future is impossible. However, there are some techniques that help avoid the component obso- lescence problem. In this subsection, we will briefly explore three of these techniques: usage of standards, self-contained media, and preservation of equipment. In the next section, we will study system techniques that can lessen the problem of component obsolescence. One way to avoid the component obsolescence problem is to use standards. Stan- dards are (usually) well-documented, which may allow the re-creation of readers for that standard. However, there are some problems with standards. First, there is a bootstrap problem: a person recreating a reader needs to be able to read the doc- umentation for the standard. Second, an important risk when using a standard is that many documents that claim to follow a standard are not really 100% compliant. For example, most HTML documents on the web do not use HTML correctly, so a 16 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
browser that is built to accept only correct HTML will not be able to display most documents. Another solution to the obsolescence problem is the preservation of equipment. This approach has been attempted by the National Media Lab, but it is a difficult and costly task, especially when the parts needed to repair preserved equipment are no longer available. Another interesting proposal for solving the problem of component obsolescence is the use of a self-contained medium such as the electronic tablet [42]. The electronic tablet is a portable computer that contains all the necessary hardware and software (as well as data) to make a document readable. The only needed input for this tablet is a power source. Although the electronic tablet also solves the component obsolescence problem (as everything needed to read the document is in the tablet), no tablets have been built and it is yet to be determined how cost efficient they will be.
Human/Software Errors
As stated before, human and software errors are the fastest growing source of faults in computer systems. Beyond traditional techniques such as program validation, and user training, some specific guidelines are useful in an AR. These guidelines include:
• Define interfaces that minimize the amount of damage that can be done: For example, by not allowing deletions or updates in the AR, users cannot make mistakes that will result in the deletion of a file. In addition, by defining a restricted interface, it is easy to distinguish between faults and the normal operation of the system. For instance, if a file disappears in a system that does not allow deletions, then we know that a fault has occurred.
• Minimize the amount of code that can cause damage: The assumption is that the smaller the amount of code, the better the chance that it can be extensively tested and verified.
• Use hardware safeguards to prevent software errors: Hardware can help pre- vent damage to documents. For example, by opening the “read-only tab” in a 2.2. ARCHIVAL PROBLEMS AND SOLUTIONS 17
diskette, we instruct the hardware to disallow any writes to it. Another com- monly used hardware protection is virtual memory that can be used to prevent unauthorized writes to some memory addresses.
External Events
A range of techniques can be used to prevent damage from external or environmental events. These techniques include, for example, fire-proof walls, earthquake-proof buildings, etc. These avoidance techniques are beyond the scope of this thesis and will not be discussed further.
2.2.3 System-Level Techniques
Migration
Migration involves replacing a document component with a new, safer one. For exam- ple, suppose that “Postscript” readers are becoming obsolete and are being replaced by new “Postscript II” readers. Then, we may decide to migrate Postscript docu- ments into the new format before Postscript readers become unavailable. Migration is particularly effective for storage devices. However, migrating to new formats is more challenging because some information may be lost in the transformation. Migration techniques can be classified in a continuum, ranging from passive strate- gies to active strategies. In Figure 2.2 we show the spectrum in the context of reducing media decay. In the figure, we can see that passive strategies include reducing the media decay time by either using better media or storing media in a “controlled” envi- ronment. Active strategies attempt to reduce media decay by reading and copying the data periodically. When the copies are made over the same medium, the strategy is called periodic refreshing. When the copies are made in a different medium, then the strategy is usually called periodic migration. There are many tradeoffs between ac- tive and passive strategies. In general, passive strategies have the advantage of lower use of resources than active strategies, but, unfortunately, they also provide a lower probability of preservation. Active strategies have a high probability of preservation, but they consume more resources. This last point is one of the reasons why archival 18 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
models are needed; designers need to make strategy decisions that achieve their target archival guarantee while reducing the total amount of resources consumed.
Passive Active
Use Control Periodic Periodic Better Media Storage Refreshing Migration Environment
Figure 2.2: Reducing Media Decay
Finally, migration can also be used to avoid component obsolescence. However, performing a migration process that does not lose some information or formatting is frequently impossible (especially when used transitively, that is, when a document in a given format is migrated to another format, and then in turn is migrated into yet another format). For example, consider translating an ancient text from Greek to Latin, and then from Latin to English. Most people would agree that if the Greek version were available, a direct translation from Greek to English would be better.
Replication
Replication makes copies or creates new instances of needed components, to ensure the long term survival of the component. For example, to prevent media failure, we can write the document onto multiple tapes. If one of the tapes is damaged, then we can use the remaining tapes to access the document. Replication can also be used to survive component obsolescence. For example, a document can be written in multiple formats; in this way, if a format becomes obsolete, we can still access the document in another format. Unfortunately, replication by itself does not improve the mean time to failure (MTTF) of a system. In [37], it is shown that without repairs the MTTF of a systems may decrease when using replication. Specifically, to obtain a benefit from replication, after a failure has been detected, we should repair it as rapidly as possible. An additional advantage of replication is that it simplifies failure detection. If we keep three copies of a document in three different tapes and one of the copies 2.2. ARCHIVAL PROBLEMS AND SOLUTIONS 19
is different, then we can assume, with very high probability, that the copy that is different has failed and should be discarded. Replication is used by many archival system including the Computing Research Repository [34], the Archival Intermemory Project [28, 13], and the Stanford Archival Vault [17].
Emulation
Emulation involves recreating all the components needed to access a document on a new platform [70]. Emulation can be done at several levels: hardware, operating system, or application software. The computer game community has embraced emu- lation, and many “platform” emulators are available that allow old computer games to run on modern computers. The main advantage of emulation is that it can potentially improve the scalability of archival systems. This scalability improvement is the result of emulation recog- nizing that components are frequently stacked on top of each other (e.g., to read a document, we need to run software, which needs an Operating System, which in turn needs some specific hardware). By choosing an appropriate level, let us say the Operating System level, we do not need to worry about components higher than this level. In other words, if we produce an emulator of Windows 98, we do not need to do anything special in order to use all the programs currently available for that platform. (On the contrary, with all other system solutions, we would need to concern ourselves with each possible component independently of each other.) In addition to this benefit, emulation better preserves the “look and feel” of the original application (which may be important in some fields). One major disadvantage of emulation is its feasibility. Emulating a piece of soft- ware/hardware can be a tricky and difficult task that requires a lot of engineering and collaboration from the current producers of the component. Additionally, the task may infringe on industrial property and copyright laws. In addition, the benefits of emulation may not materialize in many situations. For example, if we have a rela- tively uniform collection of documents, the cost of producing an emulation system will probably be much higher than the cost of using replication. In addition, preserving 20 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
the “look and feel,” while important in some contexts, can be negative in others (e.g, we do not want to preserve the “feeling” of waiting a couple of days for the answer to a query when accessing an archival database).
Fast Failure Detection and Repair
With most system fault tolerance techniques, we need to check periodically for failed documents. Fast failure detection and repair yields improved reliability. For example, if one of two component copies have failed, the sooner we detect the problem and generate a new copy, the more protection we have against a second fault. In general, it is impractical (and sometimes impossible) to detect exactly when a fault in a component has occurred. This is because the manifestation of the fault, an error, might only appear long after the fault occurred. For example, a tape can deteriorate and lose information (a fault), but we will not know that the tape has failed until we attempt to read the data and we get an error. Testing for errors may be a difficult task. A typical system has a very large number of components and it would take a long time to check all of them for errors. In order to reduce the time spent in finding errors, we can use sampling (by time or component type). We can also use partial checks or predictors to select the components that have a high probability of failure.
• Time-based samples: Documents are probed at specific intervals of time so the probability of losing a document due to multiple failures is small. Obviously, the challenge of this technique is finding the appropriate time interval. Later, in the experiments section of this chapter, we will see how can we use simulations to find good detection intervals.
• Component-based samples: This technique probes components instead of indi- vidual documents, an especially useful approach when the number of compo- nents is much smaller than the number of documents. Specifically, the system will perform a test on the component, under the assumption that if the compo- nent passes the test, it is most likely that the component will successfully work for all documents. A way to perform the test is to select one document (or some 2.2. ARCHIVAL PROBLEMS AND SOLUTIONS 21
small number of documents) that use a component and try to access them; if we succeed, then the component is considered operational, if not, then we have detected a fault. Specifically, if we are testing the postscript renderer, we can select a document and attempt to display it; if we succeed, when using this technique, the system will assume that it will be able to render all documents stored in that format.
• Partial checks: This technique probes a small part of the documents. We thus assume that if “part” of the document is not damaged, the whole document is undamaged. For example, if we have a Postscript document, we can just try to print the first page; if we succeed, then we can assume that the rest of the pages can be printed too. Similarly, to test a disk, we only check a few blocks; if those blocks are undamaged, then we assume that the whole disk is undamaged.
• Predictors: We can use “indirect” information to predict the likelihood of a fault in a component. For example, if a new format has been released as a replacement of a format in which we have documents stored, we can use this information as a predictor of the obsolescence of the old format. Similarly, some hardware systems provide warnings of the approaching end of the life of the device. For example, Compaq’s Intellisafe [16], and Seagate’s S.M.A.R.T. [60] measure several disk attributes, including head flying height, to predict failures. The disk drive, upon sensing degradation of an attribute, such as flying height, sends a notice to the host that a failure may occur. Upon receiving this notice, users can take steps to protect their data.
After we have detected that a failure has occurred or if we are predicting an imminent failure, we should repair the component as soon as possible. The longer we wait for the component to be repaired, the higher the risk that another failure will lead to information loss. There are many standard techniques for fast repair, such as having spares available, having parts available and the training of repair personnel. 22 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
Failure Damage Failure Detection Repair Prevention
Archival Archival Data Store Document Document Creation Access Archival Archival Document Document
Other Services
Figure 2.3: Archival Repository Model
2.3 Architecture of an Archival Repository
Our goal in this section is to identify the elements of a typical AR, so we can model each element and study how it impacts reliability. A typical AR stores documents in a data store that can fail. The AR can still achieve long-term survivability by enhancing the data store with an archival system (AS) that implements some of the techniques of Section 2.2.3. We present our AR model in Figure 2.3. The figure shows the AS modules (in solid-line boxes), the non-fault-tolerant store (in a dashed-line box), and the archival documents. The arrows represent the runtime interactions between the elements.
2.3.1 Archival Documents
In our architecture, an archival document embodies information. An archival docu- ment cannot be just a bag of bits, but it must also include components necessary to transform the bits into a human comprehensible form. An archival document is an abstract entity. The connection between document and access to it is achieved through materializations. A materialization is the set of all the components necessary to provide some sort of human access to a document. For example, a materialization may include the bits, disks, and format interpreters 2.3. ARCHITECTURE OF AN ARCHIVAL REPOSITORY 23
Site A Site B
Disk 1 Disk 2 Disk 3 File File File File 1 2 3 4
ASCII Printer PDF Displayer Postscript Printer
Materialization Materialization Materialization Materialization 1 2 3 4
Document 1 Document 2
Figure 2.4: Materializations
necessary to display a technical report. The same technical report may be accessible through a different materialization, which may include a different format interpreter to print the technical report.
We illustrate materializations in Figure 2.4. In the figure, there are two archival documents. Each of these documents has two different materializations. For example, Materialization 1 requires the following components to be available: File 1, Site A, Disk 1, and a ASCII printer. Incidentally, note that File 1 is stored on Disk 1, which in turn is at Site A. Such component interdependencies will be discussed later, when we model materialization failures.
The AR is able to preserve documents by preserving the materializations and their components. In this chapter, we treat the documents as “black boxes.” We do not attempt to take advantage of document structure (e.g., chapters in a book). 24 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
2.3.2 Architecture of the Non-Fault-Tolerance Store
The Store encompasses the set of components, such as sites, disks, or format inter- preters that make materializations accessible. Because the store is not fault tolerant, materializations may be lost. A materialization is considered lost when any of its components has failed. If all of the materializations of a document are lost, then the document is considered lost. To create a materialization, first we must ensure that the necessary components exist in the store. Then we create a record that links together the components as a materialization. For example, say we want to create a document materialization that requires the bits in file “doc.ps,” which are located in disk2 in site1 and re- quires a postscript interpreter. Any components that do not exist already (e.g., the file doc.ps) are created. In some cases, “creating” a component may require a phys- ical action, e.g., adding a new printer or disk to the store. Once the components exist, a metadata record containing references to the components is created (e.g.,
hsite1, doc.ps, disk2, postscripti). This record serves as the identifier for the materi- alization. A document metadata record includes the records for all available materializations. If this metadata record is corrupted or lost, the document will not be accessible. Therefore, the record must be one of the required components for any materialization.
2.3.3 Architecture of the Archival System
The AS provides fault tolerance by managing multiple materializations for each doc- ument. The AS monitors these materializations, and when a failure is detected, attempts to repair them. The AS can improve fault tolerance further by taking pre- ventive actions to avoid failures. The AS provides the user with the ability to create and retrieve archival documents. It also provides miscellaneous services such as in- dexing, security, and document retirement, among others. When a user requests a document, the AS uses its metadata to find all the available materializations of that document, selects one and returns it to the user. In this section, we describe the six modules that make up an AS (see Figure 2.3). 2.4. FAILURE AND RECOVERY MODELING 25
The Archival Document Creation module (ADC) generates new documents, im- plementing policies on the number and types of materializations that are needed. For example, say an administrator has decided that documents should be materialized as illustrated by Document 1 in Figure 2.4. Then, for each new document, the ADC creates (on the data store) the appropriate “File 1” and “File 2,” the document meta- data record, and checks that the other components exist. The main objective of this chapter is to provide a framework that allows a system administrator to choose the best materialization policies to achieve a desired level of reliability. The Archival Document Access module (ADA) services requests for documents. Basically, the module translates the request for a document into a request for the appropriate components of one of the document materializations. The Failure detection module (FD) scans the store looking for damaged or lost materializations. When a damaged materialization is found, the failure detection module informs the Damage Repair module (described below) about the problem. The Damage Repair module (DR) attempts to repair damaged documents. There are many strategies to repair a damaged document, as discussed in Section 2.2.3. The input of the DR module is a signal from the FD module. The Failure Prevention module (FP) scans the store and takes preventive measures so materializations are less likely to be damaged. For example, the FP module may copy components that are stored on a disk that is close to the end of its expected life, into a newer disk. Finally, the Other Services module (OS) provides miscellaneous services such as indexing, security, and document retirement. Retiring a document involves removing from the store any components that are no longer needed, even by other materializa- tions.
2.4 Failure and Recovery Modeling
In this section, we will model the failure and recovery characteristics of an AR, based on the architecture presented in the previous section. First, we will explore how to 26 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
model a non-fault-tolerant store, and then the archival system (AS). Later, we will combine these two models into an archival document model.
2.4.1 Modeling a Non-Fault-Tolerance Store
To model the failure characteristics of a store, we start with an abstract representation of materializations and components. We model a materialization as an n-tuple hmatid, docid, comp1, ...compni; where matid is the materialization identifier, docid is the document identifier, and comp1...compn are the components required to provide the required document access. The identifiers matid and docid together form a unique id for the materialization. For example: hM1, TR1233, doc.ps, site1, disk2, postscripti means that the materialization M1 contains the document identified by TR1233 that needs the bits in file doc.ps, disk2, site1 and the postscript interpreter in order to be readable. A document can have more than one materialization. For example,
Technical Report 1233 can also have the materialization hM2, TR1233, doc.ps, site1, disk3, postscripti, which would be a copy of M1 but on a different disk (disk3).
We model components by a tuple hcomponentid, typei, where componentid is a unique identifier for the component instance and type is the class (e.g., file, disk, interpreter) to which the instance belongs. To further model components, we need to describe:
• How many component instances and types are present in the system: that is, how many disks, formats, etc., are available.
• Failure distribution of each component type. Many components have two dif- ferent failure distributions, one during archival and another during access. For example, a tape is more likely to fail when it is being manipulated and mounted on a reader than when it is stored. Therefore, each component will have two failure distributions: during archival (i.e., time to next failure when the com- ponent is not used) and during access. For some components, such as disks or sites, the access and archival distributions will be the same; but for other components, such as tapes or diskettes, these distributions can be very different. 2.4. FAILURE AND RECOVERY MODELING 27
• Time distribution for performing a component check. This distribution de- scribes how long it takes to discover a failure (or to determine that a component is good), from the time the check process starts. For example, consider checking a tape. This may involve getting the tape from the shelf, mounting the tape, and scanning the tape for errors.
• Time distribution for repairing a component failure. This distribution describes how long it takes to repair a component. This distribution may be deterministic (if the component can be repaired in a fixed amount of time). Repair time may be “infinite” if the component cannot be fixed.
In addition, there is an important interdependency between components. Specif- ically, the failure of one component may cause the failure of another component. For example, if a site fails (e.g., because it was destroyed by a fire), then all the disks at the site will also fail. As pointed out earlier, we are only taking into account per- manent failures; transient failures (e.g., the site was temporarily disconnected from the network) are ignored. This failure dependency is captured by a directed graph. For example, an arrow between “Site A” and “Disk 1” in the interdependency graph means that if “Site A” fails, then “Disk 1” will also fail. We close this subsection with two comments. First, we do not claim that the model presented for the store is complete. For instance, we have not included policies for handling concurrent access. There is always a tradeoff between complexity of the model and our ability to analyze it. We believe that our model strikes a good balance in this respect, and captures the essential features of a store. Second, the reliability predictions we make are only valid for the current configuration of the repository. Over time, the repository will change (e.g., as new devices are introduced), so we may need to change our repository model. As the model changes, we may need to revisit our predictions.
2.4.2 Modeling an Archival System
In this section, we describe how to model the behavior of the modules of the archival system. We do not include failure distributions for these modules as we are assuming 28 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
that the AS itself does not fail. We recognize that this is a strong assumption, but in this chapter, we have chosen to concentrate on the failure of components, instead of on the failure of the system that provides fault-tolerance. How to develop error-tolerant robust software design has been studied in [63]. In general, we model the input of the modules with probability distribution func- tions and their behavior by algorithms. For example, consider the document creation (ADC) module. Its input distributions tell us how frequently requests for document creation arrive, how many materializations each new document will have, and which components will be selected to participate in a given materialization. The algorithms for the failure detection (FD) module spell out what policies are implemented, e.g., if all components are checked on a regular basis or not. The probability distributions that drive the model can be obtained in different ways. If we have data from a real system, we can use the data directly (trace driven), or we can define an empirical distribution, or we can fit the data onto a theoretical distribution [4]. If we do not have real data, we need to choose a theoretical distribu- tion that matches our intuition. A sensible distribution to choose (when requests are generated independently) is a Poisson distribution for event inter-arrival times [12]. We summarize the model parameters in Figure 2.5. The figure is divided in three parts. At the top are the parameters that describe the AR: the number of components and their types, and the failure dependency graph. Then, we list all the distributions needed for the model with the units being modeled in parenthesis. Finally, we list all the policies and algorithms that must be defined to model the archival system.
2.4.3 Modeling Archival Documents
In this section, we combine the models for the data store and the AS, in order to describe the life of an archival document. In Figure 2.6, we depict our model for the life of an archival document. The life of a document starts when its materializations are created by the ADC module and handed to the store. The creation of a document may not be an instantaneous process. For example, if long-term survival is achieved by keeping multiple copies, the document is not considered archived until all the 2.4. FAILURE AND RECOVERY MODELING 29
• AR Description – Number of components and types – Failure dependency graph • Distributions – For each component type: ∗ Failure distribution during access (time) ∗ Failure distribution during archival (time) ∗ Failure detection distribution (time) ∗ Repair distribution (time) – Document creation distribution (time) – Document access distribution (time) – Access duration distribution (time) – Document selection distribution (document) • Policies – Document Creation policy – Document to materializations policy – Failure detection algorithm – Damage Repair algorithm – Failure prevention algorithm
Figure 2.5: Archival Repository Model Parameters
copies are generated. Once the ADC module has taken all the actions that ensure the long-term survival of the document, then the document has full protection, and we say that the document is in the Archived state. When any of the document components fail (based on the distributions in Fig- ure 2.5), the document is considered to be in the Damaged state and becomes tem- porarily unprotected. For example, if we keep two copies of a document and one of the copies is lost, then the document would be damaged. As explained earlier, the AS will not know that a document is damaged until the FD module detects the failure. When the failure is detected, the document goes to the Damage Detected state. When damage to a document is detected (by the policies summarized in Fig- ure 2.5), the AS starts actions to restore the document and, hopefully, return it to the Archived state. For example, if the document is damaged because one of its copies 30 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
10
Document 1 Document 3 Document 9 Document is created is Archieved is Damaged is Lost 5
4 2 11 7 8
Document Document 6 Damage is retired is accesible detected
Figure 2.6: Archival Document Model
was lost, the repository can replace the damaged copy by creating a fresh one from one of the good copies. However, if the repair is not successful, then the document may be Lost. This latter state is the one that we want to avoid in an archival system. We also distinguish two additional states: Accessible and Retired. When a doc- ument is in the Accessible state, it can be accessed (e.g., read, printed) by users, which is not the case for the Archived state. For example, if some of the document’s components are stored on a tape which is kept in a safe, we need to take the tape out and mount it in a reading device to make it accessible. When the tape is stored, the document is in the Archived state; when the tape is mounted, it is in the Accessible state. As we discussed in the previous section, by making the document accessible, in general, we are increasing the chances of damaging the document. Therefore,the probability of transition 7 is, in general, greater than the probability of transition 5. The Retired state allows users to mark the documents that are not needed any- more. In this case, the document is retired from the archival system and the system does not provide any survivability guarantees. It is important to note that retiring a document may eventually result in removing all materializations from the store. This action is different from taking a document “out of circulation,” in which case the doc- ument is not longer available to regular users, but it is still preserved for historical reasons. 2.4. FAILURE AND RECOVERY MODELING 31
In the Archival Document Model, transitions between states are due to deliberate actions taken by the AR or due to external events. An example of a deliberate action is the creation of enough copies that lead to the transition between the Created state and the Archived state. An example of an external event is the failure of a disk, causing the transition of a document from Archived to Damaged. Most external events have a probabilistic nature. The archival document model presented contains only the main states for a doc- ument. In some particular scenarios some of the states may merge or may not exist at all. For example, in an on-line archival system, there may not be any difference between the “Archived” and “Accessible” states. Similarly, depending on the actual policies of the AR, some of the states can be expanded. For example, if a system has a “Document to Materialization” policy that requires the creation of three copies of the document, the Damaged state expands into six states, one for each possible combination of one or two copies being lost. It is important to note that for analyzing an archival system we do not need to expand the nodes unless we are interested in the specific probability of being in those expanded nodes. Also note that we have simplified our document model by assuming some tran- sitions do not occur. For example, we do not show a transition directly from the Archived state to the Lost state, as we assume that documents are always damaged before they are lost. We also do not show a transition between the Damaged state and the Archived state, as we assume that damage does not repair itself.
2.4.4 Example: An Archival System based on Tape Backups
Let us describe the archival document model in more detail by using a specific example (Figure 2.8). This archival system saves documents on two tapes. The tapes are stored in a controlled environment (i.e., a safe) to improve preservation. We summarize the parameters for the model of this system in Figure 2.7. The operation of the system starts with the ADC saving n documents on each of the tapes. For simplicity in this example, we will assume that this operation always succeeds. Let us assume that the time between access requests to the ADA module is 32 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
• AR Description – Initial collection: empty. – Number of components and types: 2 tape drives – Failure dependency graph: empty • Distributions – Tape Failure distribution during access: exponential with rate D – Tape Failure distribution during archival: exponential with rate L – Tape Failure Detection distribution : exponential with rate R – Tape Repair distribution: uniform with probability 1. – Document creation rate: n documents at startup, then no docu- ments are created. – Document access period: A – Access duration period: S – Document selection: uniform over the n documents. • Policies – Document Creation policy: write in both tapes. – Document to Materialization: read from any working tape. – Failure detection algorithm: complete scan of both tapes taking a negligible amount of time. – Damage Repair algorithm: discard bad tape and replace with new copy taking a negligible amount of time. – Failure prevention algorithm: none
Figure 2.7: A model for a dual tape system distributed exponential with rate A. The duration of a successful tape access is also exponential with a rate 1 − S. The FD module scans the tapes and finds all failures at a rate R. The DR module can repair all failures as long as one of the tapes is working and the repair is done instantaneously. Therefore, the system will lose the documents if both tapes fail in between the scans of the FD module. The system does not have an FP nor an OS module. Additionally, let us assume that while the tapes are in a safe environment, they become unreadable under an exponential distribution with rate L. When the tapes are being read, the rate of damage increases to D (D > L). Using the information from the archival model, we can model the life of a document 2.4. FAILURE AND RECOVERY MODELING 33
Failure Damage Detection Repair
Repository
Archival Archival Document Document Creation Access
Safe
Figure 2.8: A Dual-tape Archival System
as it is presented in Figure 2.9. The document is considered Archived after we finish copying the document onto the tapes and the tapes are in a safe place. As this process always succeeds, the transition to Archived happens with probability 1.
Given that we have two tapes, then the distribution for transition from Archived to Damaged will be exponential with rate 2L (if we assume independent failures). When making the document accessible, we increase the rate of failure of a tape to D, thus, the rate of the transition from Accessible to Damaged will be D + L, D for the mounted tape and L for the tape that stays in storage. Finally, when a tape is damaged, the system will attempt to recover the tape by using the other tape. We are assuming that this recovery will succeed with probability R, in which case we will return to the Archived state. However, if while in a Damaged state, the second tape becomes unreadable (this will happen with probability L), then the document would be lost. The rate of access to the document is A (not A/n, where n is the number of documents); this is because when accessing a document, we are actually accessing all documents on the tape. 34 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
Document Document Document is Document is is created is Archived Damaged Lost 2L
1 L
R A S D+L
Document is Accessible
Figure 2.9: A Dual-tape Archival System
Analyzing the Dual-tape Archival System
Our goal is to find how long it would take for the Dual-tape Archival System to lose documents. In other words, we want to find the mean time to failure (MTTF) of the system. The computation of the MTTF of an AR can be done in two ways: we can attempt to compute it analytically, or we can infer it statistically by sim- ulating the system. The analytical approach consist in modeling the problem as a Markov Decision Process (MDP) and finding a solution for it. However, MDPs make strong assumptions that may not be true in archival systems and the computational complexity of finding a solution for them is high. The basic assumption of Markov models is that the probability of a given state transition depends only on the current state [74]. Moreover, the amount of time already spent in a state does not influence either the probability distribution of the next state or the probability distribution of remaining time in the same state before the next transition. This assumption prevents us from using realistic failure distributions. In addition, in [62] the compu- tational complexity of solving MDPs has proven to be P-complete. This means that, although the problem is solvable in polynomial time, if an efficient parallel algorithm 2.4. FAILURE AND RECOVERY MODELING 35
were available, then all problems in P would be solvable efficiently in parallel (which is considered an unlikely outcome at the moment). In practice, this result means that given the high order of the polynomials describing the complexity of solving a MDP, it would take months to solve a general MDP with a million node/transitions. In the next section, to overcome the limitations of solving MDPs, we will introduce a Monte-Carlo simulation tool that allows the use of general failure distributions and is able to find the MTTF of complex archival systems in a short time.
The Dual-tape Archival System is simple enough that it can be modeled using a Markov Chain and the Mean-Time to failure can be computed exactly:
S · R + D · R + D · L + 2L2 + 4L · R + 3L · S + 2A · L + 2A · R MTTF = (2.1) 2L3 + 2L2D + A · L2 + A · D · L + 2L2S
Formal modeling allows us to take precise allocation decisions. For example, if we use the parameters in Figure 2.10, then the MTTF of the system will be 9,316 days (or about 25 years). We can also use this result to make resource allocation decisions. Let us suppose that we want to improve the MTTF to 100 years and we can do so by improving the rate of detection and repair of damaged media (i.e., change the value of R). In this case, we can compute the needed repair frequency to achieve a MTTF of 100 years: 13 days. These 13 days include the sum of the frequency of checking the tapes plus the number of days that it takes to replace a defective tape with a new copy. Specifically, if it takes 1 day to create a new tape, then we need to check all the tapes every 12 days.
The computation of Equation 2.1 is complex, even though we are not considering software errors, sites failure, etc. When considering all those factors, obtaining a closed result is very difficult, and in some cases, impossible. In the next section we will introduce a simulation tool, ArchSim, that will be able to handle more complex cases. 36 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
Mean-Time to failure (days) Access Rates (days) Parameter value Parameter value Tape in stable storage (1/L) 1000 Access frequency (1/A) 20 Tape when reading (1/D) 250 Access duration (1/S) 2 Failure detection and Repair Rate (avg. days) Parameter value Tape (1/R) 60
Figure 2.10: Parameter values
2.5 ArchSim: A Simulation Tool for Archival Repositories
To evaluate a possible AR configuration, we need to predict how well it protects doc- uments. This prediction can sometimes be done analytically, but as the AR gets more complex, an analytical solution is impractical (and sometimes impossible). Instead, we rely on a specialized simulation engine for archival repositories: ArchSim. We start this section by discussing the specific challenges confronted when simulating an AR. Then we describe ArchSim and its libraries.
2.5.1 Challenges in Simulating an Archival Repository
ArchSim builds upon existing simulation techniques for fault-tolerant systems. How- ever, the unique characteristics of archival repositories make their simulation chal- lenging:
• Time Span: The life of an archival system is measured in hundreds, perhaps thousands of years. This means that simulation runs will be extremely long, so special precautions must be taken to make the simulation very efficient. Fur- thermore, given these long periods, failure distributions must take into account component “wear-out.” (A component is more likely to fail after 50 years than when it is new.) Simple failure distributions (e.g., exponentially distributed 2.5. ARCHSIM: A SIMULATION TOOL FOR ARCHIVAL REPOSITORIES 37
time between failures) are frequently used in fault-tolerant studies, but they cannot be used here since they do not capture wear-out.
• Repairs: In an archival system we cannot in general assume that damaged com- ponents can always be replaced by new identical components (another common assumption when studying fault-tolerant systems). For example, after 100 years, it may be impossible or undesirable to replace a disk with one having the same failure characteristics.
• Component models: Component models are fairly rich, compounding the num- ber of states that must be considered. For instance, as we have discussed, a file is not simply correct or corrupted. Instead, it can be corrupted but the error undetected, it can be correct but not accessible for reads, and so on. The failure models in each of these states may be different; e.g., a file is more likely to be lost when being read.
• Sources of failures: A document can be lost for many reasons; e.g., a disk fails or a format becomes obsolete. Each of these failures has very different models and probability distributions. The approach of finding the “weak link” and assuming that all other factors can be ignored is not appropriate for ARs.
• Number of Components: An AR needs to deal with a large number of com- ponents and materializations. The challenge of simulating a large number of objects has been studied extensively [29, 59] and ArchSim uses those results.
2.5.2 The Simulation Engine: ArchSim
ArchSim receives as input an AR model, a stop condition (stop when the first doc- ument is lost or when all documents are lost) and a simulation time unit (minutes, hours, days, etc.). ArchSim outputs the mean time to failure (mean time to stop condition), plus a confidence interval for this time. We are currently considering other output metrics, e.g., the fraction of the documents that are available after some fixed amount of time. However, these other metrics are not used in our case study (Section 2.6). 38 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
To define the AR model, each distribution and policy is implemented as a Java object, so they can be as general as necessary. For example, for component repair, the corresponding Java module can simply use a probability distribution (perhaps one of the library functions described below) to generate the expected repair time. However, that module can easily be replaced by one that first decides if the com- ponent can be repaired (using one probability distribution), and then for each case generates a completion time (when the component is repaired or the repair is declared unsuccessful).
2.5.3 Library of Failure Distributions
ArchSim makes available a library of pre-defined failure distributions that can be used to describe AR components. The distributions in the library are: deterministic, uni- form, exponential, infant mortality, bathtub, and historical survival. To illustrate, in Figure 2.11(a)-(e), we sketch generic versions of these probability distributions. (The area under these curves, from time 0 to t, represents the probability the component will fail by time t.)
1. Deterministic Model (graph not shown): This is a deterministic model in which the event happens (with probability one) at a fixed time. It is useful for describ- ing deterministic events such as the start of a failure detection process. The mass function of this distribution is:
( 1 if t = t ; p(t) = 0 0 otherwise.
where t0 is the time when the event will happen.
2. Uniform Model (Figure 2.11a): This is a very simple model based on a uniform distribution. Uniform distributions are convenient, as they are easy to generate and analyze. The density function of a uniform distribution is:
( 1 if t ∈ [t0, t1]; f(t) = t1−t0 0 otherwise. 2.5. ARCHSIM: A SIMULATION TOOL FOR ARCHIVAL REPOSITORIES 39
where [t0, t1] is the time interval where the failure may happen.
3. Exponential model (Figure 2.11b): This model assumes that the probability of failure at any single point in time is constant (i.e., the model is memoryless). In many fault-tolerance systems, exponential distributions are used instead of more complex distributions (described next) under the assumption that components are checked before installation and that the life of the system is much shorter than the life of the components. However, in an archival system, these assump- tions are not always true. The density function of the exponential distribution is:
( 1 e−t/β if t ≥ 0; f(t) = β 0 otherwise.
where β is the mean (expected time to failure) of the distribution.
4. Infant Mortality model (Figure 2.11c): This model is good for describing fail- ures due to system or human errors. At first, when the document has been recently created, the probability is high (as there could be errors in its creation and/or initial storage); then the probability drops sharply and remains fairly constant during the rest of the life of the document. The density function of this distribution can be obtained by using a Weibull distribution in the infant mortality phase and an exponential distribution in the constant phase:
−α α−1 (−t/β)α αβ t e if t ∈ [0, t1); α 1 −(t−t1)/(β/t1) t1 f(t) = (β/t )αt e if t ∈ [t1, ∞); 1 1 0 otherwise.
where [0, t1) is the infant mortality period, [t1, ∞) is the constant period, α (0 < α < 1) is the shape parameter (values closer to zero generate higher rates of early failures), and β (β > 0) is a scale parameter used to adjust the mean of the distribution during the infant mortality phase. 40 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
5. Bathtub model (Figure 2.11d): This probability failure function is considered typical of electronic devices. In the bathtub distribution, the probability of failure early in the component’s life (infant mortality phase) and late in its life (wear-out phase) are higher than during the middle years. The density function of this distribution can be obtained by using a Weibull distribution in the infant mortality phase, an exponential distribution in the constant phase, and another Weibull distribution in the wear-out phase:
−α α−1 (−t/β)α αβ t e if t ∈ [0, t1); 1 −(t−t1)/γ γ e if t ∈ [t1, t2]; f(t) = δ −δ δ−1 (−(t−t2)/ζ) δζ (t − t2) e if t ∈ (t2, ∞); 0 otherwise.
where [0, t1) is the infant mortality period, [t1, t2] is the constant phase, (t2, ∞) is the wear-out phase, α (0 < α < 1) is the shape parameter for the infant mortality phase, β (β > 0) is a scale parameter used to adjust the mean of the distribution during the infant mortality phase, γ is the mean of the constant phase, δ (δ ≥ 1) is the shape parameter for the wear-out phase, ζ (ζ > 0) is a scale parameter used to adjust the mean of the distribution during the wear-out phase. The constants in the distribution must be chosen such as: α δ e−(t1/β) + e−t2/γ = e−t1/γ + e−(t2/ζ)
6. Historical survival model (Figure 2.11e): This model attempts to describe the life of a published resource. The model is based on the idea that if a resource survives after a long period, it becomes of “historical significance,” and its chance of continual survival increases. Although there have not been studies modeling the obsolescence of formats, we theorize that this function is a good choice for modeling the obsolescence of a format interpreter. When a format is adopted and retained for a long time (we are assuming here that we are adopting a “commercial” format), the probability of not being able to interpret the format is low. As the format ages, the probability of failure increases as knowledge about the format starts disappearing until we reach the historical 2.5. ARCHSIM: A SIMULATION TOOL FOR ARCHIVAL REPOSITORIES 41
p p p Infant Constant Mortality Phase
Uniform (A) t Exponential (B) t Infant Mortality (C) t
Historical p Infant Constant Wearout p Adoption Mortality Phase Phase Phase Phase
t Bathtub distribution (D) t Historical Significance (E)
Figure 2.11: Failure Functions
phase when the probability of losing the format, after having survived that far, starts decreasing. This distribution is modeled using a Weibull distribution for the adoption phase and another Weibull distribution for the historical phase.
−α α−1 (−t/β)α αβ t e if t ∈ [0, t1); δ −δ δ−1 (−(t−t2)/ζ) f(t) = δζ (t − t2) e if t ∈ (t1, ∞); 0 otherwise.
where [0, t1) is the adoption phase, (t2, ∞) is the historical phase, α ≥ 1 is the shape parameter for the adoption phase (the higher the parameter, the lower the probability of losing the format), β > 0 is a scale parameter used to adjust the mean of the distribution during the adoption phase, (0 < δ < 1) is the shape parameter for the historical phase (the closest to zero, the “earlier” a failure may happen), ζ (ζ > 0) is a scale parameter used to adjust the mean of the distribution during the historical phase. The parameters in the distribution α δ must be chosen such as: (t1/β) = (t2/ζ) 42 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
2.5.4 ArchSim’s Implementation
ArchSim follows the structure of a traditional simulation tool. Each module of the AR model can register future events in a timeline. For example, when a disk is created, the simulation uses the disk failure distribution to compute when the disk will fail; then, it registers the future failure event in the timeline. The simulation engine advances time by calling the module that registered the first event. This module may change the state of the repository and register more events in the timeline. After the module returns, the simulation advances to the next event, in chronological order. The user can choose between two end conditions for the simulation: the simulation can stop after the first document is lost or after all the documents are lost. ArchSim needs to be very flexible and efficient to meet the challenges of simu- lating an archival repository. Flexibility is needed to model very different archival conditions and implementations. Speed is needed to cope with many materializations, components, and events. Additionally, each simulation needs to be run many times in order to obtain narrow confidence intervals. In an AR simulation, many events are inconsequential. For example, suppose that the detection module schedules periodic detection events. If the detection event finds a fault (i.e., there was a failure event before the detection event), then the module starts a component repair; if no failure is detected, then the module does nothing. If a repair module checks a component with MTTF of 20 years every 15 days, on average, 486 events (20 ∗ 365/15) will be fired and the repair module will return without doing anything; only in event 487, when a failure of the component has occurred, will the repair module perform an action. Given the large number of components and modules that may be part of the model, this large number of inconsequential events represents a significant overhead. To avoid this overhead and to improve efficiency, modules are allowed to register conditional events in the timeline. These events will only happen if some other event happens before them. By using conditional events, we can condition the firing of the detection event only if a failure event on a specific component happens before it. A conditional event is not registered directly in the timeline. Instead, it is registered in an index that is part of its triggering event. For example, if event B is conditional to the occurrence of event A, we will put B in the 2.6. CASE STUDY: MIT/STANFORD TR REPOSITORY 43
index of A. When a new event A is scheduled in the timeline, we look in the index and find that B is conditional to it, so at that point we also schedule B. To reduce the number of events further, ArchSim also modifies failure distri- butions. For example, when modeling preventive maintenance, a large number of “replace component” events are generated. We can eliminate all these events, by modifying the failure distribution. Specifically, the original failure distribution is used to generate the time of the next failure of the component. If this time is higher than the prescribed preventive maintenance period, we ignore this time, and we gen- erate a new failure time, again using the original distribution. We repeat this process until the failure time is lower than the preventive maintenance period. The new dis- tribution then returns the number of iterations minus one, times the PM period, plus the last failure time. Another challenge for ArchSim is how to deal with a large number of material- izations and components. This is done by scheduling only the next failure for each component type and associating a trigger with that event. When the failure event happens, the trigger is activated. The trigger computes when the next failure of a member of that component type will occur, and adds it to the timeline. Obviously, this approach is only beneficial if we have many components of the same type, which is a reasonable scenario for an AR. In the case in which we have N different distri- butions for N components, this technique does not improve the simulation time, but it does not increase it.
2.6 Case Study: MIT/Stanford TR Repository
In this section, we use ArchSim to answer some design questions for a hypothetical MIT/Stanford Technical Report Archival Repository. The AR follows the Stanford Archival Vault (SAV) design and implementation [17]. In this case study, MIT and Stanford preserve their Computer Science Technical reports by replicating the reports at both universities. In this case study we will have to make many assumptions. Our goal here is not to make any specific predictions, but rather to illustrate the types of evaluations that ArchSim can support, the types of decisions that must be made to 44 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
Stanford MIT
Disk 1 Disk 1 File File 1 PDF Displayer ASCII Printer 3
Postscript Printer TIFF Displayer Disk 10 Disk 10 File File 2 4
Materialization Materialization Materialization Materialization 1 2 3 4
Technical Report
Figure 2.12: MIT/Stanford CSTR Scenario
model an AR, and the types of comparisons than can be made to support rational decisions among alternatives. In addition to this case study, we also used ArchSim to evaluate the dual tape system presented in Section 2.4.4. The results from ArchSim and the Markov chain analysis were consistent for the parameter values of Figure 2.10. We assume that the collection has 200,000 documents and that each document is stored in one or more of four available formats. The repository has two types of components: storage devices (disks) and format interpreters. To ensure preservation, the AR maintains four materializations of each technical report; two materializations at Stanford and two at MIT. Materializations are created by choosing two formats out of the four available formats. Two of the materializations will be in one of the chosen format, while the other two will be in the other. Then, at each site, we place two materializations that are in different formats on two different disks. Figure 2.12 illustrates the arrangement of the technical reports in this system.
Disks, formats, and sites have uniform failure distributions with parameters 1/φsto, 2.6. CASE STUDY: MIT/STANFORD TR REPOSITORY 45
1/φform, 1/φsite. As our base values, we are assuming φsto, φform, and φsite, the mean time to failure (MTTF) for disks, formats and sites, to be 3, 20, and 45 years respectively. These values are our best guess for a typical archival system. We chose 3 years as the disk MTTF, as this is the normal period under which a hard drive is under manufacturer guarantee. We chose 20 years for format MTTF as we theorize that it will take that long for a well-known format to be replaced by a new format and for all displayers and transformers of the old format to be lost. We chose 45 years for the site MTTF as we assign a 50% probability to the event of losing the Stanford site due to a high intensity earthquake in the San Francisco Bay Area (which is predicted to happen within a 45-year period). The archival system checks for faults periodically. When a fault is detected in one of the storage devices, the bad device is retired, and a new device is set online. Then, the system regenerates the bad device by making a copy of the lost materializations from the other site. Similarly, when a format becomes obsolete, a new format is selected and a new set of materializations (transformed from a non-obsolete format) is created in the new format. In addition, in case of site failure, the site is recreated from the other site. If all sites, formats and devices that support all the materializations of a technical report are lost, then the technical report is lost and the simulation stops.
Disks, formats, and sites are checked and repaired (if needed) every ρsto, ρform, and
ρsite days. As our base values, we are assuming ρsto, ρform, and ρsite to be 60, 60, and 7 days. To justify these values, we need to describe what is involved in the detection and repair of component failures. Detecting a failure in a disk involves scanning the whole disk and checking for lost data. When we find lost data, we need to order a new disk and then copy all the data that was in the damaged disk from other sources onto the new disk. Assuming that the repair time is 60 days means that we need to dedicate only 3% of the disk bandwidth to scan all materializations in order to detect failures. We did not chose a quicker repair because the scanning overhead would be too high in our opinion. For example, 25% of the disk bandwidth is required to detect failures in 7 days. 46 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
Parameter Symbol value Number of disks nsto 100 per site Number of formats nform 4 Number of documents numdoc 200,000 Mean Time to Disk Failure φsto 3 years Mean Time to Format Failure φform 20 years Mean Time to Site Failure φsite 45 years Disk Failure Detection/Repair time ρsto 60 days Format Failure Detection/Repair time ρform 60 days Site Failure Detection/Repair time ρsite 7 days
Figure 2.13: Base values
For formats, the detection/repair times imply that we are able to realize that a format is obsolete and that we can create a new copy of the document from a non- obsolete format within a 60 day period. In the case of site failures, we assume that the detection/repair time for the site is much lower than for formats and disks. The detection itself should be rather fast in this case (the entire site is down), and the 7 days could be the time it takes to find a backup site to take over. In this case study, we are assuming that failures are total. This is, we cannot partially repair a component and salvage some of the materializations. The failure distribution during access will be assumed to be the same as the failure distribution during archival. The simulation parameters are summarized in Appendix A and the base values for our simulation are in Figure 2.13. In our first experiment, we evaluate the effect of the failure MTTF and repair times of storage devices on the system MTTF. In Figure 2.14, we show the system MTTF for different disk MTTFs, given a detection/repair time of 60 days. To single out the influence of storage device failures, we are assuming in this experiment that formats and sites never fail. The dotted lines in the figure represent the 99% confidence interval for the simulation, while the solid line is the average of all the simulation runs. As expected, the system MTTF increases when the disk MTTF increases (when the disk failure rate decreases). The exponential shape of the curve is the result of constant repair time. As we keep increasing the disk MTTF, it is much 2.6. CASE STUDY: MIT/STANFORD TR REPOSITORY 47
1400
Average 1200 99% confidence interval )
s 1000 r a e y (
m 800 e t s y S
f 600 o F T T
M 400
200
0 3 5 10 20 φsto (years)
Figure 2.14: MTTF vs. φsto with ρsto of 60 days more improbable that another device will also fail before 60 days have passed. This graph allows us to select a good storage device for a target system MTTF. If the library targets a 10-year MTTF, then a disk with a failure MTTF of 3 years will suffice. However, if the library requires a MTTF of 100 years, then we will need disks with a MTTF of about 6 years. Most manufacturers guarantee their disks for three years and very few guarantee them beyond five years. Therefore, a 6-year disk MTTF requirement will be hard (or very expensive) to meet. Nevertheless, we can still achieve our target system MTTF by changing other parameters in our system, as we will see in the next experiment.
We now evaluate the sensitivity of the system MTTF to the repair time (ρsto).
In Figure 2.15, we show the MTTF of the AR for different disk MTTF (φsto) values, and for different expected detection/repair times. (The graph has a logarithmic scale. Confidence intervals are not shown to avoid clutter.) First, let us concentrate on the curve for a φsto of 3 years. As expected, the MTTF of the system decreases when 48 CHAPTER 2. EVALUATING THE RELIABILITY OF AN AR
MTTF of MIT/Stanford CSTR Archival System (logarithmic scale)
10000
φsto 1000 ) s
r 3 years a
e 5 years y (