ARCHIVAL REPOSITORIES for DIGITAL LIBRARIES a Dissertation
Total Page:16
File Type:pdf, Size:1020Kb
ARCHIVAL REPOSITORIES FOR DIGITAL LIBRARIES a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy Arturo Crespo March 2003 c Copyright by Arturo Crespo 2003 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Hector Garcia-Molina (Principal Adviser) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Jennifer Widom I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Andreas Paepcke Approved for the University Committee on Graduate Studies: iii iv Abstract Current libraries use an assortment of uncoordinated and unreliable techniques for storing and managing their digital information. Digital information can be lost for a variety of reasons: magnetic decay, format and device obsolescence, human or system error, among many others. In this thesis, we address the problem of how to build archival repositories (AR). An AR has the following combined key requirements that distinguish it from other repositories. First, digital objects (e.g., documents, tech- nical reports, movies) must be preserved indefinitely, as technologies, file formats, and organizations evolve. Second, ARs will be formed by a confederation of inde- pendent organizations. Third, published digital objects have a historical nature, so changes should not be done in-place; instead, they should be recorded in versions. We provide an architecture for ARs that assures long-term archival storage of digital objects. This assurance guarantee is achieved by having a federation of independent but collaborating sites, each managing a collection of digital objects. We also pro- vide a framework for evaluating the reliability and cost of an AR. Finally, we present techniques for efficient access of documents in a federation of independent sites. v Acknowledgments First, I would like to thank my advisor, Hector Garcia-Molina who taught me much of what I know about research, clear thinking, and effective writing. Working with Hector was not only a privilege but a pleasure as well. I also want to thank the other members of my reading committee, Andreas Papecke and Jennifer Widom, as well as my orals committee members, Chris Jacobs and Terry Winograd, for their many useful suggestions and comments. I would like to thank Jerry Saltzer for providing the initial spark that ignited my interest on this topic. In addition I thank Carl Lagoze for his useful suggestions and encouragement on pursuing this topic. I would also like to thank all the members of the Stanford Database Group for their helpful advice, insightful discussions, and for always being there during good and bad times. In particular, I would like to thank my co-authors Orkut Buyukkokten and Brian Cooper as well as Stefan Decker (whose ideas lead to the work presented in Chapter 7). Finally, a special thank to my officemates Sergey Brin, Wilburt Labio, Claire Cui, Taher Haveliwala, and Glen Jeh, who made the long hours at Stanford much more enjoyable. At Stanford, I was fortunate to have met, and count among my friends, a number of smart, interesting, and caring people. The list of friends that provided me support and encouragement is far too long to write here, but I would like to mention a couple of names. Very special thanks to Erik Boman and the leaders of the Stanford Outing Club, who introduce me to the natural wonders of California; and to Carlos Guestrin for encouraging me to break away with my self-imposed limits allowing me to to go farther and higher than I ever thought I could ever go. I also want to thank vi Sergey Melnik not only for his true friendship, but also for his technical help in developing some of the ideas in this dissertation. Finally, I want to thank my family: my parents Arturo and Inmaculada, my sister Massiel, and my brother Joaquin. Familia, esta tesis se la dedico a ustedes. Sin su fe en mi y su apoyo incondicional escribir esta tesis hubiese sido imposible. vii Contents Abstract v Acknowledgments vi 1 Introduction 1 I Design and Evaluation of Archival Repositories 8 2 Evaluating the Reliability of an AR 9 2.1 Overview . 10 2.2 Archival Problems and Solutions . 11 2.2.1 Sources of Failures . 11 2.2.2 Component Failure Avoidance Techniques . 13 2.2.3 System-Level Techniques . 17 2.3 Architecture of an Archival Repository . 22 2.3.1 Archival Documents . 22 2.3.2 Architecture of the Non-Fault-Tolerance Store . 23 2.3.3 Architecture of the Archival System . 24 2.4 Failure and Recovery Modeling . 25 2.4.1 Modeling a Non-Fault-Tolerance Store . 25 2.4.2 Modeling an Archival System . 27 2.4.3 Modeling Archival Documents . 28 2.4.4 Example: An Archival System based on Tape Backups . 30 viii 2.5 ArchSim: A Simulation Tool for Archival Repositories . 34 2.5.1 Challenges in Simulating an Archival Repository . 35 2.5.2 The Simulation Engine: ArchSim . 36 2.5.3 Library of Failure Distributions . 36 2.5.4 ArchSim’s Implementation . 40 2.6 Case Study: MIT/Stanford TR Repository . 41 2.7 Discussion . 49 3 A Design Methodology for ARs 51 3.1 Overview . 52 3.2 AR Costs . 53 3.2.1 A Taxonomy of Cost Sources . 54 3.2.2 A Taxonomy of Cost Events . 56 3.3 Designing an AR . 58 3.3.1 Framing the problem . 59 3.3.2 Identifying Uncertainties, Alternatives, and Preferences . 59 3.3.3 Modeling an AR . 60 3.3.4 Transforming Non-Critical Uncertainties into Variables . 61 3.3.5 Eliminating Futile Alternatives . 66 3.3.6 Probabilistic Assessment of Uncertainties . 67 3.3.7 Evaluating an AR Design . 68 3.3.8 Appraising Cost Decisions . 69 3.4 Discussion . 70 II An Architecture and Algorithms for Archival Reposi- tories 71 4 An Architecture for Archival Repositories 72 ix 4.1 Overview . 73 4.2 Key Components . 73 4.2.1 Signatures as Object Handles . 73 4.2.2 No Deletions . 76 4.2.3 Layered Architecture . 77 4.2.4 Awareness Everywhere . 78 4.2.5 Disposable Auxiliary Structures . 78 4.3 Object Store Layer . 79 4.3.1 Object Store Interface . 79 4.3.2 Object Store Implementation . 80 4.4 Identity Layer . 81 4.4.1 Identity Interface . 81 4.4.2 Identity Implementation . 83 4.5 Complex Object Layer . 84 4.5.1 Tuples . 85 4.5.2 Versions . 85 4.5.3 Sets . 87 4.6 Reliability Layer . 88 4.6.1 Reliability Interface . 89 4.6.2 Implementation . 89 4.7 A Complete Example . 92 4.8 Discussion . 94 4.9 Related Work . 95 5 Archive Synchronization 97 5.1 Overview . 98 5.2 Problem Definition . 100 5.3 Related Work . 103 5.4 The Client-Store Design Space . 103 5.5 The UNI-AWARE Algorithm . 105 5.5.1 Snapshot Algorithms . 109 x 5.5.2 Timestamps . 110 5.5.3 Logs . 113 5.5.4 Signature Algorithm . 114 5.5.5 UNI-AWARE Summary . 116 5.6 Distributed UNI-AWARE Algorithm . 117 5.6.1 Hierarchical Signatures . 120 5.6.2 Hierarchical Timestamps . 123 5.6.3 Performance Issues . 123 5.7 Discussion . 126 III Efficient Content Access in a Federation of Archival Repositories 127 6 Efficient Searching in an AR Federation 128 6.1 Overview . 129 6.2 Related Work . 131 6.3 Peer-to-peer Systems . 132 6.3.1 Query Processing in a Distributed-Search P2P System . 132 6.4 Routing indices . 133 6.4.1 Using Routing Indices . 136 6.4.2 Creating Routing Indices . 137 6.4.3 Maintaining Routing Indices . 138 6.5 Alternative Routing Indices . 140 6.5.1 Hop-count Routing Indices . 140 6.5.2 Exponentially aggregated RI . 141 6.6 Cycles in the P2P Network . 143 6.7 Experimental Results . 145 6.7.1 Modeling search mechanisms in a P2P system . 145 6.7.2 Simulating a P2P System . 147 6.7.3 Evaluating P2P Search Mechanism . 149 6.8 Discussion . 157 xi 7 Efficient Networks in an AR Federation 159 7.1 Overview . 160 7.2 Related Work . 162 7.3 Semantic Overlay Networks . 163 7.3.1 Classification Hierarchies . 165 7.4 Classification Hierarchies . 170 7.4.1 Experiments . 172 7.5 Classifying Queries and Documents . 176 7.5.1 Experiments . 177 7.6 Nodes and SON Membership . 179 7.6.1.