Integrating Preservation Functions Into the Web Server Joan A
Total Page:16
File Type:pdf, Size:1020Kb
Old Dominion University ODU Digital Commons Computer Science Theses & Dissertations Computer Science Summer 2008 Integrating Preservation Functions Into the Web Server Joan A. Smith Old Dominion University Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds Part of the Computer Sciences Commons, and the Digital Communications and Networking Commons Recommended Citation Smith, Joan A.. "Integrating Preservation Functions Into the Web Server" (2008). Doctor of Philosophy (PhD), dissertation, Computer Science, Old Dominion University, DOI: 10.25777/jr7m-6g71 https://digitalcommons.odu.edu/computerscience_etds/22 This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. INTEGRATING PRESERVATION FUNCTIONS INTO THE WEB SERVER by Joan A. Smith B.A. 1986 University of the State of New York M.A. 1988 Hampton University A Dissertation Submitted to the Faculty of Old Dominion University in Partial Fulfillment of the Requirement for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE OLD DOMINION UNIVERSITY August 2008 Approved by: Michael L. Nelson (Director) Kurt Maly Steven J. Zeil Mohammed K. Zubair Simeon Warner ABSTRACT INTEGRATING PRESERVATION FUNCTIONS INTO THE WEB SERVER Joan A. Smith Old Dominion University, 2008 Director: Dr. Michael L. Nelson Digital preservation of the World Wide Web poses unique challenges, different from the preservation issues facing professional Digital Libraries. The complete list of a website’s resources cannot be cited with confidence, and the descriptive metadata available for the resources is so minimal that it is sometimes insufficient for a browser to recognize. In short, the Web suffers from a counting problem and a representation problem. Refreshing the bits, migrating from an obsolete file format to a newer format, and other classic digital preservation problems also affect the Web. As digital collections devise solutions to these problems, the Web will also benefit. But the core World Wide Web problems of Counting and Representation need a targeted solution. As the host of web content, the web server is uniquely positioned to assist in the preservation of the resources it serves. It both knows the resources it has, and knows what kind of resources they are. This dissertation presents research in which preservation functions have been integrated into the web server itself. The CRATE Model defines a method for addressing the Counting Problem and the Representation Problem using existing web server-compatible technology. A series of experiments which evaluated this approach are presented, along with a technical review of the MODOAI web server module which acts as the preservation agent. The feasibility of this approach is demonstrated by a quantitative analysis of its use in a commercial web testing environment. iii ©Copyright, 2008, by Joan A. Smith, All Rights Reserved. iv Dedicated to my mother and to the memory of her mother. a barren field gives birth to the fertile ground — Byzantine (Septuagint) Psalter v ACKNOWLEDGMENTS Considering how many people influenced this research or directly contributed to it through creative discussion, commentary, or plain critique, it seems unfair to not have them as co-authors or at least co-inspirators, so to speak. If not for Michael Nelson, my eternally patient advisor, this whole project would never have begun. He and Johan Bollen took the time to convince me that pursuing a PhD was a worthwhile endeavor. More importantly, Michael then followed up by securing funding for this research from both the Andrew Mellon Foundation and the Library of Congress. His guid- ance was always on target; he knew somehow when to apply pressure and when to step back. I am indebted to him on many levels and for many reasons. Still, the process is long and courage falters. Stephan Olariu and Hussein Abdel-Wahab provided much-needed encouragement and support at critical points along the way, particularly in overcoming the first hurdle, the Diagnostic (PhD qualifying) Exam. The Candidacy Exam committee members, Kurt Maly, Simeon Warner, Steven Zeil and Mohammed Zubair, provided very helpful feedback and commentary on the initial proposal and I greatly appreciate the time and effort they put into reviewing that as well as this dissertation. Throughout these years I’ve also had the good fortune to have Martin Klein and Frank Mc- Cown as project partners, research colleagues and office mates. Working with them has been both productive and rewarding. The pseudo-competition to get papers accepted at conferences added a dimension of fun to what was otherwise just plain hard work. In addition, I have benefited from the software development efforts of Terry Harrison and Aravind Elango, who wrote the initial prototype of MODOAI, from which the author’s current modular version was derived. I owe Howard Smith (while at Symantec) and Jim Gray (of Kronos) many thanks for suggesting testing methods and for providing commercial test environments for MODOAI. It isn’t often that such an academic endeavor can be tested in a commercial environment. The experience was especially helpful in defining criteria for metadata utility compatibility and in designing off-line tests for the software. When it comes to software engineering, however, one individual stands above the rest. If I have learned anything at all about the art and practice of software engineering, it is thanks to John Owen. As a professional colleague, business partner, and friend, his depth of knowledge and love of the craft have been an inspiration and a guide. The version of MODOAI developed for this dissertation owes its new architecture and implementation style to the debates and discussions we’ve had. From moral support to technical opinions his help has been invaluable. Michael Nelson guided the development of the CRATE concept and its focus on the counting and description problems. The complex object implementation, particularly MPEG-21 DIDL, is based on the work of Herbert Van de Sompel and Jeroen Bekaert. Their research significantly vi helped clarify the concepts and goals of the CRATE model. There have, of course, been many others whose professionalism and support have made a differ- ence during my time at Old Dominion University. Janet Brunelle gave me an opportunity to teach the Computer Science capstone course, renewing my interest in teaching while simultaneously re- minding me just how much work goes into preparing lectures. The department’s administrative staff, particularly Phyillis Woods, kept the paperwork straight and the paychecks coming. Ajay Gupta and his systems staff assisted, supported, and restored as only "Sys Admins" can. Among these system administrators, Chris Robinson, Ian Gullet and Joshua Robertson were especially important to the "Counting Problem" experiments. I appreciate their help with both hardware and software. Ultimately, of course, it all comes down to family, and here I have been blessed with good fortune beyond the normal measure. My parents and siblings have continually expressed both con- fidence in my abilities as well as a belief in the value of education for its own sake. Last, but most importantly, I have had the unflagging devotion of a wonderful husband. Howard Smith has been with me through it all, with patience and love. There is no way to express how much this has meant to me every day. This PhD belongs to him. vii ACRONYMS AIP Archival Information Package (OAIS) API Application Programming Interface CCSDS Consultative Committee for Space Data Systems DIDL Digital Item Declaration Language DOI Digital Object Identifier DIP Dissemination Information Package (OAIS) DL Digital Library GIF Graphics Interchange Format HTML HyperText Markup Language HTTP Hypertext Transfer Protocol JPEG Joint Photographic Experts Group KWF Kahn-Wilensky Framework LANL Los Alamos National Laboratory METS Metadata Encoding and Transmission Standard MIME Multipurpose Internet Mail Extensions MPEG Moving Picture Experts Group NDIIPP National Digital Information Infrastructure and Preservation Program NSSDC National Space Sciences Data Center OAI Open Archives Initiative OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting OAIS Open Archival Information System PDF Portable Document Format (Adobe) PNG Portable Network Graphics PREMIS Preservation Metadata Implementation Strategies RDF Resource Description Framework RSS Really Simple Syndication SIP Submission Information Package (OAIS) URI Uniform Resource Identifier URL Uniform Resource Locator URN Unform Resource Name W3C The World Wide Web Consortium WWW The World Wide Web XHTML Extensible HyperText Markup Language XML Extensible Markup Language viii TABLE OF CONTENTS Page LISTOFTABLES ....................................... xi LISTOFFIGURES...................................... xiii CHAPTER I INTRODUCTION .................................... 1 1 The Challenge of Digital Preservation . 1 2 Scope ...................................... 4 3 Approach .................................... 5 4 Organization................................... 7 II CURRENT PRACTICE IN DIGITAL PRESERVATION . ... 9 1 Preservation: Definitions and Limitations . 9 2 Digital Preservation Models & Implementations . 12 3 LOCKSS..................................... 19 4 OAI-PMH...................................