Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank Mccown Old Dominion University
Total Page:16
File Type:pdf, Size:1020Kb
Old Dominion University ODU Digital Commons Computer Science Theses & Dissertations Computer Science Fall 2007 Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Old Dominion University Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds Part of the Computer Sciences Commons, and the Digital Communications and Networking Commons Recommended Citation McCown, Frank. "Lazy Preservation: Reconstructing Websites from the Web Infrastructure" (2007). Doctor of Philosophy (PhD), dissertation, Computer Science, Old Dominion University, DOI: 10.25777/ys8r-nj25 https://digitalcommons.odu.edu/computerscience_etds/21 This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. LAZY PRESERVATION: RECONSTRUCTING WEBSITES FROM THE WEB INFRASTRUCTURE by Frank McCown B.S. 1996, Harding University M.S. 2002, University of Arkansas at Little Rock A Dissertation Submitted to the Faculty of Old Dominion University in Partial Fulfillment of the Requirement for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE OLD DOMINION UNIVERSITY December 2007 Approved by: Michael L. Nelson (Director) William Y. Arms (Member) Johan Bollen (Member) Kurt Maly (Member) Ravi Mukkamala (Member) Mohammad Zubair (Member) ABSTRACT LAZY PRESERVATION: RECONSTRUCTING WEBSITES FROM THE WEB INFRASTRUCTURE Frank McCown Old Dominion University, 2007 Director: Dr. Michael L. Nelson Backup or preservation of websites is often not considered until after a catastrophic event has oc- curred. In the face of complete website loss, webmasters or concerned third parties have attempted to recover some of their websites from the Internet Archive. Still others have sought to retrieve missing resources from the caches of commercial search engines. Inspired by these post hoc reconstruction attempts, this dissertation introduces the concept of lazy preservation{ digital preservation per- formed as a result of the normal operations of the Web Infrastructure (web archives, search engines and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and behavior. Methods for reconstructing websites from the WI are then investigated, and a new type of crawler is introduced: the web-repository crawler. Several experiments are used to measure and evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository crawler strategies are introduced and evaluated. The implementation of the web-repository crawler Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented, and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI. iii c Copyright, 2007, by Frank McCown, All Rights Reserved. iv To my wife, Becky. v ACKNOWLEDGMENTS There are a number of people who I would like to acknowledge for their support during my doctoral work. I would especially like to thank my advisor, Michael L. Nelson, for the time and effort he put into mentoring me these past several years. Our many discussions sparked a number of great ideas and helped turn several dead-ends into possibilities. I am also grateful to my doctoral committee and for the input they have provided. Much of this dissertation is the product of collaboration with a number of excellent researchers. Michael Nelson and Johan Bollen (LANL) provided many of the initial ideas on lazy preservation and the Web Infrastructure. Other ideas about lazy preservation developed from collaboration with Cathy Marshall (Microsoft). Joan Smith worked with me on the decaying website experiment from Chapter IV and provided helpful ideas and advice throughout my time at ODU. Giridhar Nandigam helped with the search engine sampling experiment in Chapter IV, Amine Benjelloun developed most of the Brass system from Chapter VI, and Norou Diawara helped perform the statistical analysis in Chapter VII. I would like to thank Janet Brunelle, Hussein Abdel-Wahab and other ODU faculty for their friendship and encouragement these past several years. I enjoyed attending a number of conferences with Michael, Johan, Joan, Martin Klein, Marko Rodriguez (LANL) and Terry Harrison (CACI) and learned a lot from our many discussions. Thanks also to members of the Systems Group who kept our infrastructure running smoothly. My friends and colleagues at Harding University were very supportive of me while I worked on my Ph.D., and I am thankful to them for providing me the time off to pursue my doctorate degree. This dissertation is dedicated to my wife Becky who cheered me on through the good times, encouraged me when I was down and wanted to quit, and made me laugh every day. I could not have had a more supportive spouse. I especially enjoyed becoming a father in the final year of my doctoral work{ much of this dissertation was written only a few feet away from Ethan as he slept and played. I thank my parents Andy and Genia and my siblings John and Sara who have encouraged me in all my pursuits, and I thank my family at the Bayside Church of Christ who provided friendship and spiritual guidance while we lived in Virginia. Finally and primarily, I thank God for giving me the strength to finish what I started, and, in the spirit of 1 Corinthians 10:31, I hope that this dissertation glorifies Him. vi \ `My son,' the father said, `you are always with me, and everything I have is yours. But we had to celebrate and be glad, because this brother of yours was dead and is alive again; he was lost and is found.' " - Luke 15:31 vii TABLE OF CONTENTS Page LIST OF TABLES . ix LIST OF FIGURES . x Chapter I INTRODUCTION . 1 1 MOTIVATION . 1 2 OBJECTIVE . 3 3 APPROACH . 3 4 ORGANIZATION . 4 II PRESERVING THE WEB . 6 1 LINK ROT . 6 2 PRESERVING THE WEB . 8 3 WEB CRAWLING . 11 4 PRESERVING WEBSITES . 13 5 CONCLUSIONS . 14 III LAZY PRESERVATION AND THE WEB INFRASTRUCTURE . 15 1 LAZY PRESERVATION . 15 2 LIMITATIONS . 17 3 WEB REPOSITORIES . 18 4 CONCLUSIONS . 27 IV CHARACTERIZING THE WEB INFRASTRUCTURE . 29 1 A MODEL FOR RESOURCE AVAILABILITY . 29 2 WEB INFRASTRUCTURE PRESERVATION CAPABILITY . 30 3 WEB INFRASTRUCTURE CONTENTS . 38 4 DISCUSSION . 44 5 CONCLUSIONS . 45 V WEB-REPOSITORY CRAWLING . 46 1 CRAWLER ARCHITECTURE . 46 2 LISTER QUERIES AND CRAWLING POLICIES . 47 3 URL CANONICALIZATION . 48 4 CONCLUSIONS . 53 VI WARRICK, A WEB-REPOSITORY CRAWLER . 54 1 BRIEF HISTORY . 54 2 IMPLEMENTATION . 55 3 OPERATION . 56 4 RUNNING . 58 5 BRASS . 59 6 USAGE STATISTICS . 64 7 CONCLUSIONS . 65 VII EVALUATING LAZY PRESERVATION . 67 1 WEBSITE DEFINITIONS . 67 viii 2 RECONSTRUCTION MEASUREMENTS . 70 3 INITIAL RECONSTRUCTION EXPERIMENT . 73 4 CRAWLING POLICIES EXPERIMENT . 78 5 FACTORS AFFECTING WEBSITE RECONSTRUCTION . 83 6 CONCLUSIONS . 96 VIII RECOVERING A WEBSITE'S SERVER COMPONENTS . 97 1 GENERATING DYNAMIC WEB CONTENT . 97 2 WHAT TO PROTECT . 98 3 INJECTION MECHANICS . 98 4 EXPERIMENTS . 103 5 DISCUSSION . 113 6 CONCLUSIONS . 113 IX CONCLUSIONS AND FUTURE WORK . 114 1 CONCLUSIONS . 114 2 CONTRIBUTIONS . 114 3 FUTURE WORK . 115 BIBLIOGRAPHY . 117 APPENDICES A WARRICK COMMAND-LINE SWITCHES . 129 B RECONSTRUCTED WEBSITES . 130 VITA ................................................. 134 ix LIST OF TABLES Table Page 1 Systems for preserving and recovering web resources. 14 2 Sample of reconstructed websites. 17 3 Web repository-supported data types as of July 10, 2007. 22 4 Implementation summary of web-repository interfaces. 24 5 Resource availability states. 30 6 Caching of HTML resources from four web collections. 36 7 Web and cache accessibility. 39 8 Indexed and cached content by type. 40 9 Staleness of search engine caches (in days). 42 10 Search engine overlap with the Internet Archive. 44 11 Repository request methods and limits. 55 12 Brass usage statistics from 2007. 64 13 Brass recovery summary. 65 14 Repository use, contributions and requests. 66 15 General levels of reconstruction success. 73 16 Results of initial website reconstructions. 76 17 Results of crawling-policy website reconstructions (Part 1). 79 18 Results of crawling-policy website reconstructions (Part 2). 80 19 Statistics for crawling-policy website reconstructions. 80 20 Descriptive statistics for reconstruction success levels. 86 21 Reconstruction performance of web repositories. 91 22 Regression parameter estimates. 95 23 Various r values (bold). ..