A Web Archive Profiling Framework for Efficient Memento Routing
Total Page:16
File Type:pdf, Size:1020Kb
Old Dominion University ODU Digital Commons Computer Science Theses & Dissertations Computer Science Fall 12-2020 MementoMap: A Web Archive Profiling rF amework for Efficient Memento Routing Sawood Alam Old Dominion University, [email protected] Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds Part of the Computer Sciences Commons Recommended Citation Alam, Sawood. "MementoMap: A Web Archive Profiling rF amework for Efficient Memento Routing" (2020). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/ 5vnk-s536 https://digitalcommons.odu.edu/computerscience_etds/129 This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. MEMENTOMAP: A WEB ARCHIVE PROFILING FRAMEWORK FOR EFFICIENT MEMENTO ROUTING by Sawood Alam B.Tech. May 2008, Jamia Millia Islamia, India M.S. August 2013, Old Dominion University A Dissertation Submitted to the Faculty of Old Dominion University in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE OLD DOMINION UNIVERSITY December 2020 Approved by: Michael L. Nelson (Director) Michele C. Weigle (Member) Jian Wu (Member) Sampath Jayarathna (Member) Erika F. Frydenlund (Member) ABSTRACT MEMENTOMAP: A WEB ARCHIVE PROFILING FRAMEWORK FOR EFFICIENT MEMENTO ROUTING Sawood Alam Old Dominion University, 2020 Director: Dr. Michael L. Nelson With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator’s URI requests. Additionally, we use fulltext search (when available) or sample URI lookups to build an understanding of an archive’s holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the In- ternet Archive’s Wayback Machine, UK Web Archive, Arquivo.pt, LANL’s Memento Proxy, and ODU’s MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ. In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present. In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive without any false negatives (i.e., 100% recall). In addition, we separately evaluated archival voids based on the most frequently accessed resources in the access log and found that we could have avoided more than 8% of the false positives without introducing any false negatives. We defined a routing score that can be used for Memento routing. Using a cut-off threshold technique on our routing score we achieved over 96% accuracy if we accept about 89% recall and for a recall of 99% we managed to get about 68% accuracy, which translates to about 72% saving in wasted lookup requests in our Memento aggregator. Moreover, when using top-k archives based on our routing score for routing and choosing only the topmost archive, we missed only about 8% of the sample URIs that are present in at least one archive, but when we selected top-2 archives, we missed less than 2% of these URIs. We also evaluated a machine learning-based routing approach, which resulted in an overall better accuracy, but poorer recall due to low prevalence of the sample lookup URI dataset in different web archives. We contributed various algorithms, such as a space and time efficient approach to ingest large lists of URIs to generate MementoMaps and a Random Searcher Model to discover samples of holdings of web archives. We contributed numerous tools to support various aspects of web archiving and replay, such as MemGator (a Memento aggregator), Inter- Planetary Wayback (a novel archival replay system), Reconstructive (a client-side request rerouting ServiceWorker), and AccessLog Parser. Moreover, this work yielded a file for- mat specification draft called Unified Key Value Store (UKVS) that we use for serialization and dissemination of MementoMaps. It is a flexible and extensible file format that allows easy interactions with Unix text processing tools. UKVS can be used in many applications beyond MementoMaps. iv Copyright, 2020, by Sawood Alam, All Rights Reserved. v Beneath my mother’s feet... vi ACKNOWLEDGEMENTS All praises be to the Almighty to bring me this far. I am deeply grateful for the help, support, and contributions from numerous people and organizations in this journey. My advisor during my masters and doctorate degree programs, Dr. Michael L. Nelson, has been a tremendous support to me in my academic journey and career. If I were to lead a team in the future, punctual weekly progress meetings and rigorous practice sessions before every upcoming presentation are the two lessons I have learned from him that I would replicate. My co-advisor Dr. Michele C. Weigle helped me learn how to express research findings more effectively using suitable information visualization techniques for various types of results and telling stories with properly annotated figures. These two people get the lion share of credits for turning me into a researcher and teaching me techniques of effective scholarly communication. My doctoral dissertation committee members Dr. Jian Wu and Dr. Sampath Jayarathna were helpful beyond this research. They were available for discussions on many research topics and were supportive in finding job opportunities for me. Dr. Erika F. Frydenlund provided an outsider’s perspective and her reflections improved the readability of my dis- sertation for those unfamiliar with my research discipline. Prior works of Dr. Robert Sanderson and Dr. Ahmed AlSum formed the basis of this research. I got the opportunity to work with Dr. David S. H. Rosenthal, Dr. Herbert Van de Sompel, Dr. Martin Klein, Dr. Lyudmila L. Balakireva, and Harihar Shankar during the IIPC funded preliminary exploration phase of this archive profiling work. Dr. Rosenthal’s blog post reviews after his retirement from Stanford kept this work in check and reassured that I was on the right track. Dr. Sanderson reflected his thoughts and provided constructive feedback on our MementoMap framework during JCDL ’19. Dr. Edward A. Fox from Virginia Tech reflected his thoughts on our URI Key concept and showed interest on potential future collaborations. Numerous IIPC member organizations and individuals contributed to this work with datasets and feedback. Kris Carpenter and Jefferson Bailey from the Internet Archive helped us with the Archive-It WARC dataset. Joseph E. Ruettgers helped us with the petabytes of storage space setup at ODU. Daniel Gomes and Fernando Melo were very generous in sharing complete dataset of CDX files and access logs from Arquivo.pt web archive. Nicholas Taylor shared Stanford Web Archive Portal’s CDX datasets. Dr. Andrew Jackson from the British Library shared CDX files and a sample of access logs from the UK Web Archive vii and provided feedback on the IIPC funded archive profiling project. Kristinn Sigurðsson provided feedback during the early days of the project. Garth Stewart shared CDX files from the National Records of Scotland. Ilya Kreymer contributed to the discussion about CDXJ profile serialization format and shared OldWeb.today access logs. Alex Osborne and Dr. Paul Koerbin worked on sharing the web archive index from the National Library of Australia. Olga Holownia from the British Library helped the progress of IIPC funded project in numerous ways. Dr. Ian Milligan gave me the opportunity to attend multiple Archives Unleashed datathons, an event where InterPlanetary Wayback was created.