Descriptive Metadata for Web Archiving: Review of Harvesting Tools

OCLC RESEARCH SUPPLEMENTAL Descriptive Metadata for Web Archiving Review of Harvesting Tools Mary Samouelian and Jackie Dooley Descriptive Metadata for Web Archiving: Review of Harvesting Tools Mary Samouelian Harvard University Jackie Dooley OCLC Research © 2018 OCLC. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ February 2018 OCLC Research Dublin, Ohio 43017 USA www.oclc.org ISBN: 978-1-55653-014-2 DOI: 10.25333/C37H0T OCLC Control Number: 1021288422 ORCID iDs Mary Samouelian https://orcid.org/0000-0002-0238-5405 Jackie Dooley https://orcid.org/0000-0003-4815-0086 Please direct correspondence to: OCLC Research [email protected] Suggested citation: Samouelian, Mary, and Jackie Dooley. 2018. Descriptive Metadata for Web Archiving: Review of Harvesting Tools. Dublin, OH: OCLC Research. https://doi.org/10.25333/C37H0T. ACKNOWLEDGMENTS The OCLC Research Library Partnership Web Archiving Working Group used several subgroups to make our range of research investigations possible. The Tools Subgroup defined the scope of work, selected and analyzed the 11 tools, and prepared the grids that articulate the nature of each tool in some detail. In addition to Mary Samouelian, the members of the Tools Subgroup were: • Jason Kovari, Cornell University • Dallas Pillen, University of Michigan • Lily Pregill, Getty Research Institute • Matthew McKinley, California Digital Library Several individuals contributed helpful comments during the external review period. Jefferson Bailey (Internet Archive) and Dragan Espenschied (Rhizome) provided extensive input about both the tools they manage and other aspects of the draft report. WAM member Deborah Kempe (Frick Art Reference Library) reviewed the draft thoroughly and provided helpful suggestions. Special thanks are due to the developers and managers of the 11 tools that we analyzed, particularly those who provided feedback on our draft analyses. They are the dedicated experts who truly make web archiving possible. Finally, generous colleagues in OCLC Research were indispensable contributors. Program Officer Dennis Massie provided wise counsel and ubiquitous support to the working group throughout the project. Karen Smith-Yoshimura, Roy Tennant, and Bruce Washburn all contributed insightful comments. Profuse thanks also are due to those who efficiently shepherd every publication through the production process: Erin M. Schadt, Jeanette McNicol and JD Shipengrover. CONTENTS Introduction.................................................................................................................................................... 5 Brief Description of Each Tool....................................................................................................................... 7 Analysis ......................................................................................................................................................... 8 Appendix: Full Review of Tools ................................................................................................................... 10 Notes ........................................................................................................................................................... 23 I NTRODUCTION The OCLC Research Library Partnership Web Archiving Metadata Working Group (WAM) was formed to recommend descriptive metadata best practices for archived web content.1 When the group began its work early in 2016, we discovered that metadata practitioners had high hopes that it would be possible to extract descriptive metadata from harvested content. This report offers our objective analysis of 11 tools in pursuit of an answer to that question. We reviewed selected web harvesting tools to determine their descriptive metadata functionalities. The question we sought to answer was this: Can web harvesting tools automatically generate descriptive metadata that supports the discoverability of archived web resources? Auto-generation of descriptive metadata for archived web resources could result in significant gains in the efficiency of data entry and thus help enable metadata production at scale. Our intent was twofold: 1) provide the web archiving community with a description of each relevant tool’s overall purpose and metadata-related capabilities, and 2) inform WAM’s overarching objective of preparing best practice recommendations for web archiving descriptive metadata based on an understanding of user needs. What is descriptive metadata? Its ultimate purpose is resource discovery. It describes the content of a resource, associates various access points and describes its relationship to other resources.2 Archives, libraries and museums rely on descriptive metadata to enable users to locate, distinguish and select resources of all types. Metadata creation has often been found to be the most expensive activity in preparing resources for use. In today’s library and archives environment, descriptive metadata often is repurposed for use in multiple discovery systems. In the context of archived websites, this may include library catalogs, archival finding aids, and standalone platforms for delivering web content. At the outset, we were skeptical that the level of automated metadata generation needed to support the discoverability of web archive resources would be feasible due to an inherent limitation: only bare- bones descriptive metadata is recorded in the headers of most web pages. This prediction was borne out by our reviews. Nevertheless, the analysis of tools appended to the report can serve as a useful resource to understand the landscape of harvesting tools available for web archiving. This report is one of a complementary trio being issued simultaneously to document the work of the WAM Working Group. Its siblings are Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group3 and Descriptive Metadata for Web Archiving: Literature Review of User Needs.4 Descriptive Metadata for Web Archiving: Review of Harvesting Tools 5 Methodology We began by examining two lists of web archiving tools as the starting point for identifying relevant tools: one compiled by the International Internet Preservation Consortium (IIPC)5 and the other by WAM member Rosalie Lack.6 We filtered the lists to retain only those tools that harvest or replay web content, are actively under development and/or are actively supported, and appeared to include descriptive metadata capture features.7 Given the open-source nature of tools in this realm, it is challenging to provide continued support and development for particular tools. Some of the tools on our initial list are no longer supported and so were discarded from consideration. Ultimately, we analyzed these 11 tools: • Archive-It • Heritrix • HTTrack • Memento • Netarchive Suite • SiteStory • Social Feed Manager • Wayback Machine • Web Archive Discovery • Web Curator Tool • Webrecorder We developed seven criteria for evaluating each tool to ensure consistency in our approach: 1. What is the basic purpose of the tool and its core functionalities? (e.g., capture, display and/or administrative layer) 2. What objects/files can it take in and generate? (i.e., the atomic unit that the tool creates or alters, such as Mementos, WARCs (Web ARChives) or PDFs) 3. In which metadata profiles does it record? 4. Which descriptive elements are automatically generated? 5. Which descriptive elements can be created or edited by the user? 6. Which descriptive data elements can be exported for use outside of the tool? 7. What relation does it have to other tools? (e.g., Heritrix gathers metadata that is embedded in a WARC file, some of which is used by Archive-It.) We did not investigate the tools’ capabilities for generating technical or preservation metadata. Descriptive Metadata for Web Archiving: Review of Harvesting Tools 6 Brief Description of Each Tool A summary of each tool follows to highlight basic functionality, including what we learned about metadata capabilities. After preparing a description of each tool’s purpose and functionality, we contacted its owner to obtain feedback and thus ensure the accuracy of our descriptions. We are grateful to all who responded; their suggestions significantly improved upon our work. Our full review of the tools is in the appendix. Archive-It: This is a widely used subscription web archiving service from the Internet Archive (IA) that harvests websites using a variety of capture technologies including IA's open-source web crawler Heritrix. WARC files8 are preserved in the IA digital repository and can be downloaded by users for preservation in their own repositories.9 The “grab title” feature at the seed level automatically scrapes title metadata from each page. Archive-It provides 16 Dublin Core metadata fields from which users can choose, as well as the ability to add custom fields that can be added manually at the collection, seed and document levels.10 Heritrix: A widely used, open-source web crawler developed by the IA, Heritrix is one of the principal capture tools used by IA and by many others for harvesting websites. It produces WARCs but does not allow for the input or generation of additional descriptive metadata within the tool.11 HTTrack: An open-source capture tool that uses an off-line browser utility to download a website to a directory, generates a folder hierarchy and saves content that mirrors the original

Load more