<<

Web archiving: Policy and practice

Received (in revised form): 24th October, 2019

Tori Maches is the Digital at UC San Diego. Her work includes responsibility for UC San Diego’s web archiving efforts, as well as capturing born-digital materials stored on legacy media, and other aspects of for born-digital archival materials. Her previous roles include assistant archivist with Concordis LLC, and digital programme scholar at UCLA. She is also the Vice-Chair of the Society of American Web Archiving Section, Vice-Chair of the UC Libraries Born-Digital Common Knowledge Group, and a member of the UC Libraries Web Archiving Common Knowledge Group.

UC San Diego Library, University of California, San Diego, 9500 Gilman Drive #0175-S, La Jolla, CA 92093-0175, USA Tel: +1 858 822 0838; E-mail: [email protected]

Marlayna Christensen is the University Archivist at UC San Diego. She is responsible for liaising with and select records from campus and affiliated groups. Her previous roles include Reference Librarian for Interlibrary Loan at New York University; Head of Interlibrary Loan and Assistant Head of Circulation Services at Yale University; and Director of Access Services at University of California, Santa Barbara. Over her 30-year career, she has served on a number of professional committees with the Society of American Archivists, the Society of California Archivists, the American Library Association, and within the University of California.

UC San Diego Library, University of California, San Diego, 9500 Gilman Drive #0175-S, La Jolla, CA 92093-0175, USA Tel: +1 858 534 8605; E-mail: [email protected]

Abstract The UC San Diego Library has been collecting and providing access to archived web content since 2007. Initial collections were created on an ad hoc basis, with no high- level plan to identify and content of interest, and there was little documentation of how early decisions were made. As time passed, the library’s web archiving efforts increased in scale, and outgrew this informal approach. Efforts were made to standardise web archiving processes and policies via collection request forms and standardised , eventually culminating in the creation of a web collection development policy, and collection and quality control workflows and tracking. This article outlines the process of creating these tools, including establishing institutional needs and concerns, evaluating the wider landscape of web archiving policies and norms, and considering sustainable use of available resources. The article also discusses future areas of work to ensure that web content of research and historical interest is captured in full, preserved responsibly, and made accessible even when the original websites have changed or disappeared.

KEYWORDS: web archives, collection development, policy, process, workflows

INTRODUCTION allows libraries and archives to provide these Creating an institutional web archiving resources long after the original programme can be an intimidating prospect. has changed or disappeared; facilitate stable Capturing and preserving websites and citation links and effective fact-checking; and other content in web archive collections guard against the loss of content with historic

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 1

ED_Maches_and_Christensen.indd 1 2/6/20 2:46 PM Maches and Christensen

value. However, the degree of institutional was added to the collection information on commitment required, combined with the Archive-It’s public portal, and used to create lack of documentation on best practices, a catalog record with a link to the archived can make it difficult to know where to pages. begin. Likewise, changing the scope, scale Due to this initial work to standardise web and mandate of a web archiving initiative archiving metadata and formalise collecting, from a pilot programme or ad hoc collecting the number of web archive collections in effort to a fully realised part of institutional institutional holdings began to expand, and collecting can be similarly daunting. Web this continued after the first digital archivist’s collections created on an ad hoc basis may not successor was recruited in 2018. By this be structured in an intuitive or efficient way, time, the programme had increased in scale making management, discovery and future and outgrown this approach to collecting, collecting more challenging. and creating a web archive collection This article explores these and other development policy became a priority. In considerations related to web archive 2019, a project management process and collection-building, management and access. workflow were developed to track the The writers will discuss challenges and status of and next steps involved in creating strategies related to web archive collection and monitoring web archive collections development, description and management, (Figure 2). using case studies from their own institution as a lens through which to explore more generally applicable solutions. INSTITUTIONAL WEB ARCHIVE POLICY NEEDS As of autumn 2018, regularly scheduled WEB ARCHIVING AT UC SAN DIEGO captures included the library’s website, The UC San Diego Library began archiving campus news and local government websites. web content in 2007 as a collecting effort Archived web content also included through the California ’s Web collections created in response to issues Archiving Service (WAS), and captured and events of local impact, such as natural content sporadically until 2011, when web disasters, on-campus incidents and regional archive collecting largely came to a halt due political activity. According to internal to staffing changes. Initial collections were account statistics generated via Archive-It, as created on an ad hoc basis (Figure 1); there of summer 2019, the library has captured 5.6 was no high-level plan to identify websites TB of data since its migration from WAS to and content of interest, and web archiving Archive-It, with approximately 3 TB of said collection decisions were not documented data captured since 2017. for future reference. Web archiving resumed Given this sharp increase in web archiving in 2015, after UC San Diego migrated to scope and scale, the digital archivist was the ’s Archive-It subscription tasked with writing a collection development service with the rest of the University of policy for web archives that would codify California (UC) system after WAS shut collecting practices developed organically as down in 2015.1 Archive-It includes curation, web archiving activity grew, and guide the collection and access services. library’s web archiving work through future Web archiving also became more periods of growth. systematic with the recruitment of a digital Anecdotally, this is a common way archivist. She implemented a standardised for institutions to begin. The external form to collect essential metadata for new colleagues contacted while the digital and existing collections, and this metadata archivist began work on the policy were

2 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 2 2/6/20 2:46 PM Web archiving

Figure 1: Prior web archiving workflow

largely in similar circumstances; their this work.2 Additionally, 56 per cent of the institutions began collecting web content institutions with web archiving programmes without an official collection development in the testing or planning stages lacked policy, and either codified policies later, a collection policy that involved web or relied on draft policies to guide web archiving.3 Reports from later iterations archive collecting. Research supports this of the survey, which were conducted in assertion; the results of the National Digital 2013, 2016 and 2017, devote less focus to Stewardship Alliance (NDSA) 2011 Web institutions with emerging web archiving Archiving Survey indicated that 32 per programmes; respondents to these surveys cent of the institutions actively engaged in had more mature programmes, so this web archiving lacked a collecting policy for question was no longer relevant.

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 3

ED_Maches_and_Christensen.indd 3 2/6/20 2:49 PM Maches and Christensen

Figure 2: Trello web archiving workflow

Before beginning work on the policy, in curating web archive collections. Meeting the digital archivist conducted research to with these stakeholders would allow the determine institutional priorities and learn digital archivist to understand the history about common practices in web archiving. of the institution’s existing web archive The first step in this research was to establish collections, and determine priorities for the a list of internal stakeholders, consisting institution’s future approach. primarily of library administrators and The digital archivist built awareness of subject selectors responsible for or interested the web archiving project and collection

4 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 4 2/6/20 2:54 PM Web archiving

development policy via one-on-one that enhanced collecting and increased conversations, and by attending relevant awareness remained in balance with a limited committee meetings to describe the library’s annual budget. web archiving work and progress made on the policy. These meetings represented an opportunity for outreach about past and ENVIRONMENTAL SCAN OF WEB current web archiving efforts, to ensure ARCHIVING COLLECTING POLICIES that subject selectors knew that they could In addition to these internal conversations, the pursue web archiving for websites related digital archivist conducted an environmental to their subject areas, and to solicit feedback scan to understand the landscape of web to guide future work on the policy. Once archiving work, as well as common practices a draft was complete, attending committee that could be used to shape the policy meetings allowed the digital archivist to document. This environmental scan sought to present the draft for comments, determine discover how policy approaches change over the extent of revisions that would need to be time, how quickly existing policies became made, and, following these revisions, receive outdated, and the information typically administrative approval to implement the included in collection development policies policy. for web archives. After meeting with these stakeholders, The environmental scan included a list of primary institutional concerns and publicly available policies from institutions priorities was developed. This list included: of different sizes and types, policies created by other UC institutions, and a high-level • a need for more awareness among selectors view of trends and common practices in that web archiving tools were available; the field. ‘Common practices’ are distinct • a need to streamline the collection from ‘best practices’, as there are not yet proposal process to facilitate the timely codified best practices for web archive capture of at-risk or otherwise ephemeral collection development policies, but there content; are widely adopted norms. However, as • a need for better documentation of the these practices are not standardised, the history of and decisions related to each concept of web archiving ‘best practices’ can web archive collection; change dramatically in a relatively short time. • a need to remain within a limited This became evident after reviewing older annual data budget for new web archive policies, as well as observing how responses collecting; to NDSA web archiving surveys changed • a need for more established and over the years. transparent processes regarding resource Publicly available policies were obtained allocation, including establishing a list via a , and nine representative of individuals and entities responsible policies were selected for inclusion in the for resource decisions related to web environmental scan: three from public archiving. universities, four from private universities, one from a small private college and one Generally, there was institutional interest in from a government library. Unpublished draft web archiving at all levels. Selectors wanted documents from UC San Francisco and UC the processes of creating, augmenting, Davis were also included in the scan. and maintaining web archive collections After gathering the policies, the digital to be streamlined, transparent and widely archivist noted each policy’s approach acknowledged throughout the library. to different aspects of collecting (eg Meanwhile, administrators wanted assurance selection, acquisition, access, notification

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 5

ED_Maches_and_Christensen.indd 5 2/6/20 2:54 PM Maches and Christensen

of or obtaining permission to collect web also presented at the Society of California content, takedown policies, deselection Archivists’ Annual General Meeting in or , institutional priorities), April 2019, to inform the broader archival to identify similarities. In the absence of community of the work accomplished and accepted web archiving best practices, to solicit feedback. The policy was edited comparing policies to identify trends in to reflect comments and input, and the structure or content allowed for a greater subsequent draft was approved for use in July sense of web archiving norms.4 Out of the 2019. many different topics addressed in these The policy itself is divided into policies, only selection criteria appeared in five sections, each reflecting a set of all of them. Issues of access, permissions and considerations related to web archive institutional priorities appeared in many. collecting: Collecting, Robots.txt, Access, Many of the web archiving policies Preservation and Takedown. dovetailed with other collection The Collecting section describes how to development policies at their respective create a collection, or group of one or more institutions. Compared with these more related URLs that address a topic of interest, traditional collecting policies, the web and explains the library’s selection criteria archiving policies devoted more attention to for web content (said criteria are drawn from subject-agnostic criteria, such as uniqueness; existing internal policy for selecting materials whether the web content is at risk of being for digitisation). Both policies address lost; whether it complements existing materials that encompass a variety of subjects collections; or if the content represents and formats, and for which the university voices and communities that are not well does not necessarily own the . represented in the institution’s existing Selection criteria include: degree of likely holdings. They also addressed issues specific research interest; whether the website is UC- to web archiving, such as whether capture owned or affiliated; whether the website or is feasible with current technology and content is at risk; and whether the material approaches to clearing legal and technical addresses an existing collection gap or permissions. Some policies had not been missing perspective. These criteria currently updated in several years, and thus reflected serve as guidelines to direct collecting, with outdated attitudes about web archiving collections that meet more criteria receiving approaches, or relied on defunct tools or higher priority, but the library has not methods. These older policies not only yet determined a specific process for final showed how the web archiving landscape collection decisions. Reusing the library’s changed over the last five or ten years, but existing selection criteria for digitisation emphasised the importance of revisiting provided a model intended to be as policies regularly to ensure that they transparent and easily understood as possible. remain useful and relevant. This observation The criteria were simple, subject-agnostic, informed the decision to embed a revision already in use, and similar to the web schedule into the policy, to prevent the archiving selection criteria employed by peer policy from becoming inaccurate or institutions. outdated as tools and practices change. This section also outlines responsibilities and considerations of the library, digital archivist, content selector and, where DRAFTING THE POLICY applicable, content creator. It also outlines After the environmental scan, a policy was the process of requesting that an Archive-It drafted and shared with internal and external collection be created, notifying content colleagues for comment. The policy was owners of the library’s interest in capturing

6 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 6 2/6/20 2:54 PM Web archiving

their content and requesting that they deny of institutions surveyed had never been asked permission within a specified period of time to take down or cease crawling content if they do not wish for their content to be they had captured without permission.9 captured, and the technical capabilities and However, the policy does not involve asking limitations of Archive-It’s web-crawling for permission to capture all content with a technology. For example, Archive-It is robots.txt exclusion; content such as YouTube best-equipped to capture static content, videos, for which Archive-It documentation and content that requires little or no user recommends ignoring robots.txt exclusions input, so the policy informs selectors that due to the platform’s automatic restrictions dynamic or heavily interactive content on crawling videos and page styling, would would not be a good candidate for capture.5 not require permission to capture.10 This way, Given these capabilities and restrictions, the websites employing robots.txt exclusions Collecting section also states that although can be captured, but the wishes of content the library will make a good faith effort owners who do not wish for their content to capture requested websites, it cannot to be preserved are still an institutional guarantee that this will be possible for consideration. every website. This section of the policy The Access section describes access serves to manage expectations; by outlining decisions that Archive-It requires at the time these responsibilities and capabilities at the of capture, in order to ensure that collections start, those involved in capturing a website and individual websites have the correct will understand both expectations and metadata. This section of the policy explains commitments. the circumstances under which a collection The Robots.txt section details the might be listed as ‘public’ in Archive-It, that library’s policy for capturing websites that is, available to users via Archive-It’s public employ robots.txt exclusions. Robots.txt access portal, instead of available only to exclusions prohibit web crawlers (programs staff with an Archive-It login.11 Collections whose functions may include search engine are marked ‘private’ if they are incomplete, indexing, web archiving, or malicious capture undergoing quality control, under embargo, of personal information) from functioning or otherwise not ready for public access. for a particular website.6 These exclusions All other collections or websites are marked may be included to block malicious ‘public’. programmes, but, as Archive-It uses similar The Preservation section acknowledges technology, it is blocked as well.7 that Archive-It does not provide local backup Archive-It can be directed to ignore or preservation options, and commits to robots.txt exclusions during capture of a exploring solutions for preserving web specific website, so it is important to consider archive files.12 how to address these exclusions in order to The Takedown section reuses the existing capture desired content.8 The library chose institutional takedown policy for publicly to ask permission before capturing sites with available digital content. That is, if a content a robots.txt exclusion unless the exclusion owner believes that their content is available is part of the platform hosting the content, in violation of fair use, or does not want or otherwise not under the content owner their content archived, they can contact the or creator’s control. This policy is slightly institution to request that it be removed more conservative than common practice — from public access. The decision to rely on according to the 2017 NDSA web archiving and reuse these policies for web archiving, survey, 70 per cent of institutions surveyed did where applicable, not only underscores the not seek permission or notify content owners connection between web archiving and before capturing content, while 91 per cent existing institutional collecting work, but also

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 7

ED_Maches_and_Christensen.indd 7 2/6/20 2:54 PM Maches and Christensen

provides a familiar template for parts of the MOVING FORWARD policy that were not web archives-specific. The new collection development policy This is intended to forestall potential provides structure for decisions throughout qualms with archiving copyrighted material, the current web archiving process, and allows especially as the language in this portion of the library to identify and prioritise content the policy has already been approved for use to capture and make continuously accessible within the library. to users, even after the original website has changed or disappeared. After websites of interest have been identified and approved for FUTURE POLICY WORK collecting, the formal web archiving workflow As of now, the policy has been approved begins. This process is divided into four steps: for use as an approved draft. The library plan, collect, describe and access, reflecting the will continue to respond and adapt to its major types of work involved in creating web experiences as the policy is implemented, archive collections and beginning to capture and continue to refine the document during content. Tracking each collection as it moves this trial period. Issues to be addressed through this process allows the library to during this period include determining manage and maintain a group of collections responsibility for collection decisions, that will continue to expand over time. It also ensuring that the collection proposal and ensures that captured web content is complete creation process functions as planned, and by facilitating communication between evaluating the selection criteria to determine collection stakeholders, and quality control both whether they accurately reflect review of collections with different needs and institutional priorities for new collecting, and technical problems. how each criterion should be weighted in relation to the others. Overall, the library’s experiences highlight PLAN general guidance for future work on The first stage of UC San Diego’s workflow and revisions to the policy, and to other involves creating a collection plan, which institutions interested in creating web archive outlines the steps and states the objectives for policies of their own. For example, a crucial capturing the targeted content. Questions lesson was the importance of considering asked and answered during the planning a policy’s longevity; web archiving policies phase generally include: should be created with consideration of how professional approaches and institutional • Does the collection meet the selection resource commitments might change over criteria detailed in the collection time. Web archiving philosophies change development policy? quickly and frequently. Technological • How much data must be captured? affordances, prevalent tools and resource • What type of crawl would work best (eg availability are all subject to change, both at single page, single page plus external links, an institutional and profession-wide level, entire domain, entire domain plus external and websites themselves can be ephemeral. links)? This means that it is especially important • How often should the crawl occur? to be able to review and revise a policy • How many times will a given URL be regularly, to address changing resources and crawled? If crawls are ongoing, when keep the policy relevant (eg due to reliance will the decision to crawl the URL be on defunct tools, or resource and other re-evaluated? changes that affect projected timelines for • Who are the stakeholders in this work completion). collection? Site owners, subject selectors?

8 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 8 2/6/20 2:54 PM Web archiving

• What rights does the institution hold? environment to monitor the progress of active • Will the captured collection be collections and facilitate communication immediately accessible to the public? If and collaboration between subject selectors, not, how long will it be private, or when metadata librarians, the digital archivist, and will it be reviewed for release? other staff assisting in the process. Trello’s • What resources are available for capturing interface is based on project boards that are and maintaining the collection (eg staffing, used to track stages of a project, and are time, money)? subdivided into sections, or lists, representing • Are there limitations which should be each step and collecting associated tasks, placed on the collection in order to use or cards (Figure 3). The institutional Web resources effectively (eg impose a data cap, Archiving Project board is shared with staff or exclude specific types of documents or engaged in web archiving work. links)? The Web Archiving Project board is • How will users find and access the divided into lists corresponding to each collection? phase of creating a new collection. Each list • Will the captured content resemble the is divided into cards detailing outstanding original website when users access it? tasks, questions, issues and decisions. Once a new collection is proposed, a corresponding Trello was implemented for staff engaged card is added to the Potential Collections in web archiving work to track and manage list on the board. On the card is recorded workflows, communicate with each other, the basic collection information needed and record decisions made by the selector, to create the collection within Archive-It, the digital archivist and metadata librarian. set parameters for the crawl, and create Trello (https://trello.com) is an online project collection-level metadata. The standard management tool that is free to use with descriptive information collected at UC limited functions, or fully available with a San Diego for a new collection includes subscription. Trello was selected because it collection title, a brief description of the provided a visually intuitive, flexible, shared collection, creator name(s), date range of

Figure 3: Trello project board

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 9

ED_Maches_and_Christensen.indd 9 2/6/20 2:58 PM Maches and Christensen

content, starting date of capture, language(s) (Figure 4). Additional crawl information of the website, publisher, rights and access recorded on the card includes crawl information (will the collection be publicly frequency, crawl extent, and either an end accessible, or only available to staff with an date for crawling, or a date to re-evaluate an Archive-It login), geographic and topical ongoing crawl. subjects, name of the librarian or department As a new collection moves through the that requested the collection, and the seed process, staff move the corresponding card (ie starting point) URLs to be crawled from one list to the next to monitor the

Figure 4: Trello card

10 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 10 2/6/20 3:07 PM Web archiving

collection’s status and make any changes in the case of federal government websites, clear. When more granular tracking is daily or weekly changes may be significant. needed, staff add checklists to cards to show On the other hand, the general catalogue of work completed. campus courses does not change significantly during the school year, and can be captured less frequently. COLLECT The Collection phase includes deciding the scope of the content to be captured, testing Testing to ensure that this content is captured as In addition to typical crawls, Archive-It expected, and quality control or assurance provides a test crawl option, which allows review. staff to capture new URLs on a trial basis without impacting the data budget. After a test crawl is complete, staff can review the Scoping crawl and decide whether to delete or save Defining which content is included in it, depending on whether the crawl captured or excluded from capture, or scoping the the desired content. Once saved, the crawl capture, is an essential part of shaping the is counted against the annual data budget. If outcome. As the library primarily uses left unsaved for 60 days, the crawl will expire, Archive-It, which employs a , and be deleted automatically. If a test crawl options for scoping a capture can include captures too much or too little information, how far the crawler will go from the original the scoping rules can be adjusted before URL (eg capturing a single page or an beginning another test crawl. entire domain), how frequently captures When starting a new collection or adding occur, and limits on data captured and time new seeds to an existing collection, the spent crawling. Limiting data captured, and library employs test crawls to determine carefully considering time spent crawling, whether the planned scoping rules and crawl are particularly important when scoping parameters are adequate. Once a test crawl captures for social media sites and pages with is complete, staff perform quality control to embedded media. A lack of careful scoping determine whether the test crawl produced can result in capturing large amounts of the anticipated outcome. If so, the test unwanted data. crawl is saved, and, for ongoing crawls, an Most of the crawls at UC San Diego automated crawl is scheduled to run at the are scoped to capture only the immediate desired frequency. domain URL. When the site included related content beyond the organisational domain, an expanded level crawl was used. In the Quality control/quality assurance instance of capturing the campus news site, When a crawl is complete, staff review which linked to a number of local news the resulting crawl report and captured sources, the institution expanded the crawl to content to determine whether the crawl include pages external to the campus. was successful. Due to the shifting landscape The frequency at which a URL is of web content and diversity of intended captured is also a factor to consider during outcomes, there is currently no standardised scoping. Questions used to determine or recommended web archive quality control frequency include how frequently content of checklist available based on best practices. interest changes or disappears from the page, UC San Diego developed a basic checklist and whether it is important to capture every of elements to review and measure the change to the available content. For example, effectiveness of captures, which is added to

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 11

ED_Maches_and_Christensen.indd 11 2/6/20 3:07 PM Maches and Christensen

each collection’s Trello card and used to track which was well beyond the intended quality control progress. Occasionally, items scope of the crawl. In response, the scope are added to the standard quality control was adjusted to limit social media capture checklist if a collection has content that is for this collection. With these changes in difficult to capture. After a test crawl of a place, the adjusted crawl began to capture new URL or collection, staff review the between 500–800 MB weekly. However, capture to determine if there was adequate while parameters for capturing social media time allowed for the crawl, if the content is links were changed, parameters for external complete, and if links function as expected. content were not. As some external publisher UC San Diego’s checklist includes: content was expected in the crawl, refining the parameters to allow some, but not all of • Did the crawl stop too soon due to an this content would be challenging. In this insufficient time limit? If so, what types of instance, as the amount of data captured was content were not captured? significantly reduced, the decision was made • Does the capture look like the original to allow the extra external content rather pages? than spending time to identify and exclude • Is any expected content missing? unwanted content. • Do links connect as expected? Quality control on completed crawls • How much data was collected? Does this is necessary to ensure proper collection. match the apparent size and complexity of Following a quality control checklist and the page or website? reviewing crawl reports to confirm that • Which host websites or file types represent all expected content was collected can be the largest proportion of captured data? Is time- and labour-intensive, but ensures this appropriate or expected? that, if any content is missing, there is still • What types of documents were collected? an opportunity to recapture the missing Appropriate or expected? information before the original website changes. The content captured early in the When a crawl is not complete or content library’s web archiving project did not go is not captured as expected, staff review through a rigorous quality control review. the report and determine which scoping Later, it was discovered that many of the parameters to adjust, remove or add before captures were only partially complete. beginning another test crawl. Changes are As the captures were several years old, noted on the corresponding Trello card. the websites in question had changed or The crawl report details elements that disappeared, and the missing content could may need attention. For example, one not be recovered. crawl regularly captured 17–20 GB of data Fortunately, for recurring captures, weekly, yet never finished the crawl in the quality control becomes less burdensome allotted time. On first review of the output, over time. While initial crawls consist the captured pages displayed correctly, but mostly or completely of new data, and many documents from the domain URL therefore require careful and complete were still queued for capture. Initially, it review, the amount of new data per crawl appeared that the website simply included decreases as the static portions of a page are many large media files. Upon closer review, captured. Once a recurring crawl is more staff discovered that much of the captured established, and very little new content is data represented unrelated social media and captured in each crawl, quality control or external source content (New York Times, Wall quality assurance can become less frequent, Street Journal, etc). Additionally, Facebook and consist of spot-checking the captured and Twitter content linked from the original website instead of reviewing every page. page had been captured in several languages, However, some degree of quality assurance

12 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 12 2/6/20 3:07 PM Web archiving

is still necessary to ensure that sudden analogue materials to Special Collections or drastic changes to a website (eg URL and Archives, the archived site is included changes, website redesigns) do not affect in the . Other web archive crawl success. For this reason, the library collections are included in associated found that spot-checking established finding aids with the EAD tag. Linking web crawls remained necessary to confirm archived collections in the field (see, for that Archive-It was still collecting the example: https://library.ucsd.edu/speccoll/ appropriate content. findingaids/sac0001.) helped establish a high-profile location for users to look for archived websites. Another potential DESCRIBE access point for web archive collections Once a web archive collection has been could be links included in relevant subject created and made publicly accessible, and course-specific resources developed by descriptive metadata help users to discover campus librarians. The library will continue it. In the 2018 OCLC Research Perspectives to evaluate access to and use of archived report, ‘Descriptive Metadata for Web web content, and look for additional Archiving: Recommendations of the OCLC ways to bring such content to researchers’ Research Library Partnership Web Archiving attention. Metadata Working Group’, the authors state ‘the context within which a website or collection was archived is of vital importance CONCLUSION to end users for understanding potential Policy and practice offer a framework for uses of the content’.13 The report identifies structuring web archiving work, but should essential descriptive elements, defines remain flexible to allow institutions to them, and provides a crosswalk with other adapt to changes in content, technology descriptive standards commonly used for and user needs. This perspective will archival collections. The library adds seed- continue to inform the library’s web and collection-level metadata to content archiving efforts and next steps. Progress on Archive-It to improve discoverability has been made toward addressing all of the through the public user interface, which institutional concerns identified during allows users to search on descriptive work on the web archive collection metadata. These metadata are also used in the development policy, but none of these local online catalogue. considerations have been resolved fully. Instead, the library is using this initial period after the policy’s approval as an ACCESS opportunity to evaluate how the current The library currently provides access to web version of the policy works in practice, archive collections via Archive-It’s public and determine changes that may be user interface, as well as by including links required over time. For example, the to archived web content in its existing local library’s existing collecting activity is resources. The library creates local catalogue within the 2019–2020 data budget, with records for each collection with subject room for new collecting. As outstripping headings and other pertinent tracings the current budget is not an immediate (see, for example: http://roger.ucsd.edu/ problem, the library has time to observe record=b9626849~S9). Library cataloguers the policy’s effects on new collecting, and also add links to web archive collections in evaluate whether the approach to data related finding aids. That is, if permission budget allocation outlined in the policy to archive a donor’s website is included is reasonable and sustainable in practice. in the donation, purchase or transfer of This and other ‘wait and see’ elements

© Henry Stewart Publications 2047-1300 (2020) Vol. 8, 3 1–14 Journal of Digital Media Management 13

ED_Maches_and_Christensen.indd 13 2/6/20 3:07 PM Maches and Christensen

of the policy represent opportunities to References take incremental steps toward resolving 1. Lack, R. (2015) ‘Announcing a new partnership: institutional concerns, and ensure that the California Digital Library, UC Libraries, and Internet Archive’s Archive-It Service’, available at: https:// library is not without a policy while these www.cdlib.org/cdlinfo/2015/01/14/announcing- issues are considered in more depth. a-new-partnership-california-digital-library-uc- In addition to ongoing work to address libraries-and-internet-archives-archive-it-service/ (accessed 28th June, 2019). institutional concerns and priorities, and 2. NDSA Content Working Group (2012) ‘National respond to changes in how the internet Digital Stewardship Alliance Web Archiving Survey is used and understood, the library will Report’, available at: http://www.digitalpreservation .gov/documents/ndsa_web_archiving_survey_ explore ways to preserve captured content report_2012.pdf (accessed 10th December, 2018). locally to avoid reliance on backups outside 3. Ibid. institutional control. Currently, UC San 4. Maches, T. and Stine, K. (2019) ‘Web Archiving Diego is wholly reliant on Archive-It to CKG “Birds of a Feather”’, available at: https:// wiki.library.ucsf.edu/display/UCLCKG/2019-05- host and maintain captured content, but 21+UC+DLFx+Birds+of+a+Feather (accessed 28th the library is considering other options, June, 2019). which could include depositing backups 5. Lohndorf, J. (2019) ‘Known web archiving challenges’, available at: https://support.archive-it. in Chronopolis, the institutional digital org/hc/en-us/articles/209637043-Known-Web- preservation repository. As the library moves Archiving-Challenges (accessed 5th August, 2019). toward maintaining local copies of its web 6. Praetzellis, M. (2018) ‘Avoid robots.txt exclusions’, available at: https://support.archive-it.org/hc/en-us/ archives, additional types of data will be articles/208001096-How-to-avoid-robots-exclusions captured and added to collections, including (accessed 5th August, 2019). technical and administrative metadata, some 7. Ibid of which may be captured automatically by 8. Ibid. 9. Farrell, M., McCain, E., Praetzellis, M., Thomas, G. the harvesting tool. Local preservation would and Walker, P. (2018) ‘Web Archiving in the United not replace functions or services provided by States: A 2017 Survey’, National Digital Stewardship Archive-It, but would provide a backup in Alliance, available at: https://doi.org/10.17605/OSF. IO/3QH6N (accessed 10th December, 2018). case of emergency. 10. Praetzellis, M. (2019) ‘Archiving YouTube videos’, Another area for future work is ensuring available at: https://support.archive-it.org/hc/en-us/ that the library can commit sufficient time articles/208333753-Archiving-YouTube-videos (accessed 5th August, 2019). and resources to support both current and 11. Blumenthal, K.-R. (2019) ‘Controlling access to your future web archiving activity. As the library’s web archives’, available at: https://support.archive-it web archiving activity continues to grow, .org/hc/en-us/articles/208334003-Controlling- there will be a need for increased staffing access-to-your-web-archives- (accessed 20th July, 2019). support, diverse harvesting tools to capture 12. Blumenthal, K.-R. (2019) ‘Partner guide to a variety of technologically challenging downloading Archive-It Data’, available at: https:// content, and innovative approaches support.archive-it.org/hc/en-us/articles/209643793- Partner-Guide-to-Downloading-Archive-It-Data to anticipate and respond to change. (accessed 20th July, 2019). These represent substantial institutional 13. Dooley, J. and Kate, B. (2018) ‘Descriptive Metadata commitments, and will require flexibility and for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving creativity to ensure that these commitments Metadata Working Group’, available at: https://doi. do not outstrip institutional capacity. org/10.25333/C3005C (accessed 9th August, 2019).

14 Journal of Digital Media Management Vol. 8, 3 1–14 © Henry Stewart Publications 2047-1300 (2020)

ED_Maches_and_Christensen.indd 14 2/6/20 3:07 PM