Bots, Seeds and People: Web Archives As Infrastructure

Bots, Seeds and People Web Archives as Infrastructure Ed Summers Ricardo Punzalan University of Maryland University of Maryland [email protected] [email protected] ABSTRACT than 44%. Archives of web content matter, because hypertext The field of web archiving provides a unique mix of human links are known to break. Ceglowski [11] has estimated that and automated agents collaborating to achieve the preserva- about a quarter of all links break every 7 years. Even within tion of the web. Centuries old theories of archival appraisal highly curated regions of the web such as scholarly and legal are being transplanted into the sociotechnical environment publishing rates of link rot can be up to 50% [69,79]. of the World Wide Web with varying degrees of success. Failing to capture everything should not be surprising to the The work of the archivist and bots in contact with the mate- experienced archivist. Over the years, archival scholars have rial of the web present a distinctive and understudied CSCW argued that gaps and silences in the archival record are in- shaped problem. To investigate this space we conducted evitable. This is partly because we do not have the storage semi-structured interviews with archivists and technologists capacity nor all the manpower nor all the resources required who were directly involved in the selection of content from to keep everything. Thus, archivists necessarily select rep- the web for archives. These semi-structured interviews iden- resentative samples, identify unique and irreplaceable, and tified thematic areas that inform the appraisal process in web culturally valuable, records. We often assume that archivists archives, some of which are encoded in heuristics and algo- abide by a clear set of appraisal principles in their selection rithms. Making the infrastructure of web archives legible to decisions. In practice, selection is a highly subjective pro- the archivist, the automated agents and the future researcher cess that reflect the values and aspirations of a privileged few. is presented as a challenge to the CSCW and archival com- More often, archival acquisition also happens more oppor- munity. tunistically and without adherence to a planned or compre- hensive collecting framework. The central challenge facing ACM Classification Keywords the archival community is to better understand our predispo- H.3.7. Digital Libraries: Systems issues; K.4.3. Organiza- sition to privilege dominant cultures, which results in gaps in tional Impacts: Computer Supported Collaborative Work society’s archives. As Lyons [53] recently argued: Author Keywords If we have any digital dark age, it will manifest, as has archive; web; collaboration; design; practice been the case in the past with other forms of information, as a silence within the archive, as a series of gaping INTRODUCTION holes where groups of individuals and communities are In 2008, Google estimated that it had 1 trillion unique URLs absent because there was no path into the archive for in its index [2]. Recently in 2015, the Internet Archive’s them, where important social events go undocumented homepage announced that it has archived 438 billion web because we were not prepared to act quickly enough, pages. A simple back of the envelope calculation indicates and where new modalities for communication are not that roughly 44% of the web has been archived. But this esti- planned for. The digital dark age will only happen if mate is overly generous. Of course the web has continued to we, as communities of archives and archivists, do not grow in the 8 years since Google’s announcement. The Inter- reimagine appraisal and selection in light of the histori- arXiv:1611.02493v1 [cs.DL] 8 Nov 2016 net Archive’s count includes multiple snapshots of the same cal gaps revealed in collections today. URL over time. Even Google does not know the full extent Unlike more traditional archival records the web is a con- of the web, since much of it is either hidden behind search stantly changing information space. For example, the New forms that need to be queried by humans, the so called deep York Times homepage which is uniquely identified by the web [54], or blocked from indexing by Google, the dark web. URL http://www.nytimes.com can change many times during Consequently, the actual amount of the web that is archived any given day. In addition, increased personalization of web is not readily available, but certain to be much, much less content means that there is often not one canonical version Permission to make digital or hard copies of all or part of this work for personal or of a particular web document: what one user sees on a given classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation website can vary greatly compared to what another individual on the first page. Copyrights for components of this work owned by others than the sees. For instance, what one sees when visiting a particular author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or profile on Facebook can be quite different from what another republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. person will see, depending on whether they are both logged CSCW ’17, February 25-March 01, 2017, Portland, OR, USA in, and part of a particular network of friends. Even when © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. collecting web content for a particular institution, such as an ISBN 978-1-4503-4335-0/17/03. DOI: http://dx.doi.org/10.1145/2998181.2998345 1 academic community, it can be difficult to discover and de- not capture all information online and still misses local con- limit relevant regions of content to collect. tent, government documents, and database-backed websites [74]. Despite its promise to provide “universal access to all Given its vastness, volume of content, and the nature of online knowledge” [46], cultural heritage institutions cannot in good media, capturing and archiving web relies on digital tools. faith solely rely on the Internet Archive to do all the work. These archiving tools typically require archivists to supply Hence, some have advocated for libraries and archives to take seed lists lists of website URLs or that are deemed impor- on the responsibility of web archiving in order to capture and tant to capture. These lists are essentially a series of starting preserve more content. However, the dynamics and mechan- points for a web crawler to start collecting material. The lists ics of the decision-making process over what ends up being are managed by web archiving software platforms which then archived is not very much studied. deploy web crawlers or bots that start at a seed URL and be- gin to wander outwards into the web by following hyperlinks. Deciding what to keep and what gets to be labeled archival Along with the seeds archivists also supply scopes to these have long been a topic of discussion in archival science. Over systems that define how far to crawl outwards from that seed the past two centuries archivists have developed a body of URL–since the limits of a given website can extend outwards research literature around the concept of appraisal, which is into the larger space of the web, and it is understandably de- the practice of identifying and retaining records of enduring sirable for the bot not to try to archive the entire web. Dif- value. During that time there has been substantial debate be- ferent web archiving systems embody different sets of algo- tween two leading archival appraisal thinkers, Hilary Jenk- rithms and as platforms they offer varying degrees of insight inson [44] and Theodore Schellenberg [70], about the role into their internal operation. In some ways this increasing of the archivist in appraisal: whether in fact it is the role of reliance on algorithmic systems represents a relinquishing of the archivist to decide what records are and are not selected archival privilege to automated agents and processes that are for preservation [34,77]. The rapid increase in the amount of responsible for mundane activity of fetching and storing con- records being generated that began in the mid-20th century, tent. The collaborative moment in which the archivist and the led to the inevitable (and now obvious) realization that it is archival software platform and agents work together is under- impractical and perhaps impossible to preserve the complete studied and of great significance. documentary record. Appraisal decisions must necessarily shape the archive over time, and by extension our knowledge The web is an immensely large information space, even of the past [5,15]. It is in the particular contingencies of the within the bounds a given organization, so given a topic or historical moment that the archive is created, sustained and content area it is often difficult to even know what website used [8,35,68]. URLs are available, let alone whether they are relevant or not. The seed list functions as an interface between the web, the archivist, archival systems, and the researcher. The seed list Web Archives also offers material evidence of the interactions between hu- The emergence, deployment, and adoption of the World Wide man and automated agents, and makes the sociotechnical con- Web has rapidly accelerated the growth of the documentary struction of the web archive legible. It is in this context that record even more, since its introduction in 1989. The archival we ask the question: How do technologies for web crawling community is still in the process of adapting to this material and harvesting–seed lists and scopes–figure in their appraisal transformation, and understanding how it impacts appraisal decisions? processes and theories.

Bots, Seeds and People: Web Archives As Infrastructure

Archives First: Digital Preservation Further Investigations Into Digital

Digital Preservation Handbook

A New Digital Dark Age? Collaborative Web Tools, Social Media and Long-Term Preservation Stuart Jeffrey Version of Record First Published: 05 Dec 2012

Digital Preservation.Pdf

Follow-Up Questions

Problems of Digital Sustainability

A DIY Approach to Digital Preservation

Musso the Digital Dark

The Theory and Craft of Digital Preservation Manuscript Submitted to Johns Hopkins University Press By: Trevor Owens June, 2017 2

Evaluating Personal Archiving Strategies for Internet-Based Information

A Digital Dark Ages? Challenges in the Preservation of Electronic Information

Electronic Records Archives Brian Knowles Roger Williams University, [email protected]