Understanding Computational Web Archives Research Methods Using Research Objects

Understanding Computational Web Archives Research Methods Using Research Objects Emily Maemura, Christoph Becker Ian Milligan Digital Curation Institute Department of History University of Toronto University of Waterloo Toronto, Canada Waterloo, Canada [email protected], [email protected] [email protected] Abstract—Use of computational methods for exploration and traditionally focused on close readings of source materials analysis of web archives sources is emerging in new disciplines are moving to the use of distant reading methods [2]. such as digital humanities. This raises urgent questions about We highlight three critical factors to be considered for how such research projects process web archival material using computational methods to construct their findings. humanities researchers using web archives: This paper aims to enable web archives scholars to document 1) interrogating sources. – Understanding how these web their practices systematically to improve the transparency of their methods. We adopt the Research Object framework to archives collections were created is key to judging characterize three case studies that use computational methods the adequacy, appropriateness and limitations of the to analyze web archives within digital history research. We then source material. discuss how the framework can support the characterization 2) understanding new methods. – The use of emerging of research methods and serve as a basis for discussions of computational methods is a key prerequisite to work- methods and issues such as reuse and provenance. The results suggest that the framework provides an effective ing with large scale data sets. conceptual perspective to describe and analyze the computa- 3) transparency of the research process. – Findings are tional methods used in web archive research on a high level dependent on the validity of the computational pro- and make transparent the choices made in the process. The cesses and the adequacy of data. documentation of the research process contributes to a better understanding of the findings and their provenance, and the These factors contribute to the need for a stronger method- possible reuse of data, methods, and workflows. ological framework for research with web archives to under- Keywords-computational methods; web archives; research stand the research process in more detail. This can serve as a objects; computational archival science; digital curation common vocabulary for discussions of trust in the findings, as well as reuse of data, tools, or analytical techniques. I. INTRODUCTION Our work combines perspectives of a digital humanist The web has become a key source for scholars studying engaged in the computational exploitation of web archives social and cultural phenomena. Web archives facilitate this and scholars working in the intersection of systems design work by documenting and preserving snapshots of the web and digital curation. Our joint interest in research with web for future research. In areas such as history, scholars have archives led to an effort to structure research processes unprecedented opportunities to leverage access to massive in order to develop detailed descriptions, with an eye to primary sources of born-digital artifacts to make sense of developing a research model. Our motivation for this study individual experiences [1]. is to ask: how do research projects in digital humanities These new forms of archives trouble traditional concep- process web archival material using computational methods tions of historical archival research, which assumes that to construct their findings? We follow the work of other any material found in archives is historically significant, researchers in the digital humanities who have called for having been assessed through archival appraisal. The scale stronger frameworks for computational methods and the of web archives confounds the traditional appraisal process. needs of researchers using web archives [3], [4]. Web archives are replete with information that may not be To develop the framework, we draw on the concept of the significant to the research questions being asked. Research Object (RO) developed to address the aggregations The scale of these collections frequently mandates the of resources and processes used in conducting computational use of big data analysis techniques to address their research research [5]. We use the RO profile as a structure to char- questions. The adoption of these computational methods is a acterize three case studies that use computational methods major shift for fields in the humanities. Approaches that have to analyze web archives within digital history research, and discuss how the framework can illuminate issues such as working with large-scale web data must take into account transparency, provenance, and reuse. technical affordances like indexing, and different types of The next section will outline background work in com- metadata available. Research with web archives frequently putational research with web archives and in conceptual involves iterative cycles of text-mining or data-mining to sort frameworks to model computational science methods, with and filter data, then applying analytics techniques like topic particular attention to Research Objects. We then develop modeling, sentiment analysis, or network analysis. Whereas a conceptual framework for Web Research Objects that historians have traditionally found themselves wishing they we use to describe three case studies. These descriptions had more information about the past, the “abundance” of show that structured documentation of the research process web archives threatens to dramatically reshape forms of contributes to a better understanding of the findings and their historical research and knowledge [16]. provenance. They serve as a basis for discussing questions The emerging field of web archive research is therefore of transparency, provenance, and the origins of the archival witnessing the development of new research methods. This material used for analysis. These discussions show the value is complicated by the nature of historical research, which of the Research Object framework and raise a number of often remains implicit in research methods. Indeed, many questions for further research. historical research projects involve an iterative process where methods develop emergently and remain undocumented. The II. BACKGROUND variation across studies adds to the difficulty of understand- Web Archives and Historical Research ing, comparing or validating results. Web archives attempt to capture and preserve web data As a new form of archives, web archives challenge for future use and study. Different approaches to web definitions found in traditional archival theory. Traditional archiving range from micro-archiving that captures details approaches deal with aggregations of records that reflect of the experience of interacting with individual web sites specific business or institutional activities, and serve as or elements to macro-archiving initiatives that use crawlers evidence of those activities. In contrast, web archives often and other automated tools to capture web data [6], [7]. collect documents from a wide range of sources, created Major web archive initiatives for research use include the for different purposes, and formed around themes or events. Internet Archive, national libraries and archives (such as The principles that guide the scoping of web archives may the Library of Congress and the U.K. National Archives), be seen as closer to curated collections in how they are academic libraries, as well as community efforts like the selected and aggregated [17]. Interdisciplinary perspectives Archive Team. In parallel to archiving web pages, social on provenance address chain-of-custody of objects [18], media archives collect data from platforms such as Twitter however the archival perspective is critical for supporting or Instagram. They often rely on APIs from these services historical research by addressing the additional layers of to construct their holdings. meaning that context provides. For researchers in the social sciences and humanities, Previous work has investigated the practices of web web archives are increasingly recognized as an essential archiving and researchers using web archives [3]. Studies source for studying cultural and social phenomena of recent done under the auspices of the Big UK Domain Data for the decades [8], [9]. Niels Brugger¨ has been studying the Arts and Humanities (BUDDAH) have echoed this, noting in evolution of national domains [10]. Anat Ben-David has particular the lack of guidance for humanist researchers [19]. studied the now-deleted .yu domain and found parallels Beyond these studies that ask researchers to reflect on their in the structure of the web archive with the break-up of own methods, approaches are not often clearly documented. the former Yugoslavia [11]. Richard Rogers’ Digital Meth- Instead, information about a study can exist in many places ods initiative has studied the Dutch blogosphere using the – code in a GitHub repository or embedded in a blog post, Wayback Machine [12], [13]. Matthew Weber used web data occasionally published in an institutional repository or archives to trace connections between local newspapers, tweeted during intermediate stages, or methods articulated studying a transformative moment in the history of American at practitioner conferences. For example, the historical ar- journalism [14]. The field is continually

Understanding Computational Web Archives Research Methods Using Research Objects

Appraisal Talk in Web Archives Ed Summers

Lost in the Infinite Archive: the Promise and Pitfalls of Web Archives

Studying and Documenting Web Archives Provenance Emily Maemura, Phd Candidate Faculty of Information, University of Toronto Netlab Forum February 27, 2018 the Team

NET ART 2 Eng 3.Indd

The Theory and Craft of Digital Preservation Manuscript Submitted to Johns Hopkins University Press By: Trevor Owens June, 2017 2

Internet Archive Open Libraries Proposal Macarthur Foundation 100&Change Internet Archive 100&Change

Observing Web Archives the Case for an Ethnographic Study of Web Archiving

Exploring Informal Archival Practices

The Evolution of Web Archiving

Technology Watch Report: Preserving Social Media

Finding Traces in Youtube's Living Archive

Web Archiving