Studying and Documenting Web Archives Provenance Emily Maemura, Phd Candidate Faculty of Information, University of Toronto Netlab Forum February 27, 2018 the Team
Total Page:16
File Type:pdf, Size:1020Kb
If These Crawls Could Talk: Studying and Documenting Web Archives Provenance Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum February 27, 2018 The Team Nich Worby Christoph Becker Ian Milligan Librarian, Assistant Professor Associate Professor, Web Archivist Director, Digital Curation Institute Digital Historian Emily Maemura Doctoral Candidate brought together through the Digital Curation Institute McLuhan Centenary Fellowship in Digital Sustainability 2016-17 and supported by SSHRC Number of Pages per Domain, Canadian Political Parties Collection, 2005-2015 http://lintool.github.io/warcbase/vis/crawl-sites/ Pages from “policyalternatives.ca” Sept. 2007 - Nov. 2009 Dec. 2009 - May 2015 over 200.000 less than 50.000 pages per crawl pages per crawl How are web archives made and used? How can we document or communicate this? Creating web archives Using web archives As a web archivist... As a researcher… What do I need to document What do I need to ask about for researchers using this data to have confidence in web archives material? the analysis and findings? Researcher Perspective: “Web Archives Research Objects” Maemura, Milligan, Becker, 2016. Understanding computational web archives research methods using research objects, Computational Archival Science Workshop at IEEE Big Data 2016 http://hdl.handle.net/1807/74866 How are web archives used by researchers? Can shared conceptual frameworks of research methods used with Web Archives collections help to systematize practices, advance the field, and make it easier to introduce new researchers to the area? “Research Objects” in Computational Science Bechhofer et al., 2013, Why Linked Data is Not Enough for Scientists, Future Generation Computer Systems 29(2), February 2013, Pages 599-611 www.researchobject.org Our Approach Adopting the Research Object (RO) Framework for Research with Web Archives ● to structure the complex aggregation of computational processes, services, forms of data, contexts, and approaches ● balancing systematic approaches from computational sciences with humanistic issues of provenance and trust Developed here to characterize three cases of web archives research studies, examples completed by co-author Ian Milligan “Research Objects” as a Conceptual Framework ORGANIZATIONAL CONTEXT Ethical and governance approvals, investigators, etc. Acknowledgements QUESTIONS STUDY DESIGN ANSWERS state a problem scope, rationale for choices publications and and/or hypothesis in tools, sources, methodology presentations DATA METHODS RESULTS materials studied, the tools, workflows, scripts, materials produced those taken as processes, settings, configurations, derived datasets, inputs to processes used to perform the analysis visualizations, etc. Case: GeoCities Community Milligan, I. (2017). Welcome to the Web: The Online Community of GeoCities and the Early Years of the World Wide Web. In R. Schroeder & N. Brügger (Eds.), The Web as History. ● sole-authored research by a historian ORGANIZATIONAL CONTEXT ● no ethics approval since data is publicly QUESTIONS ANSWERS available via Wayback Machine STUDY DESIGN ● research agreement signed with Internet DATA, METHODS, & RESULTS Archive limits sharing and publication of specific derivative datasets Case: GeoCities Community Research Question: Did GeoCities users have a sense of community? ● also understanding what tools and ORGANIZATIONAL CONTEXT approaches historians need to study the web ● iterative approach, exploratory investigation QUESTIONS ANSWERS STUDY DESIGN of archived GeoCities.com data DATA, METHODS, & RESULTS ● development of Warcbase analytics platform Case: GeoCities Community DATA METHODS RESULTS Analysis Phase 1 (2013): data from Archive Team data prep with bash scripts, analysis with Mathematica torrent (wget geocities.com) html2text.py scripts, bash (on sample) ~1TB ~4TB Analysis Phase 2 (2016): analysis of raw link structures data from Internet Archive data prep and selection popular images ordered by end-of-life crawl done within warcbase MD5 hash An Initial Profile for Web Archives Research Objects Disciplinary perspectives of Interpretation of results is researchers, roles in large ORGANIZATIONAL CONTEXT an important part of teams and partnerships humanities scholarship, described in publication of QUESTIONS ANSWERS Legal agreements and findings (not simply the contracts that impact data STUDY DESIGN results of workflows) sharing and use DATA, METHODS, & RESULTS Designs vary widely by Motivations for the study discipline and conceptual and research contributions perspectives - for whom is the work Include rationale for selecting relevant? sources, methods, scope How was data sourced? Which scripts, services, Are results FAIR, published, Collection timeframe(s) software packages used - citable? How were they formats and interoperability was code published? validated? Template and Workshop at RESAW 2017, London ORGANIZATIONAL CONTEXT Discipline(s) of Funding Approvals Partnerships Agreements & Acknowledgements Research Team Contracts QUESTIONS STUDY DESIGN ANSWERS ? Scope Rationale Research questions and limits of time period, for choices of tools, Publications and Presentations motivations for the study geographic area, etc. sources, methods of findings DATA, METHODS & RESULTS Data Sources Data Preparation Data Analysis Results Generated metadata, method of collection workflows and derived datasets methods, config settings, logs published figures, data, code Template and Workshop at RESAW 2017, London ORGANIZATIONAL CONTEXT Discipline(s) of Funding Approvals Partnerships Agreements & Acknowledgements Research Team Contracts QUESTIONS STUDY DESIGN ANSWERS ? Scope Rationale Research questions and limits of time period, for choices of tools, Publications and Presentations motivations for the study geographic area, etc. sources, methods of findings DATA, METHODS & RESULTS Data Sources Data Preparation Data Analysis Results Generated metadata, method of collection workflows and derived datasets methods, config settings, logs published figures, data, code Where do these come from, how are they made? Web Archivist Perspective: “Elements of Provenance” Maemura, Worby, Milligan, Becker, 2017. Origin Stories: documentation for web archives provenance, Web Archiving Week / IIPC Web Archiving Conference How are web archives made? Which individual decisions are made in web archives practices, and how can we understand and communicate the impacts on resulting collections? Our Approach Starting with Web Archiving Life Cycle Model (Bragg et al. 2013) as a framework: ● What decisions are at each life cycle phase? (e.g. Appraisal and selection; Scoping; Data Capture; Storage and Organization; Quality Assurance and Analysis) ● Studying the process of using Archive-It to create three web archives collections by University of Toronto Libraries (UTL) http://ait.blog.archive.org/files/2014/04/archiveit_life_cycle_model.pdf Overview of the Collections Studied Canadian Political Parties Pan Am Games Global Summitry Collection October 2005 to February 2015 to June 2016 to Timeframe present (ongoing) December 2016 present (ongoing) Crawl frequency Quarterly (every 3 months) Combination of Daily, TBD, based on timing of Weekly, Monthly and summit events One-time crawls Crawl duration 3 days Varies widely by crawl (from 5 days (currently test hours to days) crawls only) # of Active Seeds 62 434 167 Total data >900 GB >100 GB >400 GB archived* >29,000,000 documents >3,500,000 documents >5,000,000 documents Crawl limits and Ignore robots.txt Ignore robots.txt Ignore robots.txt rules specified Block twitter.com URLs for “lang=?” General Workflow for UTL crawls Importance of iterations, different types of crawls: test crawls, production crawls, patch crawls Managing organizational data budget across multiple collections test crawls may reveal size of material a collection set to run 4 crawls a on web is bigger than expected, requires year may have many more due to re-working scope or renegotiating data patch crawls budget allocated in proposal Three Key Findings 1) Scoping: decisions are made throughout 2) Process: Unforeseen issues arise during a crawl - the actions taken to resolve these issues need to be documented 3) Context: individual decisions interact and are influenced by changes in organizational context and wider environment, impacting the collection over time Interdependencies and interactions between factors Is a site with robots.txt exclusions captured? Does the technical Does the legal Does organizational system allow? environment permit? policy guide action? ☑Yes, option ☑Yes, after ☑Yes, new law available in Archive-It, Copyright Law interpreted in Permissions since 2010 Amendment, 2013 Policy, 2014 Individual curatorial choice to exclude sites with robots.txt for a particular collection or crawl Elements of Scoping, Process, Context to Document Element Key Questions and Information to Document Motivation What is the purpose of the collection? Has its mandate changed over time? Focus Which geographic, temporal, technical, political, topical and/or social boundaries are defined to scope the collection? Access & Discovery Who is the intended audience? Do they have known characteristics or needs? Which contractual, organizational, legal, or other agreements restrict access? What metadata fields and indexes support discovery? At what degree of granularity (by collection, site, or individual resource)? Which data formats or derivative datasets are available? Seed list What seeds