Getting started with

Karl-Rainer Blumenthal, [email protected]

Digital POWRR Chicagoland Institute November 30, 2017 Naperville, IL What is web archiving?

Web archiving is the process of collecting, preserving, and enabling access to web-native materials. Why archive the web?

> Collect web-based materials in your normal collecting scope

> Fulfill a records retention requirement

> Document spontaneous/online events

> Combat link rot and content drift (no more 404s!) How does it work? How does it work?

> Web crawlers download source code from “live” websites into archival storage. How does it work?

> Replay tools render the archived websites as they appeared when they were crawled. Web archiving tools and services

The https://archive.org/web/

The largest publicly available in existence.

> 500+ Billion URLs > 100+ million websites > 40+ languages > ~ 1 billion URLs added per week Web archiving tools and services

The Wayback Machine https://archive.org/web/

The largest publicly available web archive in existence.

> 500+ Billion URLs > 100+ million websites > 40+ languages > ~ 1 billion URLs added per week Web archiving tools and services

The Wayback Machine https://archive.org/web/

The largest publicly available web archive in existence.

> 500+ Billion URLs > 100+ million websites > 40+ languages > ~ 1 billion URLs added per week Web archiving tools and services

Brozzler Heritrix ARC HTTrack WARC warcprox

Wayback Machine OpenWayback pywb (Python Wayback) oldweb.today Web archiving tools and services

Brozzler Heritrix ARC HTTrack WARC warcprox wget

Archive-It Wayback Machine NetarchiveSuite (DK/FR) OpenWayback PANDAS (AUS) pywb (Python Wayback) Web Curator (UK/NZ) oldweb.today Webrecorder Who archives the web?

8% 13%

15%

3% <1% @

59%

Organizations with web archiving programs by type NDSA, Web Archiving in the United States: A 2016 Survey Who archives the web?

16%

20% 63%

Collecting focus: Own vs. 3rd-party content (vs. both!) NDSA, Web Archiving in the United States: A 2016 Survey Who archives the web?

16%

20% 63%

Use of external service vs. in-house tools (vs. both!) NDSA, Web Archiving in the United States: A 2016 Survey Who archives the web?

19%

5%

5% 58% 13%

Staff dedicated to web archiving program NDSA, Web Archiving in the United States: A 2016 Survey Web archiving issues and trends

> Arrangement and description

> Big data analysis

> Appraisal, provenance, and metadata

> Spontaneous events and social media

> Collaborative collecting Web archiving issues and trends

> Arrangement and description

> Big data analysis

> Appraisal, provenance, and metadata

> Spontaneous events and social media

> Collaborative collecting Web archiving issues and trends

> Arrangement and description

> Big data analysis

> Appraisal, provenance, and metadata

> Spontaneous events and social media

> Collaborative collecting Web archiving issues and trends

> Arrangement and description

> Big data analysis

> Appraisal, provenance, and metadata

> Spontaneous events and social media

> Collaborative collecting Web archiving issues and trends

> Arrangement and description

> Big data analysis

> Appraisal, provenance, and metadata

> Spontaneous events and social media

> Collaborative collecting Samples and Examples!

What do Archive-It web archiving partners collect? Collection Development Decisions and Policies

What is the goal of this web archiving program?

Who are its primary/secondary users?

What belongs in this collection? And what doesn’t?

Who will make appraisal and selection decisions?

What technology will be employed? Collection Development Decisions and Policies

The Library of Virginia's mission is to preserves the legacy of Virginia's culture and history and provides access to the most comprehensive information resources for and about Virginia. Specifically, the Library collects the archival records of the executive, legislative, and judicial branches of Virginia state government and Virginia-related materials that are of a private nature and that may assist researchers in discovering more about the history and lives of Virginia citizens. The Library's Web collection is designed to capture Web sites that mirror records that are already represented in Library's State and Private Papers collections. For more information, please click on the policies listed below...

Library of Virginia Collection Development Decisions and Policies

The Archive of the Washington and Lee University School of Law Website is a project of the Lewis F. Powell, Jr. Archives. It was established in 2011 to collect and preserve the contents of the ever changing law school web presence.

Researchers and law school administrators can use this collection to trace changes in publicly accessible information and to retrieve information that is no longer currently available on the web.

Washington & Lee University School of Law Collection Development Decisions and Policies

This collection of blogs, social media sites, video, and organizational websites documents the international art exhibition, La Biennale di Venezia, in 2013, 2015, and 2017 on the web. The crawls began on April 28th of 2013, May 1st of 2015, and May 10th of 2017 and continued through to the end of the exhibitions in November.

An initiative undertaken by the Sterling and Francine Clark Art Institute Library…the Venice Biennale web project complements the Clark’s Venice Biennale Collection of exhibition catalogues, press kits, and ephemera beginning with the 52nd Biennale in 2007.

Clark Art Institute Library Collection Development Decisions and Policies

This web archive contains the websites of the Center for Jewish History; its partner institutions: American Jewish Historical Society, American Sephardi Federation, Leo Baeck Institute, Yeshiva University Museum, YIVO Institute for Jewish Research, and other organizations that occupy office space at CJH.

Websites include event listings, calendars, online exhibitions, news and press releases, videos and audio from events, blogs highlighting collections at the partner institutions, digital publications, research guides, encyclopedias, conference materials and other information documenting the organizations. The majority of the websites are captured on a quarterly basis…The web collection documents the publicly available content of the web page and does not archive material that is password protected.

Center for Jewish History Collection Development Decisions and Policies

The Web Resources Collection Program follows principles and techniques of non-intrusive harvesting. We attempt to notify all organizations and/or individuals whose websites are selected for archiving. We refrain from archiving websites that do not wish to be included in this project and will remove harvested content from the archive upon request by website owner(s).

Columbia University Break time!

Coming up: Make your own web archiving plan Where can I learn more?

Journal of Western Archives Special Issue on Web Archiving (2017) digitalcommons.usu.edu/westernarchives/vol8/iss2/

NDSA Web Archiving in the United States Surveys 2011 - 2013 - 2016 - 2017 forthcoming!

SAA Web Archiving Roundtable archivists.org/groups/web-archiving-roundtable

Jill Lepore, “The Cobweb: Can the Internet be Archived?” The New Yorker, 1/26/2015 newyorker.com/magazine/2015/01/26/cobweb Thanks!

Image credits: Iconathon

Creative Stall ...and keep in touch! Simple Icons

Brian Ejar

Karl-Rainer Blumenthal Society of American Archivists Web Archivist, Internet Archive Archive-It [email protected] National Digital Stewardship Alliance @LandLibrarian Kansas Archive-it Consortium

New York Art Resources Consortium

Ivy Plus Libraries

Tri-College Libraries

Utah State University

Condé Nast

GifCities.org