Archive-It: Tools to “Do” METRO Webinar March 16, 2021 Karl-Rainer Blumenthal Web Archivist, ARCHIVE-IT: TOOLS TO “DO” WEB ARCHIVING

Prerequisite: Some (beginners’ OK!) knowledge of web browsing

Learning objectives:

Understand the process of web archiving with Archive-It technologies Identify the primary Archive-It tools for web capture, storage, and replay Identify the additional Archive-It tools for access and sharing Explore new new and developing Archive-It tools for research

Out of scope: Advanced training for Archive-It’s software suite Appraisal, coverage, description, &c. Web archiving is the process of collecting, preserving, and enabling access to web-published materials. WEB ARCHIVING

capture crawler

replay store “Wayback” W/ARC WEB ARCHIVING

The

The largest publicly available in existence.

https://archive.org/web/

> 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING

The Wayback Machine

The largest publicly available web archive in existence.

https://archive.org/web/

> 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING

The Wayback Machine

Limitations: Lightly curated Completeness Temporal cohesion

Access: No full-text search No descriptive metadata ‘Hunt and peck’ by URL only WEB ARCHIVING

Brozzler Heritrix ARC HTTrack WARC warcprox

Wayback Machine OpenWayback pywb wab.ac oldweb.today WEB ARCHIVING

Brozzler Heritrix ARC HTTrack WARC warcprox wget

Archive-It Wayback Machine Conifer OpenWayback NetarchiveSuite (DK/FR) pywb PANDAS (AUS) wab.ac Web Curator (UK/NZ) oldweb.today ARCHIVE-IT

Archive-It

https://archive-it.org

Curator controlled > 800 partner organizations ~ 2 PB of web data collected Full text and metadata searchable APIs for archives, metadata, search, &c. ARCHIVE-IT ARCHIVE-IT TOOLS Brozzler | Browser-based capture for high fidelity social media archives ARCHIVE-IT TOOLS

Waybackfill Service

Add past archived webpages from the Internet Archive’s Wayback Machine to your own Archive-It collections.

● Covers 1996 to the present day ● ARC and WARC files available ● Indexed for search & browse ● Flat engineering service fee ARCHIVE-IT TOOLS

Redirection Service

Send visitors to archived versions of webpages no longer on your website.

● For Apache, nginx, or HAProxy ● No more 404s! ● Link to web captures or calendars ● Flat engineering service fee ARCHIVE-IT TOOLS

Archive-It APIs and integrations | Access web archives “under the hood” ARCHIVE-IT TOOLS Significant Properties ARCHIVE-IT TOOLS (DEVELOPING!)

Social Feed Manager | API access to Twitter data ARCHIVE-IT TOOLS (DEVELOPING!) ARCHIVE-IT TOOLS (DEVELOPING!)

WANE WAT LGA Named entities Key metadata from Link graphs for from full text request headers network analysis THANKS <3

...and keep in touch!

Karl-Rainer Blumenthal Web Archivist, Internet Archive

[email protected] [email protected]