Archive-It: Tools to “Do” Web Archiving METRO Webinar March 16, 2021 Karl-Rainer Blumenthal Web Archivist, Internet Archive ARCHIVE-IT: TOOLS TO “DO” WEB ARCHIVING
Prerequisite: Some (beginners’ OK!) knowledge of web browsing
Learning objectives:
Understand the process of web archiving with Archive-It technologies Identify the primary Archive-It tools for web capture, storage, and replay Identify the additional Archive-It tools for access and sharing Explore new new and developing Archive-It tools for research
Out of scope: Advanced training for Archive-It’s software suite Appraisal, coverage, description, &c. Web archiving is the process of collecting, preserving, and enabling access to web-published materials. WEB ARCHIVING
capture crawler
replay store “Wayback” W/ARC WEB ARCHIVING
The Wayback Machine
The largest publicly available web archive in existence.
https://archive.org/web/
> 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING
The Wayback Machine
The largest publicly available web archive in existence.
https://archive.org/web/
> 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING
The Wayback Machine
Limitations: Lightly curated Completeness Temporal cohesion
Access: No full-text search No descriptive metadata ‘Hunt and peck’ by URL only WEB ARCHIVING
Brozzler Heritrix ARC HTTrack WARC warcprox wget
Wayback Machine OpenWayback pywb wab.ac oldweb.today WEB ARCHIVING
Brozzler Heritrix ARC HTTrack WARC warcprox wget
Archive-It Wayback Machine Conifer OpenWayback NetarchiveSuite (DK/FR) pywb PANDAS (AUS) wab.ac Web Curator (UK/NZ) oldweb.today ARCHIVE-IT
Archive-It
https://archive-it.org
Curator controlled > 800 partner organizations ~ 2 PB of web data collected Full text and metadata searchable APIs for archives, metadata, search, &c. ARCHIVE-IT ARCHIVE-IT TOOLS Brozzler | Browser-based capture for high fidelity social media archives ARCHIVE-IT TOOLS
Waybackfill Service
Add past archived webpages from the Internet Archive’s Wayback Machine to your own Archive-It collections.
● Covers 1996 to the present day ● ARC and WARC files available ● Indexed for search & browse ● Flat engineering service fee ARCHIVE-IT TOOLS
Redirection Service
Send visitors to archived versions of webpages no longer on your website.
● For Apache, nginx, or HAProxy ● No more 404s! ● Link to web captures or calendars ● Flat engineering service fee ARCHIVE-IT TOOLS
Archive-It APIs and integrations | Access web archives “under the hood” ARCHIVE-IT TOOLS Significant Properties ARCHIVE-IT TOOLS (DEVELOPING!)
Social Feed Manager | API access to Twitter data ARCHIVE-IT TOOLS (DEVELOPING!) ARCHIVE-IT TOOLS (DEVELOPING!)
WANE WAT LGA Named entities Key metadata from Link graphs for from full text request headers network analysis THANKS <3
...and keep in touch!
Karl-Rainer Blumenthal Web Archivist, Internet Archive