WEB ARCHIVING
© 2016 Nicole Martin WEB ARCHIVING INTERNET ARCHIVE WEB ARCHIVING
‣ 1996: Brewster Kahle founds the Internet Archive
‣ 1996: Starts Alexa Internet project to archive the World Wide Web
‣ Alexa Internet captures text only
‣ Kahle inspired by visited Alta Vista search engine, entire system was just
“five or six Coke machines.”
‣ Internet Archive preserves copyrighted data against lawyers' advice
‣ 1999: Sells Alexa Internet (company and crawler software) to Amazon INTERNET ARCHIVE WEB ARCHIVING
‣ 1999: Develops new crawling software for Internet Archive
‣ Begins collecting av media, scanned books, etc.
‣ 2001: Creates search interface for archived collections of websites held
at IA repository known as the Wayback Machine
‣ 2003: Release of Heritrix open source web archiving software INTERNET ARCHIVE HERITRIX HERITRIX INTERNET ARCHIVE
‣ Web crawler for archiving, released in 2003
‣ Developed by the Internet Archive and Nordic National Libraries
‣ Previous tools captured only plain text, Heretrix also captures media
‣ Runs on Linux/Unix
‣ Open source Apache license
(permissive: no share-alike required, initial release was GPL)
‣ Produces WARC files successor to IA's ARC file format HERITRIX INTERNET ARCHIVE
Scalable and Flexible ‣ Initiate crawl with a "seed" URL
‣ Customize scope of crawl (focused or broad crawls)
‣ Adjust depth and reach of crawl (recursive directories, linked sites, etc.)
‣ Set minimum/maximum bytes to download during crawl
‣ Set maximum bandwidth usage
‣ Obey/Ignore Robots.txt files (HTTP "do not crawl" header) HERITRIX INTERNET ARCHIVE
Challenges… ‣ Avoiding crawler loops (redirects) and duplicate data
‣ Avoiding archiving outside of selection policy (automation)
‣ Avoiding spam & viruses (unless you want them)
‣ Archiving dynamic or interactive content:
‣ Content requested from APIs or Databases
‣ User search query
‣ Complex AV media HERITRIX INTERNET ARCHIVE
More challenges… ‣ Streaming media
‣ Avoiding Robots.txt (blocks crawlers) and URLs with #
‣ Password protected sites
‣ Archiving social media and commercial platforms
‣ GET vs. POST HTTP requests
GET is a simple URL-based, read-only request
POST is interactive and read/write WEB ARCHIVE WARC FILE WARC WARC WARC 2003
WARC© 2016 Nicole Martin
WARC FILE WEB ARCHIVE
WARC WARC FILE WEB ARCHIVE
‣ Aggregates sets of digital resources (HTML files, etc.) that make up the web
‣ Rendered in an stand-alone application or a web browser (though WARC not by default)
‣ Original digital resources (HTML, etc.) can be extracted from a WARC file (Unarchiver, ARCreader) WARC FILE WEB ARCHIVE
‣ Created by the Internet Archive ‣ Open Format ISO Standard ISO 28500:2009 ‣ Apache license (permissive: no share-alike required) WARC WARC FILE WEB ARCHIVE
‣ Successor to the the ARC file format (also by IA – 1996)
‣ Adds new functionality: ‣ Iterative: documents changes to web content over time ‣ Detects duplicate content when WARC compared to previous events ‣ Stores associated metadata (subject) WARC FILE WEB ARCHIVE
‣ Can be invoked with the "wget" command ‣ Can perform a "crawl" or automated capture directly via CLI ‣ Setting parameters of wget =
archivist's way of implementing WARC selection policy WARC FILE WEB ARCHIVE
‣ Can be invoked with the "wget" command ‣ Can perform!!!DEMO a "crawl" or automated TIME!!! capture directly via CLI !DEVELOPER‣ Setting parameters of wgetSafari = > MODE Preferences PREVIEW! archivist's way of implementing Advanced Tab > Show Develop menu in menu bar WARC selection policy THE END.
© 2019 Nicole Martin