<<

WEB ARCHIVING

© 2016 Nicole Martin WEB ARCHIVING

‣ 1996: founds the Internet Archive

‣ 1996: Starts project to archive the

‣ Alexa Internet captures text only

‣ Kahle inspired by visited Alta Vista search engine, entire system was just

“five or six Coke machines.”

‣ Internet Archive preserves copyrighted data against lawyers' advice

‣ 1999: Sells Alexa Internet (company and crawler software) to INTERNET ARCHIVE WEB ARCHIVING

‣ 1999: Develops new crawling software for Internet Archive

‣ Begins collecting av media, scanned books, etc.

‣ 2001: Creates search interface for archived collections of websites held

at IA repository known as the

‣ 2003: Release of open source web archiving software INTERNET ARCHIVE HERITRIX HERITRIX INTERNET ARCHIVE

for archiving, released in 2003

‣ Developed by the Internet Archive and Nordic National Libraries

‣ Previous tools captured only plain text, Heretrix also captures media

‣ Runs on /Unix

‣ Open source Apache license

(permissive: no share-alike required, initial release was GPL)

‣ Produces WARC files successor to IA's ARC file format HERITRIX INTERNET ARCHIVE

Scalable and Flexible ‣ Initiate crawl with a "seed" URL

‣ Customize scope of crawl (focused or broad crawls)

‣ Adjust depth and reach of crawl (recursive directories, linked sites, etc.)

‣ Set minimum/maximum bytes to download during crawl

‣ Set maximum bandwidth usage

‣ Obey/Ignore Robots.txt files (HTTP "do not crawl" header) HERITRIX INTERNET ARCHIVE

Challenges… ‣ Avoiding crawler loops (redirects) and duplicate data

‣ Avoiding archiving outside of selection policy (automation)

‣ Avoiding spam & viruses (unless you want them)

‣ Archiving dynamic or interactive content:

‣ Content requested from APIs or Databases

‣ User search query

‣ Complex AV media HERITRIX INTERNET ARCHIVE

More challenges… ‣ Streaming media

‣ Avoiding Robots.txt (blocks crawlers) and URLs with #

‣ Password protected sites

‣ Archiving social media and commercial platforms

‣ GET vs. POST HTTP requests

GET is a simple URL-based, read-only request

POST is interactive and read/write WEB ARCHIVE WARC FILE WARC WARC WARC 2003

WARC© 2016 Nicole Martin

WARC FILE WEB ARCHIVE

WARC WARC FILE WEB ARCHIVE

‣ Aggregates sets of digital resources (HTML files, etc.) that make up the web

‣ Rendered in an stand-alone application or a web browser (though WARC not by default)

‣ Original digital resources (HTML, etc.) can be extracted from a WARC file (Unarchiver, ARCreader) WARC FILE WEB ARCHIVE

‣ Created by the Internet Archive ‣ Open Format ISO Standard ISO 28500:2009 ‣ Apache license (permissive: no share-alike required) WARC WARC FILE WEB ARCHIVE

‣ Successor to the the ARC file format (also by IA – 1996)

‣ Adds new functionality: ‣ Iterative: documents changes to web content over time ‣ Detects duplicate content when WARC compared to previous events ‣ Stores associated metadata (subject) WARC FILE WEB ARCHIVE

‣ Can be invoked with the "" command ‣ Can perform a "crawl" or automated capture directly via CLI ‣ Setting parameters of wget =

archivist's way of implementing WARC selection policy WARC FILE WEB ARCHIVE

‣ Can be invoked with the "wget" command ‣ Can perform!!!DEMO a "crawl" or automated TIME!!! capture directly via CLI !DEVELOPER‣ Setting parameters of wgetSafari = > MODE Preferences PREVIEW! archivist's way of implementing Advanced Tab > Show Develop menu in menu bar WARC selection policy THE END.

© 2019 Nicole Martin