Archive-It: Tools to “Do” Web Archiving METRO Webinar March 16, 2021 Karl-Rainer Blumenthal Web Archivist, Internet Archive ARCHIVE-IT: TOOLS TO “DO” WEB ARCHIVING Prerequisite: Some (beginners’ OK!) knowledge of web browsing Learning objectives: Understand the process of web archiving with Archive-It technologies Identify the primary Archive-It tools for web capture, storage, and replay Identify the additional Archive-It tools for access and sharing Explore new new and developing Archive-It tools for research Out of scope: Advanced training for Archive-It’s software suite Appraisal, coverage, description, &c. Web archiving is the process of collecting, preserving, and enabling access to web-published materials. WEB ARCHIVING capture crawler replay store “Wayback” W/ARC WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine Limitations: Lightly curated Completeness Temporal cohesion Access: No full-text search No descriptive metadata ‘Hunt and peck’ by URL only WEB ARCHIVING Brozzler Heritrix ARC HTTrack WARC warcprox wget Wayback Machine OpenWayback pywb wab.ac oldweb.today WEB ARCHIVING Brozzler Heritrix ARC HTTrack WARC warcprox wget Archive-It Wayback Machine Conifer OpenWayback NetarchiveSuite (DK/FR) pywb PANDAS (AUS) wab.ac Web Curator (UK/NZ) oldweb.today ARCHIVE-IT Archive-It https://archive-it.org Curator controlled > 800 partner organizations ~ 2 PB of web data collected Full text and metadata searchable APIs for archives, metadata, search, &c. ARCHIVE-IT ARCHIVE-IT TOOLS Brozzler | Browser-based capture for high fidelity social media archives ARCHIVE-IT TOOLS Waybackfill Service Add past archived webpages from the Internet Archive’s Wayback Machine to your own Archive-It collections. ● Covers 1996 to the present day ● ARC and WARC files available ● Indexed for search & browse ● Flat engineering service fee ARCHIVE-IT TOOLS Redirection Service Send visitors to archived versions of webpages no longer on your website. ● For Apache, nginx, or HAProxy ● No more 404s! ● Link to web captures or calendars ● Flat engineering service fee ARCHIVE-IT TOOLS Archive-It APIs and integrations | Access web archives “under the hood” ARCHIVE-IT TOOLS Significant Properties ARCHIVE-IT TOOLS (DEVELOPING!) Social Feed Manager | API access to Twitter data ARCHIVE-IT TOOLS (DEVELOPING!) ARCHIVE-IT TOOLS (DEVELOPING!) WANE WAT LGA Named entities Key metadata from Link graphs for from full text request headers network analysis THANKS <3 ...and keep in touch! Karl-Rainer Blumenthal Web Archivist, Internet Archive [email protected] [email protected].
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages22 Page
-
File Size-