Tools to “Do” Web Archiving METRO Webinar March 16, 2021 Karl-Rainer Blumenthal Web Archivist, Internet Archive ARCHIVE-IT: TOOLS to “DO” WEB ARCHIVING

Tools to “Do” Web Archiving METRO Webinar March 16, 2021 Karl-Rainer Blumenthal Web Archivist, Internet Archive ARCHIVE-IT: TOOLS to “DO” WEB ARCHIVING

Archive-It: Tools to “Do” Web Archiving METRO Webinar March 16, 2021 Karl-Rainer Blumenthal Web Archivist, Internet Archive ARCHIVE-IT: TOOLS TO “DO” WEB ARCHIVING Prerequisite: Some (beginners’ OK!) knowledge of web browsing Learning objectives: Understand the process of web archiving with Archive-It technologies Identify the primary Archive-It tools for web capture, storage, and replay Identify the additional Archive-It tools for access and sharing Explore new new and developing Archive-It tools for research Out of scope: Advanced training for Archive-It’s software suite Appraisal, coverage, description, &c. Web archiving is the process of collecting, preserving, and enabling access to web-published materials. WEB ARCHIVING capture crawler replay store “Wayback” W/ARC WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine Limitations: Lightly curated Completeness Temporal cohesion Access: No full-text search No descriptive metadata ‘Hunt and peck’ by URL only WEB ARCHIVING Brozzler Heritrix ARC HTTrack WARC warcprox wget Wayback Machine OpenWayback pywb wab.ac oldweb.today WEB ARCHIVING Brozzler Heritrix ARC HTTrack WARC warcprox wget Archive-It Wayback Machine Conifer OpenWayback NetarchiveSuite (DK/FR) pywb PANDAS (AUS) wab.ac Web Curator (UK/NZ) oldweb.today ARCHIVE-IT Archive-It https://archive-it.org Curator controlled > 800 partner organizations ~ 2 PB of web data collected Full text and metadata searchable APIs for archives, metadata, search, &c. ARCHIVE-IT ARCHIVE-IT TOOLS Brozzler | Browser-based capture for high fidelity social media archives ARCHIVE-IT TOOLS Waybackfill Service Add past archived webpages from the Internet Archive’s Wayback Machine to your own Archive-It collections. ● Covers 1996 to the present day ● ARC and WARC files available ● Indexed for search & browse ● Flat engineering service fee ARCHIVE-IT TOOLS Redirection Service Send visitors to archived versions of webpages no longer on your website. ● For Apache, nginx, or HAProxy ● No more 404s! ● Link to web captures or calendars ● Flat engineering service fee ARCHIVE-IT TOOLS Archive-It APIs and integrations | Access web archives “under the hood” ARCHIVE-IT TOOLS Significant Properties ARCHIVE-IT TOOLS (DEVELOPING!) Social Feed Manager | API access to Twitter data ARCHIVE-IT TOOLS (DEVELOPING!) ARCHIVE-IT TOOLS (DEVELOPING!) WANE WAT LGA Named entities Key metadata from Link graphs for from full text request headers network analysis THANKS <3 ...and keep in touch! Karl-Rainer Blumenthal Web Archivist, Internet Archive [email protected] [email protected].

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    22 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us