Internet Archive 20 Years / 20 Petabytes Internet Archive Digital Storage

Internet Archive 20 Years / 20 Petabytes Internet Archive Digital Storage

Internet Archive 20 years / 20 petabytes Internet Archive Digital Storage ● Lots of hard drives ● 56.36 petabytes of disk space ● 21.93 petabytes of archival content ● Stored redundantly in different physical locations DIY storage cluster ● Archival on-disk format ● Efficient (cost + energy) Upload ● 8M Texts ● 2.4M Audio recordings ● 140K Concerts ● 1.9M Videos ● 100K Software items ● … and your stuff! Upload ● S3-compatible API ● Metadata API ● `pip install internetarchive` ● https://archive.org/help/abouts3.txt Upload Using IA S3 to back distributed art projects Physical Storage Upload Wayback Machine ● Started in 1996 with Alexa crawls ● Updated in realtime at 2000 urls/sec ● 475,951,398,000 captures ● 10.5 petabytes of WARC data ● Intl Internet Preservation Consortium ● Open source on github.com/internetarchive Realtime Wayback Deleted within two hours: Realtime Wayback - Save Page Now Save Page Now Wayback Under the Hood ● 10.5 petabytes of WARC data ● 2nd Level CDX Index: 20TB compressed ● 1st level: 13GB. Fits in core. Wayback APIs ● Availability and CDX APIs: build your own tools on top of Wayback ● Open source tools to read/write WARCs ● WARC Support in wget: wget --warc-file Books ● 2.5 million books with Scribe bookscanner ● 1000 books/day ● 35 scanning centers on 5 continents ● 800 million images ● Open-sourced gphoto Canon 1D/5D drivers Books Books Books Books Television Television Internet Archive ● Archive-It ● Foundation Housing ● OPDS ● Audio Books ● Heritrix crawler ● Open Content Alliance ● Bitcoin ATM ● IIPC ● Open Library ● Bittorrent seeds ● JSMESS emulator ● Petabox ● Bookmobile ● Lending Library ● Physical Archive ● Book Reader ● Listening Rooms ● Reader Privacy ● Book Scanning ● Live Music Archive ● Researcher Access ● CD archiving ● LP digitization ● S3-like API ● Community Wireless ● Metadata API ● Scribe and TTScribe ● Credit Union ● Microfilm scanning ● Software preservation ● DAISY / Print- ● Microfiche ● Television Archive Disabled ● Netlabels ● V2 site ● Distributed Web ● Newspaper digitization ● VHS digitization ● Film preservation ● Wayback Machine Questions? ● developers.archive.org ● we love volunteers! ● we are hiring engineers! ● archive.org/jobs.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    24 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us