Internet Archive 20 Years / 20 Petabytes Internet Archive Digital Storage

Total Page:16

File Type:pdf, Size:1020Kb

Internet Archive 20 Years / 20 Petabytes Internet Archive Digital Storage Internet Archive 20 years / 20 petabytes Internet Archive Digital Storage ● Lots of hard drives ● 56.36 petabytes of disk space ● 21.93 petabytes of archival content ● Stored redundantly in different physical locations DIY storage cluster ● Archival on-disk format ● Efficient (cost + energy) Upload ● 8M Texts ● 2.4M Audio recordings ● 140K Concerts ● 1.9M Videos ● 100K Software items ● … and your stuff! Upload ● S3-compatible API ● Metadata API ● `pip install internetarchive` ● https://archive.org/help/abouts3.txt Upload Using IA S3 to back distributed art projects Physical Storage Upload Wayback Machine ● Started in 1996 with Alexa crawls ● Updated in realtime at 2000 urls/sec ● 475,951,398,000 captures ● 10.5 petabytes of WARC data ● Intl Internet Preservation Consortium ● Open source on github.com/internetarchive Realtime Wayback Deleted within two hours: Realtime Wayback - Save Page Now Save Page Now Wayback Under the Hood ● 10.5 petabytes of WARC data ● 2nd Level CDX Index: 20TB compressed ● 1st level: 13GB. Fits in core. Wayback APIs ● Availability and CDX APIs: build your own tools on top of Wayback ● Open source tools to read/write WARCs ● WARC Support in wget: wget --warc-file Books ● 2.5 million books with Scribe bookscanner ● 1000 books/day ● 35 scanning centers on 5 continents ● 800 million images ● Open-sourced gphoto Canon 1D/5D drivers Books Books Books Books Television Television Internet Archive ● Archive-It ● Foundation Housing ● OPDS ● Audio Books ● Heritrix crawler ● Open Content Alliance ● Bitcoin ATM ● IIPC ● Open Library ● Bittorrent seeds ● JSMESS emulator ● Petabox ● Bookmobile ● Lending Library ● Physical Archive ● Book Reader ● Listening Rooms ● Reader Privacy ● Book Scanning ● Live Music Archive ● Researcher Access ● CD archiving ● LP digitization ● S3-like API ● Community Wireless ● Metadata API ● Scribe and TTScribe ● Credit Union ● Microfilm scanning ● Software preservation ● DAISY / Print- ● Microfiche ● Television Archive Disabled ● Netlabels ● V2 site ● Distributed Web ● Newspaper digitization ● VHS digitization ● Film preservation ● Wayback Machine Questions? ● developers.archive.org ● we love volunteers! ● we are hiring engineers! ● archive.org/jobs.
Recommended publications
  • How to Find Free, Reusable Content Online Rhode Island Library
    Open Everything: How to find free, reusable content online Rhode Island Library Association Conference 2016, “Color Outside the Lines” Andrée Rathemacher • Julia Lovett • Angel Ferria University of Rhode Island Open Culture General Resources: Sites, Portals & Guides Digital Public Library of America — http://dp.la/ Aims to be a national digital library for the​ USA. Harvests metadata and content in all formats from other digital libraries and databases (HathiTrust, Internet Archive, state/consortium repositories, govt repositories etc. ­ full partner list here http://dp.la/partners) Does not yet allow searching/filtering by rights information. ​ ​ Europeana — http://www.europeana.eu/portal/ Europe’s portal t​o cultural collections: “Explore 52,219,831 artworks, artefacts, books, videos and sounds from across Europe.” Can filter search results by reuse rights. Internet Archive — http://archive.org Founded in 1996. A “no​n­profit library of millions of free books, movies, software, music, and more.” Searchable by Creative Commons license or Public Domain: See https://archive.org/about/faqs.php#1069 ​ Open Culture — http://www.openculture.com/ Founded in 2006. B​rings together free/open resources from around the web. Geared for a popular audience, with frequent blog posts and active social media presence. OpenGLAM Open Collections — http://openglam.org/open­collections/ A searchable index of open cultural her​ itage collections with freely reusable content. Shared Shelf Commons — http://www.sscommons.org Freely available images and oth​ er digital content from libraries, archives, and museums participating in Shared Shelf by Artstor. Copyright restrictions vary. Creative Commons Search — https://search.creativecommons.org/ Search CC­licensed content from m​ultiple sites such as Flickr, Google, and YouTube.
    [Show full text]
  • March 2010, Corrected 3/31/10 ISSN: 0195-4857
    TECHNICAL SERVICE S LAW LIBRARIAN Volume 35 No. 3 http://www.aallnet.org/sis/tssis/tsll/ March 2010, corrected 3/31/10 ISSN: 0195-4857 INSIDE: Technical Services Law Librarian From the Officers OBS-SIS ..................................... 3 to be added to HeinOnline! TS-SIS ........................................ 4 AALL Headquarters and William S. Hein & Co. signed an agreement Announcements on December 2, 2009 that will permit TSLL to become available in a Renee D. Chapman Award ....... 31 fully-searchable image-based format as part of HeinOnline’s Law Librarian’s TS SIS Educational Grants ...... 13 Reference Library. TSLL to be added to Hein ........... 1 The Law Librarian’s Reference Library, currently in beta version, is accessible by subscription at http://heinonline.org/HOL/Index?collection=lcc&set_ as_cursor=clear. At present if a library subscribes to Larry Dershem’s print Columns version of the Library of Congress Classification Schedules it has free access Acquisitions ............................... 5 to this reference library. As part of this HeinOnline library TSLL will join such Classification .............................. 6 classic works as Library of Congress Classification schedules, Cataloging Collection Development ............ 8 Service Bulletin, Subject Headings Manual, and the Catalog of the Library of Description & Entry ................... 9 the Law School of Harvard University (1909). For more information about the The Internet .............................. 10 Law Librarian’s Reference Library see Hein’s introductory brochure at http:// Management ............................. 14 heinonline.org/HeinDocs/LLReference.pdf. MARC Remarks....................... 15 OCLC ....................................... 18 We’re hopeful TSLL will be accessible on HeinOnline in time for the AALL Preservation .............................. 19 Annual Meeting in July, but no timetable has yet been set … so stay tuned! Private Law Libraries ..............
    [Show full text]
  • Hathitrust Preferred Internet Archive Book Package Overview
    HathiTrust Preferred Internet Archive Book Package Overview & Background As a by-product of the Internet Archive scanning process, a variety of different files and formats are available to everyone, everywhere. This differs from the Google output, which offers no file-level variations or options. However, this also means that files chosen for ingest into the HathiTrust repository must be carefully selected, with an eye towards both near-term and long-term utility. The process of selecting files that is described below attempted to balance the following important criteria: a baseline, cross-partner standard; functional consistency with the Google work products; a desire to keep the highest quality master images; a disinclination to discard useful information; and an attempt to minimize overall package size to reduce storage costs. Ingest into the HathiTrust repository will require pre-processing of the original file set described below in order to normalize files to an expected format. This normalization will allow HathiTrust processes to accommodate content from all partners. This process is currently in development and a link to the documentation of the process will be included here, once it is finalized. File Selection Criteria In the following section, the files selected for ingest into the HathiTrust repository are identified, along with a justification for why they were selected. Also listed are files that are available from the Internet Archive, but have not been selected. A description of each file can be found in the All Available Files & Characteristics section below. All files below are ​ ​ ​ named using the Internet Archive identifier, preceding the underscore (ex.
    [Show full text]
  • Overview of the INEX 2009 Book Track
    Overview of the INEX 2009 Book Track Gabriella Kazai1, Antoine Doucet2, Marijn Koolen3, and Monica Landoni4 1 Microsoft Research, United Kingdom [email protected] 2 University of Caen, France [email protected] 3 University of Amsterdam, Netherlands [email protected] 4 University of Lugano [email protected] Abstract. The goal of the INEX 2009 Book Track is to evaluate ap- proaches for supporting users in reading, searching, and navigating the full texts of digitized books. The investigation is focused around four tasks: 1) the Book Retrieval task aims at comparing traditional and book-specific retrieval approaches, 2) the Focused Book Search task eval- uates focused retrieval approaches for searching books, 3) the Structure Extraction task tests automatic techniques for deriving structure from OCR and layout information, and 4) the Active Reading task aims to explore suitable user interfaces for eBooks enabling reading, annotation, review, and summary across multiple books. We report on the setup and the results of the track. 1 Introduction The INEX Book Track was launched in 2007, prompted by the availability of large collections of digitized books resulting from various mass-digitization projects [1], such as the Million Book project5 and the Google Books Library project6. The unprecedented scale of these efforts, the unique characteristics of the digitized material, as well as the unexplored possibilities of user interactions present exciting research challenges and opportunities, see e.g. [3]. The overall goal of the INEX Book Track is to promote inter-disciplinary research investigating techniques for supporting users in reading, searching, and navigating the full texts of digitized books, and to provide a forum for the exchange of research ideas and contributions.
    [Show full text]
  • Harvesting Strategies for a National Domain France Lasfargues, Clément Oury, Bert Wendland
    Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, Bert Wendland To cite this version: France Lasfargues, Clément Oury, Bert Wendland. Legal deposit of the French Web: harvesting strategies for a national domain. International Web Archiving Workshop, Sep 2008, Aarhus, Denmark. hal-01098538 HAL Id: hal-01098538 https://hal-bnf.archives-ouvertes.fr/hal-01098538 Submitted on 26 Dec 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, and Bert Wendland Bibliothèque nationale de France Quai François Mauriac 75706 Paris Cedex 13 {france.lasfargues, clement.oury, bert.wendland}@bnf.fr ABSTRACT 1. THE FRENCH CONTEXT According to French Copyright Law voted on August 1st, 2006, the Bibliothèque nationale de France (“BnF”, or “the Library”) is 1.1 Defining the scope of the legal deposit in charge of collecting and preserving the French Internet. The On August 1st, 2006, a new Copyright law was voted by the Library has established a “mixed model” of Web archiving, which French Parliament.
    [Show full text]
  • Challenges in Sustaining the Million Book Project, a Project Supported by the National Science Foundation
    Clair / J Zhejiang Univ-Sci C (Comput & Electron) 2010 11(11):919-922 919 Journal of Zhejiang University-SCIENCE C (Computers & Electronics) ISSN 1869-1951 (Print); ISSN 1869-196X (Online) www.zju.edu.cn/jzus; www.springerlink.com E-mail: [email protected] Personal View: Challenges in sustaining the Million Book Project, a project supported by the National Science Foundation Gloriana St. CLAIR wisely invested in bringing educational and cultural Director, Universal Library Project resources to a large segment of their constituents. The Dean, Carnegie Mellon University Libraries, Pittsburgh, disadvantage is that, as government budgets tighten, Pennsylvania, USA the funding necessary to sustain a project can be lost. E-mail: [email protected] One great advantage of government funding is that the government wants to serve the whole public. doi:10.1631/jzus.C1001011 Beginning this year, in the U.S., the National Science Foundation now requires principal investigators to explain how the data they have collected will be made One of the main roles I have played as a director available to the larger research community and how it of the Universal Digital Library has been to write will be sustained. In the U.S., the government also grant proposals to support our work. Both for this wants free-to-read access and at the same time allows project and for another project, Olive.org, an archive creators to charge for enhanced versions. of executable content, how to sustain the final product Foundations and other not-for-profit organi- is the most difficult challenge. This paper discusses zations.
    [Show full text]
  • The Internet Archive: an Interview with Brewster Kahle Brewster Kahle and Ana Parejo Vadillo
    The Internet Archive: An Interview with Brewster Kahle Brewster Kahle and Ana Parejo Vadillo Rumour has it that one of the candidates for Librarian of Congress is Brewster Kahle, the founder and director of the non-profit digital library Internet Archive.1 That he may be considered for the post is a testament to Kahle’s commitment to mass digitization, the cornerstone of modern librarianship. A visionary of the digital preservation of knowledge and an outspo- ken advocate of the open access movement (the memorial for the Internet activist Aaron Swartz was held at the Internet Archive’s headquarters in San Francisco), Kahle has been part of the many ventures that have created our cyber age. At MIT, he was on the project team of Thinking Machines, a precursor of the World Wide Web. In 1989 he created WAIS (Wide Area Information Server), the first electronic publishing system, which was designed to search and make information available. He left Thinking Machines to focus on his newly founded company, WAIS, Inc., which was sold to AOL two years later for a reported $15 million. In 1996 he co- founded Alexa Internet, which was built on the principles of collecting Web traffic data and analysis.2 The company was named after the Library of Alexandria, the largest repository of knowledge in the ancient world, to highlight the potential of the Internet to become such a custodian. It was sold for c. $250 million in stock to Amazon, which uses it for data mining. Alongside Alexa Internet, in 1996 Kahle founded the Internet Archive to archive Web culture (Fig.
    [Show full text]
  • Canadian Libraries
    Web Moving Images Texts Audio Software Patron Info About IA Projects Home American Libraries | Canadian Libraries | Universal Library | Community Te xts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | A dditional Collections Search: All Media Types Wayback Machine Moving Images Animation & Cart oons Arts & Music Community Video Computers & Technology Cultura l & Academic Films Ephemeral Films Movies News & Public Affairs Prelinger Archives Spirituality & Religion Sports Videos Television Videogame Videos Vlogs Youth Media Texts American Libraries Canadian Libraries Universal Library Community Texts Project Guten berg Children's Library Biodiversity Heritage Library Additional Col lections Audio Audio Books & Poetry Community Audio Computers & Te chnology Grateful Dead Live Music Archive Music & Arts Netlabels News & Public Affairs Non-English Audio Podcasts Radio Programs Spirituality & Religion Software DigiBarn The Shareware CD Archiv e Tucows Software Library The Vectrex Collection Education Math Le ctures from MSRI UChannel Chinese University Lectures AP Courses fro m MITE MIT OpenCourseWare Forums FAQs Advanced Search Anonymous User ( login or join us) Upload See other formats Full text of "My water-cure : tested for more than 35 years an d published for the cure of diseases and the preservation of health" '^I'.V-i mjisi;iM M>) ¦¦if< I'm '» WB .<¦ sr%(i mtm iW"\ r^ 3K STORE lAST WATER ST. lVAUKEE,Wir FRpyiTHE-FUND-BtiQUbAiHbD-BY- :^zMj^^^ MY WATER-CURE *'Go, and wash seven times in the Jordan, and thy flesh shall recover health, and thou shalt be clean." 4. Kings V, 10. -t/^t ^^ £h.b TEE ONLY AUTEQ-RTZED AND COMPLETE ENGLTSTl EDITION MY WATER-CURE TESTED FOR MORE THAN 35 YEARS AND PUBLISHED FOR THE CURE OF DISEASES AND TEE PRESERVATION OF HEALTH BY SEBASTIAN KNEIPP BECRET CHAMBERLAIN OF THE POPE, PARISH PRIEST OP WORISHOFEN (BAVARIA) TRANSLATED FROM HIS 36^^ GERMAN EDITION.
    [Show full text]
  • NAME: Mary-Jo K. Romaniuk
    CURRICULUM VITAE NAME: Mary-Jo K. Romaniuk PLACE OF BIRTH: Columbia, Missouri USA Citizenship: Canadian and American UNIVERSITY EDUCATION: • PhD Candidate, Queensland University of Technology (expected completion 2012) • Masters of Library and Information Science – San Jose State University • Bachelor of Commerce (With Distinction) - University of Saskatchewan RELATED EDUCATION: 2012 Harvard Graduate School of Education-Leadership Institute for Academic Librarians - August 2012 (accepted - forthcoming) 2007 Frye Leadership Institute, Frye Fellow 2003 Public Participation Certificate Program, International Association of Public Participation 2000 University Management Course, University of Manitoba, Centre for Higher Education 1999 Library Management Skills Institute, Library Manager, Association of Research Libraries 1997 Advanced Facilitation Skills Course, Dr. Donald Carmont 1995 Competitive Intelligence Program, Dr. Jonathan Calof, University of Ottawa 1990 Alberta Best, Government of Alberta 1988 Finalist - Uniform Final Examination, Canadian Institute of Chartered Accountants 1984 – 1987 Student Education Program, Institute of Chartered Accountants of Alberta AWARDS & HONOURS: • Library Journal - 2010 Mover and Shaker • Student Convocation Speaker, San Jose State SLIS – Convocation 2009 • Fellow of the Frye Leadership Institute (2007) • Deans Scholarship – College of Commerce (1982) • Government of Saskatchewan Scholarship (1978) • Catholic Women’s League Scholarship (1978) PROFESSIONAL AND WORK EXPERIENCE University of Alberta,
    [Show full text]
  • Perspectives from Canadian Research Libraries
    Submitted on: May 8, 2013 New frontiers in Open Access for Collection Development: Perspectives from Canadian Research Libraries K. Jane Burpee Research Enterprise and Scholarly Communication, University of Guelph, Guelph, ON, Canada. [email protected] Leila Fernandez Steacie Science and Engineering Library, York University Libraries, Toronto, ON, Canada. [email protected] Copyright © 2013 by K. Jane Burpee and Leila Fernandez. This work is made available under the terms of the Creative Commons Attribution 3.0 Unported License: http://creativecommons.org/licenses/by/3.0/ Abstract: As the push for open access (OA) burgeons around the globe, it is important to examine OA as it relates to collection development practices. Canada has its own particular set of characteristics and approaches to service delivery based on its history and context. Like our global colleagues, opportunities for collection development in Canada include the support of OA journals, repositories, monographs and electronic theses. The strengthening of OA in Canada is tied closely with other issues. Political and educational realities as well as geographic spread are affecting the way the movement is strengthening and impacting collection development practices. In this context, we share the results of a study examining the scholarly communication landscape in Canadian research libraries. The results of interviews with librarians, who are leaders in scholarly communication activities at their own institutions, showcase the prominent role OA plays in enhancing collections at Canadian institutions. Collaboration and the role of cooperative collection development are covered. The paper concludes with recommendations for strengthening access to open scholarship in libraries regardless of their geographic location. Keywords: Open Access; Collection Development; Canadian Research Libraries; Interviews; Scholarly Communication 1 1 INTRODUCTION Open Access (OA) is defined as literature that is digital, online, free of charge, and free of most copyright and licensing restrictions (Suber, 2013).
    [Show full text]
  • Web Archiving Supplementary Guidelines
    LIBRARY OF CONGRESS COLLECTIONS POLICY STATEMENTS SUPPLEMENTARY GUIDELINES Web Archiving Contents I. Scope II. Current Practice III. Research Strengths IV. Collecting Policy I. Scope The Library's traditional functions of acquiring, cataloging, preserving and serving collection materials of historical importance to Congress and the American people extend to digital materials, including web sites. The Library acquires and makes permanently accessible born digital works that are playing an increasingly important role in the intellectual, commercial and creative life of the United States. Given the vast size and growing comprehensiveness of the Internet, as well as the short life‐span of much of its content, the Library must: (1) define the scope and priorities for its web collecting, and (2) develop partnerships and cooperative relationships required to continue fulfilling its vital historic mission in order to supplement the Library’s capacity. The contents of a web site may range from ephemeral social media content to digital versions of formal publications that are also available in print. Web archiving preserves as much of the web‐based user experience as technologically possible in order to provide future users accurate snapshots of what particular organizations and individuals presented on the archived sites at particular moments in time, including how the intellectual content (such as text) is framed by the web site implementation. The guidelines in this document apply to the Library’s effort to acquire web sites and related content via harvesting in‐house, through contract services and purchase. It also covers collaborative web archiving efforts with external groups, such as the International Internet Preservation Consortium (IIPC).
    [Show full text]
  • Web Archiving Environmental Scan
    Web Archiving Environmental Scan The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Web Archiving Environmental Scan Harvard Library Report January 2016 Prepared by Gail Truman The Harvard Library Report “Web Archiving Environmental Scan” is licensed under a Creative Commons Attribution 4.0 International License. Prepared by Gail Truman, Truman Technologies Reviewed by Andrea Goethals, Harvard Library and Abigail Bordeaux, Library Technology Services, Harvard University Revised by Andrea Goethals in July 2017 to correct the number of dedicated web archiving staff at the Danish Royal Library This report was produced with the generous support of the Arcadia Fund. Citation: Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Table of Contents Executive Summary ............................................................................................................................ 3 Introduction ......................................................................................................................................
    [Show full text]