Hi! Good Morning, All, and Thanks for Joining Us. I'm Karl Blumenthal. I'm a Web Archivist for the Internet Archive's “A
Total Page:16
File Type:pdf, Size:1020Kb
Hi! Good morning, all, and thanks for joining us. I’m Karl Blumenthal. I’m a web archivist for the Internet Archive’s “Archive-It” service and partnership community. And to begin our discussion of of collaborative web archiving I’d like to introduce a little bit of web archiving’s history and how in fact it was collaboration among many different archivists, technologists, and organizations that made the practice what it is today, and indeed how the lessons learned from that early collaboration are just as vital and important to new web archivists and their subjects today as the ever were, which I think Amy and Sam can then demonstrate in even more living color. So before we dig any deeper into this topic we can first just agree on some specific terminology. What we mean when we say “web archiving” is something like this: its the process of collecting, preserving, and ultimately enabling end-user patron access to materials originally published to the web. There are myriad reasons why libraries and archives perform this labor, but in general, you may find: that the materials you have traditionally collected in print, bound and serial forms, have increasingly shifted to a web-based publishing paradigm--that local organization or academic department might no longer send you their materials on paper but instead may share it all online; and indeed your organization itself may need to meet its own records retention mandate by preserving materials only published to its website or even the website itself; increasingly web archiving is a means to preserve and provide enduring access to events and conversations that exist entirely online, like movements with social media presences. And whatever their specific goal, each web archivist engaged in this work as a result mitigates the threat of what we call Link Rot, the loss of content found at the other end of live links, such as increasingly appear in the citations of journal articles, book chapters, even court decisions, so that everyone can still find what they’re looking for online instead of those universally dreaded “404: File not found” error messages. This work different precise forms, based on the needs and goals, but more often than not it looks generally something like this: an archivist or selector of some kind at a computer terminal identifies a website or web page that they want to collect; they acquire it using a software--not always but most frequently a web crawler--which deposits it into local or networked storage--again not always but most commonly in the form of what we call a Web ARChive or “WARC” file; that file can thereafter be read through a browser-based software that knows how to interpret WARC files--just as you would use Microsoft Word or the like to read a DOC file, and renders them the way we would expect to see and browse through them as they appeared at the time that they were archived. Now, if you’ve never done any web archiving yourself but you’ve heard of it before, it might be because of my organization, the Internet Archive. Since 1996 we’ve been a non-profit digital library based in San Francisco--this is our actual headquarters in an old converted (but not very converted!) Christian Science church near Golden Gate Park. And from here we host and serve millions of books, movies, audio recordings… ...software and games. Increasingly we’re collecting born-digital artifacts and even the broader software environments that are necessary to access them like we might have originally done at a school computer lab thirty years ago. But we’re likely most universally known to this day for the Wayback Machine. That’s our web-spanning archive--one of those rendering software’s that I mentioned a moment ago--at archive.org/web, which provides access to the web as we’ve collected and preserved it since 1996. And you can find all sorts of interesting stuff in there! Say for instance I wanted to know what the Archivists Round Table of New York were up to in 2000, just from the comfort of my laptop or tablet, I can do that now. New England Archivists, I see you too. All the way back too 1996, this time! So the Wayack Machine is obviously pretty useful in its way; it’s a vital resource to historians of technology and the web and its cultures in particular, of course. Increasingly though it’s also become singularly important to journalists, activists, government watch dogs, artists, and really anyone invested in not throwing web-published information down the memory hole when it becomes politically or socially inconvenient to provide. I’ll take a little bit further about that one in just a moment. But in the meantime really it’s proven its value in countless individual stories of writers long since disconnected from their old and shuttered blog sites, students and teachers looking for that long lost course syllabus -- all of those things that deserved but didn’t have an archive. There are myriad stories like these and they begin to stretch the Wayback Machine to its boundaries. The trove is vast, to be sure--some 500+ Billion individual URLs preserved over the last 22 years. And still, which that vast extent, it can present an incomplete or unsatisfying record. And in fact you can summarize its limitations as a function of not having enough archivists. What we long had here was a reflection of as much as we could gather from the entire expanse of the web and by using highly automated web crawling tools. What it lacks then is depth--oftentimes we don’t have more than a site’s homepage--or coherence--there’s no regular, predictable frequency at which you can rely on seeing captures of the URL most important to you; to this day the only really reliable way to find a preserved URL is to go straight to it, there’s no arrangement or description of holdings. These are all perfect demonstrations of what distinguished a back-up like Wayback, from an archive, like the people in this room would create--something that has been critically curated to contain the right content, in an order and with the context needed to understand what it is and means. By about 2005 we’d found that there was a critical mass of archivists who needed something similarly modeled but under their own intellectual control. And with 10 pilot partners to help build and beta test it we developed what came to be know as “Archive-It,” a suite of software and a non-profit partnership model that enabled archivists to create their own web archives--their own miniature versions of the Wayback Machine, if you like, but they could enrich that corpus too in the process--with the tools to decide precisely what gets acquired and when, to arrange and describe those holdings and to ultimately enable access to them through a front-end web-based interface, and to maintain and manage multiple redundant copies of those WARC files in their network and locally for preservation purposes, since as we know “lots of copies keeps stuff safe.” We started with the geographically, thematically, and professionally diverse group that you see here. Just for example our early partners in state archives and libraries succeeded in collecting and preserving government records that had no print analog, such as the website of the Bob McDonnell campaign and later administration in Virginia, over on the left. The later web archive of Governor Tim Kaine’s administration got a lot of attention and use some years later, as you might imagine. And these archivists were even in fact the first to help us at the Internet Archive build out our capacity for capturing social media. The State Archives and Library of North Carolina, for instance, were keen to preserve Governor bev Purdue’s communication with constituents over Facebook and Twitter since she was likewise among the first to really heavily invest in those notoriously ephemeral or walled-off platforms to reach them. You can now find these and many more captures in the Wayback Machine, and even better you can browse through or search for them on our website, archive-it.org. Just like a real archive! It worked! From a few pilot partners the community of web archivists has grown to now--since I made this graph a couple of weeks ago it topped 600 different organizations and institutions. Each one contributes to the self-sustaining non-profit model by paying to keep the lights on--that’s what it takes to keep copies of their collections online from our data centers in the bay area, or through tool development to make sure that our crawlers and rendering tools are the best that they can be, or by doing further outreach of their own. From the smallest one- or two-person non-profit to the big R1 universities, each in the process is empowered to fulfill its own specific mission by using the shared technology stack to collect those unique materials that it knows best, be that the institutional records of a small town or municipality... ...or J.R.R. Tolkien fandom, if you happen to be Marquette University. It’s a model that’s worked for services with similar goals like HathiTrust and which we’ve seen increasingly in the digital preservation realm, as archivists across institutions decide that it’s really better to share infrastructure than to always roll their own. Very early in Archive-It’s life though, there began interest not only sharing the hardware and software to build web archives, but indeed the appraisal and selection responsibilities; to use good old fashioned archival practice, documentation strategy in this case, to again make web archiving achievable at what would otherwise be an unattainable scale.