<<

Web Archiving and You and Us

Amy Wickner University of Maryland Libraries Code4Lib 2018

Slides & Resources: https://osf.io/ex6ny/

Hello, thank you for this opportunity to talk about web and archiving. This talk is about what stakes the code4lib community might have in documenting particular experiences of the live web. In addition to slides, I’m leading with a list of material, tools, and trainings I read and relied on in putting this talk together. Despite the limited scope of the talk, I hope you’ll each find something of personal relevance to pursue further. “ the process of portions of the , preserving the collections in an archival format, and then serving the archives for access and use

International Preservation Coalition

To begin, here’s how the International Internet Preservation Consortium or IIPC defines web archiving. Let’s break this down a little. “Collecting portions” means not collecting everything: there’s generally a process of selection. “Archival format” implies that long-term preservation and stewardship are the goals of collecting material from the web. And “serving the archives for access and use” implies a stewarding entity conceptually separate from the bodies of creators and users of archives. It also implies that there is no web archiving without access and use. As we go along, we’ll see examples that both reinforce and trouble these assumptions.

A point of clarity about wording: when I say for example “critique,” “question,” or “trouble” as a verb, I mean inquiry rather than judgement or condemnation. we are collectors

So, preambles mostly over. Between us, we here in this room and on the livestream and so on represent collectors of web-based material, subjects of captured , and users of web archives. To confirm, let’s do a quick poll. Please raise your hand if you are a collector of web-based material. Thank you. Web design & development Labor

Browser technology Federal policy

Personal tech Corporate policy

Digital culture Information ethics

Web archiving technology Collaboration

Costs & impact of storage Attention

As individuals, collectives, and agents of institutions building web archives, we manage many moving parts in attempting to document even a small part of the living web. Here’s a short list of areas that influence our practices. Web development affects the archivability of websites. Personal tech influences how people produce and consume web-based material. Regime change leads to federal policy change like the end of , content appearing and disappearing, large-scale collaborations like the End of Term , and the moving servers to Canada. Corporate policies and practices like terms of service, DRM, startup churn, and data selling influence both archiving and live use of the web, as we heard Wednesday in Mark Matienzo’s overview of IndieWeb. And trends in ethics include growing discussion of privacy as contextual integrity, particularly in online spaces, as well as the right to be forgotten. What constitutes archival value is, and will always be, “specific to place, time, culture and individual subjectivity. It does not dangle somewhere outside of humanity, immutable, pristine, transcendent. The appraiser creates, or recreates, archival value with every appraisal exercise.

Harris 1998

As collectors, we also work within the specific contexts of our biases. This is an appraisal practice, in which collectors assign value to material and take actions accordingly. We aren’t always able to articulate these criteria, nor do we always itemize the actions taken -- colloquially lumping it all together in the word “save.” Appraisal as a fundamental archival practice has been hotly and insularly contested for more than a century -- and I have a syllabus to share if you’re at all curious. This [POINTS TO SLIDE] is one of the more accessible and increasingly relevant approaches, articulated by Verne Harris in 1998. He argues that appraisal is where power is most concentrated in , and that it’s closer to storytelling than to a science. Trying to articulate that story is one way we can grow as web archivists. How can we better Competent, critical, curious use web archiving Learn, teach how it works technologies? Foreground labor

Put faces to names behind infrastructure

Blewer 2017; Arquivo.pt 2018

Improving our appraisal also comes down to being not only competent but also critical and curious users of web archiving tools. As starting points, I recommend two posts by Ashley Blewer: one that approachably introduces the technical side of popular web archiving frameworks; and one that explains the links between archivability and accessibility. If you find yourself in a position to teach with web archiving, try to scaffold learning around not just how to use things but also how they work.

Let’s also look at how labor impacts web archiving: How much time do you spend on different parts of the process? What kind of work is web archiving? Put faces to the names of developers, archivists, and designers behind what you collect and how. The developers at Arquivo.pt, the Portuguese web archive, put out a pretty honest video last week describing the process behind their work, including some recent struggles and decisions around improving services. Documentation like this give us insight on the care of web archives. we are subjects

So that’s an overview of how collectors and web archiving mutually shape one another. Next let’s consider how web archiving impacts subjects represented in web archives. And in fact, much of the power archivists wield is in the description or that tells the story of a and its subjects. Please raise your hand if you’re a subject represented in web archives. Trick question: it’s all of us. How are we identity represented as safety subjects? privacy

access

accessibility

exclusion

harm

To get at why representation in web archives even matters, consider how you and I are represented on the internet. Many of us come face to face everyday with the reality that the web is not so friendly to us, that it’s not built for us to use, is designed to propagate and privilege limited, harmful representations, all of which have real impacts on our well-being. Spaces nominally designed for participation -- , , , I don’t really want to go on -- can be some of the most unwelcoming.

There’s the tumultuous experience of trying to manage our own identities and safety online. Appeals to and for the right to be forgotten elicit cries of “Shame!”, of government overreach, and of the dangers of censorship for accountability and democracy. Protecting privacy is now treated as an individual rather than a collective responsibility: the admirable work of the Electronic Freedom Foundation, Library Freedom Project, and more only emphasize that institutions and their policies currently trend towards a lack of respect for privacy. And maybe they always have.

So. We know the power of archival representation. We know that bias, mis- and underrepresentation, are rampant on the web, including in highly participatory spaces. We’re aware of access and accessibility issues in all of the above. So what warrant could we possibly have to assume web archives would be any different? Future users Current users Audience Designated community

Web archives perpetuate unresolved issues that affect us as subjects and for which we bear responsibility as collectors. Communicating the context of web archiving to people or robots who might use the results is one way to confront these issues. There were some raging Western archival debates in the mid-to-late 20th century about whether and how archivists should envision a future user or user community when building collections, and I assume there are small fires of this kind smoldering today. In the world, and in any system even nominally based on the Open Archival Information System (OAIS), we assume a designated community to justify preserving certain data. It can be illuminating to examine one’s assumptions about audience. we are users

Now, please raise your hand if you use web archives. Thank you. How do people use remix web archives? critique

“plural and heterogeneous archives”

legal evidence

historical evidence

receipts Post 2017; Taylor 2017; Belovari 2017; Milligan 2016; Zannettou, et al. 2018

How do people use web archives, anyway? Artists use them as source material for remix and critique, as what Colin Post calls “plural and heterogeneous archives.” Courts are starting understand the legal uses and limits of web archives as evidence, including whose interests such evidence and cases tend to serve. Historians have used web archives to study political discourse and engagement, among other topics, but it’s also been pointed out that today’s web archives are so little conducive to historical research methods that it may not be strictly accurate to refer to them as “the historical record.” And of course, journalists and the general public use them for RECEIPTS. Maybe you see yourself in one of these categories, although I’ve left so many out. Let’s think about lines of inquiry we can take as users of web archives, to critically read, much as we learn to critically build. EXERCISE: postcolonial critique

● How do web archives reflect or suppress values of the people they represent? ● How are the decisions behind designs, appraisal, and access obscured or revealed? ● What are the labor practices behind web archives and archiving technologies? ● What are the environmental impacts of web archiving?

Anderson 2002

Fundamentally, this means approaching them as having been constructed in different ways and for a variety of purposes. Just like other archives, the narrative of how they’re built is closely tied to how they represent the world. If you’re super familiar with web archives and archiving, you can try “making them strange.” What would a complete newcomer think and see?

One exercise I tried recently is looking at web archives through a few different postcolonial lenses for the study of science, technology, and society. Here are some of the questions that came out of that, and you can check the resources doc for some slightly longer but still relatively short takes on how postcolonial critique, which as you may notice don’t necessarily play nicely together. The point is you can ask questions about web archives through an existing critical framework, or roll your own, whatever helps you in that process of making strange and digging deeper. Who put this here? Who decided to put it here? Who did the work? What isn’t here? Why not?

Wiggins 1996; Bingham 2017

Ask questions like, “Who put this here?” “Who decided it should be preserved?” “Who did the work?” “What has not been documented or preserved?” And people have been doing this for years. One of the earliest articles in First Monday, from the August 1996 issue, concerns the “Mysterious Disappearance of the White House Speech Archive” from the Clinton administration’s . As collectors, we can try to anticipate and answer these questions. For example, The UK Web Archive does a good job of identifying gaps in its record and where they come from, describing collection criteria, showing the literal forms used to include and scope websites, and visualizing the kinds of data in their collection. we can DIY (why?)

As an extension of critically reading web archives and archiving, members of the code4lib community might explore DIY web archiving. Why might we want to do this? Personal digital archiving

Managing everyday digital stuff

Memorials

More web archives are better

Archiving for obscurity

Lepore 2015; Mayer-Schönberger 2009; Eveleth 2017

Personal digital archiving is one reason. Recording your Twitter feed (before quitting forever and joining Mastodon!); capturing your website before changing platforms; getting organized. Personal web archiving might also take the form of memorials, bringing together online evidence of a loved one. It’s likely that more web archives are better, both for representation and redundancy. The more people care enough to save a part of the web, the more likely it is to survive. There’s also the semi-extreme project of creating vastly more web archives in order to bury something you want to hide. This idea borrowed from the Delete and from an episode of the podcast Flash Forward about AI-generated video and compromised identity. Very impractical, but it’s out there.

Image source: https://i.ytimg.com/vi/qdu3fl1gOtU/hqdefault.jpg

Some web archiving speaks more to the work world. In the most recent round of updating documentation, I replaced every link with a Perma.cc, WebRecorder, or URL, just in case. I’ve suggested to colleagues that they use small-scale web archiving -- and draw on existing collections like our institutional Archive-It -- to document their work for performance reviews, tenure dossiers, and other favored activities of the neoliberal university. Perma.cc itself emerged from law librarians’ concerns about , which continues to be a problem in legal opinions, scholarly communication, and beyond.

Image description: document emoji; logos for WebRecorder, Perma.cc, and Wayback Machine; screenshot from a YouTube video with title text reading, “Neoliberalism in Higher Education” Preserving the news

Hopeless or not writers shouldn't have their work erased, “for God's sake. You worked hard for years and this is your future not just past.

We hope that sites that can't simply be made to disappear “will show some immunity to the billionaire problem.

https://twitter.com/ftrain/status/926313561462329344; Freedom of the Press Foundation 2018

So how we preserve our work impacts how social institutions continue to run and knowledge continues to circulate. You’ll remember the sudden shutdown of DNAinfo and Gothamist last November, after which journalists scrambled to recover copies of their clips. Paul Ford generated and posted a list of 12,000 Gothamist articles, bylines, and archive.org links to help them. Recently, the Freedom of the Press Foundation and Archive-It partnered to “archive the alternative press threatened by wealthy buyers.” Watch the documentary Nobody Speak for a good idea of the threat. Online creativity / archives / community fan archives archiving through repertoire online libraries video evidence

De Kosnik 2016; Radjy 2018

Many of us pour labor into online creativity and community, where critical web archiving is also valuable. Not news to anyone here, but people treat websites as repositories all the time. From the Marxists Internet Archive to the Egyptian Mosireen YouTube archive collecting video evidence of police abuse, putting text, video, sound, and images online are a means of preservation, if not necessarily through long-term, secure, geographically diverse storage and archival formats, then through use. The book Rogue Archives describes how online fan archives are often built through the “repertoire” or repetition and transmission of archiving practices: Members of a community archives encourage users to contribute content, including stuff that remixes existing content -- thereby reproducing their particular archival practice.

And this is also where we can start to expand the scope of how we think about “web archiving,” to push at elements of the IIPC definition like “archival format” and “access and use.” #VinesWithoutVines

https://twitter.com/eveewing/status/929790289304346624; https://twitter.com/eveewing/status/929793226025947136

Two Twitter threads have gotten me to expand my thinking recently about the terms and uses of web archiving. This one was started by the sociologist and poet Eve Ewing, and features text retellings of Twitter users’ favorite Vines, RIP. The result was a shared creative and memory practice that also continued for a few hours on the hashtag #VinesWithoutVines. #VinesWithoutVines reanimated and celebrated precarious web content, and long-term preservation and access were not really the point. This thread also led to a personal first of contacting someone’s creative agency for permission to screenshot a tweet for public presentation. I wanna create a thread called #GifHistory. Send a gif that “you want to know the backstory to and we'll try to find the original video.

@MatthewACherry

https://twitter.com/MatthewACherry/status/962011241815277568

The other thread was started last week by the filmmaker Matthew Cherry, using the hashtag #GifHistory. In this thread, Cherry matches super well-known reaction gifs with the videos they came from. The first person I saw retweet it added, “This is really important web archiving right here” and I’m inclined to agree. Again, I don’t know that #GifHistory itself lasted or is going to last that long but, like #VinesWithoutVines, it pushes the boundaries of how we think about web archiving. Some of us, running across either of these hashtags, would be all, “Someone needs to save this!” or, more tone-appropriate to archives Twitter, “I really hope someone is saving this!” And certainly it’s important to make sure people in the future can learn about a rich array of internet lives. But while springing into action to “save” something from the internet, consider whether it’s yours to save, whether it was meant to be saved, what value you add by saving it and, if you decide to go ahead, how you can properly contextualize what you save to reflect this background. “ I believe they are important and ought to be preserved, but this doesn’t necessarily translate to unrestricted, public access.

de jesus 2014

In an example of what this critical process looks like, nina de jesus gave a talk at the 2014 Gender and Sexuality in Information Studies Colloquium about building a web archive of by, in her words, “marginalized people.” She approached the project as a newish insider, running on blogs that she worried would soon be taken down by their authors in response to exploitation via plagiarism. Her talk centers on the importance of consent and agency in preservation, and this is what she concluded. So as part of understanding how we collect, use, and are represented in web archives, consider also how preservation and access are at times linked and at times separate, how access might be neither universal nor given, and how sometimes there are good reasons for that. Asking these questions can be a core part of doing our own web archiving. The practical details, like which tools work for you, will depend on your critical readings, not just on how comfortable you are with learning software, using the command line, and so on -- although those are non-negligible factors as well.

One way to participate in web archiving is to add URLs to an existing web archive. Internet Archive, Archive.is, Perma.cc, and WebCite accept single-page submissions of this kind.

If you prefer to store your own material and feel good about using command-line utilities (and I really don’t want to assume anything here), wget is a flexible, free utility that follows HTML links to recursively download files from web servers following specified parameters. (Thank you to Sam Abrams, Andrew Berger, and Jarrett Drake for pronunciation help!) Some everyday use cases from my personal life: downloading blog posts or grabbing mp3s of a bunch of podcasts at once. Wget focuses on web-based content rather than performance or experience. There are other utilities for downloading specific kinds of content, including -dl for video, twarc for tweet data, and so on. Opinions vary whether web archiving strictly means preserving entire websites, or if it applies broadly to preserving any evidence of the web.

WebRecorder is designed to capture dynamic, interactive elements of websites by recording what happens in browser and network as site content loads and a user interacts with it. You can draw from existing internet archives and the live web to “patch” missing content, and extract pages from other archives. Use cases might be documenting interactive fiction, games, artwork, and anything built with Flash. The idea is capturing a specific experience involving a particular user, browser, and settings. I see a lot of parallels with “Let’s Plays” and other instances of preservation through use and capturing participatory experience.

A few steps further along in terms of technical demand is , the Internet Archive crawler. Crawlers start with a seed URL; copy it; identify, follow, and copy links out from that page; and continue on until reaching a specified limit in terms of domain, documents, data, and so on. Heritrix is integrated into a number of other web crawling suites.

There’s a lot of current research on tools that could lower barriers in web archiving. For example, the Web Science and Digital Libraries Research Group at Old Dominion University are working on an Interplanetary Wayback Machine based on the Interplanetary File System, a Chrome extension called WARCreate that creates a WARC file from what you see in your browser, and a standalone application called WAIL for replay of same. You can find links to these tools and several more in the resources list.

Image description: logos for Archive.is, Perma.cc, WebCite, Save Page Now; screenshot of wget in command prompt; logos for Heritrix, WebRecorder, InterPlanetary Wayback, WAIL, and WARCreate We b Archiving; and You? We, barkiving, and Us

Ultimately web archiving is about capturing and recapturing aspects of the experience and performance of the live web, and it’s up to collectors, users, and subjects to negotiate together what exactly that means.

Image description: formal portrait of two dogs side by side against a black background