Musso the Digital Dark

Today I would like to talk about DIGITAL RESOURCES and what they can give to historical research. Digital sources seem like something that do not concerns us historians today, but twenty years from now we will not be able to invesEgate any aspect of history without taking into consideraon these sources, because this is what our era is producing now - and mostly, they represent possibiliEes that no previous generaon of historians could enjoy. The utopia of Alexandria’s library, the Universal Library which holds all knowledge and is accessible to anyone is potenEally something that the internet could give us – however, just like Alexandria’s library burned to the ground, our universal and fast-growing digital heritage is in danger. I called this presentaon “The digital dark age” to focus on the problem of preservaon of our digital cultural heritage, which has some specificity that make it par8cularly vulnerable for preservaon. But also, I would like to stress that when I talk about digital sources I don’t just mean documents created with the arrival of computers and the internet and that will be useful historical sources either to contemporary historians today or to historians in general 100 years from now, but really to anyone engaging in historical research today. 1 I would like to start with this: this is a digiEzed copy of the Magna Carta from the BriEsh Library h=p://www.bl.uk/collec8on-items/magna-carta-1215 Zoom – you can access the whole document in a simulaon of how it looks like Metadata on the document – where, when, copyright, descripEon Transcript So this makes the Magna Carta a digital source – like any other document, from any era, that has been digized. 2 I like to divide digital sources in three big categories: a) DIGITISED SOURCES: any type of document that was not originally created as a digital object (whether text, video, or sound) that was subsequently digiEzed and made into a digital object, whether a pdf, mov, jpeg, whatever format computer can read. EX: MAGNA CHARTA b) BORN-DIGITAL SOURCES: these are all files that were created in digital: pictures, texts, music scores, sketches, this power-point… pre\y much anything that is produced today from offices to arEsts’ studios to government assemblies c) ONLINE SOURCES: sources that are created not only as digital files, but as online objects – on the internet basically. Tweets, facebook statuses, blogs, newspapers, etc. These objects have the further characterisEc of being publicly accessible and much more interacEve: think of how many days a newspaper websites is updated, or the process of retweeEng tweets, or the comments that we can all live on the web in many ways. The Guardian website is nothing like its paper copy, and it is also very different from a pdf version of the paper copy. Also, online sources are composed of different media: text, images, video, 3 I would like to say something about the process of digi8za8on, which poten8ally affects all historical sources which are the only available sources up to the arrival of digital records. DigiEsaon is important for many reasons: - providing online, potenally universal access to informaon held on paper - reducing wear and tear to historical records (imagine how many more people can access the Magna Carta online without having to actually touch it) - retain the appearance of the original arEfact - reducing the problem of SPACE - improving searchability (eg, OCR) 4 The digiEzaon techniques are not unique, and they vary a lot according to the type of medium you want to digiEze. Standards are under review, but good guidelines are for example provided by JISC, the UK associaon for the usage of digital media. The easiest documents to digiEze are text and sEll images, which don’t required parEcularly complicated technologies and DIY used oden by archives Usually done with a digital cameras or scanners – These are: - old picture of people at King’s College - The other is a picture of a document I took in the historical archive of the Italian oil company, and it is a terrible example of how to digiEze – I just did it with a small camera for personal usage. 5 from the JISC guidelines: - Copies must not be enhanced or modified. So no cropping, no photoshop - Each page must be copied on its own. - Weights sheets may be needed to flaen documents. - A colour checker and ruler must be included on every page so to show the actual dimension and colour of the object. - The enre page should be included; the edge of the paper must not be cropped out of view. If you are photographing a bound volume, the margin should be included. Resolu8on -> 300dpi for reference-only, 600 for actual preservaon. Usually two files are created, one at lower and one at higher resoluEon Colour sengs: Bit-depth relates to the level of colour that will be captured A ‘bit’ is the binary digit that represents the tonal value of the pixel Generally speaking, a 1-bit image is black and white, an 8-bit image has 256 shades of either grey or colour and a 24-bit image has millions of shades of colour format: these are jpeg, Eff, pdf, etc. The difference lies in whether the format is compressed or uncompressed and if compressed, if it’s lossy or lossless. Usually jpeg are used for low-res, compressed storage and Eff for high-res storage. Usually each 6 DigiEsaon of audio and video objects are much more complex and oden require professional equipment. Usually only archives that are specialized in audiovisual components (like the bbc archive) do the digiEzaon themselves. Most of them outsource the job to specialized companies. The original object (tape, record, film) must be played on its original reader (recorder, record player, etc) and connected to a computer with special cables and run with a programme which converts the informaon in digital. For example, AUDACITY is a free, easy to learn so^ware 7 One great advantage of digital files is that they do not degrade or lose quality with repeated use (like tapes or record albums, or books do). They can also be copied repeatedly without any loss of alteraon. This leads to big problems related to copyright and intellectual property, but I will leave them for another Eme. The digital object will not be catalogued on a folder or shelf anymore, but it will sEll be a folder on a directory. These folder trees are set up by the archives according to specific guidelines. However, usually the researchers do not access the folder tree, but they operate research through a search engine interface, whether online or on the offline archive’s catalogue. 8 We should never forget that these documents are kept on servers and hard drives, so they do occupy a physical space, just a different one. This physical space is actually a very fragile physical space (I’m sure you have all broken a hard drive by making it fall or just with a power malfuncEon. One big difference between physical archives and digiEal archives is that we do not tend to preserve objects, say through restauraon, etc, but through the preservaon of the data. The hard drive is not a historical source, it’s its content that maers. The new fronEer of historical preservaon is DATA MIGRATION, the process of transferring data between storage types, formats, or computer systems. As digital storage technology progresses, data will migrate periodically on new formats. I think the suggested standards is to migrate every 10 years. The file size is the size of the computer file of your image It is measured in bytes The larger the file size, the more disk space (storage space) this will take up on your computer Bytes (1 byte = 8 bits) are oden broken down into kilobytes or KB (1000 bytes) 9 This in red is the old one, these in black are the new drives. This goes on and on for several kilometres. “The Internet Archive”, a non-profit organizaon funded in San francisco in 1996 whose purpose is to collect, preserve, and make available to the general public all historical collecEons that exist in digital format. The Internet archive includes pictures, websites, music, moving images, and over three million public-domain books. It is an umbrella archive, as it both acquires digital sources itself and links to different collecEons around the world. 10 This is addiEonal storage purchased for the archive. The main problems with these archives is that they need constant power consump8on and cause excess heat – for example, the Internet Archive’s Petabox system uses the heat the hard drives generate to heat the building. - Also, separate data centre to prevent physical damage in just one part - old drives kept as an extra copy, not thrown away The Internet Archive alone currently hold 50 PetaBytes, that is 50000 terabytes (Petabyte = 1000 terabytes). Which corresponds to roughly: 6 million books 400 billion webpages 3,800 films 350,000 news programmes 200,000 audio recordings 100,000 pictures This order of magnitude takes me to the next problem I would like to describe: the 11 This is what happens on the internet in 60’’. And this is just these monitores websites, plus there are all the non-online digital sources and the digiEzed sources. The amount of informaon available is skyrockeEng Big data, sampling and social science approaches oden seem to be the only way to navigate in this ocean of digital sources. If we consider pictures for examples, 5 billion pictures are uploaded on the internet every day. Ge\y images, one of the largest photography archives in the world, had 80 million pictures in total.

Musso the Digital Dark

Archives First: Digital Preservation Further Investigations Into Digital

Digital Preservation Handbook

A New Digital Dark Age? Collaborative Web Tools, Social Media and Long-Term Preservation Stuart Jeffrey Version of Record First Published: 05 Dec 2012

Digital Preservation.Pdf

Follow-Up Questions

Problems of Digital Sustainability

A DIY Approach to Digital Preservation

The Theory and Craft of Digital Preservation Manuscript Submitted to Johns Hopkins University Press By: Trevor Owens June, 2017 2

Evaluating Personal Archiving Strategies for Internet-Based Information

A Digital Dark Ages? Challenges in the Preservation of Electronic Information

Electronic Records Archives Brian Knowles Roger Williams University, [email protected]

Bots, Seeds and People: Web Archives As Infrastructure