Today I would like to talk about DIGITAL RESOURCES and what they can give to historical research.

Digital sources seem like something that do not concerns us historians today, but twenty years from now we will not be able to invesgate any aspect of history without taking into consideraon these sources, because this is what our era is producing now - and mostly, they represent possibilies that no previous generaon of historians could enjoy. The utopia of Alexandria’s library, the Universal Library which holds all knowledge and is accessible to anyone is potenally something that the internet could give us – however, just like Alexandria’s library burned to the ground, our universal and fast-growing digital heritage is in danger.

I called this presentaon “The digital dark age” to focus on the problem of preservaon of our digital cultural heritage, which has some specificity that make it parcularly vulnerable for preservaon.

But also, I would like to stress that when I talk about digital sources I don’t just mean documents created with the arrival of computers and the internet and that will be useful historical sources either to contemporary historians today or to historians in general 100 years from now, but really to anyone engaging in historical research today.

1 I would like to start with this: this is a digized copy of the Magna Carta from the Brish Library hp://www.bl.uk/collecon-items/magna-carta-1215

Zoom – you can access the whole document in a simulaon of how it looks like Metadata on the document – where, when, copyright, descripon Transcript

So this makes the Magna Carta a digital source – like any other document, from any era, that has been digized.

2 I like to divide digital sources in three big categories: a) DIGITISED SOURCES: any type of document that was not originally created as a digital object (whether text, video, or sound) that was subsequently digized and made into a digital object, whether a , mov, jpeg, whatever format computer can read. EX: MAGNA CHARTA b) BORN-DIGITAL SOURCES: these are all files that were created in digital: pictures, texts, music scores, sketches, this power-point… prey much anything that is produced today from offices to arsts’ studios to government assemblies c) ONLINE SOURCES: sources that are created not only as digital files, but as online objects – on the internet basically. Tweets, facebook statuses, blogs, newspapers, etc. These objects have the further characterisc of being publicly accessible and much more interacve: think of how many days a newspaper websites is updated, or the process of retweeng tweets, or the comments that we can all live on the web in many ways. The Guardian website is nothing like its paper copy, and it is also very different from a pdf version of the paper copy.

Also, online sources are composed of different media: text, images, video,

3 I would like to say something about the process of digizaon, which potenally affects all historical sources which are the only available sources up to the arrival of digital records.

Digisaon is important for many reasons:

- providing online, potenally universal access to informaon held on paper - reducing wear and tear to historical records (imagine how many more people can access the Magna Carta online without having to actually touch it) - retain the appearance of the original arfact - reducing the problem of SPACE - improving searchability (eg, OCR)

4 The digizaon techniques are not unique, and they vary a lot according to the type of medium you want to digize.

Standards are under review, but good guidelines are for example provided by JISC, the UK associaon for the usage of .

The easiest documents to digize are text and sll images, which don’t required parcularly complicated technologies and DIY used oen by archives

Usually done with a digital cameras or scanners –

These are: - old picture of people at King’s College - The other is a picture of a document I took in the historical archive of the Italian oil company, and it is a terrible example of how to digize – I just did it with a small camera for personal usage.

5 From the JISC guidelines:

- Copies must not be enhanced or modified. So no cropping, no photoshop - Each page must be copied on its own. - Weights sheets may be needed to flaen documents. - A colour checker and ruler must be included on every page so to show the actual dimension and colour of the object. - The enre page should be included; the edge of the paper must not be cropped out of view. If you are photographing a bound volume, the margin should be included.

Resoluon -> 300dpi for reference-only, 600 for actual preservaon. Usually two files are created, one at lower and one at higher resoluon

Colour sengs: Bit-depth relates to the level of colour that will be captured A ‘bit’ is the binary digit that represents the tonal value of the pixel Generally speaking, a 1-bit image is black and white, an 8-bit image has 256 shades of either grey or colour and a 24-bit image has millions of shades of colour Format: these are jpeg, ff, pdf, etc. The difference lies in whether the format is compressed or uncompressed and if compressed, if it’s lossy or lossless. Usually jpeg are used for low-res, compressed storage and ff for high-res storage. Usually each

6 Digisaon of audio and video objects are much more complex and oen require professional equipment. Usually only archives that are specialized in audiovisual components (like the bbc archive) do the digizaon themselves. Most of them outsource the job to specialized companies.

The original object (tape, record, film) must be played on its original reader (recorder, record player, etc) and connected to a computer with special cables and run with a programme which converts the informaon in digital.

For example, AUDACITY is a free, easy to learn soware

7 One great advantage of digital files is that they do not degrade or lose quality with repeated use (like tapes or record albums, or books do). They can also be copied repeatedly without any loss of alteraon.

This leads to big problems related to copyright and intellectual property, but I will leave them for another me.

The digital object will not be catalogued on a folder or shelf anymore, but it will sll be a folder on a directory.

These folder trees are set up by the archives according to specific guidelines. However, usually the researchers do not access the folder tree, but they operate research through a search engine interface, whether online or on the offline archive’s catalogue.

8 We should never forget that these documents are kept on servers and hard drives, so they do occupy a physical space, just a different one.

This physical space is actually a very fragile physical space (I’m sure you have all broken a hard drive by making it fall or just with a power malfuncon.

One big difference between physical archives and digial archives is that we do not tend to preserve objects, say through restauraon, etc, but through the preservaon of the data. The hard drive is not a historical source, it’s its content that maers.

The new froner of historical preservaon is DATA MIGRATION, the process of transferring data between storage types, formats, or computer systems. As digital storage technology progresses, data will migrate periodically on new formats. I think the suggested standards is to migrate every 10 years.

The file size is the size of the computer file of your image It is measured in bytes The larger the file size, the more disk space (storage space) this will take up on your computer Bytes (1 byte = 8 bits) are oen broken down into kilobytes or KB (1000 bytes)

9 This in red is the old one, these in black are the new drives. This goes on and on for several kilometres.

“The ”, a non-profit organizaon funded in San Francisco in 1996 whose purpose is to collect, preserve, and make available to the general public all historical collecons that exist in digital format.

The Internet archive includes pictures, websites, music, moving images, and over three million public-domain books.

It is an umbrella archive, as it both acquires digital sources itself and links to different collecons around the world.

10 This is addional storage purchased for the archive.

The main problems with these archives is that they need constant power consumpon and cause excess heat – for example, the Internet Archive’s Petabox system uses the heat the hard drives generate to heat the building. - Also, separate data centre to prevent physical damage in just one part - old drives kept as an extra copy, not thrown away

The Internet Archive alone currently hold 50 PetaBytes, that is 50000 terabytes (Petabyte = 1000 terabytes).

Which corresponds to roughly: 6 million books 400 billion webpages 3,800 films 350,000 news programmes 200,000 audio recordings 100,000 pictures

This order of magnitude takes me to the next problem I would like to describe: the

11 This is what happens on the internet in 60’’. And this is just these monitores websites, plus there are all the non-online digital sources and the digized sources.

The amount of informaon available is skyrockeng

Big data, sampling and social science approaches oen seem to be the only way to navigate in this ocean of digital sources.

If we consider pictures for examples, 5 billion pictures are uploaded on the internet every day. Gey images, one of the largest photography archives in the world, had 80 million pictures in total.

18 million hours of videos are uploaded on youtube every year. The whole of BBC archive has only 600 thousands of videos.

12 For this reason, research engines and metadata are absolutely fundamental to run our research. We cannot think of digital research without search engine tools, it would be exactly like trying to empty the ocean with a spoon

This is a piture of how google would have looked like in the pre-digital era: except that nothing as inclusive as the world wide web could have ever existed.

Fortunately though, digital documents allow for new approaches to research that analogue documents did not allow.

13 The reason why I said it’s useless is mostly that it does not allow for Opcal Character Recognion. If the picture I took has beer lighng and the lines where more horizontal, I could have elaborate the image with various programmes.

Here are three examples of programmes you can use. Personally I have only used google doc because it’s free, but exactly because it’s free the result is not very good. Adobe Acrobat scans the page, recognizes the text, and tells you which part it did not recognize. Abbyy Finereader also allows you to manually add the parts it did not recognize automacally, making the programme beer and beer as you use it – it’s the same principle of the capthca.

There is a big difference in having to read a text, even just scrolling, and being able to run a keyword search.

There is also a CNTL+F version for audiovisual files, except it is obviously more complicated. It is possible to bookmarks video and audio files according to keywords or topics, so to automacally play the digital file from that specific moment. For example, this is a company that offers bookmarking on videos: hp://www.voicebase.com/public/

14 However, digital sources have also the opposite problem. The idea of a digital dark age grew exactly from the risk of losing most of our digital sources, which in many ways are much more fragile than parcels or ancient inscripon.

For example, most of the early 20th-century media are completely lost.

TV and radio started to be recorded regularly by the BBC only in the 1970s for example.

Television was all live and the only way to record it was to place a film camera in front of the television and record it this way. This is why we do have the images transmied by BBC for the coronaon of the Queen – but think of all the TV news of the me and how much we have lost.

Even when recording technology became available, people simply did not do it because television and radio were perceived as ephemeral arts; videotapes would be used over and over for budget reasons, the new programmes deleng the old ones.

20th century history was definitely not televised. The Internet Archive and similar projects were established to ensure that the same desny would not be followed by digital docuemnts with the arrival of computers

15 There are two ways in which digital documents can be in danger: the obsolescence of the hardware and the obsolescence of the soware. documents are stored on physical media which require special hardware in order to be read and that this hardware will not be available in a few decades from the me the document was created. For example, it is already the case that disk drives 1 capable of reading 5 ⁄4 inch floppy disks are not readily available.

16 A very good example was the BBC Domesday Project in 1986, in which a survey of the naon was compiled 900 years aer the Domesday Book was published. people, mostly school children, wrote about geography, history or social issues in their local area or just about their daily lives. Children from over 9,000 schools were involved.[5] This was linked with maps, and many colour photos, stascal data, video and "virtual walks". Over 1 million people parcipated in the project. The project also incorporated professionally prepared video footage, virtual reality tours of major landmarks and other prepared datasets such as the 1981 census. It was a very interacve, digital project that combined different type of sources, just like we are used today on any website.

The problem is that that the laser discs on which the new domesday book was prepared became soon obsolete and by the early 1990s were unreadable as computers capable of reading the format had become rare and drives capable of accessing the discs even rarer.

Ironically, while the db of 900 years ago is sll easily accessible through our eyes, the 1986 source was at risk of being lost forever. It was a project between the Universiy of Leeds and the BBC called “DomesEm” that created a system for modern computers to access the Domesday laser discs – and immediately proceed to data migraon.

17 The obsolescence of soware is even trickier than the obsolescence of support for the data, because it is not just a technical problem, but a problem of patents released by computer sowares and IT companies.

For example, perhaps some of you will remember as a very young child WORDSTAR, the DOS wring programme that came before windows arrived. Programmes to be able to simulate wordstar and read the old files are being developed in order to retrieve the informaon. In general, now it is good pracce to allow the new version of a programme to read the older version

Now more and more available, but it is a standard, not imposed by the law (ex Final Cut X against Final Cut).

18 Remember I said that one fundamental problem to consider in choosing the file format (jpeg, pdf, etc) when digizing is about open or proprietary sowares:

A proprietary file format is defined as: “containing data that is ordered and stored according to a parcular encoding-scheme, designed by a company or organizaon to be secret, such that the decoding and interpretaon of this stored data is only easily accomplished with parcular soware or hardware that the company itself has developed”

This simply mean that there are files you can read with any programmes, and files you can only read with some programmes. For examples, pdf was a proprietary file, only readable with Abode Reader, up to 2008. It was then made open because it was a standard de facto. This was due to a cooperave approach of companies and to the pressures of organisaons promong free soware. Microso as well in 2006 signed a Open Specificaon Promise not to assert legal rights over certain Microso patents.

This is important mostly to allow future programmes to be able to read old files through an open and inclusive technology. The Internet Archive also collects soware to allow simulaon of old files. For example, videogames

19 Finally, I would like to say a few words about internet as a source.

You might have checked already the link I sent you, from the Internet Archive

à go to Wayback Machine

Ok, so this is how google.com looked like on November 11, 1998. It looks as much as possible as the original website, except that it does not perform – you cannot look for results there.

The Wayback machine tries to collect all pages that are public on the World Wide Web. It is not just pages in English, this tries to be all the esisng internet from 1996 or a bit earlier.

à example, this is the major Italian newspaper, larepubblica.it, it started to crawl from 1996.

The wayback machine collects internet pages through what are called “web spiders”, or “web crawlers. These soware start with a list of URLs to visit, it idenfies all the hyperlinks in the page and adds them to the list of URLs to visit. Usually a website is archived up to 5 levels down.

20 One final issue I would like to addresss is that of social media.

Think of how we will study cultural history or polical history 50 years from now, and if we will be able to do it without Twier, Facebook, or youtube.

Twier is replacing press releases more and more – facebook is a mix between a secret diary, a pamphlet, and much more – youtube will probably become the most relevant audiovisual source for cultural history ever.

The problem with social media is that they are owned private companies and also as user-generated content, people creang the content that is online have the right to remove it at any me. As historians, they are worth as diaries, correspondence, and all private sources that we work on.

But Emails and facebook posts cannot just be found in an old drawer and donated to an archive, – how to preserve them for the future?

Honestly I don’t have an answer for that.

For example, Facebook now allows you to archive your profile, just go on general sengs -> download your facebook copy.

21 So this is the end of my presentaon. On a final note, I would like to recommend a visit to this website, The Long Now Foundaon, a cultural instuon established in 1996 and that is one of the main supporters of The Internet Archive. It has the interesng peculiarity of using five-digit dates, in order to solve the deca-millennium bug which will come into effect in about 8,000 years.

The goal of the Long Now Foundaon is to promote long-term thinking and assumpon of long-term responsibility in a society with a very short-horizon perspecve. One of the main projects of the associaon is the creaon of a LONG- TERM clock which [I quote] cks once a year, bongs once a century, and the cuckoo comes out every millennium”.

I would like to conclude my presentaon with a quote from one of the Long Now Foundaon founder, computer scienst Daniel Hill:

“When I was a child, people used to talk about what would happen by the year 02000. For the next thirty years they kept talking about what would happen by the year 02000, and now no one menons a future date at all. The future has been shrinking by one year per year for my enre life. I think it is me for us to start a long- term project that gets people thinking past the mental barrier of an ever-shortening future”

22 List of useful links:

Gutemberg Million books American memory project Egypt thing Papirology The Natonal Archives www.archives.org CIS History Universe The Avalon Project (Yale) Electronic Text Centre Family History UK

Smithsonian Instute A geographic guide to uncovering women’s history in archiva collecons Eogan.org Ready, Net, Go! (Tulane University) Africa Research Central (Cal State) European rchival Network UNESCO Archives Portal

23