Meeting the Challenges of Preserving the UK Web

Helen Hockx-Yu 96 Euston Road, London NW1 2DB United Kingdom [email protected]

ABSTRACT Collecting and providing continued access to the UK’s digital heritage is a core purpose for the British Library. An important element of this is the World Wide Web. The British Library The British Library started archiving UK websites in 2004, started in 2004, building from scratch the based on the consent from site owners. This resulted in the 1 capability of eventually preserving the entire UK web domain. Open UK Web Archive , a curated collection currently This is required by the non-print Legal Deposit Regulations consisting of over 70,000 point-in-time snapshots of nearly which came into force in April 2013, charging the Legal 16,000 selected websites, archived by the British Library and 2 Deposit Libraries with capturing, among a wide range of digital partners. publications, the contents of every site carrying the .uk suffix Non-Print Legal Deposit (NPLD) Regulations became effective (and more), preserving the material and making it accessible in in the UK in April 2013, applying to all digitally published and the Legal Deposit Libraries’ reading rooms. on line work. NPLD is a joint responsibility of publishers and The paper provides an overview of the key challenges related to Legal Deposit Libraries (LDLs). An important requirement is archiving the UK web, and the approaches the British Library that access to NPLD content is restricted to premises controlled has taken to meet these challenges. Specific attention will be by the LDLs. given to issues such as the “right to be forgotten” and the treatment for social networks. The paper will also discuss the The British Library leads the implementation of NPLD for the access and scholarly use of web archives, using the Big UK UK web. While the many existing web archiving challenges Domain Data for Arts and Humanities project as an example. described in detail by Hockx-Yu[7] remain valid, the significant increase of scale, from archiving hundreds of Keywords websites to millions, has brought about new and additional Web Archiving, Non-print Legal Deposit, Digital Preservation, challenges. The key ones are discussed in this paper. Big data, Scholarly use, Digital Humanities.

2. IMPLEMENTING NON-PRINT LEGAL 1. INTRODUCTION DEPOSIT Web Archiving was initiated by the in the NPLD of UK websites is mainly implemented through periodic mid-1990’s, followed by memory institutions including crawling of the openly available UK web domain, following an national libraries and archives around the world. Web archiving automated harvesting process where web crawlers request has now become a mainstream digital heritage activity. Many resources from web servers hosting in-scope content. countries expanded the existing mandatory deposit scheme to include digital publications and passed regulations to enable systematic collection of the national web domain. A recent 2.1 Collecting Strategy survey identified 68 web archiving initiatives and estimated With over 10 million registered domain names, .uk is one of the that 534 billion files (measuring 17PB) had been archived since largest Top Level Domains (TLDs) in the world. A strategy to 1996. [2] archive such a large web space requires striking a balance between comprehensive snapshots of the entire domain and adequate coverage of changes of important resources. Figure 1. outlines our current strategy, which is a mixed model allowing annual crawl of the UK web in its entirety, augmented by prioritisation of the parts which are deemed important and Copyright Helen Hockx-Yu (2015), licensed under a Creative receive greater curatorial attention. Commons 4.0 Attribution International Licence. When citing this paper please use the following: Hockx-Yu, H., “Meeting the

Challenges of Preserving the UK Web”, Digital Preservation for the Arts, Social Sciences and Humanities, Dublin, 25-26 June 2015. Dublin: The Digital Repository of Ireland.

1 Open UK Web Archive, http://www.webarchive.org.uk.

2 Another key collection is the UK Government UK Web Archive provided by the National Archives, containing government records on the web, http://www.nationalarchives.gov.uk/webarchive.  Data volume limitation3 4 A default per-host data cap of 512MB or 20 hops is applied to our domain crawls with the exception of a few selected hosts. As soon as one of the pre-configured caps has been reached, the crawl of a given host will terminate automatically.  Robots.txt policy We obey robots.txt and META exclusions, except for the home pages and content required to render a page (e.g. JavaScript, CSS).  Embedded resources Resources which are essential to the coherent Figure 1. UK Web Archive collecting strategy interpretation of a web page (e.g. JavaSrcript, CSS) are considered in-scope and collected, regardless of where  The domain crawl is intended to capture the UK domain as these are hosted. comprehensively as possible, providing the “big picture”.  The key sites represent UK organisations and individuals of general and enduring interest in a particular sector of 3. “RIGHT TO BE FORGOTTEN” [6] the life of the UK and its constituent nations. The “right to be forgotten” relates to the European Court of  News websites contain news published frequently on the Justice (ECJ)’s ruling against Google, who were asked to web by journalistic organisations. remove the index and access to a 16-year old newspaper article  The events-based collections are intended to capture concerning an individual’s proceedings over social security political, cultural, social and economic events of national debts. [10] interest, as reflected on the web. “Right to be forgotten” reflects the principle of an individual The key sites, news sites and events collections are maintained being able to remove traces of past events in life from the by curators across the LDLs and governed by a sub-group Internet or other records. When considering this, it is important overseeing web archiving for NPLD. These are typically not to lose sight of the purpose of NPLD. By keeping a captured more than once a year. historical record of the UK web for heritage purposes, it ensures the “right to be remembered”. Websites archived for NPLD are only accessible within the LDL’s reading rooms and 2.2 UK Territoriality the content of the archive is not available for search engines. The Regulations define an on line work as in scope if: This significantly reduces the potential damage and impact to individuals and the libraries’ exposure to take-down requests. a) it is made available to the public from a website with a There is at present no formal and general “right to be forgotten” domain name which relates to the United Kingdom or to a place in UK law by which a person may demand withdrawal of the within the United Kingdom; or lawfully archived copy of lawfully published material, just b) it is made available to the public by a person and any of that because they do not wish it to be available any longer. We person’s activities relating to the creation or the publication of apply the Data Protection Act 1998 for withdrawing material the work take place within the United Kingdom. [13] containing sensitive personal data from the NPLD collection. A notice and takedown policy is in place allowing withdrawal of Part a) is interpreted as including all .uk websites, plus websites public access or removal of deposited material under specific in future geographic top level domains that relate to the UK circumstances.[12] "Evidence of damage and distress to such as .scotland, .wales or .london. This part of the UK individuals" is a key criterion used to review complaints. territoriality criteria can be implemented using automated methods, by assembling various lists or directories, or through discovery crawls, which identify linked resources from an 4. ARCHIVING SOCIAL MEDIA [5] initial list and extract additional URLs recursively. A sampling approach was taken to archiving social media Part b) concerns websites using non .uk domains. It is a content prior to NPLD. The Open UK Web Archive contains a statement about the location of the publisher or the publishing limited amount of pages from Twitter, Facebook and YouTube. process without defining explicitly what “takes place within the These typically are parts of “special collections”, groups of United Kingdom” constitutes. We use a mixture of automated websites about a particular theme or an event, usually archived and manual means to discover content relevant to this category. for a fixed period of time. An example is the special collection Manual checks include UK postal address, private on the UK General Election 2010, which includes Twitter communication, who-is records and professional judgment. A pages belonging to the Prospective Parliamentary Candidates crawl-time process has also been developed, to check non .uk (PPCs). The decision not to systematically archive social media URLs against an external Geo-IP database and add UK-hosted related to the selective nature of the archive itself, the difficulty content to the fetch-chain. This helped us identify over 2 in obtaining permissions and resources constraints – even the million non .uk hosts during our 2014 domain crawl. exemplar content required highly skilled technical staff to At a more detailed level, crawler configurations also determine develop customised solutions outside standard workflow. the scope and boundary of a national web archive collection. Some key parameters of our current implementation are as follows: 3 This is a common way to manage large scale crawls, which otherwise could require significant machine resources or time to complete. 4 Each page below the level of the seed, i.e. the starting point of a crawl, is considered a hop. Social media content are not treated differently from other web prototype interface to conduct specific research and provide resources in the context of NPLD. Regardless of the platform feedback which guided the next cycle of development. used for publication, non-print work is collected for Legal The Shine interface7 provides access to a 30TB underlying Deposit if it is fulfills the territoriality criteria. archive containing 2.02 billion URLs, collected between 1996 Determining territoriality of social media is however not and 2010. It supports query building, corpus formation and straightforward. The major social network platforms, YouTube, handling. The full-text search has proximity options, and can Twitter and Facebook, all use .com domain names so do not exclude specified text strings. The search results are presented meet the first part of the territoriality criteria.5 The second back to users without much manipulation or ranking, but can be applies to UK-based individuals or organisations, so does not filtered using multiple facets, e.g. content type, public suffix, warrant archiving twitter.com or facebook.com in their entirety. domain, crawl year. Single resources or whole hosts can also be Until scalable solutions are developed, the identification of UK excluded from result sets. Queries can be saved and results content in social media relies on manual checking and curators’ exported as CSV or similar. The interface also allows “trends” professional judgment. In-scope content will continue to be search, visually presenting occurrences of search terms across a archived together with the rest of the UK web for non-print timeline and allowing access to random samples at given data Legal Deposit. points which support the trends.

5. ACCESS AND SCHOLARLY USE [4] Despite the time-consuming and non-scalable administrative process, the advantage of permissioned-based archiving is online access. NPLD enabled systematic collection at scale but limits access to those physically present at LDL’s premises. Access restriction is common for web archives developed with a Legal Deposit mandate. There seems to be a choice between comprehensiveness of the archive and online access. While similar access restrictions were applied to printed Legal Deposit collections, there is an expectation from the users to be able to access web archvies 24/7. The misalignment between legal requirements and user expectation is a difficult problem. Another issue is the single-site based access method, over- focusing on the actual HTML text and ignoring contextual or para-textual information. The division of effort in archiving just Figure 2. Shine Trends Search the national (or a subset) web also breaks down a global system and introduces arbitrary boundaries which are irrelevant to Researchers on the project concluded that web archives have research questions. great potential and limitations. The main challenge is Access and use of the web archives has been one of our methodological rather than content-related: how to make sense strategic focuses from the outset. Many activities and projects of a vast amount of unstructured data and how to create took place to involve general users and researchers in collection relevant research corpora in a consistent manner. Traditional building, requirements and tools development. We explored “relevance searching” seems to raise unreasonable web archives both in granularity and totality, providing a rich expectations. A number of researchers had to adapt the original set of functions to enhance access to individual websites, and research questions due to the overwhelming amount of search developing analytics and visualisations to explore patterns, results. Gareth Millward, a historian who proposed to research trends and relationships based on the entire web archive disability organisations on the web, detailed some of the collection. We also developed the Mementos Service, allowing challenges and frustrations in an interview with the Washington resource discovery across multiple web archives in the world.6 Post. [11] Andy Jackson, who led the technical development of Shine, responded to the interview and explained the difference The Big UK Domain Data for the Arts and Humanities project, between a historical search engine and one like Google, which was funded by the Arts and Humanities Research Council to makes many assumptions and uses these to rank search grasp the opportunities offered by big data. Using a historical results.[8] UK web domain dataset[9] collected by the Internet Archive and acquired by the Joint Information Systems Committee The project demonstrated a learning process for both (JISC), the project aims to develop a methodological researchers and web archive providers. Understanding each framework for the study of the archived web, a monograph on other’s assumptions or expectations reveals issues but is also a the history of the UK web and an access tool based on step towards building better web archives. It is not about a requirements extracted from 11 research proposals across a choice between granularity or totality, it is rather the ability to range of disciplines. move between the two that is desired: “a visualisation tool that allows a single data point, to be both visualised at scale in the The British Library’s role was to work with the researchers and context of a billion other data points, and drilled down to its co-develop a prototype user interface to the JISC/IA dataset. smallest compass”. [3] An iterative model was followed where researchers used the

5 YouTube is out of scope as the Regulations do not apply to works consisting solely or predominantly of film or recorded sound (or both). 6 UK Web Archive Mementos Service, 7 UK Web Archive Shine application, http://www.webarchive.org.uk/ukwa/info/mementos. http://www.webarchive.org.uk/shine. 6. CONCLUSION 7. REFERENCE After more than ten years of archiving the UK web, we still [1] Big UK Domain Data for the Arts and Humanities: face many challenges. Two years into NPLD, we have already http://buddah.projects.history.ac.uk/. Accessed: 2015-05- collected over 100TB of web data and the archive continues to 20. grow. While new content types and technologies are added to [2] Costa, M. and Gomes, D. 2015. Web Archive Information the web, our purpose-built crawler struggles with dynamic Retrival: content, streaming media and social media. Only a relatively http://www.netpreserve.org/sites/default/files/attachments/ small group of researchers have discovered the value of web 2015_IIPC-GA_Slides_11_Gomes.pptx. Accessed: 2015- archives and started to use this new type of scholarly source. 05-20. They too need to understand and resolve many conceptual and [3] Hitchcock, T. 2014. Big Data, Small Data and Meaning. methodological issues. Historyonics: http://historyonics.blogspot.co.uk/2014/11/big-data-small- The most valuable lesson from interaction with scholars is that data-and-meaning_9.html. Accessed: 2015-05-20. much contextual information we regard as operational or [4] Hockx-Yu, H. 2014. Access and Scholarly Use of Web private is relevant and can impact the interpretation of web Archives. Alexandria: The Journal of National and archives. This includes a wide range of technical decisions, International Library and Information Issues. 25, curatorial choices and contextual data. Crawl logs and Numbers 1-2 (Aug. 2014), 113–127. configurations, responses from web servers, websites we [5] Hockx-Yu, H. 2014. Archiving Social Media in the intended to include in a special collection but failed to obtain Context of Non-print Legal Deposit. (Lyon, France, Jul. rights-holders’ permission, they are all relevant. Explanation of 2014). http://library.ifla.org/id/eprint/999. Accessed: and access to such information should become base-line 2015-05-20 knowledge and integral parts of the web archive. [6] Hockx-Yu, H. 2014. A Right to be Remembered. UK Web The immediate next step is for us to redevelop the Open UK Archive Blog: Web Archive and fold the key learning into a next generation http://britishlibrary.typepad.co.uk/webarchive/2014/07/a- web archive with a different focus. We hope to provide as right-to-be-remembered.html. Accessed: 2015-05-20. much as possible information and open a window to our rich [7] Hockx-Yu, H. 2011. The Past Issue of the Web. (Koblenz, web archive collection regardless of the access conditions, Germany, Jun. 2011). including the millions of websites with have disappeared from http://www.websci11.org/fileadmin/websci/Papers/PastIss the live web. We hope to link out to more web archives, so that ueWeb.pdf. Accessed: 2015-05-20. the historical UK web can be studied in the global context. [8] Jackson, A. 2015. Building a “Historical Search Engine” is no easy thing. UK Web Archive Blog: http://britishlibrary.typepad.co.uk/webarchive/2015/02/bu ilding-a-historical-search-engine-is-no-easy-thing.html. Accessed: 2015-05-20. [9] JISC UK Web Domain Dataset (1996-2013): http://data.webarchive.org.uk/opendata/ukwa.ds.2/.

Accessed: 2015-05-20. [10] JUDGMENT OF THE COURT (Grand Chamber): 2014. http://curia.europa.eu/juris/document/document_print.jsf? doclang=EN&docid=152065. Accessed: 2015-05-20. [11] Millward, G. 2015. I tried to use the Internet to do historical research. It was nearly impossible: http://www.washingtonpost.com/posteverything/wp/2015/ 02/17/i-tried-to-use-the-internet-to-do-historical- research-it-was-nearly-impossible/. Accessed: 2015-05- 20. [12] The British Library Notice and Take Down Policy:

http://www.bl.uk/aboutus/legaldeposit/complaints/noticeta kedown. Accessed: 2015-05-20. [13] The Legal Deposit Libraries (Non-Print Works) Regulations 2013: 2013. http://www.legislation.gov.uk/uksi/2013/777/pdfs/uksi_20 201307_en.pdf. Accessed: 2015-05-20.