Edinburgh Research Explorer
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Edinburgh Research Explorer Edinburgh Research Explorer Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time Citation for published version: Thompson, H & Tong, J 2018, Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time. in Companion Proceedings of the The Web Conference 2018. WWW '18, International World Wide Web Conferences Steering Committee, Lyon, France, pp. 1749-1755, The Web Conference 2018, Lyon, France, 23-27 April. DOI: 10.1145/3184558.3191636 Digital Object Identifier (DOI): 10.1145/3184558.3191636 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Companion Proceedings of the The Web Conference 2018 General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 05. Jun. 2018 Track: 8th Temporal Web Analytics Workshop WWW 2018, April 23-27, 2018, Lyon, France Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time? Henry S. Thompson TONG Jian∗ University of Edinburgh University of Edinburgh Edinburgh, United Kingdom Edinburgh, United Kingdom [email protected] [email protected] ABSTRACT The Digital Object Identifier scheme [11], managed by the Inter- We report here on the results of two studies using two and four national DOI Foundation (IDF) [16], was an early adopter of this monthly web crawls respectively from the Common Crawl (CC) approach, and DOIs are now in widespread use, particularly in scien- initiative between 2014 and 2017, whose initial goal was to provide tific journals, where their use is actually mandated by a number of empirical evidence for the changing patterns of use of so-called major publishers. The mapping for DOIs to actionable https: URIs persistent identifiers. This paper focusses on the tooling needed for is simple: For example a DOI for a journal article written in the form 1 dealing with CC data, and the problems we found with it. The first of a URI such as doi:... is mapped (client-side) to https://doi.org/... study is based on over 1012 URIs from over 5x109 pages crawled in In response to an HTTP request for that URI, the server at doi.org April 2014 and April 2017, the second study adds a further 3x109 (operated on behalf of IDF by the Corporation for National Research pages from the April 2015 and April 2016 crawls. We conclude with Initiatives (CNRI) [5]) will respond with a redirect to the appro- suggestions on specific actions needed to enable studies based on priate http(s): URI from the actual publisher of the article. We CC to give reliable longitudinal information. call the three forms involved the ’original’ (e.g. doi: or info:hdl), the ‘actionable’ (e.g. https://doi.org/... and variants thereof or KEYWORDS http://hdl.handle.net/...) and the ‘locating’. Note that none of these is strictly speaking a PID as such: that’s what comes after temporal web analytics, persistent identifier, Common Crawl, Uni- the doi: or https://hdl:handle.net/. form Resource Identifier, longitudinal web crawl analysis, Digital The success of this approach has overcome a significant barrier to Object Identifier the adoption of PIDs in general: to date there has been no significant ACM Reference Format: move towards support for any of them as URIs in web browsers Henry S. Thompson and TONG Jian. 2018. Can Common Crawl Reliably or PDF viewers. That is, if you try to use doi://10.1000/182 or info: Track Persistent Identifier (PID) Use Over Time?. In WWW ’18 Companion: hdl/20.1000/100 as a link (for example, as the value of the href The 2018 Web Conference Companion, April 23–27, 2018, Lyon, France. ACM, attribute of an HTML A element), it will not work. But you can use New York, NY, USA, 7 pages. https://doi.org/10.1145/3184558.3191636 them as the link text of an A element, and put the actionable form (https://doi.org/10.1000/182 and http://hdl.handle.net/20.1000/100 1 INTRODUCTION respectively) in the href attribute, and that will work just fine. The history of efforts to meet the demand for so-called ‘persistent That’s the good news. The less good news is that the use of identifiers’ (PIDs) for use on the Web is complicated, with many redirection from the actionable form to the locating form means alternative offerings and much debate about the meaning ofper- that when someone follows a link such as those in the previous sistence and how to go about ensuring it. We take no position paragraph, it’s the locating form that appears in the address bar in that debate here, beyond the observation that the demand for of their browser, and is thus the form they may well copy and PIDs shows no signs of abating, and that there has been a more- paste into an email to a colleague or their own reading list. But or-less general acknowledgement over the last 5–10 years that to this undermines the fundamental value proposition of the original be successful in the context of the Web a PID scheme must define (’persistent’) form: that it is not vulnerable to all the things that and support a mapping from PIDs in the scheme to ‘actionable’ cause http: URIs to fail over time. identifiers. In practice this has meant specifying a purely syntac- Our goal in the work reported here was to quantify the growth tic procedure for converting a PID into an http(s): URI using a over time in actual usage of the three forms, to see not only how domain owned and operated by the proprietors of the scheme. An good the good news was, but also whether there was cause to worry HTTP request for such ‘actionable’ URIs will typically result in a about the less good news: are locating forms ‘leaking’ into public redirection to the then-current location of the identified resource. use? For concrete evidence we used the Common Crawl sample of ∗SURNAME forename HTML pages on the Web [3], the only large-scale public source of This paper is published under the Creative Commons Attribution 4.0 International evidence readily available to us. This turned out to be challenging (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their in a number of respects, to the extent that although our results are personal and corporate Web sites with the appropriate attribution. interesting, problems with the CC data mean that they may not WWW ’18 Companion, April 23–27, 2018, Lyon, France accurately reflect the actual situation. In what follows we willfirst © 2018 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC BY 4.0 License. ACM ISBN 978-1-4503-5640-4/18/04. https://doi.org/10.1145/3184558.3191636 1doi: is not (yet) a registered URI scheme, but often used as if it were one 1749 Track: 8th Temporal Web Analytics Workshop WWW 2018, April 23-27, 2018, Lyon, France describe the work as such, and then discuss the ways in which the Table 1: Crawl size for first study [24] CC data fell short of what we think is required for reliable analysis. Note on terminology Although most of the PIDs in various Crawl month URIs crawled Pages retrieved Dup URI %age forms (original, actionable or locating) found during our studies 2014-04 1,718,646,762 2,641,371,316 34.9% were DOIs, we will be careful hereafter to use ‘PID’ when we mean anything recognised as a form of persistent identifier, and ‘DOI’ for 2017-04 2,907,715,349 2,942,930,482 1.2% the subset thereof which are some form of DOI. 2 PRIOR WORK AND OTHER SOURCES OF Table 2: Duplicate page estimates for first study [24] INFORMATION An excellent overview of the space of PIDs and arguments for their Crawl month Pages retrieved Digests Dup pages %age use, only slightly dated, can be found in [21]. The IDF’s views on 2014-04 2,641,371,316 2,250,363,653 14.8% the need for PIDs and their goals for DOIs is described in [1]. The IDF occasionally update their "Key Facts" page [12] which 2017-04 2,907,715,349 2,915,114,582 0.9% currently says that • [DOIs are] Currently used by well over 5,000 assigners, e.g., publishers, science data centres, movie studios, etc. • Approximately 148 million DOI names assigned to date starts with a unique set of URIs and does not follow page-internal • Over 5 billion DOI resolutions per year links, redirects to URIs in the initial set occur surprisingly often, The leading issuer of DOIs for publications is CrossRef [7], who giving rise to duplication in some cases. The "Duplicate URI %age" publish regularly-updated statistics about membership numbers, column in Table reftab:t1 reports this, as estimated by subtracting DOIs registered, etc. [8] the ratio of the URI to Page columns from 1.