REFERENCE ROT : a Digital Preservation Issue Beyond file Formats
Total Page:16
File Type:pdf, Size:1020Kb
ARCHIVABILITY OF DOCUMENTS IS EMERGING AS A BEST PRACTICE IN DIGITAL PRESERVATION. WHAT'S IN YOUR INSTITUTIONAL REPOSITORY? REFERENCE ROT : a digital preservation issue beyond file formats Mandated electronic deposit of theses and dissertations (ETDs) carries with it digital preservation concerns for librarians in a new role as defacto digital publisher. As scholarly content vanishes due to the nature of the ephemeral web, what's next for digital-born documents deposited in our institutional repositories? LINK ROT + CONTENT DRIFT ••• MEMENTOS Links pointing to webpages & resources are no longer available at Links work, but URL page content has evolved over time and differs, Digital snapshots, i.e., screen captures, which are preserved in URL address, e.g., 404-page not found. sometimes dramatically, from what was there originally. publicly accessible archives. "All three corpora show a moderate, yet alarming, link rot ratio for references "We find that for over 75% of references the content has drifted The international archiving community has been periodically made in recent articles, published in 2012: 13% for arXiv, 22% for Elsevier, away from what it was when referenced." crawling websites and saving mementos for years. An example of and 14% for PMC ... Going back to the earliest publication year in our such incidental archiving is Internet Archive's Wayback Machine. corpora, 1997, the ratios become 34%, 66%, and 80%, respectively." Jones, S. M., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R., & Grover, C. (2016). Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PloS one, 11(12), Even if libraries have undertaken no actions to insure digital Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., & Tobin, R. (2014). e0167475. preservation, some mementos will exist, as our research shows. Scholarly context not found: one in five articles suffers from reference rot. PloS one, 9(12), e115253. Sampled Links (990), by Memento Found, by Year 354 C U’ S R R 400 s k n i l 300 d e l p To see if Concordia University's ETDs suffered from reference rot, we examined PhD Using a 10% stratified random sample of 990 links, we found: m a s f o dissertations deposited in Spectrum during a 5 year period (Spring 2011 - Fall 2015) • About half of PhD links sampled (492/990) exhibited content drift r 200 e b 135 118 m u 105 • Documents were downloaded, converted to text and mined for URLs • 77% (764/990) had mementos, 23% (226/990) had no memento N • URLs were checked using cURL to obtain http response codes • Content for 11% of sampled links (54/990) is lost and not recoverable 100 87 52 40 20 Memento Found 57 22 Yes (764 links, 77%) 0 No (226 links, 23%) 2011 2012 2013 2014 2015 Year Total PhDs (720), by Discipline Total PhDs (720), by Document Health Sampled Links (990), by Content Drift Detected, by Year Not Converted Science Embargoed 20 (3%) 235 86 (12%) 36 (5%) Arts 400 Fine Arts Healthy W A G? 210 (29%) 28 (4%) 171 (24%) s k n i l 300 d e l p m a Reframing Our Responsibilities s Business f 206 o (JMSB) r 200 Eng & e 45 (6%) Infected (≥1 unhealthy Immune (no links) b 110 64 • Avoid use of URL shorteners (e.g., bit.ly) Comp Sci 160 (22%) m links) u 62 351 (49%) N 333 (46%) 100 94 • Make repositories and publishing websites archive-friendly 65 21 82 Content Drift Detected 51 Yes (492 links, 49.7%) • Add archiving crawlers to whitelists Link Distribution, by HTTP Status Code 0 No (498 links, 50.3%) 2011 2012 2013 2014 2015 • Collaborate with Thesis Office to systematically preserve ETD links Links, by HTTP Response Code, by Year (11,437 links) Year 5xx - Server error 1,795 (72%) 2,017 (82%) 0 - Empty response 76 (1%) 507 (4%) 2500 2,118 (78%) 4xx - Client error s 2000 1,249 (69%) k 1916 (17%) n i 1,655 (85%) l f o r 1500 e Total Content Drift (492), by Type of Drift M S M b 3xx - Redirect m 102 (1%) u 1000 Lost 54 N 698 (28%) 607 (22%) 562 (31%) Major 42 444 (18%) Minor 179 500 2xx - Successful Custom 404 35 • Use Save Page Now • Install browser extension Updating website 34 292 (15%) 8834 (77%) 0 Other 148 2011 2012 2013 2014 2015 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 Number of sampled links Links, by HTTP Response Code, by Discipline and Year Total Links (11,437), by Discipline Arts Business (JMSB) Eng & Comp Sci Fine Arts Science Science Total Content Drift (492), by Type of Drift, by Discipline 294 (3%) Fine Arts Arts Business Eng & Comp Sci Fine Arts Science 1728 (15%) Arts Updat ing s 5946 (52%) websit e k 1000 n i l f Archive-it: Internet Archive’s web archiving subscription service. o Ot her r Lost Major e b Ot her Save collections; rescue websites before they disappear. m Ot her u 500 Eng & N Comp Sci Major Major 3259 (28%) Minor Business (JMSB) Minor 0 210 (2%) Minor 2011 2013 2015 2011 2013 2015 2011 2013 2015 2011 2013 2015 2011 2013 2015 Get to know • Individual/Institutional • Install browser extension HTTP Response Codes accounts 2xx ("active") 0, 1xx, 3xx, 4xx, 5xx ("rotten") Type of Content Drift Lost Major Minor Custom 404 Updating website Other Massicotte, M., & Botter, K. (2017). Reference rot in the repository: A case study of electronic theses and [email protected] dissertations (ETDs) in an academic library. Information Technology and Libraries (ITAL), 36(1), 11-28. [email protected].