<<

EARLY EXIT STRATEGIES IN DIGITAL PRESERVATION

Ashley Adair Maria Esteva Benn Chang University of Texas Texas Advanced Computing University of Texas Libraries, University of Texas at Center, University of Texas at Libraries, University of Texas at Austin Austin Austin United States United States United States [email protected] [email protected] [email protected]

Abstract – Digital preservation is a set of measures, often developed on the spot. continuous activity requiring long-term effort, Depending on the context and status of the the lack of which presents risks for falling data, and on the possibilities of the institutions behind in maintenance, representation, that support them, the approaches may entail functionalities, and long-term safeguarding. significant challenges, particularly if not However, contingencies in a preservation pathway can change quickly. Going to the rescue considered and codified in advance. of data at preservation risk requires potentially In this paper, we discuss two different costly and time consuming strategies. The ability scenarios enacted due to the abrupt closure of to respond successfully is enhanced by planning a large distributed data preservation initiative an exit strategy for the data. We present two scenarios enacted in response to the closure of a [9]. While our approach to depositing two sets distributed data preservation initiative and of data in this network included several stress the importance of a prior “plan B” to strategies that supported exit efforts, failure to digital preservation plans. outline a comprehensive early exit strategy in Keywords – exit strategy, at-risk data, each case led to extra effort and decision- distributed digital preservation making following news of the closure. Based Conference Topics – Designing and on this experience we identify what worked, Delivering Sustainable Digital Preservation; what could have been improved, and provide Collaboration: a Necessity, an Opportunity or a recommendations. Luxury?

II. DEPOSITING DATA INTO A DISTRIBUTED I. INTRODUCTION PRESERVATION INITIATIVE Because digital preservation efforts exist The case study we present concerns a large on extended time scales, conditions distributed digital preservation initiative that surrounding their context are bound to opened in 2016. It was comprised of nodes at change. The ongoing nature of digital academic institutions geographically dispersed preservation has been extensively stressed. throughout the United States, each using a Administrative tools such as cost sustainability different storage architecture. Members of the calculators [1] [2] and decision-making initiative bought a data allocation for deposit in matrices [3], and technical approaches such as the network. They worked with an ingest node, auditing [4], migration [5] [6], and virtualization which used a centralized suite of tools to or emulation [7] [8], allow institutions to select deposit data and replicate it to additional and maintain a preservation pathway. nodes for long term storage. The transfer However, when conditions for preservation mechanism for the initiative was BagIt [10], a change, the pathway is disrupted. Responding widely adopted specification for grouping files to data at risk requires implementing another in a standardized directory structure (a “bag”) and attaching “tag files,” plain text files 16th International Conference on Digital Preservation iPRES 2019, Amsterdam, The Netherlands. Copyright held by the author(s). The text of this paper is published under a CC BY-SA license (https://creativecommons.org/licenses/by/4.0/). DOI: 10.1145/nnnnnnn.nnnnnnn

containing descriptive and administrative recover the directly from the legacy metadata, a file manifest, information about database were unsuccessful. An oversight on the version of the bagging tool used, and our part was not pursuing extracting the checksums for each file in the bag. metadata from the new system as a JSON file. UT Austin served as a network node, Due to the expected changes data receiving content from other member ownership, we needed an identifier system to institutions for storage at the University of track the preservation network data packages Texas at Austin’s Texas Advanced Computing over time. Each bag was given an ARK identifier Center (TACC) via ingest tool implementation [13] through a global identifier service before and hosting by the Texas Digital Library [11]. deposit. The ARK pointed to the new location Data deposited by UT Austin in the initiative of the dataset so that information about the would be copied to TACC and two additional project was maintained. Using this strategy, geographically dispersed nodes. upon changes in data stewardship, the identifiers could be updated to show new III. DATA DEPOSIT: DESIGNSAFE-CI custody. For our own recordkeeping, and to provide future custodians information about the In 2015 DesignSafe, hosted at TACC, preservation network packages, we created become the awardee of a National Science metadata packages for each bag to retain Foundation cyberinfrastructure (CI) grant to locally. We stored a copy of each bag’s tag files build an end-to-end and and copies of the network’s ingest and analysis portal for natural hazards engineering replication tool reports in a directory named [12]. The grant required taking custody of data according to bag identifiers. We placed a copy from the previous iteration of the project, of these within the cyberinfrastructure for which had been hosted at two other transmission to future awardees. institutions for more than a decade [16]. The legacy data, composed of ~2000 datasets and IV. DATA DEPOSIT: UT LIBRARIES their metadata, were migrated into the new web-based portal for distribution and access. At the same time, the UT Libraries were While the metadata for each dataset in the preparing their own data for ingest into the collection followed a logical model, it was not network. These were archival master TIFF translated into a standard schema. The new images of content digitized from library system involves a second copy of the data on a collections, primarily representing items such geographically replicated file system. as rare books, University theses and dissertations, maps, and government reports. In late 2016, we began preparing this legacy data for ingest into the distributed digital Copies of the flies were stored in bags in preservation initiative. The goal was to explore the Libraries’ LTO tape archive, largely a long term preservation proof of concept by organized only in relation to their date of creating a subset of static data and its creation, and without descriptive and in some metadata as a third dark archival copy. The cases technical metadata. The online projects cyclical nature of funding for the CI meant that arising from these digitization efforts feature special care had to be taken to make the data descriptive metadata for the files, but and knowledge of it and its whereabouts asynchronous legacy workflows meant that portable, anticipating when the next host metadata were not ready for vaulting at the institution would take custody in 5-10 years. time that files needed to move to tape to free processing space on disk. To prepare for deposit, the data were grouped per data publication (research Because purchasing storage in the project) and packaged according to the BagIt distributed digital preservation initiative specification. When possible we enclosed each represented a significant cost to our project in one bag according to the 200 GB limit organization, we wanted to prepare our data to for the distributed initiative’s ingest tool. For a higher degree of preservation quality for projects over 200 GB we enclosed data in ingest than we had been storing it locally. To sequenced bags. In each bag we also placed prepare, we restored a copy from tape, descriptive metadata, which was scraped from reorganized files in logical content units, the legacy site interface. Multiple attempts to generated FITS technical metadata [14], and re- iPRES 2019 - 16th International Conference on Digital Preservation 2 September 16 - 20, 2019, Amsterdam, The Netherlands.

bagged, making use of bag-info.txt files to add A simultaneous development was our basic descriptive metadata for each package. university’s adoption of a new global identifier This metadata came from various sources, service that does not support ARKs. With this such as project web portals, digitization change, the DesignSafe preservation bag ARKs records, and in some cases institutional were decommissioned. We did not anticipate memory. at the time of creating the ARKs, which were central to our preservation plan, that this Bags were ingested into the network in the service would be disrupted. Had the same manner as the natural hazards legacy distributed initiative continued we would have data, with bag tag files and ingest reports needed a new strategy for identifiers, retained locally. The initiative marketed very illustrating how many preservation services long goals, meaning that staff and systems can change in a short period of creating these initial ingest bags could be time within one preservation pathway. Risks retired by the end of the service terms. This for each dependency in a plan, especially reality stressed the importance of local regarding services and systems outside of recordkeeping regarding our deposits that one’s immediate control, should be taken into could be persisted in our organization over account at the outset. Risk management is not time. Notably, the enhanced data packages well represented in current digital preservation were not re-written to tape locally, since we literature but would be a fruitful area for future assumed they would be preserved in the work [15] [16] [17]. distributed network and the data were sizable by our local storage standards. The content VI. EXITING THE NETWORK: UT LIBRARIES file-only bags were retained as originally stored. UT Libraries’ data took another path. Since we knew that the deposited data packages V. EXITING THE NETWORK: DESIGNSAFE-CI were superior to our local copies, we wanted to retrieve them. We first collected bag identifiers In early 2019 the distributed digital applied by the Libraries while preparing the preservation initiative announced that it would data for ingest, using a client that was part of shutter. Because we had no formalized exit the technology stack of the distributed strategy to turn to, quick action was needed to network. Interacting with TACC storage node decide the disposition of the data stored within was via iRODS iCommands, an open source it. data management software [18]. After copying We first investigated which network nodes the data to local storage, a post copy received copies of our data and began verification computed SHA2 values on both conversations with staff there to determine ends for comparison. Each copied tarball was options. In the end, we found that full copies of then extracted and had bagit-python validation all UT Austin data, both DesignSafe’s and UT run. Since the ingests into the distributed Libraries’, had been replicated to a file system initiative were an early proof of concept using at the TACC network node. Because we are new technology, this time consuming campus partners with an existing collaborative validation assured us that the bag contents relationship, this offered us some time and were an exact match to what had been flexibility to move forward. originally placed into the network. With the DesignSafe data, we initiated The UT Libraries are now exploring testing on the CI to ensure that the data we alternative options for storage duplication. For placed in the network had been effectively the time being, we write two copies of all data ported to the new CI for access. We searched for preservation to tape, with one being stored the cyberinfrastructure for legacy project in an off-site vendor facility. numbers that we had embedded in the network bag identifiers and found that all were VII. CONCLUSIONS present. Because the data was ported and includes the geographically replicated copy, we In each of these cases, staff at TACC and the decided not to recall the copies that were at the UT Libraries worked together to expend other three national nodes. These copies will considerable effort strategizing an approach to be deleted. If we decide to make a third copy of preservation packages for ingest into the the data, it can be sent to TACC’s tape archive. distributed digital preservation initiative, along with even more time and effort spent actually iPRES 2019 - 16th International Conference on Digital Preservation 3 September 16 - 20, 2019, Amsterdam, The Netherlands.

creating the data packages. We then One positive outcome for the UT Libraries meticulously tracked and recorded ingests of is that since we were able to retrieve and verify the packages into the network. We did not, these higher quality packages when the however, spend enough time creating a plan distributed initiative closed, we can supplant that could be enacted quickly if the network the lower quality packages in our tape archive failed or we needed to leave it for our own right as we are planning a tape migration. reasons. Another is that the exercise of creating the superior preservation bags for the distributed In the case of DesignSafe, we took the network transformed our ongoing local work. continuation of the initiative for granted and We now treat all preservation data with the concerned ourselves primarily with how we same approach that we devised for would let new CI awardees know about the participating the distributed initiative. We are packages that we deposited into the network. also developing a Digital Asset Management At the UT Libraries, we wanted to take the best System (DAMS), which will help automate advantage of our financial investment in the much of the work involved in creating these network by depositing the best-organized, enhanced preservation packages and supply most fully-described copy of our data possible. us with means for including more robust Because we were aware that some technical structured descriptive metadata. aspects of the network were still in development when our ingests started, we had In summary, our efforts in DesignSafe and a degree of skepticism about how we or the the UT Libraries to prepare data for the receiving nodes would keep track of our bags distributed preservation initiative should have over time. And, as previously described, we been matched by equally careful early exit were mindful of potential staff turnover in the strategy planning, risk analysis, and risk long term. These led us to make decisions management. This came into sharp view when about preparing archival packages that would the initiative closed and we needed to respond be fully self-describing. We wanted our data, quickly. However, the experience presented an once out of our hands, to be understandable to opportunity to improve on previous anyone encountering it without the staff who shortcomings in the projects involved, ended prepared it needing to be available for with successful retrieval of data, and pushed us explanations over the long term. These to make point-forward changes in existing strategies all addressed aspects of data’s practices so that we would not repeat mistakes persistence in the initiative over the long term, of the past. but not what we would do in the event of Our recommendations for exit strategies in closure. digital preservation include: Our lack of a fully formed exit strategy cost ● Pay equal attention not just to how to us a good deal of staff time and effort. For best use a system or tool but also how DesignSafe, had we kept records for each bag to stop using it, possibly very abruptly. that the corresponding project was safely We were careful in planning our ingest ported into the new cyberinfrastructure, we packages and process, but then caught could have notified the partner nodes off guard by needing to exit the immediately that they could delete the initiative on a relatively short timeline. preservation network bags, rather than use ● Consider the goals of an exit strategy. valuable time tracking bag and project With one in place, what will you be able whereabouts on news of the closure. We to do? What is most important: expended significant staff time and Efficiency? Ease? Technical computational resources at the UT Libraries considerations? Had we planned for pulling down and verifying a copy of all of our how abruptly the network might network bags from TACC storage when the dissolve we would have devised a network closed. In the end it would have been strategy that made data deletion a much more efficient for us to have written the quick and easy decision. The network enhanced copies to tape locally as the new bags would only have represented an copy of record at the time of their creation. On additional replication. closure of the network we then could have ● Assess dependencies early in the simply agreed to delete the distributed copies. planning process. If we had done this, we might have foreseen how lack of iPRES 2019 - 16th International Conference on Digital Preservation 4 September 16 - 20, 2019, Amsterdam, The Netherlands.

support for ARKs could cause issues [4] S. Marks, Becoming a Trusted Digital Repository, later in the switch to a new identifier Chicago, IL: Society of American Archivists, 2015, pp. 46-49. system. [5] S. Marks, Becoming a Trusted Digital Repository, ● Include metadata in preservation Chicago, IL: Society of American Archivists, 2015, pp. packages, not just data. Without 50. metadata files may become [6] K. Green, K. Niven, G. Field, “Migrating 2 and 3D datasets: preserving AutoCAD at the Archeology Data meaningless over time. UT Libraries Service,” ISPRS International Journal of Geo-Information, enhanced packages became valuable vol. 5, no. 4, pp. 44-56, April 2016. in the network exit because they were [7] D. Rosenthal, “Emulation & virtualization as the only copies with metadata preservation strategies: a report commissioned by The Andrew W. Mellon Foundation,” New York, 2015. alongside the content. https://mellon.org/Rosenthal-Emulation-2015 ● Preferably include structured [8] D. Anderson, J. Delve, D. Pinchbeck, “Toward a metadata to allow interoperability with workable emulation-based preservation strategy: future systems. In our examples, lack rationale and metadata,” New Review of Information Networking, vol. 15, no. 2, pp.110-131, November of structured metadata will make 2010. pushing preservation packages back [9] D. Minor, “The Digital Preservation Network,” into a repository a problem. presentation, 2014. ● Include identifiers that link replicated http://web.stanford.edu/group/dlss/pasig/PASIG_Sep tember2014/20140918_Presentations/20140918_04_ data with the projects to which they DigitalPreservationNetwork_DavidMinor.pdf belong so that provenance can be [10] J. Kunze, J. Littman, E. Madden, J. Scancella, C. Adams, retraced. This helped us track the “The BagIt File Packaging Format (V1.0).” DesignSafe data, assuring safety to https://tools.ietf.org/html/rfc8493 [11] The Texas Digital Library. delete network bags. https://www.tdl.org/about-tdl ● Keep careful local records of what data [12] E. Rathje, C. Dawson, J.E. Padgett, J.P. Pinelli, D. have been sent for replication, where, Stanzione, A. Adair, P. Arduino, S.J. Brandenberg, T. and when. Again, this helped us verify Cockerill, C. Dey, M. Esteva, F.L. Jr. Haan, M. Hanlon, A. Kareem, L. Lowes, S. Mock, G. Mosqueda, “DesignSafe: our decisions at exit. a new cyberinfrastructure for natural hazards ● Select tools that offer hash checking at engineering,” ASCE Natural Hazards Review, 2016. both ends of transfers for data doi:10.1061/(ASCE)NH.1527-6996.0000246. integrity. This is well-established in [13] H. Tarver, M. Phillips, “Identifier usage and maintenance in the UNT Libraries’ digital collections,” digital preservation but bears presentation, DCMI International Conference on repeating. Dublin Core and Metadata Applications, 2016. ● Carefully consider contractual http://dcpapers.dublincore.org/pubs/article/view/384 language and technical documentation 6 [14] File Information Tool Set (FITS). when selecting a preservation https://projects.iq.harvard.edu/fits/home approach, but proceed with caution [15] S. Hein, K. Schmitt, “Risk management for digital long- knowing that even with written terms term preservation services,” Proceedings of the 10th in place conditions may change over International Conference on Digital Preservation (iPRes 2013), 2013. time. https://services.phaidra.univie.ac.at/api/object/o:378 059/diss/Content/get ACKNOWLEDGEMENT [16] A. McHugh, P. Innocenti, S. Ross, R. Ruusalepp, “Risk management foundations for digital libraries: This work was partially funded by the DRAMBORA (Digital Repository Audit Method Based National Science Foundation grant number on Risk Assessment),” Second Workshop on Foundations of Digital Libraries, 2007. 1520817 https://www.researchgate.net/publication/45456621 _Risk_management_foundations_for_DLs_DRAMBOR REFERENCES A_digital_repository_audit_method_based_on_risk_as sessment [1] J. Morley, “Storage cost modeling,” presentation, [17] S. Ross, A. McHugh, “Preservation pressure points: figshare, 2019. evaluating diverse evidence for risk management,” https://doi.org/10.6084/m9.figshare.7795829.v1 presentation, International Conference on Digital [2] K. Dohe, D. Durden, “The cost of keeping it: toward Preservation (iPres 2006), 2006. effective cost-modeling for digital preservation,” https://services.phaidra.univie.ac.at/api/object/o:294 presentation, open science framework, 2018. 550/diss/Content/get https://doi.org/10.17605/OSF.IO/HVD5F [18] Integrated Rule-Oriented Data System (iRODS). [3] N. Tallman, L. Work, “Approaching appraisal: framing https://irods.org/ criteria for selecting digital content for preservation,” presentation, Open Science Framework, 2018. https://doi.org/10.17605/OSF.IO/8Y6DC

iPRES 2019 - 16th International Conference on Digital Preservation 5 September 16 - 20, 2019, Amsterdam, The Netherlands.