What’s in your bag? Practical tools and strategies for scalable Adriane Hanson Brandon Pieczko Iva Dimitrova Mary Willoughby

October 25, 2018 UGA Organizational Context

● Five departments with content ● Collaborated since 2014 ○ media archives ● 1.5 PB content ○ political archives ○ manuscript/university archives ● Wide variety of formats ○ map and government docs ● Workflow to create AIPs ○ digital library ● In-house preservation storage ● Ten staff do digital preservation ● Set of policies ● Library has IT and a developer ● Digital preservation is priority How We Might Apply to You

● Most departments have less than 1 FTE for digital preservation ● Collaboration between departments like collaboration between institutions ● Other than storage, all tools we use are free ● Other than storage, nothing we did required developer expertise ● Scripts require minimal programmer expertise but don't have to use them Policy Development

Digital Preservation ≠Technology (only) Problem Standards and Assessments

Why it's worth it Making the most of them: ● Distill current research ● One part at a time ● Ba s is for m a king decis ions ● Choose option closest to compliance ● Framework for program ● Sta rt with likely ris ks ● Measure progress Implementing New Things

● Identify criteria for evaluating options ○ Standards ○ User needs ○ Staff needs ● Test against criteria, with real data ● Learn as you go ● Documentation Collaboration

● Who is at the table ● Agility: subgroups, "sprints" ● Project management software Prioritizing Digital Content for Preservation

● Issues to consider: ○ Appraisal is still important ○ Multiple levels of digital preservation with corresponding actions

Source: Northwestern University Libraries, Digital Preservation Policy https://www.library.northwestern.edu/about/administration/policies/digital-preservation-policy.html Prioritizing Digital Content for Preservation

● Other considerations: ○ Uniqueness ○ Rights ○ Accessibility ● Our approach (Russell Library): ○ born-digital + deed of gift + appraised + processed -> ARCHive ○ other digital records -> SAN storage and/or LTO tape AIP Creation Concerns

● Level at which to create Archival Information Packages ● Issues to consider: ○ Level of description for digital records within the finding aid ○ How the files will be made accessible to researchers (as Dissemination Information Packages) ○ Size, quantity, and diversity of file formats in the AIP ○ Version control ● Our approach: DIPs = AIPs (usually) AIP Creation Concerns

Unique identifier = repository id + collection number + er + ###### Example: rbrl-432-er-000001 AIP Creation Concerns

Aeon requests (unique identifier = “barcode”) AIP Creation Concerns

AIP identifiers in the ARCHive Validating Fixity of Stored Digital Objects

● Validating fixity -> preserves authenticity ● Validate early and validate often ○ Generate checksums during accessioning and store them with content ○ Validate on a regular basis, especially when copying files between storage environments (Robocopy, TeraCopy, rsync, etc.) ● Strategies for checksum validation ○ Fun with spreadsheets! Validating Fixity of Stored Digital Objects

● Spreadsheet method for validating fixity ○ Original checksums from manifest vs. current checksums Validating Fixity of Stored Digital Objects Validating Fixity of Stored Digital Objects

● Other (better) strategies for checksum validation: ○ Dedicated fixity checking software (e.g. Fixity) ○ “Buy into the beauty of bags” (BagIt File Packaging Format) ■ For real, they will make your life much easier (well, at least your fixity checking). Preservation metadata: putting the pieces together

A lot of pieces of useful Strategic selection of required Interfacing with digipres system (but not necessarily information for preservation & required) information decision about how to structure data aip-id collection-id

format(s)

aip size 3 types of metadata

1. technical metadata a. format(s) master. b. format note (extra format info) c. file size / aip size d. fixity

1. rights metadata a. copyright b. access / restrictions c. provenance

1. system metadata a. title b. aip-id c. collection -id d. version # Choosing a metadata standard

PREMIS Dublin Core (Preservation Metadata: Implementation Strategies)

- widely used and has active - simpler than PREMIS development/support - includes a title tag (PREMIS does not) - good balance of required vs. - used for rights metadata optional fields - URI links to online rights statement - extensible http://rightsstatements.org/en/ Technical metadata extraction

FITS (File Identification Tool Set) MediaInfo

● robust format ID (11 tools) ● great for AV items ● tools are configurable (turn on/off) ● XML output ● XML output ● tags for video & audio streams; includes codec info ● PRONOM-based ● cons: ● cons: 1 tool can misidentify a ○ one XML for each file (no internal checks) ○ open-source ○ inconsistent order of XML tags Tools + Automation

You can use & modify our stylesheets!

Available on GitHub @ uga-libraries Testing metadata validity

XML Schema Definition (XSD) "I said, 'Do you speak my language?'

He just smiled and gave me a vegemite sandwich."

The XML Schema Definition specifies: - expected hierarchy of XML elements - max/min number of fields to expect - type of data to expect in each field (e.g number, string) - combination of characters to look for (regexes), etc. BagIt Specification

Packages files for transfer and storage

● Allows repositories to check completeness and fixity

● Platform independent

● Widely implemented

● Tools in many different languages including Python, Java, and Perl A look inside...

In the base directory of the Bag is the data folder that contains the payload as well as text files describing the contents of the bag and the bag itself. A look inside... bagit.txt (required tag file) A look inside... manifest-md5.txt (required tag file) A look inside... bag-info.txt (common tag file) Tools: Bagger Tools: bagit - python Tools: bagit - python + scripting bag_and_validate_all_folders.cmd

@ ECHO OFF ECHO Please close the Explorer window you launched this from timeout /t -1

FOR /D %%g IN ("*") DO python bagit.py %%g && python bagit.py --validate %%g && rename "%%g" "%%g_bag"

Pause Uses for Bags

Is it important? Do you need to keep it? Put it in a Bag.

● Any time you have a set of files destined for a preservation workflow.

● When files are ready for preservation storage.

● When transferring files to partners.

● Work with partners/donors to have them use Bags for transferring files. Bags in ARCHive

How are UGA Libraries using bags in ARCHive?

● Mandated as part of our ARCHive AIP definition

● Created with scripting, but our methods vary by department ○ Mac/Linux using Python and Bash scripting ○ Windows batch scripting

● Serialized for ingest into ARCHive using for hardware efficiency

● Compressed for more efficient transfer using BZip Storage

A lot of good options, but know their limitations:

● External hard drives

● RAID (Redundant Array of Independent Disks)

● LTFS (Linear Tape File System)

● Cloud Storage Storage

Don’t put all of your eggs in one basket!

● Redundancy ○ Multiple copies of data on different kinds of media ● Geographic dispersal ○ Store copies in different locations ● Monitor ○ Check for errors ○ Store media in a controlled environment ○ Plan migrations to manage obsolescence Thank you!

What questions do you have?

Contact us: Adriane: [email protected] Brandon: [email protected] Iva: [email protected] Mary: [email protected]

(CC BY-NC-SA 2.0) https://www.flickr.com/photos/bensheldon/2686058366/ Additional Information

BagIt. (n.d.). In Wikipedia. Retrieved October 20, 2018, from https://en.wikipedia.org/wiki/BagIt

Kunze, J., Littman, J., Madden, E., Scancella, J., & Adams, C. (2018, September 17). The BagIt file packaging format (V1.0) [IETF Internet-Draft]. Retrieved from https://tools.ietf.org/html/draft-kunze-bagit-17

Digital Library of Georgia AIP Scripts: https://github.com/mkwzzz/AIPscripts_DLG

UGA Libraries AIP Scripts: https://github.com/uga-libraries

UGA Libraries Digital Preservation Workflow Handout: https://bit.ly/SGA2019Session2A