Design and Implementation of Archival Storage Component of OAIS Reference Model

Masaryk University Faculty of Informatics Design and implementation of Archival Storage component of OAIS Reference Model Master’s Thesis Jan Tomášek Brno, Spring 2018 Masaryk University Faculty of Informatics Design and implementation of Archival Storage component of OAIS Reference Model Master’s Thesis Jan Tomášek Brno, Spring 2018 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Jan Tomášek Advisor: doc. RNDr. Tomáš Pitner Ph.D. i Acknowledgement I would like to express my gratitude to my supervisor, doc. RNDr. Tomáš Pitner, Ph.D. for valuable advice concerning formal aspects of my thesis, readiness to help and unprecedented responsiveness. Next, I would like to thank to RNDr. Miroslav Bartošek, CSc. for providing me with long-term preservation literature, to RNDr. Michal Růžička, Ph.D. for managing the very helpful internal documents of the ARCLib project, and to the whole ARCLib team for the great col- laboration and willingness to meet up and discuss questions emerged during the implementation. I would also like to thank to my colleagues from the inQool com- pany for providing me with this opportunity, help in solving of practical issues, and for the time flexibility needed to finish this work. Last but not least, I would like to thank to my closest family and friends for all the moral support. iii Abstract This thesis deals with the development of the Archival Storage module of the Reference Model for an Open Archival Information System. The theoretical part analyzes the basic concepts of the OAIS Refer- ence Model and its Archival Storage component and then researches the current storage technologies suitable for the use in the Archival Storage. The practical part deals with the analysis, design and implementation of the Archival Storage and its integration with the Ceph and ZFS storage technologies. The developed first version of the Archival Storage is ready to be integrated with other OAIS modules to create the first functional version of the whole system which will be tested since the summer of 2018. The result of this thesis is one of the first steps for implementation of the complex, open-source, OAIS-compliant solution for long-term preservation of (library) digital collections which may be used by organizations of any size. iv Keywords OAIS, ISO 14721, archival storage, archival system, LTP, long-term preservation, digital preservation, cultural heritage, Java, Ceph, ZFS v Contents 1 Introduction ............................1 2 OAIS ................................3 2.1 The OAIS Standard ......................3 2.2 Functional Model .......................4 2.2.1 OAIS Environment . .5 2.2.2 Information Objects . .6 2.2.3 Functional Entities . .7 2.3 Data Migration ........................9 2.3.1 Refreshment and Replication . .9 2.3.2 Repackaging and Transformation . 10 2.4 Archival Storage in Detail .................. 11 2.4.1 Receive Data . 12 2.4.2 Manage Storage Hierarchy . 12 2.4.3 Replace Media . 13 2.4.4 Error Checking . 13 2.4.5 Disaster Recovery . 13 2.4.6 Provide Data . 13 3 Software Storage Technologies for LTP ............ 15 3.1 Features of Software Storage Technologies .......... 15 3.1.1 Underlaying Hardware . 15 3.1.2 Capacity and Speed . 15 3.1.3 Open-Source . 16 3.1.4 Redundancy . 16 3.1.5 Geo-Replication . 17 3.1.6 Other Features . 18 3.2 Research of Up-to-date Technologies ............. 18 3.2.1 ZFS . 19 3.2.2 Btrfs . 21 3.2.3 Gluster . 21 3.2.4 Ceph . 23 3.2.5 Amazon Web Services . 26 3.3 Conclusion .......................... 29 3.3.1 Summary . 29 3.3.2 Features Comparison . 30 4 Analysis .............................. 33 vii 4.1 The ARCLib Project ...................... 33 4.2 ARCLib Archival Storage ................... 34 4.3 Archival Storage Requirements ................ 35 4.3.1 Receive Data . 36 4.3.2 Manage Storage Hierarchy . 36 4.3.3 Replace Media . 36 4.3.4 Error Checking . 37 4.3.5 Disaster Recovery . 38 4.3.6 Provide Data . 38 4.3.7 Other Functional Requirements . 38 4.3.8 Other Non-functional Requirements . 39 5 Design ............................... 41 5.1 Archival Storage Prototype .................. 41 5.2 Refining Requirements and Design .............. 42 5.2.1 Object Metadata . 42 5.2.2 Storing Objects . 43 5.2.3 Logical Storage Failure . 44 5.2.4 Authentication and Authorization . 44 5.3 Entity Relationship Diagram ................. 45 5.4 Class Diagram ........................ 46 5.5 Object State Diagram ..................... 47 6 Implementation .......................... 49 6.1 Technologies .......................... 49 6.2 HTTP API .......................... 50 6.3 Service Layer ......................... 50 6.3.1 AIP Store Request . 51 6.3.2 AIP Get Request . 53 6.4 Database Layer ........................ 54 6.5 Storage Service ........................ 55 6.5.1 ZFS Storage Service . 55 6.5.2 Ceph Storage Service . 56 6.6 Testing ............................. 57 7 Conclusion ............................. 59 A Source Code, Setup and Documentation References .... 61 viii List of Figures 2.1 OAIS Functional Model [6, p. 4-1] 5 2.2 OAIS Archival Storage functional entity [6, p. 4-8] 12 3.1 ZFS Architecture [20] 19 3.2 Gluster architecture [26] 22 3.3 Ceph architecture [31] 24 4.1 ARCLib Archival Storage 34 5.1 Entity Relationship Diagram 45 5.2 Class Diagram 46 5.3 Object State Diagram 47 6.1 AIP Store Request Activity Diagram 51 6.2 AIP Get Request Activity Diagram 53 ix 1 Introduction The digital age provides the institutions responsible for preservation of the cultural heritage with the means which can contribute greatly to fulfillment of their mission. The topic of the long-term preservation of digital documents (LTP) is addressed in the Reference Model for an Open Archival Information System known as the OAIS standard, a generally recognized standard which has become the starting point for most of the actual LTP systems. In the Czech Republic, the digitization efforts were advanced in 2002 when the flood destroyed many historical collections [1], which could have been saved if the documents had been digitized, replicated and distributed to distinct geographical locations. One of the most known open-source LTP systems, Archivematica, has been tested in the LTP-pilot project, led by the Institute of Com- puter Science of Masaryk University in 2014-2015 [2]. The research shows that while Archivematica is a powerful LTP system, it does not fulfill all OAIS requirements [3]. Using outputs of the LTP-pilot and other research projects, the ARCLib project, led by the Library of the Czech Academy of Sciences, together with Masaryk University, National Library of the Czech Republic and Moravian Library in Brno, has originated with the main goal to create complex LTP, open-source and OAIS-compliant solution, integrating Archivematica and other systems and standards used by Czech organizations [4]. The aim of this thesis is to analyze the OAIS standard and its Archival Storage module and then perform research and comparison of the current storage technologies suitable to use in the Archival Storage module. Based on the acquired knowledge, the first version of the Archival Storage module within the ARCLib context is designed and implemented. The first version of the Archival Storage is ready to be integrated with the rest of the ARCLib system through its HTTP API and provides the main functionality, which is storing data to the Archival Storage, retrieving data from the Archival Storage and versioning. The first version also contains integration of two different storage technologies. 1 2 OAIS This chapter describes the OAIS standard. The introductory section is followed by the description of the OAIS functional model, which is needed to understand the OAIS data lifecycle. As this thesis is dedi- cated to the single component of the OAIS, responsible for the archival data storage, the third section describes the OAIS data migration con- cept and is followed by the last section which further elaborates the Archival Storage functional entity of the OAIS standard. 2.1 The OAIS Standard Reference Model for an Open Archival Information System (OAIS standard) is a standard developed by the Consultative Committee for Space Data Systems (CCSDS) on request of the ISO (International Organization for Standardization). The first version of the CCSDS standard, designated as Recommended Standard, was released in 1999, accepted as ISO standard in 2002 and published as such a year later (ISO 14721:2003). In accordance with both organizations policies, standards went through the revision process and after revisions, which were modest to a few exceptions [5, p. 6], the new version of CCSDS standard designated as Recommended Practice [6] was published in 2012 together with new version of ISO standard (ISO 14721:2012). Both CCSDS standards are available for free at CCSDS official public websites [7]. Even though the development of the OAIS standard originated at the field of space agencies, it is not tied to the space domain. Beinga reference model, the OAIS standard is a conceptual framework with high level of abstraction and flexibility. It introduces a terminology and specifies fundamental requirements, entities, relations and processes of an archival system, but it does not impose specific requirements on the actual implementation or technologies used to achieve the requirements. It is therefore applicable to any digital archive, mainly for those with the need of a long-term preservation (LTP), where the long-term means long enough to overcome technological changes as well as changes of the users community [6, p. 1-1]. 3 2. OAIS The OAIS standard was developed in open public forums (hence Open Archival Information System) in which any interested party could participate.

Design and Implementation of Archival Storage Component of OAIS Reference Model

Optimizing the Ceph Distributed File System for High Performance Computing

Proxmox Ve Mit Ceph &

Red Hat Openstack* Platform with Red Hat Ceph* Storage

Red Hat Data Analytics Infrastructure Solution

Unlock Bigdata Analytic Efficiency with Ceph Data Lake

Evaluation of Active Storage Strategies for the Lustre Parallel File System

Ceph Done Right for Openstack

Shared File Systems: Determining the Best Choice for Your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC

Cs 5412/Lecture 24. Ceph: a Scalable High-Performance Distributed File System

Openzfs on Linux Hepix 2014 October 16, 2014 Brian Behlendorf [email protected]

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

Introduction to Distributed File Systems. Orangefs Experience 0.05 [Width=0.4]Lvee-Logo-Winter