Masaryk University Faculty of Informatics

Design and implementation of Archival Storage component of OAIS Reference Model

Master’s Thesis

Jan Tomášek

Brno, Spring 2018

Masaryk University Faculty of Informatics

Design and implementation of Archival Storage component of OAIS Reference Model

Master’s Thesis

Jan Tomášek

Brno, Spring 2018

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Jan Tomášek

Advisor: doc. RNDr. Tomáš Pitner Ph.D.

i

Acknowledgement

I would like to express my gratitude to my supervisor, doc. RNDr. Tomáš Pitner, Ph.D. for valuable advice concerning formal aspects of my thesis, readiness to help and unprecedented responsiveness. Next, I would like to thank to RNDr. Miroslav Bartošek, CSc. for providing me with long-term preservation literature, to RNDr. Michal Růžička, Ph.D. for managing the very helpful internal documents of the ARCLib project, and to the whole ARCLib team for the great col- laboration and willingness to meet up and discuss questions emerged during the implementation. I would also like to thank to my colleagues from the inQool com- pany for providing me with this opportunity, help in solving of practi- cal issues, and for the time flexibility needed to finish this work. Last but not least, I would like to thank to my closest family and friends for all the moral support.

iii Abstract

This thesis deals with the development of the Archival Storage module of the Reference Model for an Open Archival Information System. The theoretical part analyzes the basic concepts of the OAIS Refer- ence Model and its Archival Storage component and then researches the current storage technologies suitable for the use in the Archival Storage. The practical part deals with the analysis, design and imple- mentation of the Archival Storage and its integration with the Ceph and ZFS storage technologies. The developed first version of the Archival Storage is ready to be integrated with other OAIS modules to create the first functional version of the whole system which will be tested since the summer of 2018. The result of this thesis is one of the first steps for implementation of the complex, open-source, OAIS-compliant solution for long-term preservation of () digital collections which may be used by organizations of any size.

iv Keywords

OAIS, ISO 14721, archival storage, archival system, LTP, long-term preservation, digital preservation, cultural heritage, Java, Ceph, ZFS

v

Contents

1 Introduction ...... 1 2 OAIS ...... 3 2.1 The OAIS Standard ...... 3 2.2 Functional Model ...... 4 2.2.1 OAIS Environment ...... 5 2.2.2 Information Objects ...... 6 2.2.3 Functional Entities ...... 7 2.3 Data Migration ...... 9 2.3.1 Refreshment and ...... 9 2.3.2 Repackaging and Transformation ...... 10 2.4 Archival Storage in Detail ...... 11 2.4.1 Receive Data ...... 12 2.4.2 Manage Storage Hierarchy ...... 12 2.4.3 Replace Media ...... 13 2.4.4 Error Checking ...... 13 2.4.5 Disaster Recovery ...... 13 2.4.6 Provide Data ...... 13 3 Software Storage Technologies for LTP ...... 15 3.1 Features of Software Storage Technologies ...... 15 3.1.1 Underlaying Hardware ...... 15 3.1.2 Capacity and Speed ...... 15 3.1.3 Open-Source ...... 16 3.1.4 Redundancy ...... 16 3.1.5 Geo-Replication ...... 17 3.1.6 Other Features ...... 18 3.2 Research of Up-to-date Technologies ...... 18 3.2.1 ZFS ...... 19 3.2.2 ...... 21 3.2.3 ...... 21 3.2.4 Ceph ...... 23 3.2.5 Amazon Web Services ...... 26 3.3 Conclusion ...... 29 3.3.1 Summary ...... 29 3.3.2 Features Comparison ...... 30 4 Analysis ...... 33

vii 4.1 The ARCLib Project ...... 33 4.2 ARCLib Archival Storage ...... 34 4.3 Archival Storage Requirements ...... 35 4.3.1 Receive Data ...... 36 4.3.2 Manage Storage Hierarchy ...... 36 4.3.3 Replace Media ...... 36 4.3.4 Error Checking ...... 37 4.3.5 Disaster Recovery ...... 38 4.3.6 Provide Data ...... 38 4.3.7 Other Functional Requirements ...... 38 4.3.8 Other Non-functional Requirements ...... 39 5 Design ...... 41 5.1 Archival Storage Prototype ...... 41 5.2 Refining Requirements and Design ...... 42 5.2.1 Object Metadata ...... 42 5.2.2 Storing Objects ...... 43 5.2.3 Logical Storage Failure ...... 44 5.2.4 Authentication and Authorization ...... 44 5.3 Entity Relationship Diagram ...... 45 5.4 Class Diagram ...... 46 5.5 Object State Diagram ...... 47 6 Implementation ...... 49 6.1 Technologies ...... 49 6.2 HTTP API ...... 50 6.3 Service Layer ...... 50 6.3.1 AIP Store Request ...... 51 6.3.2 AIP Get Request ...... 53 6.4 Database Layer ...... 54 6.5 Storage Service ...... 55 6.5.1 ZFS Storage Service ...... 55 6.5.2 Ceph Storage Service ...... 56 6.6 Testing ...... 57 7 Conclusion ...... 59 A Source Code, Setup and Documentation References .... 61

viii List of Figures

2.1 OAIS Functional Model [6, p. 4-1] 5 2.2 OAIS Archival Storage functional entity [6, p. 4-8] 12 3.1 ZFS Architecture [20] 19 3.2 Gluster architecture [26] 22 3.3 Ceph architecture [31] 24 4.1 ARCLib Archival Storage 34 5.1 Entity Relationship Diagram 45 5.2 Class Diagram 46 5.3 Object State Diagram 47 6.1 AIP Store Request Activity Diagram 51 6.2 AIP Get Request Activity Diagram 53

ix

1 Introduction

The digital age provides the institutions responsible for preservation of the cultural heritage with the means which can contribute greatly to fulfillment of their mission. The topic of the long-term preservation of digital documents (LTP) is addressed in the Reference Model for an Open Archival Information System known as the OAIS standard, a generally recognized standard which has become the starting point for most of the actual LTP systems. In the Czech Republic, the digitization efforts were advanced in 2002 when the flood destroyed many historical collections [1], which could have been saved if the documents had been digitized, replicated and distributed to distinct geographical locations. One of the most known open-source LTP systems, Archivematica, has been tested in the LTP-pilot project, led by the Institute of Com- puter Science of Masaryk University in 2014-2015 [2]. The research shows that while Archivematica is a powerful LTP system, it does not fulfill all OAIS requirements [3]. Using outputs of the LTP-pilot and other research projects, the ARCLib project, led by the Library of the Czech Academy of Sciences, together with Masaryk University, National Library of the Czech Republic and Moravian Library in Brno, has originated with the main goal to create complex LTP, open-source and OAIS-compliant solution, integrating Archivematica and other systems and standards used by Czech organizations [4]. The aim of this thesis is to analyze the OAIS standard and its Archival Storage module and then perform research and comparison of the current storage technologies suitable to use in the Archival Storage module. Based on the acquired knowledge, the first version of the Archival Storage module within the ARCLib context is designed and implemented. The first version of the Archival Storage is ready to be integrated with the rest of the ARCLib system through its HTTP API and provides the main functionality, which is storing data to the Archival Storage, retrieving data from the Archival Storage and versioning. The first version also contains integration of two different storage technologies.

1

2 OAIS

This chapter describes the OAIS standard. The introductory section is followed by the description of the OAIS functional model, which is needed to understand the OAIS data lifecycle. As this thesis is dedi- cated to the single component of the OAIS, responsible for the archival data storage, the third section describes the OAIS data migration con- cept and is followed by the last section which further elaborates the Archival Storage functional entity of the OAIS standard.

2.1 The OAIS Standard

Reference Model for an Open Archival Information System (OAIS standard) is a standard developed by the Consultative Committee for Space Data Systems (CCSDS) on request of the ISO (International Organization for Standardization). The first version of the CCSDS standard, designated as Recommended Standard, was released in 1999, accepted as ISO standard in 2002 and published as such a year later (ISO 14721:2003). In accordance with both organizations policies, stan- dards went through the revision process and after revisions, which were modest to a few exceptions [5, p. 6], the new version of CCSDS standard designated as Recommended Practice [6] was published in 2012 together with new version of ISO standard (ISO 14721:2012). Both CCSDS standards are available for free at CCSDS official public websites [7]. Even though the development of the OAIS standard originated at the field of space agencies, it is not tied to the space domain. Beinga reference model, the OAIS standard is a conceptual framework with high level of abstraction and flexibility. It introduces a terminology and specifies fundamental requirements, entities, relations and processes of an archival system, but it does not impose specific requirements on the actual implementation or technologies used to achieve the requirements. It is therefore applicable to any digital archive, mainly for those with the need of a long-term preservation (LTP), where the long-term means long enough to overcome technological changes as well as changes of the users community [6, p. 1-1].

3 2. OAIS

The OAIS standard was developed in open public forums (hence Open Archival Information System) in which any interested party could participate. Besides the advantage of taking shared knowledge into account during the development, this approach also made it possible for OAIS to be integrated and popularized years before its official publication date. For example the Deposit System for Electronic Publications (DSEP) of the Networked European Deposit Library (NEDLIB) is an OAIS- compliant archive which has originated from the OAIS standard even six months before its first official version [8, p. 27]. The OAIS standard has not become a starting point just for OAIS- compliant archives but also for development of new methodologies, standards, concepts, technologies and formats originating from re- quirements specified in the OAIS standard. Now, after twenty years, the OAIS standard is still considered the lingua franca of LTP.

2.2 Functional Model

The OAIS Functional Model is a key to understanding how individual functional entities of the OAIS interact with each other. The diagram also shows how users interact with the OAIS and how Information Packages (IP) are transferred and transformed. As this thesis further elaborates only one component of the OAIS (Archival Storage), the following description is simplified to a level appropriate for understanding of the context and the terminology used later. Examples used to overcome the abstraction of OAIS are mostly from the real OAIS-compliant system which is introduced in Chapter 4. Terms OAIS and system are interchangeable further in this chapter, if not specified otherwise (e.g. LTP system). When using the OAIS standard terminology, first time the term occurs it is expressed in italics. Definitions of these terms may be found in the OAIS standard [6]. The diagram bellow shows six functional entities (modules) within the system boundaries, four information objects and interaction with the OAIS environment. Arrows represent an unidirectional communi- cation, whereas lines represent a bidirectional communication. Some lines are dashed for clarity reasons.

4 2. OAIS

Figure 2.1: OAIS Functional Model [6, p. 4-1]

2.2.1 OAIS Environment They are three different roles to be played in the OAIS environment. Role can be played either by an individual or by an organization. In the case of Producer and Consumer it can be also played by other cooperating system.

Producer At first, producer has to establish a Submission Agreement with the OAIS. Such agreement may contain specification of the provided data, process of extracting metadata, as well as payment and authentication requisites. After the agreement is negotiated, producer provides data to the system in Data Submission Sessions.

Consumer To get data from the system, the consumer has to establish an Order Agreement, which can be either ad-hoc or event-based (e.g. for peri-

5 2. OAIS odical exports). Consumer can also use Finding Aids of the system to search for data which correspond to a particular search query. Order Agreement in this case can be as simple as completing the form in the Graphical User Interface (GUI) of the system [6, p. 2-11].

Management Management administers the archive policies and defines the scope of the archive. It may generally do any typical management activity as processing the measurement results, constructing risk analysis, solving conflicts etc.

2.2.2 Information Objects

Ability of understanding/rendering the archived data in long-term is undermined by storing additional data, a Representation Information, together with the data which are subject of archiving. This data couple is defined as Content Information. For example, when archiving data objects gathered from a web service, their meaning, format as well as the service used could be part of the Representation Information. Besides the Content Information, an Information Package also con- tains Preservation Description Information (PDI) which includes infor- mation required for the archive to fulfill its requirements. Example of PDI part is the fixity information (checksums) used for periodical check of consistency of data or access rights for the data. Last part of the Information Package is a Packaging Information which links Content Information and PDI together to a single logical unit that can be identified, located and managed [5, p. 18]. The Pack- aging Information itself doesn’t have to be stored in the system. If all files of the Information Package are packed into a single ZIP file,then this fact can be considered as Packaging Information.

Submission Information Package Submission Information Package (SIP) is the package that is supplied to the system by the Producer. Its content and format specifications are usually parts of Submission Agreement.

6 2. OAIS

Archival Information Package Archival Information Package (AIP) is an SIP transformed within the OAIS before storing to a media. This is the key Information Package that is actually archived within the OAIS, and therefore it has to contain all the additional data required by the system to work according to specification. One SIP can result in more AIPs and vice versa. The relations can be even more complex.

Dissemination Information Package Dissemination Information Package (DIP) is a package returned to the Consumer as a result of the Order Agreement. The DIP can be equal to the AIP, but it can be also different as the AIP can be transformed within the OAIS for retrieval purposes. It can be also a collection of AIPs.

Descriptive Information The Descriptive Information contains metadata which are the source of the system Finding Aids used by Consumer to find the data of interest. The Descriptive Information can be provided by a Producer within a SIP and enriched by the OAIS process, resulting in Descriptive Information of the AIP, which may contain additional metadata.

2.2.3 Functional Entities Ingest The Ingest process receives the SIP and transforms it into AIP. The preparation can be a long list of processes containing, for example, SIP format validation, checksum validation, virus scan, format detection and metadata extraction. Resulting AIP is then stored to the Archival Storage and its metadata to the Data Management module.

Archival Storage The basic responsibility of the Archival Storage is to receive AIP from the Ingest, store it to the permanent storage and make it available for

7 2. OAIS a Consumer via the Access module. This module is the subject of this thesis and will be further explained in more details.

Data Management The Data Management receives AIP metadata from Ingest and man- ages it in a way suitable for Consumer’s search queries delegated through the Access module. Indexing software may be the example of a suitable technology. It also manages other administrative data, e.g. reports about the archiving processes or state of Archival Storage media.

Access The Access module is an interface for a Consumer to search within the metadata in Data Management and then retrieve requested data (AIP) from Archival Storage and transform it into DIP. It may also handle authentication, authorization and other processes related with access to the system.

Preservation Planning The purpose of the Preservation Planning is to ensure that archived information remains representable and understandable over the long- term. It has to watch and analyze technology evolution as well as Consumer evolution and create plans for migration/emulation as solutions to information obsolescence. It therefore tends to be rather research and monitoring activity done by people than automated process. The Preservation Planning module is an example of community contributions to the OAIS standard as it was not included in the first OAIS draft of May 1999 [8, p. 26].

Administration The Administration module is a communication hub for all other modules and external entities of the system [5, p. 13]. Its responsi- bility is monitoring of the system, its configuration and, generally, coordination of other entities.

8 2. OAIS

For example when Producer negotiates the Submission Agreement, this happens via Administration module which may later provide the information to the Ingest module to verify the format of the SIP. Another example could be monitoring of the state of the Archival Storage and providing results to the Data Management which may then make it accessible to the Consumer via Access module, assuming that the Consumer has a role specified by policies provided by the Management.

2.3 Data Migration

Before going into details of the Archival Storage it is convenient to have some background about the data migration concept. The data migration can refer to a process of transferring data from one media to another without changing its format. It can, however, also refer to the process of transforming data from one format to another. Likewise, the OAIS standard divides migration types into two categories: those which change bit sequences and those which do not. Here is a brief overview of types of migration defined in the OAIS, together with examples and motivation for its usage.

2.3.1 Refreshment and Replication Refreshment and Replication falls into the first category. In both cases, no changes are made to the Information Package. In the case of the Refreshment, AIPs are transferred to the medium of the same type in the sufficiently exact, form so that the old medium can be replaced with the new one without any further changes. Exam- ple: copying of all bits from the old error-prone disk to a new one. During the Replication, AIPs are transferred to the media of the same or a new type and their location may change. The change of the location can cause a necessary change of the Archival Storage mapping infrastructure which is responsible for mapping the AIP ID to its Packaging Information. Example: replicating AIPs to the media of the same type, a , with differently named root could cause the update of the Archival Storage mapping to match the new name.

9 2. OAIS

Both approaches are widely used to keep the system robust. Migrat- ing from an old media to a newer one as well as creating a Redundant Array of Independent Disks (RAID), e.g. RAID 5, of replicas prevents data loss caused by media decay. Another reason for this type of mi- gration is to increase the performance and capacity of the system by migrating to new media types or again by using RAID variants de- signed for this purpose. Considering evolution of software and limited support for the old hardware, compatibility is another key reason for this type of migration.

2.3.2 Repackaging and Transformation

The Repackaging process involves changes in the Packaging Informa- tion but not in the Content Information or PDI. It therefore belongs to the second category. Following the example of Packaging Information at Subsection 2.2.2, the repackaging may be just repackaging from the ZIP file format to the TAR file format. It can, however, be amore complex process. For example, the data of the Information Package don’t have to be at one location and their Packaging Information may be represented by a Metadata Encoding and Transmission Standard (METS) [9]. In this case, change of the location of some data may require changes in the METS file. Transformation is a migration which involves changes in the Con- tent Information or in the PDI. Typical use-case is transforming ob- solete, unsupported or specific file format to a new or a general one. This can be achieved in an automated way by using some converter, however, in many cases the transformation has to be done manually, either because the converter does not exist, or it may cause undue information loss or misinterpretation. Transformation can be reversible, e.g. changing the type of fixity information in PDI from MD5 type to SHA256, or non-reversible, e.g. converting MS Word document to a PDF or using OCR software to transform a scanned, handwritten document into sequence of char- acters. The former example of non-reversible transformation points to the problem of authenticity versus data efficiency, which is often much more complicated. The OAIS standard recognizes three types of the resulting AIP:

10 2. OAIS

∙ AIP version, if the transformation was done for preservation purposes, keeping the information as close as possible to its origin

∙ AIP edition, if the transformation was done to improve the origin, for example with some additional data

∙ derived AIP, if the transformation was done by extracting data from existing AIP or aggregating data of more AIPs

In all cases, the resulting AIP has a new ID an its PDI is updated to register the transformation and to identify the original AIP. Due to the gradual obsolescence of data formats, it is almost certain that the Transformation will take a place in every LTP system. However, the amount of transformations needed can be reduced bu a wisely defined policy of the archive. Such policy may, for example, restrict the incoming SIPs to contain only files of widely supported standardized formats like PDF.

2.4 Archival Storage in Detail

The diagram bellow shows major data flows of Archival Storage func- tions with relations to other modules of the system. It also shows that there can be multiple media used to store the data and those media can be physically separated. Example media could be a database, file sys- tem or some distributed storage system. The media can be composed from multiple devices for example to form a RAID.

11 2. OAIS

Figure 2.2: OAIS Archival Storage functional entity [6, p. 4-8]

2.4.1 Receive Data

The Archival Storage receives a Storage Request with AIP from the Ingest and stores it to the media. When the transfer is done a Storage Confirmation message including AIP ID is sent back to the Ingest.

2.4.2 Manage Storage Hierarchy

The Archival Storage can use different media for different data. One of the responsibilities of this function is to place the data on the ap- propriate media. The mapping can be defined in the policies of the Administration module, together with the Storage Request, or by op- erational statistics (e.g. free space on the disk). Other responsibility is to monitor error logs raised by the Error Checking function when the AIP is damaged, and reporting about

12 2. OAIS

the operational statistics of the Archival Storage to the Administration module.

2.4.3 Replace Media The Replace Media function handles the reproduction of AIPs. Us- ing the migration terminology, Refreshment, Replication and sim- ple Repackaging may be performed completely within this function. Complex Repackagings and Transformations are also subjects of the Archival Information Update function of the Administration module, though.

2.4.4 Error Checking Error Checking ensures that AIP was not corrupted anywhere within the Archival Storage. This requires checking the integrity of AIP at SW/HW layer and, if corrupted, writing this information to the error log, which is monitored by the storage hierarchy management. It should also provide support for verifying of the integrity of a particular AIP upon a request.

2.4.5 Disaster Recovery The Disaster Recovery function provides mechanism for duplicat- ing the data in permanent storage to other media, possibly in other location. The disaster recovery policy should be specified in Adminis- tration module. The duplication process may simply involve copying of the archive content to a remote storage over the network.

2.4.6 Provide Data This function receives an AIP request with AIP ID, look up the re- quested AIP in the permanent storage and transfers it to the calling Access module. When the AIP transfer is done, the Access module is notified.

13

3 Software Storage Technologies for LTP

Storage technology and methodology is the fundamental of any LTP system. Yet, it is often underestimated and does not get atten- tion it deserves, especially by people with the lack of LTP knowledge or IT knowledge generally [10]. This chapter does not describe the methodology for physical storing and providing of bit-level protection, it rather summarizes important features of software storage technolo- gies related to LTP system requirements. Then it follows with research of some up-to-date storage technologies and evaluates them against specified criteria.

3.1 Features of Software Storage Technologies

This section outlines some important features to be considered when choosing the right storage technology for LTP system. Those which are general, required or affect every storage technology are briefly commented. Those which are more specific to each technology are only outlined and may be described in Section 3.2.

3.1.1 Underlaying Hardware Disks, mainly HDDs, are used as an online, primary storage for the LTP system to read and write data. Nowadays, due to technological progress and reduced price, disks are also often used for backups. SSDs may be used for caching or tiering purposes, or for storing of frequently accessed data, like journal and metadata. Magnetic Tapes are used as a backup storage. They are ideally kept off-site, i.e. at different geographical place. They are cheaper andoffer higher durability than disks, however, their performance is much worse and their usage may require significant initial investment into tape drives or even robotic tape library.

3.1.2 Capacity and Speed While some LTP systems are meant to be frequently accessed public repositories with lots of daily read and writes, others can be dark

15 3. Software Storage Technologies for LTP archives1 used for batch ingests of archival, not frequently accessed data. The decision which storage technology to use depends on the nature of the system. It is also common to use more then one tech- nology to achieve both speed and capacity, with redundancy and independence on specific technology as other significant advantages. Capacity requirements vary a lot among organizations. Where one organization expects 10 TB annual data increase, another one expects 100 TB [10]. Ongoing increase of digital data produced and changing requirements on which and how much data to archive requires inten- sive and continuous planning to choose the right storage technologies. Scalability, mainly the horizontal scalability provided by distributed systems, is one of the most widely accepted answers to unstoppable data increase.

3.1.3 Open-Source Open-source projects are flexible, extensible, free and may benefit from wide community and shared knowledge. Possibility to integrate open-source software is required by communities [11] so as it is one of the main priorities of organizations like, for example, Rockefeller Archive Center [10, p. 49]. Other requirements usually tied with open- source technologies are wide community, adoption in big projects, sponsorship, documentation and ongoing development and support. These aspects simply indicate the usability of the technology. However, human labor needed to integrate and maintain open-source technolo- gies has to be taken into account every time.

3.1.4 Redundancy In order to prevent data loss caused by media decay or failure, storage technologies have to support redundancy, i.e. write additional data which can be used to reconstruct original data when failure occurs. Redundancy may be achieved for example by creating a copy of data, i.e. replicating it, or by striping data and using parity information in RAID 5 array. Nowadays, with the growth of digital data, also erasure coding is being used as an alternative to traditional RAIDs and

1. Archives with restricted access, typically for subset of individuals of an organi- zation

16 3. Software Storage Technologies for LTP

replicas. It may lower the cost by using less storage and also provide more resilience but, on the other hand, it adds complexity and I/O latency [12]. Redundancy itself can prevent data loss caused by disk failure but it does not prevent data loss caused by correlated faults i.e. "when one fault causes others or when one error causes multiple faults" [13]. Example of the former case could be a human error, a mistake, which results in data deletion on the disk and all its replicas. Regular backups are solution to these cases. Example of the latter case may be a nature disaster destroying whole data center with all replicas and backups. Solution to this is a disaster recovery plan which contains distribution of data copies to multiple geographically separated locations, a geo- replication.

3.1.5 Geo-Replication Possibly least costly way how to distribute data is by repeating process of backup creation and manual transfer of backups to different location to keep them off-site. In addition to low cost, this approach maybe also preferred for simplicity and total data control. However, if the disaster happens, all data written in the time window from the last backup procedure would be lost. In many LTP systems, any data loss is unacceptable. Thus, while the manual transfer of backups off-site may still be useful, it is neces- sary to integrate some distributed storage technology. As a decision whether to pay for whole commercial LTP product or integrate and maintain free open-source LTP technologies has to be made, the same decision has to be made for storage technology. Especially in the case of distributed storage systems which tend to be complex and require IT infrastructure maintenance, a cloud-based commercial solution may be a better choice. The National Library of Scotland and Edinburgh Parallel Comput- ing Centre were researching cloud possibilities for LTP in a project called Cloudy Culture. The estimated price of the cloud storage was in the end roughly half of the price of using local storage [14]. Disadvantage of such cloud storage is that further processing of submitted data is not completely under our control. Security concerns of cloud storage are well founded by known incidents like human

17 3. Software Storage Technologies for LTP error on Pentagon’s side resulting in exposition of the data to any Amazon cloud storage user [15]. Moreover, an incident of the hacked Dropbox [16] confirms vulnerability of the cloud storage providers, where the frequency of attacks is likely to grow together with cloud storage popularity. Also, LTP system should not be locked in by any service provider so there should always be an exit strategy.

3.1.6 Other Features Last but not least, there are other features to be considered when choosing the right storage technology, for example:

∙ Extensibility/shrinkage of the storage

∙ Ability to add/replace/remove storage devices to/from a run- ning storage

∙ Bulk data exports used in exit strategies

∙ Self-healing and scrubbing2

∙ Authentication, authorization and auditing

∙ Level of complexity and configurability

For LTP system it is not a must to have one-fit-all storage technology. In fact, it may be better to integrate more technologies with different features to achieve the best results.

3.2 Research of Up-to-date Technologies

In this section, five storage technologies of four different kinds are described. From local advanced file systems, through more complex distributed storage systems to the biggest cloud services provider, this chapter tries to summarize important aspects of each technology together with its background.

2. Self-healing is an automatic process performed to detect and fix data corruption. In this context, scrubbing usually refers to performing the self-healing process for the whole storage section.

18 3. Software Storage Technologies for LTP

When some feature occurs for the first time, it is briefly commented to underline its importance. However, the feature does not have to be explicitly mentioned in subsections dedicated to other technolo- gies, even though it might be present. Each subsection rather de- scribes specifics of the particular technology, than follow some feature- matching pattern. There are also a lot of back references. Therefore, it is better to read this section as a whole. See Section 3.3 for brief outline and features comparison.

3.2.1 ZFS ZFS is a file system and volume manager used in Solaris OS, which became open-source in 2005. In 2010 it became proprietary again, but the development of its open-source version, OpenZFS, continues [17]. Integration with various platforms is subject to specific projects, like open-source ZFS on project [18] (OpenZFS combined with ZFS on Linux are further referenced as ZFS). Due to licensing problems, ZFS may not be contained in the Linux kernel and was added as a kernel module in Ubuntu distribution just two years ago (2016) [19]. On the top level of each ZFS pool, data are striped across attached virtual devices (VDEVs) like disks, RAIDs or files (useful for testing). There is also possibility to provide dedicated VDEV, typically SSD, for caching, tiering and logging purposes. Multiple ZFS datasets are mapped to a directory and may share the same pool. Each dataset may be configured with pool space quotas etc.

Figure 3.1: ZFS Architecture [20]

19 3. Software Storage Technologies for LTP

Compared to traditional file system like , ZFS can store huge amount of data. Theoretically, its single file may be 16 EiB large, which is 16x more than the maximum size of the whole Ext4 volume. The maximum size of the ZFS pool is 256 quadrillion ZiB. Mentioned ZFS RAIDs (called RAID-Z) are non-standard, enhanced implementations of RAID providing advanced data resilience. Ability to distribute data over these RAID-Z attached as VDEVs in the pool makes it possible to handle much more disks failures because disk failures in one VDEV does not affect resilience of the other VDEV. For example, as single RAID-Z3 may handle 3 disks failures without data loss, pool of 4 RAID-Z3 VDEVs can handle up to 12 disks failures. Moreover, ZFS supports creation of multiple mirrors for any VDEV of the pool. The disadvantage of the ZFS design is that once the VDEV RAID is created, it is neither extensible nor shrinkable by adding/removing a device. At the higher level, as mentioned, attaching VDEV to a pool is possible, however, its removal is not supported: once the VDEV is added to a pool, it can’t be removed. Therefore, setting up the ZFS instance may be a hard task requiring proper analysis with the right predictions on the storage needs. RAID improves resilience but does not detect silent errors called bit rot, or data decay. To minimize the possibility of data corruption, for example due to hardware error, ZFS manages checksums of data blocks and uses these checksums when reading data to verify data integrity. If the corruption is detected and there is some redundancy level set, ZFS either reads the block from the mirrored device or re- constructs it using parity data and then fixes the bad block. This can be also done for the whole pool on request. ZFS also supports copy-on-write technique and snapshotting which can be used for incremental storage backup. Snapshots may be cre- ated for the whole pool or particular dataset. Geo-replication may be achieved using ZFS send/receive commands to transfer these snap- shots from one server to another. To automate this backup process a third party library, for example Syncoid [21], may be used. Comparing to Linux’s rsync command, which may be also used for incremental backups, ZFS send/receive is much faster [22].

20 3. Software Storage Technologies for LTP

3.2.2 Btrfs The biggest open-source ZFS competitor in Linux world is Btrfs [23]. Btrfs was introduced in 2007 with the first stable version in 2014. It is under frequent development supported by Oracle. Unlike ZFS, it is a file system made specially for Linux and may be distributed directly within the Linux kernel. Interesting features comparison of these two file systems can be read at the Czech Linux portal [24]. In many cases, the Btrfs design is more flexible [24]. Important difference is that unlike ZFS, Btrfs supports shrinkage of the poolas well as its extension by adding or removing disk to/from the RAID. Supported RAID types are RAID 0, RAID 1, RAID 10, RAID 5, RAID 6. Moreover, it is possible to easily switch from one RAID type to another. On the other hand, Btrfs supports only single RAID where ZFS RAIDs may be grouped as VDEVs in a pool. Btrfs is therefore able to handle only 2 disks failures without the data loss. Btrfs also provides very useful features like defragmentation and rebalancing (redistributing data across all devices). These are missing in ZFS and a workaround to this involves copying the data from one dataset to another and back.

3.2.3 Gluster Even though ZFS and Btrfs are featured file systems, they are not designed to run in a cluster or in a distributed environment. Therefore, they don’t meet the increasingly important requirement for horizontal scalability. Gluster [25] is an open-source, scalable, released to the community in 2007. Gluster was bought by in 2011 and it became a part of the Red Hat Storage Server product. In 2014, after Red Hat had bought Inktank, a company behind Ceph storage, Red Hat Storage Server was renamed to Red Hat Gluster Storage. As many other distributed systems, Gluster may scale up hori- zontally (up to several petabytes) by utilizing commodity hardware3.

3. Scale up horizontally means enhance the performance/capacity by adding more machines (servers, even PCs) to the pool of system resources. On the other hand, scale up vertically means add or upgrade CPUs, OPs, HDDs etc. on a single machine. Horizontal scaling is typically cheaper.

21 3. Software Storage Technologies for LTP

From the architectural point of view, Gluster is quite simple. Its single Volume, the mountable directory from the client’s perspective, is a set of Bricks (disks or mount points) which are distributed across multiple nodes (servers). There is no centralized metadata server that would mean a single point of failure. Instead, it uses a distributed hash table to determine the location of files based on their names. Client can access cluster through several ways, e.g. by Gluster Native Client over FUSE, NFS v3 or Samba.

Figure 3.2: Gluster architecture [26]

In terms of redundancy, Gluster supports replicated volume with configurable number of mirrors and distributed replicated volume, which is a distributed volume of replicated volumes, thus something like RAID 10. There is also a distributed striped replicated volume type which strips large files into chunks to increase performance in cases likeHPC (High-Performance Computing), however, as stated in Administrator Guide section of the documentation, it is currently supported only for Map Reduce workloads [27]. Gluster also supports erasure coding as a form of redundancy. If these redundancy options are not enough, there is a possibility to use another storage technology, for example

22 3. Software Storage Technologies for LTP

ZFS, as a Gluster brick and delegate the reliability requirement to it. However, in that case the ZFS structure or data should not be altered outside of the Gluster. Extending/shrinking of the Gluster storage is more flexible than ZFS but less flexible than Btrfs. In the case of distributed replication volume, the number of bricks to be added must be a multiple of the number of replicas set for the volume. The same applies for remov- ing bricks with additional condition that removed bricks have to be from the same sub-volume. Converting between volume types is not supported, although, in some specific cases, a workaround exists. The number of bricks which can fail without the data loss is determined by the redundancy setting, e.g. with 3 replica setting, 2 bricks may fail. Besides being a clustered system, Gluster also supports geo-replication for backup purposes. The replication is a Master/Slave, asynchronous, incremental replication of the whole volume. The replication process involves detection of changed files on the master node and transfer of the differences between master/slaves files’ copies to slaves’ nodes us- ing rsync command over SSH. Volume type of the slave node does not need to match the volume type of the master node. Failover process, i.e. promoting slave to a master when the master is down, has to be done manually together with reconfiguring client’s connection. The same applies for failback process, when the former master is up again and needs to be synced and promoted to a master again.

3.2.4 Ceph Ceph [28] is a distributed storage system which has originated from the PhD project in 2003 and was made open-source in 2006. It aims to provide self-managing, infinitely scaling storage without single point of failure. It is made to run on Linux OS. Even though the previously mentioned systems are all popular, under active development and with wide community, the amount of attention they get does not seem so high when comparing with the Ceph which is being used, for example, by or CERN. CERN is also an active community member. In 2017, CERN performed a scaling experiment called Big Bang III, in which Ceph successfully scaled up to 65 petabytes [29].

23 3. Software Storage Technologies for LTP

Architecture Ceph is unique also for its architecture. In its base there is a RADOS, intelligent, self managing written in ++, which is re- sponsible for all the low level work with underlying clusters. To pro- vide the application with ability to use some low level functionality, there is a LIBRADOS layer, which is represented by set of libraries in various programming languages, like Java [30], Python etc. The uniqueness of the design shows up at the top layer: Ceph can be used as a distributed file system, block device and object storage. Though it is not possible to share data among these three storage types, it is possible to share the RADOS cluster beneath.

Figure 3.3: Ceph architecture [31]

The architecture of the RADOS is scalable and without a single point of failure. There are three types of nodes in the cluster: ∙ OSD: an intelligent daemon which stores objects, and commu- nicates with other OSDs via P2P network to take care of the replication, rebalancing, recovery etc.

24 3. Software Storage Technologies for LTP

∙ Monitor: a daemon which knows the topology of the cluster and takes care of the cluster’s state. It provides OSDs and clients with the cluster map.

∙ Metadata: a daemon used only in the case of CephFS, responsible for handling metadata of the stored files.

Ceph Object Storage All previously mentioned storage technologies were based on file systems with directories and files, as we know them. However, con- sidering Ceph options, the object storage may be more suitable to use for an archival storage application. While file systems store files ina hierarchical structure, object storages store objects, which consist of data and metadata, in a flat structure. The only information needed to retrieve the object is its identifier. The object itself may be stored at various locations but from the application point of view, it is still ac- cessed the same way. Object storages are usually accessible via HTTP API and may be therefore integrated with the application more easily then the file system. Last but not least, unlike file metadata, object metadata are very flexible and may contain useful information about the transfer, treatment, nature of the object etc. Ceph Object Storage is accessible through HTTP at Rados Gateway (RGW) daemons which use Civetweb, an embedded web server. There might be multiple RGWs per a cluster. RGW uses existing HTTP APIs of popular object storage projects and maps it to its own backend logic. Therefore the client application which uses Amazon S3 or OpenStack Swift (currently supported RGW APIs) may switch to the Ceph with minimal changes like reconfiguration of the storage endpoint. Data written through S3 compatible API may be read through Swift compat- ible API and vice versa. Access through the RGW has more advantages over accessing directly through LIBRADOS. RGW is not just an ad- ditional layer to translate HTTP requests into LIBRADOS functions. In fact, it handles complex business logic related to the functional- ity of S3 and Swift APIs like the concept of buckets, authentication, striping of large RGW objects to smaller RADOS objects and it also handles other features of the Ceph Object Storage like geo-replication synchronization.

25 3. Software Storage Technologies for LTP

Features Ceph also supports an asynchronous geo-replication, or as called in Ceph, a multi-site configuration. In Ceph terminology, individual clus- ters are mapped to zones and these zones form a zone group. Unlike in Gluster, in Ceph it is possible to choose between typical Master/Slave (Active/Passive) replication, in which only one node is used for write operations, and Active/Active replication where all nodes which are set as read/write nodes may receive write operations and replicate the data to the rest of nodes. However, only object writes may be done on all active nodes. Cluster metadata operations, e.g. creation of new user, or bucket operations, like creation of a new bucket, are handled only by primary node, which is only one per the zone group. Synchronization tasks are handled on RGW nodes. Failover/failback from primary node to secondary node and vice versa must be done manually. Active/Active configuration does not require to perform this failover immediately because other active nodes are still able to write objects. This is a great advantage, as the primary node failure may be only temporary, e.g. because of the power outage [32]. The resilience, flexibility and scalability of the Ceph is much higher than in ZFS, Btrfs or Gluster case. While those in the case of a disk fail- ure passively wait for the disk replacement, exposing the system to the risk of data loss due to failure of another disk, Ceph self-management ability automatically allocates copies of the failed data and replicates them to hold the desirable replica number. Also, due to its ability to replicate, move and rebalance data in a smart way, it does not enforce OSDs to match any multiple of copies’ number.

3.2.5 Amazon Web Services

Amazon Web Services (AWS) [33] is a cloud computing provider which was launched in 2002. Even though its biggest competitor, Microsoft Azure, is getting more and more popular, AWS is still the clear leader with market share about 60 percent [34]. Some motivation and implications of using AWS, or cloud services generally, were already revealed in Subsection 3.1.5, referencing also the Cloudy Culture project of The National Library of Scotland which shows that usage of the cloud service may be valid, economical and

26 3. Software Storage Technologies for LTP

easy way of fulfilling archival requirements without the need of own IT infrastructure [14]. Sticking to the object storage concept, Amazon offers two services: Simple Storage Service (S3) and Glacier. Both services replicate data across multiple geographically distributed facilities in a particular region, runs integrity verification and self-healing processes and pro- vides 99.999999999% durability. Features like authentication, auditing or data retrieval policies are supported by both. They differ in cost, access speed and they generally serve different purposes. Glacier may be accessed directly or through S3 integration. Amazon Glacier is a cheap storage suitable for archival purposes of infrequently accessed objects. To retrieve data, a retrieval request specifying id of the object or creation date filter must be sent and once the data are retrieved, they are available for downloading for 24 hours. Retrieval price depends on retrieval speed and overall number of retrieval requests. There are three different retrieval options: ∙ Expedited: retrieval time 1-5 minutes, for objects of bigger sizes (250MB+) it may take longer ∙ Standard: retrieval time 3-5 hours ∙ Bulk: retrieval time 5-12 hours Amazon S3 is a storage providing low latency and real-time, fre- quent access possibilities. S3 is also more feature rich than Glacier, supporting user-defined object metadata, various storage policies, reporting etc. In addition to geo-replication withing zone, S3 also supports asynchronous geo-replication between different regions, e.g. from EU to US. S3 lifecycle management feature enables to set up poli- cies for transition of objects between different storage classes which provide different level of protection, access speed, price etc. ∙ S3 Standard storage class provides 99.9% availability in the Ser- vice Level Agreement (SLA) which is 8h 45m downtime per year. ∙ S3 Standard-Infrequent Access (S3 IA) provides 99.9% availabil- ity in SLA (3d 15h downtime per year). The latency, throughput stays the same as in S3 standard case but storage access is addi- tionally charged, e.g. $0.01 for each retrieved GB.

27 3. Software Storage Technologies for LTP

∙ S3 One Zone-Infrequent Access (S3 OZIA) provides 99.5% avail- ability. Data retrieval is additionally charged and there is no geo-replication.

∙ S3 Glacier storage class integrates the Glacier storage for archival purposes.

Following table is just for a general overview of AWS prices per different storage classes. The first column shows the price for one- time upload of 10 000 objects with 1GB size each and the second column shows the price for one-time, one-by-one download of these files. The last column shows the overall archival price containing these operations and price per storage. All prices are computed for the EU- Frankfurt region and do not contain additional charges like, like the charge for transition between storage classes.

Table 3.1: Example costs of AWS

Put Retrieval GB/month After a year S3 Standard $0.054 $0.0043 $0.0245 $2940.0583 S3 IA $0.1 $100.0043 $0.0135 $1720.1043 S3 OZIA $0.1 $100.0043 $0.0108 $1296.1043 Glacier Expedited $0.6 $480 $0.0045 $1020.6 Glacier Standard $0.6 $120.6 $0.0045 $661.2 Glacier Bulk $0.6 $30.3 $0.0045 $570.9

Amazon provides many other services which may be integrated with AWS object storages, for example: ∙ AWS Direct Connect, a dedicated network connection with AWS to speed up transfer.

∙ AWS Snowball, a portable device (21.3 Kg) with usable capacity of 72 TB used, which may be used to import/export data to AWS. Once the Snowball device arrives, the customer copies data from/to it locally (e.g. over 10 GiB Ethernet) and sends it back to AWS. Export from Glacier is possible only through S3, therefore the fee for transition from Glacier to S3 is charged.

28 3. Software Storage Technologies for LTP

∙ AWS Snowmobile, a truck carrying a shipping container capable to transport 100 PB of data. It may only be used to import data into AWS.

∙ Amazon Athena, a service for querying and analyzing data stored in Amazon S3.

3.3 Conclusion

3.3.1 Summary

ZFS and Btrfs are advanced file systems capable to fulfill majority of LTP requirements. Compared to other researched technologies, these are the simplest, which might be a valid argument why to use them, at least for the small archive. ZFS provides more resilience at the price of less flexibility. It is considered to be more mature than Btrfs dueto its origin in a proprietary software. Therefore, it could be better choice for mission-critical systems. Both technologies are not distributed, contain single point of fail- ure and can’t scale horizontally. Therefore, they are not ideal for big archives. However, they might be integrated as a virtual devices in more complex technologies. Gluster is a horizontally scalable system with enough features to represent storage layer of LTP system. Being a distributed file system, Gluster provides an easy way of sharing data among users. This is also true for CephFS and Amazon Elastic File System which were not covered in the research. Compared to Ceph, Gluster’s advantage is its simplicity. Performance tests from 2014 [35] and 2017 [36] also show that Gluster may outperform Ceph. On the other hand, Ceph has bigger community, provides file, block and objects interfaces to access to the same cluster. It also excels at self-management, scaling up and down and also provides great re- silience. In addition to Active/Passive geo-replication, it also supports Active/Active replication and thus may provide higher availability. Ceph is more complex and harder to set up. Luckily, there are great resources provided by Red Hat, for example performance and sizing guides, which are available for both, Ceph [37] and Gluster [38].

29 3. Software Storage Technologies for LTP

AWS is the biggest cloud computing provider in the world. There are some disadvantages of commercial cloud solutions, like security concerns or loss of control over the data, but the ability to use pro- fessional archival solution without the need of maintaining own IT infrastructure makes them very attractive, especially for smaller orga- nizations. The cost of using commercial cloud solution may or may not be lower than maintaining own storage technology. Amazon Glacier is a very cheap product suitable for archiving of data which does not have to be retrieved in a real-time. Data retrieval may take from 1 minute to 12 hours, where the price depends on the speed requirement. Amazon S3 is more expansive, yet still affordable product with real-time data access capabilities. S3 offers four storage classes with different pricing, among which objects can be transited according to lifecycle policies. For example, after one month after creation, object might be moved from S3 Standard class to S3 Glacier class.

3.3.2 Features Comparison This section compares technologies which were subjects to the research. All technologies are meant in the same context as in which they were investigated. Specifically, ZFS means OpenZFS combined with ZFSon Linux, Ceph means Ceph Object Storage and AWS means combination of Amazon S3 and Amazon Glacier. Table 3.2 contains general comparison of storage technologies. If some property is unknown, it is marked as unk. Here is an explanation of some criteria used in the table, which may not exactly match the definition of the same term:

∙ Resilience is the ability to handle failure of multiple devices, together with the ability to quickly recover from these failures.

∙ Scalability is the ability to scale with the performance/capac- ity, up and down, in economical way, while still fulfilling the resilience requirement.

∙ Flexibility is the ability to change the configuration of the running system in a significant way, e.g. to achieve a different level of data protection.

30 3. Software Storage Technologies for LTP

∙ Complexity, in this context, means complexity to which an admin- istrator is exposed, rather than the complexity of the architecture. It somehow correlates with the complexity of the system, but not in the AWS case.

Table 3.3 contains a list of features and information if the feature is present per each of the storage technologies. Only native features are considered, for example, the integration of Virtual Data Optimizer (VDO) with Gluster for deduplication is not a Gluster’s native dedupli- cation. If some feature is not production-ready, but seems to be soon, it is considered as supported. Here is an explanation of some features used in the table:

∙ Authentication & Authorization has to be user-friendly, e.g. based on name and password. File systems’ way of authenticating with SSH key and authorizing via Access Control List (ACL) is not considered to match this feature.

∙ Encryption means encryption of the stored data, not encryption of communication, like TLS.

∙ SPOF stands for Single Point Of Failure, i.e. the entity the failure of which results in the failure of the whole system.

∙ Caching stands for the ability to set up dedicated data pool which is used as a cache/tier to speed up access to frequently accessed data.

∙ Erasure Coding is a technique using Forward Error Detection (FEC) to recover data when some of them are lost. In distributed systems, it is a popular and cost saving alternative to a replication technique.

∙ Deduplication is a technique used to detect duplicated data and store them only once to save storage capacity.

All storages support these features: geo-replication, self-healing, scrub- bing, snapshotting or compression.

31 3. Software Storage Technologies for LTP X X unk Community medium medium medium large large Deduplication X X unk coding Erasure Complexity low low medium high low X X X X Caching Flexibility low high medium high medium X X unk SPOF X X X X Scalability low low medium high high Encryption Resilience medium low medium high high X X Table 3.2: General Comparison Authentication Table 3.3: Presence of Particular Features & Authorization Platform illumos, FreeBSD, Linux, OS X Linux Linux Linux Web X X X Support Enterprise X X X X Category Advanced FS Advanced FS Distributed FS Object Storage Object Storage Open- source ZFS Btrfs Gluster Ceph AWS ZFS Btrfs Gluster Ceph AWS

32 4 Analysis

From now on, this thesis describes the OAIS Archival Storage within the context of the ARCLib project. In this chapter the ARCLib project is introduced and the high level overview of its Archival Storage module is provided. Then requirements on the ARCLib Archival Storage are listed and grouped by the Archival Storage functions described in Section 2.4. When using the OAIS standard terminology, the first time the term occurs it is expressed in italics. These terms are explained in Chapter 2.

4.1 The ARCLib Project

ARCLib [4] is a research project led by the Library of the Czech Academy of Sciences (LIBCAS), together with Masaryk University, National Library of the Czech Republic and Moravian Library in Brno, which aims to create a complex solution for long-term preservation of (library) digital collections. It is based on the OAIS standard, ex- periences and knowledge of professionals from various fields of the LTP problematics, and the needs of Czech institutions responsible for preserving of the cultural heritage. The project started in 2015 and is planned to be ended in 2020. The ARCLib was, among others, preceded by the LTP-pilot project [2], led by the Institute of Computer Science of Masaryk University (ICS MU), which examined the possibilities of the Archivematica LTP system with conclusion that it does not support some of the OAIS requirements [3]. ARCLib system is designed to integrate Archivemat- ica as well as other systems used by Czech institutions, e.g. ProArc, where those systems act as OAIS Producers/Consumers. The scope of the ARCLib project exceeds the development of LTP system by the development of LTP methodologies. The Methodology for Logical Preservation of Digital Data (MLOG) [39] is related to the protection of the intellectual content of the document, while the second methodology, further referenced as MBIT [40], which should be finished by the end of 2018, deals with the bit-level protection of data. In addition to the OAIS standard, MBIT is also based on the Preservation Storage Criteria document, a more storage-focused

33 4. Analysis document, with community contributions, which has originated from the iPRES 2016 conference [11]. The Archival Storage component is more related to MBIT. The author of this thesis participates in refinement of the analysis, creation of the design and implementation of the ARCLib system from the position of an employee of the inQool company, an industry partner of the The Faculty of Informatics at Masaryk University.

4.2 ARCLib Archival Storage

The whole ARCLib system is designed to be modular. Archival Storage module is completely separated from the rest of the ARCLib system and is accessible through the RESTful API exposed by the Archival Storage Gateway component, which is the one and only way of access- ing the Archival Storage. The RESTful API is a simple, object storage API which may be accessed also directly by clients other than the AR- CLib. ARCLib itself is a dark archive, accessible only by a small subset of individuals of registered organizations. For now, all communication is initiated by clients accessing the Archival Storage.

Figure 4.1: ARCLib Archival Storage

The Archival Storage Gateway is connected to one or more log- ical storages, for example an instance of CEPH cluster, which store the data. Every type of logical storage has to implement the Storage

34 4. Analysis

Service Interface defined by the Archival Storage Gateway. All logi- cal storages are then accessed uniformly via this interface. It is also possible to implement this interface by the Archival Storage Gateway itself and use it as a logical storage in other Archival Storage Gateway. Configuration of logical storages, like incremental backup of theZFS storage, is handled by administrator of the technology and is out of the Archival Storage Gateway’s scope. The Archival Storage Administration Module contains functions for management and monitoring of logical storages, and reporting of their errors and states. To understand the Archival Storage, it is needed to have some ba- sic understanding of how Information Objects are handled in ARCLib. When Producer uploads SIP, it is processed by the Ingest module to extract its metadata and generate its PDI. The resulting AIP is a com- bination of two files: the original SIP and generated AIP XML which contains the PDI. This AIP is then sent to the Archival Storage. In addi- tion to this standard OAIS process, ARCLib also supports versioning of the AIP XML, i.e. the update of the PDI. Therefore, the Archival Storage does not only operate with AIPs, but also with AIP XMLs.

4.3 Archival Storage Requirements

Specification of requirements on the ARCLib Archival Storage module is not a one-off process. Initial requirements have been adjusted over the time and are likely to be adjusted in the future. Requirements described in this section are based on the OAIS standard (2003) and are defined in these documents:

∙ Technical specification of LIBCAS (2016) [41]

∙ MLOG (2017) [39]

∙ MBIT (2018) [40]

In this section, requirements from all these documents, and also some of those which were stated on ARCLib meetings, are grouped by Archival Storage functions described in Section 2.4.

35 4. Analysis

4.3.1 Receive Data ∙ The Storage Service Gateway of the Archival Storage provides simple RESTful object storage API which is used to store AIPs or generally any objects.

∙ Storage request contains AIP data, its Universally Unique Iden- tifier (UUID) generated by the client (which is its only unique ID) and its checksum.

∙ Client receives the storage confirmation message after all con- nected storages successfully store the AIP.

4.3.2 Manage Storage Hierarchy ∙ Based on the outputs of the Negotiate Submission Agreement func- tion of the Administration module, the Archival Storage may choose to which logical storages the objects of a particular pro- ducer are stored. This requirement, as well as others related to non-uniform distribution of data over the logical storages were later reversed in MBIT, at least for the first ARCLib version.

∙ Objects are stored mainly on online media. The Archival Stor- age Administration Module supports offline backups only by exposing the API endpoint to get IDs of changed objects since specified time stamp.

∙ The Archival Storage Administration Module contains functions for management and monitoring of logical storages, and report- ing of their errors and states.

4.3.3 Replace Media ∙ Objects are stored on at least two logical storages, which may be using different storage technology. The overall redundancy is further increased by inner redundancy of physical storages of each logical storage.

∙ In the first version, there is an implementation for Ceph, ZFS and for ordinary file system (for testing purposes).

36 4. Analysis

∙ Replacement of a part of a logical storage, e.g. replacement of a disk in a ZFS RAID array, is handled by the logical storage instance itself, not by the Archival Storage.

∙ Replacement of the whole logical storage instance follows these steps:

– New storage instance is added and marked as write only. – All objects are copied to new storage. Incoming read re- quests are directed to other storages, write requests are written to all storages. – Old storage is removed and the new is set to read/write state.

∙ Logical storage may be also removed without replacement, sup- posing it does not break the policy specifying the minimum number of logical storages.

∙ All these operations must be done at runtime, without the system stop or data loss.

4.3.4 Error Checking ∙ After the Archival Storage Gateway receives an object, it com- putes its checksum and compares it with the provided checksum. If the checksums do not match, for example because of transfer error, the operation is ended and the object is not stored to any storage.

∙ Each logical storage computes the checksum of a stored object and compares it with the provided checksum. If the checksums do not match, for example because of transfer error, the operation is ended and new storage request must be sent.

∙ During the object get request, the checksum of object retrieved from logical storage is compared with the checksum which was provided during store request. If these do not match, the object is retrieved from other storage and the corrupted copy from the first storage is replaced with the copy from the second storage.

37 4. Analysis

∙ API provides the endpoint for verification of the state of any object on any storage. This involves the same comparison and error solving as the previous case. ∙ Each logical storage supports features like scrubbing and self- healing (as mentioned in Subsection 3.1.6) and is able to do efficient routine integrity checks.

4.3.5 Disaster Recovery ∙ Every logical storage contains a complete set of objects, so if one storage fails, data may be recovered from the other one. ∙ Logical storages are situated in different geographical locations. Additionally, logical storage may support geo-replication, so the data of one logical storage may be distributed to different locations. ∙ System provides the cleanup function which is used in the case of system failure to delete the objects the transfer of which was not successfully completed.

4.3.6 Provide Data ∙ Client gets objects through the Archival Storage Gateway API by sending a request with UUID of the object.

4.3.7 Other Functional Requirements ∙ ARCLib supports two types of versioning which have to be supported also in the Archival Storage: – Versioning of the whole AIP involves changes in both the Content Information and PDI. According to the OAIS termi- nology described in Subsection 2.3.2, this results in new AIP. The Archival Storage treats the new AIP the same way as all other AIPs. – Versioning of the AIP XML involves changes only in PDI. In this case, ARCLib conception differs from the OAIS stan- dard because in the ARCLib conception this Transformation

38 4. Analysis

does not result in new AIP. Instead, the new AIP XML is added to an existing AIP, so the new AIP is composed of SIP, old AIP XMLs and new AIP XML.

∙ AIP may be logically removed, i.e. made unaccessible, but with- out deleting any data.

∙ AIP may be physically deleted. The deletion applies to SIP part of AIP, AIP XMLs persist.

∙ API provides authentication and authorization services.

∙ All data accesses and operations are logged.

4.3.8 Other Non-functional Requirements ∙ Uses open-source technologies

∙ Communication between modules over HTTPS

∙ Platform independent, Linux OS preferred

∙ Java (OpenJDK)

∙ PostgreSQL

∙ Horizontal scalability of performance

∙ Modularity

∙ Third-party tools are used through interfaces

39

5 Design

The first section of this chapter deals with the Archival Storage pro- totype which was built to reveal hidden requirements on the system. Next section summarizes the main points which were subjects to dis- cussions after the prototyping, and is followed by diagrams describing the system design.

5.1 Archival Storage Prototype

Before the actual development of the ARCLib system began, twelve prototypes covering different functionalities of the system had been developed. The purpose of the prototypes was to revise the analysis, design and implement solutions to particular problems, demonstrate and discuss findings and refine the analysis. Prototypes were devel- oped as independent modules, each focused on one particular task. All prototypes may be found at LIBCAS Github [42]. The development of the prototypes of the Ingest module is covered in the other ARCLib- related thesis [43]. This thesis elaborates the prototype number 10 dedicated to the Archival Storage. The goal of the Archival Storage prototype was to develop the basic functionality of the Archival Storage, specifically creation of AIP, get of AIP and versioning. The development of the prototype involved:

∙ Design of the database layer

∙ Definition of the Storage Service Interface to be implemented by logical storages

∙ Implementation of defined interface for local file system

∙ Definition of the Storage Service Gateway API

∙ Implementation of simple service logic interconnecting the database, storage and API layers

All the prototypes were presented and discussed with the ARCLib research team during three meetings in the fall of 2017. The prototypes

41 5. Design have fulfilled their purpose as they revealed new functional require- ments, refined some of the old ones and prepared the ground forthe development of the system. Although the later development of the Archival Storage has started from scratch (which is a recommended practice in software development), its database layer, Storage Service Gateway API and design of AIP states are based on the outputs of the prototype.

5.2 Refining Requirements and Design

Because the ARCLib system has to follow the methodologies devel- oped by the research team of experts in the LTP field, refining of the analysis was a subject to team discussions. Discussions related to the functionality covered by prototypes took place during the prototypes presentations, the others were realized via online collaborative tools or at other analytical meetings. Some of the outputs of discussions are already reflected in requirements in Section 4.3. This section summa- rizes the most important points of the discussions.

5.2.1 Object Metadata With respect to the AIP XML versioning requirement, the Archival Storage was designed to store SIP and XMLs as separate objects which are logically grouped to form AIP. During XML update, the new XML version is simply added to the logical group. This grouping of XMLs and SIPs was initially held only in the Archival Storage database. How- ever, even though there will be a periodical database backup process, the system must not risk any loss of such important information as mapping of XMLs an SIPs. In other words, the system should not rely on the database as on the only source of transactional information. This problem was solved by setting up requirements on the Storage Service Interface and all its implementations, to store metadata critical for the system in addition to the objects themselves. Metadata contains creation time of an object, its initial checksum, its current state and for AIP XML its version and SIP ID. In the case of local file system and AIP with ID 6d7e47b1-369b-4bdd-8c62-df7d1c8df4aa:

∙ ID of the SIP is the same as ID of the AIP.

42 5. Design

∙ Creation time is stored automatically

∙ Initial checksum is stored in a file name of which is derived from the object ID and the type of state, e.g. 6d7e47b1-369b-4bdd-8c62- df7d1c8df4aa.MD5.

∙ Mapping of XML to SIP and its version number are contained in the ID of the XML object, which is derived from the AIP ID e.g. 6d7e47b1-369b-4bdd-8c62-df7d1c8df4aa_xml_1. Other metadata related to this XML object follows the same pattern based on the ID of the object, so the state would be stored as 6d7e47b1-369b- 4bdd-8c62-df7d1c8df4aa_xml_1.PROCESSING etc.

With this design there is a duplication of information which may result in information inconsistency between the database and storages’ metadata. However, collisions caused by the possible inconsistency will never occur, because the storages metadata is never used until the database has been lost. When this happens, the database is recovered from its last backup and the records which were written after the backup had been created are recreated from the storages metadata.

5.2.2 Storing Objects The prototype was designed to accept HTTP store request, synchronously store the object and return the information about success/failure to the client in a the HTTP response. However, in the real system, the store request may take some time because the store process of each logical storage runs in a separate thread which is obtained from a thread pool. If there is not currently any thread available, processes are waiting in a queue to be assigned a thread once it is free. Even if all threads are free, the size of the object to be stored may itself cause the long processing time. Consequently, the requirement is altered, so that the HTTP response containing the ID of the stored object is sent to the client, but the object is stored asynchronously and the user has to send additional request for the object state to get information whether the object is stored, or the process has failed or is still in progress. Another revealed requirement which was not covered by the pro- totype is related to handling of failure of logical storages during store process. If one storage fails, the process of storing is immediately

43 5. Design stopped at all other storages and the rollback process is started during which the uncompleted data is deleted. The information that the object was rolled back must be present when the client sends the request for the object state.

5.2.3 Logical Storage Failure When one storage fails, all write requests to the Archival Storage fail and the whole system stays in read-only mode until the failure is solved. Read requests are handled by working storages. The system may be transited into read-write state by: ∙ Detaching of the storage, supposing this action does not break the limit of minimum attached storages defined in Archival Storage policies. ∙ Automatic or manual recover of failed storage, e.g. when it fails only due to power outage. ∙ Replacing the storage with a new one. The replacing process is described in Section 4.3.

5.2.4 Authentication and Authorization The authentication and mainly the authorization to the Archival Stor- age need to be further discussed. As for the first version of the ARCLib system, the whole system runs on a private network, and therefore the modules do not need to authenticate against each other. The following are just current design proposals delegating the authorization to the systems through which the Archival Storage is accessed. Client authenticates to the Archival Storage with the API key and API secret pair (username/password) to obtain the JWT (JSON Web Token) used for authentication and authorization in subsequent re- quests. When the client accesses the Archival Storage through the ARCLib system, its authentication and authorization is handled in the ARCLib Administration module. The Administration module then uses a sys- tem account to authenticate against the Archival Storage. The Archival Storage recognizes the whole ARCLib system as a single user with the administration permissions.

44 5. Design

When the client accesses the Archival Storage from the outside of the ARCLib system, e.g. through another system using the Archival Storage Gateway API, the authorization is again handled by the other system. Data of different users are separated in the Archival Storage so that all users can access only their own data.

5.3 Entity Relationship Diagram

The main responsibility of the Archival Storage is to implement archival processes and handle permanent data storages. Its database layer is simple, designed to provide the service layer with needed transac- tional data.

Figure 5.1: Entity Relationship Diagram

Every database entity has its own UUID. All three types of ob- jects: AIP SIP, AIP XML and general object share the same subset of attributes but each of them is in separate table. All enumerations (checksum type, object state, storage type) are defined in the applica- tion logic and stored in their textual form in database. Storage table holds data for logical storages. If any storage has its reachable attribute set to false, the system is in read-only mode and reads are handled by random storage of those with the highest priority. Different types of storages are distinguished by the type attribute, and

45 5. Design if there is some configuration specific for the particular storage type, it is stored in JSON (JavaScript Object Notation) string in the config attribute.

5.4 Class Diagram

The decomposition of the Archival Storage Gateway module to service classes is described by following class diagram.

Figure 5.2: Class Diagram

The API controller accepts HTTP requests and handles the trans- formation of service layer data to/from DTOs (Data Transfer Object) used for transfer between the system and the client. Archival Service obtains instances of Storage Service via Storage Provider, which is the service responsible for retrieving data about currently attached logical storages, instantiating of Storage Services of particular type according to the retrieved data and providing the instances to the Archival Service. In the case of read operations, the Archival Service directly uses one of the provided Storage Service instances (the one with high- est priority) to get the data from the storage. However, in the case of write operations, it marks the beginning of the operation to the database through Database Service and delegates the operation to

46 5. Design

Asynchronous Service. The Asynchronous Service performs the write operation in a separate thread for each storage and marks the result of the operation to the database at the end.

5.5 Object State Diagram

Objects may acquire various states during their lifecycle. The following diagram describes available states together with operations (separated by comma) and conditions (in square brackets) needed for the state transition.

Figure 5.3: Object State Diagram

The left side of the diagram shows states which an object may ac- quire as a result of successful processing of client’s request. Object is in the PROCESSING state if there is some important and time consuming operation in progress, like storing or deleting. Once such operation successfully ends, the state is switched accordingly, e.g. to ARCHIVED or DELETED state. The right side of the diagram shows the error states which an object may acquire due to an error during its processing. Rollback is a process fired at failure of one logical storage during processing, involving deletion of uncompleted object data from all storages. If the process succeeds, the object is transited to the ROLLED_BACK state. However,

47 5. Design errors may occur even during the rollback process. In that case, the object is transited to FAILED state. The cleanup process is a bulk rollback process for all objects which are in the state PROCESSING or FAILED. It is used to cleanup the storage after the failure of the whole Archival Storage. For example after the power outage, first the cleanup is performed to delete the uncompleted data which had been processing when the outage occurred. After the cleanup, the Archival Storage is ready to accept new requests. In accordance with requirements, the SIP object may acquire all states, but the XML object can’t acquire REMOVED and DELETED states. Working with general objects is not a priority for now, and so it has not been decided yet how to handle them and which states may they acquire. As pointed out in Subsection 5.2.1, states are not only stored in the database but also on every logical storage. The only exception to this is a FAILED state, which is stored only in the database.

48 6 Implementation

This chapter contains description of implementation of some key com- ponents and functions of the first version of the Archival Storage. Technologies used in the particular component of the system are men- tioned in a section dedicated to that component. The first section mentions the technologies used in the whole Archival Storage and the following sections describes particular components of the system.

6.1 Technologies

The Archival Storage application is written in the Java programming language and runs on the Java Enterprise Edition 8 platform. It uses the Apache Maven tool for building of the the source code and managing of dependencies to third party libraries. One of the very basic requirements on the system is a loose cou- pling of components. Such components are easy to manage and test. Loose coupling is achieved by usage of the Spring Framework and its In- version of Control (IOC) container, which is responsible for managing the lifecycle of reusable objects, called beans, within the application context. Spring Framework also provides various modules which ex- tend the functionality of existing libraries, are easy to use and reduce the boilerplate code. Spring Boot is a project which follows the convention over configu- ration paradigm and is used by the Archival Storage to simplify and speed up the configuration of the Spring application and to manage conventional dependencies of application components. All the con- figuration is done within Java annotations and configuration classes rather than in XML configuration files. Configuration properties are specified in the property file of a human readable YAML formatand are injected into the application at startup. The Lombok library is used for annotation-based generation of meth- ods. It makes the code clearer by reducing the code that a developer has to maintain. For example, instead of writing methods for getting of each attribute of a class, the @Getter class annotation is used to generate the code automatically but without showing it in the IDE (Integrated Development Environment).

49 6. Implementation 6.2 HTTP API

The HTTP API of the Archival Storage runs in embedded Apache Tomcat web server which is provided by the Spring Boot web starter. With this design, the application is easily portable and does not require installation and configuration of a separate web server. API controllers, their endpoints and input/output parameters are registered using Spring Web module annotations. In most cases, the API controller only validates that input parameters are valid, e.g. that UUID is present and well-formed, optionally transforms parameters into DTO and delegates the rest of the logic to the service layer. Java exceptions which are thrown during the processing of HTTP requests are mapped to proper response codes using Spring exception handlers. In the case of the AIP get method, the API service requests the service layer for AIP data and receives references to files (SIP and XMLs) stored on the remote logical storage. Because the files, mainly the SIP, may be very large, data are read from the storage by parts and those parts are immediately compressed and transformed to a stream of a single ZIP file, which is sent by parts to a client inaHTTP response. In other words, the Archival Storage works like a pipe for transferring the AIP files from a storage to a client and packs them into a single ZIP during the transfer. The connection to the source storage is closed at the end of the transfer. The Archival Storage Gateway API documentation is published in a form of a web page using the Swagger UI tool. The web page contains the list of all API endpoints with their descriptions. Furthermore, it provides an interactive GUI for calling the API and viewing the responses. The text of the documentation is written in the source Java classes, right above the methods, using the Swagger annotations from which the content of Swagger UI is automatically generated.

6.3 Service Layer

The design of the service layer is shown in Figure 5.2. This section further describes the two most complex core methods which are AIP store request and AIP get request.

50 6. Implementation

6.3.1 AIP Store Request The diagram shows the request when two logical storages are attached.

Figure 6.1: AIP Store Request Activity Diagram

51 6. Implementation

The Archival Service first uses the Storage Provider to obtain in- stances of Storage Services for particular logical storages. If one of the storages is unreachable, error is propagated and the method ends. Then the XML and SIP are read from the HTTP input stream, temporar- ily stored at the Archival Storage server and their checksum is verified. If the checksum from the HTTP request matches the checksum com- puted from the received data, the AIP is stored to the database with PROCESSING state, its ID is returned to the client in HTTP response and asynchronous operations start. The Asynchronous Service starts parallel store processes, one for every Storage Service. Besides input streams of the SIP (which is stored in temporary folder) and input streams of the XML (which is stored in main memory) the Storage Service store method accepts a thread- safe, atomic boolean variable. The Storage Service first stores the file to a logical storage and then reads it back to perform verification of checksum after store process. If the error occurs, either because of checksum verification failure or because of any other error, the Storage Service changes the boolean variable and ends. The boolean variable is watched by all other Storage Services and when its change is detected, their store methods immediately end too. Thanks to this, if one storage fails at the very beginning of the process, the others will immediately fail too and do not waste the performance and capacity resources of the logical storages. If at least one of the Storage Services fails, parallel rollback pro- cesses are started. During the rollback, AIP or its uncompleted parts are deleted from the logical storage. If the rollback of one Storage Ser- vice fails, there is no need to immediately detect it and take an action as it was in the store process case. Once all rollbacks are finished, the state of the AIP is set according to whether the rollback was successful on all storages or not.

52 6. Implementation

6.3.2 AIP Get Request The diagram bellow shows the get request when two logical storages are attached.

Figure 6.2: AIP Get Request Activity Diagram

The Archival Service first verifies the AIP state and throws excep- tion if the state is not ARCHIVED. Then it obtains Storage Services of reachable storages and tries to get the AIP from the storage with the highest priority. After the AIP is retrieved, its SIP is stored to temporary folder and XMLs to the main memory. Checksums of all

53 6. Implementation objects are computed and compared with those which are stored in the database since the objects creation. If the checksums match, the AIP is returned and the process ends. If the checksum verification fails, the AIP is retrieved from the other storage and the checksums are compared again. If the retrieval of the AIP or verification of the checksums fails again and there isno other storage to use, the error is propagated to the API layer and the process ends. If the AIP retrieved from the second storage is valid and the AIP from the first storage was not valid because the AIP had been missing or corrupted, then the valid copy of AIP from the second storage is used to repair the invalid copy of AIP on the first storage, and then the AIP is returned to the client. If the reason of the failure of the first storage is different from the two mentioned above, e.g. when the first storage becomes unreachable during the transfer, then the copy from the second storage is not copied to the first storage.

6.4 Database Layer

The Archival Storage uses the popular Hibernate Object/Relational Mapping (ORM) framework implementation of the Java Persistence API (JPA) for access to the PostgreSQL database. The design of the database layer is based on the experience of the inQool company, gained through the years of development of information systems. To unify and simplify the access to the data and to reduce the boilerplate code, all database entities extend the generic Domain Ob- ject entity and all repository classes extend the generic Domain Store repository class, which contains the basic CRUD (Create, Read, Update, Delete) operations. With this design, a developer has to implement only those methods which are specific for a particular entity. For specific queries to the database, the Querydsl framework with its JPA module is used. Querydsl is an easy to learn framework for writing type-safe database queries. The type-safety significantly de- creases the probability of a ’s mistake and also allows the programmer to write queries much faster with an intelligent code completion. The type-safety is achieved by using special query classes which are generated from the source entity classes and mapping their

54 6. Implementation

attributes to textual strings representing the attributes in the database. To compare the Querydsl with JPQL (Java Persistence Query Lan- guage), in JPQL the developer has to write the whole query string very similar to the native SQL, for example: entityManager . createQuery("select aipXml from AipXml as aipXml where aipXml.version=1 and aipXml.sipId= \"43450d72−bb46−4879−aa9e−0bf0dfbcc3df\""); while in Querydsl the developer just writes a standard Java code: QAipXml aipXml = QAipXml . aipXml ; queryFactory .from(aipXml) .where(aipXml. version .eq(1) .and( aipXml. sipId .eq("43450d72 −bb46−4879−aa9e−0bf0dfbcc3df")) ); To create, update and manage the database schema the Liquibase tool is used. The Liquibase allows to quickly create the database schema, refactor it, track all the changes and roll back to previous state if needed. All these operations are written in change sets of the XML change log file and are independent on the underlaying database, which makes it possible to change the database provider and keep the defined database schema.

6.5 Storage Service

In the first version of the Archival Storage, there are two implemen- tations of the Storage Service, one for ZFS and one for Ceph. All the methods of the Storage Service interface are documented.

6.5.1 ZFS Storage Service From the application point of view, the ZFS acts just like a common file system and its features, like replication, are transparent to the client. As implementation of Storage Service for common file system is one of the requirements, the ZFS Storage Service was designed in such way that it does not differ from the implementation for the common file system and the code is shared. The only difference isthatthe ZFS Storage Service implements all methods of the Storage Service, whereas the FS Storage Service does not implement methods which require usage of commands specific to the underlaying platform. For

55 6. Implementation example, the method to get the storage state involves get of storage space information. On Linux platform the df command would be used but that is a command specific to the Linux platform which does not exist in Windows. The Archival Storage supports access to the local file system or Network File System (NFS) through the standard Java API. It also supports access to the remote file system over the SFTP (SSH file transfer protocol) using the SSHJ library. The decision whether to use the local/NFS adapter of SSHJ adapter is made by the Storage Provider. If the host attribute of the logical storage database entity is set to "localhost" then the local/NFS adapter is used, otherwise the host attribute contains IP address of the remote storage server and the SFTP adapter is used. If the SFTP adapter is used, the SSH public key authentication is used to authenticate the Archival Storage. Archival objects are stored to the hierarchical folder structure, where the object location is derived from its id. The folder structure is designed to ensure that the data are spread uniformly across the folders. For example, the object with id 43450d72-bb46-4879-aa9e- 0bf0dfbcc3df is located at the path 43/45/0d/43450d72-bb46-4879- aa9e-0bf0dfbcc3df. The metadata of objects are stored using additional files and naming conventions described in Subsection 5.2.1.

6.5.2 Ceph Storage Service The Ceph object storage may be accessed either through the RGW over the S3/Swift compatible API or through the lower level LIBRADOS library. In the first version of the ARCLib, the S3 API was chosen be- cause of the Amazon S3 popularity and because the RGW is optimized for the typical, simple object storage needs, whereas the LIBRADOS is a low level library which requires knowledge of the Ceph’s internal functionality. However, it is possible that the Ceph Storage Service will use LIBRADOS in the future to fulfill some new requirements which can’t be fulfilled by the RGW. The development of the Ceph S3 Storage Service is in some cases simpler than, for example, the development of the ZFS service over SFTP, because the standard Java S3 API is well defined, easy to use and is created exactly for the object storage needs. However, there are some problems with compatibility of the S3 API and the Ceph cluster. For

56 6. Implementation

example, the multipart upload, which is used in object store method, does not work with the AWS signature v4 authentication algorithm and, because of that, the signature v2 algorithm has to be used. Finding the root cause of such issues may be a challenging task because the AWS error messages do not incorporate any Ceph-related information. For example, the error message for the mentioned problem only says that signature does not match. The possibility to store user defined object metadata is a typical feature of object storages and it is easily achieved with the Ceph S3 Storage Service. All the metadata are stored within a Java Map with String keys and String values. However, the problem with the S3 API is that in order to update metadata, the whole object must be uploaded again. Therefore, the Archival Storage stores two Ceph objects for each archival object, one object contains the actual data of the object to be stored, and the other is an empty object with all the metadata for the first object. This workaround would not be needed if the Swift APIwas used, because the Swift API supports update of the object metadata.

6.6 Testing

The automated tests which are written to ensure the system behavior in standard and non standard situations can be divided into four categories according to which layer of the application is tested. All unit tests use the in-memory HSQLDB database. The integration Spring Boot tests use the in-memory H2 Database. To simulate the specific behavior of objects and methods, the popular Mockito framework is used. Unit tests of the database layer test that the Querydsl queries wrote in particular repository classes retrieve the correct data from the database. The tests of the Service layer are unit tests verifying mainly the Archival Service and its interaction with the Asynchronous Service. Because of the active Archival Storage development, these tests are subjects to continual refactoring as the service layer changes very often, old features are replaced with new features etc. In current active development phase, it may happen that some tests are outdated or missing. However, the main functionality is properly tested and the

57 6. Implementation risk of unrevealed error is further lowered by the API integration tests, which use the storage layer. The Storage Service Gateway API integration tests are end-to-end, Spring Boot tests, which use the Spring web test module to simulate the client’s requests. The only parts of the system which are mocked in these tests are the instances of the Storage Service. While the service layer solves how to process the data, the API layer relates more to the definition of inputs and outputs of the Archival Storage. These definitions originate from the basic requirements on the system and do not change so frequently. Therefore, these tests verify the function- ality of all API methods and the functionality of the whole system respectively. The last category are unit tests of the Storage Service implemen- tations. Because the Storage Service instances are mocked in the in- tegration tests, these tests are the only automated verification of the Storage Service implementations and are given with a special atten- tion. There are test classes for Ceph, ZFS local and ZFS remote Storage Services which implement the Storage Service test interface which defines 25 storage test methods, testing all methods of the Storage Service interface in standard and non standard situations. The ZFS Storage Service tests run on ZFS installed on a virtual machine on a private server. The ZFS pool consists of four file VDEVs and forms a group similar to the RAID 10 where two of the VDEVs are mirrors to the other two. The Ceph Storage Service tests run on a cluster of virtual machines with one monitor and three OSD nodes. The cluster was installed using the ceph-deploy tool and is set up with the default configuration, i.e. with three replicas for every object. The user testing of the Archival Storage integrated with other AR- CLib modules will start in the summer of 2018. The testing will be done on real world data and also on the real infrastructure. Together with user tests, the stress tests will be performed to verify the performance and scalability of the system.

58 7 Conclusion

The main objectives of this thesis were the analysis of the OAIS stan- dard, the research of storage technologies suitable for use in the OAIS Archival Storage module and finally, the design and implementation of the first version of the Archival Storage which provides the main required functionality. The chapter about the OAIS standard is focused on the Archival Storage, but also describes the basic OAIS concepts required to un- derstand the interactions of the Archival Storage with the rest of the system. To overcome the abstraction of the OAIS, the real-world exam- ples, often related to the ARCLib system, are provided. The following research of five different storage technologies is ended up with atex- tual summary pointing out which technology may be suitable for which type of archive, and also with two comparison tables. The terms used in tables are defined to prevent the ambiguity of their meaning. Requirements on the first version of the Archival Storage originate from the base requirement which is that the Archival Storage has to be ready to be integrated with the rest of the ARCLib system, the modules of which are also in their first version, supporting the main functions. The activity diagrams from Section 6.3 show how the Archival Stor- age implements get and store functions fulfilling the main functional requirements of its first version. Methods are provided through the HTTP API used by other ARCLib modules and the support for Ceph and ZFS storages is implemented. As described in Subsection 4.3.7 the versioning of the whole AIP is out of the scope for the Archival Storage and the versioning of the AIP XML is implemented and fol- lows the rules described in Subsection 5.2.1. All these methods are also covered by 130 automated tests. Requirements on the first version of the Archival Storage are therefore fulfilled. The author of this thesis designed and implemented the whole Archival Storage prototype, designed the whole Archival Storage and implemented the vast majority of its functionality. He also set up the whole testing environment, including one ZFS virtual machine and four virtual machines for the Ceph cluster. In addition to the Archival Storage, the author also designed and implemented many important components of the rest of the ARCLib system, for example the ARCLib

59 7. Conclusion

Lightweight Directory Access Protocol authentication, authorization, OAIS Finding Aids or handling of incidents occurring in the Ingest module. The author was also present at all seven meetings organized by the LIBCAS where he presented the design and implementation out- puts to the whole ARCLib project team and discussed the problematics with experts in the LTP field to refine the analysis and requirements. Thanks to all of this, the author has now overall and deep knowledge of the whole ARCLib system, and generally great knowledge of the whole ARCLib project. The Archival Storage, as well as the rest of the ARCLib system, are, at the time of writing, still in very active development. In addition to the functions required for the first version of the Archival Storage, the support for changing of the AIP state was added, so that it can now acquire all the states defined in the object state diagram in Figure 5.3. Also the basic API endpoints for managing the logical storages and the endpoint for retrieval of the AIP state have been recently implemented. In the nearest future, the authentication and authorization, reporting and auditing are going to be implemented. The first version of the whole system will be tested since theSum- mer of 2018. The development will continue and in the Fall of 2019 the system will be deployed to the LIBCAS and verified in the form of partial operation. The ARCLib project is planned to be ended in the end of 2020.

60 A Source Code, Setup and Documentation Ref- erences

∙ The source code of the prototype number 10 dedicated to the Archival Storage is available in the LIBCAS Git ARCLib-Prototypes repository: https://github.com/LIBCAS/ARCLib-Prototypes.

∙ The source code of the first version of the Archival Storage is available in the LIBCAS Git ARCLib-Archival-Storage repository: https://github.com/LIBCAS/ARCLib-Archival-Storage.

∙ The ID of the commit of the Archival Storage first version is: 9ff1caec3d8727271e7d50aa2e87fd3968a54b85

∙ The setup and run instructions for the Archival Storage module are described in the README.md file located in the Archival Storage repository.

∙ The Storage Service Gateway API documentation is in the api- doc.html file located in the Archival Storage repository.

61

Bibliography

[1] Jan Hutař and Marek Melichar. “The Long Decade of Digital Preservation in Heritage Institutions in the Czech Republic: 2002–2014”. In: International Journal of Digital Curation 10.1 (Mar. 2015). issn: 1746-8256. url: http://www.ijdc.net/article/ view/10.1.173 (visited on 05/17/2018). [2] LTP-pilot website. url: http://ltp-portal.mzk.cz/ltp-pilot (visited on 04/14/2018). [3] Andrea Miranda. Archivematica z hlediska normy OAIS. Report. CESNET, Oct. 2015. url: https://drive.google.com/file/d/ 0BzOLuOh094X8S1hPUWstZWIySTA/view (visited on 04/14/2018). [4] Official ARCLib website. url: https://arclib.cz/ (visited on 04/14/2018). [5] Brian Lavoie. “The Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition)”. In: DPC Technology Watch Report (Oct. 2014). issn: 2048-7916. url: https: / / www . dpconline . org / docs / technology - watch - reports / 1359-dpctw14-02/file (visited on 03/09/2018). [6] The Consultative Committee for Space Data Systems. Reference Model for an Open Archival Information System (OAIS). Recom- mended practice. CCSDS 650.0-M-2. CCSDS Secretariat, June 2012. url: https://public.ccsds.org/Pubs/650x0m2.pdf (visited on 03/11/2018). [7] Official CCSDS public website. url: https://public.ccsds.org (visited on 03/08/2018). [8] Uwe Borghoff, Peter Rödig, Jan Scheffczyk, and Lothar Schmitz. Long-Term Preservation of Digital Documents. Principles and Prac- tices. Springer-Verlag, June 2012. isbn: 3-540-33639-7. [9] Official METS website. url: https://www.loc.gov/standards/ mets/METSOverview.v2.html (visited on 03/24/2018). [10] Erin O’Meara and Kate Stratton. Digital Preservation Essentials. Ed. by Christopher J. Prom. With an intro. by Kyle R. Rimkus. Trends in Archives Practice. Society of American Archivists, 2016. isbn: 1-931666-95-4. [11] Preservation Storage Criteria, Version 2. url: https://goo.gl/ 1Q9vDe (visited on 04/21/2018).

63 BIBLIOGRAPHY

[12] Preservation and Archiving Special Interest Group discussion. url: http://mail.asis.org/pipermail/pasig- discuss/2017- March/subject.html#471 (visited on 04/21/2018). [13] Mary Baker et al. “A fresh look at the reliability of long-term digital storag”. In: Proceedings of the 1st ACM SIGOPS/EuroSys Eu- ropean Conference on Computer Systems 2006. Association for Com- puting Machinery, Apr. 2006. isbn: 1-59593-322-0. url: https: //lockss.org/locksswiki/files/Eurosys2006.pdf (visited on 04/21/2018). [14] Lee Hibberd. Cloudy Culture: Preserving digital culture in the cloud. Part 4: Costs and tools. Report. The National Library of Scotland, Nov. 2017. url: https://www.dpconline.org/blog/cloudy- culture-part-4 (visited on 04/30/2018). [15] Selena Larson. Pentagon exposed some of its data on Amazon server. Nov. 2017. url: http://money.cnn.com/2017/11/17/technology/ centcom-data-exposed/index.html (visited on 04/22/2018). [16] Karen Turner. Hacked Dropbox login data of 68 million users is now for sale on the dark Web. Sept. 2016. url: https : / / www . washingtonpost . com / news / the - switch / wp / 2016 / 09 / 07 / hacked - dropbox - data - of - 68 - million - users - is - now - or - sale - on - the - dark - web / ?noredirect = on & utm _ term = .074e5213a15d (visited on 04/22/2018). [17] Official OpenZFS website. url: http://open-zfs.org (visited on 04/22/2018). [18] Official ZFS on Linux website. url: http://zfsonlinux.org (vis- ited on 04/22/2018). [19] Dustin Kirkland. ZFS Licensing and Linux. . Feb. 2016. url: https : / / insights . ubuntu . com / 2016 / 02 / 18 / - licensing-and-linux (visited on 04/23/2018). [20] OpenZFS documentation. url: https://pthree.org/category/ zfs (visited on 04/22/2018). [21] Sanoid’s Github Repository. url: https://github.com/jimsalterjrs/ sanoid (visited on 05/06/2018). [22] Jim Salter. rsync.net: ZFS Replication to the cloud is finally here—and it’s fast. Dec. 2015. url: https://arstechnica.com/information- technology / 2015 / 12 / rsync - net - zfs - replication - to - the - cloud - is - finally - here - and - its - fast (visited on 05/06/2018).

64 BIBLIOGRAPHY

[23] Official Btrfs website. url: https://btrfs.wiki.kernel.org (visited on 04/22/2018). [24] Michal Růžička. Btrfs vs ZFS – srovnání pro a proti. May 2016. url: http://www.abclinuxu.cz/blog/Drobnosti/2016/2/btrfs- vs-zfs-srovnani-pro-a-proti (visited on 04/27/2018). [25] Official GlusterFS website. url: https://www.gluster.org (vis- ited on 04/28/2018). [26] Karthik Shiraly. Distributed File Systems and Object Stores on Linode (Part 1). GlusterFS. Feb. 2017. url: https://medium.com/linode- cube/distributed- file- systems- and- object- stores- on- linode-5f635178aad7 (visited on 04/29/2018). [27] GlusterFS documentation. url: https://docs.gluster.org (vis- ited on 04/28/2018). [28] Official Ceph website. url: https://ceph.com (visited on 04/29/2018). [29] . New in Luminous: Improved Scalability. Sept. 2017. url: https://ceph.com/community/new- luminous- scalability (visited on 04/29/2018). [30] Java Librados library. url: https://github.com/ceph/rados- java (visited on 04/29/2018). [31] Ceph documentation. url: http://docs.ceph.com/docs/master (visited on 04/29/2018). [32] Walter Graf. Understanding a Multi Site Ceph Gateway Installa- tion. White paper. , Jan. 2017. url: https://ceph.com/ wp - content / uploads / 2017 / 01 / Understanding - a - Multi - Site- Ceph- Gateway- Installation- 170119.pdf (visited on 04/30/2018). [33] Official Amazon Web Services website. url: https://aws.amazon. com (visited on 05/01/2018). [34] Jordan Novet. Amazon lost cloud market share to Microsoft in the fourth quarter: KeyBanc. Jan. 2018. url: https : / / www . cnbc . com/2018/01/12/amazon- lost- cloud- market- share- to- microsoft- in- the- fourth- quarter- keybanc.html (visited on 04/30/2018). [35] Giacinto Donvito, Giovanni Marzulli, and Domenico Diacono. “Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis”. In: Journal of Physics: Conference Series 513.4 (2014). url: http://

65 BIBLIOGRAPHY iopscience.iop.org/article/10.1088/1742-6596/513/4/ 042014 (visited on 05/02/2018). [36] Loïc M. Roch, Tyanko Aleksiev, Riccardo Murri, and Kim K. Baldridge. “Performance analysis of open-source distributed file systems for practical large-scale molecular ab initio, density functional theory, and GW+BSE calculations”. In: International Journal of Quantum Chemistry 118.1 (Jan. 2018). issn: 1097-461X. url: http:https://doi.org/10.1002/qua.25392 (visited on 05/02/2018). [37] Red Hat. Red Hat Ceph Storage: Scalable object storage on QCT servers. Performance and sizing guide. Guide. June 2017. url: https : / / www . redhat . com / cms / managed - files / st - ceph - storage - qct - object - storage - reference - architecture - f7901-201706-v2-en.pdf (visited on 05/02/2018). [38] Red Hat. Red Hat Gluster Storage on QCT servers. Performance and sizing guide. Guide. Aug. 2016. url: https://www.redhat. com/cms/managed-files/st-RHGS-QCT-config-size-guide- technology-detail-INC0436676-201608-en.pdf (visited on 05/02/2018). [39] Jan Hutař, Andrea Miranda, Eliška Pavlásková, Zdeněk Vašek, and Zdeněk Hruška. Metodika logické ochrany digitálních dat. Cer- tified methodology. Library of the Czech Academy of Sciences, 2018. url: http://hdl.handle.net/11104/0282107 (visited on 05/06/2018). [40] Michal Růžička et al. “Metodika bitové ochrany”. Working Ver- sion. May 2018. [41] Library of the Czech Academy of Sciences. ARCLib – podrobnější technická specifikace předmětu plnění – funkční požadavky. 2016. url: https://www.tenderarena.cz/profil/zakazka/detailVerzeDokumentu. jsf?id=82214&idDokumentu=756005&idVerze=806283 (visited on 05/06/2018). [42] ARCLib-Prototypes Git repository of LIBCAS. url: https://github. com/LIBCAS/ARCLib-Prototypes (visited on 05/08/2018). [43] Šimon Hochla. “Implementation of Open Archival Information System”. Master’s Thesis. Faculty of Informatics at Masaryk University, Dec. 2017.

66