Bachelor Thesis Efficient dataset archiving and ver- sioning at large scale Bernhard Kurz

Subject Area: Information Business Studienkennzahl: 033 561 Supervisor: Fernández García, Javier David, Dr. Co-Supervisor: Neumaier, Sebastian, Dipl.-Ing., B.Sc. Date of Submission: 23. June 2018

Department of Information Systems and Operations, Vienna University of Economics and Business, Welthandelsplatz 1, 1020 Vienna, Austria Contents

1 Introduction 5 1.1 Motivation ...... 6 1.2 Outline of research ...... 6

2 Requirements 7 2.1 Archiving and Versioning of dataset ...... 8 2.2 Performance and Scalability ...... 8

3 Background and Related Work 9 3.1 Storage Types ...... 9 3.2 Systems ...... 10 3.2.1 DEC’s VMS ...... 11 3.2.2 Subversion ...... 11 3.2.3 ...... 11 3.2.4 Scalable Version Control Systems ...... 12 3.3 Redundant Data Reduction Techniques ...... 12 3.3.1 Compression ...... 13 3.4 Deduplication ...... 14 3.4.1 Basic Workflow ...... 15 3.4.2 Workflow Improvements ...... 15 3.4.3 Key Design Decisions ...... 17

4 Operating Systems and Filesystems 18 4.1 Definitions ...... 18 4.1.1 ...... 19 4.1.2 ...... 20 4.1.3 Data Access ...... 20 4.2 Classic vs Modern Filesystems ...... 20 4.3 ZFS ...... 21 4.3.1 Scalability ...... 21 4.3.2 Virtual Devices ...... 22 4.3.3 ZFS Blocks ...... 23 4.3.4 ZFS Pools ...... 24 4.3.5 ZFS Architecture ...... 24 4.4 Data Integrity and Reliability ...... 25 4.4.1 ...... 26 4.5 Transactional Semantics ...... 26 4.5.1 Copy-on-Write and Snapshots ...... 27 4.6 ...... 28 4.7 NILFS ...... 28 4.8 Operating Systems and Solutions ...... 29

5 Approaches 29 5.1 Version Control Systems ...... 29 5.2 Snapshots and Clones ...... 30 5.3 Deduplication ...... 30

6 Design of the Benchmarks 31 6.1 Datasets ...... 31 6.1.1 Portal Watch Datasets ...... 31 6.1.2 BEAR Datasets ...... 32 6.1.3 Wikipedia Dumps ...... 33 6.2 Testsystem and Methods ...... 34 6.2.1 Testing Enviroment ...... 34 6.2.2 Benchmarking Tools ...... 34 6.2.3 Bonnie++ ...... 35 6.3 Conditions and Parameters ...... 35 6.3.1 Deduplication ...... 36 6.3.2 Compression ...... 36 6.3.3 Blocksize ...... 37

7 Results 37 7.1 Performance of ...... 37 7.2 Performance of ZFS ...... 38 7.3 Deduplication Ratios ...... 39 7.3.1 Portal Watch Datasets ...... 39 7.3.2 BEAR Datasets ...... 41 7.3.3 Wikipedia Dumps ...... 41 7.4 Blocksize Tuning ...... 42 7.5 Compression Performance ...... 42 7.5.1 Compression Sizes ...... 44 7.5.2 Compression Ratios ...... 45

8 Conclusion and further research 47

9 Appendix 57 List of Figures

1 Redundant Data Reduction Techniques and approximate dates of initial research[92] ...... 13 2 Overview of a Deduplication Process ...... 15 3 Block-based vs object-based storage systems[42] ...... 23 4 Traditional Volumes vs ZFS Pooled Storage[12] ...... 24 5 ZFS Layers[12] ...... 25 6 Diskusage of datasets in GB per Month ...... 32 7 Bonnie Benchmark for ext4 ...... 37 8 Block I/O for ZFS ...... 38 9 Block I/O for ext4 compared to ZFS ...... 39 10 Deduplication Table for the Portal Watch Datasets ...... 40 11 Block I/O for ZFS: with Blocksize=4k and Compression or Deduplication ...... 43 12 Block I/O for ZFS: with Blocksize=128k and Compression or Deduplication ...... 43 13 Block I/O for ZFS: with Blocksize=1M and Compression or Deduplication ...... 44 14 Compression sizes in GB for the Datasets ...... 45 15 Compression Ratios for the Datasets ...... 46

List of Tables

1 BEAR Datasets Compression Ratios ...... 33 2 Wikipedia Dumps Compression Ratios ...... 33 3 Compression Ratios for different Compression Levels per Dataset 45 Abstract As the amount of generated data is increasing rapidly, there is a high need for reducing storage costs. In addition, the requirements for storage systems have evolved and archiving and versioning data have become a major challenge. Modern filesystems target key fea- tures like scalability, flexibility and data integrity. Furthermore, they provide mechanisms to reduce data redundancy and version control. This thesis discusses recent related research fields and evaluates how well archiving and versioning can be integrated in the filesystem layer to reduce administration overhead. The effectiveness for deduplication and compression is benchmarked on some datasets to ensure scalabil- ity and feasibility. Finally, a conclusion and an overview for further research is given.

1 Introduction

Due to the fact, that the amount of data is increasing exponentially, storage costs are increasing too. On the one hand prices for hard drive disks sunk, but on the other hand, there are higher requirements, which means storing much data can be expensive. In computing, file systems are used to control how data is handled and stored on an at least physical device.

"Some of today’s most popular filesystems are, in computing scale, ancient. We discard hardware because it’s five years old and too slow to be borne—then put a 30-year-old filesystem on its replacement. Even more modern filesystems like extfs, UFS2, and NTFS use older ideas at their core." [56]

Classic filesystems have become less practical by fulfilling evolved demands. In many times it is efficient to make duplicates instead of sorting files or versions. For archiving and reasons there is a demand for version histories, to "look back" in time and be able to recreate older versions of documents or another kind of data. Due to the nature of , they allocate a lot of storage, even, if there have been just slight changes. To spare disk space incremental backups are used, which may cause a very complex and effortful recovering of older files. This leads also to a big impact on performance and it is difficult to manage versions. Another way could be an application based version control systems, like GIT, but they are slow and problematic if it comes to large files and scalability, or non-text and binary files.[9, 47] Another weakness of classic filesystems are size limits and partition caps, which may be still sufficient for the next five to ten years, but

5 thinking in a larger scale means, that they have to be recorded and adapted in the future. Besides there is the so-called bitrot, which is also known as silent-data-corruption.[6] Bitrot is problematic because it is mainly invisible and can cause problematic data inconsistency. Modern filesystems, like ZFS and Btrfs, deal with these problems. To counter this phenomenon, storage systems use levels of redundancy (RAID) combined with checksums. Overall, it will be tested how well fulfilling these requirements can be included in the filesystem layer. The aim of this thesis is to compare these features and find a plausible solution for the specific use case mentioned below.

1.1 Motivation The Department of Information Systems and Operations at the Vienna Uni- versity of Economics and Business is hosting the Open Data Portal Watch. The projects aim is to have a "scalable quality assessment and evolution mon- itoring framework for Open Data (Web) portals."[91] Therefore, Datasets from around 260 Web catalogues are downloaded and saved to disk each week. To reduce the storage costs, datasets are hashed and only saved, if they have changed. Otherwise, they are logged in a database to enable ac- cess to older versions and the datasets histories. In fact, changing can mean adding just one line or even remove rows from the dataset. That means, there could be thousands of duplicates or strictly speaking nearly duplicates. As this is simply speaking, a deduplication on a file-basis, we wanted to test, how well chunk-sized deduplication will provide additional gains.

1.2 Outline of research The scope of this thesis includes a literature review about modern filesystems like ZFS, Btrfs and the versioning filesystem NILFS. They will be compared with features known from version control systems like GIT and Subversion as well as deduplication. Attention will be spent on scalability and perfor- mance of these implementations. In conclusion, the following features will be reviewed and compared to each other: Targeted Factors

• Saving Disk Space • IO Performance

Required Factors

• Archiving and Versioning

6 • Performance and Scalability

Additional Factors

• Hardware Requirements • Compatibility • Reliability

Approaches

• Version Control Systems • ZFS Snapshots and Clones • ZFS Deduplication

A model, which shows the best ratio between the two targeted factors, would be a possible view to structure and visualize the results. These two can be derived from the requirements. It may be difficult to reduce the factors to the model without loosing too much information. The important thing is testing and benchmarking of the approaches. Due to its nature of filesystems and the importance of data consistency they should be tested for a long period of time to satisfy demands and eliminate as many errors. It will be discussed how well specific filesystems have been documented and tested.

2 Requirements

Important for the Open Data Portal watch are the archiving and versioning of datasets and the access to them. On the one hand, this is similar to a backup schema of user data, but the difference is, that on the other hand, it is strictly speaking primary storage because it requires a performant read access. For backup cases, data is often compressed and deduplicated to save space, but for active storage, this can be too slow. It is given, that the storage system uses spinning disks. Hard drive disks have a very low rate of I/O operations per second compared to solid state disks.[44] Also, the kind of access is a random one, because, datasets are added from time to time and are not written sequentially. In order to ensure performance standards, the proposed solutions will be benchmarked and tested. The test data will be adjusted to the requirements of the use case to guarantee a basic level of scalability.

7 2.1 Archiving and Versioning of dataset The main requirement is to archive the history of datasets and be able to access them. Archiving data means storing older and not frequently used files in a systematic way with the goal to reduce primary storage costs. In contrast to backups, archives are not copies of datasets and do not target data recovery. Nevertheless, they should protect data and may be accessed occasionally. This is similar to the function of a library. They focus on saving information and try to archive as many as possible of it systematically. Relating to file-versioning in many cases it is sufficient to keep them as read- only and to refuse modifications. On the basis of the ’Open Data Portal Watch’ the read performance have to be adequate, but after crawling the data, there is a timeframe to organize the datasets and to archive them. That means writing them can take some time. Nevertheless, it is important, that older file versions are accessible without too high recreation costs and loosing too much I/O performance, as it will be discussed in the following subsection.

2.2 Performance and Scalability As mentioned above it is important that the performance is sufficient. This may be easy for small amounts of data, but the solution has to scale with big datasets and with a long file history. Presumed, there is a basic file and each modification will be written as a delta file, there are basically two ways. The first solution would be to relate every delta file to the basic file, which may be not the best choice in case of saving disk space. The other possibility would be to relate every incremental file to the previous one, which may result in a high CPU load for the first version. It will be discussed later, how the Version Control System GIT, the versioning file system NILFS or copy- on-write filesystems like ZFS or Btrfs solve this problem. The goal of this thesis should be finding the best ratio between performance and scalability and saving primary storage.

8 3 Background and Related Work

In this section, we provide an overview of the different approaches and com- pare them subsequently. We will start with general storage aspects and types, then go to specific filesystems like ZFS, Btrfs, and NILFS, version control systems like Git and Subversion, then cover features like deduplication and compression and at least we will discuss the operating principles of transac- tional semantics, in particular, Copy-On-Write and snapshots.

Filesystems A simple desktop computer has normally one hard drive or solid state disk in it, but if it comes to larger and often distributed systems, the storage should be planned and scaled adequately. Therefore, the intention and expected workload should be considered. For accessing and storing data in a reasonable way, a filesystem is needed. It ensures, that the bits and blocks are stored and retrieved correctly to and from its correct physical places. In general, they can be grouped by the type of access, commonly known as DAS, NAS and SAN1. Clients and Personal Computers normally use DAS and eventually NAS, whereas Server often uses block devices delivered from a SAN.[75, 87, 31] For this thesis, we assume, that the relevant techniques are done in the filesystem layer from a server itself. The data may be attached via network protocols, but not provided blockwise or distributed used. This means the data is saved and delivered via NFS2. The three filesystems (ZFS, Btrfs, NILFS) for this thesis will be analyzed further in the following and next section.

3.1 Storage Types Overall storage can be categorized into two terms. Either it is actively used or it is not actively used. In this thesis, we define main storage as primary storage, because it is the one, that the server or the client uses. If the storage is used for archiving or backup reasons, the storage is often not primarily used and we define it as secondary storage. As performance is also a very important parameter, we can differ between systems with mainly reading and systems with mainly write access. Speaking in a simplified way, storage can be classified as followed: 1. Primary Storage ("actively used") (a) Storage with read and write access

1DAS = Direct-Attached-Storage, NAS = Network-Attached-Storage, SAN = Storage- Area-Network 2NFS =

9 (b) Storage with mainly read access (c) Storage with mainly write access (d) Storage with less frequent read and write

2. Secondary Storage ("not actively used")

(a) Storage with read and write access (b) Storage with mainly read access (c) Storage with mainly write access (d) Storage as data archive

The terms archiving and backup are not defined clearly. The aim of archiving is to save data, that is not needed permanently, but have to be stored in a reliable way. This may be of legal regulations, of business-related knowledge or of personal preferences. Often archives should preserve the data and pro- tect them against any kind of changes. There are different threats to data, which will be discussed in 4.4. Indeed the main reason may be, that people want to have their data somewhere. Referring to a study by Wakefield Re- search, 47% of office work would rather renounce their clothing than their data. 63% of them think about themselves that they are hoarding data.[3] The goal of a backup is to have a preferable recent copy of the data in a case of any casualty. If an accident happens, you would be able to restore the data. It depends on the importance of the actual data on how often a backup should be done. An automated backup, a minimum of three copies, two different storage media, and two different places are a good start and practicable nowadays. For the given datasets we may have a temporal mix- ture of primary and secondary storage. First of all, we need permanent read access and from time to time write access. Updates to datasets are written periodically and they have to be accessible. On the one hand, strictly spoken this may be primary storage, but on the other side, the scraped datasets should be archived and protected from changes. So, therefore, writing of the data can take some time, but a high compression may cause a too slow I/O performance for reading. Additionally, the datasets are often already compressed and this may be an unnecessary workload for the server.

3.2 Version Control Systems Version Control Systems3 are commonly used to keep track of changes. Soft- ware developers or document writers can commit their changes to the VCS,

3VCS, also known as Revision Control Systems

10 push them to a remote server and can easily collaborate with co-workers and team members. These systems provide also the possibility to look back in time and view or restore older versions of it. So if anything goes wrong, the code can be restored to the last committed known working state. In their early years they started with local revision control, but over time version con- trol systems got improved and nowadays these systems use a client/server model and are often used distributed. That means collaboration is possible without a central repository.[47]

3.2.1 DEC’s VMS Revision Control can be tracked back to DEC’s VMS operating system.[47] This system uses the technique to never delete files, but rather attaching a version number to the file, resulting in a big amount of created files. Overall,

"the system versioned files but did not provide version control."[47]

This is a very interesting approach because this procedure is quite similar to the version control of our given Open Data Portal case. It obviously categorized them with their timestamps but does not provide per se a well- managed revision control.

3.2.2 Subversion Subversion improved many features of his predecessors and implemented the support for atomic commits similar to transactional database systems. This means, that if a commit fails, the system is resilient enough to reject changes and keep the system consistent and collision-free. Also, and binary file support was introduced.[47]

3.2.3 GIT With the demand for better collaboration and better failure safety, the trend moves away from a central repository and the server/client model to dis- tributed version control systems. These systems create copies of the whole repository locally and make it possible to work on a project even without a permanent internet connection. The main advantage of these VCS is, that changes are only tracked if they are committed. That allows people to re- vision drafts locally without communication them to all other contributors. One of the most important distributed version control systems is called GIT. It was pushed by Linus Torvald for the development of the kernel. Overall, distributed version control is much better for big projects and many

11 different independent workers and coders. Indeed a disadvantage is the less intuitive understanding in the users view.[54, 47]

3.2.4 Scalable Version Control Systems For the given use-case and the relating datasets, a version control system would be a good way to reduce storage costs. There is a blog entry from Amol Deshpande’s Research Group about the failure of Git and SVN to manage dataset versions.[21] As it is mentioned there, version control systems were designed for source code control and limited to the assumption of its design, that only small text files are stored and versioned. All this leads to suboptimal behavior for large datasets and revision control systems lack querying functionality, which has become important for data analysis.[21] Referring to Bhattacherjee et al. that the fundamental challenge is about the storage-recreation-tradeoff. It means, that

"the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions."[9]

This is always the case if it comes to data reduction technologies. To coun- teract, the algorithms have to be optimized and improved. There is Datahub, which was specially designed as a Dataset Version Control System (DVCS) to provide a scalable team collaboration platform for analyzing and querying large datasets.[8, 16]

Cloud Service Provider like Dropbox or Onedrive offer online storage for a broad range of clients, there is a demand for a variety of features. Both services have a feature to restore previous versions, also called version history.45 It would be interesting how this feature is implemented. For Drop- box, there is a research paper, but which focuses on the protocol operating principle.[24] Due to their , it is not possible to come to know the realization of the versioning.

3.3 Redundant Data Reduction Techniques Redundancy is important if it comes to data integrity, reliability, and avail- ability. But disorganized duplicates cost storage and provide very less benefit.

4https://www.dropbox.com/help/security/recover-older-versions 5https://www.microsoft.com/en-us/microsoft-365/blog/2017/07/19/ expanding-onedrive-version-history-support-file-types/

12 Additionally, the amount of digital data is growing rapidly. Referring to a study of IDC the digital universe will grow from 130 exabytes in 2005 to 40 000 exabytes6 in 2020.[29] It has become an important challenge to manage storage in an efficient way. With regard to case studies conducted by Mi- crosoft Research and a research group on the Johannes Gutenburg-University in Mainz deduplication can save up to 87 percent for backup images[62] and 30 to 70 percent of data in primary storage systems.[59] Eliminating redun- dant data will, on the one hand, reduce and save storage space and the other hand also decrease transfer durations on low bandwidth networks. Data re- duction techniques have a long history. We will provide an overview of the history and different approaches to reduce redundant data.

Figure 1: Redundant Data Reduction Techniques and approximate dates of initial research[92]

3.3.1 Compression "The goal of is to represent an information source (e.g. a data file) as accurately as possible using the fewest number of bits"[92]

Basically, compression can be categorized into two categories. Lossy com- pression removes unnecessary information making it irretrievable in con- trast to lossless compression, which uses algorithms to identify and elim- inate redundancy.[92] As an example jpeg or mp4 do use lossy compres- sion. For lossless compression, there are many different algorithms such as GZIP[22] and LZW[64]. The early algorithms started in the 1950s. There are the Huffmann Coding and the Arithmetic Coding with a statistical model based approach on a -level. The development continued with model-based coding on a string level with different LZ algorithms.[92] Since

640 000 exabytes = 40 trillion gigabytes

13 the 1980s variants of the LZ compression arose and either improved com- pression ratios or accelerates the compression process. In the 1990s data compression started to be more intelligent, as delta compression was sug- gested to aim for better compression on similar files or similar chunks. This is widely used in source code version control, remote synchronization and backup systems.[84, 86, 13] With increasing computing power, deduplication becomes more usable and an interesting field for research.

3.4 Deduplication Deduplication aims to find redundancy and eliminates it by removing dupli- cates of data. The former approach was to do so on a file-level, but that technique has become overhauled due to a better compression performance on a chunk-based concept.[62, 70, 100, 58] Two of the early adopters were the low-bandwidth-network file system (LBFS) and Venti. LBFS was using a variable content-based chunking compared to Venti, which used fixed-sized chunks. Both of them used SHA1 for computing the fingerprints, which was accepted and considered as cryptographically strong enough for a long time.[72, 63] Even if SHA1 has become obsolete because it is not collision- proof anymore[60, 80], the probability of two different chunks with the same hash value is lower than the probability of hardware bit errors.[37, 10] Modern deduplication systems prefer to use stronger hash algorithms, like SHA256, as it is used by ZFS and Dropbox.[24, 56] ZFS even offers the possibility to compare chunks on a bit basis, eliminating all doubts of a collision. Gen- erally, there are two ways, to create chunks. On the one hand they can be fixed-sized or on the other hand, they can be variable-sized. The first approach is quite simple, but may fail in recognizing duplicates, because of the known boundary-shift problem. If small changes (e.g. insertion or dele- tion) generate the shift of boundaries in the data stream, they may fail to deduplicate.[92, 7] The second approach, variable-sized chunking or content- based chunking, divides files into variable sized chunks on basis of the content itself. As this solves almost the boundary-shift problem, it is mainly used today.[92] Data characteristics like the type of data and the frequency that data changes have a high impact on the effectiveness.[25] Deduplication is considered as effective in backups and also getting more popular in primary storage.[55, 26, 35, 69, 99]

14 3.4.1 Basic Workflow

Figure 2: Overview of a Deduplication Process

The workflow of is illustrated in figure 2 and consists of chunking and fingerprinting, indexing, optionally compressing and storing.[57, 92] Referring to Xia et al.[92], there are eight considerations, which can be optimized or implemented in various ways. As it is already outlined in the paper of Xia et al., there are for each of the mentioned issues several coun- termeasures available. Additional, the amount is numerous and that is why this will not be covered in a comprehensive way and we refer to the paper itself.[92]

3.4.2 Workflow Improvements In "A Comprehensive Study of the Past, Present, and Future of Data Deduplication"[92] the authors have defined the following eight points, which can be used to im- prove the deduplication workflow:

(1) Detecting Redundancy The first problem is how to design an al- gorithm to detect the most redundancy in a data stream. Like discussed before, there is fixed-sized and variable-sized chunking, which was studied more precise. The three main evaluation criteria are "(1) space savings, (2) performance (throughput and latency), and (3) resource usage (disk, CPU, and memory). All three met- rics are affected by the data used for the evaluation and the spe- cific hardware configuration."[82] There are three main taxonomies for content-defined-chunking7 presented in detail. Introduced restrictions of the min/max chunk size, over improving the speed of CDC, to further re-chunking non-deduplicated chunks.[92] This is also subject of recent research.[89, 94]

7CDC = content-defined-chunking

15 (2) Accelerating the Process As all of the steps consume a lot of com- puting power, the next improvements can be done by accelerating them. State-of-the-art approaches are about multicore-based and GPU-based accelerating.[92, 30, 46]

(3) Memory Considerations Also thinking of growing data, the index- ing of the fingerprints have to be considered in case of overloading the main memory. The indexing has been categorized in four taxonomies, starting with locality and similarity-based approaches[65], which provide their bene- fits mostly on hard drive disks. The lookup for the fingerprints is with around 100 IOPS8 compared to solid-state drives with about 100 000 IOPS limited by expensive disk seeks.[20] As indexing can be stored on faster drives, flash- based approaches become common. The fourth approach deals with more than one node and is called cluster deduplication. In summary, all of the ap- proaches are effective in improving indexing performance depending on the respective setup.[92] The costs of deduplication have been studied in various research projects.[34]

(4) Compression Even if deduplication eliminates most redundancy fur- ther compression should also be considered as an improvement.[93, 92, 19] This is also an available feature for ZFS.

(5) Restoring Data As data streams get reduced and compressed, restor- ing data might be challenging if shattering a few bits can have catastrophic dimensions. Deduplication systems can be categorized through their classi- fication. They may focus primary storage systems, backup storage systems and cloud storage systems.[92]

(6) Garbage Collection Changes in the data stream have to be pro- cessed entirely and garbage collection gets necessary. That means if a file gets deleted, the affected chunks have to be detected and removed from the metadata and the deduplication table.

(7) Privacy Concerns If it comes to security, there are a few potential privacy risks and also proofed attacks, that can lead to leaked information. This is a very recent research field for cloud storage, which also includes deduplication on encrypted storage.[71, 95, 36] There is convergent encryp- tion, which uses a hash, that was created by deduplicating the data streams

8IOPS = Input/Output Operations per Second

16 to speed up the encryption of the data.[92] Xia et al. presented two tax- onomies as a vulnerability. First, side channel attacks aim to reveal private information of deduplicated data streams. Additional, proofs of ownership may also a privacy concern. These attacks are primary an issue on cloud storage.[92]

(8) Reliability The last problem is similar to (5) restoring data and deals with reliability and is about creating redundancy to store data in a safe way.

3.4.3 Key Design Decisions According to Paulo et al.[68] there are

"six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope."[68]

Storage environments do typically have specific requirements. As an exam- ple, immutableness and reliability are targeted by archived data. Further, latency is considered less important than throughput. This is because data has to be restored infrequently but archived in a limited time window.[68] Deduplication on primary storage has different requirements, as this involves active data. The I/O performance may be a critical path, as well as read requests are obviously more frequent. This leads to the challenge of the tradeoff between deduplication gain and overhead information. As the key design decisions may overlap with the workflow improvements in the section before, we will point out the differences and similarities.

Granularity is about the method of creating chunks. The lower the gran- ularity, the better the deduplicating gain.[68]

Locality can be used to support caching strategy and has been already mentioned before.

Timing is the third decision. There are two possible options, the inline and outline deduplication. The first one is also known as in-band deduplication and is processed by writing the files to the disk. This is the counterpart to the second one, also known as out-of-band deduplication, which is performed manually or through a defined time pattern. For primary storage systems, inline deduplication is strongly preferred, due to accumulating duties in the other case.[68]

17 Indexing For comparing files or chunks, deduplication systems use cryp- tographic hash algorithms to speed up the settlement. These signatures are stored in a global table, like a database with unique values. This table is often called deduplication table9. The decision about technique deals with aliasing and delta encoding. Aliasing means chunk-based deduplication and redirects the I/O requests accordingly.[68] As it requires less processing power and has also a faster recreation time, this approach is considered more effi- cient. Delta encoding has to deal with the tradeoff of storage costs versus recreation costs. This is similar to the approaches covered in the work of Fernandez et al.[28] As both aspects have to store metadata, an object-based storage system is strongly preferred, due to the abstraction of the physical to the logical view.[68]

Scope The last key decision targets the issues of distributed storage sys- tems. As scalability is finally resulting in distributed systems on the basis of the limits, this is an important aspect to think about.[68]

4 Operating Systems and Filesystems

In this section, an overview of operating systems and filesystems will be given. As the topic is very wide, not all aspects of it can be covered. We will focus on the main aspects and functionalities. We will start with some basic aspects of filesystems, continue with some historical events and finally the main subject of the generation of modern filesystems will be discussed. This will, as mentioned before, cover useful features and techniques, which make managing storage much easier. Overall, we have picked three filesystems, that will be analyzed and compared for differences. All of them offer similar features, but they may be implemented differently. For exact functionalities, we will provide references for further reading.

4.1 Definitions It is not quite easy to specify an exact definition of the terms. This is due to the fact, that nearly all of them are very complex and broad areas and as a subsequence, every definition will be more conceptional and kind of inaccurate. 9DDT = deduplication table

18 4.1.1 Operating System On the one side operating systems (OS) provide a software layer to manage hardware and resources efficiently and on the other side, they try to hide the complexity of hardware by serving simple abstractions to the user. One of the most important abstractions is the file. If a user wants to save and restore data in a uniform way and in a permanent matter, he can use files.[81, 43] The user has only to think about his favorite text editor (or the program, that produces the files). This is a way of creating layers to reduce complexity and abstract them to a simple and understandable way to work with. Simplified we can say, that the user communicates with the application, the application with the operating system and the operating system with the hardware. Due to the fact, that communication with the hardware can be a big security and privacy related problem, most operating systems can run services in kernel/supervisor mode/space or in user mode/space.[43]

Kernel Running a service in the kernel means full control over the hard- ware, but may cause in case of bugs or malfunction, critical damage to the whole system. This can mean either a system crash, corrupted files or opened security vulnerability. Services, that are executed in user mode have re- stricted access to system resources and this makes them less dangerous.[4] Today most operating systems support both modes. In order to name pop- ular personal computer OS, there is Windows, GNU/Linux, and MacOSX, which provide a kernel and a user mode. In contrast, embedded systems, which are more often used in the industrial sector commonly may not have a kernel mode.[43] State of the art is that operation system structures follow either a monolithic or a microkernel approach. In the first concept, all sys- tem services will be executed in the kernel mode, whose benefit is the high performance. As a disadvantage they are insecure, not isolated between the services and they have low reliability.[81, 4] Common examples for monolithic kernel-based operating systems are GNU/Linux, FreeBSD, and OpenSolaris. The second concept is microkernel-based and uses the philosophy, that only very fewer services should run in the kernel mode. All other processes and services are moved and executed in the mode. The advantage of the user mode is, that every service is isolated from the other ones and bugs may be less powerful. As a very simplified example, a service in the kernel mode could overwrite the first e.g. 100 MB of a device causing through de- stroying of the layout table the destruction the whole disk in comparison to a userspace service, which will be rejected by overwriting these . Never- theless, there are many scenarios to communicate with other services. This is realized with the Interprocess Communication (IPC). While this is fast and

19 cheap in regard to overhead for monolithic kernels, it is very complex and expensive for microkernels. Most operating systems nowadays use monolithic kernels, but often they do not use pure monolithic approaches.

4.1.2 File System One of the most important abstractions or interaction for the user is the file. As creating a file would require kernel access rights, there is an own software layer, who is responsible to deal with files. It is called the file management system or the file system.[43]

4.1.3 Data Access NFS is a distributed file system protocol and was developed by Sun Mi- crosystems in 1984. It allows a client user to access files over the local network like they are local. The counterpart for Windows is SMB10 and offers similar options. NFS is defined as an open standard and anyone can implement the protocol available in Request for Comment.[79] Over the years many security and general improvements, as well as features, have been added. iSCSI stands for Internet Small Computer Systems Interface and is an Internet Protocol-based storage networking standard. It delivers SCSI com- mands over the network and provides access on a block-level to storage de- vices. As it an open standard it is available and asks for contribution as Request for Comment.[76]

4.2 Classic vs Modern Filesystems There are many traditional filesystems like FAT32, NTFS, , ext4 or UFS. Most of them are very old and even some are as old as computers are. At the time of their creation, a computer used slow hard drive disks and the filesystems were designed to fit and run on every hardware. We rely on them because they have worked for a long time and they just work. We wanted to save our important files in a reliable way. Even if traditional file systems are able to deal therewith, they have bottlenecks in scalability, support for distributed use, long-time preservation, access speeds and problems how to deal with hardware failure. Overall, they do not get all of it and there are many possible improvements. Modern filesystems[41] like ZFS, Btrfs and NILFS are designed to think further, provide many advantages and use the available computing power nowadays. To explain different techniques and

10SMB =

20 advantages of a modern filesystem, we will use ZFS as an example. As Btrfs and NILFS have many similarities, the features will be compared and given differences will be mentioned.

4.3 ZFS Previously known as "Zettabyte File System", ZFS was developed by as part of the Solaris Operating System in 2001. As of the nature of Solaris, ZFS was designed for use in data centers and for servers. Due to its architecture and its small hardware requirements, it can also be used on standard and consumer hardware. One of the main supporters for use at home is, for example, iXsystems, Inc.11, who offers a modified version of FreeBSD, called FreeNAS. It is optimized for a network attached storage server and can be easily installed and configured. In its beginnings, the source code for ZFS was closed, but along with OpenSolaris, it was published under the CDDL License in 2005. As a consequence of that, porting ZFS to other operating systems like BSD derivates, or Linux based systems has begun. With the stop of releasing updated source code by Sun’s acquirer Oracle, ZFS went closed source in 2010. As a response, was forked out of OpenSolaris and started to maintain and manage open source projects. In the year 2013 the OpenZFS[5, 40] the project was established and it aims to raise awareness of the quality of ZFS, encourage open communication and to ensure the reliability of all distributions for ZFS.[40]

Recommendations for further Reading are either the book FreeBSD Mastery - ZFS written by Lucas et. al, the documentation of the OpenZFS community12, the ZFS Administration Guide13 or the docu- mentation of the FreeNAS project, which offers also a good summary of the features of ZFS14.

4.3.1 Scalability Overall, ZFS was designed as a 128-bit file system to provide a scalability up to 256 quadrillion zettabytes of storage. Because of the move of storage management capabilities from the hardware to the software, the filesystems are classified as software-defined-storage15.[14] That means, due mirroring,

11https://www.ixsystems.com 12http://www.open-zfs.org/wiki/System_Administration 13https://docs.oracle.com/cd/E19253-01/819-5461/index.html 14http://doc.freenas.org/11/zfsprimer.html#zfs-primer 15SDS

21 striping or using a ZFS specific RAID array, it makes classic controllers unnecessary. It is possible to stripe over data drives, to mirror them or to put them in RAID-Z1, RAID-Z2, or RAID-Z3. Depending on the redundancy, ZFS is able to check the data for errors and to self-heal them, if there is an intact version available. This is provided through checksums, that are con- sequently done and saved in the metadata. The process to compare the as-is data with the should-be data checksums in the metadata is called scrubbing. As this is very CPU-intensive, it should be scheduled from time to time or started manually. Overall, scalability is considered as fundamental for big data analysis.[39]

4.3.2 Virtual Devices ZFS uses virtual devices, also called VDEVs, to combine one or more GEOM providers (that is e.g. a hard drive disk, a solid state disk or any other kind of compatible storage hardware). A VDEV is the logical unit of ZFS and there are several types of it supported. Commonly, the mirrored disk or the striped disk are types of it. The filesystem also provides support for three varieties of RAID arrays, named with RAID-Z1, RAID-Z2, and RAID-Z3. Starting with the first RAID-Z, we need three or more providers. This increases with each level of RAID-Z to RAID-Z3 with a minimum of five providers, but a very high fault tolerance. The RAID-Z3 is even operational with three failed providers. If a disk fails, it can be replaced and the process to recover the failed drive and repair the VDEV is called resilvering. RAID-Z arrays offer benefits compared to the traditional RAID arrays. As an example, there is the so-called write hole, where operations, that require two steps are interrupted between them resulting in an inconsistent state. ZFS solves the problem with transactional semantics using copy-on-write. Files are never overwritten in place and each update is either completely performed or not at all.[56] There are also special VDEVs providing improved read or write speeds. A Separate Intent Log16 can help to maintain the ZFS Intent Log as part of the pool. The dedicated device for the ZIL is normally a high endurance SSD. Requests can be written to the faster log device with the consequence, that they are reported as done. That means, that they are processed very quick and worked off to the actual data disks afterward. To increase read speeds, ZFS uses an Adaptive Replacement Cache17. This keeps frequently used data in the memory and is optimized for modern hardware. As few systems have enough RAM to cache parts of the whole pool, an own VDEV called Level 2 ARC or L2ARC can be used.[56] For this thesis, we

16SLOG = Separate Intent Log, ZIL = ZFS Intent Log 17ARC = Adaptive Replacement Cache

22 have used a SAN, which makes mirroring unnecessary, because the disks are already mirrored or in any kind of RAID array in the Storage Area Network.

4.3.3 ZFS Blocks Traditional filesystems like UFS or NTFS use fixed-blocks to store data on the disk. By its creation, the filesystem creates special blocks, the , that index the blocks to the files. This is shown on the left side of Figure 3. As metadata is written as near as possible to the data, we have a mixture of data and metadata. Object-based storage systems use a more intelligent approach and use variable-sized objects, which are shown in the right part of the figure. They do not create pre-configured indexes but use storage blocks, where each one contains information and is linked to their neighbors in a tree. Every block is hashed and the information stored in the parent block and the block itself. As a tree needs a start, there is one special block, called uberblock. In the case of ZFS, there are 128 data blocks per pool reserved. Even if the uberblocks are critical blocks, there are other ones. ZFS copies vital information like metadata and information about the pool into multiple ditto blocks. They are stored as far as possible from each other. In case a block is damaged, ZFS checks for backup and recover the information from the ditto blocks.[56]

Figure 3: Block-based vs object-based storage systems[42]

23 4.3.4 ZFS Pools If you think of stripe, you may think of storage. In traditional RAID systems, the data is striped across the physical disks. In case of ZFS, the data is striped across virtual devices18. In sum, a stripe is a chunk of data, which is written to disk. Traditional RAID systems use normally a stripe size of 128k. If a file should be stored, the software writes it to each disk depending on their RAID level in parallel. The size can be customized and optimized for the server’s workload. The redundancy is provided from the parity bits or the mirrored devices because stripes do not provide any redundancy. Speaking of ZFS, the redundancy comes from the underlying VDEVs. ZFS also uses 128k as a default, but it is smart enough to dynamically change the size and optimize it for the pieces of data. This can be adjusted for every chunk of data up to 1M.[56]

Figure 4: Traditional Volumes vs ZFS Pooled Storage[12]

Overall the VDEVs are combined to a ZFS pool, also known as zpool. It is possible to add VDEVs to a pool, as ZFS will use the new VDEV to stripe the data to it. As an example we could have two mirrored virtual devices and add them to a zpool. For a traditional RAID system, this would mean a RAID-10, which offers a very high I/O performance. For classic RAID systems this is limited to a depth of two, but for ZFS it is 264.[56] This offers great flexibilty and easy administration. As seen in Figure 4 traditional approaches limit and fragment the storage by each . ZFS can combine all providers, offers almost all bandwidth and the whole storage is shared over the zpool.[11, 12]

4.3.5 ZFS Architecture As seen in figure 5 there are three main layers, the Storage Pool Allocator (SPA), the Data Management Unit (DMU) and the ZFS POSIX Layer (ZPL)

18VDEVs

24 for datasets and non-block-based network protocols at the top.

Figure 5: ZFS Layers[12]

The ZPL does object-based transactions, which are all atomic. Atomic is known from databases and means, that either all transactions are done or none of all transactions is done. The transactional semantics will be discussed in subsection 4.5. Further on, the DMU will do transaction group commits, which are atomic for the entire group establishing a permanent consistency on disk. Finally the SPA will schedule and aggregate I/O at it’s own discretion. ZFS offers an universal storage, because files, blocks and objects are all in the same pool. The DMU provides a general-purpose transactional object store. Distributed filesystems like Lustre19 or block-based protocols like iSCSI20 can use the DMU and access the pool without unnecassary overhead or additional software.[11, 12]

Key Features ZFS supports many features like compression, deduplica- tion, encryption and end-to-end data integrity. As of their design, they are common for all datasets.[11] Most of them have already been discussed in the previous section 3. For further reading, we refer to the recommendations we mentioned in the paragraph 4.3.

4.4 Data Integrity and Reliability keeping digital information is very cheap, because it does not require any space, it can be easily replicated and speaking of newer technologies it can be shared with low costs. But it may be hard to have a reliable and long-term preservation. As technologies like tape drives or Compact Disks come and go, because they have been replaced, it is not easy to find a solution for saving data. The bigger the amount of data grows, the higher is the possibility an

19http://lustre.org 20https://www.thomas-krenn.com/de/wiki/ISCSI_Grundlagen

25 error will happen. This could be the fault of the RAM, the disk, the software itself or the filesystem. ZFS does checksums to prevent silent . This is called bitrot because the files have rotten bits with no value.[15, 60] ZFS uses sha256 sums for the fingerprinting, which is still a cryptographically secure hash sum compared to sha1.[80] Referring to the paper of Baker et al. there are nine possible threats to archiving. As they have labeled each threat-source to "HW/SW", "environmental", "people" and "institutional", we have picked the ones, which are caused by hardware or software failure (HW/SW). Every storage system can fail to cause a loss of data. There are many possible reasons, like hardware failure, software bugs, natural disasters or human error. This is also a problem for the short term, but it is more likely to happen over a longer expected lifetime. Further on, there is the problem of bitrot and of outdated media as mentioned in the paragraph above. There is also the possibility, that formats, applications, and systems became obsolete and outdated. It is problematic if there is a file, that cannot be used in a useful way anymore because it only worked with a specific program. Finally, an issue could also be the loss of context. This is a problematic one, as many files are useless without context or any kind of metadata.[6] In conclusion, a filesystem, that is reliable, scalable, includes features to prevent, repair and deal with hardware and software failure is a good start.[98] It is important, that forthcoming features can be easily integrated. As further development and a developer team behind any software project are essential for reliability, this is an aspect to consider too. In the case of OpenZFS, there is a project group behind the filesystem, which is providing support and improvements for the filesystem.

4.4.1 Replication There is replication for ZFS available, which is a feature, that offers an easy way to migrate whole filesystems to another ZFS pool. It is based on the snapshot technology, which will be discussed in the following subsection 4.5.1. As the underlying hardware is almost detached from the filesystem at the top, administrators can design the underlying hardware to provide redundancy and add the VDEVs to the new pool. Afterward, they can transfer the old pool to the new one. This can be used externally too. The commands are zfs send and zfs receive.[67]

4.5 Transactional Semantics In a transactional file system, operations are performed as a transaction.[88] A transaction consists of four properties: Atomicity, Isolation, Consistency,

26 and durability, also known as ACID.[43] Due to that behavior, the state of the file system is always consistent because transactions are either full conducted or not conducted at all.

"The component provides data isolation by providing multiple file versions, such that transactional readers do not receive changes until after the transaction commits and the reader reopens the file."[88]

As a consequence, data is never overwritten. This offers advantages towards older approaches like journaling. In modern file systems, copy-on-write is used to implement transactional behavior.

4.5.1 Copy-on-Write and Snapshots The Copy-on-Write mechanism is a technique to optimize and avoid unneces- sary duplicates and writing access. If a file is copied, the new file is linked to the old one, marked as read-only and the resource is shared between them. If a file is modified, the file is virtually copied, but the unchanged part is linked and the modification is written as delta. In addition, it addresses the problem of scalability, fault tolerance, and easy administration.[27, 56, 38, 43]

"Copy On Write is used on modern file systems for providing (1) metadata and data consistency using transactional semantics, (2) cheap and instant backups using snapshots and clones."[43]

A snapshot is a copy of the actual state. They share the resource and are read-only. If it is implemented, snapshots can be upgraded to clones, which can provide write access. The essential information is written as metadata to the objects in the storage pool. The performance and reliability of copy-on- write have been studied extensively and considered as an advantage compared to classic file systems.[78, 85, 51, 66, 43]

27 4.6 Btrfs Also known as "B-Tree Filesystem" or rarer called "Better Filesystem", Btrfs provides similar features to ZFS. As it is unnecessary to reinvent the wheel and most features are already very advanced, most of them were reused, but with slightly architecture changes. Oracle started its development in 2007 because the license of ZFS was incompatible with GNU/Linux. As of the nature of filesystems, it takes time and Btrfs was finally marked as stable in 2014. Btrfs is a 64-bit filesystem and offers also a great scalability. It uses a B-Tree copy-on-write mechanism and supports snapshots and clones too. One difference to ZFS is, that it only supports Out-of-band deduplication, which means, that userspace tools are required to deduplicate data and this action is executed afterward. Additional features like encryption, that are supported by ZFS, are also not completely finished.

4.7 NILFS The "New Implementation of a Log-Structured-Filesystem", also known as NILFS[49], was developed in Japan by Nippon Telegraph and Telephone. It is a log-structured filesystem and is suitable for NAND made SSDs. A fundamental problem of flash drives is, that they are limited in write and erase cycles.[42] One of the first log structured filesystems was already im- plemented by John Ousterhout and Fred Douglis in an -like operating system called Sprite in 1992.[74] But the reasons for the first kind of this filesystems were the increased CPU speed combined with nearly standatory disk access speeds. As this was obviously before flash drives become popular, the I/O performance was a bottleneck for many applications. In addition, memory was cheap and available, which offered an answer as caches to solve the issues of reading speeds in a filesystem. As a result, many random writes remain and the disk utilization was quite low. A log-structured-filesystem tries to solve the arisen dominated write requests by writing all new data blocks in a sequential structure to disk. This is as the name itself a log. This clearly increases write speeds and reduces seek times to find the relevant locations.[73] Summarized log-structured-filesystems provide some very use- ful advantages. First of all, basically every change can be reverted because there is everything in the log. Due to the fact, that very big and long logs get very unwieldy and slow down the I/O Performance, NILFS implemented a garbage collector to clean up old data and metadata. Further on, this can be used to implement continuous snap-shotting and shapes up very well for backup scenarios.[50]

28 4.8 Operating Systems and Solutions As file system are a very basic feature for any operating systems there are many different file systems and operating systems available. This section should provide a short overview of possible implementations of the three covered file systems. As this is not the main subject of the thesis it will be a simple list of the most important operating systems.

1. ZFS

(a) UNIX: Solaris, OpenIndiana, OmniOS (b) FreeBSD, FreeNAS (c) Linux with ZFS on Linux (Promox as Virtual Environment)

2. Btrfs

(a) Linux since kernel 2.6.29 (2009)[23] (b) Promox (Virtual Environment) (c) Rockstor

3. NILFS2

(a) Linux since kernel 2.6.34 (2010)[52]

5 Approaches

With the covered techniques, there are a few possibilities to realize a well structured and scalable architecture to provide an archiving and version- ing system for a large number of datasets. We will point out three main approaches and compare and analyze them for their feasibility. As many as- pects of them have already been covered in the previous background section 3, there will be only a general description for the mechanisms. The goal of this section is to provide a preliminary conclusion, which will be benchmarked and further tested in the next sections.

5.1 Version Control Systems Version Control Systems obviously offer a nice opportunity to track and pro- vide a file history of files. As there would be many similarities and recurring data blocks, there are two extreme ways to store them. The first means to save each version on disk without any regard to overlaps, which results in

29 high storage costs, but very low recreation costs. The other way would be to store the dataset only once and the others using deltas from the first. This may result on the one hand in low storage costs, but on the other hand in very high recreation costs. [21] Recreating consumes time and I/O performance. A possibility to counteract this may be to store the dataset more than once and create the relevant deltas in a structured and logical way. This would be a similar concept as mentioned in the paper "Evaluating Query and Storage Strategies for RDF Archives" written by my Fernandez et al.[28] As this would mean to rewrite a version control system and adapt and optimize it for a big amount of datasets, this will not be covered in this thesis. There is a project, that has implemented a scalable version control, called DataHub21.[8] As an inference available revision control systems do not fit well as a solution for our given use case and fail because of the poor scalability.

5.2 Snapshots and Clones As the copy-on-write mechanism offers a very good opportunity to create snapshots with less overhead, this could be a very promising approach. Also, the covered NILFSv222 would grant a flexible possibility to create many snap- shots and make version control possible. As the datasets are already down- loaded and written to disk, the version control cannot be done afterward. There may be also a similar problem as mentioned before. The recreation costs will rise with the number of snapshots, decrease I/O performance and increase access time. Because of the fact, that it is implemented in the filesys- tem, the efficiency may be much higher than for the version control systems. As this is very hard to test in a comprehensive way, it was also not part of this work. To sum up, we can say, that snapshots can be very efficient for version control because it is done on a very low layer, the timestamps etc. are added to the metadata and the versioning is done on a bit and byte ba- sis. For our use-case, we exclude snapshots, because there would be process changes required and it is not possible to do the versioning afterward within reason. Overall, this may be a topic for further research.

5.3 Deduplication The third approach is deduplication. As already mentioned the aim of dedu- plication is to eliminate duplicates of data. Since this requires a lot of com- putational power, it is a very recent way for data reduction. In general,

21https://datahub.csail.mit.edu/www/ 22New Implementation of a Log-Structured Filesystem Version 2

30 the data will be divided into parts and then fingerprinted with a crypto- graphically secure hast algorithm. This checksum will be indexed in a global deduplication table and also stored in the metadata. If the checksum for the data part is already in the table, the data will be deduplicated, which means, the metadata will be created as usual, but the actual data will only be linked to the already available data parts. For the Portal Watch Datasets, there is already a kind of deduplication mechanism. The datasets are downloaded and fingerprinted on a per-file basis, which is, in fact, deduplicating data. One big difference is, that the chunking is done in a different way. Regarding research papers, chunk-sized deduplication is much more effective than a file- based deduplication.[92, 62, 59] In the next section, we will test deduplication and proof how good our estimation to go for deduplication was. Because of possible performance issues, it is preferred to only enable it, if the ratio is above two. Speaking of results, there is a big gain in saving storage costs. This will be discussed in the results section 7. As this approach offers easy administration and can be done in the case of ZFS directly in the filesystem layer, we will continue with it in the next sections.

6 Design of the Benchmarks

Filesystems should be considered carefully when applying them to produc- tive use. Following we will introduce our test-datasets and the methods we will use for testing them. There will also be a description of the used Envi- ronment. The exact results will be presented in the results 7.

6.1 Datasets As the deduplication ratio differs for different types of data and also to get a more extensive outcome, we have defined and collected two additional datasets. They will be analyzed and described in this section.

6.1.1 Portal Watch Datasets As we see in Figure 6 the amount of disk usage is of course very high for the first few months and is slowing down, because more and more datasets have already been added and there have been fewer changes. As we have not done a specific analysis of the data, we can not justify the causality for the sunken storage costs. There is also no indicator for the ratio of deduplication on a per-file basis ratio. It will be tested how well does the deduplication of the filesystem ZFS scale with an increased data amount. As we switch to

31 Figure 6: Diskusage of datasets in GB per Month a chunk-based deduplication, we can compare the deduplication ratio gain compared to a file-based deduplication.

6.1.2 BEAR Datasets The BEnchmark of RDF ARchives consists of three main datasets and was published by Fernandez et al.[28] As there is an emerging demand for archiv- ing, querying and versioning semantic web data, it tries to "evaluate its storage space efficiency and the performance of dif- ferent retrieval operations."23 The first dataset, called BEAR-A, consists of dynamic linked data, which is composed of 58 weekly snapshots from the Dynamic Linked Data Obser- vatory24. That means, there are 58 different versions resulting in 331 GB of decompressed data. Compressing the independent copy version will con- clude in a file of 22 GB, which can be fetched from the BEAR-Homepage25. 23https://aic.ai.wu.ac.at/qadlod/bear.html 24http://swse.deri.org/dyldo/ 25https://aic.ai.wu.ac.at/qadlod/bear.html#BEAR-A

32 Table 1: BEAR Datasets Compression Ratios BEAR-A (IC) BEAR-B (IC) BEAR-C (IC) raw (uncompressed) 331 GB 239 GB 3.59 GB gzip (compressed) 22 GB 12 GB 0.23 GB Compression Ratio 15 19.9 15.6

Table 2: Wikipedia Dumps Compression Ratios 2018-01-01 2018-01-20 2018-02-01 raw (uncompressed) 60 GB 60 GB 60 GB gzip (compressed) 15 GB 15 GB 15 GB Compression Ratio 4 4 4

Comparing the uncompressed with the compressed version, we have a com- pression ratio of 15. The second dataset is called BEAR-B and was compiled from DBpedia Live changesets26 over three months. We picked the version of independent copies and an instant granularity. It uses 239 GB of disk space in an uncompressed format and 12 GB for the gzipped version, which can be fetched as mentioned before for the BEAR-A. There is a compression ratio of nearly 20 for the second BEAR. The third BEAR is named BEAR- C and used dataset descriptions of the European Open Data Portal27 for 32 weeks. The data were retrieved with 10 complex queries from the Open Data Portal Watch, which was described before. This dataset is with 0.23 GB compressed and 1.59 GB uncompressed much smaller than the others. Nevertheless, there is a compression ratio of 15. All ratios and sizes of the dataset are shown in table 1.

6.1.3 Wikipedia Dumps As the Wikipedia Database is dumped once to twice a month and available for download28, we have picked three different versions with the timestamps 2018-01-01, 2018-01-20 and 2018-02-01 as seen in the table 2. As there are many different variants with various characteristics, we decided to use the current revisions without any talk or user pages. We have only used the

26http://live.dbpedia.org/changesets/ 27http://data.europa.eu/euodp/en/data/ 28https://dumps.wikimedia.org/enwiki/

33 English variant. They have a size of around 15 GB in a compressed version and around 60 GB for each version in the uncompressed state resulting in a compression ratio of four.

6.2 Testsystem and Methods 6.2.1 Testing Enviroment For our scenario, we have a virtualized environment with 16.04 LTS on it. The virtual machine is hosted by an IBM x3650 M4 machine, which consists of two Intel Xeon CPU E5-2650 v2 clocked with 2.60 GHz. For the whole server, there are 384 GB of RAM available. The guest is assigned with 4 vCores and 32 Gigabytes of RAM. The storage is provided through a SAN, which uses Hard Drive Disks and two Emulex LightPulse Fibre offering a bandwidth of 8 GBit/s for each fiber cable. As there are many other virtual machines, the bandwidth for the testing environment is depending on the workload and is normally much less than possible and more limited by the spinning disks speed.

Rsync was used to transfer the datasets from the NFS share to the ZFS pool. It is a small program to collate data and synchronize them between local or via network available paths. One particular feature is, that rsync split the data into several chunks and hashes them to ensure data integrity.[86]

SSH was used to access our virtual machine. It is a protocol for secure remote login and can be used even over insecure networks.[96]

6.2.2 Benchmarking Tools There are many ways to benchmark a filesystem. Basically, administrators use the so-called tool iostat29. As this may be adequate for some simple speed tests, we were looking for a more extensive tool. There is wrstat30, which was used by Norbert Schramms in his master thesis for benchmarking.[77] As we would have to recompile the kernel to integrate wrstat and be able to test zfs, we have decided to use a less broad but solid tool called bonnie++31.

29man iostat, https://linux.die.net/man/1/iostat 30https://github.com/bolek42/wrstat 31https://www.coker.com.au/bonnie++/

34 6.2.3 Bonnie++ "Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance."[18]

According to the documentation, Bonnie creates files, that are twice as big as the RAM size. Speaking of our test environment, bonnie benchmarks with a size of 60 GB. There are three main questions, that are to be answered.

1. How fast can Bonnie write files to the disk?

2. How fast can Bonnie read files from the disk?

3. How fast can Bonnie change internal data structures?

More precisely, bonnie tests the sequential output, starting with creating and writing the file to disk. Afterward, the file is rewritten with the intent to test the effectiveness of the filesystem cache and the data transfer speed. Sub- sequently, Bonnie reads the create file and evaluates the read performance. The last test runs random seeks and has referred to the documentation writ- ten by Coker and Bray a strong nonlinear effect on the results of this test. As many UNIX systems will use available memory to cache the files, they will report excessive high random I/O rates. This was also the result of our benchmarks. To bypass this, we have run it with a nearly full pool, where less RAM is available for caching. Bonnie++ uses random alpha-numeric characters.[18]

6.3 Conditions and Parameters The test system uses ext4 as a filesystem for the operating system Ubuntu 16.04 LTS. For the benchmarks, we have installed zfsonlinux32, which is al- ready marked as stable for Linux and integrated into Ubuntu since version 16. There are some thesis, that used ZFS on Linux.[77, 83] The loaded mod- ule is v0.6.5.6-0ubuntu20 with zpool version 5000 and zfs filesystem version 5. To simplify benchmarking, we have created two zpools, the first with a size of 2 TB and the second one with a size of 500 GB. The 2TB pool is called dedup and the 500GB pool dedup2. They are mounted as

\zfs \zfs2 32http://zfsonlinux.org/index.html

35 To start, we compared the performance of the default filesystem ext4 with ZFS. This will be discussed in section 7.1. As there are many parame- ters, which can be used to fine-tune zfs, we have defined three parameters, which will be modified and used. These are deduplication, compression and block size. Each of them works on a per-dataset basis.[56] As we have only used one dataset per zpool, each command was only specified for the used zpool/dataset.

6.3.1 Deduplication The default value for deduplication is off. To enable it, we have used the command

# zfs set dedup=on

When deduplication is enabled, only new files will get deduplicated and it should be mentioned again, that files, that have been deduplicated remain deduplicated even by disabling deduplication. To disable it we can use the same command excerpt an off instead of on. Restoring the duplicates of data is not that simple. The best way is to send them to a new dataset without enabled deduplication. For this, the commands zfs send and zfs receive can be used.[56] Even though deduplication is enabled per-dataset, all unique hashes and metadata for deduplication are assigned to one global deduplication table per zpool. As this was very impractical for testing the different datasets, we have used a second zpool.

6.3.2 Compression Compression is also disabled as a default. Because the lz4 compression algo- rithm is quite fast and efficient, it is recommended for almost all cases and also used if the algorithm is not specified. What is more, zfs supports the gzip algorithm, which can be indicated with the desired level of compression, but with effects on the performance. To enable the lz4 compression, we have used the command

# zfs set compression=on

The behavior of compression is similar to deduplication, as only new files get compressed and files remain compressed after disabling it.

36 6.3.3 Blocksize The SAN provides blocks with a size of 4k, which means every combination above is sufficient. As the deduplication results of the datasets were not satisfying, we started to test finetuning of the block size. Even though zfs tries to create storage blocks, that fit the data, they could fail in various ways. To find out, if the default block size is already optimal, we benchmarked with different block sizes. As the default is 128k, we set it to (a) 4k, (b) 128k and (c) 1M. The commands for setting the block size are

# zfs set blocksize=4k # zfs set blocksize=128k # zfs set blocksize=1M

7 Results

First, we will introduce the procedure for the benchmarks. For easier han- dling of different datasets, we have created two zpools. To start, we have tested if the Portal Watch datasets fit into the 2TB pool. The datasets have 2.3TB for 2015 and 1TB for 2016 and are therefore bigger than the 2TB pool. For the transfer rsync was used. Overall, we will have the following order:

1. Performance of ext4 and ZFS

2. Deduplication ratio for Portal Watch Datasets

3. Deduplication ratios for BEAR and Wikipedia dumps

4. Deduplication ratios with regard to blocksizes (BEAR, wiki dumps)

5. Compression ratios with regard to blocksizes (BEAR, wiki dumps)

7.1 Performance of ext4

Figure 7: Bonnie Benchmark for ext4

37 In order to have a basic view of the speed of the SAN, we have done a benchmark on the filesystem ext4, which is used by Ubuntu as standard. As we see in figure 7, there is a speed of approximately 410 MB/s for the sequential output and around 150 MB/s for rewriting the file. The read speed is around 440 MB/s in a sequential input. In contrast to the fact, that the SAN uses hard drives disks a sequential read and writing speed with around 400 MB/s is quite good. Customary available drives, that are optimized for high transfer rates, but are targeting the non business markets, like the Western Digitals Black series 33 or the Seagate Barracuda Pro series34 offer sequential read and write speeds of around 200-230MB/s. Compared to enterprise disks, they spin with 7.200 RPM and use SATA-cables instead of SAS-cables, which are often used in data centers. As the speeds of the SAN are obviously very high for hard drive disks, the SAN may use SSDs as caching devices or more than likely gain these huge speeds from a high-scaled RAID group.

7.2 Performance of ZFS

Figure 8: Block I/O for ZFS

In figure 8 we see the Block I/O for a Bonnie benchmark on the 2TB zpool. There are three areas, the first is the sequential block output, the second one is the block rewrite and the last one is block input. The performance

33https://www.wdc.com/content/dam/wdc/website/downloadable_assets/deu/ spec_data_sheet/2879-771434.pdf 34https://www.seagate.com/www-content/datasheets/pdfs/ barracuda-pro-12-tbDS1901-7-1707-GB-en_GB.pdf

38 is dependent on the level of storage allocation. As the pool gets fuller, the speed will drop.

Figure 9: Block I/O for ext4 compared to ZFS

Compared to ZFS, ext4 is nearly three times as fast as ZFS is for sequen- tial block output. Speaking of rewriting, ZFS is nearly as fast, as ext4 is. For the read speed, ZFS may be even slightly faster, but starting with a level of 80 percent, the speed goes down enormous. For the charts we have used bonnie2gchart.35

7.3 Deduplication Ratios 7.3.1 Portal Watch Datasets After copying the datasets from the year 2015 of the Portal Watch to the first zpool, we run the commands zpool list dedup and zpool status -D dedup to get the deduplication ratio and a summary of the deduplication table. As mentioned, the portal watch uses deduplication on a per-file basis. That means, each file gets fingerprinted and the hash is compared with the already existing files. The approach of ZFS is to go a level below and use chunk-based deduplication. The files are split into chunks and also fingerprinted with a cryptographically secure hash sum. The chunk will be considered as redun- dant or unique after comparing it with the deduplication table. According to Xia et. al[92] the chunk-based approach is much more efficient than the file-based attempt. Running zpool list returns the following output:

35https://github.com/pommi/bonnie2gchart

39 NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP dedup 1.98T 1.31T 695G - 39% 65% 1.75x

We got a deduplication ratio of 1.75 for the already on a file basis dedupli- cated data. This is a very good rate, considering the already deduped and compressed datasets. As we can see in figure 10, there are 11.4 million differ- ent data blocks, which allocate 1.29TB and refer to actually 2.27TB of data. For these files we got 11 959 067 DDT36 entries. ZFS uses sha256 hashes, whereby one hash is 320 bytes long. By multiplying both by each another, we receive a metadata-size of 3650MB for the deduplication table. As ZFS on Linux uses up to 50 percent of the available RAM, we have 15 of 30GB available RAM for the ARC37. Metadata like the DDT can use up to 25% of the ARC.[56] That is 3.75GB. So if ZFS should load the whole deduplication table for the year 2015 of the Portal Watch, we need 30GB of RAM. Tuning the usage rate of RAM for ZFS to around 90 Percent would mean, that there are 6.75 GB of usable RAM for metadata like the deduplication table.

Figure 10: Deduplication Table for the Portal Watch Datasets

Looking at the actual stats of the ARC for ZFS we got the following output:

36DDT = deduplication table 37ARC = Adaptive Replacement Cache

40 > cat /proc//kstat/zfs/arcstats |grep c_ c_min 4 33554432 c_max 4 15707279360 arc_meta_max 4 11808078920 arc_meta_min 4 16777216

We have a truncated output with four values above. There are c_min and c_max, which are values for the minimum and maximum of the RAM, that is used for ZFS on Linux. With 15GB as maximum, ZFS can use up to 50 percent of the whole RAM. The values arc_meta_max and arc_meta_min are for the limits for the metadata, that are loaded into the Adaptive Re- placement Cache. For our case, we have a maximum of 11.25GB for the metadata. As this seems to be contradictory, it should be mentioned that there are other metadata than the deduplication table. That means up to one-quarter of the available RAM is used for the deduplication table and up to three quarter is used for the ARC and as a conclusion, one quarter is always reserved for non-metadata. Holding the whole deduplication table in the RAM is best-practice. It will obviously offer the best performance, but there is also the solution to add a performant SSD as L2ARC to the pool. Even if the L2ARC is slower than the RAM, it is much faster, than reading the DDT from disk. There are many interesting blog posts and observa- tions from different administrators, who have tried deduplication.[32, 33, 2] Additional, there are many discussions on the FreeBSD forum.38

7.3.2 BEAR Datasets We have tested the BEAR-A as a compressed and as a decompressed version for deduplication ratios. The first one has a size of 22.5GB and leads to 184378 DDT entries. Unfortunately, there is no block two or more times in the data. Looking at the other version, we have a size of 354GB, but also no space gain. There are 61 blocks, that are redundant and twice in the data. The other 2.77 million blocks are unique. We got the same results for BEAR-B and BEAR-C regarding the deduplication ratios.

7.3.3 Wikipedia Dumps Looking at the summary of the deduplication table of the Wikipedia dumps also creates a disappointing outcome. The three dumps with a size of 227 GB results in 1.8 million DDT entries and only four blocks with a reference

38https://forums.freebsd.org/threads/deduplication-with-SSD-instead-of-ram. 37289/

41 count of two. As these results are very underwhelming, we have tried to find a declaration for the unexpected results.

7.4 Blocksize Tuning Even if

"ZFS automatically chooses the best block size on a per-file basis" [38] we wanted to test if the smart algorithms of ZFS always find the best block sizes. We have run tests with enabled deduplication for the block size (1) 1M, (2) 128k and (3) 4k. The runs for (1) blocksize=1M were quite fast and returned for BEAR and the Wikipedia Dumps also a ratio of 1.00. As one block has a size of 1MB, there are much fewer entries in the DDT. The second parameter setting is the same as the default and returned also the same result for the ratio. The third setting (3) is a very small block size and creates for the BEAR-A a deduplication table with over 90 million entries. So we have a metadata overhead of 26.9GB for the deduplication table. As we simultaneously copied and decompressed the datasets, the system crashed at the end and did not respond anymore. Even though, there was also a ratio of 1.00. The Wikipedia dumps created 48 million DDT entries and resulted in no space win too. Overall, the small blocksize dropped the performance drastically and took very long.

7.5 Compression Performance On the basis of the bad deduplication results, we started to test the integrated compression in ZFS. We have used the default setting with lz4, which offers a good performance and a good compression. To start, we run a few simple benchmarks with Bonnie. The block sizes were also considered.

42 Figure 11: Block I/O for ZFS: with Blocksize=4k and Compression or Dedu- plication

In figure 11 we see the input and output performance of passes with a block size of 4k and one time with enabled compression and the other time with enabling deduplication. As we see, the write performance for the combination of 4k and deduplication is very bad. Though, the read speeds for this combination are more reasonable.

Figure 12: Block I/O for ZFS: with Blocksize=128k and Compression or Deduplication

Looking at the results in figure 12 of the same settings with a block size of 128k we got a much better performance for deduplication and compression. For this setting, we also run a pass without compression or deduplication and the performance for sequential block output is better, but interestingly slower for block rewrites and block inputs. This was also discovered by Oracle

43 as written in its blog.[1] The round with a block size of 1M is shown below in figure 13 and offers slightly smaller performance as the default block size value of 128k. With regard to performance, we can say, that, the default block size is for our datasets already a very good choice.

Figure 13: Block I/O for ZFS: with Blocksize=1M and Compression or Dedu- plication

7.5.1 Compression Sizes As read and write performance may not be the only significant condition, we had a closer look on the compression sizes and compression ratios. The figure 14 shows the size for the different compression levels. The one called tar.gz is from the original dataset, as it was compressed by downloading. Obviously, the size of that is the smallest and needs the least storage. The ones with a block size of 128k and 1M are better than the one with a block size of 4k.

44 Table 3: Compression Ratios for different Compression Levels per Dataset Wiki Dumps BEAR-A IC BEAR-B IC BEAR-C IC blocksize=4k 1.56 3.76 5.85 3.16 blocksize=128k 2.29 8.48 13.68 8.39 blocksize=1M 2.34 8.75 14.51 8.80 tar.gz 4.11 15.10 19.83 14.83

Figure 14: Compression sizes in GB for the Datasets

7.5.2 Compression Ratios To have a more meaningful comparison, we have put the compression ratios in the table 3 and plotted it, as shown in figure 15. According to these facts, the compression was with a block size of 1M slightly more efficient than with the default block size of 128k. Even if the ratios of the ZFS lz4 compression are very good, the ratio of the gzip algorithm, that was used for publishing the datasets add 50 percent more ratio. The ratio of it may be

45 high, but the performance is lower. The filesystem integrated compression works at almost full speed compared to the manually compressed file. There are scenarios in which compression also increases write speed, as fewer bytes have to be written to disk.[90]

Figure 15: Compression Ratios for the Datasets

46 8 Conclusion and further research

As a conclusion, deduplication was not as effective as expected. Indeed, it offers moderate write speeds, but it works well and provides a good perfor- mance for reading because the caching is efficient. Furthermore, integrating the deduplication in the filesystem layer removes administration overhead and speaking of ZFS the implementation is reliable and compatible. It has been shown, that the deduplication on a chunk-based level can in additional remove nearly 43 percent of the already on a file-based deduplicated data. A recommendation to migrate the whole portal watch deduplication to the ZFS integrated deduplication can be made. All redundant files would be recognized as good as eliminated by the chunk-based deduplication. The number of hardware requirements should be taken into account, as it is rec- ommended to add 5GB of RAM for each TB of deduplicated data.[45] This can be easily calculated with the number of blocks, that are necessary for 1TB of data. Because it takes 2.5GB to store all deduplication entries, we need twice the amount of RAM to be able to load the whole DDT into RAM. Even if deduplication already helps to save storage costs, we experienced, that compression is much more efficient and can even provide increased I/O performance. Overall, we would recommend ZFS for this and similar use cases, as it provides flexibility, reliability, and many features.

Further Research can be done about self-learning caches to improve content-based chunking[53, 97], mechanisms how to improve detecting redun- dant data and how to utilize content similarity to improve I/O performance.[48] As access patterns, lifetime and file sizes differ considerably, there is a huge potential for automatic system improvements.[61] Querying and indexing huge amounts of data have become more and more popular, so this may also be considered furthermore, as important research.[28, 17] This thesis has examined mainly csv, XML and RDF datasets. There may be other formats, that can be evaluated further. As it has become common for crypto mining, the hashing for fingerprinting the blocks can also be offloaded to the GPU.[30, 46] Further research can be done by implementing these perfor- mance improvements.

47 References

[1] ZFS Compression - A Win-Win. https://blogs.oracle.com/ solaris/zfs-compression-a-win-win-v2, 2009.

[2] ZFS Deduplication. https://blogs.oracle.com/bonwick/ zfs-deduplication-v2, 2009.

[3] In Kürze. Controlling & Management Review, 61(5):82, 2017.

[4] A. S. Tanenbaum, J. N. Herder, and H. Bos. Can we make operating systems reliable and secure? Computer, 39(5):44–51, 2006.

[5] Matthew Ahrens. OpenZFS: A Community of Open Source ZFS De- velopers. AsiaBSDCon 2014, page 27, 2014.

[6] Mary Baker, Kimberly Keeton, and Sean Martin. Why traditional storage systems don’t help us save stuff forever. In Proc. 1st IEEE Workshop on Hot Topics in System Dependability, pages 2005–2120, 2005.

[7] Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lil- libridge, editors. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. IEEE, 2009.

[8] Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. DataHub: Collaborative Data Science & Dataset Ver- sion Management at Scale. CoRR, abs/1409.0798, 2014.

[9] Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya G. Parameswaran. Principles of Dataset Versioning: Ex- ploring the Recreation/Storage Tradeoff. CoRR, abs/1505.05211, 2015.

[10] John Black, editor. Compare-by-Hash: A Reasoned Analysis, 2006.

[11] Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shel- lenbaum. The zettabyte file system. In Proc. of the 2nd Usenix Con- ference on File and Storage Technologies, volume 215, 2003.

[12] Jeff Bonwick and Bill Moore. ZFS. In LISA, 2007.

[13] Randal C. Burns and Darrell D. E. Long. Efficient distributed backup with delta compression. In Proceedings of the fifth workshop on I/O in parallel and distributed systems, pages 27–36, 1997.

48 [14] Mark Carlson, Alan Yoder, Leah Schoeb, Don Deel, Carlos Pratt, Chris Lionetti, and Doug Voigt. Software defined storage. Storage Networking Industry Assoc. working draft, pages 20–24, 2014.

[15] V. G. Cerf. Avoiding "Bit Rot": Long-Term Preservation of Digital Information [Point of View]. Proceedings of the IEEE, 99(6):915–916, 2011.

[16] Amit Chavan, Silu Huang, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. Towards a unified query lan- guage for provenance and versioning. CoRR, abs/1506.04815, 2015.

[17] Francisco Claude, Antonio Farina, Miguel A. Mart\’\inez-Prieto, and Gonzalo Navarro. Universal indexes for highly repetitive document collections. Information Systems, 61:1–23, 2016.

[18] Russell Coker. Bonnie++ file-system benchmark. 2012.

[19] C. Constantinescu, J. Glider, and D. Chambliss. Mixing Deduplication and Compression on Active Data Sets. In 2011 Data Compression Conference, pages 393–402, 2011.

[20] Biplob Debnath, Sudipta Sengupta, and Jin Li. FlashStore: high throughput persistent key-value store. Proceedings of the VLDB En- dowment, 3(1-2):1414–1425, 2010.

[21] Amol Deshpande. Why Git and SVN Fail at Managing Dataset Versions. http://www.cs.umd.edu/~amol/DBGroup/2015/06/26/ datahub.html, 2015.

[22] L. Peter Deutsch. GZIP file format specification version 4.3. 1996.

[23] Oliver Diedrich. Das Dateisystem Btrfs. https://heise.de/-221863, 2009.

[24] Idilio Drago, Marco Mellia, Maurizio M. Munafò, Anna Sperotto, R. Sadre, and Aiko Pras. Inside Dropbox: Understanding Personal Cloud Storage Services. In Proceedings of the 2012 ACM Conference on Internet measurement, IMC 2012, pages 481–494. ACM, 2012.

[25] Mike Dutch. Understanding data deduplication ratios. In SNIA Data Management Forum, page 7, 2008.

49 [26] Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. Primary Data Deduplication-Large Scale Study and System Design. In USENIX Annual Technical Conference, vol- ume 2012, pages 285–296, 2012.

[27] Francisco Javier Thayer Fábrega, Francisco Javier, and Joshua D. Guttman. Copy on Write, 1995.

[28] Javier D. Fernández, Jürgen Umbrich, Axel Polleres, and Magnus Knuth. Evaluating Query and Storage Strategies for RDF Archives. In Proceedings of the 12th International Conference on Semantic Sys- tems, SEMANTiCS 2016, pages 41–48, New York, NY, USA, 2016. ACM.

[29] John Gantz and David Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future, 2007(2012):1–16, 2012.

[30] Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, and Matei Ripeanu. A GPU Accelerated Storage System. In Proceedings of the 19th ACM International Symposium on High Performance Dis- tributed Computing, HPDC ’10, pages 167–178, New York, NY, USA, 2010. ACM.

[31] Garth A. Gibson and Rodney van Meter. Network Attached Storage Architecture. Commun. ACM, 43(11):37–45, 2000.

[32] Constantin Gonzalez. OpenSolaris ZFS Deduplication: Everything You Need to Know. https://constantin.glez.de/2010/03/16/ -zfs-deduplication-everything-you-need-know/, 2010.

[33] Constantin Gonzalez. ZFS: To Dedupe or not to Dedupe... https://constantin.glez.de/2011/07/27/ zfs-to-dedupe-or-not-dedupe/, 2011.

[34] D. Harnik, O. Margalit, D. Naor, D. Sotnikov, and G. Vernik. Esti- mation of deduplication ratios in large data sets. In 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), pages 1–11, 2012.

[35] Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. Estimating Unseen Deduplication-from Theory to Practice. In FAST, pages 277–290, 2016.

50 [36] Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. Side Channels in Cloud Services: Deduplication in Cloud Storage. 8:40–47, 2011. [37] Val Henson, editor. An Analysis of Compare-by-hash, 2003. [38] Val Henson, Matt Ahrens, and Jeff Bonwick. Automatic performance tuning in the Zettabyte File System. File and Storage Technologies (FAST), work in progress report, 2003. [39] H. Hu, Y. Wen, T. S. Chua, and X. Li. Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. IEEE Access, 2:652–687, 2014. [40] iXsystems. OpenZFS – Communities co-operating on ZFS code and features. https://www.freebsdnews.com/2013/09/23/ -communities-co-operating-on-zfs-code-and-features/, 2013. [41] Boban Joksimoski and Suzana Loskovska. Overview of Modern File Systems. 2010. [42] TIm Jones. Next-generation Linux file systems: NiLFS(2) and exofs: Advancing Linux file systems with logs and objects. https://www.ibm. com/developerworks/library/l--exofs/index.html, 2009. [43] Sakis Kasampalis. Copy on write based file systems performance anal- ysis and implementation. Kogens Lyngby, 2010. [44] Vamsee Kasavajhala. Solid state drive vs. price and performance study. Proc. Dell Tech. White Paper, pages 8–9, 2011. [45] Dominic Kay. How to Determine Memory Requirements for ZFS Deduplication. http://www.oracle.com/technetwork/articles/ servers-storage-admin/o11-113-size-zfs-dedup-1354231.html, 2011. [46] Chulmin Kim, Ki-Woong Park, and Kyu Ho Park. GHOST: GPGPU- offloaded High Performance Storage I/O Deduplication for Primary Storage System. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM ’12, pages 17–26, New York, NY, USA, 2012. ACM. [47] Ali Koc and Abdullah Uz Tansel. A survey of version control systems. ICEME 2011, 2011.

51 [48] Ricardo Koller and Raju Rangaswami. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transactions on Storage (TOS), 6(3):13, 2010.

[49] Ryusuke Konishi, Yoshiji Amagai, Koji Sato, Hisashi Hifumi, Seiji Kihara, and Satoshi Moriai. The Linux implementation of a log- structured file system. ACM SIGOPS Operating Systems Review, 40(3):102–107, 2006.

[50] Ryusuke Konishi, Koji Sato, and Yoshiji Amagai. Filesystem support for Continuous Snapshotting, 2007.

[51] E. Lee, J. E. Jang, T. Kim, and H. Bahn. On-Demand Snapshot: An Efficient for Phase-Change Memory. IEEE Transactions on Knowledge and Data Engineering, 25(12):2841–2853, 2013.

[52] Thorsten Leemhuis. Kernel-Log – Was 2.6.34 bringt (2): Dateisysteme. https://heise.de/-984551, 2010.

[53] Bin Lin, Shanshan Li, Xiangke Liao, Jing Zhang, and Xiaodong Liu. Leach: An automatic learning cache for inline primary deduplication system. Frontiers of Computer Science, 8(2):175–183, 2014.

[54] Jon Loeliger and Matthew McCullough. Version Control with Git: Powerful tools and techniques for collaborative software development. O’Reilly Media, Inc, 2012.

[55] Maohua Lu, David Chambliss, Joseph Glider, and Cornel Constanti- nescu. Insights for Data Reduction in Primary Storage: A Practical Analysis. In Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR ’12, pages 17:1–17:7, New York, NY, USA, 2012. ACM.

[56] Michael Lucas and Allan Jude. FreeBSD Mastery: ZFS. Tilted Wind- mill Press, 2015.

[57] J. Malhotra and J. Bakal. A survey and comparative study of data deduplication techniques. In 2015 International Conference on Perva- sive Computing (ICPC), pages 1–5, 2015.

[58] Dirk Meister and André Brinkmann, editors. Multi-level comparison of data deduplication in a backup scenario. ACM, 2009.

52 [59] Dirk Meister, Jurgen Kaiser, Andre Brinkmann, Toni Cortes, Michael Kuhn, and Julian Kunkel. A study on data deduplication in HPC storage systems. In High Performance Computing, Networking, Stor- age and Analysis (SC), 2012 International Conference for, pages 1–11, 2012. [60] Nick Merrill. Better Not to Know? The SHA1 Collision & the Limits of Polemic Computation. In Proceedings of the 2017 Workshop on Computing Within Limits, LIMITS ’17, pages 37–42, New York, NY, USA, 2017. ACM. [61] Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, and Margo Seltzer. File classification in self-* storage systems. In Au- tonomic Computing, 2004. Proceedings. International Conference on, pages 44–51, 2004. [62] Dutch T. Meyer and William J. Bolosky. A study of practical dedupli- cation. ACM Transactions on Storage (TOS), 7(4):14, 2012. [63] Athicha Muthitacharoen, Benjie Chen, and David Mazieres, editors. A low-bandwidth network file system, volume 35. ACM, 2001. [64] Mark R. Nelson. LZW data compression. Dr. Dobb’s Journal, 14(10):29–36, 1989. [65] Mohamad Zaini Nurshafiqah, Nozomi Miyamoto, Hikari Yoshii, Riichi Kodama, Itaru Koike, and Toshiyuki Kinoshita. Data deduplication for Similar Files. 2017. [66] Martin Christoffer Aasen Oppegaard. Evaluation of Performance and Space Utilisation When Using Snapshots in the ZFS and Hammer File Systems. PhD thesis, 2009. [67] Oracle. Architectural Overview of the Oracle ZFS Storage Appliance: Oracle White Paper, 2016. [68] João Paulo and José Pereira. A Survey and Classification of Storage Deduplication Systems. ACM Comput. Surv., 47(1):11:1–11:30, 2014. [69] João Paulo and José Pereira. Efficient Deduplication in a Distributed Primary Storage Infrastructure. Trans. Storage, 12(4):20:1–20:35, 2016. [70] Calicrates Policroniades and Ian Pratt. Alternatives for Detecting Re- dundancy in Storage Systems Data. In USENIX Annual Technical Conference, General Track, pages 73–86, 2004.

53 [71] Pasquale Puzio, Refik Molva, Melek Önen, and Sergio Loureiro. Block- level De-duplication with Encrypted Data. Open Journal of Cloud Computing (OJCC), 1(1):10–18, 2014.

[72] Sean Quinlan and Sean Dorward, editors. Venti: A New Approach to Archival Storage, volume 2, 2002.

[73] Mendel Rosenblum and John K. Ousterhout. The Design and Imple- mentation of a Log-structured File System. SIGOPS Oper. Syst. Rev., 25(5):1–15, 1991.

[74] Mendel Rosenblum and John K. Ousterhout. The design and imple- mentation of a log-structured file system. ACM Transactions on Com- puter Systems (TOCS), 10(1):26–52, 1992.

[75] David Sacks. Demystifying storage networking das, san, nas, nas gate- ways, fibre channel, and . IBM storage networking, pages 3–11, 2001.

[76] Julian Satran and Kalman Meth. Internet small computer systems interface (iSCSI). 2004.

[77] Norbert Schramm. ZFS on Linux Performance Evaluation. Master- projekt, Universität Hamburg, Hamburg, 2016.

[78] Bhavana Shah. Disk performance of copy-on-write snapshot logical volumes. master degree thesis, The University Of British Columbia, 2006.

[79] Spencer Shepler, Mike Eisler, David Robinson, Brent Callaghan, Robert Thurlow, David Noveck, and Carl Beame. Network file sys- tem (NFS) version 4 protocol. Network, 2003.

[80] Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, and Yarik Markov. The first collision for full SHA-1. In Annual Interna- tional Cryptology Conference, pages 570–596, 2017.

[81] Andrew S. Tanenbaum. Modern operating system. Pearson Education, Inc, 2009.

[82] Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. Generating Realistic Datasets for Dedu- plication Analysis. In USENIX Annual Technical Conference, pages 261–272, 2012.

54 [83] Markus Then, Helmut Hayd, and Ingolf Brunner. ZFS unter Linux: Evaluierung und Optimierung. PhD thesis, Berufsakademie Sachsen Leipzig, 2017. [84] Walter F. Tichy. RCS—a system for version control. Software: Practice and Experience, 15(7):637–654, 1985. [85] Oracle Tim Chien. Snapshots Are NOT Backups: Comparing Storage- based Snapshot Technologies with Recovery Manager (RMAN) and Fast Recovery Area for Oracle Databases. [86] Andrew Tridgell, Paul Mackerras, et al. The rsync algorithm. 1996. [87] Ulf Troppens, Rainer Erkens, and Wolfgang Müller. Speichernetze. Grundlagen und Einsatz von Fibre–Channel SAN, NAS, iSCSI und InfiniBand. iX Edition. Heidelberg: dpunkt–Verlag, 2003. [88] Surendra Verma, Thomas J. Miller, and Robert G. Atkinson. Trans- actional file system, 2009. [89] Jiansheng Wei, Junhua Zhu, and Yong Li. Multimodal Content Defined Chunking for Data Deduplication. 2014. [90] E. D. Widianto, A. B. Prasetijo, and A. Ghufroni. On the implemen- tation of ZFS (Zettabyte File System) storage system. In 2016 3rd International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), pages 408–413, 2016. [91] WU Wien. Open Data Portal Watch: Quality Assessment and Monitor- ing of 260 Open Data Portals. http://data.wu.ac.at/portalwatch/ about. [92] W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang, and Y. Zhou. A Comprehensive Study of the Past, Present, and Future of Data Deduplication. Proceedings of the IEEE, 104(9):1681–1710, 2016. [93] W. Xia, H. Jiang, D. Feng, and L. Tian. Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets. In 2014 Data Compression Conference, pages 203– 212, 2014. [94] Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. FastCDC: A Fast and Efficient Content- Defined Chunking Approach for Data Deduplication. In 2016 USENIX

55 Annual Technical Conference (USENIX ATC 16), pages 101–114, Den- ver, CO, 2016. USENIX Association.

[95] Z. Yan, W. Ding, X. Yu, H. Zhu, and R. H. Deng. Deduplication on Encrypted Big Data in Cloud. IEEE Transactions on Big Data, 2(2):138–150, 2016.

[96] Tatu Ylonen and Chris Lonvick. The secure shell (SSH) protocol ar- chitecture. 2006.

[97] Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangt- ing Huang, and Yukun Zhou. AE: An Asymmetric Extremum con- tent defined chunking algorithm for fast and bandwidth-efficient data deduplication. 2015 IEEE Conference on Computer Communications (INFOCOM), pages 1337–1345, 2015.

[98] Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. End-to-end Data Integrity for File Systems: A ZFS Case Study. In Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST’10, page 3, Berkeley, CA, USA, 2010. USENIX Association.

[99] Ruijin Zhou, Ming Liu, and Tao Li. Characterizing the efficiency of data deduplication for big management. In workload char- acterization (IISWC), 2013 IEEE International Symposium on, pages 98–108, 2013.

[100] Benjamin Zhu, Kai Li, and R. Hugo Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Fast, volume 8, pages 1–14, 2008.

56 9 Appendix

The functions of the zfs and zpool command have been well documented in the Linux man pages39 or in the FreeBSD manual pages40 41. In the following paragraph the commands, that we have used most of the time for checking stats and exporting them to files via pipe will be mentioned and shown.

General ZFS Statements sudo zpool list sudo zpool status -D [poolname] sudo zfs get all [poolname]

Statements for Data Analysis and Transfer du -h * -d 3 > /zfs/diskusage.txt rsync -av --progress --stats -n * [destination] gunzip -r * 7z x .tar.7z

Benchmarking and setting Parameters

# Deduplication sudo zfs set dedup=on [poolname] sudo zpool status -D [poolname] sudo zpool list [poolname]

# Compression sudo zfs set compression=on [poolname] sudo zfs get compressratio [poolname]

# Recordsize sudo zfs set recordsize=4k [poolname] #128k, 1M sudo zfs get all [poolname]

# Bonnie Benchmarking bonnie >> file-name.csv

39’man zfs’ or ’man zpool’ 40https://www.freebsd.org/cgi/man.cgi?query=zfs&sektion=8 41https://www.freebsd.org/cgi/man.cgi?zpool(8)

57