Efficient Dataset Archiving And
Total Page:16
File Type:pdf, Size:1020Kb
Bachelor Thesis Efficient dataset archiving and ver- sioning at large scale Bernhard Kurz Subject Area: Information Business Studienkennzahl: 033 561 Supervisor: Fernández García, Javier David, Dr. Co-Supervisor: Neumaier, Sebastian, Dipl.-Ing., B.Sc. Date of Submission: 23. June 2018 Department of Information Systems and Operations, Vienna University of Economics and Business, Welthandelsplatz 1, 1020 Vienna, Austria Contents 1 Introduction 5 1.1 Motivation . .6 1.2 Outline of research . .6 2 Requirements 7 2.1 Archiving and Versioning of dataset . .8 2.2 Performance and Scalability . .8 3 Background and Related Work 9 3.1 Storage Types . .9 3.2 Version Control Systems . 10 3.2.1 DEC’s VMS . 11 3.2.2 Subversion . 11 3.2.3 GIT . 11 3.2.4 Scalable Version Control Systems . 12 3.3 Redundant Data Reduction Techniques . 12 3.3.1 Compression . 13 3.4 Deduplication . 14 3.4.1 Basic Workflow . 15 3.4.2 Workflow Improvements . 15 3.4.3 Key Design Decisions . 17 4 Operating Systems and Filesystems 18 4.1 Definitions . 18 4.1.1 Operating System . 19 4.1.2 File System . 20 4.1.3 Data Access . 20 4.2 Classic vs Modern Filesystems . 20 4.3 ZFS . 21 4.3.1 Scalability . 21 4.3.2 Virtual Devices . 22 4.3.3 ZFS Blocks . 23 4.3.4 ZFS Pools . 24 4.3.5 ZFS Architecture . 24 4.4 Data Integrity and Reliability . 25 4.4.1 Replication . 26 4.5 Transactional Semantics . 26 4.5.1 Copy-on-Write and Snapshots . 27 4.6 Btrfs . 28 4.7 NILFS . 28 4.8 Operating Systems and Solutions . 29 5 Approaches 29 5.1 Version Control Systems . 29 5.2 Snapshots and Clones . 30 5.3 Deduplication . 30 6 Design of the Benchmarks 31 6.1 Datasets . 31 6.1.1 Portal Watch Datasets . 31 6.1.2 BEAR Datasets . 32 6.1.3 Wikipedia Dumps . 33 6.2 Testsystem and Methods . 34 6.2.1 Testing Enviroment . 34 6.2.2 Benchmarking Tools . 34 6.2.3 Bonnie++ . 35 6.3 Conditions and Parameters . 35 6.3.1 Deduplication . 36 6.3.2 Compression . 36 6.3.3 Blocksize . 37 7 Results 37 7.1 Performance of ext4 . 37 7.2 Performance of ZFS . 38 7.3 Deduplication Ratios . 39 7.3.1 Portal Watch Datasets . 39 7.3.2 BEAR Datasets . 41 7.3.3 Wikipedia Dumps . 41 7.4 Blocksize Tuning . 42 7.5 Compression Performance . 42 7.5.1 Compression Sizes . 44 7.5.2 Compression Ratios . 45 8 Conclusion and further research 47 9 Appendix 57 List of Figures 1 Redundant Data Reduction Techniques and approximate dates of initial research[92] . 13 2 Overview of a Deduplication Process . 15 3 Block-based vs object-based storage systems[42] . 23 4 Traditional Volumes vs ZFS Pooled Storage[12] . 24 5 ZFS Layers[12] . 25 6 Diskusage of datasets in GB per Month . 32 7 Bonnie Benchmark for ext4 . 37 8 Block I/O for ZFS . 38 9 Block I/O for ext4 compared to ZFS . 39 10 Deduplication Table for the Portal Watch Datasets . 40 11 Block I/O for ZFS: with Blocksize=4k and Compression or Deduplication . 43 12 Block I/O for ZFS: with Blocksize=128k and Compression or Deduplication . 43 13 Block I/O for ZFS: with Blocksize=1M and Compression or Deduplication . 44 14 Compression sizes in GB for the Datasets . 45 15 Compression Ratios for the Datasets . 46 List of Tables 1 BEAR Datasets Compression Ratios . 33 2 Wikipedia Dumps Compression Ratios . 33 3 Compression Ratios for different Compression Levels per Dataset 45 Abstract As the amount of generated data is increasing rapidly, there is a high need for reducing storage costs. In addition, the requirements for storage systems have evolved and archiving and versioning data have become a major challenge. Modern filesystems target key fea- tures like scalability, flexibility and data integrity. Furthermore, they provide mechanisms to reduce data redundancy and version control. This thesis discusses recent related research fields and evaluates how well archiving and versioning can be integrated in the filesystem layer to reduce administration overhead. The effectiveness for deduplication and compression is benchmarked on some datasets to ensure scalabil- ity and feasibility. Finally, a conclusion and an overview for further research is given. 1 Introduction Due to the fact, that the amount of data is increasing exponentially, storage costs are increasing too. On the one hand prices for hard drive disks sunk, but on the other hand, there are higher requirements, which means storing much data can be expensive. In computing, file systems are used to control how data is handled and stored on an at least physical device. "Some of today’s most popular filesystems are, in computing scale, ancient. We discard hardware because it’s five years old and too slow to be borne—then put a 30-year-old filesystem on its replacement. Even more modern filesystems like extfs, UFS2, and NTFS use older ideas at their core." [56] Classic filesystems have become less practical by fulfilling evolved demands. In many times it is efficient to make duplicates instead of sorting files or versions. For archiving and backup reasons there is a demand for version histories, to "look back" in time and be able to recreate older versions of documents or another kind of data. Due to the nature of backups, they allocate a lot of storage, even, if there have been just slight changes. To spare disk space incremental backups are used, which may cause a very complex and effortful recovering of older files. This leads also to a big impact on performance and it is difficult to manage versions. Another way could be an application based version control systems, like GIT, but they are slow and problematic if it comes to large files and scalability, or non-text and binary files.[9, 47] Another weakness of classic filesystems are size limits and partition caps, which may be still sufficient for the next five to ten years, but 5 thinking in a larger scale means, that they have to be recorded and adapted in the future. Besides there is the so-called bitrot, which is also known as silent-data-corruption.[6] Bitrot is problematic because it is mainly invisible and can cause problematic data inconsistency. Modern filesystems, like ZFS and Btrfs, deal with these problems. To counter this phenomenon, storage systems use levels of redundancy (RAID) combined with checksums. Overall, it will be tested how well fulfilling these requirements can be included in the filesystem layer. The aim of this thesis is to compare these features and find a plausible solution for the specific use case mentioned below. 1.1 Motivation The Department of Information Systems and Operations at the Vienna Uni- versity of Economics and Business is hosting the Open Data Portal Watch. The projects aim is to have a "scalable quality assessment and evolution mon- itoring framework for Open Data (Web) portals."[91] Therefore, Datasets from around 260 Web catalogues are downloaded and saved to disk each week. To reduce the storage costs, datasets are hashed and only saved, if they have changed. Otherwise, they are logged in a database to enable ac- cess to older versions and the datasets histories. In fact, changing can mean adding just one line or even remove rows from the dataset. That means, there could be thousands of duplicates or strictly speaking nearly duplicates. As this is simply speaking, a deduplication on a file-basis, we wanted to test, how well chunk-sized deduplication will provide additional gains. 1.2 Outline of research The scope of this thesis includes a literature review about modern filesystems like ZFS, Btrfs and the versioning filesystem NILFS. They will be compared with features known from version control systems like GIT and Subversion as well as deduplication. Attention will be spent on scalability and perfor- mance of these implementations. In conclusion, the following features will be reviewed and compared to each other: Targeted Factors • Saving Disk Space • IO Performance Required Factors • Archiving and Versioning 6 • Performance and Scalability Additional Factors • Hardware Requirements • Compatibility • Reliability Approaches • Version Control Systems • ZFS Snapshots and Clones • ZFS Deduplication A model, which shows the best ratio between the two targeted factors, would be a possible view to structure and visualize the results. These two can be derived from the requirements. It may be difficult to reduce the factors to the model without loosing too much information. The important thing is testing and benchmarking of the approaches. Due to its nature of filesystems and the importance of data consistency they should be tested for a long period of time to satisfy demands and eliminate as many errors. It will be discussed how well specific filesystems have been documented and tested. 2 Requirements Important for the Open Data Portal watch are the archiving and versioning of datasets and the access to them. On the one hand, this is similar to a backup schema of user data, but the difference is, that on the other hand, it is strictly speaking primary storage because it requires a performant read access. For backup cases, data is often compressed and deduplicated to save space, but for active storage, this can be too slow. It is given, that the storage system uses spinning disks. Hard drive disks have a very low rate of I/O operations per second compared to solid state disks.[44] Also, the kind of access is a random one, because, datasets are added from time to time and are not written sequentially.