Adaptive Filesystem Compression for General Purpose Systems

Adaptive Filesystem Compression for General Purpose Systems A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Jonathan Beaulieu IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE Peter Andrew Harrington Peterson June 2018 ⃝c Jonathan Beaulieu 2018 Acknowledgements I would first like to thank my advisor Dr. Peter Peterson for his always available support and countless hours of reading my rough writing. He also guided me through picking a thesis topic I found interesting and followed through by providing insightful ideas. I am grateful for each of my committee members: Professor Peter Peterson, Pro- fessor Pete Willemsen and Professor Dalibor Froncek. They have all had a great impact on my education and way of thinking for which I consider myself very lucky. I would like to thank my LARS labmates: Brandon Paulsen, Ankit Gupta, Dennis Asamoah Owusn, Alek Straumann and Maz Jindeel. Each of them have provided me support and camaraderie for the lab. To the man I sat next to for the past two years, Václav Hasenöhrl,thank you for being a great classmate, desk buddy and mathematics consultant. I would like to send my deepest apologies to all of the reviewers of this thesis: Xinru Yan, Alek Straumman, Conrad Beaulieu and lastly my committee members. Thanks for correcting my countless spelling mistakes and improving the readability of the thesis. I would like to thank the department staff, Jim, Lori, Clare and Kelsey, for their work that played a role in this thesis albeit behind the scenes. This thesis would not have been possible to finish without the support of my family and friends. Especially my girlfriend, fiancéeand finally wife, Xinru Yan for putting up with late nights and a person with water in their head. She spent countless hours of work so that I could spend time on this thesis. i Abstract File systems have utilized compression to increase the amount of storage available to the operating system. However, compression comes with the risk of increased latency and wasted computation when compressing incompressible data. We used Adaptive Compression (AC) to increase disk write speeds using compression, while not falling victim to incompressible data. Our AC system decides which algorithm to use from a list of available compression algorithms based on a model that queries CPU usage, disk bandwidth, current write queue length and estimates compressibility using \bytecounting," an entropy-like statistic. Our model learns over time by building a table of estimated outcomes for each state and averaging previous outcomes for that state. We integrated our AC system into ZFSs write pipeline to reduce overall write times. In benchmarks with several standard datasets, our solution increased average write speeds by up to 48% compared to the baseline. For worst case (incompressible) data, our system decreased write speeds by 5%, due to system overhead. Compared to a previous ZFS AC system, our system does not perform statistically worse on any dataset and writes datasets with many small compressible files up to 49% faster. ii Contents Contents iii List of Tables v List of Figures vi 1 Introduction 1 2 Background 5 2.1 Overview .................................. 5 2.1.1 Compression ............................ 5 2.1.2 Current Modern Compression Algorithms ........... 7 2.1.3 Adaptive Compression ...................... 8 2.1.4 File System Compression ..................... 10 2.1.5 ZFS ................................ 11 2.2 Related Work ............................... 14 2.2.1 ZFS smart compression ...................... 14 2.2.2 Quality of service improvement in ZFS through compression . 15 3 Implementation 18 3.1 Adaptive Compression System Design .................. 18 iii 3.1.1 Methods .............................. 18 3.1.2 Monitors .............................. 19 3.1.3 Model ............................... 21 3.1.4 Estimating Model ......................... 22 3.2 Integration into ZFS ........................... 25 3.2.1 ZFS Write Pipeline ........................ 25 3.2.2 AC Compress Function ...................... 27 4 Experiment 28 4.1 Methodology ............................... 28 4.2 Environment ................................ 31 4.3 Datasets .................................. 31 5 Results 38 5.1 Raw Results ................................ 38 5.2 Result Analysis .............................. 42 5.2.1 Write Speed ............................ 42 5.2.2 Compression Choices ....................... 47 6 Conclusions 51 6.1 Future work ................................ 51 References 53 iv List of Tables 2.1 Auto Compress decision process. ..................... 17 3.1 Monitor buckets .............................. 24 4.1 Systems .................................. 28 4.2 Experiment procedure .......................... 29 4.3 Sizes of datasets .............................. 32 v List of Figures 1.1 State Diagram for Simple on/off. .................... 2 3.1 Diagram of \Bucketization" ....................... 23 3.2 The matrix used for storing estimated values .............. 25 3.3 ZFS write pipeline with AC ....................... 26 4.1 Compression Ratio per block using gzip-1 for dataset enwiki ..... 34 4.2 Compression Ratio per block using gzip-1 for dataset mozilla ..... 35 4.3 Compression Ratio per block using gzip-1 for dataset sao ....... 35 4.4 Compression Ratio per block using gzip-1 for dataset pdf ....... 36 4.5 Compression Ratio per block using gzip-1 for dataset web ....... 36 4.6 Compression Ratio per block using gzip-1 for dataset random .... 37 5.1 Total Writing Time for each system to write enwiki .......... 39 5.2 Total Writing Time for each system to write mozilla .......... 39 5.3 Total Writing Time for each system to write sao ............ 40 5.4 Total Writing Time for each system to write pdf ............ 40 5.5 Total Writing Time for each system to write web ........... 41 5.6 Total Writing Time for each system to write random ......... 41 5.7 Average throughput for each system while writing enwiki ....... 44 vi 5.8 Average throughput for each system while writing mozilla ...... 45 5.9 Average throughput for each system while writing sao ......... 45 5.10 Average throughput for each system while writing pdf ......... 46 5.11 Average throughput for each system while writing web ........ 46 5.12 Average throughput for each system while writing random ...... 47 5.13 Compression choices per run made by n1kl and lars for enwiki .... 49 5.14 Compression choices per run made by n1kl and lars for mozilla .... 49 5.15 Compression choices per run made by n1kl and lars for web ...... 50 vii 1 Introduction To understand the crux of this thesis, it is important to understand two fundamental properties of compression algorithms. The first property is that compression algorithms work by exploiting patterns found in the data they are compressing. This fundamental property has resulted in algorithms being better, either requiring less resources or compressing the data smaller, at compressing some data types over oth- ers. A good display of this is the FLAC compression algorithm which was designed specifically to compress digital audio files. It achieves a great compression ratio on most audio while using relatively few resources. On the other hand it is very inef- ficient at compressing images. The second fundamental property is that there is a trade-off between the achievable compression ratio and the required computational resources. Since compression at its core is pattern matching or computing statistics, in general to achieve better compression more time, energy and space is required. This trade-off is so well founded that most modern compression algorithms allow for some tunability of how much resources to use. This value is normally referred to as level or strength. A lower value indicates the algorithm will use less resources and as a result will do less compression and have a worse compression ratio. The opposite is true of larger values. Even with the downsides of these properties, countless software systems use compression algorithms as a means to improve their efficiency. File systems are a critical component of every operating system as they handle all of the long term information storage. As a result, they can become a major performance bottleneck, especially when it comes to reading and writing data. Many 1 Comp ratio < Threshold start Comp NoComp After sometime Figure 1.1: State Diagram for Simple on/off. improvements in performance have come through caching and exploiting the physical structure of the storage devices. However, comparatively little attention has been paid to using compression as a means of increasing disk performance. Compression has long been seen as a way to increase the density of information on the disk. However, this has come with the price of decreased I/O speeds and wasting computational power compressing already compressed data. A possible solution to these drawbacks is Adaptive Compression (AC). AC systems make a decision about whether or how to compress a given input. Two na¨ıve types of AC systems that have been used in the past are: simple on/off methods [7, 12] and making decisions based on file types [5]. Simple on/off methods are based on the idea that data that is close to each other, either in a temporal sense or a local sense, should exhibit similar entropy properties. In order words, if the first couple bytes of a file are easily compressible then the next couple bytes should also be and vise versa. This method is normally implemented by compressing all data by default and checking the compression ratio of the last X bytes. If those bytes have a compression ratio worse than a threshold then the system will stop compressing for a set amount of time or bytes. After disabling compression, the system will reset and start compressing again after a predefined interval. This algorithm in modeled in figure 1.1. Systems in this group lack the potential to gain 2 the benefits of using different compression algorithms and/or different compression levels for different data types. Using file types as a way to easily identify data types seems like a good solution to the problem of on/off systems, however it also has its own drawbacks.

Load more