<<

Adaptive Filesystem Compression for General Purpose Systems

A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Jonathan Beaulieu

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

Peter Andrew Harrington Peterson

June 2018 ⃝ Jonathan Beaulieu 2018 Acknowledgements

I would first like to thank my advisor Dr. Peter Peterson for his always available support and countless hours of reading my rough writing. He also guided me through picking a thesis topic I found interesting and followed through by providing insightful ideas.

I am grateful for each of my committee members: Professor Peter Peterson, Pro- fessor Pete Willemsen and Professor Dalibor Froncek. They have all had a great impact on my education and way of thinking for which I consider myself very lucky.

I would like to thank my LARS labmates: Brandon Paulsen, Ankit Gupta, Dennis

Asamoah Owusn, Alek Straumann and Maz Jindeel. Each of them have provided me support and camaraderie for the lab.

To the man I sat next to for the past two years, V´aclav Hasen¨ohrl,thank you for being a great classmate, desk buddy and mathematics consultant.

I would like to send my deepest apologies to all of the reviewers of this thesis:

Xinru Yan, Alek Straumman, Conrad Beaulieu and lastly my committee members. Thanks for correcting my countless spelling mistakes and improving the readability of the thesis.

I would like to thank the department staff, Jim, Lori, Clare and Kelsey, for their work that played a role in this thesis albeit behind the scenes.

This thesis would not have been possible to finish without the support of my family and friends. Especially my girlfriend, fianc´eeand finally wife, Xinru Yan for putting up with late nights and a person with water in their head. She spent countless hours of work so that I could spend time on this thesis.

i Abstract

File systems have utilized compression to increase the amount of storage available to the . However, compression comes with the risk of increased latency and wasted computation when compressing incompressible data. We used

Adaptive Compression (AC) to increase disk write speeds using compression, while not falling victim to incompressible data. Our AC system decides which to use from a list of available compression based on a model that queries CPU usage, disk bandwidth, current write queue length and estimates compressibility using

“bytecounting,” an entropy-like statistic. Our model learns over time by building a table of estimated outcomes for each state and averaging previous outcomes for that state. We integrated our AC system into ZFSs write pipeline to reduce overall write times. In benchmarks with several standard datasets, our solution increased average write speeds by up to 48% compared to the baseline. For worst case (incompressible) data, our system decreased write speeds by 5%, due to system overhead. Compared to a previous ZFS AC system, our system does not perform statistically worse on any dataset and writes datasets with many small compressible files up to 49% faster.

ii Contents

Contents iii

List of Tables v

List of Figures vi

1 Introduction 1

2 Background 5

2.1 Overview ...... 5

2.1.1 Compression ...... 5

2.1.2 Current Modern Compression Algorithms ...... 7 2.1.3 Adaptive Compression ...... 8

2.1.4 System Compression ...... 10

2.1.5 ZFS ...... 11

2.2 Related Work ...... 14

2.2.1 ZFS smart compression ...... 14 2.2.2 Quality of service improvement in ZFS through compression . 15

3 Implementation 18 3.1 Adaptive Compression System Design ...... 18

iii 3.1.1 Methods ...... 18 3.1.2 Monitors ...... 19

3.1.3 Model ...... 21

3.1.4 Estimating Model ...... 22

3.2 Integration into ZFS ...... 25 3.2.1 ZFS Write Pipeline ...... 25

3.2.2 AC Function ...... 27

4 Experiment 28 4.1 Methodology ...... 28

4.2 Environment ...... 31

4.3 Datasets ...... 31

5 Results 38

5.1 Raw Results ...... 38

5.2 Result Analysis ...... 42

5.2.1 Write Speed ...... 42 5.2.2 Compression Choices ...... 47

6 Conclusions 51

6.1 Future work ...... 51

References 53

iv List of Tables

2.1 Auto Compress decision process...... 17

3.1 Monitor buckets ...... 24

4.1 Systems ...... 28

4.2 Experiment procedure ...... 29

4.3 Sizes of datasets ...... 32

v List of Figures

1.1 State Diagram for Simple on/off...... 2

3.1 Diagram of “Bucketization” ...... 23

3.2 The matrix used for storing estimated values ...... 25

3.3 ZFS write pipeline with AC ...... 26

4.1 Compression Ratio per block using gzip-1 for dataset enwiki ..... 34

4.2 Compression Ratio per block using gzip-1 for dataset mozilla ..... 35

4.3 Compression Ratio per block using gzip-1 for dataset sao ...... 35 4.4 Compression Ratio per block using gzip-1 for dataset ...... 36

4.5 Compression Ratio per block using gzip-1 for dataset web ...... 36

4.6 Compression Ratio per block using gzip-1 for dataset random .... 37

5.1 Total Writing Time for each system to write enwiki ...... 39

5.2 Total Writing Time for each system to write mozilla ...... 39

5.3 Total Writing Time for each system to write sao ...... 40 5.4 Total Writing Time for each system to write pdf ...... 40

5.5 Total Writing Time for each system to write web ...... 41

5.6 Total Writing Time for each system to write random ...... 41

5.7 Average throughput for each system while writing enwiki ...... 44

vi 5.8 Average throughput for each system while writing mozilla ...... 45 5.9 Average throughput for each system while writing sao ...... 45

5.10 Average throughput for each system while writing pdf ...... 46

5.11 Average throughput for each system while writing web ...... 46

5.12 Average throughput for each system while writing random ...... 47 5.13 Compression choices per run made by n1kl and lars for enwiki .... 49

5.14 Compression choices per run made by n1kl and lars for mozilla .... 49

5.15 Compression choices per run made by n1kl and lars for web ...... 50

vii 1 Introduction

To understand the crux of this thesis, it is important to understand two funda- mental properties of compression algorithms. The first property is that compression algorithms work by exploiting patterns found in the data they are compressing. This fundamental property has resulted in algorithms being better, either requiring less resources or compressing the data smaller, at compressing some data types over oth- ers. A good display of this is the FLAC compression algorithm which was designed specifically to compress digital audio files. It achieves a great compression ratio on most audio while using relatively few resources. On the other hand it is very inef-

ficient at compressing images. The second fundamental property is that there is a trade-off between the achievable compression ratio and the required computational resources. Since compression at its core is pattern matching or computing statistics, in general to achieve better compression more time, energy and space is required. This trade-off is so well founded that most modern compression algorithms allow for some tunability of how much resources to use. This value is normally referred to as level or strength. A lower value indicates the algorithm will use less resources and as a result will do less compression and have a worse compression ratio. The opposite is true of larger values. Even with the downsides of these properties, countless software systems use compression algorithms as a means to improve their efficiency.

File systems are a critical component of every operating system as they handle all of the long term information storage. As a result, they can become a major performance bottleneck, especially when it comes to reading and writing data. Many

1 Comp ratio < Threshold

start Comp NoComp

After sometime Figure 1.1: State Diagram for Simple on/off. improvements in performance have come through caching and exploiting the physical structure of the storage devices. However, comparatively little attention has been paid to using compression as a means of increasing disk performance.

Compression has long been seen as a way to increase the density of information on the disk. However, this has come with the price of decreased I/O speeds and wasting computational power compressing already compressed data. A possible solution to these drawbacks is Adaptive Compression (AC). AC systems make a decision about whether or how to compress a given input. Two na¨ıve types of AC systems that have been used in the past are: simple on/off methods [7, 12] and making decisions based on file types [5]. Simple on/off methods are based on the idea that data that is close to each other, either in a temporal sense or a local sense, should exhibit similar entropy properties.

In order words, if the first couple bytes of a file are easily compressible then the next couple bytes should also be and vise versa. This method is normally implemented by compressing all data by default and checking the compression ratio of the last X bytes. If those bytes have a compression ratio worse than a threshold then the system will stop compressing for a set amount of time or bytes. After disabling compression, the system will reset and start compressing again after a predefined interval. This algorithm in modeled in figure 1.1. Systems in this group lack the potential to gain

2 the benefits of using different compression algorithms and/or different compression levels for different data types.

Using file types as a way to easily identify data types seems like a good solution to the problem of on/off systems, however it also has its own drawbacks. In practice this group of systems is implemented using a mapping of file types to compression algorithms. The file types are normally identified by the file extension or Multipurpose

Internet Mail Extensions (MIME) type. These systems require a hard-coded table which would but optimal if it could be complete and accurate. However a table can only be accurate if the following two assumptions are correct: 1) there is always a single best algorithm per file and 2) the best algorithm is not dependent on the current environment of the computer. However, these assumptions do not hold true in most cases. Files with the same file type are allowed to contain multiple types of data some of which could be very compressible for a given algorithm and some could be the worst case for that same algorithm. On modern multi-tasking systems it is important to not forget that other tasks may be competing for resources. This means that at times the compression algorithm may only be able to use half of the total resources resulting in the compression algorithm compressing relatively much slower.

Recently, there has been some work on developing more complex algorithms to increase the quality of the compression choices being made. Niklas Bunge created an AC system that chooses a compression algorithm based on the number of queued blocks waiting to be written and a rolling average of previous write speeds. [2] Their system organizes the compression algorithms it uses, into a list ordering them from fastest to slowest. Notice that it allows for the use of different compression levels by treating each pair of compression algorithm and its level as a unique algorithm. It switches to a faster compression algorithm when the estimated time to compres- sion the currently queued writes given the average compression speed of the current

3 compression algorithm exceeds the estimated time to write the data to the disk. Conversely, it switches to a slower compression algorithm when it estimates that the

compression algorithm will require less time than the disk write.

The goal of this research is to explore other general AC methods that can improve

the state-of-the-art file system AC system. We compared a more dynamic model with different factors for choosing an algorithm with Niklas Bunge’s work. We created an

AC system inside a file system based on the general AC system, Datacomp [13], which

uses a list of available compression algorithms and creates a model that learns which

algorithm works best for different types of data over time. We have posted the source

for this system on GitHub 1. This thesis explores the effectiveness of AC methods that build upon previous work in the field by applying them to file systems. Our results provide valuable insight for future work into AC systems for both the compression and file system fields. We found that simple byte counting is an accurate enough compressibility measure to make compression choices which lead to improved I/O speeds. Other inputs to our

AC system included cpu usage, available bandwidth and current write queue length

(based on Bunge’s work). We found that these inputs together with our systems design resulted in a system that wrote data up to 48% faster than our baseline, no compression, and 49% faster than Bunge’s system.

1https://github.com/derpferd/zfs/tree/lars

4 2 Background

2.1 Overview

The purpose of this thesis is to add adaptive compression to a file system with the goal of providing a speed up for disk I/O.

2.1.1 Compression

Compression is a process with the goal of taking information encoded with bytes and changing the representation of the information to an encoding that requires fewer bytes. There are two types of compression: lossy and lossless. is practical for information when it is acceptable to only keep an approximation of the information. Applications of this include compressing images and audio because some quality can be lost without a human being able to notice. An example is an algorithm, which like every compression algorithm, uses patterns to compress information. When matching patterns in the image, instead of requiring an exact match, it could allow for some close matches. This slightly distorts the image in a way not noticeable to a human while resulting in a smaller representation of what seems to be the same information to a human. However there are other types of data like programs, documents and configuration files where this loss of information is not acceptable. For information of this type we use which encodes the information into a smaller representation that is fully reversable, meaning it can

5 compress a file then decompress the compressed file to get a file exactly the same as the original file. One of the expectations of a file system is that when given information is read it will be the exact same as when written. In this paper we will use many different compression algorithms with different trade-offs, however all of them will be lossless. The metric used in the compression field to measure how effective a compression algorithm is given some data is the compression ratio. Compression ratio is the size of the original data over the size of the data after the compression algorithm has compressed it.

size of input ratio = size of output data

This means that larger compression ratios are better since the size of the output is smaller.

Types of lossless compression

Under the category of lossless compression there are many different algorithms, each with there own trade-offs. There are two main types of general lossless com- pression methods: 1) statistical methods and 2) dictionary based methods. Many times one or both of these methods are applied together with other basic compression techniques to create current compression algorithms. [17]

Statistical methods

Statistical methods use the fact that the symbols (e.g., a unique byte) in a file have a non-uniform probability of appearing.[17] Natural languages are a great example. One of the first examples of someone designing a statistical compression scheme is when Samuel Morse created a telegraph encoding: instead of every letter taking the

6 same amount of dots or dashes, the most used letters require fewer and the least used letters more. However, his method was developed for use with English language only.

A very popular statistical method is Huffman Coding. [6] Huffman Coding assigns variable length codes to symbols based on the probability of each symbol appearing in the input using a binary tree.

Dictionary methods

Dictionary compression is based on the idea that files have repeating sequences in them. [17] For example if we have the phrase “to be, or not to be”, we can see that a group of symbols repeat. Therefore, we could just store “to be, or not” and a description of how the first and second words repeat at the end. This style of compression was first practically applied by Abraham Lempel and Jacob Ziv in their

1977 and 1978 papers. [18, 19] Their compression algorithms, LZ77 and LZ78, are the basis for many other algorithms, forming the LZ family of compression algorithms.

2.1.2 Current Modern Compression Algorithms

Data compression remains an open problem with new and improved solutions appearing periodically. Table 2.1.2 lists the most common lossless compression algo- rithms in use today.

Each compression algorithm has its own drawbacks due to either speed or poor compression of some data types. The file system that we will be working with in this thesis currently only implements Deflate, in the form of gzip - a “slow” algorithm, and lz4 - a “fast” algorithm.

7 Name Date Methods Used Description Widely adopted compressor. Medium speed with medium LZ methods and Huff- Deflate 1990 compression ratios. Has been man Coding implemented in hardware on some systems. RLE, BWT, MTF, Slower speed and high compres- 1996 Huffman coding sion ratios. Very slow compression and fast LZ methods and range LZMA 1996 or 1998 decompression with high com- encoding pression ratios. Fast compression and very fast LZ4 2011 LZ methods decompression. Poor compres- sion ratio. LZ methods and Fi- Relatively fast while maintaining zstd 2015 nite State Entropy high compression ratios. Fast compression and decompres- 2011 LZ methods sion. Poor compression ratios.

2.1.3 Adaptive Compression

Adaptive Compression (AC) takes some state and a set of compression algorithms and decides which algorithm, if any, to apply to the data. The simplest type of AC unitizes a single compression algorithm and simply decides to compress, or not.

Many applications use this primitive AC method, including OpenVPN and ZFS Smart

Compression.[8, 12] The algorithm for deciding whether to compress samples the compression ratio: if it is above a threshold it will continue compressing; otherwise it will stop compressing for a set amount of time before restarting. Much better algorithms have been made using different approaches. Some theoretical algorithms have been developed that have not been implemented in a world real file system. [1]

However for this thesis we have decided to use a general and practical algorithm and therefore we will be not able to compare with these systems.

8 Datacomp

Peter Peterson developed an AC system, Datacomp, focusing on the compression of network traffic using AC to increase the rate of transfer. [13] The overall idea here is to have the CPU compress the data before sending it so that the transfer is faster. The difference between their AC system and simply selecting a compression algorithm to compress all traffic are these two goals: if the CPU time is needed by other tasks running on the machine or the data is not very compressible, the system should not waste resources compressing and it should be able to handle all network traffic in a general fashion. AC is well suited to provide a solution which meets both of these goals. They break apart their AC system into three different generalized subsystems:

Methods, Monitors and Models. Methods are basically a wrapper for a given com- pression algorithm that includes other parameters like strength of the algorithm, amount of input to compress and thread count. Monitors provide the AC system with the information about of what type of data is being sent or how much resources are currently available, so that the AC system could make better decisions. Last,

Models take in the information given from the Monitors and the list of available

Methods and make a decision. Datacomp’s user interface consists of the functions

‘dcread’ and ‘dcwrite’, which appear to act the exact same as the standard POSIX ‘recv’ and ‘send’ functions. However they wrap the ‘recv’ and ‘send’ functions with the code necessary to adaptively compress outgoing data and decompress incoming data.

For methods they chose the following algorithms: LZO, bzip2, and xz. For monitors they used CPU Usage, CPU frequency, available bandwidth and a compress- ibility estimate. Their compressibility estimate, bytecount, is based on counting the

9 number of unique bytes that appear in the input more than a set number of times. This number of times is set by calculating the expected average number of appear- ances of the byte in data. The fewer unique bytes that reach this threshold the few bytes dominate the data and therefore the data should be more compressible. They use this simple measure because it is very fast, requiring only a single pass through the data.

Before we describe Datacomp’s model we must understand the concept of Effective

Output Rate (EOR). EOR is simply the transfer rate achieved taking compression into account. Their model is designed to learn over time to make better decisions for each situation. It consists of a multidimensional matrix for each method where the inputs are the dimensions and the contents of each cell is data required to make a rolling average of the EOR for the given inputs and method. Their rolling average is computed like a typical mean up to 20 values; then, to avoid overflow, it subtracts the current average from the sum and adds the new value, still dividing by 20. Two important things to note about their model are that 1) the model has a hard coded threshold on bytecount at which point the model will always choose to not compress and 2) they also quantize the input values into predefined buckets. This is done to reduce the size of the matrix in order to make it more manageable.

2.1.4 File System Compression

Applying compression to file systems seems like a good idea to improve space utilization. A paper by Timo Raita in 1987 started the exploration of compression in

file systems. [14] He created a system for identifying unused files in home directories, compressing them and accessing them later. This system would be run at startup and would compress in the background as not to be noticed. This type of file compression

10 however is not transparent. In a nontransparent system the user can notice what the system is doing or did.

There are many way which a user could notice a nontransparent system. In the case of file system, it could be that the system “randomly” requires more resources than expected or moves files automatically. Raita’s system required the user to follow a different sequence steps to access a file that the system had compressed. This hassle of

Raita’s system resulted in minimal adaptation of it. In order for widespread adoption there would need to be a transparent system.

One of the first practical file system compression designs was created by Michael

Burrows and others from the DEC Systems Research Center in 1992. [3] Their design took the already existing file system, Log-structured File System (LFS), and added compression in order to increase the amount of data which could be stored. They kept their system design transparent so users would not need to change their usage to gain the benefits of compression. This allowed users of any LFS to replace their current file system with DEC’s compressing file system. Since this file system was released, many of the popular file systems have included support for transparent compression however many leave it turned off by default due to the downsides of always on compression.

2.1.5 ZFS

The Zettabyte File System (ZFS), designed by Jeff Bonwick et al., was created to tackle some of the drawbacks of traditional file systems.[16] Some of these problems include lack of scalability, automatic integrity checking and easy administration. To tackle these problems they needed to change the traditional file system architecture.

Their changes to the architecture included a redesign of how the file system is related to a volume, pooled storage, copy-on-write and automated checksums.

11 The traditional relationship between file system and volume is one or more file systems per volume. They redesigned the architecture to allow multiple volumes per file system. This allowed for one of the main features of ZFS that made it different than traditional file systems which is its pooled storage design. Traditional

file systems are assigned per virtual disk. This has many drawbacks including no space sharing between file systems and difficult administration. When using a traditional system we must assign each file system a certain amount of space and in order for another file system to use some of that space in the future we have to resize and shift the file systems. Storage pooling does away with this. All the disks are “pooled” together into a single storage pool which is in turn shared between all of the file systems.

The designers of ZFS thought that data checking and correcting should not only be left up to hardware or RAID controllers. Instead, the file system should catch errors on all the hardware, software and firmware being used by the file system. Traditionally file systems blindly trust and relay data from disk controllers. This can lead to uncorrected data corruption or in a worst case a system panic. ZFS fixes this problem by storing a checksum along with the data. This serves as a flag to detect if a disk read is returning corrupted data. If the ZFS pool is set up using a mirror in the configuration, ZFS will automatically detect which data is correct then rewrite the data on the corrupted disk.

To further help with error prevention, ZFS always has a consistent structure on disk. This is accomplished through atomic transactions and a copy-on-write design.

The transactions are applied one at a time with all states in between writes being valid. Bonwick et al. also redesigned the architecture of how the file system interacts with the OS and the disks. This new architecture contains three layers, the Storage

12 Pool Allocator (SPA), the Data Management Unit (DMU) and the ZFS POSIX Layer (ZPL). The SPA takes blocks of a configurable size and virtual addresses from the

DMU and decides where to write the data to. The DMU implements a general

transactional object interface for the ZPL to use. The DMU transforms objects to

blocks transactions that the SPA executes in an atomic way. The ZPL implements a POSIX file system (the Vnode operations) on top of the DMU layer.

Compression in ZFS

Compression in ZFS is handled at the SPA level. The DMU translates everything

into configurable sized blocks that are then passed to the SPA. ZFS’s block size can

be set to any value that is a power of two and in the range from 512 bytes to 128KB. The default value is set to 128KB. When compression in turned on, the SPA will take

in the block, compress it with the configured compression algorithm, and store the

output with the metadata necessary for decompression.

Ben Rockwood explains in his article, “Understanding ZFS: Compression”, about how and why to use ZFS compression. [15, 9] He states that compression not only saves space but it can also speed up disk I/O throughput (something this thesis builds on). ZFS Compression configured per directory and will be inherited by all sub-directories. Only files written after enabling compression will be compressed. In addition to this the compression algorithm can be changed at anytime which will also only influence future files. This means we can have many files in the same directory be compressed by different compression algorithms. “du” is a useful command to list the actual space taken by a given file (after compression) opposed to ”ls” which lists stats transparent of the compression.

13 2.2 Related Work

2.2.1 ZFS smart compression

ZFS’s compression option does not allow for some files to be excluded from being

compressed. This can be very wasteful when compressing hard to compress files,

such as encrypted files or files in an already compressed format. Sao Kiselkov noticed

this in 2013 and decided to fix it with his ZFS smart compression. [8, 7] In his talk he points out that ZFS compression settings are per file system and each file

system contains a mix of compressible data (e.g. and text documents) and pre

compressed data (e.g. archives and multimedia).

He describes several “bad solutions”, before presenting his solution. The first “bad solution” is to have a file extension table inside ZFS to tell whether the files by that

extension are compressible. This is a poor solution because there are many extensions,

and because renaming a file should not change the behavior of the file system storing

it. This is also a security issue. Someone could easily cause a denial of service attack

by renaming a large incompressible file with an extension of a highly compressible file type, resulting in the file system spending resources on trying to compress it.

This design has been discussed by many people in the file system community. In

2015, Florian Ehmke implemented a system following this design in ZFS as master

thesis work. [5] His system was not merged into any official ZFS releases due to the aforementioned issues.

The second “bad solution” is to let the admin control which files are to be com- pressed. This is clearly not general and could not be applied to in bulk.

ZFS smart compression is an on/off system as we described in chapter 1. It tries

to compress each file and tracks how well the first part of the file . If the file does not compress, then it will not compress the rest of the file and continue to

14 not compress the file till a certain number of writes have been made to it. This is a n¨aive AC system and is far less optimal than Datacomp.

2.2.2 Quality of service improvement in ZFS through com-

pression

The latest AC system to be implemented for ZFS is Niklas Bunge’s auto compres- sion. [2] He created an AC system, called auto compress, as a part of his thesis which is titled “Quality of service improvement in ZFS through compression”. Auto com-

press is more complex than ZFS smart compression however it is based on a similar

foundational assumption: the compressibility of data in a file is uniform. This means

that the compressibility of one piece of a file should be the same as the rest of the file.

The second foundational idea auto compress utilizes is: the more bytes in the write queue the more time the system can use to compress data without slowing down the

write speed while less bytes in the queue means less time for compression. The write

queue is a data structure that the SPA layer of ZFS uses to storage incoming write

requests till the disk is free to write the data. This queue acts as a buffer where blocks sit while they wait. The third assumption auto compression makes is that compres-

sion algorithms can be sorted from fastest to slowest and that a faster compression

algorithm, one that takes less CPU time for the same amount of data, will have a

worst compression ratio than a slower one. This intuitively makes sense: the more

work the greater the reward. However, as Peterson discussed in his Datacomp paper this is not strictly true since some algorithm are better at certain data types despite

how much CPU time they take. [13]

Auto compress is implemented inside the ZFS write pipeline. It looks at each

block before it is added to the queue and decides which compression algorithm if

15 any to compress it with before it is added to the queue. Auto compress decides which algorithm it should use based on the current state and a table of compression algorithms. This table contains all the algorithms available for the system to pick from in a sorted ordering from fastest to slowest. The state of auto compress always starts selecting the fastest algorithm, no compression. Then it starts adjusting the currently selected algorithm based on two numbers.

The first number that auto compress uses to select a compression algorithm is the estimated delay for the current block to start to be written to the disk. Bunge calls this the queue delay. The queue delay can be estimated by dividing the number of bytes currently in the write queue by the estimated write speed of the disk.

queue size queue delay = disk write speed

The second number that auto compress uses is the estimated time it will take to compress the given amount of data. This value is calculated by dividing the number of bytes in the current block by the compression throughput for the given compression algorithm. Compression throughput is calculated by taking the average throughput of compressing past blocks in the same file with the same algorithm. With these two estimates auto compress starts its decision process, which is outlined in table 2.1.

It either selects a compression algorithm that is one algorithm faster or slower, the neighboring algorithms in the algorithm table, or stays with the same algorithm. If the queue delay is smaller than the estimated compression delay then auto compress selects the faster neighboring algorithm, if one exists. If the queue delay is larger than the estimated compression delay for the slower neighboring algorithm then it selects it. Otherwise auto compress stays with the current algorithm.

16 Case Reaction Estimated compression de- Select faster algorithm lay is larger than the queue delay Estimated compression for Select slower algorithm slower algorithm is smaller than queue delay Default Repeat last choice Table 2.1: Auto Compress decision process.

17 3 Implementation

The goal of this thesis is to speed up disk I/O using compression. In this chapter we will layout the details of the methods taken to reach this goal and how these methods were implemented in ZFS.

3.1 Adaptive Compression System Design

We chose to model our AC System after Datacomp’s design. This means we organized the system into three parts: Methods, Monitors and Models.

3.1.1 Methods

Methods are the different compression algorithms available to the AC system. Instead of using the methods found in Datacomp we used the methods that are cur- rently built into ZFS. ZFS currently supports the following compression algorithms: lz4, lzjb, gzip (1 through 9) and zle. However, lz4 is a replacement for lzjb therefore lzjb is not a good option. Zle was designed to help with the use case where a user stores large amounts of zeros in a file as placeholders, only compresses repeating zeros in files. This use case is no longer popular as it is a poor system design. Therefore the only methods our system uses are lz4, gzip and no compression. We designed our system to allow for easy integration of compression algorithms. We did this to allow for the addition of future algorithms like zstd or lz4fast which are planned to be introduced into ZFS.

18 3.1.2 Monitors

Monitors provide the AC system with values describing the current environment or attributes of the input data. While similar to Datacomp, our monitors are different because our AC system is working with a hard drive instead of the network. We have monitors for compressibility, queue length, disk bandwidth and CPU usage. An important feature of each of the monitors is that each must be quick to compute and have an value which is correlated to the total write time.

Compressibility

Compressibility is a function which given data returns a value that strongly corre- lates to compression ratio when compressing the data with a compression algorithm. The most accurate function would be to simply compress the data using a com- pression algorithm. This would give the correct answer by definition. However this requires the same amount of time as compressing the data which would render it useless when trying to speed up disk operations. Thus our system uses an efficient function which estimates the compressibility based on the variation in the input data. We used the same estimator found in Datacomp, bytecounting (BC). BC estimates the compressibility by sampling the distribution of bytes in the input as described in section 2.1.3.

Queue Length

The main indicator of how much computational power we will be able to use to compress the incoming data before the pipeline starts waiting for the next block to write to the disk is the Disk Queue Length. This monitor was implemented as a part of the work Niklas Bunge did in, “Quality of service improvement in ZFS through

19 compression”. [2] In his thesis, he used the queue length to estimate how long the algorithm had to compress the current data before the writing process would need to start. We used the similar methods to those found in his thesis. We define queue length as the number of bytes currently ready and waiting to be written to the disk.

Bandwidth

Bandwidth is one of the most important monitors as it allows the system to select the optimal compression strength to strike a balance between increasing the compression ratio and keeping up with the speed of the data transfer. The bandwidth is calculated by a statistics update function called as a callback in the ZFS write pipeline after the actual disk write has completed. A rolling average is kept in bytes per second (BPS) and calculated as described by Equation 3.1. After each disk write, the rolling average is updated using a weight of 1/1000 and a value of the number of

“physical” bytes transfer over the number of seconds needed to complete the transfer.

999 ∗ BPS bytes transfered BPS = n + seconds to transfer (3.1) n+1 1000 1000

CPU Usage

The CPU usage monitor enables the model to make a different decision depending on the current load on the CPU. This is crucial for the performance of the model since when the CPU is under a high load, every compression algorithm will end up waiting for necessary compute power. Resulting in high turnaround times for the compression phase which would in turn delay the rest of the pipeline.

20 3.1.3 Model

We modeled the file system’s total writing time (TWT) as a function of number

of input bytes (IB), compression rate (CR), queue length (QS), disk rate (DR) and

number of bytes after compression (OB). Our final model for TWT is described by

Equation 3.2.

{ } IB QS OB TWT = max , + (3.2) | CR{z DR } |{z}DR pre-write time write time The TWT is the sum of the time spent before the write, pre-write time, and during the write. Before the writing, the block is compressed while the data in the queue is being written to the disk. Therefore the minimum amount of time spent before the disk starts writing the data is the time it takes to write all the data in the queue to

QS the disk. That is DR . However, sometimes the compression algorithm requires more time than it takes for the queue to empty. In this case the pre-write time would be

IB OB CR . So the pre-write time is simply the max of both cases. The write time is DR . Given this equation our system knows how long a write would take and could select the best compression algorithm for the current environment by minimizing

TWT with respect to all compression algorithms. This is a complete model of the system. However, three of these values can not be known before hand: DR, CR and

OB. DR can not be known, but disk’s write rates are very predictable. Therefore

our system simply uses historical write rates to estimate this value. The other two unknown values are CR and OB. It is impossible to know how long the compression algorithm will take on some data without compressing it. Similarly there is currently

IB no known way of computing OB apart from actually compressing it. Since OB = RT , where RT is the compression ratio, and we know IB, thus we only need to estimate

21 RT . To be able to utilize this model in order to make good compression decisions, we estimate the values of CR and RT using models similar to the model used by

Datacomp. We called this type of model an Estimating Model, which is described in

detail in section 3.1.4. We denote this type of model as My(x1, ..., xn) where y is the variable being estimated and x1 through xn are the variables which y is dependent on.

Complete System Model

Let C be the set of available compression algorithms. Let CR(c) = MCR(BC, CP U, c) where BC is the byte count, CPU is the percent of idle cpu and c is a compression algorithm.

Let RT (c) = MRT (BC, c) where BC is the byte count and c is a compression algorithm.

{ } { } IB IB QS Optimal compression algorithm = arg min max , + RT (c) c∈C CR(c) DR DR

3.1.4 Estimating Model

For each of the estimating models we use the values of the environment variables which they are dependent on. The model to estimate compression rate is dependent on

the amount of computational resource available and the compression algorithm. For

this model we drew inspiration from Datacomp’s results, to know which values would

represent the environment well. We settled on using the compressibility monitor and

the CPU load monitor. For the model to estimate compression ratio we only needed two inputs: the compressibility and the compression algorithm. We only need these

22 Figure 3.1: Diagram of “Bucketization”

two since the compression ratio is only dependent on the data and the algorithm and

is independent of the environment. In the subsections below we will discuss exactly how our estimating model processes the input and “learns”.

Input

The input to the model is a list of features. These features are the current values

of each monitor which the output variable is dependent on. The model quantizes each of the features via simple discretization. Our system’s method of discretization,

Bucketization, will be explained with the help of Figure 3.1. The figure shows the two

components of the Bucketization: predefined cut-offs and their respective buckets.

Each input value is sorted into a bucket by finding the bucket with a low cut-off

less than the value and a high cut-off greater than the value. Using Figure 3.1 as an example, the Bucketization of 25% would be 1. This input preprocessing is important

since it reduces the size of the feature space. Given a feature, which is a percentage,

the feature space would have a size of 100; given two percentage features the size

would be 1002. However after Bucketization, two percentage features would have a size of only 32 since after Bucketization each feature would have 3 possible values.

The downside of Bucketization is the loss of some precision of the input values. As noted by Datacomp, Bucketization allows the model to be more efficient while still

23 Bucket Monitor 0 1 2 3 CPU idle > 66% > 33% > 0% 0% Bytecount ≥ 100 ≥ 66 ≥ 33 < 33 Table 3.1: Monitor buckets

achieving good accuracy. Table 3.1 lists the buckets for each monitor used in the

estimating model.

Learning

We chose to treat the problem of estimating the variables that we must estimate

as a machine learning problem. We need a model that outputs a prediction for the output variable, the variable that we must estimate, given the input variables, the

variables that the output is dependent on. The system only knows the value of the

output variable for a given input pair through exploration, where exploration means

trying an input pair and recording the outcome. The system needs to “explore” the feature space by selecting different values for the inputs that it can control.

Our system “learns” through an exploration process, called training, and recording

values in a matrix. The dimensions of the matrix represent each input and the value

stored in the matrix is the estimated output for the given input. An example is if

we have following inputs: x1 ∈ 1, 2 and x2 ∈ 1, 2, 3, then the matrix would be a 2 by 3 matrix. Figure 3.2 shows this example matrix. We store the estimated y value for x1 = 1, x2 = 3 at A1,3 where A is the matrix. During the training phase, each compression operation is timed. The time is converted to a rate using IB like so

IB CR = time . The matrix value is then updated by summing the weighted old value and weighted new value. For weights we choose 0.999 for the old value and 0.001

for the new value. This gives little weight to the new value, in this way the model

is more stable since spikes and noisy values will have little influence over the value.

24 [ ] A A A A = 1,1 1,2 1,3 A2,1 A2,2 A2,3

where Aa,b = the estimated value when x1 = a and x2 = b. Figure 3.2: The matrix used for storing estimated values

∗ 999 Ax1,x2 new value We express this mathematically like so Ax1,x2 = 1000 + 1000 . Note that the values in the matrix are initialized to the first recorded value.

3.2 Integration into ZFS

We integrated our AC system into ZFS’s write pipeline. This pipeline is shown in

Figure 3.3. We will be describing this in detail through this section. The granularity of ZFS’s compression selection is very fine. It allows each block to be compressed using any algorithm it supports by storing needed meta data per block. This means that to integrate our AC system into ZFS, we only have to modify the compress step of the compression step of the ZFS pipeline to select the algorithm which our model deems best then ZFS records the selected algorithm along with the block. During the decompression step of the reading pipeline, ZFS will use the algorithm that it had recorded.

3.2.1 ZFS Write Pipeline

The ZFS write pipeline is the steps that ZFS uses to write information to the

disk. There are many steps, however we will focus only on the relevant steps, which

are shown in Figure 3.3. It is important to note that by the time any information is past to the ZIO Write step, the information has already been split into blocks. This means that each write request contains a single block of data. The first thing the

25 Figure 3.3: ZFS write pipeline with AC pipeline does after receiving the block of data is compress it. In this step ZFS reads the configuration and decides the compression algorithm to call. To integrate our system, we created a compress configuration so that when the compression configuration is set to lars ZFS calls our system’s compress function. After the block is compressed, ZFS continues to the DVA Allocate step. This is when a space on the disk is found for the block. It is important to note that disks have a smallest addressable unit, a sector, which is normally 512 bytes. During this step ZFS rounds the size of the block up to the nearest multiple of the size of the sector. Each block is assigned a virtual address that represents the disk and sectors that the block should be written to.

26 The VDEV Write step is the step that actually does the writing. During this step ZFS writes the block to its assigned sectors. When the write is finished the ZIO Done

step is reached. At this point statistics about the write are recorded. Our system

during this step increments the number of bytes written which is used to estimate

the disk’s write speed.

3.2.2 AC Compress Function

Our AC system is implemented inside the lars compress function. Algorithm 1 lays out how the compress function works.

Algorithm 1: Lars Compress Function 1: gather monitor values 2: if training then 3: start time ← current time 4: compress data with random compression algorithm 5: compression delay = current time − start time 6: update compression rate model and compression ratio model 7: else 8: calculate the best compression algorithm using model from section 3.1.3 9: compress data with best algorithm 10: end if

27 4 Experiment

4.1 Methodology

In this thesis, we created a AC file system with the goal of improving I/O speeds.

To test this hypothesis we designed an experiment which compares our system with

several alternatives. The ZFS default is to not use compression which we used as our

baseline. The two most used compression algorithms in ZFS are lz4 and gzip. We took these two algorithms and our baseline as our static compression option. Both our

AC system, LARS Adaptive File Compression System (LAFCS), and Niklas Bunge’s

AC system, which we will refer to as n1kl, were set up to pick from any of the static

options. Table 4.1 shows a summary of each system configuration tested and the names by which we will refer to them.

We used a single process for testing all the system’s write speeds for each dataset.

This process is outlined in Table 4.2 and discussed in detail below.

1) Mount the ZFS file system. We started each experiment by creating a

clean ZFS partition on a secondary disk and mounting it. After implementing the

Name System Compression Algorthm baseline off zfs-lz4 zfs lz4 zfs-gzip zfs gzip-1 n1kl n1kl {off, lz4, gzip-1} lars LAFCS {off, lz4, gzip-1} Table 4.1: Systems

28 Step Description 1 Mount ZFS file system 2 Train model 3 Load dataset into RAM 4 Start timer 5 Write dataset 6 Stop timer when dataset is finished writing to disk Table 4.2: Experiment procedure

lars system in ZFS, we added the n1lk system into the same source in order to have

a single version of ZFS with all the systems we wanted to test. Therefore regardless of which system was being tested, we mounted a ZFS partition using the same kernel

module. This allowed us to not have to unload and load a kernel module for each

experiment.

2) Train the model. This step was only done if the lars system was being tested.

The lars system uses models to estimate values as described by section 3.1.4. These models must be trained before the system can reasonably predict values needed for

choosing a compression algorithm. During this step we wrote a training set to the

ZFS partition several times to insure the model was filled. The training set consisted

of data similar to the data found in the test datasets, which will be introduced in section 4.3. Specifically, we used a compression corpus and several files not found in

the test datasets. The corpus we selected was the Squash Compression Corpus. [11]

The other files were an unused part of the enwiki dataset and files created in a similar

manner to the random dataset and the pdf dataset.

3) Load the dataset into RAM. Before writing anything to the disk we loaded the dataset into RAM. We did this to minimize the overhead during the timed process of writing to disk. With the dataset in RAM, the overhead of reading the data is minimal since RAM has read speeds upwards of 20 GB/s which is orders of magnitude faster than disk write speeds.

29 4) Start the timer. We started a timer to track how long it takes to write the given dataset to the disk.

5) Write the dataset. For the datasets that only contained a single file we simply opened a file on the ZFS partition and wrote the whole dataset into it. For datasets that contained many files we used a pool of threads with a size equal to the number of CPU threads, which was created before the experiment start as to prevent the creation time of thread from effecting the write time results. We chose

8 threads for our environment since our CPU had 4 cores each with 2 threads. We used a pool of threads instead of a single thread to insure that the CPU would not be the bottleneck even though it is highly unlikely that it would be. It is important to note that this did not change how the blocks were compressed or written as all the data written to the disk would enter a single write queue inside ZFS just like when using a single thread to write a file. Each thread in the pool would open a file on the zfs partition and write the contents of a single file in the dataset into it. When finished the system would repeat till all the files in the dataset had been copied to the zfs partition.

6) Stop timer when dataset is finished writing to disk. When a program writes data into a file it calls the write system call. The write call completes almost instantly; however, it does not mean the data is actually written to the file system. This is a feature of the operating system (OS). It allows a program to move on to other calculations while the OS waits for the disk write to complete. In order to know when the data actually was done being written to the disk we needed to wait till the data finished writing using a system call, fsync. The fsync system call allows a program to wait till a disk write is complete. We call fsync with the name of the file we wish to wait on. In the case of the writing multiple files, we called fsync on each of the files written after all the threads finish writing. This insures that the experiment

30 waits till all the files have been written. After the call is complete we stop the timer. We call the time elapsed between starting and stopping the timer Total Write Time

(TWT), which is the output of our experiment.

4.2 Environment

We used a system with a 2.7 GHz Intel Core i7 and 32GB of RAM. We set up

our system with two hard drives: one for the main operating system and one to use for benchmarking the ZFS file system. We did this to reduce background noise

in the experiments from other tasks writing to the main disk. The disk used for

benchmarking was a 5400rpm HDD running at slightly above 50 MB/s. The system

was running Ubuntu 16.04 with a version of 4.4.0-128.

4.3 Datasets

In order to grasp a better understanding of how the systems compared to each

other we used six datasets. Table 4.3 lists the datasets along with their sizes. We used

a standard English dataset that is taken from the English Wikipedia called enwiki.

[10] This dataset is just a representation of the text part of the Wikipedia which is mostly English text and some markup. This means it does not contain the images,

html or other data found on the Wikipedia website. In order to know how the systems

would react to incompressible data we used a random corpus which was generated though the use of the /dev/urandom on our test machine. We selected two files from the Silesia Corpus, mozilla and sao.[4] The mozilla file is a tarball of executables and the sao file contains a catalog in a binary format. Along with these three standard

compression corpora we created two datasets of our own. The first is a dataset of

31 Size (in bytes) Compression Ratio Raw LZ4 Gzip-1 LZ4 Gzip-1 enwiki 100,000,000 57,285,990 42,265,725 1.746 2.366 mozilla 51,220,480 26,441,791 20,686,601 1.937 2.476 sao 7,251,944 6,790,464 5,568,179 1.068 1.302 pdf 6,471,787 5,404,086 5,127,285 1.198 1.262 web 14,428,160 5,348,363 4,396,610 2.698 3.282 random 100,000,000 100,000,111 100,016,915 0.999999 0.999 Table 4.3: Sizes of datasets

Note that these numbers are per file. For per block number look at figures 4.1-4.6 pdf files which are thesis dissertation containing a mix of English text, graphs and images. The second dataset contained CSS, HTML, JavaScript and image files from popular GitHub repositories containing web base files for various frameworks. Per the above descriptions, each of these datasets contains a different data type with a different size as shown by Table 4.3. The raw size is the original size of the file.

The lz4 and gzip-1 sizes are the size after compressing the file as a whole. However this does not reflect the size after compressing through the ZFS pipeline since by the time the compression step starts the file is already split into blocks. We kept the default block size of 128KB. Figures 4.1 through 4.6 show the compression ratio per block for each dataset after being split into 128KB chunks and compressed using gzip-1. Note that the results shown in these graphs are measuring space so they are valid, both accurate and precise, since gzip1 is a deterministic algorithm. An important edge case to mention is the last block contains the number of bytes equal to the remander of dividing the dataset size by 128KB. This can result in a small block which may have a different compression ratio. An example is the random dataset’s last block which is 120KB. Since the overhead of gzip is a constant in this case and the data is completely incompressible, the 128KB blocks have a “less worse” ratio than the last block.

32 We chose our datasets to get a variety of real world data types. This is clearly visible in the figures showing the compression ratio per block. Figure 4.1 shows

how enwiki is fairly compressible at an average compression ratio of 2.3 and has low

variance with only a couple spikes over the almost 800 blocks. This is exactly what

we would expect from data containing natural language like English. Since the data is mostly uniformly compressible we would expect an AC system to pick a single

compression algorithm for every block. If we knew before hand that the file system

would only be handling file similar to enwiki if would be best to simply select a single

algorithm like gzip1.

The mozilla dataset, shown in 4.2, exbibits much variance. Something to note about this dataset is that executables contain different sections where different types of data are stored. This results in pockets of similar compressibility. Two examples are blocks 8 to 31 are almost not compressible with an average compression ratio of 1.15 and blocks 160 to 180 which have a compression ratio of 4.25. This means that an AC system has potential to save time and energy by not compressing blocks

8 to 31 and using a strong compressor, one that takes more time and gets a better compression ratio, on blocks 160 to 180. However an AC system that doesn’t have a compressibility measure as an input, such as Niklas Bunge’s System, will miss this type of opportunity since the ranges are so small. The binary catalog format that sao contains is only slightly compressible and very uniform as figure 4.3 shows. The graph shows that when using gzip1 the average compression ratio is 1.3, however something important to note is that the graph doesn’t show that the average compression ratio using lz4 is 1.06. This means that unlike the other datasets, this data set is next to incompressible with lz4 but is compressible with gzip1. A good AC system would select to not compress the data

or compress it with gzip1 depending on the time available.

33 Compression Ratio per block for dataset enwiki

5.5

5.0

4.5

4.0

3.5 Compression ratio 3.0

2.5

2.0 0 100 200 300 400 500 600 700 800 nth Block

Figure 4.1: Compression Ratio per block using gzip-1 for dataset enwiki

The pdf dataset shows that some files contain data that has highly varying com-

pressibilities. Similar to the mozilla dataset, their are different sections in a pdf.

Blocks 38 to 42 have a compression ratio of 6.1 while blocks 5 to 36 have a compres- sion ratio of 1.2. We believe this is due to how the pdf stores images. The images are

stored in a compressed format and therefore are not compressible, while all the text

is stored uncompressed. This results in the graph we see in figure 4.4.

The web dataset is unique since it is the only dataset we used that contains

more than one file. The web dataset contains 1,099 files. These files contain source code for websites, which are very compressible, as was well as images which are not

compressible. This is why figure 4.5 shows the most variance. We believe that this dataset embodies a large part of real world data that gets written to the disk and a good AC system should preform well on it.

34 Compression Ratio per block for dataset mozilla

8

6

4 Compression ratio

2

0 50 100 150 200 250 300 350 400 nth Block

Figure 4.2: Compression Ratio per block using gzip-1 for dataset mozilla

Compression Ratio per block for dataset sao

1.34

1.32

1.30

1.28

Compression ratio 1.26

1.24

1.22

0 10 20 30 40 50 nth Block

Figure 4.3: Compression Ratio per block using gzip-1 for dataset sao

35 Compression Ratio per block for dataset pdf

6

5

4

3 Compression ratio

2

1 0 10 20 30 40 50 nth Block

Figure 4.4: Compression Ratio per block using gzip-1 for dataset pdf

Compression Ratio per block for dataset web

12

10

8

6 Compression ratio 4

2

0 20 40 60 80 100 nth Block

Figure 4.5: Compression Ratio per block using gzip-1 for dataset web

36 Compression Ratio per block for dataset random 1 +9.995 10− ×

0.000055

0.000050

0.000045

0.000040 Compression ratio

0.000035

0.000030

0 100 200 300 400 500 600 700 800 nth Block

Figure 4.6: Compression Ratio per block using gzip-1 for dataset random

37 5 Results

5.1 Raw Results

Figures 5.1 through 5.6 show the raw timing results from running our total write time (TWT) experiment using each system and each dataset 30 times. The orange line marks the mean time for each system. The box shows the upper and lower quartiles

th th of the results. These are defined as the 75 and 25 percentiles and denoted Q3 and

Q1 respectively. Outliers are represented using a circle and were identified as any result that fell farther than 1.5 ∗ IQR from the mean, where IQR = Q3 − Q1 (where IQR means interquartile range.)

38 Time writing dataset enwiki

2.2

2.0

1.8

1.6

1.4

Total Write Time (sec) 1.2

1.0

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.1: Total Writing Time for each system to write enwiki

Time writing dataset mozilla

1.1

1.0

0.9

0.8

0.7

Total Write Time (sec) 0.6

0.5

0.4 baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.2: Total Writing Time for each system to write mozilla

39 Time writing dataset sao

1.1

1.0

0.9

0.8

0.7 Total Write Time (sec)

0.6

0.5 baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.3: Total Writing Time for each system to write sao

Time writing dataset pdf

0.50

0.45

0.40

0.35

0.30

0.25 Total Write Time (sec) 0.20

0.15

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.4: Total Writing Time for each system to write pdf

40 Time writing dataset web

0.35

0.30

0.25

0.20 Total Write Time (sec) 0.15

0.10 baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.5: Total Writing Time for each system to write web

Time writing dataset random

2.4

2.3

2.2

2.1

2.0 Total Write Time (sec)

1.9

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.6: Total Writing Time for each system to write random

41 5.2 Result Analysis

5.2.1 Write Speed

After inspecting the distributions of the timing results we concluded that all the

results were normally distributed. Figures 5.7 through 5.12 show the speed of writing

each dataset for the above results after removing the outliers. The bars represent the

confidence intervals for each system’s results with a 99.7% confidence level. The first thing to point out about these results is that our system, lars, is working.

There is no dataset where any system, including static compression, is statistically

faster than lars, although zfs-lz4 gets on the random dataset comes close. In any case

the results suggest that using our system will not result in worse write performance compare to any of the other systems.

Under the motivation for creating an AC system, we stated that the optimal

compression algorithm varies depending on the type of data. This can be seen in the

results. For the sao dataset, the optimal compression algorithm is gzip, as seen from

figure 5.9. On the other hand, lz4 is much better for the random dataset, in figure 5.12. This shows that selecting a single compression algorithm is a na¨ıve system that

can be improved upon.

The lars and n1kl systems perform comparably for the enwiki and pdf datasets,

figures 5.7 and 5.10. The author of the n1kl system used the enwiki dataset in his testing which showed favorable results. However this dataset is an “easy” dataset

because of its uniform compressibility as discussed in section 4.3. Which means that

even na¨ıve AC system such as simple off/on system can be expected to perform well

on this dataset.

The results from the pdf dataset, figure 5.10, show that lars and n1kl perform similarly, however, lz4’s and gzip1’s means are higher. Given the confidence intervals,

42 we are unsure whether lz4’s and gzip1’s higher means are meaningful. All of the systems achieve speeds much less than the disk bandwidth of 50 MB/s including the baseline. The large variance in the speeds for the pdf dataset and the fact the baseline’s speed isn’t near the expected 50 MB/s leads us to believe that the pdf dataset is too small to obtain precise write time values using our current experiment. We have decided not to draw any conclusions from this dataset.

While lars and n1kl perform comparably on the enwiki dataset, the mozilla, sao and web datasets, shown in figures 5.8, 5.9 and 5.11, show that n1kl and lars perform statistically different on there datasets. We would like an AC system to perform close or even the same as the best static compression algorithm for any dataset. N1kl does not since it has speeds that are statistically the same as the baseline for these datasets, while lars has speeds similar to the fastest preforming static compression algorithm. Exactly how lars is preforming better will be discussed more in depth in section 5.2.2. We used the random dataset to get an idea of the overhead involved in each AC system. The results are shown in figure 5.12. Since there are no gains from using compression, the overhead can be seen in the decrease of speed. First, we see that n1kl and lars are not statistically different, however we can see how the mean changes between the them. The mean write speed of the baseline is 47 MB/s. N1kl has a mean of 45 MB/s, which is slower by 2 MB/s while lars has a mean of 44.2 MB/s, which is

2.8 MB/s less than the baseline. We see that the lars system has a greater overhead which is due to calculating the bytecount of each block. This is one of the downsides of the lars system and improvements could be made by using a faster compressibility estimator, computing a bytecount from smaller samples instead from the whole block or some other cost-reducing technique. Figure 5.12 led us to think about why zfs-lz4 is almost as fast as the baseline. Lz4 has a compression speed around 500 MB/s on

43 Speed writing dataset enwiki

90

80

70 Speed (MB/s) 60

50

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.7: Average throughput for each system while writing enwiki our testing system. This means that the speed of the lz4 compressor was fast enough to keep up with the disk. According to figure 5.9, we can tell that sometimes lz4 has unsatisfying performance with certain data types. Therefore, it is better to use an AC system despite its overhead instead of using lz4 on all blocks.

44 Speed writing dataset mozilla

100

90

80

70 Speed (MB/s)

60

50

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.8: Average throughput for each system while writing mozilla

Speed writing dataset sao

62

60

58

56

54

Speed (MB/s) 52

50

48

46 baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.9: Average throughput for each system while writing sao

45 Speed writing dataset pdf

38

36

34

32

30

Speed (MB/s) 28

26

24

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.10: Average throughput for each system while writing pdf

Speed writing dataset web

100

90

80

70 Speed (MB/s)

60

50

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.11: Average throughput for each system while writing web

46 Speed writing dataset random

48

47

46

45 Speed (MB/s)

44

43

baseline n1kl lars zfs-lz4 zfs-gzip

Figure 5.12: Average throughput for each system while writing random

5.2.2 Compression Choices

In addition to recording the speed of each system we tracked the compression choices of the lars system and the n1lk system. We plotted the compression choices for three datasets that we found interesting. Figures 5.13 through 5.15 show the compression choices the AC systems are making. The x-axis values represent each run of the experiment. The y-axis for the solid lines is the percent of the blocks per run that were compressed with the algorithm the line represents. The TWT is plotted in a dotted line on top over the solid lines allowing us to see how each run and the choices an AC system makes are linked to the TWT. The first thing to note about these graphs are that the choices are relatively consistent for each dataset across runs. This means that the AC systems are making consistent choices, which means our experiment procedure is robust.

There are two significant effects we would like to point out from figure 5.13. First,

47 there is a single outlier at run 14 for the lars system. This is likely due to some background process running on the machine during this run of the experiment. The interesting thing to see is that the spike in TWT happen at the same run as when the system makes a different compression choice. This leads us to believe that the system adapted to the interference which is what a good AC system should do. Second, we can see the reason that the n1kl system is the fastest system to write the enwiki dataset. The n1lk system starts out by not compressing the data which allows the disk to start writing immediately. After the write queue has filled up a little bit, the system starts compressing with lz4, which is faster than gzip1, then finally compresses the rest of the blocks with gzip1. This is a case where n1kl’s design could be better than the lars’.

As pointed out in section 4.3, the mozilla dataset has several sections, some of which are incompressible, while most are compressible. Figure 5.14 shows that the lars system is using gzip on 95% of the blocks and no compression on the other 5%. This leads us to believe that the lars system is exploiting the compressibility of the compressible blocks to decrease the write time. This is a case where the n1kl system does not preform well because it assumes that all blocks in a file have on average the same compressibility. However as seen in figure 4.2, on average, the first blocks in the mozilla dataset have a low compressibility resulting in the n1kl system thinking the whole file must be incompressible and therefore not compressing any of the blocks.

The results shown in figure 5.15 are interesting because they show how each system handles small files. The n1kl system needs to “learn” about each file’s compressibility before deciding whether to compress. When it is given a dataset full of small files it never chooses to compress any blocks. So, on the other hand, since the lars system tracks its model per file system and uses a compressibility estimator, it makes good compression choices even for small files. As shown in figure 5.15, the lars system

48 Compression Choices: enwiki

1.5 100

1.4 TWT 80 lars n1kl 1.3 60 Comp. Choice lars: gzip-1 1.2 lars: lz4 40 TWT (secs) % of blocks n1kl: gzip-1 1.1 n1kl: lz4 n1kl: no comp. 20 1.0

0 0.9 0 5 10 15 20 25 Run

Figure 5.13: Compression choices per run made by n1kl and lars for enwiki

Compression Choices: mozilla

1.1 100

1.0 TWT 80 lars 0.9 n1kl

60 0.8 Comp. Choice lars: gzip-1 lars: no comp. 0.7 TWT (secs) % of blocks 40 n1kl: no comp. 0.6

20 0.5

0 0.4 0 5 10 15 20 25 Run

Figure 5.14: Compression choices per run made by n1kl and lars for mozilla compresses 95.2% of blocks in the web dataset with gzip1 and the other 4.8% it chooses to leave uncompressed.

49 Compression Choices: web

100

0.35 TWT 80 lars n1kl 0.30

60 Comp. Choice lars: gzip-1 0.25 lars: no comp. TWT (secs) % of blocks 40 n1kl: no comp. 0.20

20 0.15

0 0 5 10 15 20 25 Run

Figure 5.15: Compression choices per run made by n1kl and lars for web

50 6 Conclusions

In this thesis we present a system that achieved our goal of improving disk write speeds using AC. We built upon related work to create our system. Compared to the latest AC system for ZFS, n1kl, our system was able to better handle more types of real world data. Specifically, our system was able to correctly handle small compressible files and files containing both compressible and incompressible blocks.

Our system does not assume a speed ordering to compression algorithms. We believe this allows the system to be more general and better suited for a larger variety of data types. Our system also does not assume compressibility of blocks based on spatial locality but instead uses a compressibility estimator on every block. This adds to the overhead of our system but also allows good compression choices for small files. All in all, our system is better equipped to handle real world data then the preceding AC system for ZFS. We believe that our results show that our system is currently the best system for ZFS.

6.1 Future work

Before we discuss the future work that could be done to improve our work, we would like to note that your implementation is openly available on GitHub 1.

Our results showed that our system preformed better than all other methods tested. Even with a positive result, there are several questions that were raised

1https://github.com/derpferd/zfs/tree/lars

51 through the process of creating and testing our system, that we believe are worth answering. Most importantly, how would our system preform in a real world test?

We think creating an experiment that more closely reflects real world condition would help shed light on how future systems can do better. In order to better understand how our system would perform in the real world another variable should be added to our current experiment. This variable should be test systems with varying amounts of computational power and disk speeds. This would allow us to see which systems would benefit from AC and possibly allow us to improve our system to be even more generalized. It would be helpful to also create datasets of more realistic data, e.g., by tracing what happens on a user’s file system as they use their system. We could also trace file system events on a server. With more variety of data, we believe that an AC system would benefit from having access to more compression algorithms.

Another important experiment is to test read speeds as well. With our current experiment we only test write speeds however we know that read speeds are effected by our AC system. We believe that our system should improve read speed as well.

Since our system speeds up write speeds by compressing and decompression speeds are generally faster than compression speeds, then read speeds should also be faster.

In any case, actual read speed results are important to analyze before recommending use of this system. Our system only focuses on single disk systems. It would be nice to see whether systems with multiple disks or disks in raid receive similar benefits from AC systems.

The random dataset allowed us to see that our system has more overhead than n1kl’s system. We believe that our system’s overhead could be reduced by looking at of compressibility estimators other than byte counting or by using samples of the data instead of the whole data to compute the byte count.

52 References

[1] L. S. Bai, H. Lekatsas, and . P. Dick. “Adaptive filesystem compression for

embedded systems”. In: 2008 Design, Automation and Test in Europe. IEEE.

2008, pp. 1374–1377 (cit. on p. 8).

[2] N. Bunge. “Quality of service improvement in ZFS through compression”. MA thesis. Universitt Hamburg, 2017 (cit. on pp. 3, 15, 20).

[3] M. Burrows, C. Jerian, B. Lampson, and T. Mann. “On-line Data Compression

in a Log-structured File System”. In: Proceedings of the Fifth International

Conference on Architectural Support for Programming Languages and Operating

Systems. ASPLOS V. Boston, Massachusetts, USA: ACM, 1992, pp. 2–9. isbn:

0-89791-534-8. doi: 10.1145/143365.143376. url: http://doi.acm.org/10.

1145/143365.143376 (cit. on p. 11).

[4] S. Deorowicz. Silesia compression corpus. url: http://sun.aei.polsl.pl/

~sdeor/index.php?page=silesia (visited on 06/01/2018) (cit. on p. 31).

[5] F. Ehmke. “Adaptive Compression for the Zettabyte File System”. MA thesis.

Universitt Hamburg, 2015 (cit. on pp. 2, 14).

[6] D. A. Huffman. “A method for the construction of minimum-redundancy codes”. In: Proceedings of the IRE 40.9 (1952), pp. 1098–1101 (cit. on p. 7).

53 [7] S. Kiselkov. 6400 ZFS smart compression. Oct. 2015. url: https://reviews. csiden.org/r/266/ (cit. on pp. 2, 14).

[8] S. Kiselkov. “ZFS ( Smart ?) Compression Types of Compression Algos”. In: ().

url: http://open-zfs.org/w/images/4/4d/Compression-Saso_Kiselkov.

pdf (cit. on pp. 8, 14).

[9] B. Leonard. ZFS Compression - A Win-Win. Apr. 2009. url: https://blogs. oracle . com / observatory / entry / zfs _ compression _ a _ win _ win (cit. on p. 13).

[10] M. Mahoney. About the Test Data. Sept. 2011. url: http://mattmahoney.

net/dc/textdata.html (visited on 06/01/2018) (cit. on p. 31).

[11] E. Nemerson. Squash Compression Corpus. url: https://github.com/nemequ/

squash-corpus (visited on 06/01/2018) (cit. on p. 29).

[12] OpenVPN Man page. url: https://openvpn.net/index.php/open-source/ documentation/manuals/65-openvpn-20x-manpage.html (cit. on pp. 2, 8).

[13] P. A. H. Peterson and P. L. Reiher. “Datacomp: Locally Independent Adaptive

Compression for Real-World Systems”. In: 2016 IEEE 36th International Con-

ference on Distributed Computing Systems (ICDCS). June 2016, pp. 211–220.

doi: 10.1109/ICDCS.2016.106 (cit. on pp. 4, 9, 15).

[14] T. Raita. “An Automatic System for File Compression”. In: The Computer

Journal 30.1 (1987), pp. 80–86. doi: 10.1093/comjnl/30.1.80. eprint: http:

//comjnl.oxfordjournals.org/content/30/1/80.full.pdf+html. url:

http://comjnl.oxfordjournals.org/content/30/1/80.abstract (cit. on p. 10).

54 [15] B. Rockwood. Understanding ZFS: Compression. Nov. 2008. url: http : / / cuddletech.com/?p=473 (cit. on p. 13).

[16] O. Rodeh and A. Teperman. “zFS - A scalable distributed file system using

object disks”. In: Proceedings - 20th IEEE/11th NASA Goddard Conference on

Mass Storage Systems and Technologies, MSST 2003 (2003), pp. 207–218. issn:

10519173. doi: 10.1109/MASS.2003.1194858. url: http://ieeexplore. ieee.org/document/1194858/ (cit. on p. 11).

[17] D. Salomon. Data Compression: The Complete Reference. Secaucus, NJ, USA:

Springer-Verlag New York, Inc., 2007. isbn: 1846286026 (cit. on pp. 6, 7).

[18] J. Ziv and A. Lempel. “A universal algorithm for sequential data compression”.

In: IEEE Transactions on 23.3 (May 1977), pp. 337–343.

issn: 0018-9448. doi: 10.1109/TIT.1977.1055714 (cit. on p. 7).

[19] J. Ziv and A. Lempel. “Compression of individual sequences via variable-rate

coding”. In: IEEE Transactions on Information Theory 24.5 (Sept. 1978), pp. 530–

536. issn: 0018-9448. doi: 10.1109/TIT.1978.1055934 (cit. on p. 7).

55