Exploring the Capacity-Increasing Potential of an OS-Invisible Hardware Compression Accelerator

Exploring the Capacity-Increasing Potential of an OS-Invisible Hardware Compression Accelerator Kimberly Lu, Kyle Kovacs Department of Electrical Engineering and Computer Science University of California, Berkeley fkimberlylu, [email protected] Abstract—As more and more data are created, considerations writes and reads on the I/O path and performing compression of SSD wear-out and storage capacity limitations pose major or decompression on the data. Since block-level compression problems for system designers and consumer alike. Data compres- is below the file-system, the file-system need not be aware sion can alleviate both of these problems by reducing physical I/O that compression is happening at all. This type of compression cycles on disks and increasing effective disk capacity. A hardware- is called ”transparent” or ”OS-invisible” compression because accelerated, transparent block-level compression layer slotted it is invisible to the rest of the system. This means that between the operating system and disk controller layers appears promising in the face of these challenges. While integrated transparent block-level compression can be flexibly combined compression design efforts are underway, existing research relies with all kinds of s. Additionally, block-level compression on either modifying the file-system or running compression always compresses single data blocks, eliminating both the algorithms in software alone. This work presents an OS-invisible small file and large file issue in file-level compression. hardware accelerator that is capable of compressing data, thereby saving disk write cycles and capacity, all with minimal additional Implementing block-level compression comes with its own cost to system architects. set of challenges. Compression algorithms do not uniformly compress data, so compressed blocks can be of variable I. INTRODUCTION size. Let these compressed blocks be referred to as logical blocks, and the physical storage blocks be referred to as With the advent of big data and the internet of things (IoT), physical blocks. One physical block may contain multiple storage space is in greater demand than ever. Storage cost logical blocks, meaning additional overhead and metadata is per GB is declining but the increased demand for storage is needed to keep track of these variable-sized logical blocks. far outstripping this decline; the internet alone generates 2.5 This also means that one must read a physical block before quintillion bytes of data per day [1]. As a result, it is necessary writing to it in order to preserve data already on the physical to find ways to use storage space more efficiently. block. Not only that, but the unpredictable size of logical Data compression offers a way to improve the efficiency blocks can lead to fragmentation as compressed blocks are of storage space by effectively storing more data per physical modified or overwritten. Dealing with these issues can add block at the cost of greater response time and CPU usage. In increased I/O latency and CPU usage for every disk access. addition, as SSD storage grows more prevalent the problem In this work, we present an implementation of block-level of SSD wear-out has become a significant issue. SSD wear- compression using a hardware accelerator, which transparently out can also be slowed by reducing the amount of data being performs compression and decompression on data blocks. The written. In the past, compression has typically been done on hardware component is inserted along the I/O path, below the file level, either by user applications or by file-systems the file-system and above the disk. A device driver would such as NTFS or ZFS. While file-level compression does save intercept block writes and reads and forward the requests space, it has several drawbacks. First, in cases where most files to the hardware component, which transparently compresses are very small, file-level compression is not very effective and writes before writing them to the disk, and transparently introduces significant overhead. On the flip side, very large decompresses blocks on reads. The hardware component keeps files can be prohibitively expensive to compress depending on track of the location of logical blocks, meaning that the file- the compression algorithm due to the amount of RAM needed. system need not be aware that any of this is happening. Second, file-level compression requires support from the file- Utilizing a hardware accelerator moves the cost of block-level system to keep track of the compressed file’s blocks, compress compression overhead out of the host machine and onto the files on writes, and decompress on reads, which restricts the separate hardware component. By doing this, it is possible to types of file-systems that can be used. Adding support for hide some of the costs of block-level compression. Since the file-level compression also increases the complexity of the file- hardware accelerator handles compression, decompression, and system. Moving data compression to the block level helps solve keeping track of logical blocks, the CPU cost for disk accesses these issues. does not noticeably increase on the host machine. Disk latency Block-level compression occurs on a layer between the is already high, so an incremental increase does not have file-system and the disk, essentially intercepting block-level a huge impact on overall system performance. In addition, Figure 1: The ideal setup of our hardware accelerator attached as an intermediate layer between the operating system and the disk. Notice that the OS and disk are interchangeable, and the accelerator need not have any knowledge of either. the increased I/O latency due to compression can potentially be hidden by the hardware accelerator during concurrent I/O operations. The rest of this paper is organized as follows: Section II explains the details of our implementation and testing strategies. Section III discusses how our accelerator performs and analyzes the results. In section IV we summarize current research in this area, and talk about the contrast between our work and other methods. We ponder what we could have done Figure 2: A sequence of bytes can be differential-encoded (or differently and how it would have impacted our results in decoded) according to this diagram. The result is an initial section V, and finally, section VI concludes the paper. value and a series of differences. The reverse transformation simply takes the initial value and adds all the successive values II. DESIGN AND IMPLEMENTATION cumulatively to recover the original sequence. The bottom To implement transparent block-level compression we de- portion of the figure shows a sequence of repeating bytes signed a hardware accelerator that is placed along the I/O path resulting in a string of zeros, which is important for run-length between the file-system and disk. This hardware component encoding to work. communicates with the host machine through a block device driver. The driver intercepts block I/O requests to the kernel and forwards them to the hardware accelerator which then 1 31 4 35 3 32 handles the request. To handle block writes, the accelerator to get to , add to get to , and subtract to get to ). compresses the data, writes the compressed data to physical It is easy to see how the decode operation works in reverse storage, and updates the metadata of the logical block. To to recover the original sequence. Note that differential coding handle block reads, the accelerator looks up the logical block’s does not change the length of an input sequence. location, decompresses the data, and returns the data to the This has a couple of functions. First of all, it can reduce requester. We designed the hardware accelerator in Chisel the total number of unique symbols required to represent a [2], a Scala-embedded language for parameterized hardware sequence of bytes. This can make dictionary or sliding-window generators. The testing infrastructure is built on top of the compression passes more space efficient, but we do not utilize testers2 [3] library. this property here. More importantly, differential encoding will turn any sequence of repeated bytes into a sequence of zeros. A. Compression Algorithm Because run-length encoding operates on sequences of zeros, differential encoding is necessary to allow run-length encoding The compression algorithm used is a simple differential to provide good compression. + run-length encoding scheme. Both differential encoding and run-length encoding have completely reversible analo- 2) Run-Length Coding: Run-Length encoding compresses gous operations, meaning that the compression is lossless. “runs” of zeros into a single zero followed by a number Using a lossless algorithm is particularly important for our of additional zeros. As shown in Figure 3, the sequence implementation as all reads and writes in the system will [30,0,0,0,0,0,25,26] encodes to [30,0,4,25,26] undergo compression. This lossless algorithm was chosen for (start with [30], change the sequence of five zeros to [0,4], its simplicity and for ease of implementation. and then end with [25,26]). The decode operation is clear to see: the number immediately after any zero is treated as a 1) Differential Coding: This technique is often used as a number of zeros to insert. pre-compression pass over data to prepare it for one or more compression passes. Input sequences are transformed bytewise Differential + run-length encoding provides good com- to an initial value followed by a series of differences. In the pression on the right kind of data. Any data that has large example shown in Figure 2, the sequence [30,31,35,32] sequences of repeated values will be compressed very nicely, is differentially encoded to [30,1,4,-3] (start with 30, add but random sequences or sequences that repeat at granularities Figure 3: Run-length encoding changes sequences of repeated zeros into a single zero and a number of additional zeros.

Load more