Exploring the Capacity-Increasing Potential of an OS-Invisible Hardware Compression Accelerator

Kimberly Lu, Kyle Kovacs Department of Electrical Engineering and Computer Science University of California, Berkeley {kimberlylu, kylekovacs}@berkeley.edu

Abstract—As more and more data are created, considerations writes and reads on the I/O path and performing compression of SSD wear-out and storage capacity limitations pose major or decompression on the data. Since block-level compression problems for system designers and consumer alike. Data compres- is below the file-system, the file-system need not be aware sion can alleviate both of these problems by reducing physical I/O that compression is happening at all. This type of compression cycles on disks and increasing effective disk capacity. A hardware- is called ”transparent” or ”OS-invisible” compression because accelerated, transparent block-level compression layer slotted it is invisible to the rest of the system. This means that between the and disk controller layers appears promising in the face of these challenges. While integrated transparent block-level compression can be flexibly combined compression design efforts are underway, existing research relies with all kinds of s. Additionally, block-level compression on either modifying the file-system or running compression always single data blocks, eliminating both the algorithms in software alone. This work presents an OS-invisible small file and large file issue in file-level compression. hardware accelerator that is capable of compressing data, thereby saving disk write cycles and capacity, all with minimal additional Implementing block-level compression comes with its own cost to system architects. set of challenges. Compression algorithms do not uniformly data, so compressed blocks can be of variable I.INTRODUCTION size. Let these compressed blocks be referred to as logical blocks, and the physical storage blocks be referred to as With the advent of big data and the internet of things (IoT), physical blocks. One physical block may contain multiple storage space is in greater demand than ever. Storage cost logical blocks, meaning additional overhead and metadata is per GB is declining but the increased demand for storage is needed to keep track of these variable-sized logical blocks. far outstripping this decline; the internet alone generates 2.5 This also means that one must read a physical block before quintillion bytes of data per day [1]. As a result, it is necessary writing to it in order to preserve data already on the physical to find ways to use storage space more efficiently. block. Not only that, but the unpredictable size of logical offers a way to improve the efficiency blocks can lead to fragmentation as compressed blocks are of storage space by effectively storing more data per physical modified or overwritten. Dealing with these issues can add block at the cost of greater response time and CPU usage. In increased I/O latency and CPU usage for every disk access. addition, as SSD storage grows more prevalent the problem In this work, we present an implementation of block-level of SSD wear-out has become a significant issue. SSD wear- compression using a hardware accelerator, which transparently out can also be slowed by reducing the amount of data being performs compression and decompression on data blocks. The written. In the past, compression has typically been done on hardware component is inserted along the I/O path, below the file level, either by user applications or by file-systems the file-system and above the disk. A device driver would such as NTFS or ZFS. While file-level compression does save intercept block writes and reads and forward the requests space, it has several drawbacks. First, in cases where most files to the hardware component, which transparently compresses are very small, file-level compression is not very effective and writes before writing them to the disk, and transparently introduces significant overhead. On the flip side, very large decompresses blocks on reads. The hardware component keeps files can be prohibitively expensive to compress depending on track of the location of logical blocks, meaning that the file- the compression algorithm due to the amount of RAM needed. system need not be aware that any of this is happening. Second, file-level compression requires support from the file- Utilizing a hardware accelerator moves the cost of block-level system to keep track of the compressed file’s blocks, compress compression overhead out of the host machine and onto the files on writes, and decompress on reads, which restricts the separate hardware component. By doing this, it is possible to types of file-systems that can be used. Adding support for hide some of the costs of block-level compression. Since the file-level compression also increases the complexity of the file- hardware accelerator handles compression, decompression, and system. Moving data compression to the block level helps solve keeping track of logical blocks, the CPU cost for disk accesses these issues. does not noticeably increase on the host machine. Disk latency Block-level compression occurs on a layer between the is already high, so an incremental increase does not have file-system and the disk, essentially intercepting block-level a huge impact on overall system performance. In addition, Figure 1: The ideal setup of our hardware accelerator attached as an intermediate layer between the operating system and the disk. Notice that the OS and disk are interchangeable, and the accelerator need not have any knowledge of either.

the increased I/O latency due to compression can potentially be hidden by the hardware accelerator during concurrent I/O operations. The rest of this paper is organized as follows: Section II explains the details of our implementation and testing strategies. Section III discusses how our accelerator performs and analyzes the results. In section IV we summarize current research in this area, and talk about the contrast between our work and other methods. We ponder what we could have done Figure 2: A sequence of bytes can be differential-encoded (or differently and how it would have impacted our results in decoded) according to this diagram. The result is an initial section V, and finally, section VI concludes the paper. value and a series of differences. The reverse transformation simply takes the initial value and adds all the successive values II.DESIGNAND IMPLEMENTATION cumulatively to recover the original sequence. The bottom To implement transparent block-level compression we de- portion of the figure shows a sequence of repeating bytes signed a hardware accelerator that is placed along the I/O path resulting in a string of zeros, which is important for run-length between the file-system and disk. This hardware component encoding to work. communicates with the host machine through a block device driver. The driver intercepts block I/O requests to the kernel and forwards them to the hardware accelerator which then 1 31 4 35 3 32 handles the request. To handle block writes, the accelerator to get to , add to get to , and subtract to get to ). compresses the data, writes the compressed data to physical It is easy to see how the decode operation works in reverse storage, and updates the metadata of the logical block. To to recover the original sequence. Note that differential coding handle block reads, the accelerator looks up the logical block’s does not change the length of an input sequence. location, decompresses the data, and returns the data to the This has a couple of functions. First of all, it can reduce requester. We designed the hardware accelerator in Chisel the total number of unique symbols required to represent a [2], a Scala-embedded language for parameterized hardware sequence of bytes. This can make dictionary or sliding-window generators. The testing infrastructure is built on top of the compression passes more space efficient, but we do not utilize testers2 [3] library. this property here. More importantly, differential encoding will turn any sequence of repeated bytes into a sequence of zeros. A. Compression Algorithm Because run-length encoding operates on sequences of zeros, differential encoding is necessary to allow run-length encoding The compression algorithm used is a simple differential to provide good compression. + run-length encoding scheme. Both differential encoding and run-length encoding have completely reversible analo- 2) Run-Length Coding: Run-Length encoding compresses gous operations, meaning that the compression is lossless. “runs” of zeros into a single zero followed by a number Using a lossless algorithm is particularly important for our of additional zeros. As shown in Figure 3, the sequence implementation as all reads and writes in the system will [30,0,0,0,0,0,25,26] encodes to [30,0,4,25,26] undergo compression. This lossless algorithm was chosen for (start with [30], change the sequence of five zeros to [0,4], its simplicity and for ease of implementation. and then end with [25,26]). The decode operation is clear to see: the number immediately after any zero is treated as a 1) Differential Coding: This technique is often used as a number of zeros to insert. pre-compression pass over data to prepare it for one or more compression passes. Input sequences are transformed bytewise Differential + run-length encoding provides good com- to an initial value followed by a series of differences. In the pression on the right kind of data. Any data that has large example shown in Figure 2, the sequence [30,31,35,32] sequences of repeated values will be compressed very nicely, is differentially encoded to [30,1,4,-3] (start with 30, add but random sequences or sequences that repeat at granularities Figure 3: Run-length encoding changes sequences of repeated zeros into a single zero and a number of additional zeros. All Figure 5: The read and write sides of the compressor are two sequences of up to 256 zeros are compressed into just 2 bytes. separate blocks with two separate functions each. Note that the differential block does not have to modify the header, whereas the run-length coder does. Each of these modules is written as a parameterized Chisel generator.

Transactions on the bus are composed of header beats and data beats. Each transaction is made up of one header beat and one or more data beats. The number of data beats in the transaction is kept track of in the header. This way, the hardware block knows which incoming data beats correspond Figure 4: Sometimes this compression scheme can lead to an to which transaction. To process a transaction, the module increase in data size. In this diagram, the sequence got 2 bytes accepts one header beat, accepts as many data beats as the shorter from a run of zeros, but it also got 3 bytes longer due header beat specified, operates on the data, sends an output to singleton zeros. See Section III for more detail about how header beat, and sends the resulting data in the form of more this drawback is affected by different types of data, i.e. sparse data beats. The output header specifies the final length of the versus random. data. In this way, we deal with the length changes before and after compression and decompression. See Figure 5 for a visual representation of the compressor and decompressor and how they relate to the bus. larger than the byte level are not compressible. Furthermore, a bad sequence can sometimes lead to data expansion, rather than compression. This arises due to the fact that a single zero in the . Hardware Verification input sequence encodes to two zeros in the output (a zero with To verify the hardware, we use a form of transaction-level zero additional zeros). For example, as depicted in Figure 4, the modelling (Figure 6). A high-level transaction just specifies sequence [4,0,0,0,0,3,0,2,0,7,1,0](length = 12) some bytes to process. High-level transactions can be de- encodes to [4,0,3,3,0,0,2,0,0,7,1,0,0](length = composed into low-level transactions. In our bus, a low-level 13). Even though a sub-sequence of four repeated zeros transaction is either a data beat or a header beat. So each high- got compressed to just two bytes, the singleton zeros were level transaction is composed of one low-level header beat expanded and the overall sequence became longer. transaction and one or more low-level data beat transactions. In the hardware compression block, differential encoding High-level transactions have no notion of beats. and decoding is done in chunks according to the size of the The reason we set it up this way is so that we can use a data bus. Run-length encoding is done one byte at a time. The common stimulus to drive both abstract software models and two functions are implemented as separate modules, and the physical hardware blocks. Software models have no informa- compressor is written as a composition of the two. This allows tion about the bus on which the hardware operates, so they for more extensibility and flexibility. For example, another cannot operate on low-level transactions. Hardware modules, compression scheme that uses differential encoding as a pre- on the other hand, are bound by the physical world and so compression pass would not have to re-implement it. Instead, cannot operate on such a high level. A bus that is hundreds of the differential block can be slotted in at the beginning of the wires wide would pose a serious problem in actual hardware compression and at the end of the decompression to form the for power and routing reasons. See Figure 7 for more details whole operation. about the High- and Low-level transactions. Transaction-level modelling provides a powerful abstraction on which to test the B. Hardware Architecture hardware under design. The compression hardware is set up on a unidirectional bus We were able to use the RocketChip [4] SoC generator as two separate modules: a compressor and a decompressor. project template [5] to attach our block as a memory-mapped On the write path, data from the CPU is compressed, and on peripheral. This is detailed in Figure 8. By connecting the the read path, data from the disk is decompressed. accelerator to a TileLink crossbar via AXI4-Stream nodes, we Since logical blocks may be smaller than a physical data block, there can be one or more logical blocks in a single physical block. As a result, additional metadata is needed to keep track of the location of logical blocks so that they can be retrieved and decompressed when a file-system requests them. Our current prototype has not yet fully implemented logical block mapping. In this section we describe the method we will use to keep track of logical blocks and how we will deal with Figure 6: Transaction-level modelling. The red block shows a file modifications. Yim, Bahn, and Koh [6] and Zuck et. al. [7] high-level transaction. On the left is a software model, and on describe various packing methods for handling the changing the far right is a hardware module. The high-level transaction size of data after compression, but none of their schemes are is able to drive both software and hardware, and then check to exactly right for our purposes, so we introduce our own. ensure that the results from both are identical. The blue and Let us first define the physical segment, which is a fixed- green blocks show how low-level transactions are constructed size unit of physical storage whose size is some multiple of from high-level ones in order to drive the RTL. the block size of the underlying storage. The actual size of the physical segment can be arbitrarily chosen without affecting how logical block mapping works, so let it be any arbitrary multiple of the underlying block size. When a logical block of data is compressed and stored, the hardware component finds a free physical segment to write it to. Pointers to free physical segments are kept in a free segment list. We define a free physical segment to have at least one physical data block of contiguous free space. Information about the logical block’s location is then written to a translation table that is stored persistently on a special region in the disk so that it can be found and decompressed when requested later. The translation table stores information about the physical segment that contains each logical block and the index within the Figure 7: The software and hardware modules are driven in physical segment that the logical block occupies. This index different ways, but both through decoupled interfaces and number is necessary since the physical segment may contain both through the same high-level abstraction. The gray oval more than one logical block. represents a check to make sure the output from the hardware matches that from the software model. Physical segments themselves consist of a one-way linked list of logical blocks, so free space in a physical segment is contiguous and at the end of the segment. When retrieving a logical block, the entire physical segment that contains it is are able to memory-map the device and generate a Verilator read. Therefore, given the physical segment a logical block simulator. Arbitrary C code can be compiled for RISC-V and resides in and its index, the logical block can be retrieved by run on the Rocket core (through Verilator), which can in turn traversing the linked list. A linked list was chosen to allow utilize the accelerator via MMIO. logical blocks to be easily added and removed at any location in the physical segment. As long as the physical segment size D. Logical and Physical Block Mapping does not grow too large, the linked list will be fairly short and so the time taken to traverse the linked list will be negligible. The compression algorithm described above results in Additionally, we chose a one-way linked list to minimize the variable-sized compressed blocks that we call logical blocks. amount of additional metadata stored on the physical segment; only a single “next” pointer is required for each logical block. This simple linked list structure ensures we do not lose the benefits of compression by storing a lot of metadata in the physical segment. The structure of the translation table and physical segment is shown in Figure 9. Due to variable-sized logical blocks, in-place updates to data become problematic to implement. Since the updated logical block may be larger than the original, it may not be possible to store it in the same location; however, if the updated logical block is smaller or the same size, it can stay in the same spot. Attempting to perform in-place updates results in Figure 8: The accelerator attached to RocketChip as an MMIO additional overhead to perform this size check and can lead to peripheral. Code compiled for RISC-V can be run on the complex, unpredictable behaviors. Instead, we perform out-of- Rocket core and can utilize our accelerator. place updates by marking the old logical block as invalid and writing the update to a new location. The valid bit is stored in the physical segment linked list. This approach is simple and reduces overhead for updates but necessitates cleaning up old invalid logical blocks. To avoid introducing overhead on every block write, we will not clean the invalid block during each update but instead have a cleaner process that periodically iterates through phys- ical segments and deletes invalid logical blocks. The cleaner Logical Block Physical process also adds physical segments to the free segment list if Block Index Address Segment ID enough space is freed on it. Since the cleaner process could potentially interfere with I/O operations, we will want to run it infrequently. It can be triggered at set time intervals as well as when the number of free physical segments drops below a certain threshold. Additional optimizations can be made such as running the cleaner process late at night when there is little activity or restricting the cleaner process to only cleaning physical segments that have not been accessed for some time. Another feature which we do not plan to implement but can Physical Segment be added to the cleaner process is to periodically compact data by taking physical segments with little data on them and Logical Data Block Logical Data Block combining them together. For instance, three physical segments that are each 1/3 full can have their data all copied into an Next Pointer Next Pointer empty physical segment, and the original 3 segments will be re- Valid Bit Valid Bit added to the free segment list. This helps increase data locality and reduce wasted space. We do not consider the operations of the disk controller, Figure 9: To retrieve logical data blocks, the hardware compo- namely wear-levelling (even distribution among physical sec- nent intercepts a block read request and looks up the logical tors that can prevent certain sectors from going bad due to block’s address in the translation table. It uses this to get the excessive use), request re-ordering, and various performance- physical segment that contains the logical block and the block’s based allocation schemes. We assume no knowledge of the disk index in the segment. It then reads in the physical segment internals, including its controller, which allows our device to and traverses the linked list of logical blocks to find the one slot in with any type of disk, be it HDD, SSD, or some other requested. type of persistent storage.

III.RESULTS AND DISCUSSION Compressor Decompressor To evaluate the performance of our prototype, we imple- Area (LUTs) 29161 29035 mented the compression scheme described in section II in Area (FFs) 21403 21431 software as well as hardware. We are able to compare the Critical path (ns) 3.51 5.13 time it takes to do compression in software to the number of cycles needed to perform the same compression in hardware, Table I: Synthesis results for the hardware block. A critical which tells us what clock frequency the hardware would have path of 3.51 ns corresponds to a clock frequency of 285 MHz, to run at to achieve comparable performance. Additionally, and a critical path of 5.13 ns corresponds to a clock frequency the software implementation of our compression algorithm of 195 MHz. allowed us to compare compression times and compression factors to , a more sophisticated and more widely-used compression scheme. All tests were run on a server with 56 differential module encodes or decodes eight bytes at a time, Intel Xeon E5-2690 v4 processors running at 2.60 GHz with there are eight arithmetic operations that need to be completed 264GB RAM. in one clock cycle, not to mention additional logic. Processing one byte per clock cycle could cut this critical path down, but A. Hardware Analysis at the cost of latency and greater engineering effort. We synthesized our hardware design using Hammer [8] and Vivado [9] targeting a Xilinx VC707 FPGA. The results are B. Compression Analysis shown in Table I. Since the design does not use block RAMs, there are a lot of flip-flops. Reconfiguring the way that these To test our compression scheme’s performance, we gener- internal buffers are configured could potentially improve the ated two separate sets of test data. These workloads are performance and FPGA mapping. • a set of 100 files, each 512 bytes long filled with The long critical path of the decompressor comes from random ASCII characters (called “ASCII data”), and the fact that differential decoding requires a series of addition operations that rely on each other (each output byte cannot • a set of 100 files, each 512 bytes long filled with 80% be determined until the previous output is known). Since the zeros and 20% ones. (called “sparse data”). The ASCII data is classically incompressible, that is, on 90% 350 80% simple 300 average there are no trends, sequences, or repeating data 70% s)

zlib μ simple 250 whatsoever. The sparse data, on the other hand, has more 60% zlib 50% 200 strings of zeros and is therefore more compressible. As seen 40% in Figure 10, both zlib [10] and our simple differential + run- 30% 150 Space Saved (%) 20% 100

10% Time ( Compression length compression schemes compress the sparse data more 50 0% than they compress the ASCII data. The results in Figure 10 are -10% 0 from software implementations of the compression algorithms. ASCII Sparse ASCII Sparse Compression Algorithm Compression Algorithm

In terms of execution time, the zlib algorithm is far faster Figure 10: Performance comparison between our simple com- than our simple algorithm. Our algorithm performs arithmetic pression and zlib. We ran software implementations of our computation on every byte, whereas zlib uses a sliding-window simple differential + run-length compression and zlib on the approach. It is clear that zlib outperforms our simple com- same set of sample data. pression scheme; however, this is not surprising. We chose to implement this simple differential + run-length compression scheme in our prototype as a proof-of-concept. In order to achieve more efficient and effective compression, we will later took 285µs to compress the ASCII data while our hardware implement a more sophisticated algorithm like zlib in our implementation took on average 1752 hardware clock cycles. hardware accelerator. From these results, we estimate that our block would need Choosing an effective compression algorithm can depend to execute at about 6 GHz in order to perform as well as greatly on the type of data being compressed. Kodituwakku our software implementation. It is unlikely that our hardware et al. [11] compares the speed and effectiveness of various accelerator can achieve this speed. algorithms for compression of text data. A general survey of compression algorithms has also been conducted [12]. Cycle count These previous works demonstrate that some compression Compressor 1752 algorithms are more effective than others for certain data types. Decompressor 1709 In addition, there are tradeoffs between compression speed and effectiveness, where compression algorithms that compress Table II: Cycle counts for the hardware operation. data very compactly can take more time to run. Determining what compression algorithm to use requires careful analysis of the workload and determining the right balance between It is important to note that we ran our software tests CPU/RAM usage, time, and compression factor. on a high-end research server with 56 processors, while our Since our block-level compression scheme is used on all hardware accelerator has nowhere near the same compute file writes and reads in a system, a potentially large variety power. When running the same software tests on our personal of file types may be compressed. Our goal is to offer flexible laptop (Intel i7-3667U processor with 2 cores running at 2 GHz compression that can be inserted into any system, so there is and 6GB RAM) the tests took over a factor of 10x slower to no way to know what kind of workload the algorithm will be run. This yields a much more manageable 600MHz clock rate. used on. To deal with this, there are two possible approaches. Unfortunately, as our personal laptop runs a Windows OS, the First is to use an adaptive compression scheme which uses timer measurements for these tests could not accurately time heuristics to choose between different compression algorithms any runtimes below 16ms and were therefore too imprecise to depending on the data type being compressed. The selection use for analysis. Still, this demonstrates that the apparent poor of compression algorithms based on classification of data has performance of our hardware implementation is partly due to been explored by I et. al. [13]. They used machine learning the massively greater processing power used for the software to categorize data based on its contents to utilize compression tests. It seems feasible that, given a similar number of proces- dictionaries more efficiently. However, they implement their sor cores as the research server, our hardware implementation design as a firmware layer, which could be improved upon with could achieve comparable performance. dedicated custom hardware. Furthermore, this would add addi- It is also worth noting that the latency for disk accesses tional overhead and require multiple compression algorithms to is likely to be much larger than the time it takes to perform be implemented in hardware, increasing hardware complexity. compression or decompression even if the hardware implemen- We prefer the other option, which is to choose a compression tation achieves lower performance. This would enable com- scheme that performs well over a broad category of data types. pression times to be hidden behind I/O latency in the presence Strong candidates for future compression schemes are zlib and of concurrent I/O operations by performing compression for LZMA [14], the compression algorithm used in 7zip. These a pending operation while waiting for a previous operation’s algorithms achieve reasonable compression factors and times disk I/O to complete. over many data types.

IV. RELATED WORK C. Hardware Compression Block-level compression schemes have been previously We ran the ASCII data through the hardware compressor implemented at the file-system, cache, and block device level, and decompressor to test the hardware’s performance (Table but none of these approaches utilize a dedicated hardware II). On average, the software implementation of our algorithm component. Cloop [15] performs block-level compression on data and tially even hide the increased I/O latency through leveraging stores it in a read-only block device that decompresses the concurrency in multiple pending I/O operations. Furthermore, data on read. The translation table used to translate logical ZBD uses a buffer in the DRAM for intermediate results while to physical blocks of data is kept on disk. The read-only our implementation utilizes a buffer built into the accelerator. nature of this device allows for tightly packed compressed This saves DRAM space on the host machine. blocks and negates the issue of fragmentation after deletion Makatos et al. extend their block-level compression scheme or modification of logical blocks. On the other hand, cloop to also apply to SSD caches in [18]. Once again this provides is not feasible for any system that requires writes or deletes block-level compression below the file-system, but for the since it does not allow the block device to be written after the cache rather than for storage. However, the results from their initial compression and storage. By contrast, our work provides work indicate that block-level compression can be imple- online compression, compressing block writes and block reads mented efficiently on multi-core CPUs and provide significant on the fly. This makes our work more suitable to common use (11%, 25%, and 99% depending on workload) space savings cases like database systems. even on the cache level. Burrows et. al. [16] implement block-level compression Filgueira et al. [19] propose a transparent file-level com- in a log-structured file-system, Sprite LFS, by compressing pression scheme that uses heuristics to choose whether to each block of log data or metadata as it is written and compress data and what compression algorithm to use on decompressing as it is read. Log-structured file-systems lend the fly. Their work is aimed at reducing I/O time in parallel themselves to this approach due to the mostly sequential storage systems by compressing data. To this end, they detail nature of log writes, which helps reduce fragmentation caused several strategies to speed up transparent compression. They by block-level compression. Burrows also leveraged the file- use multithreading to enable parallel compression of more than system metadata in Sprite LFS to keep track of the physical one chunk of data. This is especially suited to parallel storage locations of logical data blocks, while our approach requires systems since chunks of data can also be written to the disk in more complex metadata management. parallel. In our work we do not assume the existence of parallel However, Burrows’ approach requires modifications to the storage systems, but compressing data blocks in parallel can file-system itself, making it inflexible. This drawback also still be utilized to improve performance since our results show applies to other file-systems which support compression like that hardware compression could potentially be slow. ZFS and Windows NTFS. Our work implements compression Filgueira et al. also choose between different compression below the file-system, allowing it to be combined with any algorithms on the fly to achieve the fastest performance using type of file-system. The Sprite LFS compression also degrades heuristics gathered from analyzing compression performance system performance by as much as 1.6 times for intensive on datasets of different file types. We believe that such an operations like copying large files. This is because compression approach would not be suited for our work because additional adds latency to every log write in the file-system, and log metadata would need to be stored with every compressed data entries are written sequentially, creating significant overhead block to tell us what compression algorithm was used so that when writing large amounts of data. By contrast, our imple- it can be properly decompressed. Our work focuses more on mentation is able to process concurrent I/O operations and hide saving space rather than achieving the fastest I/O possible. the compression overhead when there are multiple pending I/O Finally, Filgueira et al. also use heuristics to estimate the operations. Disk I/O is relatively slow, so when our hardware impact compressing a file would have on its I/O latency and component processes concurrent I/O operations it is able to then choose whether or not to compress the file at all. In our do compression or decompression while waiting for the disk case, we would always want to compress data as long as it read or write of another operation to complete. As a result, saves space. However, for some types of data compression the added latency from compressing and decompressing data may actually expand the size of the data block rather than blocks becomes much less noticeable. reducing it. If the workload consists of this type of data it Klonatos’ and Makatos’ work [17] performs block-level may be advantageous to have a similar scheme where we use compression through a virtual block-device layer, ZBD, which heuristics to determine whether or not compression would be is placed on the I/O path below the file-system. This is beneficial. Yet this introduces its own set of problems where similar to our own approach in that ZBD is independent we must now include additional metadata indicating whether of the file-system and provides transparent compression and each logical data block has been compressed or not. Whether decompression of data blocks. Both ZBD and our approach the cost of this additional metadata outweighs the benefits of perform compression in a way that is invisible to the file- selective compression will depend on workload. Workloads system and OS. We both keep track of logical block metadata where compression actually results in an increase in data size through a table that is stored on disk and utilize a buffer to are unusual, so at the moment it appears best to not use accommodate multiple I/O operations by storing intermediate selective compression. results while waiting for disk reads and writes to complete. Lee et. al. [20] push compression even further down the The major difference is that our implementation offloads stack by integrating it below the disk controller itself. They the work of compressing and decompressing data to a dedi- write a custom FTL which allows them to theoretically control cated hardware component, allowing for higher performance things like wear-leveling and exact hardware sector mapping. on the host machine. Makatos’ results show up to 311% This approach offers finer control over the disk. The drawback increased CPU utilization and up to 34% slowdown on I/O to this is that it is less flexible. System designers could not performance. By utilizing a separate hardware device, it is swap this out with different disks if they wanted to, unless the possible to reduce the impact on CPU utilization and poten- new disks also supported their same system. Deduplication is an alternate approach to saving space on storage systems. Deduplication works by removing redundant information on the file or block level and can potentially be combined with compression schemes to further save space. It seems worth exploring whether deduplication can be combined with our work to provide more effective compression. Work done by Meyer et. al. [21] finds that block-level deduplication can reduce space more effectively than file-level but has several drawbacks including loss of locality in file data (especially problematic if stored on hard disks due to seek times), file fragmentation, and increased complexity of the deduplication Figure 11: The accelerator as a virtual block device behind the algorithm. File-level deduplication has been shown to be highly kernel block device driver. This setup is theoretically possible, effective, reducing realistic file-systems to 32% of their origi- although we did not implement it in this work. nal size [21]. Additionally, deduplication can achieve very high throughput, over 210 MB/sec for 4 multiple write data streams [22]. However, deduplication has very high overhead and is et al. [19] demonstrate the benefits of enabling parallel I/O therefore mostly used in archival storage for keeping backups and compression in this way. Such an approach would be of data. Since archival storage is rarely read or written, the particularly effective in systems with parallel storage units that high overhead is not an issue. By contrast, our work focuses on can be written or read from simultaneously. providing compression for online data storage which is being actively read and written to, so the overhead of deduplication It is also possible to increase our hardware accelerator’s could cause unacceptable degradation of system performance. performance by caching physical segments and parts of the translation table for logical block lookups. Physical segment caching would remove the necessity of constantly writing and V. FUTUREWORK reading a physical segment that is being accessed repeatedly, While functional, our hardware compressor is only a pro- speeding up disk accesses. In addition, caching parts of the totype and its functionality can be improved and expanded in logical block translation table will reduce the number of a variety of ways. Encryption and error correction features are additional I/O operations needed to retrieve the translation already in progress which will allow optional encryption on table from the disk when looking up logical blocks. writes and perform error correction on compressed data. In This compressor was designed to fit into the data bus future work, the simple compression algorithm used in this mentioned in Section II. This bus, as mentioned above, requires analysis will be replaced by a more effective compression headers to be sent before data. What that means is that the scheme such as [23], zstd [24], zlib, or LZMA. The compressor must wait until it has received all the data before simple algorithm was chosen because of its simplicity and ease it can send an output header. If the output header could be of implementation in hardware. Since the size of the input data sent before the data was done being processed, the latency is fixed as one block, the algorithm can be selected or tuned of the entire transaction could be halved. This is because the to provide good compression for data of that size. header could be sent after the data, assuming the downstream Some optimizations can also be made to the Chisel code. component knows how to accept and cache data beats before For example, the tradeoffs between critical path length and it receives their corresponding header. In the future, we will latency could be explored further. Currently, the module only change the bus semantics to allow for this. processes data one byte at a time, but this could be parallelized Although not completely implemented, we started off with at the hardware level. the notion that a single hardware block can process multiple We also plan to explore ways to improve I/O latency by in-flight transactions at once. This could allow re-ordering of hiding compression overheads in cases where there are several data and header beats, and would potentially increase overall pending I/O operations. Multiple I/O operations executing throughput because the file-system or disk controller in the concurrently allows us to overlap compression with reads write or read case, respectively, could optimize the low-level or writes, reducing the total latency of the operations while transaction ordering. We ultimately decided not to implement maintaining the same throughput. It is not possible to hide multiple in-flight transactions due to the strict semantics of the compression overheads during single I/O operations, but due to data bus. the block-level granularity of our compression scheme, having To test the compressor in software, we came up with multiple pending I/O operations is likely. In the best case, this another scheme (Figure 11). We could write a custom block would allow us to almost completely hide the increased latency device driver that connects to a running simulator of the in I/O operations caused by compression. hardware via a Linux socket. All data normally written to disk via the block device driver would instead go through the Another option which would increase performance in a hardware simulator to be compressed by the compressor. The many-write workload is to add more parallelism at a higher sector remapping could then be implemented in Scala, and the granularity. Many of these accelerators can be placed side-by- data written out to disk from there. This setup avoids the need side to increase effective throughput. This would allow many for an FPGA or any other physical hardware for testing, but transactions to take place at the same time, thereby eliminating allows for more system-level data to be gathered. the potential bottleneck. This of course comes at the cost of a few extra bits of state and more thorough design. Filguiera A setup like this would allow us to fully realize a system- specialized hardware component, the accelerator can be flex- ibly inserted on the I/O path of all sorts of file-systems and operating systems, requiring only a driver allowing it to interface with the host machine. This makes our work capable of integrating with a wide array of systems that could benefit from transparent compression such as database systems, cloud storage, and more.

ACKNOWLEDGMENT The authors thank Dr. John Kubiatowicz for his support throughout the semester and for kicking us into gear when we Figure 12: The accelerator chip attached as a DMA-accessible needed it; Tan Nguyen, Vighnesh Iyer, and Bob Zhou for their device outside of the OS, but accessed by the file-system. This work on CREEC in collaboration with the hardware design; setup would save CPU, but requires modification of the file- and Sagar Karandikar and Nathan Pemberton for their help in system itself. the exploration of the project space.

REFERENCES level implementation. The hardware simulator would of course be slower than real hardware, but we would still be able to [1] “How much data do we create every day? the mind-blowing stats everyone should read,” https://www.forbes.com/sites/bernardmarr/2018/ capture performance data via cycle counts from the simulator. 05/21/how-much-data-do-we-create-every-day-the-mind-blowing- This was not implemented, but would be another way for us stats-everyone-should-read/#3c5c600960ba, accessed: 2018-12-8. to test the hardware for validity and system-level applicability. [2] “Chisel 3,” https://github.com/freechipsproject/chisel3, accessed: 2018- Although this work mainly explores the idea of using com- 12-4. pression invisibly, underneath the OS, our accelerator could [3] “Chisel testers 2,” https://github.com/ucb-bar/chisel-testers2, accessed: 2018-12-4. potentially be used as a file-system-level hardware accelerator. [4] “Rocket chip generator,” https://github.com/freechipsproject/rocket- Figure 12 shows a diagram of what this would look like. chip, accessed: 2018-12-4. In this configuration, the file-system would be able to make [5] “Risc-v project template,” https://github.com/ucb-bar/project-template, DMA requests to the accelerator chip, thereby offloading the accessed: 2018-12-4. compression workload and saving CPU cycles. We expect this [6] K. S. Yim, H. Bahn, and K. Koh, “A flash compression layer for would be beneficial, but it would require the file-system to smartmedia card systems,” IEEE Transactions on Consumer Electron- ics, vol. 50, no. 1, pp. 192–197, Feb 2004. be aware of the accelerator. Since the purpose of this work [7] A. Zuck, S. Toledo, D. Sotnikov, and D. Harnik, was to develop a hardware component that was completely “Compression and ssds: Where and how?” in 2nd Workshop OS-invisible, we did not explore this option in depth. on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW 14). Broomfield, CO: USENIX Association, VI.CONCLUSION 2014. [Online]. Available: https://www.usenix.org/conference/inflow14/ workshop-program/presentation/zuck In this work we provide transparent, block-level com- [8] “Hammer: Highly agile masks made effortlessly from rtl,” https:// pression for storage systems through a dedicated hardware github.com/ucb-bar/hammer, accessed: 2018-12-4. component. We analyze the tradeoffs between saved storage [9] “Vivado,” https://www.xilinx.com/products/design-tools/vivado.html, and increased I/O latency. Our results show that our current accessed: 2018-12-8. implementation performs compression slower than software [10] “zlib,” https://zlib.net/, accessed: 2018-12-8. compression, but could potentially achieve comparable per- [11] S. Kodituwakku and U. Amarasinghe, “Comparison of lossless data formance through optimizations mentioned in section V. compression algorithms for text data,” Indian journal of computer science and engineering, vol. 1, no. 4, pp. 416–425, 2010. Previous works have demonstrated that transparent block- [12] D. A. Lelewer and D. S. Hirschberg, “Data compression,” ACM level compression can be implemented in software to effec- Comput. Surv., vol. 19, no. 3, pp. 261–296, Sep. 1987. [Online]. tively increase storage space at the cost of some CPU and Available: http://doi.acm.org/10.1145/45072.45074 RAM usage. Though our analysis was conducted using a sim- [13] Pensieve: A Machine Learning Assisted SSD Layer for Extending the ulated hardware component, offloading the compression and Lifetime. IEEE International Conference on Computer Design, 2018. decompression onto a separate physical hardware component [14] “Lzma,” https://www.7-zip.org/sdk.html, accessed: 2018-12-10. should reduce the CPU usage during I/O. This suggests that [15] “The compressed loopback device.” http://www.knoppix.net/wiki/ our compression accelerator could provide similar block-level Cloop, accessed: 2018-12-8. compression to previous implementations like ZBD and reduce [16] M. Burrows, C. Jerian, B. Lampson, and T. Mann, “On-line data compression in a log-structured file system,” in Proceedings of storage costs at the price of very little, or even no increased the Fifth International Conference on Architectural Support for CPU utilization in the host machine. In addition, our hardware Programming Languages and Operating Systems, ser. ASPLOS V. accelerator could hide the increased I/O latency caused by New York, NY, USA: ACM, 1992, pp. 2–9. [Online]. Available: compression during concurrent I/O operations. http://doi.acm.org/10.1145/143365.143376 [17] M. M. M. D. F. A. B. Yannis Klonatos, Thanos Makatos, “Transparent Our work also offers greater flexibility than file-system online storage compression at the block-level,” ACM Transactions on level compression schemes such as ZFS and NTFS. As a Storage, vol. 8, no. 5, 2012. [18] T. Makatos, Y. Klonatos, M. Marazakis, M. D. Flouris, and A. Bilas, “Using transparent compression to improve ssd-based i/o caches,” in Proceedings of the 5th European Conference on Computer Systems, ser. EuroSys ’10. New York, NY, USA: ACM, 2010, pp. 1–14. [Online]. Available: http://doi.acm.org/10.1145/1755913.1755915 [19] R. Filgueira, M. Atkinson, Y. Tanimura, and I. Kojima, “Applying selectively parallel i/o compression to parallel storage systems,” 08 2014. [20] K. F. A. S. Lee, J. Park and J. Kim, “Improving performance and lifetime of solid-state drives using hardware-accelerated compression,” IEEE Transactions on Consumer Electronics, vol. 57, no. 4, pp. 1732– 1739, 2011. [21] D. T. Meyer and W. J. Bolosky, “A study of practical deduplication,” Trans. Storage, vol. 7, no. 4, pp. 14:1–14:20, Feb. 2012. [Online]. Available: http://doi.acm.org/10.1145/2078861.2078864 [22] B. Zhu, K. Li, and H. Patterson, “Avoiding the disk bottleneck in the data domain deduplication file system,” in Proceedings of the 6th USENIX Conference on File and Storage Technologies, ser. FAST’08. Berkeley, CA, USA: USENIX Association, 2008, pp. 18:1–18:14. [Online]. Available: http://dl.acm.org/citation.cfm?id= 1364813.1364831 [23] “Snappy,” https://google.github.io/snappy/, accessed: 2018-12-4. [24] “Zstandard,” https://facebook.github.io/zstd/, accessed: 2018-12-4.