QZFS: QAT Accelerated Compression in File System for Application
Total Page:16
File Type:pdf, Size:1020Kb
QZFS: QAT Accelerated Compression in File System for Application Agnostic and Cost Efficient Data Storage Xiaokang Hu and Fuzong Wang, Shanghai Jiao Tong University, Intel Asia-Pacific R&D Ltd.; Weigang Li, Intel Asia-Pacific R&D Ltd.; Jian Li and Haibing Guan, Shanghai Jiao Tong University https://www.usenix.org/conference/atc19/presentation/wang-fuzong This paper is included in the Proceedings of the 2019 USENIX Annual Technical Conference. July 10–12, 2019 • Renton, WA, USA ISBN 978-1-939133-03-8 Open access to the Proceedings of the 2019 USENIX Annual Technical Conference is sponsored by USENIX. QZFS: QAT Accelerated Compression in File System for Application Agnostic and Cost Efficient Data Storage Xiaokang Hu Fuzong Wang∗ Weigang Li Shanghai Jiao Tong University Shanghai Jiao Tong University Intel Asia-Pacific R&D Ltd. Intel Asia-Pacific R&D Ltd. Intel Asia-Pacific R&D Ltd. Jian Li Haibing Guan Shanghai Jiao Tong University Shanghai Jiao Tong University Abstract Hadoop [3], Spark [4] or stream processing job [40] is com- pressed, the data processing performance can be effectively Data compression can not only provide space efficiency with enhanced as the compression not only saves bandwidth but lower Total Cost of Ownership (TCO) but also enhance I/O also decreases the number of read/write operations from/to performance because of the reduced read/write operations. storage systems. However, lossless compression algorithms with high com- It is widely recognized that the benefits of data compression pression ratio (e.g. gzip) inevitably incur high CPU resource come at the expense of computational cost [1,9], especially consumption. Prior studies mainly leveraged general-purpose for lossless compression algorithms with high compression hardware accelerators such as GPU and FPGA to offload ratio [41]. In a number of fields (e.g., scientific big data or costly (de)compression operations for application workloads. satellite data), lossless compression is the preferred choice This paper investigates ASIC-accelerated compression in file due to the requirement for data precision and information system to transparently benefit all applications running on availability [12,43]. Prior studies mainly leveraged general- it and provide high-performance and cost-efficient data stor- purpose hardware accelerators such as GPU and FPGA to age. Based on Intel R QAT ASIC, we propose QZFS that alleviate the computational cost incurred by (de)compression integrates QAT into ZFS file system to achieve efficient gzip operations [15, 38, 41, 45, 52]. For example, Ozsoy et al. [38] (de)compression offloading at the file system layer. A com- presented a pipelined parallel LZSS compression algorithm pression service engine is introduced in QZFS to serve as an for GUGPU and Fowers et al. [15] detailed a scalable fully algorithm selector and implement compressibility-dependent pipelined FPGA accelerator that performs LZ77 compression. offloading and selective offloading by source data size. More Recently, the emerging AISC (Application Specific Integrated importantly, a QAT offloading module is designed to lever- Circuit) compression accelerators, such as Intel R QuickAssist age the vectored I/O model to reconstruct data blocks, mak- Technology (QAT) [24], Cavium NITROX [34] and AHA378 ing them able to be used by QAT hardware without incur- [2], are attracting attentions because of their advantages on ring extra memory copy. The comprehensive evaluation val- performance and energy-efficiency [32]. idates that QZFS can achieve up to 5x write throughput im- Data compression can be integrated into different system provement for FIO micro-benchmark and more than 6x cost- layers, including the application layer (most common), the file efficiency enhancement for genomic data post-processing over system layer (e.g., ZFS [48] and BTRFS [42]) and the block the software-implemented alternative. layer (e.g., ZBD [27] and RedHat VDO [18]). Professional storage products such as IBM Storwize V7000 [46] and HPE 1 Introduction 3PAR StoreServ [19] may contain competitive compression feature as well. If compression is performed at the file system Data compression has reached proliferation in systems involv- or lower layer, all applications, especially big data processing ing storage, high-performance computing (HPC) or big data workloads, running in the system can transparently benefit analysis, such as EMC CLARiiON [14], IBM zEDC [7] and from the enhanced storage efficiency and reduced storage I/O RedHat VDO [18]. A significant benefit of data compression cost per data unit. This feature also implies that only lossless is the reduced storage space requirement for data volumes, compression is acceptable to avoid influences on applications. along with the less power consumption for cooling per unit To the best of our knowledge, there is no practical solution at of logical storage [12, 51]. Furthermore, if the input data to present that provides hardware-accelerated data compression at the layer of local or distributed file systems. ∗Co-equal First Author In this paper, we propose QZFS (QAT accelerated ZFS) that USENIX Association 2019 USENIX Annual Technical Conference 163 integrates Intel R QAT accelerator into the ZFS file system 1050 OFF 900 to achieve efficient data compression offloading at the file GZIP system layer so as to provide application-agnostic and cost- 750 LZ4 efficient data storage. QAT [24] is a modern ASIC-based 600 acceleration solution for both cryptography and compression. 450 ZFS [6,48] is an advanced file system that combines the roles 300 of file system and volume manager and provides a number 150 Write throughput (MB/s) of features, such as data integrity, RAID-Z, copy-on-write, 0 encryption and compression. 0 1 2 3 4 5 6 7 8 Dataset size (TB) In consideration of the goal of cost-efficiency, QZFS se- lects to offload the costly gzip [1] algorithm to achieve high Figure 1: Write throughput on hybrid storage of one 1.6TB space efficiency (i.e., high compression ratio) and low CPU NVMe SSD and backup HDDs. Gzip and LZ4 achieve a resource consumption at the same time. QZFS disassembles compression ratio of about 3.8 and 1.9 respectively. the (de)compression procedures of ZFS to add two new mod- ules for integrating QAT acceleration. First, a compression service engine module is introduced to serve as a selector of diverse compression algorithms, including QAT-accelerated 2.1 Data Compression on Storage Devices gzip and a number of software-implemented compression algorithms. It implements compressibility-dependent offload- As high-performance storage devices, NVMe SSDs can re- ing (i.e., compression/non-compression switch) and selective markably improve the read/write speed with low energy con- offloading by source data size (i.e., hardware/software switch) sumption [25, 49]. Nonetheless, the limited capacity and high to optimize system performance. Second, a QAT offloading price significantly discourage their widespread use and stor- module is designed to efficiently offload compression and age devices have accounted for a large proportion of Total decompression operations to the QAT accelerator. It lever- Cost of Ownership (TCO) [50]. In the Mistral Climate Simu- ages the vectored I/O model, along with address translation lation System, storage devices occupy more than 20% of the and memory mapping, to reconstruct data blocks prepared TCO for the entire system [28]. Many studies have investi- by ZFS, making them able to be accessed by QAT hardware gated data compression on storage devices to improve I/O through DMA operations. This kind of data reconstruction performance and reduce system TCO simultaneously [31,36]. avoids expensive memory copy to achieve efficient offloading. To show the benefits of data compression, we evaluated the Besides, considering QAT characteristics, this module further performance of a compression-enabled file system (i.e., ZFS) provides buffer overflow avoidance, load balancing and fail by the FIO tool [5] on a hybrid storage system, including recovery. one 1.6TB NVMe SSD (Intel R P3700 series) and backup In the evaluation, we deploy QZFS as the back-end file HDDs. Two representative lossless compression algorithms, system of Lustre [44] in clusters with varying nodes, and mea- gzip [16] and LZ4 [35], were used in ZFS to compare with sure the performance with FIO micro-benchmark and practi- the compression OFF configuration. As shown in Figure1, cal genomic data post-processing. For FIO micro-benchmark, the write throughput with data compression (for both gzip QZFS with QAT-accelerated gzip can achieve up to 5x aver- and LZ4) outperforms the case of OFF because compression age write throughput with a similar compression ratio (3.6) can effectively reduce the total data size written into the stor- and about 80% reduction of CPU resource consumption (from age [33,53]. If the dataset size is larger than the capacity of the more than 95% CPU utilization to less than 20% CPU utiliza- 1.6TB NVMe SSD, the excessive data are written into backup tion), compared to the software-implemented alternative. For HDDs. Due to the poor read/write performance of HDD, the practical genomic data post-processing workloads, benefiting OFF configuration incurs throughput degradation rapidly once from QAT acceleration, QZFS provides 65.83% reduction the dataset size exceeds 1.6TB. The gzip algorithm achieves a of average execution time and 75.58% reduction of CPU re- compression ratio of about 3.8 in this evaluation and the write source consumption over the software gzip implementation. throughput degrades after the dataset size exceeds 6.1TB. Moreover, as compression acceleration is performed at the Since LZ4 is a fast compression algorithm (i.e. CPU time for file system layer, QZFS also significantly outperforms the compression is largely reduced), it can bring a higher write traditional simple gzip acceleration for applications while throughput than the gzip case although the compression ratio conducting genomic data post-processing. is lower, with a value of about 1.9.