Assessing the Overhead of Offloading Compression Tasks

Assessing the Overhead of Offloading Compression Tasks Laura Promberger Rainer Schwemmer Holger Fröning [email protected] [email protected] [email protected] CERN CERN heidelberg.de Geneva, Switzerland Geneva, Switzerland Heidelberg University Heidelberg University Heidelberg, Germany Heidelberg, Germany ABSTRACT 1 INTRODUCTION Exploring compression is increasingly promising as trade- The current computational landscape is dominated by in- off between computations and data movement. There are creasingly costly data movements, both in terms of energy two main reasons: First, the gap between processing speed and time, while computations continuously decrease in these and I/O continues to grow, and technology trends indicate costs [10, 13]. Furthermore, the amount of data generated a continuation of this. Second, performance is determined and processed, preferably as fast and interactive as possible, by energy efficiency, and the overall power consumption is growing dramatically. As a consequence, especially for big is dominated by the consumption of data movements. For data applications, it becomes more and more promising to these reasons there is already a plethora of related works on trade computations for communication in order to diminish compression from various domains. Most recently, a couple the implications of data movements. Overall run time bene- of accelerators have been introduced to offload compression fits should be possible, even though this might substantially tasks from the main processor, for instance by AHA, Intel and increase the amount of computations. Of particular interest Microsoft. Yet, one lacks the understanding of the overhead in this context is the recent introduction of dedicated acceler- of compression when offloading tasks. In particular, such of- ators for compression tasks, including accelerators provided floading is most beneficial for overlap with other tasks, ifthe by AHA [12], Intel QuickAssist [14] or Microsoft’s Project associated overhead on the main processor is negligible. This Corsica [22]. work evaluates the integration costs compared to a solely Most published works about compression utilizing acceler- software-based solution considering multiple compression al- ators are focusing on the compression performance, includ- gorithms. Among others, High Energy Physics data are used ing work on GPU [4], addressing SIMD in general [18, 25, 27], as a prime example of big data sources. The results imply and FPGAs [7, 9, 17, 24]. The common part of all these works that on average the zlib implementation on the accelerator is the focus on improving compression performance, albeit achieves a comparable compression ratio to zlib level 2 for various reasons. However, it is commonly neglected to on a CPU, while having up to 17 times the throughput and assess the additional integration costs for the entire system. utilizing over 80 % less CPU resources. These results suggest Only if the overhead of orchestrating compression tasks on that, given the right orchestration of compression and data accelerators and the associated data movements is as small movement tasks, the overhead of offloading compression is as possible, one can truly speak of an off-loaded task. Other- limited but present. Considering that compression is only a wise, the host is not entirely free to perform other, hopefully single task of a larger data processing pipeline, this overhead overlapping, tasks. cannot be neglected. Thus, the present work will assess these integration costs by evaluating the overhead associated with orchestrating offloaded compression tasks, as well as assessing obtained CCS CONCEPTS compression throughput and ratio by comparing it to well- • Computer systems organization ! Reconfigurable known general-purpose compression algorithms solely run computing; • Information systems ! Data compres- on the host. The data sets used are a representative set of sion. standard compression corpora, and additionally, as a prime example of big data, High Energy Physics (HEP) data sets from the European Organization of Nuclear Research (CERN) KEYWORDS are used. For these data sets the following will be assessed: Compression, Offloading, Accelerator Card, Overhead, Mea- compression time, ratio, overhead and power consumption surement of a compression accelerator. Promberger, et al. In particular, this work makes following contributions: 3 METHODOLOGY AND BENCHMARK (1) Proposing and implementing a well-defined, methodi- LAYOUT cal, multi-threaded benchmark; 3.1 Server and Input Data (2) Analyzing the performance of the accelerator in terms The server selected was an Intel Xeon E5-2630 v4. The server’s of machine utilization, power consumption, through- technical specifications are listed in Table 1. put and compression ratio; For the input data four standard compression corpora and (3) Comparing the achieved results to multiple software- three HEP data sets were selected. To be representative for based, i.e. non-offloaded, compression algorithms. big data, compression corpora with at least 200 MB were Note that this work analyzes only compression and not de- preferred: compression. Big data applications, like utilized at CERN, • enwik9[20] consists of about 1 GB of English Wikipedia often have I/O limitations and expectations, where a lowest from 2006 throughput threshold must be passed e.g. to fulfill real-time • proteins.200MB[8] consists protein data of the Pizza requirements. Compression - compared to decompression - & Chili corpus is the throughput limiting task and as a result studied in this • silesia corpus[2] consists of multiple different files, work. Notably, for all selected algorithms the decompression summing up to 203 MB, which were tarred into a single speed is at least the compression speed, if not significantly file higher. • calgary corpus[2] is one of the most famous compres- As a result, this work will provide details for a better sion corpora. Tarred into a single file, it only sums understanding of the offloading costs for compression tasks. up to 3.1 MB. Therefore, it was copied multiple times The methodology allows in particular system designers to to reach the 150 MB to conform with the benchmark identify deployment opportunities given existing compute requirements systems, data sources, and integration constraints. The HEP data came from three different CERN experi- 2 RELATED WORK ments (ATLAS [4.1 GB], ALICE [1 GB] and LHCb [4.9 GB]). At CERN, particles are accelerated in a vacuum very close To diminish the implications on (network) data movements to the speed of light. They are then collided and analyzed by and storage, massive utilization of compression can be found, the experiments. The data created consists of geographical among others, in communities working with databases, graph locations where particles were registered. Due to the nature computations or computer graphics. For theses communities of quantum mechanics this process is close to a random a variety of works exists which addresses compression opti- number generator. A high compression ratio can therefore mizations by utilizing hardware-related features like SIMD not be expected. All data sets contain uncompressed data1, and GPGPUs[4, 18, 25, 27]. even though in their normal setup some type of compression Moreover, a multitude of research has been conducted might already be employed. to efficiently use Field Programmable Gate Array (FPGA) technology, as they excel at integer computations. Works 3.2 Compression Algorithms here range from developing FPGA-specific compression algorithms [9], over integrating the FPGA compression hard- The selection of general-purpose compression algorithms ware into existing systems [17, 24], to implementing well- was based on the requirement to be lossless and their us- known compression algorithms, like gzip, and comparing age in related works and general popularity, including re- the achieved performance to other systems (CPU, GPGPU). cent developments. Many of those algorithms allow tuning 2 3 For example, some work explored CPUs, FPGAs, and CPU- the performance for compression ratio or throughput by FPGA co-design for LZ77 accelerations [7], while other ana- setting the compression level. Generally speaking, a higher lyzed the hardware acceleration capabilities of the IBM Pow- compression level refers to a higher compression ratio, but erEN processor for zlib [16]. Regarding FPGAs, Abdelfattah also a longer computation time. For this study, compression et al. [1] and Qiao et al. [21] implemented their own deflate algorithms run solely on CPU are called software-based (sw- version on an FPGA. based). For them, a pre-study was done to select compression However, the large majority of those works focuses on the compression performance in terms of throughput and 1Parts of the detectors perform already some compression, e.g. similar to compression ratio, but not the resulting integration costs for the representation of sparse matrices in coordinate format. When talking about uncompressed data, it refers to any form of compression applied after the entire system. Overall, little to no published work can be retrieving the data from the detectors. found covering the topic of system integration costs in the 2compression ratio = (uncompressed data) / (compressed data) context of compression accelerators. 3throughput = (uncompressed data) / (run time) Assessing the Overhead of Offloading Compression Tasks Table 1: Technical specifications of the server used for this study

Load more