Unification of Temporary Storage in the Nodekernel Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Unification of Temporary Storage in the NodeKernel Architecture Patrick Stuedi, IBM Research; Animesh Trivedi, Vrije Universiteit; Jonas Pfefferle, IBM Research; Ana Klimovic, Stanford University; Adrian Schuepbach and Bernard Metzler, IBM Research https://www.usenix.org/conference/atc19/presentation/stuedi This paper is included in the Proceedings of the 2019 USENIX Annual Technical Conference. July 10–12, 2019 • Renton, WA, USA ISBN 978-1-939133-03-8 Open access to the Proceedings of the 2019 USENIX Annual Technical Conference is sponsored by USENIX. Unification of Temporary Storage in the NodeKernel Architecture Patrick Stuedi† Animesh Trivedi‡ Jonas Pfefferle† Ana Klimovic§ Adrian Schuepbach† Bernard Metzler† †IBM Research ‡Vrije Universiteit §Stanford University Abstract may consist of a large number of files which are organized hierarchically, vary widely in size, are written randomly, and Efficiently exchanging temporary data between tasks is criti- read sequentially. While file systems (e.g., HDFS) offer a cal to the end-to-end performance of many data processing convenient hierarchical namespace and efficiently store large frameworks and applications. Unfortunately, the diverse na- datasets for sequential access, distributed key-value stores are ture of temporary data creates storage demands that often fall optimized for scalable access to a large number of small ob- between the sweet spots of traditional storage platforms, such jects. Similarly, DRAM-based key-value stores (e.g., Redis) as file systems or key-value stores. offer the required low latency, but persistent storage platforms We present NodeKernel, a novel distributed storage archi- (e.g, S3) are more suitable for high capacity at low cost. Over- tecture that offers a convenient new point in the design space all, we find that existing storage platforms are not able to by fusing file system and key-value semantics in a common satisfy all the diverse requirements for temporary data storage storage kernel while leveraging modern networking and stor- and sharing in distributed data processing workloads. age hardware to achieve high performance and cost-efficiency. NodeKernel NodeKernel provides hierarchical naming, high scalability, In this paper we present , a new distributed and close to bare-metal performance for a wide range of data storage architecture designed from the ground up to support sizes and access patterns that are characteristic of temporary fast and efficient storage of temporary data. As its most dis- data. We show that storing temporary data in Crail, our con- tinguishing property, the NodeKernel architecture fuses file crete implementation of the NodeKernel architecture which system and key-value semantics while leveraging modern net- uses RDMA networking with tiered DRAM/NVMe-Flash working and storage hardware to achieve high performance. storage, improves NoSQL workload performance by up to NodeKernel is based on two key observations. First, many 4.8× and Spark application performance by up to 3.4×. Fur- features offered by long-term storage platforms, such as dura- thermore, by storing data across NVMe Flash and DRAM bility and fault-tolerance, are not critical when storing tem- storage tiers, Crail reduces storage cost by up to 8× compared porary data. We observe that under such circumstances, the to DRAM-only storage systems. software architectures of file systems and key-value stores begin to look surprisingly similar. The fundamental differ- ence is that file systems require an extra level of indirection 1 Introduction to map offsets in file streams to distributed storage resources, while key-value stores map entire key-value pairs to stor- Managing temporary data efficiently is key to the performance age resources. The second observation is that low-latency of cluster computing workloads. For example, application networking hardware and multi-CPU many-core servers dra- frameworks often cache input data or share intermediate data, matically reduce the cost of this indirection in a distributed both both within a job (e.g., shuffle data in a map-reduce job) setting by enabling scalable RPC communication at latencies and between jobs (e.g., pre-processed images in a machine of a few microseconds. learning training workflow). Temporary data storage is also in- Based on these insights we develop the NodeKernel archi- creasingly important in serverless computing for exchanging tecture by implementing file system and key-value semantics data between different stages of tasks [17]. as thin layers on top of a common storage kernel. The storage Storing temporary data efficiently is challenging as its char- kernel operates on opaque data objects called “nodes” in a acteristics typically lie between the design points of existing unified namespace. Applications can store arbitrary size data storage platforms, such as distributed file systems and key- in nodes, arrange nodes in a hierarchical namespace, and ob- value stores. For instance, shuffle data in a map-reduce job tain file system or key-value semantics through specialized USENIX Association 2019 USENIX Annual Technical Conference 767 node types if needed. For instance, key-value and table nodes 100 permit concurrent creation of nodes with the same name, of- 90 TPC-DS ML-Cocoa fering last-put-wins semantics. On the other hand, file and 80 PR-Twitter directory nodes permit efficient enumeration of data sets at a 70 given level in the storage hierarchy. By splitting functionality 60 in a common storage kernel and custom node types, NodeKer- 50 CDF nel enables applications to use a single platform to store data 40 that may require different semantics, while generally offering 30 20 good performance for a wide range of data sizes and access 10 patterns. 0 NodeKernel is designed explicitly with modern hardware 1 1kB 1MB 1GB in mind. Following strict separation of concerns, the storage data size kernel is composed of a lightweight and scalable metadata plane tailored to low-latency networking hardware and an Figure 1: CDF of the size of intermediate data written or read efficient data plane that provides access to multiple tiers of per compute task in Spark for different workloads. network-attached storage resources. The metadata plane is trimmed down to offer only the most critical functionality, with low overhead. The data plane runs a lightweight software NoSQL workloads. When integrated in Spark’s shuf- stack and leverages modern networking and storage hardware fle and broadcast services, Crail improves application to achieve fast access to arbitrary size data sets while also op- performance up to 3.4× and reduces cost up to 8×. timizing cost efficiency. For instance, data attached to “nodes” may either be pinned to a particular storage technology tier Crail is an open source Apache project [2,3] with the or may spill from one storage tier to another depending on code available for download from the project website as well performance and cost requirements. as directly from GitHub at https://github.com/apache/ Crail is our concrete implementation of the NodeKernel incubator-crail. Further, all benchmarks used in this pa- architecture using RDMA networking and two storage tiers per are open source. based on DRAM and NVMe SSDs respectively. We evaluated Crail on a 100Gb/s RoCE cluster equipped with Intel Optane 2 Background and Motivation NVMe SSDs using raw storage microbenchmarks as well as using the NoSQL YCSB benchmark and different Spark work- Temporary data represent a large and important class of in- loads. Our results show that Crail matches the performance processing data in analytics frameworks. For example, Zhang of current state-of-the-art file systems and key-value stores et al. report that over 50% Spark jobs executed at Facebook when operated in their sweet spot, and outperforms existing contain at least one shuffle operation, generating significant systems up to 3.4× for data accesses outside the sweet spot of amounts of temporary data [37]. file systems and key-value stores. Moreover, Crail creates new We define temporary data as the multitude of all application opportunities to reduce cost and gain flexibility with almost data being created, handled, or consumed during processing, no performance penalties by using NVMe Flash in addition excluding the original input and final output data. Specifically, to DRAM. For instance, using Crail to store shuffle data in we identify three distinct classes of temporary data: intra-job, Spark allows us to adjust the ratio between DRAM and Flash inter-job, and cached input/output datasets. Intra-job tempo- with only a minimal increase in job runtimes. rary data is generated within a framework when executing a In summary, this paper makes the following contributions: single job like page-rank, or a SQL query. Common examples are datasets generated during shuffle or broadcast operations • We propose NodeKernel, a new storage architecture fus- in frameworks like Spark, Hadoop or Flink. Such data is typi- ing file system and key-value semantics to best meet cally generated and consumed by the same job, which puts a the needs of temporary data storage in data processing bound on the lifetime of the data. Inter-job temporary data are workloads. intermediate results in multi-job pipelines. For example, there are many pre-processing and post-processing jobs in a typical • We present Crail, a concrete implementation of the machine learning pipeline [35], where the output of one job NodeKernel architecture using RDMA, DRAM, and becomes the input of another job. Lastly, examples of cached NVMe Flash. input/output data are mostly read-only datasets that are pulled into a cache for fast repetitive processing. For instance, users • We show that storing temporary data in Crail reduces may run many SQL queries on the same table (or a view) over the runtime and cost of data processing workloads. For a short period of time. In this case, a copy of the input table instance, Crail improves performance up to 4.8× for might be cached on a fast storage media.