Henosis: Workload-Driven Small Array Consolidation and Placement for HDF5 Applications on Heterogeneous Data Stores
Total Page:16
File Type:pdf, Size:1020Kb
Henosis: Workload-driven small array consolidation and placement for HDF5 applications on heterogeneous data stores Donghe Kang, Vedang Patel, Ashwati Nair, Spyros Blanas, Yang Wang, Srinivasan Parthasarathy {kang.1002,patel.3140,nair.131,blanas.2,wang.7564,parthasarathy.2}@osu.edu The Ohio State University ABSTRACT Although the aggregate data volume of many datasets is enor- Scientific data analysis pipelines face scalability bottlenecks when mous, individual observations are often stored as independent small processing massive datasets that consist of millions of small files. files for downstream processing. The supernovae detection pipeline Such datasets commonly arise in domains as diverse as detecting in the ASAS-SN sky survey stores detected transient stars in sepa- supernovae and post-processing computational fluid dynamics sim- rate 1.7 KB files [19]; a vortex prediction pipeline on the simulation ulations. Furthermore, applications often use inference frameworks of a Mach 1.3 jet produces vortices with average size 8 KB [36]; such as TensorFlow and PyTorch whose naive I/O methods exacer- sequencing the human genome produces nearly 30 million files av- bate I/O bottlenecks. One solution is to use scientific file formats, eraging 190 KB each [8]; and 20 million images served by SDSS are such as HDF5 and FITS, to organize small arrays in one big file. less than 1 MB on average [7]. These files are small compared with However, storing everything in one file does not fully leverage the the typical block sizes of modern file systems. For example, the de- heterogeneous data storage capabilities of modern clusters. fault block size of the Hadoop Distributed File System (HadoopFS) This paper presents Henosis, a system that intercepts data ac- is 64 MB and the default stripe size of the Lustre parallel file system cesses inside the HDF5 library and transparently redirects I/O to is 1 MB. Achieving respectable I/O performance when accessing the in-memory Redis object store or the disk-based TileDB array a large number of small files is particularly challenging for a dis- store. During this process, Henosis consolidates small arrays into tributed file system. In addition, it is cumbersome and error-prone bigger chunks and intelligently places them in data stores. A critical for users to organize large collections of small files manually. research aspect of Henosis is that it formulates object consolida- One solution that has been embraced by scientists is storing tion and data placement as a single optimization problem. Henosis such datasets in array-centric file formats like HDF5, netCDF and carefully constructs a graph to capture the I/O activity of a work- FITS. These file formats arrange collections of objects in an inter- load and produces an initial solution to the optimization problem nal hierarchy, offer richer metadata support than file systems, and using graph partitioning. Henosis then refines the solution using a store entire collections of objects as a single file. However, these hill-climbing algorithm which migrates arrays between data stores file format libraries adopt a monolithic design that tightly couples to minimize I/O cost. The evaluation on two real scientific data an array-centric API with a particular physical data layout. This analysis pipelines shows that consolidation with Henosis makes monolithic design is not well-suited to the heterogeneous I/O ca- I/O 300× faster than directly reading small arrays from TileDB and pabilities of modern clusters. Storing everything in one file in the 3:5× faster than workload-oblivious consolidation methods. More- parallel file system does not fully utilize nodes with large memory over, jointly optimizing consolidation and placement in Henosis or nodes with locally attached flash-based storage. A large mem- makes I/O 1:7× faster than strategies that perform consolidation ory node would be an ideal deployment setting for an in-memory and placement independently. key/value store such as Redis, while fast locally-attached storage can be utilized with a locality-conscious file system in user space, 1 INTRODUCTION such as HadoopFS. Unfortunately, established array-centric file format libraries do not support multiple storage backends with The data volume processed by scientific pipelines is increasing heterogeneous I/O capabilities. rapidly. Scientific pipelines in various domains, such as plasma This paper describes Henosis, an I/O library that allows HDF5- simulation [9], climate modeling [16], transient detection [19], and based programs to consolidate, place and read small arrays on computational fluid dynamics, analyze massive amounts of array heterogeneous data stores. The prototype implementation of Heno- data that range up to petabytes. For example, the Large Hadron sis that we describe in this paper supports two storage backends: Collider produces approximately 15 petabytes of data annually, and (1) Redis [25], an in-memory object store with a key/value interface, the Sloan Digital Sky Survey (SDSS) [7] archives terabytes of data and (2) TileDB [28], an array management system for distributed for hundreds of millions of astronomical objects. file systems such as HadoopFS. By leveraging the recent virtual Permission to make digital or hard copies of all or part of this work for personal or object (VOL) interface of the HDF5 library, existing HDF5 applica- classroom use is granted without fee provided that copies are not made or distributed tions can benefit from the I/O optimizations of Henosis without for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the recompilation, as Henosis makes no modifications to the public author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or HDF5 API and can be dynamically loaded at runtime. republish, to post on servers or to redistribute to lists, requires prior specific permission The research question that naturally arises when supporting and/or a fee. Request permissions from [email protected]. ICS ’19, June 26–28, 2019, Phoenix, AZ, USA multiple storage backends is how one should physically lay out © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. array data across data stores with heterogeneous I/O capabilities. ACM ISBN 978-1-4503-6079-1/19/06...$15.00 Consider, for example, the Redis and TileDB data stores mentioned https://doi.org/10.1145/3330345.3330380 392 Henosis: Workload-driven small array consolidation and placement ICS ’19, June 26–28, 2019, Phoenix, AZ, USA Redis (3) We design and implement an I/O library prototype, Henosis, TileDB 10,000 to transparently consolidate, place and read small arrays for HDF5 applications. The evaluation, based on two real workloads, 1,000 shows that Henosis accelerates I/O by 300× compared with native TileDB and 1.7× over workload-oblivious consolidation 100 and placement strategies. 10 The remainder of the paper is structured as follows. Section 2 pres- Read throughput (KB/sec) throughput Read ents necessary background on the HDF5 library. Section 3 describes 1 two application drivers that process small arrays. Section 4 de- 1.7 KB 1.7 MB Object size scribes the Henosis architecture. Section 5 formulates consolidation and placement as a single optimization problem, and proposes a Figure 1: Read throughput of Redis, an in-memory object heuristic method to optimize the array storage plan based on graph store, and TileDB, a disk-based array store, when accessing partitioning and the hill climbing technique. Section 6 follows with small (1.7 KB) and large (1.7 MB) objects. Redis is nearly 100× more details about the implementation. Section 7 describes the ex- faster than TileDB with small objects, but read throughput perimental setup and presents the performance evaluation, Section is statistically indistinguishable with large objects. 8 presents related work, and Section 9 concludes. earlier. If one stores small objects that are 1.7 KB each, the read throughput of Redis outperforms TileDB by nearly 100×, as shown 2 BACKGROUND in Figure 1. This is an instance of the data placement problem on This section first introduces the array model which is prevalent heterogeneous storage systems, a topic which has been well-studied in scientific computing. It then describes two new features ofthe in prior literature [11–13, 24, 31, 38]. Yet, if one stores objects that HDF5 array library, namely the virtual dataset (VDS) and the virtual are 1.7 MB each, the read throughputs of TileDB and Redis become object layer (VOL), which allow Henosis to transparently store and statistically indistinguishable. Although consolidation is less ex- consolidate arrays on different storage backends. Although this plored in prior research, the results clearly show that it has the paper focuses on the HDF5 library, the consolidation and placement potential to equalize I/O performance as long as the requested techniques are also applicable to other array formats. small objects can be consolidated into sufficiently big chunks. How- Scientific datasets in various domains such as astronomy, physics, ever, consolidation and placement are not orthogonal optimizations: and medicine can be represented by arrays. Arrays are said to be A consolidation-oblivious placement algorithm misses opportuni- dense when every cell has an associated value, or sparse when ties to place data in a manner that benefits from sequential I/O. the majority of the cells are empty. Sparse arrays are sometimes Conversely, a placement-oblivious consolidation algorithm cannot stored as dense arrays after filling all empty cells with a null value. differentiate between fast and slow I/O devices that often co-exist Dense arrays are commonly stored in a chunked layout. A chunk in modern HPC systems. is a subarray bounded by a (hyper-)rectangle that covers adjacent This paper describes how Henosis formulates consolidation and cells.