Byna S, Breitenfeld MS, Dong B et al. ExaHDF5: Delivering efficient parallel I/O on exascale computing systems. JOUR- NAL OF SCIENCE AND TECHNOLOGY 35(1): 145–160 Jan. 2020. DOI 10.1007/s11390-020-9822-9

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

Suren Byna1,∗, M. Scot Breitenfeld2, Bin Dong1, Quincey Koziol1, Elena Pourmal2, Dana Robinson2 Jerome Soumagne2, Houjun Tang1, Venkatram Vishwanath3, and Richard Warren2

1Lawrence Berkeley National Laboratory, Berkeley, CA 94597, U.S.A. 2The HDF Group, Champaign, IL 61820, U.S.A. 3Argonne National Laboratory, Lemont, IL 60439, U.S.A.

E-mail: [email protected]; brtnfl[email protected]; {dbin, koziol}@lbl.gov E-mail: {epourmal, derobins, jsoumagne}@hdfgroup.org; [email protected]; [email protected] E-mail: [email protected]

Received July 6, 2019; revised August 28, 2019.

Abstract Scientific applications at exascale generate and analyze massive amounts of data. A critical requirement of these applications is the capability to access and manage this data efficiently on exascale systems. Parallel I/O, the key technology enables moving data between compute nodes and storage, faces monumental challenges from new applications, memory, and storage architectures considered in the designs of exascale systems. As the storage hierarchy is expanding to include node-local persistent memory, burst buffers, etc., as well as disk-based storage, data movement among these layers must be efficient. Parallel I/O libraries of the future should be capable of handling file sizes of many terabytes and beyond. In this paper, we describe new capabilities we have developed in Hierarchical Data Format version 5 (HDF5), the most popular parallel I/O library for scientific applications. HDF5 is one of the most used libraries at the leadership computing facilities for performing parallel I/O on existing HPC systems. The state-of-the-art features we describe include: Virtual Object Layer (VOL), Data Elevator, asynchronous I/O, full-featured single-writer and multiple-reader (Full SWMR), and parallel querying. In this paper, we introduce these features, their implementations, and the performance and feature benefits to applications and other libraries. Keywords parallel I/O, Hierarchical Data Format version 5 (HDF5), I/O performance, virtual object layer, HDF5 optimizations

1 Introduction tems. Parallel I/O, the key technology behind moving data between compute nodes and storage, faces monu- In pursuit of more accurate modeling of real-world mental challenges from the new application workflows systems, scientific applications at exascale will gene- as well as the memory, interconnect, and storage ar- rate and analyze massive amounts of data. A critical chitectures considered in the designs of exascale sys- requirement of these applications is the capability to tems. The storage hierarchy in existing pre-exascale access and manage this data efficiently on exascale sys- computing systems includes node-local persistent mem-

Regular Paper Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics This research was supported by the Exascale Computing Project under Grant No. 17-SC-20-SC, a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. This work is also supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract Nos. DE-AC02-05CH11231 and DE-AC02-06CH11357. This research was funded in part by the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract No. DE-AC02-06CH11357. This research used resources of the National Energy Research Scientific Computing Center, which is DOE Office of Science User Facilities supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. ∗Alphabetical listing of authors after the first author (the corresponding author) ©Institute of Computing Technology, Chinese Academy of Sciences & Springer Nature Singapore Pte Ltd. 2020 146 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1 ory, burst buffers, etc., as well as disk and tape-based ExaHDF5 is a project funded by ECP in enhancing storage○1 ○2 . Data movement among these layers must the HDF5 library. In this paper, we describe new ca- be efficient, with parallel I/O libraries of the future ca- pabilities we have developed in the ExaHDF5 project, pable of handling file sizes of many terabytes and be- including Virtual Object Layer (VOL), Data Eleva- yond. Easy-to-use interfaces to access and move data tor, asynchronous I/O, full single-writer and multiple- are required for alleviating the burden on scientific ap- reader (SWMR), and parallel querying. We describe plication developers and in improving productivity. Ex- each of these features and present an evaluation of the ascale I/O systems must also be fault-tolerant to handle benefits of the features. the failure of compute, network, memory, and storage, There are multiple I/O libraries, such as given the number of hardware components at these PnetCDF [2] and ADIOS [3], which provide parallel I/O scales. functionality. However, this paper is focusing on im- Intending to address efficiency, fault-tolerance, and plementations in the HDF5 library. The remainder of other challenges posed by data management and para- the paper is organized as follows. We provide a brief llel I/O on exascale architectures, we are developing background to HDF5 in Section 2 and describe the new capabilities in HDF5 (Hierarchical Data Format features enhancing HDF5 in Section 3. In Section 4, version 5) [1], the most popular parallel I/O library we describe our experimental setup used for evaluating for scientific applications. As shown in Fig.1, HDF5 various developed features using different benchmarks is the most used library for performing parallel I/O on existing HPC systems at the National Energy Re- (presented in Section 5) and conclude our discussion in search Scientific Computing Center (NERSC). (HDF5 Section 6. and hdf5 are used interchangeably) HDF5 is among the We have presented the benefits of the developed top libraries used at several US Department of Energy features individually as these have been developed for (DOE) supercomputing facilities, including the Lead- different use cases. For instance, Data Elevator is for ership Computing Facilities (LCFs). Many of the ex- using burst buffers on systems that have the hardware. ascale applications and co-design centers require HDF5 Full SWMR is for applications that require multiple for their I/O, and enhancing the HDF5 software to han- readers analyzing data while a writer is adding data to dle the unique challenges of exascale architectures will an HDF5 file. Integrating all these features into the sin- play an instrumental role in the success of the Exascale gle HDF5 product requires a greater effort, Computing Project (ECP)○4 . which will occur in the next phase of the project.

10 7

10 6

10 5

10 4

10 3

10 2 f i c api lib Number of Linking Incidences Linking of Number bsci mkl fftw oost z ets tpsl li hdf5 etcd p imp p mpich netcdf -n b -parallel 5parallel rallel hdf5 -hdf a df p

netc Libraries Fig.1. Library usage at NERSC on Cori and Edison in 2017 using the statistics collected by Automatic Library Tracking Database (ALTD)○3 .

○1 OLCF : https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/, Nov. 2019. ○2 Cori at NERSC: https://www.nersc.gov/users/computational-systems/cori/, Nov. 2019. ○3 https://www.nersc.gov/assets/altdatNERSC.pdf, Dec. 2019. ○4 ECP homepage: https://www.exascaleproject.org/, Nov. 2019. Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 147

2 Background to HDF5 a powerful and efficient mechanism for describing users’ data. Datatype definition includes information about HDF5 provides a data model, library, and file format byte order, size, and floating point representation; it for storing and managing data. Due to the versatility fully describes how the data is stored in the file, insur- of its data model, and its portability, longevity, and ing portability between the different operating system efficiency, numerous scientific applications use HDF5 and compilers. as part of their data management solution. The HDF Due to the simplicity of the HDF5 data model, and ○5 Group (THG) has been designing, developing, and flexibility and efficient I/O of the HDF5 library, HDF5 supporting the HDF5 software for more than 20 years. supports all types of digitally stored data, regardless HDF5 is an open-source project with a BSD-style of origin or size. Petabytes of remote sensing data col- license and comes with C, Fortran, C++, and Java lected by satellites, terabytes of computational results APIs, and third-party developers offer API wrappers in from nuclear testing models, and megabytes of high- Python, Perl, and many other languages. The HDF5 resolution MRI brain scans are stored in HDF5 files, to- library has been ported to virtually all existing ope- gether with metadata necessary for efficient data shar- rating systems and compilers, and the HDF5 file format ing, processing, visualization, and archiving. is portable and machine-independent. The HDF5 com- munity contributes patches for bugs and new features, 3 Features and actively participates in testing new public releases and feature prototypes. Towards handling challenges posted by massive con- HDF5 is designed to store and manage high-volume currency of exascale computing systems, deeper storage and complex scientific data, including experimental and hierarchies of these systems, and large amounts of sci- observational data (EOD). HDF5 allows storing generic entific data produced and analyzed, several new fea- data objects within files in a self-describing way, and tures are needed in enhancing HDF5. In the ExaHDF5 much of the power of HDF5 stems from the flexibility project, we have been working on opening the HDF5 of the objects that make up an HDF5 file: datasets for API to allow various data storage options, on provid- storing multi-dimensional arrays of homogeneous ele- ing transparent utilization of deep storage hierarchy, ments and groups for organizing related objects. HDF5 on optimizing data movement, on providing coherent attributes are used to store user-defined metadata on and concurrent access to HDF5 data by multiple pro- objects in files. cesses, and on accessing desired data efficiently. In the The HDF5 library offers several powerful features remainder of this section, we will describe the features for managing data. The features include random access addressing these aspects of improving HDF5. to individual objects, partial access to selected dataset values, and internal data compression. HDF5 allows the 3.1 Virtual Object Layer (VOL) modification of datasets values, including compressed data, without rewriting the whole dataset. Users can The HDF5 data model is composed of two funda- use custom compression methods and other data filter- mental objects: groups and datasets, composed as a ing techniques that are most appropriate for their data, directed graph within a file (sometimes also called a for example, ZFP compression developed at LLNL○6 “container”). This data model is powerful and widely along with the compression methods available in HDF5. applicable to many applications, including HPC appli- Data extensibility is another powerful feature of the cations. However, the classic HDF5 mechanism of stor- HDF library. Data can be added to an array stored in ing data model objects in a single, shared file can be a an HDF5 dataset without rewriting the whole dataset. limitation as HPC systems evolve toward a massively Modifications of data and metadata stored in HDF5 concurrent future. Completely re-designing the HDF5 can be done in both sequential mode and parallel mode file format would be a possible way to address this lim- using MPI I/O. itation, but the software engineering challenges of sup- HDF5 supports a rich set of pre-defined datatypes as porting two (or more, in the future) file formats directly well as the creation of an unlimited variety of complex in the HDF5 library are daunting. Ideally, applications user-defined datatypes. User-defined datatypes provide would continue to use the HDF5 programming API and

○5 The HDF Group homepage: https://www.hdfgroup.org/, Nov. 2019. ○6 LZF Website at LLNL: https://computation.llnl.gov/projects/floating-point-compression, Nov. 2019. 148 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1 data model, but have those objects stored with the sto- each VOL connector serves a different purpose, they rage mechanism of their choice. can be developed as different modules by different deve- As shown in Fig.2, the Virtual Object Layer (VOL) lopers and can be stacked to work together. This type adds a new abstraction layer internally to the HDF5 of stackability is not feasible with other binary instru- library and is implemented just below the public API. mentation libraries. Having the VOL implementation The VOL intercepts all HDF5 API calls that access within HDF5 reduces the dependency on external li- objects in a file and forwards those calls to an “ob- braries not maintained by the HDF5 developers. ject driver” connector. A VOL connector can store the In our recent work, we have integrated the HDF5 data model objects in a variety of ways. For VOL feature implementation into the mainstream example, a connector could distribute objects remotely HDF5 library and released it in HDF5 version over different platforms, provide a direct mapping of 1.12.0. The VOL feature implementation included new the model to the file system, or even store the data in HDF5 API to register (H5VLregister*), to unregis- another file format (like the native netCDF or HDF4 ter (H5VLunregister*), and to obtain the VOL infor- format). The user still interacts with the same HDF5 mation (H5VLget*). We have also developed a “pass data model and API, where access is done to a single through” VOL connector as an example implementa- HDF5 “container”; however, the VOL connector trans- tion that forwards each VOL callback to an underlying lates those operations appropriately to access the data, connector. regardless of how it is actually stored. Our implementation of VOL allows VOL connec- While HDF5 function calls can be intercepted by tors widely accessible to the HDF5 community and binary instrumentation libraries such as gotcha○7 , the will encourage more developers to use or create new VOL feature in HDF5 is designed to work with multiple connectors for the HDF5 library. The VOL implemen- VOL connectors based on the user requirement. This tation is currently available in the “develop” branch VOL connector stackability allows users to register and as we write this paper○8 , which is available for public unregister multiple connectors. For example, Data Ele- access. The HDF Group is working on releasing the vator can be stacked with the asynchronous I/O connec- feature as the next major public release. Several VOL tor and a provenance collection VOL connector. Since connectors have been or are under development but

Application D

Application A Application B Application C netCDF -4

HDF5 Programming API

Non-Persistent HDF5 Operations Persistent HDF5 Operations

Transient Data Structures Virtual Object Layer

VOL Plugins Classic HDF5 File Remote Storage Other File Format Direct File System Format Access

Legacy or Future POSIX or Object Network or HDF5 ? File Formats Storage Cloud Storage

Fig.2. Overview of the Virtual Object Layer (VOL) architecture within HDF5.

○7 https://github.com/llnl/gotcha, Nov. 2019. ○8 HDF5 develop branch: https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5, Nov. 2019. Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 149 cannot be officially released until the VOL architec- Elevator library to intercept MPI-IO data write calls. ture is finalized and integrated with the main HDF5 This extension will cover a large number of applications library. These VOL connectors in development include to use the library. For supporting data reads, we deve- those aiming to map HDF5 natively on top of object loped prefetching and caching data in a faster storage storage, to support remote HDF5 object storage, and layer. To support data reads, data is prefetched and to track statistics of HDF5 usage, to name a few. We cached in a faster storage layer. We describe the write present two VOL connectors in this paper, i.e., the Data and read caching methods in this subsection. Elevator connector (Subsection 3.2) for transparently 3.2.1 Write Caching in Data Elevator moving data between multiple layers of storage and the asynchronous I/O VOL connector (Subsection 3.3) for In Fig.3, we show a high-level overview of the write moving data asynchronously by letting the computa- and read caching implementations of Data Elevator. tion process continue while a thread moves the data. The main issues we target to address with DE are 1) The VOL interface will also enable developers to cre- to provide a transparent mechanism for using a burst ate connectors for representing HDF5 objects in mul- buffer (BB) as a part of a hierarchical storage system, tiple types of non-persistent and persistent memories, and 2) to move data between different layers of a hi- to support workflows that take advantage of in-memory erarchical storage system efficiently with low resource data movement between different codes. contention on BB nodes. The data stored temporar- ily in the burst buffer can be reused for any analysis. 3.2 Data Elevator Hence, we call this feature caching instead of simply buffering the data. Key synergistic criterion to achieve scalable data movement in scientific workflows is the effective place- Simulatioin Processes TEDM Processes ment of data. The data generation (e.g. by simulations) and consumption (such as for analysis) can span vari- ous storage and memory tiers, including near-memory HDF5/Other APIs MPI -IO Computing Node IOCI API NVRAM, SSD-based burst buffers, fast disk, campaign Async. Data storage, and archival storage. Effective support for Redirected I/O Movement caching and prefetching data based on the needs of the Append application is critical for scalable performance. ○9 f⊲h⊲temp Solutions such as DataWarp , DDN Inte- Burst f⊲h 5 PFS grated Memory Engine (IME), and parallel file systems Buffer such as and GPFS provide solutions for a spe- DEMT ⊳⊳⊳ f⊲h֒ f⊲h⊲temp֒ cific storage layer, but it is currently left to the users to move data between different layers. It is imperative Fig. 3. Overview of Data Elevator write caching functionality. that we design parallel I/O and data movement sys- The arrows from IOCI to DEMT are control requests and the tems that can hide the complexity of caching and data the remaining arrows are requests for data. IOCI: I/O Call In- terceptor. DEMT: Data Elevator Metadata Table. movement between tiers from users, to improve their productivity without penalizing performance. The first issue arises because burst buffers are intro- Towards achieving the goal of an integrated para- duced in HPC systems as independent storage spaces. llel I/O system for multi-level storage, as part of the Burst buffers that are being considered in exascale sys- ExaHDF5 research project, we have recently developed tems are of two types: those provide a single names- the Data Elevator library [4]. Data Elevator intercepts pace and are shared by all the compute nodes, and HDF5 write calls, caches data in the fastest persistent those installed on each node and do not provide a single storage layer, and then moves it to the final destination namespace. In the Data Elevator, our implementations specified by an application in the background. We have are on burst buffers that are shared by all compute used the HDF5 Virtual Object Layer (VOL) (described nodes. Due to limited storage capacity, i.e., 2x to 4x the in Subsection 3.1) to intercept HDF5 file open, dataset main memory size, burst buffers are typically available open and close, dataset write and read, and file close to users only during the execution of their programs. calls. In addition to HDF5, we have extended the Data Consequently, if users choose to write data to BB, they

○9 Cray DataWarp: http://docs.cray.com/books/S-2558-5204/S-2558-5204.pdf, Nov. 2019. 150 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1 are also responsible for moving the data to PFS for re- moved to a PFS. Meanwhile, the Data Mover moni- taining the data. The second issue is caused by Cray tors the metadata periodically and once it finds that DataWarp, the state-of-the-art middleware for manag- the writing process is complete, it starts to move the ing burst buffers on Cray systems. DataWarp uses a data from the BB to the PFS. Data Elevator reads the fixed and small number of BB nodes to serve both reg- data from the BB to the memory on computing nodes, ular I/O and data movement requests. This typically where Data Elevator is running, and writes the data results in performance degradation caused by interfer- to PFS without interfering with other I/O requests on ence between the two types of requests. the BB. Data Elevator provides optimizations, such as To address above issues, Data Elevator performs overlapping of reading data from BB to memory and asynchronous data movement to enable transparent use writing to the PFS, and aligning Lustre PFS stripe size of hierarchical data storage, and also to use a low- with the data request size (assuming PFS is Lustre) to contention data flow path by using compute nodes for reduce the overhead of data movement. While the data transferring data between different storage layers. Data is in the burst buffer, Data Elevator allows data ana- Elevator allows one to use as many data transfer nodes lysis codes to access the data, which is called in situ or as necessary, which reduces the chance of contention. in transit analysis, by redirecting data read accesses to As shown in Fig.3, Data Elevator has three main com- the data stored in the burst buffer. ponents: I/O Call Interceptor (IOCI), Data Elevator 3.2.2 Read Caching in Data Elevator Metadata Table (DEMT), and Transparent and Effi- cient Data Mover (TEDM or Data Mover in short). In read caching, we have implemented caching and The IOCI component intercepts I/O calls from ap- prefetching chunks of data based on the history of plications and redirects I/O to fast storage, such as HDF5 chunk accesses [5]. The cached chunks are stored burst buffer. We implement IOCI mechanism using as binary files in a faster persistent storage device, such the HDF5 VOL feature, by developing an HDF5 VOL as an SSD-based burst buffer. When the requests to the plug-in. While the current implementation supports prefetched data come, DE redirects the read requests HDF5-based I/O, the implementation can be extended to the cached data and thus improving performance to other I/O interfaces, such as MPI-IO. DEMT con- significantly. tains a list of metadata records, e.g., file name, for the The data flow path for read caching in Data Elevator files to be redirected to BB. The Data Mover (TEDM) is shown in Fig.4. We highlight the prefetching func- component is responsible for moving the data from a tion of Data Elevator that improves the performance of burst buffer to a PFS. TEDM component can share array-based analysis programs by reading ahead data the nodes with the application job or run using a sepa- from the disk-based file system to the SSD-based burst rate set of compute nodes. In Fig.3, TEDM shares two buffer. An array is prefetched as data chunks. In the of the eight CPU cores with a simulation job (that uses figure, we show an array with 12 (i.e., a 3 × 4 array) the remaining six cores) on a computing node. chunks. Let us assume that an analysis task is access- To reduce resource contention on BB during data ing the first two chunks (marked in green) that are read movement, Data Elevator is instantiated either on sep- from the disk-based file system. arate compute nodes or on the same compute nodes as Data Elevator’s Array Caching and Prefetching an application. While this design of Data Elevator in- method (ARCHIE) prefetches the following four chunks creases the number of CPU cores needed for running (shown in blue) into the SSDs for future reads. Data an application, extensive test results show that using a Elevator manages the metadata table to contain data small portion of computing power to optimize I/O per- access information extracted from applications. Entries formance reduces the end-to-end time of the whole I/O of the metadata table contain the name of the file being intensive simulation. When the data is in BB, Data El- accessed by a data analysis application. This filename evator also allows to perform transit analysis tasks on information is inserted into the table by the read func- the data, before it is moved to the final destination of tion linked with analysis applications. The metadata the file specified by the application. table also contains “chunk size”, “ghost zone size”, and Data Elevator applies various optimizations to im- “status of the cached file”. A Chunk access predictor prove I/O performance in writing data to BB. After component uses the information of the cached chunks the application finishes writing a data file, it can con- to predict the future read accesses of analysis applica- tinue computations without waiting for the data to be tions and to prefetch the predicted data chunks into Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 151

Data Analysis ARCHIE ⊳⊳⊳ Applications Insert֒ Update֒ Query֒ Metadata Manager Parallel Read Consistency Manager Metadata Prediction Algorithm Table Parallel Reader/Writer Garbage Collection Hits Fault Tolerance Manager ... SSD File System Miss Cached Chunks

Disk File System Parallel Chunk Prefetching

Array to Be Analyzed

Fig.4. High level description of Data Elevator read caching and prefetching functions. All the arrows to and from the ARCHIE component are control signals, and those between the storage (i.e., Disk File System and SSD file system) are the data movement. the faster SSD storage. We used a simple predictor running with the Data Elevator fails, and some data based on recent history, but any prediction algorithm files are in the temporary staging area, the Data Ele- can be used. A parallel reader/writer brings data from vator moves the files to the destination and then exits. the disk file system into the SSD file system or vice In the case of Data Elevator failure while a co-locating versa. Because the data to be analyzed may be large, application is running, the Data Elevator task restarts Data Elevator uses parallel reader/writer to prefetch and resumes moving the data. A limitation of the Data data. Specifically, the reader/writer uses multiple MPI Elevator restarting is the dependency on the SLURM processes that span across all computing nodes that scheduler that is popular in HPC center. are running the analysis application. These processes concurrently prefetch different chunks from the disks 3.3 Asynchronous I/O into the SSD to increase the prefetch efficiency. When an application modifies a cached chunk, Data Elevator Asynchronous I/O allows an application to over- synchronizes updated cached chunks in the storage lay- lap I/O with other operations. When an application ers and writes the chunks back to the disk-based file properly combines asynchronous I/O with non-blocking systems. communication to overlap those operations with its cal- Meanwhile, Data Elevator augments each culation, it can fully utilize an entire HPC system, leav- prefetched chunk with a ghost zone layer (shown with ing few or no system components idle. Adding asyn- a red halo around a blue chunk) to match the access chronous I/O to an application’s existing ability to per- pattern on the array. A user may specify the width form non-blocking communication is a necessary aspect of the ghost zone, which can be zero. The first two of maximizing the utilization of valuable exascale com- chunks (colored in white) read into the SSDs are ac- puting resources. tually empty chunks only containing metadata (e.g., We have designed asynchronous operation support starting and ending offsets). These metadata entries in HDF5 to extend to all aspects of interacting with are written by the “read” function and they provide HDF5 containers, not just the typical “bulk data” ope- information for the prefetching algorithm in Data Ele- rations (i.e., raw data read/write). Asynchronous ver- vator to predict future chunks. We have implemented sions of “metadata” operations like file open, close, parallel prefetching to accelerate parallel I/O of ana- stat, etc. are not typically supported in I/O interfaces, lysis applications. but based on our previous experience, synchronization We have also added fault tolerance features to Data around these operations can be a significant barrier to Elevator, where either the application or the Data El- application concurrency and performance. Supporting evator programs can fail gracefully. If an application asynchronous file open and close operations as well as 152 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1 other metadata operations such as object creation and the reading “processes”. The method of updating the attribute updates in the HDF5 container will allow an HDF5 file will be identical for serial and parallel appli- application to be decoupled entirely from I/O depen- cations, so they can interoperate with each other, en- dencies. abling serial readers to concurrently access a file being We have implemented the asynchronous I/O fea- modified by a parallel writer, etc. ture as a VOL connector for HDF5. The VOL interface The implementation of full SWMR requires manag- in HDF5 allows developers to create a VOL connector ing file space in the file, handle “meta” flush dependen- that intercepts all HDF5 API calls that interact with cies, and handle state for a closed object. a file. As a result, it requires no change to the current Managing File Space. A SWMR writer should not HDF5 API or the HDF5 library source code. We imple- free the space when there is a possibility that read- mented asynchronous task execution by running those ers continue accessing the space for a period of time tasks in separate background threads. We are currently and must not be overwritten with new information un- using Argobots [6], a lightweight runtime system that til all of the reader cached metadata entries that refe- supports integrated computation and data movement rence them are evicted or updated. To achieve this, the with massive concurrency to spawn and execute asyn- SWMR writer and readers have to agree when there chronous tasks in threads. The thread execution inter- are no references to agree on a “recycle time” (△t) for face has been abstracted to allow us to replace Argobots space freed by the writer. Allowing for some margin with any other threading model, as desired. of errors, freed space could be recycled on the writer after 2△t and a reader would be required to refresh 3.4 Towards Full SWMR any metadata that remained in its metadata cache for longer than △t. The Single-writer and Multiple readers (SWMR) Handling “Meta” Flush Dependencies. Objects that functionality allow multiple processes to read an HDF5 an SWMR writer is modifying in an HDF5 file that has file while a single process is writing concurrently. A metadata data structures with flush dependencies be- capability that enables a single writing process to up- tween those data structures must determine and main- date an HDF5 file, while multiple reading processes tain the correct set of “meta” flush dependencies be- access the file in a concurrent, lock-free manner was tween those data structures, over all possible operations introduced in the 1.10 release of HDF5. However, to the object. The core capability used to implement the SWMR feature is currently limited to the nar- updates to an HDF5 file by an SWMR writer is the row use-case. The current implementation of “partial” “flush dependency” feature in the HDF5 library meta- SWMR (single-writer/multiple-reader) in the HDF5 li- data cache. A flush dependency between a “parent” brary only allows appending to datasets with unlim- cache entry and its “child” entries indicates that all ited dimensions, but “full” SWMR functionality allows dirty child entries must be marked clean (either by be- any modification to the file. Applications use a much ing written to the file or through some other metadata broader set of operations on HDF5 files during work- cache mechanism) before a dirty parent entry can be flow processing, making this limitation an impediment flushed to the file. For example, when the dimension to their efficient operation. size of a dataset is extended, both the extensible ar- We have designed and implemented to extend the ray that references chunks of dataset elements and the SWMR feature in HDF5 to support all metadata ope- dataspace message in the dataset object header must be rations for HDF5 files (e.g., object creation and dele- updated with the new size, and the dataspace message tion, attribute updates). In addition to removing the must not be seen by an SWMR reader until the ex- limitation in the types of operations that the SWMR tensible array is available for the reader to access. For feature supports, the limitation of only operating on this situation, and others like it, proxy entries in the files with serial applications must also be lifted, to fully metadata cache are used to guarantee that all the dirty support exascale application workflows. Therefore, we extensible array entries are flushed to the file before designed the extension of the SWMR capability to sup- the object header entry with the dataspace message is port parallel MPI applications for both the writing and written to the file. reading aspects of the SWMR feature. This enables In Fig.5, we show a typical situation in the meta- a single MPI application to function as the writing data cache for a chunked dataset. All the entries for “process” and any number of MPI applications to be the chunk index (an extensible array, in this case) have Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 153 a “top” proxy entry that reflects the dirty state of all metadata updates. This will further improve the per- the entries in the index (i.e., if any index entry is dirty, formance of SWMR readers. the index’s top proxy will be dirty also). Similarly, all the entries in the object header depend on a “bottom” 3.5 Parallel Querying proxy, which will keep any of them from being flushed (i.e., if the bottom proxy is dirty, no dirty object header Methods to store, move, and access data across com- entries can be flushed). Therefore, making the bottom plex Exascale architectures are essential to improve the proxy of the object header the flush dependency par- productivity of scientists. To assist with accessing data ent of the top proxy for the extensible array will force efficiently, the HDF Group has been developing func- all extensible array entries to be written to the file and tionality for querying HDF5 raw data and releases this marked clean in the metadata cache before any dirty feature in HDF5. The interface relies on three main object header entries can be written to the file. components that allow application developers to create complex and high-performance queries on both meta- Object data and data elements within a file: queries, views, Header Entries and indices. Query objects are the foundation of the data ana-

Object Object Object

lysis operations and can be built up from simple com- Header Header Header Chunk 0 Chunk 1 Chunk n ponents in a programmatic way to create complex ope- rations using Boolean operations. View objects allow the user to retrieve an organized set of query results. The core query API is composed of two routines: H5Qcreate and H5Qcombine. H5Qcreate creates new

queries, by specifying an aspect of an HDF5 container, Object Header in particular: Bottom Proxy Entry • H5Q TYPE DATA ELEM (data element)

as well as a match operator, such as: Object Header • H5Q MATCH EQUAL (=) Top Proxy Entry Extensible • H5Q MATCH NOT EQUAL (6=) Array • H5Q MATCH LESS THAN (6) Entries • H5Q MATCH GREATER THAN (>) Header and a value for the match operator. Created query objects can be serialized and deserialized using the H5Qencode and H5Qdecode routines○10 , and their con- A B C 1 2 3 4 5 6 7 tent can be retrieved using the corresponding accessor routines. H5Qcombine combines two query objects into a new query object, using Boolean operators such as: • H5Q COMBINE AND (∧) Fig.5. Typical situation of the HDF5 metadata cache for a chun- • H5Q COMBINE OR (∨). ked dataset. Queries created with H5Qcombine can be used as input Handle Metadata Cache State for a Closed Object. to further calls to H5Qcombine, creating more complex Objects that are opened by an SWMR writer must con- queries. For example, a single call to H5Qcreate could tinue to maintain the necessary state required to update create a query object that would match data elements the object’s metadata in the file while any of the meta- in any dataset within the container that is equal to the data entries for the object are cached, even after the value “17”. Another call to H5Qcreate could create object is closed. a query object that would match link names equal to In future, we will be enhancing this feature with “Pressure”. Calling H5Qcombine with the ∧ operator a mechanism for SWMR writer to notify the SWMR and those two query objects would create a new query readers when a piece of metadata has changed, instead object that matched elements equal to “17” in HDF5 of current polling method where readers poll for such datasets with link names equal to “Pressure”. Creating

○10 Serialization/deserialization of queries was introduced so that queries can be sent through the network. 154 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1 the data analysis extensions to HDF5 using a program- the VOL feature, evaluation of performance is unavail- matic interface for defining queries avoids defining a able as the feature as it is an implementation that en- text-based query language as a core component of the ables other features, such as Data Elevator and asyn- data analysis interface, and is more in keeping with the chronous I/O. All the other features have been evalu- design and level of abstraction of the HDF5 API. Ta- ated using various benchmarks and I/O kernels from ble 1 describes the result types for atomic queries and real applications. combining queries of different types.

Table 1. Query Combinations 5 Evaluation

Query Result Type 5.1 Data Elevator H5Q TYPE DATA ELEM Dataset region Dataset element ∧ dataset element Dataset region We have implemented Data Elevator in C and com- Dataset element ∨ dataset element Dataset region piled with compilers. Our tests for disk-based per- formance used all 248 OSTs of the Lustre. In tests for the BB, we used all 144 DataWarp server nodes. The Index objects are the core means of accelerating the striping size for multiple DataWarp servers is fixed at HDF5 query interface and enable queries to be per- 8 MB, which cannot be modified by normal users. As formed efficiently. The indexing interface uses a plu- such, we set the striping size of Lustre file system also gin mechanism that enables flexible and transparent to be 8 MB. index selection and allows new indexing packages to be We evaluated various configurations of Data Eleva- quickly incorporated into HDF5 as they appear, with- tor using two parallel I/O benchmarks: VPIC-IO and out waiting for a new HDF5 release. Chombo-IO. Both benchmarks have a single time step The indexing API includes registration of an and generate a single HDF5 file. VPIC-IO is a parallel indexing plugin (H5Xregister), unregistering it I/O kernel of a plasma physics simulation code, called (H5Xunregister), creating and removing indexes VPIC. In our tests, VPIC-IO writes 2 million particles (H5Xcreate and H5Xremove, respectively), and gets and 8 properties per particle, resulting in a file of 64 GB information about the indexes, such as H5Xget info in size. The I/O pattern of VPIC-IO is similar to par- and H5Xget size. The current implementation of in- ticle simulation codes in the ECP, where each particle dexing in HDF5 supports FastBit bitmap indexing [7], has a set of properties, and a large number of particles which uses Word-Aligned Hybrid (WAH) compression are studied. As shown in Fig.6, using Data Elevator for for bitmaps generated. The bitmap indexes are gene- moving the data related to a single time step achieves rated offline using a utility, and they are stored within 5.6x speedup on average over Cray DataWarp and 5x the same HDF5 file. These index files are used for accel- speedup on average over writing data directly to Lustre. erating the execution of queries. Without the indexes, query execution scans the entire datasets defined in a 400 Writing Data from Memory to Disk query condition. Writing Data to Burst Buffer Moving Data from Burst Buffer to Disk 300 Lustre 4 Experimental Setup DataWarp 200 DataElevator

We have conducted our evaluation on Cori, a Cray Time (s) Time XC40 at NERSC. Cori data partition 100 (phase 1) consists of 1 630 compute nodes, and each r node has 32 Intel Haswell CPU cores and 128 GB 0 memory. The Lustre file system of Cori has 248 ob- 2k 4k 8k 16k ject storage targets (OSTs) providing 30 PB of disk CPU Cores space. Cori is also equipped with an SSD based “Burst Fig.6. Evaluation of Data Elevator write caching functionality — I/O time of a plasma physics simulation’s (VPIC-IO) data write Buffer”. The Burst Buffer is managed by DataWarp pattern. from Cray and has 144 DataWarp server nodes. We describe the benchmarks and the results in the Chombo-IO is derived from Chombo, a popular following section, as each feature used a different set of block-structured adaptive mesh refinement (AMR) li- benchmarks and their evaluation metrics varied. For brary. The generated file has a problem domain of Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 155

256 × 256 × 256 and is of 146 GB in size. The I/O data used to detect extreme weather events [8], gradient pattern of this benchmark is representative of the sub- computation of plasma physics dataset using a 3D mag- surface flow simulation application in the ECP. In Fig.7, netic field data generated by a VPIC simulation [9], and we show the performance benefit of the Data Elevator vorticity computation on combustion data produced library. Chombo-IO benchmark achieves 2x benefit (on by an S3D simulation that captures key turbulence- average) over Lustre and 5x benefit (on average) over chemistry interactions in a combustion engine [10]. Cray DataWarp. In our evaluation of DE reading with CNN of a cli- mate dataset, we focus on the most time-consuming 200 Writing Data from Memory to Disk step, i.e., convolution computing, in CNNs. The CAM5 Writing Data to Burst Buffer Moving Data from Burst Buffer to Disk dataset used in this test is a 3D array with size 150 Lustre [31, 768, 1152], where 768 and 1182 are the latitude DataWarp and the longitude dimensions, respectively, and 31 is DataElevator 100 the number of height levels from the earth into the at-

Time (s) Time mosphere. The filter size for the convolution is [4, 4]. 50 The chunk size is [31, 768, 32], resulting in a total of 36 chunks. The analysis application runs with 9 MPI pro- 0 cesses. We run DE with another 9 processes. In this 2k 4k 8k 16k configuration, there are 4 batches of reads, and each CPU Cores batch accesses 9 chunks. We present the read time Fig.7. Evaluation of Data Elevator write caching functionality — I/O time of a Chombo adaptive mesh refinement (AMR) code’s of the analysis application in Fig.8. Reading all the I/O kernel. data from the disks, i.e., Lustre, gives the worst per- formance, as expected. Using DataWarp to move the In both benchmarks, the benefit over Lustre is com- data from Lustre to the burst buffer reduces the time ing from writing data to faster SSD-based burst buffer by 1.7x for reading data, but it contains initial over- compared with the disk-based Lustre. With regards to head in staging the data in the burst buffer. Using the Cray DataWarp, the advantage is coming from two fac- DE cache on both the disks and on the SSDs reduces tors: the selection of MPI-IO mode for writing data to the time for reading data. The advantage of DE comes the SSD-based burst buffer and dynamic configuration from converting non-contiguous reads into contiguous of Lustre striping parameters based on the data size. reads. Data Elevator can also prefetch data to be ac- The Data Elevator changes the MPI-IO mode from col- cessed in the future into the burst buffer as chunks and lective (two-phase) I/O to independent I/O because we achieves the best performance. Overall, for the CNN observed that the BB on Cori is performing much better use case, DE is 3.1x faster than Lustre and 1.8x faster with the independent mode. For the selection of Lustre than DataWarp. striping, while we chose all the Lustre OSTs for both Cray DataWarp and Data Elevator, the stripe size is Stage (Only for DataWarp) Read #1 Read #3 dynamically set based on the file size in Data Elevator. Read #2 Read #4 We evaluate DE read caching by comparing it with 5 0.9x Lustre to show the benefits of the prefetching func- 4 tion on accelerating I/O operations by converting non- 3 contiguous I/O to contiguous ones. Cray’s DataWarp 1.8x 2 2.7x provides tools, e.g., “stage-in”, to manually move a data (s) Time 1 file from disk space into SSD space. In contrast, DE automatically prefetches the array data from Lustre to 0 the burst buffer in chunks. Our method avoids appli- Lustre(Disk) cation’s wait time for the entire file data movement at ARCHIE(Disk) the start of job execution enforced by DataWarp. DataWarp(Disk+SSD) ARCHIE(Disk+SSD) We compare DE reading (labeled ARCHIE in the Fig. 8. Time profile of Data Elevator read caching for CNN plots representing Array Caching in hierarchical sto- analysis of CAM5 data. The plot compares read performance of Lustre and DataWarp with Data Elevator caching using the rage) using three scientific analysis kernels: convolu- Lustre file system and using the SSD-based DataWarp (labeled tional neural network (CNN)-based analysis on CAM5 ARCHIE). 156 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1

The VPIC dataset is a 3D array with size as 80 0.8x Stage (Only for DataWarp) [2 000, 2000, 800]. On a meshed 3D field, the gradient Read #1 Read #2 can be computed with a Laplacian, as shown in (1). 60 Read #3 Read #4 Read #5 Read #6 Read #7 Read #8 gradf(x,y,z)= ∇f(x,y,z), (1) 40

where f represents magnetic value and ∇ denotes the (s) Time Laplace operator. 20 3.4x 5.8x The chunk size for parallel array processing is [250, 250, 100], giving a total of 512 chunks. Our tests 0 use 128 processes to run the analysis program and use isk) +SSD) only 64 processes to run DE. In this test, we have a Lustre(Disk) (Disk+SSD) ghost zone with a size of [1, 1, 1]. We present the read ARCHIE(D time of this analysis in Fig.9. The performance compa- DataWarp ARCHIE(Disk rison has the same pattern as that of convolution on Fig.10. Data read time profiles for vorticity computation of a CAM5 data. Overall, DE is 2.7x faster than Lustre combustion simulation (S3D) data. ARCHIE represents the Data and 2.4x faster than DataWarp. Elevator’s array caching optimization.

Stage (Only for DataWarp) 5.2 Asynchronous I/O 100 Read #1 Read #3 We ran experiments with asynchronous I/O enabled 80 Read #2 Read #4 in HDF5 on the Cori system. The tests used an I/O 1.5x 60 kernel from a plasma physics simulation (VPIC) that writes 8 variables per particle at 5 timesteps. The num- 40 2.6x ber of MPI processes is varied from 2 to 4k cores, in 3.3x Time (s) Time multiples of 2. During the computation time between 20 subsequent time steps, which is emulated with a 20- second sleep time, Argobot tasks started on threads 0 perform I/O by overlapping I/O time fully with the isk) +SSD) computation time.

Lustre(Disk) RCHIE(D In Fig.11 and Fig.12, we show a comparison of write A RCHIE(Disk time and read time, respectively, with and without the A DataWarp(Disk+SSD) asynchronous I/O. In Fig.11, we show the I/O time for Fig.9. Data read time profiles for gradient computation of VPIC writing data produced by a simulation kernel and for data. ARCHIE represents the Data Elevator’s array caching op- timization. reading data to be analyzed, where using asynchronous I/O obtains 4x improvement. In these cases, I/O re- S3D simulation code captures key turbulence- lated to 4 of the 5 timesteps is overlapped with com- chemistry interactions in a combustion engine. An at- putation. The last timestep’s I/O has to be completed tribute to study the turbulent motion is vorticity, which before the program exits, which is not overlapped. The defines the local spinning motion for a specific location. overall performance benefit is hence dependent on the Given the z component of the vorticity at a point (x, y), number of time steps used. the vorticity analysis accesses four neighbors for each point at (x +1,y), (x − 1,y), (x, y − 1) and (x, y + 1), 5.3 Full SWMR as defined in [11]. Our tests use a 3D array with size [1 100, 1080, 1 408]. The chunk size is [110, 108, 176] and To evaluate the performance of full SWMR, we have the ghost zone size is [1, 1, 1], giving 800 chunks. We use implemented a version of SWMR-like functionality with 100 processes to run analysis programs and 50 processes locks using MPI (labeled non-SWMR in Fig.13 and to run DE. The read performance comparison is shown Fig.14). We compare the performance of this non- in Fig.10. The analysis program reads all 800 chunks SWMR with the full SWMR implementation in HDF5 in 8 batches. DE outperforms Lustre by 5.8x and is for a single writer to write 1D data to an HDF5 dataset, 7x better than DataWarp. In this case, DataWarp per- where the write size is varied between 1 KB and 512 MB forms 1.2x slower than Lustre. for 100 times. While the single writer process is writ- Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 157

80 ing data, three readers open the dataset and read the SWMR Non -SWMR newly written data concurrently. 60

Write VPIC Data (256 MB per Process Timestep, 5 Timesteps) 40 HDF5 HDF5 -Async 20 150

Total Elapsed Time (s) Time Elapsed Total 0 B B B B B B 100 KB KB KB M M M MB M 1 2 KB 4 KB 8 KB 2 KB 1 M 2 4 8 16 3 64 KB 16 32 MB64 128 KB256 512 KB 128 MB25 6M512 MB Individual Write Size

Time (s) Time 50 Fig.13. Performance comparison — SWMR vs non-SWMR. 1 writer writes 1D data to a HDF5 dataset, with individual write size ranging from 1 KB to 512 MB for 100 times. At the same 0 time, 3 readers discover and read the newly written data concur- 2 8 32 128 512 2048 rently. Number of Processes Fig.11. Parallel write performance of HDF5 with asynchronous In Fig.13, we show the raw comparison, and in I/O enabled (HDF5-async) compared with the current HDF5 im- plementation. Fig.14, we show the speedup numbers for performing this SWMR functionality. As can be seen, our im-

Read VPIC Data (256 MB per Process Timestep, 5 Timesteps) proved SWMR implementation HDF5 outperforms the HDF5 HDF5 -Async non-SWMR approach by up to 8x at smaller write sizes

40 between 2 KB and 32 KB. Beyond 32 KB, our imple- mentation still performs well, by 20% at 512 MB writes. 30 5.4 Parallel Querying 20 Initial benchmarking of the parallel querying imple- Time (s) Time 10 mentation prototype on Cori resulted in the following scalability plots (Fig.15 and Fig.16) which show the 0 2 8 32 128 512 2048 time taken to 1) construct the FastBit index; and 2) Number of Processes evaluate a query of the form: finding all elements in Fig.12. Parallel read performance of HDF5 with asynchronous the “Energy” dataset whose value is greater than 1.9 I/O enabled (HDF5-async) compared with the current HDF5 im- (i.e., “Energy > 1.9”). The data used for this query plementation.

10.0 8.7 8.3 8.2 8.3 7.7 8.0

6.0

3.7

Speedup 3.3 4.0 3.0 2.7 2.6 2.1 1.8 1.5 2.0 1.3 1.4 1.3 1.3 1.2 1.2 1.2

0.0

1 KB 2 KB 4 KB 8 KB 1 MB 2 MB 4 MB 8 MB 16 KB 32 KB 64 KB 16 MB 32 MB 64 MB 128 KB256 KB512 KB 128 MB256 MB512 MB Individual Write Size

Fig.14. Speedup of SWMR over non-SWMR in writing dataset with different write sizes. 158 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1 was obtained from a plasma physics simulation (VPIC) (VOL) that opens up the HDF5 API for alternate ways ran for understanding the magnetic field interactions in of storing data, Data Elevator for storing and retriev- a magnetic reconnection phenomenon of space weather ing data by using faster storage devices, asynchronous scenario. The HDF5 file contained particle properties, I/O for overlapping I/O time with application compu- such as the spatial locations in 3D, corresponding veloc- tation time, full SWMR for allowing concurrent read- ities, and energy of particles. Each property was stored ing of data while a process is writing an HDF5 file, as an HDF5 dataset that contains 623520550 elements and querying to access data that matches a given con- represented as 32-bit floating point values. dition. We have presented an evaluation of these fea- tures, which demonstrates significant performance ben- Construct Dataset Index efits (up to 6x with Data Elevator that uses SSD-based 100 burst buffer compared with using disk-based Lustre sys- tem, and 1.2x to 8x benefit with SWMR compared with 50 non-SWMR implementation). We have compared these

features separately as the use cases are distinct. For in- Time (s) Time stance, the Data Elevator and the async I/O features 0 1 2 4 8 16 32 use the VOL infrastructure. Full SWMR and querying CPU Cores serve different use cases from masking the I/O latency. Fig. 15. Parallel index generation time, where the indexes are With the ongoing ExaHDF5 effort, these features will based bitmap indexes generated by the FastBit indexing library. be integrated into HDF5 public release to have a broad performance advantage impact for a large number of Evaluate Query HDF5 applications. 1.5 References 1.0 [1] Folk M, Heber G, Koziol Q, Pourmal E, Robinson D. An

0.5 overview of the HDF5 technology suite and its applica- Time (s) Time tions. In Proc. the 2011 EDBT/ICDT Workshop on Array 0 Databases, March 2011, pp.36-47. 1 2 4 8 16 32 [2] Li J W, Liao W K, Choudhary A N et al. Parallel netCDF: CPU Cores A high-performance scientific I/O interface. In Proc. the Fig.16. Parallel query evaluation time. 2003 ACM/IEEE Conference on Supercomputing, Novem- ber 2003, Article No. 39. As shown in Fig.15 and Fig.16, both index gene- [3] Lofstead J, Zheng F, Klasky S, Schwan K. Adaptable, meta- data rich IO methods for portable high performance IO. In ration and query evaluation scale well for this dataset Proc. the 23rd IEEE International Symposium on Parallel up to 8 cores. The advantage of adding more cores di- Distributed Processing, May 2009, Article No. 44. minishes because the amount of data processed by each [4] Dong B, Byna S, Wu K S et al. Data elevator: Low- MPI process becomes smaller. With larger datasets, contention data movement in hierarchical storage system. both these functions will scale further. In Proc. the 23rd IEEE International Conference on High Performance Computing, December 2016, pp.152-161. [5] Dong B, Wang T, Tang H, Koziol Q, Wu K, Byna S. 6 Conclusions ARCHIE: Data analysis acceleration with array caching in hierarchical storage. In Proc. the 2018 IEEE International HDF5 is a critical parallel I/O library that is used Conference on , December 2018, pp.211-220. heavily by a large number of applications on supercom- [6] Seo S, Amer A, Balaji P et al. Argobots: A lightweight low- puting systems. Due to the increasing data sizes, ex- level threading and tasking framework. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(3): 512-526. treme level of concurrency, and deepening storage hier- [7] Wu K. FastBit: An efficient indexing technology for accel- archy of upcoming exascale systems, the HDF5 library erating data-intensive science. Journal of Physics: Confe- has to be enhanced to handle several challenges. In the rence Series, 2005, 16(16): 556-560. ExaHDF5 project funded by the Exascale Computing [8] Racah E, Beckham C, Maharaj T, Kahou S E, Prabhat, Project (ECP), we are developing various features to Pal C. ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding improve the performance and productivity of HDF5. In of extreme weather events. In Proc. the 31st Annual Confe- this paper, we presented a few of those feature enhance- rence on Neural Information Processing Systems, Decem- ments, including integration of Virtual Object Layer ber 2017, pp.3402-3413. Suren Byna et al.: ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems 159

[9] Byna S, Chou J C Y, R¨ubel O et al. Parallel I/O, analysis, Quincey Koziol received his B.S. and visualization of a trillion particle simulation. In Proc. degree in electrical engineering from the International Conference on High Performance Com- the University of Illinois, Urbana- puting, Networking, Storage and Analysis, November 2012, Champaign. He is a principal data Article No. 59. architect at Lawrence Berkeley National [10] Chen J H, Choudhary A, de Supinski B et al. Terascale Laboratory where he drives scientific direct numerical simulations of turbulent combustion using S3D. Computational Science & Discovery, 2009, 2(1). data architecture discussions and participates in NERSC system design [11] Dong B, Wu K S, Byna S, Liu J L, Zhao W J, Rusu F. ArrayUDF: User-defined scientific data analysis on ar- activities. He is a principal architect for the HDF5 project rays. In Proc. the 26th International Symposium on High- and a founding member of The HDF Group. Performance Parallel and , June 2017, pp.53-64. Elena Pourmal is one of the founders of The HDF Group, a not- for-profit company with the mission to Suren Byna received his Master’s develop and sustain the HDF techno- degree in 2001 and Ph.D. degree in 2006, logy, and to provide free and open both in computer science from Illinois access to data stored in HDF. She Institute of Technology, Chicago. He is has been with The HDF Group since a Staff Scientist in the Scientific Data 1997 and for more than 20 years led Management (SDM) Group in CRD at HDF software maintenance, quality assurance and user Lawrence Berkeley National Laboratory support efforts. Ms. Pourmal currently serves as The HDF (LBNL). His research interests are in Group Engineering Director leading HDF5 engineering scalable scientific data management. More specifically, effort and is also a member of The HDF Group Board of he works on optimizing parallel I/O and on developing Directors. Elena received her M.S. degree in mathematics systems for managing scientific data. He is the PI of from Moscow State University and her M.S. degree in the ECP funded ExaHDF5 project, and ASCR funded theoretical and applied mechanics from University of object-centric data management systems (Proactive Data Illinois at Urbana-Champaign. Containers — PDC) and experimental and observational data management (EOD-HDF5) projects. Dana Robinson received his B.S. degree in biochemistry in 2003 and M. Scot Breitenfeld received his his Ph.D. degree in chemistry in 2010, Master’s degree in 1998 and his Ph.D. both from the University of Illinois at degree in 2014, both in aerospace Urbana-Champaign. He has been a engineering from the University of software engineer and scientific data Illinois at Urbana-Champaign. He consultant with The HDF Group since currently works for the HDF Group in 2009. text text text text text text text parallel computing and HPC software text text text text text text text text text text text text text development. His research interests are in scalable I/O performance in HPC and scientific data management. Jerome Soumagne received his Master’s degree in computer science in Bin Dong received his B.S. degree 2008 from ENSEIRB Bordeaux and his in computer science and technology Ph.D. degree in computer science in from University of Electronic and 2012 from the University of Bordeaux, Science Technology of China (UESTC), Bordeaux, France, while being an HPC Chengdu, in 2008, received his Ph.D. software engineer at the Swiss National degree in computer system architecture Supercomputing Centre in Switzerland. from Beihang University, Beijing, in He is now a lead HPC software engineer at The HDF 2013, and works in LBNL as research Group. His research interests include large scale I/O scientist now. Bin’s research interests include new and problems, parallel coupling and data transfer between scalable algorithms and data structures for sorting, orga- HPC applications and in-transit/in-situ analysis and nizing, indexing, searching, analyzing big scientific data visualization. He is responsible for the Development of with . Bin has published over 40 research the Mercury RPC Interface and an active developer of the papers in peer-reviewed conferences and journals. HDF5 library. 160 J. Comput. Sci. & Technol., Jan. 2020, Vol.35, No.1

Houjun Tang received his B.Eng. Richard Warren received his B.S. degree in computer science and degree in electrical and computer technology from Shenzhen University, engineering from the University of Shenzhen, in 2012, and Ph.D. degree in Massachusetts, Amherst. He currently computer science from North Carolina works for The HDF Group where State University in 2016. He is a he continues to pursue his long time postdoctoral researcher in Berkeley interests of parallel computing and HPC Laboratory. Tang’s research interests software development. He holds several include data management, parallel I/O, and storage patents related to computer architecture and distributed systems in HPC. Tang has published over 20 research software algorithms. papers in peer-reviewed conferences.

Venkatram Vishwanath is a computer scientist at Argonne National Laboratory. He is the Data Science group lead at the Argonne Leader- ship Computing Facility (ALCF). His current focus is on architectures, algo- rithms, system software, and workflows to facilitate data-centric applications on supercomputing systems. His interests include scientific applications, parallel I/O and data movement, supercom- puting architectures, parallel runtimes, scalable analytics and collaborative workspaces. Vishwanath received his Ph.D. degree in computer science from the University of Illinois at Chicago in 2009.