High-Performance I/O Programming Models for Exascale Computing

SERGIO RIVAS-GOMEZ

Doctoral Thesis Stockholm, Sweden 2019 KTH Royal Institute of Technology School of Electrical and Science TRITA-EECS-AVL-2019:77 SE-100 44 Stockholm ISBN 978-91-7873-344-6 SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska Högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i högprestan- daberäkningar fredag den 29 Nov 2019 klockan 10.00 i B1, Brinellvägen 23, Bergs, våningsplan 1, Kungl Tekniska Högskolan, 114 28 Stockholm, Sverige.

© Sergio Rivas-Gomez, Nov 2019

Tryck: Universitetsservice US AB iii

Abstract

The success of the exascale is largely dependent on novel breakthroughs that overcome the increasing demands for high-performance I/O on HPC. Scientists are aggressively taking advantage of the available compute power of petascale to run larger scale and higher- fidelity simulations. At the same time, data-intensive workloads have recently become dominant as well. Such use-cases inherently pose additional stress into the I/O subsystem, mostly due to the elevated number of I/O transactions. As a consequence, three critical challenges arise that are of paramount importance at exascale. First, while the concurrency of next-generation su- percomputers is expected to increase up to 1000 , the bandwidth and access × latency of the I/O subsystem is projected to remain roughly constant in com- parison. Storage is, therefore, on the verge of becoming a serious bottleneck. Second, despite upcoming supercomputers expected to integrate emerging non-volatile memory technologies to compensate for some of these limitations, existing programming models and interfaces (e.g., MPI-IO) might not provide any clear technical advantage when targeting distributed intra-node storage, let alone byte-addressable persistent memories. And third, even though com- pute nodes becoming heterogeneous can provide benefits in terms of perfor- mance and thermal dissipation, this technological transformation implicitly increases the programming complexity. Hence, making it difficult for scientific applications to take advantage of these developments. In this thesis, we explore how programming models and interfaces must evolve to address the aforementioned challenges. We present MPI storage windows, a novel concept that proposes utilizing the MPI one-sided commu- nication model and MPI windows as a unified interface to program memory and storage. We then demonstrate how MPI one-sided can provide bene- fits on data analytics frameworks following a decoupled strategy, while in- tegrating seamless fault-tolerance and out-of-core execution. Furthermore, we introduce persistent coarrays to enable transparent resiliency in Coarray , supporting the “failed images” feature recently introduced into the standard. Finally, we propose a global memory abstraction layer, inspired by the memory-mapped I/O mechanism of the OS, to expose different storage technologies using conventional memory operations. The outcomes from these contributions are expected to have a consid- erable impact in a wide-variety of scientific applications on HPC, both in current and next-generation supercomputers. v

Sammanfattning

Framgången för superdatorer på exaskala kommer till stor del bero på nya genombrott som tillmötesgår ökande krav på högpresterande I/O inom högprestandaberäkningar. Forskare utnyttjar idag tillgänglig datorkraft hos superdatorer på petaskala för att köra större simuleringar med högrefidelitet. Samtidigt har dataintensiva tillämpningar blivit vanliga. Dessa skapar ytter- ligare påfrestningar på I/O subsystemet, framförallt genom det större antalet I/O transaktioner. Följdaktligen uppstårflera kritiska utmaningar som är av största vikt vid beräkningar på exaskala. Medan samtidigheten hos nästa generationens superdatorer förväntas öka med uppemot tre storleksordningar så beräknas bandvidden och accesslatensen hos I/O subsystemet förbli relativt oföränd- rad. Lagring står därför på gränsen till att bli en allvarligflaskhals. Komman- de superdatorer förväntas innehålla nya icke-flyktiga minnesteknologier för att kompensera för dessa begränsningar, men existerande programmerings- modeller och gränssnitt (t.ex. MPI-IO) kommer eventuellt inte att ge några tydliga tekniska fördelar när de tillämpas på distribuerad intra-nod lagring, särskilt inte för byte-addresserbara persistenta minnen. Även om ökande he- terogenitet hos beräkningsnoder kommer kunna ge fördelar med avseende på prestanda och termisk dissipation så kommer denna teknologiska transforma- tion att medföra en ökning av programmeringskomplexitet, vilket kommer att försvåra för vetenskapliga tillämpningar att dra nytta av utvecklingen. I denna avhandling utforskas hur programmeringsmodeller och gränssnitt behöver vidareutvecklas för att kringgå de ovannämnda begränsningarna. MPI lagringsfönster kommer presenteras, vilket är ett nytt koncept som går ut på att använda den ensidiga MPI kommunikationsmodellen tillsammans med MPI fönster som ett enhetligt gränssnitt till programminne och lagring. Därefter demonstreras hur ensidig MPI kommunikation kan vara till gagn för dataanalyssystem genom en frikopplad strategi, samtidigt som den integre- rar kontinuerlig feltolerans och exekvering utanför kärnan. Vidare introdu- ceras persistenta coarrays för att möjliggöra transparant resiliens i Coarray Fortran, som stödjer “failed images” funktionen som nyligen införts i standar- den. Slutligen föreslås ett globalt minnesabstraktionslager, som med inspira- tion av minnes-kartlagda I/O mekanismen hos operativsystemet exponerar olika lagringsteknologier med hjälp av konventionella minnesoperationer. Resultaten från dessa bidrag förväntas ha betydande påverkan för högpre- standaberäkningar inomflera vetenskapliga tillämpningsområden, både för existerande och nästa generationens superdatorer. vii

Acknowledgements

Despite my professional success at Microsoft, I always felt that my job was getting in the way offinishing my academic training. The truth is that, from the beginning, I wanted to pursue a Ph.D. in High-Performance Computing. It was a personal challenge that I wanted to accomplish, a missing part of my development that I felt it could open me the possibility of becoming a professor in the near-term future. Furthermore, in the worst-case scenario, I could simply get back to industry with a greater set of skills. However, my academic journey as a Ph.D. student ended-up being far from ideal, I must admit, or at least not in terms of how I originally envi- sioned it to be. Without getting into too many details, diseases and other personal matters kept coming over and over. I constantly felt that I was being retained from fulfilling my greatest potential: I just could not believe it! But regardless of the circumstances, my effort and dedication always prevailed. I never abandoned, no matter how tall the mountain was, or how dark the day sometimes appeared. And thanks to this effort, I now have the chance to reflect and look back to my journey. The reality is that I have plenty to be thankful for. For instance, I am thankful that I have had the opportunity of experimenting the beauty of teaching, and to provide value to the future brighter minds of supercomput- ing. I am thankful that I have had the pleasure to travel around the world, meeting not only some of my idols in thefield (e.g., William Gropp or Rajeev Thakur, founders of MPI), but even scientists from NASA. I am also thankful that I established collaborations with major institutions, that I participated and earned voting rights for KTH inside the MPI Forum, and much more! Above all, I frequently say that everything I do is a result of a team effort, not a single-person effort. Thus, I am specially thankful for my supervisors, Stefano Markidis and Erwin Laure, that trusted and believed in me since the beginning (even though I did not make it easy for them). I am thankful that I have had the chance to work alongside amazing colleagues, such as Sai Narasimhamurthy, Keeran Brabazon, Roberto Gioiosa, Gokcen Kestor, Alessandro Fanfarillo, Philippe Couvee, and many more, that have provided me with the knowledge, expertise, and even friendship to achieve my goals. I am also thankful that I have had great workmates, like Thor Wikfeldt, Van-Dang Nguyen, Himangshu Saikia, Wiebke Köpp, Anke Friederici, Xavier Aguilar, Harriett Johansson, Anna-Lena Hållner, Maria Engman, and plenty of other people, who made my life at KTH so much easier and enjoyable. More importantly, I am very thankful that I am still surrounded by an amazing family and friends that love me, as much as I love them. I feel unnecessary to list each and every one, because they already know how much I appreciate them for being by my side no matter what. From the bottom of my heart, I dedicate this Ph.D. to everyone that has made this journey possible: thank you very much for everything, and on to the next chapter! Contents

Contents ix

I Thesis Overview 1

1 Introduction 3 1.1 Scientific Questions ...... 9 1.2 Thesis Outline ...... 10 1.3 List of Contributions ...... 11

2 State-of-the-art of I/O on HPC 15 2.1 MPI-IO ...... 15 2.2 SAGE Platform ...... 19 2.3 Streaming I/O ...... 21 2.4 Additional Approaches ...... 22

3 Unified Programming Interface for I/O in MPI 27 3.1 Extending MPI Windows to Storage ...... 28 3.2 Data Consistency Considerations ...... 32 3.3 Experimental Results ...... 33 3.4 Concluding Remarks ...... 36

4 Decoupled Strategy on Data Analytics Frameworks 37 4.1 Designing a Decoupled MapReduce ...... 38 4.2 Fault-tolerance and Out-of-Core Support ...... 41 4.3 Experimental Results ...... 42 4.4 Concluding Remarks ...... 44

5 Transparent Resilience Support in Coarray Fortran 45 5.1 Integrating Persistent Coarrays in Fortran ...... 46 5.2 Application Programming Interface ...... 49 5.3 Experimental Results ...... 50 5.4 Concluding Remarks ...... 52

6 Global Memory Abstraction for Heterogeneous Clusters 53 6.1 Creating a User-level Memory-mapped I/O ...... 54

ix x CONTENTS

6.2 Evict Policies and Memory Management Modes ...... 58 6.3 Experimental Results ...... 59 6.4 Concluding Remarks ...... 62

7 Future Work 63 7.1 Object Storage Support in MPI ...... 63 7.2 Automatic Data-placement of Memory Objects ...... 65 7.3 Failed Images in Coarray Fortran ...... 67 7.4 I/O Beyond Exascale ...... 68

8 Conclusions 71

Bibliography 73

Glossary 87

II Included Papers 91

Paper I “MPI windows on storage for HPC applications” 93

Paper II “Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks” 115

Paper III “Persistent Coarrays: Integrating MPI Storage Windows in Coarray Fortran” 125

Paper IV “uMMAP-IO: User-level Memory-mapped I/O for HPC” 135 Part I

Thesis Overview

1 Chapter 1

Introduction

The I/O landscape of High-Performance Computing (HPC) has remained characterized by the advancements in hardware and software storage technologies introduced almost three decades ago. During the 1990s, the concept of Parallel File System (PFS) was consolidated in thefield with the appearance of GPFS [126], enabling efficient access to multiple storage devices to accomplish high-performance I/O. In addition, the Message-Passing Interface (MPI), the de-facto standard for programming HPC clusters, introduced thefirst-ever I/O interface designed specifically for this type of use-cases: MPI-IO [48, 140]. Such interface provides the tools for distributed collective I/O operations at large-scale, among other functionality and optimizations that are critical for scientific applications [141]. Due to these developments, storage was not considered a serious concern on HPC. As a matter of fact, the relative performance of compute and network resources lagged re- markably behind for more than a decade in comparison. The situation, however, worsened over the years. The introduction of multi-core processors and accelerators in the 2000s decade, alongside with better network fabrics and protocols, boosted the compute power of supercomputers to the petascale era [33]. At the same time, the performance of the storage subsystem remained roughly constant, lagging behind this trend. Figure 1.1 (a) illustrates a recent study published by IBM Research in which the compute, network, and storage resources are compared from 1990 until 2017 [152]. The illustration indicates that the storage subsystem is still several orders of magnitude behind the relative performance of the compute and network resources. Only in the current decade it can be observed that the trend is diverging, mainly thanks to the invested effort by the community in preparations for the next-generation supercomputers. Notwithstanding, scientific applications are still often limited by the I/O subsystem performance [15, 16]. The reality is that HPC applications running on current petas- cale supercomputers are increasingly demanding for high-performance I/O. From weather forecast codes to spectral element codes for magnetohydrodynamics [31, 131], scientists are aggressively taking advantage of the available compute power to run larger scale and higher-fidelity simulations. This fact implies the movement of massivefiles from / to storage, which in turn means a bigger utilization of such resource. For instance, storage devices are often required for pre-/post-processing (e.g., visualization), and also fault- tolerance (e.g., checkpointing). Hence, these simulations and the data that they produce, pose an enormous amount of stress into the I/O subsystem.

3 4 CHAPTER 1. INTRODUCTION

Compute Network Storage Compute Network Storage 105 1.4 4 1.2 Initialization Intermediate Checkpoint 10 1 103 0.8 102 0.6 0.4 101 0.2

eaiePerformance Relative 0 1990 1995 2000 2005 2010 2015 2020 (%) Utilization Resource 0 5 10 15 20 25 30 35 40 45 50 55 Year Time (s) (a) Resource Performance relative to 1990 (b) Resource Utilization Timeline on iPIC3D

Figure 1.1: (a) Relative performance of compute, network and storage resources, in comparison with the status in 1990 [152]. Note the scale of they-axis. (b) Resource utilization timeline during a magnetic reconnection simulation using iPIC3D [10].

As a reference example, Figure 1.1 (b) shows an approximate resource utilization time- line of a magnetic reconnection simulation using iPIC3D [72], captured with Arm Map [10]. The simulation utilizes 4096 processes distributed across 128 nodes running on Beskow, a XC40 supercomputer from KTH Royal Institute of Technology [94]. iPIC3D is an implementation of the implicit moment Particle-in-Cell (PIC) method, used for the sim- ulation of space and fusion plasmas. The implementation saves intermediate simulation results to binaryfiles in Visualization Toolkit (VTK) format for visualization and data analysis [128], as well as restart information of the current simulation in Hierarchical Data Format v5 (HDF5) format [43]. For thefield values, such as magneticfield and electric field, the implementation takes advantage of MPI collective I/O operations to allow all processes efficiently write into a commonfile, eliminating the need for merging the indi- vidual outputfiles from each process during post-processing. The restart information is, on the other hand, stored as independent HDF5files containing the specific particle infor- mation, one per process. Despite the heavy source code optimizations included in iPIC3D, the aforementionedfigure demonstrates that the code currently spends over 40% of the execution time just on I/O operations. This is reflected in light-gray (dotted), while com- pute and network resources are illustrated in gray and dark-gray, respectively. As Beskow utilizes a PFS [9] with 165 Object Storage Target (OST) servers1, it must be noted that the performance could have been much worse if iPIC3D were not using MPI collective I/O (i.e., due to the high process count). Furthermore, the simulation conducted is con- siderably lightweight with only one checkpoint scheduled. On larger-scale simulations, the fault-tolerance mechanism of the code would affect the execution time significantly [110]. Thus, the result demonstrates that introducing any improvement in the I/O subsystem has the enormous potential of reducing the overall simulation times of a vast majority of scientific applications. In the case of iPIC3D, the execution time could be ideally reduced by almost in half. In addition to traditional scientific applications, emerging data analytics and machine learning applications are also becoming common on supercomputers [71]. The recent

1Files in Lustre are distributed in stripes into one or more OST servers to allow parallel access to storage. The stripe width can be configured when thefile isfirst created, and it usually defaults to 1MB to match the RPC size [113]. 5

FPGA

HBM HBM HBM HBM

GPU GPU GPU

CPU CPU GPU

Co-Proc.

Co-Proc. Co-Proc. Co-Proc. FPGA Interconnect (e.g., NVLINK) Memory CPU CPU Local Storage (e.g., STT-RAM)

Memory Local Storage (e.g., V-NAND Flash)

Network Card Network Card

Interconnect Interconnect (e.g., EDR) (e.g., EDR)

Parallel File System Parallel File System (e.g., GPFS)

(a) Traditional Compute Node (b) Data-centric / Heterogeneous Compute Node

Figure 1.2: (a) Compute nodes on current supercomputers feature a “CPU + Memory-only” design, with storage provided through a PFS. (b) Upcoming clusters feature local storage technologies alongside specialized compute devices [127]. advances in deep learning and convolutional networks have dramatically influenced the role of machine learning on a wide-range of scenarios [12, 147]. This fact has been moti- vated by an increase in object classification and detection accuracy, alongside with better tools for data mining that allow us to understand large datasets of unstructured infor- mation [138, 153]. The inference error rate of machine learning algorithms has become remarkably low as well, reaching a state where the capacity of humans has been already surpassed in certain tasks [63]. For instance, deep learning has fostered the development of significant milestones in 3D protein folding discovery [36], with direct impact to our society. Nonetheless, the irregular nature of these use-cases pose an even larger stress into the I/O subsystem, mainly due to the elevated number of small I/O transactions. In addition, some of these applications require large amounts of main memory and force the integration of out-of-core techniques2, specially on data analytics [6]. To overcome some of these limitations, recent trends indicate that both memory and storage subsystems of large-scale supercomputers are becoming heterogeneous and hier- archical [104]. Historically, HPC systems have been frequently designed as a large system for computing with no local storage, and a separate large system that offers persistency support (e.g., through a PFS). Compute nodes have commonly integrated one or more multi-core CPU following a Non-Uniform Memory Access (NUMA) architecture3, with only a considerable amount of DRAM (e.g., 64GB). This configuration is illustrated in Figure 1.2 (a). However, the trend for large-scale computer design is diverging from the traditional compute-only node approach, to data-centric node solutions that integrate a diversity of memory and storage technologies, as depicted in Figure 1.2 (b). Emerging storage technologies, such as Spin-Transfer Torque Random-Access Memory (STT-RAM), are evolving so rapidly that the gap between memory and I/O subsystem performances is thinning [64]. For this reason, next-generation supercomputers are expected to feature a

2The purpose of out-of-core is to utilize storage to move out in-memory pages when the capacity has been exceeded. Thus, storage becomes an extension of main memory [27, 140]. 3NUMA is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor [58]. 6 CHAPTER 1. INTRODUCTION

SUMMIT Oak Ridge National Laboratory

2 × 22-core IBM POWER9 6 × NVIDIA Volta GV100 512GB DDR4+HBM

Per node Per 1600GB NVRAM

4600 Heterogeneous Nodes

200.8 PFLOPS / 0.2 ExaFLOPS

Figure 1.3: , the supercomputer from Oak Ridge National Laboratory, is considered one of the pre-exascale reference machines [40]. The cluster combines DRAM, HBM, and a dedicated tier of 1.6 Terabytes of NVRAM per node. variety of Non-Volatile Random-Access Memory (NVRAM), with different characteristics and asymmetric read/ write bandwidths, next to traditional hard disks and conventional DRAM [83,97]. The main goal is to reduce data movement inside the cluster, allowing the computations to be transferred right next to where the information is currently stored. This latter matter is of paramount importance for the success of the exascale super- computer [14,84]. An exascale machine features a compute power of 1018 FLOPS (i.e., one exaFLOPS). This is an increase in concurrency of 100–1000 compared to current petas- × cale machines [32]. The simulations scales and number of data-intensive applications will, consequently, make the problem of parallel I/O efficiency at exascale a major concern. As a matter of fact, most recent supercomputers are already integrating NVRAM memories in order to accommodate for the new wave of use-cases on HPC. Figure 1.3 provides a brief overview of Summit, considered the fastest supercomputer in the world4 and the main ref- erence pre-exascale machine [40,151]. Located in the at Oak Ridge National Laboratory (ORNL), this supercomputer features over 4600 nodes with dual 22-core IBM POWER9 processors [123] and six NVIDIA Volta GV100 GPUs [89] (i.e., 2.4 million cores in total). The cluster delivers a peak performance of 200.8 petaFLOPS or 0.2 exaFLOPS. More importantly, Summit is thefirst leadership-class supercomputer to feature 1.6TB of NVRAM per compute node, which complements a traditional PFS provided by GPFS. As a consequence, programming models and interfaces must mandatorily evolve to accommodate for the inherent heterogeneity that supercomputers will feature over the next decade. Without any further considerations, scientists willfind it difficult to take advantage of these developments. On the other hand, it should also be noted that the success of the exascale supercom- puter is largely dependent on novel breakthroughs in technology that effectively reduce both power consumption and thermal dissipation requirements [32]. The integration of persistent memory technologies reduces inter-node data movement, which in turn helps minimizing the amount of energy required. But in addition to the changes in storage hierarchy, data-centric applications on HPC also pose several constraints on general- purpose processors, mainly due to the irregularity of the memory accesses that they fea-

4As of TOP500 ranking published in June 2019: https://www.top500.org/lists/2019/06 7

(a) Simplified Hardware Diagram of the VPU (b) Myriad VPU System-on-Chip

Figure 1.4: (a) High-level representation of the Myriad 2 VPU and one of its SHAVE vector processors [5, 78]. (b) Illustration of the form factor of the Myriad SoC.

CPU GPU VPU (Multi) CPU GPU VPU (Multi)

125 175 5

(img/s) 100 140 4 75 105 3 50 70 2 25 35 1 oughput 0 0 0 hogpt(img/s) Throughput Thr 1 2 4 8 16 Set-1 Set-2 Set-3 Set-4 (img/W) Throughtput 1 2 4 8 ValidationyDataset Batch Input Size Batch Input Size

(a) Inference Perf. per subset (b) Inference Perf. per batch () Throughput-TDP per batch

Figure 1.5: Inference performance using the ILSVRC 2012 Validation dataset [122] running on CPU, GPU, and multi-VPU configurations. Thefigures illustrate per- formance (a) per reference subset, (b) per batch size, and (c) per Watt [118].

ture [65, 155]. These accesses have reduced temporal or spatial locality, incurring in long memory stalls and large bandwidth demands. As a side effect, the power consumption and thermal dissipation requirements increases [109]. The industry is shifting towards designing processors where cost, power, and thermal dissipation are key concerns [5]. For instance, recent hardware architectures integrate novel data formats that use tensors with a shared exponent in order to maximize the dynamic range of the traditional 16-bitfloating point data format for machine learning [26,68]. In addition and as previously illustrated in Figure 1.2 (b), it is also expected from compute nodes to embed Field-Programmable Gate Array (FPGA) devices [137], alongside several specialized co-processors that reduce the power envelope constraints while improving the performance on specialized tasks [90]. ’s Vision Processing Unit (VPU) is an example (Figure 1.4) [5, 78]. This chip emerges as a new category of compute devices that aims at providing ultra-low power consumption during tensor acceleration, with an estimated Thermal Design Power (TDP) of less than 1 Watt. Figure 1.5 demonstrates the possibil- ities of this type of co-processor during inference in convolutional networks over a large 8 CHAPTER 1. INTRODUCTION

R / W R / W R / W R / W … P1 P2 P3 PN

File 1 File 2 File 3 File N

(a) File-per-process I/O Strategy (b) Aggregator Process I/O Strategy

Figure 1.6: High-performance I/O on current HPC clusters has been historically accomplished by (a) using afile per process, or (b) assigning a process to aggregate requests with the purpose of relaxing PFS locking [48, 99]. image dataset from the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2012 challenge [122]. This is based on preliminary work on the topic [118]. The evalua- tions utilize a pre-trained network from the Berkeley Vision and Learning Center (BVLC), which follows the GoogLeNet work by Szegedy et al. [138]. The results indicate that a combination of several of these chips can potentially provide up to 3.5 higher throughput × compared to reference CPU and GPU implementations, measured as number of inferences conducted per Watt. At the same time, while compute nodes becoming heterogeneous can provide clear ad- vantages for scientific applications, this technological transformation implicitly increases the programming complexity and introduces additional challenges for existing interfaces on HPC, such as MPI. Most of the aforementioned compute devices integrate High-Bandwidth Memory (HBM), Multi-Channel DRAM (MCDRAM), and other memory technologies that generally demand explicit data-movement from applications. Additionally, allocating and moving data in such systems often require the use of different programming interfaces as well, specially in regards to storage [114]. In particular, byte-addressable NVRAM technologies are expected to be accessible with a convenient load/ store interface in the near-term future [96], while block-addressable NVRAM modules (e.g., PCIe-based 3D XPoint [7, 55]) still require read/ write operations and potentiallyfile-system support. Furthermore, the introduction of local storage technologies will also disrupt how appli- cations have traditionally performed I/O on HPC [48]. Due to historical limitations in the I/O capabilities of many parallelfile systems, application developers have broadly utilized two major I/O strategies to improve the performance. These are depicted in Fig- ure 1.6. Thefile-per-process I/O strategy (a) instructs each process to write into separate files [130]. This simple mechanism reduces the locking overhead on PFS like Lustre, but it is only suitable for a few hundred processes. On the other hand, the aggregator process I/O strategy (b) makes all processes send their data to dedicated processes that gather and write the data to storage [99]. The use of collective I/O operations in MPI-IO is often based on a similar mechanism. Most MPI implementations also provide the opportunity for a generalized two-phase strategy over collective read/ write accesses to sharedfiles, largely improving the sustained throughput [141]. Yet, developers are now left with the open question of which interface is more suitable to access the diversified storage hierarchy in next-generation supercomputers. This fact alone poses complex challenges for scientific applications aiming to take advantage of the 1.1. SCIENTIFIC QUESTIONS 9 increased-level of heterogeneity. For instance, while the wide-availability of dedicated I/O interfaces on HPC efficiently support collective access to parallelfile systems (e.g., MPI-IO), these interfaces might not provide any clear technical advantage when targeting local storage, at least in comparison with straightforward solutions like POSIX-IO [1]. In fact, the absence of a “de-facto interface” for accessing local storage on HPC clusters have motivated the definition of intermediate solutions that feature semantics closer to traditional parallelfile systems. The concept of burst buffer I/O accelerator, incorporated in Summit and Cori supercomputers, is a common example [91, 92]. Consequently, now that the complexity of compute nodes is increasing, simplified programming models for conducting I/O on HPC are necessary. Avoiding such challenge could lead to a complete lack or under-utilization of the I/O resources at exascale.

1.1 Scientific Questions

From the previous statements, we observe the inherent need to provide a seamless transi- tion between existing and future I/O programming models on HPC. In this thesis, we aim at solving some of the most critical I/O challenges before exascale supercomputers be- come a reality. For this purpose, we identified and determined three major open scientific questions that we want to address with the Doctoral studies:

Q1 / Scientific Question #1 Q2 / Scientific Question #2 Q3 / Scientific Question #3

DRAM ··· 0010101 +

Space 푓(푥) +

Memory Storage

Heterogeneous I/O Unified Prog. Models High-Performance

The significance of each scientific question is briefly given below, based on some of the arguments provided in the introductory content of this chapter. Additionally, the descriptions on how we envisioned to address each question are referenced in the next section and included in detail within the conclusions (Chapter 8).

[Q1] How can programming models and interfaces support heterogeneous I/O? Description: Emerging storage technologies are thinning the existing gap between main memory and I/O subsystem performances [83]. The new non-volatile technologies, such asflash, phase-change, and spin-transfer torque memories, provide bandwidth and access latency close to those of DRAM memories [18]. As a consequence, memory and storage technologies are converging. This fact raises concerns in regards to most of the existing programming models and interfaces on HPC (e.g., MPI and its dedicated MPI-IO inter- face [48]). For instance, a vast majority of these interfaces were not designed for dis- tributed intra-node storage, nor the possibility of accessing byte-addressable persistent memories [64]. The same situation is observed with data analytics frameworks on HPC, that assumed lack of local storage on their architecture [61]. Moreover, there are other implications, like data consistency semantics, classic I/O performance optimizations orig- inally meant for targeting PFS [141], or resilience support on applications. Hence, it is of 10 CHAPTER 1. INTRODUCTION paramount importance to consider how programming models and interfaces are going to evolve to support mechanisms for data-placement across the different memory and storage tiers of next-generation supercomputers.

[Q2] How can the integration of emerging storage technologies be simplified? Description: Over the past two decades, the variety of interfaces to access memory and storage has grown considerably on HPC. When targeting memory technologies like HBM, recent libraries have emerged that enablefine-grained control over memory properties through extensible heap management frameworks, such as memkind [13]. If applications require targeting storage, the diversification is even wider in comparison. From traditional POSIX-IO [1] (e.g., read/ write), to dedicated MPI-IO collective operations [48] (e.g., MPI_File_read_all/ MPI_File_write_all), the memory-mapped I/O mechanism of the OS [8] (i.e., mmap), or even object-based hierarchical representation libraries like HDF5 [42], the list is extremely long. Scientific application developers are, thus, already struggling to consider which one of the existing APIs is better for their particular use-case, which restricts introducing support for fault-tolerance or out-of-core mechanisms [140]. Hence, it is critical to define unified I/O interfaces to reduce the programming complexity and allow existing applications to take advantage of the changes in the storage hierarchy. Ideally, incurring in minimal source code changes in comparison.

[Q3] How can HPC applications seamlessly improve the I/O performance? Description: As the concurrency of exascale-grade supercomputers is expected to in- crease up to 1000 [32], the bandwidth and access latency of the I/O subsystem is also × expected to remain roughly constant in comparison. This makes storage a serious bottle- neck. At the same time, scientific applications require retrieving and storing large amounts of data. In most cases, a substantial percentage of the overall execution time is spend just on I/O operations. We motivated this fact by illustrating a resource utilization timeline of iPIC3D [72] (Figure 1.1), in which over 40% of the execution time was dedicated on storing visualization and resilience information. In the case of data analytics frameworks, the storage subsystem is utilized as input source, as well as to provide checkpoint/restart mechanisms [53]. However, the integration of local storage technologies (by itself) will not directly provide any direct benefits without further adaptations. Thus, it is crucial that programming models and interfaces seamlessly take advantage of the heterogeneity of future supercomputers to improve the overall performance perceived by applications.

1.2 Thesis Outline

This thesis is divided in two parts. Thefirst part introduces the reader into the concepts and provides a summary of the main research work conducted throughout the Doctoral studies to address the three scientific questions, concluding with a comprehensive outline of future work on the topic. The description for each chapter is provided below: • Chapter 1 contains the motivation of the Ph.D. and outline of the thesis. The current section is part of the introductory chapter. • Chapter 2 provides a brief overview of the state-of-the-art of I/O on HPC, in- cluding traditional interfaces (e.g., MPI-IO). The chapter also presents other novel concepts, such as Software-Defined Memory solutions (e.g., MDT). 1.3. LIST OF CONTRIBUTIONS 11

• Chapter 3 proposes the use of the MPI one-sided communication model and MPI windows as unique interface for programming different memory and storage tech- nologies [Q1]. Applications benefit from this contribution with minimal source code changes [Q2]. More importantly, this represents one of the few efforts for heterogeneous I/O in MPI, the de-facto standard on HPC. • Chapter 4 presents a decoupled strategy for data analytics frameworks to im- prove the performance on highly-concurrent workloads [Q3]. The integration of the MPI one-sided communication model also allows us to support fault-tolerance and out-of-core without further source code changes [Q1,Q2]. • Chapter 5 explores the possibility of providing seamless resilience in Coarray For- tran (CAF) applications [Q2]. The approach can also be used as a general-purpose API for I/O [Q1]. This contribution has the potential of improving numerous sci- entific applications written in Fortran. • Chapter 6 describes a novel memory abstraction mechanism designed to map stor- age devices into memory space transparently [Q1,Q2]. Presented as a system li- brary, the approach considerably improves performance compared to existing I/O solutions, which can benefit both MPI and CAF implementations [Q3]. • Chapter 7 outlines future work on the topic, extending some of the contributions presented in the previous chapters. • Chapter 8 gives a summary in regards to the scientific questions and concludes the work. The chapter is then followed by the list of references and the glossary. After thefirst part of the thesis, the second part contains the different papers that represent most of the source material utilized for the chapters.

1.3 List of Contributions

This thesis provides an overview of four peer-reviewed publications that represent the main research work conducted during the Ph.D. studies. The aforementioned chapters are based on these manuscripts, listed in chronological order below:

Paper I “MPI Windows on Storage for HPC Applications” Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor, Sai Narasimhamurthy, Erwin Laure, and Stefano Markidis. , Vol. 77, pp. 38–56, Elsevier, 2018. Paper II “Decoupled Strategy for Imbalanced Workloads in Map- Reduce Frameworks” Sergio Rivas-Gomez, Sai Narasimhamurthy, Keeran Brabazon, Oliver Perks, Erwin Laure, and Stefano Markidis. Proceedings of the 20th International Conference on High Performance Computing and Communications (HPCC’18), pp. 921–927. IEEE, 2018. Paper III “Persistent Coarrays: Integrating MPI Storage Windows in Coarray Fortran” Sergio Rivas-Gomez, Alessandro Fanfarillo, Sai Narasimhamurthy, and Stefano Markidis. Proceedings of the 26th European MPI Users’ Group Meeting (EuroMPI 2019), pp. 3:1-3:8, ACM, 2019. 12 CHAPTER 1. INTRODUCTION

Paper IV “uMMAP-IO: User-level Memory-mapped I/O for HPC” Sergio Rivas-Gomez, Alessandro Fanfarillo, Sebastién Valat, Christophe Laferriere, Philippe Couvee, Sai Narasimhamurthy, and Stefano Markidis. Proceedings of the 26th IEEE International Conference on High- Performance Computing, Data, and Analytics (HiPC’19), IEEE, 2019.

In addition, the following publications represent an important part of the research conducted during the Ph.D. studies, establishing the foundations of the work described in the manuscripts listed above. Moreover, part of these contributions define the initial steps for future work on the topic of data-placement of memory objects on heterogeneous clusters (Paper IX). Once again, the publications are listed in chronological order:

Paper V “Extending Message Passing Interface Windows to Storage” Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng, Erwin Laure, Gokcen Kestor, and Roberto Gioiosa. Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2017), pp. 727–730. IEEE, 2017. Paper VI “MPI Windows on Storage for HPC Applications” Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor, Sai Narasimhamurthy, Erwin Laure, and Stefano Markidis. Proceedings of the 24th European MPI Users’ Group Meeting (Eu- roMPI/USA 2017), ACM, 2017. Best Paper Award/ Remarks: Invited to publish an extended version of the manuscript inside a special edition of Parallel Computing (Paper I). Paper VII “MPI Concepts for High-Performance Parallel I/O in Exascale” Sergio Rivas-Gomez, Ivy Bo Peng, Stefano Markidis, and Erwin Laure. Proceedings of the 13th HiPEAC International Summer School on Ad- vanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES’17), pp. 167–170, HiPEAC, 2017. Paper VIII “SAGE: Percipient Storage for Exascale Data-centric Computing” Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Stefano Markidis, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Dirk Pleiter, and Shaun De Witt. Parallel Computing, Vol. 83, pp. 22–33, Elsevier, 2018. Paper IX “Exploring the Vision Processing Unit as Co-processor for Inference” Sergio Rivas-Gomez, Antonio J. Peña, David Moloney, Erwin Laure, and Stefano Markidis. Proceedings of the 2018 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops (IPDPSW’18), pp. 589–598, IEEE, 2018.

Lastly, the author of this thesis has also been involved in three different Horizon 2020 projects from the European Union, contributing with relevant research topics that have led to several scientific publications and conference presentations (e.g., EASC18 Conference). The projects are Percipient Storage for Exascale Data Centric Comput- ing (SAGE), its extension (Sage2), and Exascale Programming Models for Heterogeneous 1.3. LIST OF CONTRIBUTIONS 13

Systems (EPiGRAM-HS). The following deliverables were published and successfully ap- proved by the EU-H2020 reviewing committee during the Ph.D. studies:

SAGE “State-of-the-Art of I/O with MPI and Gap Analysis” D4.1 Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng, Giuseppe Congiu. Percipient Storage for Exascale Data Centric Computing (SAGE), 2016. SAGE “Tools Based Application Characterization” D3.1 Keeran Brabazon, Oliver Perks, Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng. Percipient Storage for Exascale Data Centric Computing (SAGE), 2017. SAGE “Global Address Space Concepts for Percipient Storage” D4.6 Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng. Percipient Storage for Exascale Data Centric Computing (SAGE), 2017. SAGE “Validation Report: Message Passing & Global Address Space D4.9 for Data-centric Computing” Sergio Rivas-Gomez, Stefano Markidis. Percipient Storage for Exascale Data Centric Computing (SAGE), 2018. E-HS “Initial Design of a Memory Abstraction Device for Diverse D3.2 Memories” Adrian Tate, Tim Dykes, Harvey Richardson, Sergio Rivas-Gomez, Stefano Markidis, Oliver Thomson Brown, Dan Holmes. Exascale Prog. Models for Heterogeneous Systems (EPiGRAM-HS), 2019. Sage2 “Global Memory Abstraction – Architecture Definition” D2.4 Sai Narasimhamurthy, Sergio Rivas-Gomez, Stefano Markidis, Hua Huang, Gregory Vaumourin, Sebastién Valat, Christophe Laferriere, Philippe Couvee, and Ganesan Umanesan. Percipient Storage for Exascale Data Centric Computing 2 (Sage2), 2019. Sage2 “Sage2 Ecosystem Design Document” D3.1 Salem El Sayed M., Dirk Pleiter, Philippe Deniel, Thomas Leibovici, Jacques-Charles Lafoucriere, Sergio Rivas-Gomez, Stefano Markidis, Patrick Wohlschlegel, Nikos Nikoleris, Iakovos Panourgias, Magnus Morton, Andrew Davis, Shaun De Witt, Ganesan Umanesan, and Sai Narasimhamurthy. Percipient Storage for Exascale Data Centric Computing 2 (Sage2), 2019.

The author has also contributed to several open-source projects (e.g., iPIC3D) and has been involved in multiple dissemination events of relevance in thefield.

1.3.1 Participation in Papers The purpose of this subsection is to briefly indicate the participation / role of the author of this thesis into the four main papers presented in the following chapters. The descriptions below exceptionally utilize the term “the student” to refer to the author.

Paper I The student came up with the idea of integrating MPI storage windows through perfor- mance hints, as well as designed and implemented its reference implementation in col- laboration with part of the co-authors. This also includes the heterogeneous window 14 CHAPTER 1. INTRODUCTION allocations. He also conducted the experimental results and the adaptation of the differ- ent benchmarks, and wrote part of the methodology, experimental results, and discussion of the paper, includingfigures and other material. The writing was heavily guided by the main supervisor, as part of the learning process of the Doctoral studies.

Paper II The student designed and implemented the decoupled strategy of MapReduce in collabo- ration with some of the co-authors, and proposed the integration of MPI storage windows to support seamless fault-tolerance and out-of-core for data analytics. The execution time- lines come from an intensive collaboration from our group with Arm, that provided us with insight about the performance limitations of MPI one-sided in MPI implementations. The student also conducted the experimental results and wrote most of the methodology, experimental results, and discussion of the paper, includingfigures and other material.

Paper III The student envisioned the possibility of making coarrays persistent in Coarray Fortran, in collaboration mostly with one of the co-authors who is heavily involved in OpenCoarrays. He designed and participated in the implementation of the approach, as well as integrated some of its features (e.g., segment size configuration). While in this work the student was only partially involved in conducting the experimental results, he wrote the full paper (except for related work), includingfigures and other material.

Paper IV After the experience obtained with the memory-mapped I/O mechanism of the OS in MPI storage windows (Paper I) and subsequent publications (Papers II & III), the student observed the need for a Global Memory Abstraction (GMA) mechanism designed for HPC. Thus, he came up with the idea and implementation of uMMAP-IO in collaboration with the co-authors, as well as conducted the experiments and the adaptation of the different applications evaluated. The student wrote the full paper, includingfigures and other material, fulfilling the objective for autonomous research work of the Doctoral studies. Chapter 2

State-of-the-art of I/O on HPC

In this chapter, we provide an overview of some of the most relevant I/O programming models and advances available on HPC. From the de-facto standard MPI-IO [48, 142], to novel clusters that aim at revolutionizing the storage paradigm while bringing applications to the exascale [84], the chapter introduces the reader into the I/O landscape before presenting the rest of the contributions of the thesis. The content should technically ease the understanding of such contributions.

2.1 MPI-IO

The Message-Passing Interface (MPI) began as a standardization process to converge message-passing libraries into a common and truly portable API [47]. As of today, MPI is considered the de-facto standard for programming large-scale HPC systems [119]. The MPI Forum is the organization that manages and updates the standard. It is a large group that involves partners from industry and academia from all around the world1. Since thefirst revision published back in 1994, the MPI standard has been extended twice, with a fourth edition expected during the 2020 decade [82]:

MPI-1 Release MPI-3 Release MPI-4 Release MPI-1.3 + MPI-2.1 Revision Revision 1994 ··· 1997 ··· 2008 2009 ··· 2012 ··· 2015 ··· 202X

MPI-2.2 MPI-3.1 MPI-2 Release Revision Current Revision MPI-IO + RMA included

The original specification of MPI, named MPI-1 [79], provided not only point-to-point and collective communication, but a simple and versatile interface that allowed for both blocking and non-blocking variants. Other advanced functionality was also included, such as virtual topologies of processes that resemble the underlying hardware, or datatypes to describe the layout of the data represented in memory, among others. From the beginning,

1Our group at KTH Royal Institute of Technology has participated in the discussions.

15 16 CHAPTER 2. STATE-OF-THE-ART OF I/O ON HPC the goal of MPI has always been to allow HPC applications to take advantage of the massively parallel systems and their integrated hardware / software technologies. As a matter of fact, incorporating further functionality based on application require- ments was one of the main motivations of thefirst effort to expand MPI in 1997 with the release of MPI-2 [80]. Short after the initial publication of the standard, the new release included support for Remote Direct Memory Access (RDMA) with the MPI one- sided communication model2, that complements the traditional two-sided communication model (e.g., point-to-point). It also contained dynamic creation of processes, and enhanced support for multi-threaded communications. More importantly, the MPI-2 specification introduced comprehensive support of parallel I/O with the definition of MPI-IO [48]. This new addition to the MPI standard aims at providing high-performance simultaneous ac- cess to individual and sharedfiles under very high process counts, fully exploiting the capabilities of modern parallelfile systems on HPC (e.g., GPFS [126] or Lustre [9]) and overcoming most of the previous limitations for parallel I/O [142]. This is accomplished by the following main functionality: • POSIX-like File Operations. Processes can individually or collectively open/ close, seek, and read/ writefiles using an interface comparable to that of the POSIX-IO standard (Subsection 2.4.1) [1]. • File View Representation. Through a powerful derived data type mechanism inherited from the original MPI-1 standard, the interface allows expressing patterns for selective access to different sections of afile. • Collective I/O Operations. Processes can perform read/ write operations si- multaneously over sharedfiles while maintaining high-throughput. A vast majority of scientific applications on HPC utilize collective I/O (e.g., Nek5000 [131]). • Non-Blocking I/O Operations. As in the message-passing counterpart API, MPI-IO also defines non-blocking I/O operations to allow efficient, decoupled stor- age data accesses. The contribution in Chapter 4 partially utilizes this approach. • Performance Hints. The interface accepts hints that can be passed to the specific implementation to manipulate its behaviour and achieve greater performance. In principle, the purpose of MPI hints is to provide assistance to MPI implementations, but they can also be used to extend its functionality, as motivated in Chapter 3. The last item highlights the importance of MPI implementations, as software libraries that implement the MPI standard. The specification released by the MPI Forum group defines the programming model and recommendations that the specific implementations must follow. However, it is up to the implementors to comply with the specification, even to extend it (e.g., with fault-tolerance support [37]). Thus, the MPI specification does not define the underlying algorithms utilized to accomplish the functionality of the standard. For instance, ROMIO, the MPI-IO module of one of the most popular MPI imple- mentations called MPICH [50], integrates multiple optimizations that can be configured through performance hints [141]. Figure 2.1 illustrates two critical contributions to the field from ROMIO that have defined the state-of-the-art for I/O over the years on HPC. Thefirst one is Data Sieving (a), whose purpose is to make as few requests as possible

2RDMA is a memory access mechanism that enables access from the memory space of a compute node into another. This provides increased throughput and low-latency communications. More information about MPI one-sided communication will be provided in the next chapter. 2.1. MPI-IO 17

Phase #1 Phase #2 Reorganization Redistribution

Data in memory P1 P1 Da

File View of the process for the process ta in in

Process memory

P2 P2 process

Request by process ch

f

or ea or ea or

MPI-IO f ch ch

Copy requested regions P3 P3 View View

Request by MPI-IO process

to buffer in process ile ile PFS F P4 P4 Data in the parallel file system

(a) Data Sieving Optimization (b) Two-Phase I/O Optimization

Figure 2.1: Illustration of Data Sieving (a) and Two-Phase I/O (b) optimizations, critical for high-performance collective I/O in MPI-IO implementations [141,142]. to the PFS to reduce the effects of high-latency on storage operations. The technique enables an MPI implementation to make fewer large, contiguous requests to the PFS, even if the user requested several small, non-contiguous accesses. Thereafter, ROMIO re-distributes the content to accommodate the request by the application. On the other hand, as applications provide afile view to represent the access pattern, the Two-Phase I/O optimization (b) aims at splitting the access into two steps to improve the I/O per- formance. In thefirst phase, processes access data assuming a distribution in memory that results in each process making single, contiguous accesses. In the second phase, pro- cesses redistribute data among themselves to the desired access pattern. The advantage of this method is that by making allfile accesses large and contiguous, the I/O time is reduced significantly. In other words, the added cost of inter-process communication for redistribution is relatively small compared with the savings in I/O time. The following subsections discuss data consistency considerations in MPI-IO, which must be considered for the proposed I/O alternative in MPI presented in Chapter 3. The section concludes with a simple source code example and some additional remarks.

2.1.1 Data Consistency in MPI-IO When multiple processes perform I/O operations over a sharedfile, there are situations where data consistency cannot be guaranteed, either partially or completely [120]. This is the case when part of these processes are accessing thefile through individual storage operations and perform write operations. Not using proper collective I/O operations difficulties MPI implementations understand the dependencies among processes and their desired access pattern. Thus, explicit data consistency semantics are required by applica- tions, who are responsible for organizing the source code following these three options: File Atomicity. After opening afile and before performing any I/O operations, the ◦ operational mode of the MPI-IO implementation can be set to atomic. This means that any dependencies in data between processes are guaranteed and changes im- mediately visible without further action. The application, however, must introduce synchronization barriers to avoid race conditions (e.g., MPI_Barrier). 18 CHAPTER 2. STATE-OF-THE-ART OF I/O ON HPC

1 ... 2 MPI_Aint length = CHUNK_SIZE * sizeof ( int ); // Length of request (bytes) 3 MPI_Aint extent = num_ranks * length; // Displacement between procs. 4 MPI_Offset disp = rank * length; // Displacement of the rank 5 MPI_Datatype contig, filetype; 6 7 // Create the datatype that represents the access pattern 8 MPI_Type_contiguous (CHUNK_SIZE, MPI_INT, &contig); 9 MPI_Type_create_resized (contig, 0, extent, &filetype); 10 MPI_Type_commit (&filetype); 11 12 // Open the file in "read-only" mode 13 MPI_File_open (MPI_COMM_WORLD, "filename", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); 14 15 // Set the file view and perform the collective read 16 MPI_File_set_view (fh, disp, MPI_INT, filetype, "native", MPI_INFO_NULL); 17 MPI_File_read_all (fh, buffer, CHUNK_SIZE, MPI_INT, MPI_STATUS_IGNORE); 18 19 // Close the file handle 20 MPI_File_close (& fh ); 21 ...

Listing 2.1: Source code example that shows how to use MPI-IO to perform a collective read operation of CHUNK_SIZE integers (per process) on a sharedfile.

Explicit Synchronization. The explicit synchronization primitives in MPI-IO, ◦ such as MPI_File_sync, guarantee that the changes performed by the process are flushed to the PFS. Alternatively, if the application closes afile, MPI-IO will also implicitlyflush any temporary cached value to the PFS. This means that processes that need to collaborate in a producer-consumer fashion, can ensure data consistency if the consumer processes open thefile after the producers have closed it. Write Sequences. Applications are encouraged to design their source code to ◦ follow write sequences. In MPI-IO, a write sequence starts when afile isfirst opened or when a synchronization is triggered (e.g., MPI_File_sync), and ends when thefile is closed or when anotherfile synchronization point is generated. All the processes involved in the same communicator must follow this pattern. Note, however, that the use of thefile view representation and collective I/O guarantees data consistency according to the semantics specified in the MPI standard. In addition, applications can always guarantee data consistency as long as no overlap between processes exists. For performance reasons, the division should be a multiple of the stripe size in the PFS (e.g., in Lustre, the stripe defaults to 1MB [9]).

2.1.2 Source Code Example Listing 2.1 illustrates a very simple source code example in MPI-IO, in which a group of processes is aiming at reading an inputfile collectively in groups of CHUNK_SIZE integers. The example begins by creating the derived datatype that represents the access pattern requested by the rank, that also contains the displacement between processes. Thereafter, the targetfile is opened and thefile view is defined based on the derived datatype. At this point, MPI-IO has enough information to perform the collective read operation. The example concludes by closing thefile handle. 2.2. SAGE PLATFORM 19

2.1.3 Concluding Remarks about MPI-IO

Despite the versatility of the MPI-IO interface and some of the advancements described (e.g., Two-Phase I/O), there are several considerations that must be taken into account. First and foremost, MPI-IO has not received any major update since its initial release in 1997. The subsequent releases published by the MPI Forum, such as MPI-3 in 2012, did not strictly evolve the parallel I/O interface. In fact, at the time of writing this thesis, no active effort exists in regards to I/O inside the MPI Forum [82], leaving scientific applications with no clear solution in MPI to accommodate for the heterogeneity that HPC clusters already feature. On top of this challenge, the programming complexity of MPI-IO has remained high over the years. The source code example illustrated utilizes the most simple representation of afile view. But this can differ dramatically when integrating indexed or structured datatypes, or aiming at a block-based 2D or 3D grid [48]. This thesis presents in the next chapter a novel interface for I/O inside MPI that requires only minor changes into the standard. Note that the readers can obtained further information about MPI-IO in the following references: [48, 115].

2.2 SAGE Platform

The Percipient Storage for Exascale Data Centric Computing (SAGE) platform, part of the Horizon 2020 program from the European Union, is a project that proposes hardware to support a multi-tiered I/O hierarchy and associated intelligent management software [84]. The goal is to provide a demonstrable path towards exascale. Further, SAGE proposes a radical approach in extreme-scale HPC, by moving traditional computations, typically done in the compute cluster, to the storage subsystem. This would provide the potential of significantly reducing the overall energy footprint of the cluster. Hence, helping to move towards the “Performance / Watt” targets of exascale-class supercomputers [136]. In addition, it must also be noted that the contributions of this thesis are mostly developed around the SAGE and Sage2 EU-H2020 projects. The SAGE platform mainly consists of a Unified Object Storage Infrastructure (UOSI), that contains multiple tiers of storage device technologies at the bottom of the stack (Fig- ure 2.2). The system does not require any specific storage device technology type and it accommodates upcoming NVRAM (e.g., 3D XPoint [55]), reliable Solid-State Drive storage (e.g., V-NAND [35]), high-performance hard disks (e.g., Hybrid SSHD [108]), and archival-grade storage (e.g., LTO-based Tape Drives [11]). For the NVRAM-based Tier-1, the project currently integrates Intel’s 3D XPoint technology in block device NVMe form- factor [55]. A potential Tier-0 with Non-Volatile Dual In-line Memory Module (NVDIMM) units is proposed in Sage2, allowing for byte-addressable data accesses [125]. The contri- bution of Chapter 6 is one of the main efforts inside Sage2 to define a Global Memory Abstraction (GMA) mechanism that allows seamless access to Tier-0 [112]. Regardless of the storage technology utilized, the main contribution of SAGE is to incorporate each of these tiers into standard form-factor enclosures that provide their own compute capability. This is enabled by standard x86 embedded processing components, which are connected through an Infiniband Fourteen Data Rate (FDR) network [41], and a rich software ecosystem capable of exploiting the Unified Object Storage Infrastructure. The SAGE software consists of these main layers: 20 CHAPTER 2. STATE-OF-THE-ART OF I/O ON HPC

Data-centric Applications

SAGE APIs for Object Storage

Unified Object Storage Infrastructure

Tier-1 / NVRAM (e.g., 3DXP / STT-RAM)

Tier-2 / SSD Flash (e.g., 3D V-NAND / NAND)

Tier-3 / High-Performance Hard Disks (e.g., Hybrid SSHD)

Tier-4 / Archival-grade Storage Disks (e.g., LTO-based Tape Drives)

Figure 2.2: SAGE conceptual object storage architecture, integrating a multi-tiered I/O hierarchy with intelligent management software [84]. The contributions of this thesis are mostly developed around the SAGE and Sage2 EU-H2020 projects.

• Mero. It is the base of the software stack that provides the distributed object stor- age support. Mero is considered an exascale-capable object storage infrastructure, with the ability to read/ write objects and the possibility of storing metadata through a Key Value Store (KVS). The core also provides resource management of caches, locks, extents, and hardware reliability. • Clovis. Provides the higher-level interface to Mero. Clovis is a rich, transactional storage API that can be used directly by applications and layered with traditional interfaces, such as POSIX-IO. The Clovis I/O interface provides functions related to objects and indices (KVS) for storing and retrieving data. A Clovis object is an array of blocks of power of two size bytes. Objects can be read from and written to at block-level granularity. Objects can be deleted at the end of their lifetime. • Function Shipping. This component allows applications to run data-centric, dis- tributed computations directly on the storage nodes where the data resides. More- over, the computations offloaded to the storage cluster are designed to be resilient to errors. Well-defined functions are shipped from the use cases to storage through simple Remote Procedure Call (RPC) mechanisms. • Tools and APIs. SAGE contains a set of tools for I/O profiling and optimized data movement across different platform tiers at the top of the SAGE software stack. Such tools include support for data analytics, telemetry recording (e.g., Arm Map [10]), and more. Additionally, SAGE offers several high-level APIs designed to support the cluster, which mostly include the contributions presented in the next chapters of this thesis.

To conclude this section, it is worth mentioning that the base hardware platform of SAGE was successfully installed at the Jülich Supercomputing Centre in 2017. It is currently being updated with Arm-based Cavium ThunderX2 processors [74], as part of the Extreme Scale Demonstrators (ESD) initiative. More information about SAGE and Sage2 can be found on the official website of the project (http://www.sagestorage.eu) and the following references: [84, 124]. 2.3. STREAMING I/O 21

Compute Processes (Data Producers)

MPI MPI MPI MPI I/O Processes Process Process Process Process Stream with attached MPI MPI MPI-IO Ind. / Col. PFS I/O operation Process Process operation MPI MPI MPI MPI Process Process Process Process MPI MPI Process Process MPI MPI MPI MPI Process Process Process Process

Figure 2.3: With streaming I/O offloading, part of the processes launched for the application are dedicated to conduct I/O operations [99]. Compared to the use of MPI-IO collective operations, here the I/O processes do not perform computations.

2.3 Streaming I/O

Streaming I/O offloading is a strategy that reserves MPI processes dedicated to perform I/O operations using MPI-IO, while the rest of MPI processes carry out the computation of HPC applications [99]. The compute processes send the data to the I/O processes through asynchronous and irregular data streams, while the I/O processes thereafter carry out the MPI-IO operations either individually or collectively. The main advantage of this approach is that it allows the compute processes perform only computations in parallel while the I/O processes carry out the storage operations. In principle, the streaming I/O mechanism decouples the I/O operations from the majority of the MPI processes. The complexity of the I/O operations depends only on the size of the group of I/O processes, and for this reason a small group of I/O processes can keep the complexity of the storage operations low, even when the number of compute processes increases. This fact differs considerably from an approach based only on collec- tive I/O operations from MPI-IO, which would require all processes in one communicator write to or read from a sharedfile collectively. Hence, the complexity of maintaining the consistency semantics and optimizing thefile access in the collective I/O operations increases at the same pace as the number of processes. Moreover, the technique can be particularly useful for irregular I/O in HPC applica- tions, as the communication between the compute processes and I/O processes isfine- grained and irregular. MPI collective I/O supports multiple processes writing to one shared outputfile, that compared to the individual outputfile by each process, can largely reduce the metadata and locking constraints of the PFS (see Section 2.1). Such overhead increases with the number of MPI processes as well. When applications have irregular I/O patterns, each process can have a different amount of data to output and the amount of data can change at each output cycle. For this reason, at each output cycle, it is necessary to determine the offset in thefile view and re-compute it manually per process. Figure 2.3 illustrates the pipelining of application computation through the I/O of- floading with the streaming I/O technique, where the light-gray boxes represent the MPI processes that carry out the computations of the HPC application, and the dark-gray boxes the ones dedicated to the I/O operations. These two groups of MPI processes are connected by asynchronous data streams, defined by parallel I/O operations. At the run- 22 CHAPTER 2. STATE-OF-THE-ART OF I/O ON HPC

MPI-IO Streaming I/O

125 5 500 12.3 15 3.6 100 4 Speedup 400 12 Sp 9 75 2.6 3 300 9 ee

2.2 6.6 dup 50 1.3 1.6 2 200 3.9 6 1 1 1 1.1 2.5 25 1 100 1 1.2 1.4 1.9 3 xcto ie(s) Time Execution 0 0 (s) Time Execution 0 0 32 64 128 256 512 1024 2048 4096 8192 32 64 128 256 512 1024 2048 4096 8192 Num. Processes Num. Processes

(a) Weak Scaling Eval. with MPI_File_write_all (b) Weak Scaling Eval. with MPI_File_write_shared

Figure 2.4: Weak scaling using iPIC3D comparing (a) MPI_File_write_all and (b) MPI_File_write_shared to the streaming I/O technique. The evaluations run on Beskow [94] and assign an I/O process per every 15 compute processes. time of the application, the group of compute processes carry out the simulation. Once they generate some data that needs to be saved to disk, they will stream out the data to the group of I/O processes. This streaming operation is an individual non-blocking communication operation, allowing the compute process to return immediately without waiting for completion. The group of I/O processes keep on receiving data from the com- pute processes and the requests are processed according to the parallel I/O operation attached to the data streams on afirst-arrive,first-served basis. Furthermore, as these I/O processes are not involved in the computation of the application, they can allocate large memory buffers to hold the data received from multiple compute processes before saving them, providing efficient data aggregation and reducing interactions with the PFS. The MPI Streams library [100] provides an interface to define the tasks to be per- formed on the dedicated I/O processes. The parallel I/O processes can also perform post- processing or even online processing with appropriate configuration, a requirement for the function shipping concept of the SAGE platform. When an application has been adapted, the performance over a PFS (e.g., GPFS) generally improves considerably compared to using plain MPI-IO operations. Figure 2.4 illustrates a weak scaling evaluation utilizing a modified iPIC3D [72] and two types of MPI-IO collective operations in comparison with the streaming I/O technique. The evaluations run on Beskow, a supercomputer from KTH Royal Institute of Technology [94], and assign an I/O process per every 15 compute pro- cesses. From the results, it can be observed that the speedup obtained is over 3.5 when × running up to 8192 processes using MPI_File_write_all (i.e., collective I/O operation), and 12.3 when using MPI_File_write_shared (i.e., individual I/O operation). × Despite these results, the streaming I/O technique has several drawbacks. The most important one is that it requires a considerable effort to adapt applications to incorporate this model. Worse, the technique is only valid for targeting PFS and enforces part of the processes to remain idle, which can result in under-utilization of the cluster in certain applications. More information can be acquired in the following references: [99, 100].

2.4 Additional Approaches

The previous sections have briefly illustrated some of the most relevant I/O advances in thefield of High-Performance Computing. Nonetheless, it must also be mentioned that 2.4. ADDITIONAL APPROACHES 23 other alternative approaches are available, such as the traditional POSIX-IO [1] or even the native support in programming languages like Fortran [106].

2.4.1 Support The Operating System (OS) has traditionally provided support and services for I/O that applications can easily integrate into their workflow [8]. Compared to MPI-IO on HPC, in this case the support is designed mostly for general purpose and regular applications. The Portable Operating System Interface (POSIX) is an IEEE standard that helps in obtaining compatibility and portability between operating systems [1]. The POSIX-IO specification enablesfile access and manipulation through a very simple set of system calls, such as open/ close, seek, read/ write, and even variations like pread/ pwrite that provide furtherflexibility (e.g., seek+ read). Moreover, the Portable Operating System Interface for Asynchronous I/O (POSIX-AIO) allows applications to initiate one or more I/O operations to be performed asynchronously by the OS in the background. The application can elect to be notified for completion of the I/O operation in a variety of ways. For instance, by delivery of a signal, by instantiation of a thread, or no notification. Furthermore, the OS also integrates mechanisms to bypass buffering in main memory and improve the performance on emerging non-volatile memories [144]. The Direct Access forfiles (DAX) is a kernel feature that prevents the use of the OS page cache to buffer read/ write operations from / tofiles. While such buffering is useful for traditional block devices, byte-addressable NVDIMM or state-of-the-art NVRAM memories do not require the use of page cache, as it would imply unnecessary copies of the original data. DAX removes the extra copies by performing read/ write operations directly to the storage device from the CPU cache. The pmem.ioeffort by Intel Corporation utilizes this mechanism with special instructions to improve the performance [121]. In addition, the OS features a novel concept called memory-mapped I/O, or simply mmap (i.e., from its system call). This mechanism creates a mapping of afile in the virtual address space of the process. A memory-mappedfile is, thus, a segment of virtual memory that has been assigned a direct byte-for-byte correlation with some portion of afile orfile- like resource. This resource is typically physically present on a storage device (e.g., SSD), but can also be a device, shared memory object, or any other resource that the OS can reference through afile descriptor. Once present, the correlation between thefile and the memory space permits to treat the mapped portion as if it were main memory. In other words, applications utilize simple load/ store operations to access the storage device. An HPC-oriented alternative to mmap is presented in Chapter 6, overcoming most of its limitations [112]. More information about OS support for I/O can be found in: [8].

2.4.2 Memory Drive Technology The Memory Drive Technology (MDT) is a recent product by Intel Corporation that combines a Software-Defined Memory (SDM) technology with SSD drives to allow the expansion of the system’s main memory beyond DRAM [62]. The main idea is to utilize the capacity of the storage device as byte-addressable memory, instead of as block storage. The technology is optimized to take advantage of the latest processors, PCIe-based storage, and the latest persistent memory technology (e.g., 3D XPoint [55]). In fact, Intel claims that MDT has been designed for high-concurrency and in-memory analytics workloads. 24 CHAPTER 2. STATE-OF-THE-ART OF I/O ON HPC

CPU CPU DRAM

CPU DRAM CPU

SATA SSD SATA SSD SATA

CPU CPU

Parallel File System File Parallel System File Parallel

Byte-addressable SpaceByte-addressable SpaceByte-addressable

CPU CPU

MCDRAM MCDRAM

VeSSD NVMe SSD NVMe

(a) Default Config. (Memory + Storage) (b) MDT Config. (Extended Mem. + Storage)

Figure 2.5: (a) Traditional “memory + storage” design, where storage is considered a separate entity. (b) Intel’s Memory Drive Technology seamlessly extends main memory from the BIOS without the OS or the applications knowing [62].

Among some of its benefits, MDT offers optimized performance for up to 8 system × memory expansion over installed DRAM capacity (up to 1.2TB), ultra-low latencies, close- to DRAM performance for SDM operations, and reliably consistent Quality-of-Service (QoS). It also reduces operating costs considerably, as it enables compute clusters to implement the same amount of memory for a significantly lower cost compared to all- DRAM installations. Or, alternatively, achieve much larger amounts of memory than the practical limitations of a DRAM-only compute node capacity. A comparison between the traditional configuration and MDT is illustrated in Figure 2.5. Despite being technically similar to mmap, the main difference with MDT is that it ex- ecutes directly on the hardware and below the operating system. The system memory is assembled from DRAM and the PCIe-based SSD drives, leveraging the economic benefit of storage devices while operating transparently as volatile system memory. MDT is imple- mented as an OS-transparent virtual machine [76]. Storage devices are utilized as part of the system memory to present the aggregated capacity of the DRAM and NVDIMM mod- ules installed in the system, as one coherent, shared memory address space. No changes are required to the operating system, applications, or any other system components. Ad- ditionally, MDT implements advanced memory access prediction algorithms. While the technology seems to offer tremendous potential, it is still on a experimental stage. Further, storage devices are no longer utilized for persistency, but to extend main memory only. Additional information about Intel’s Memory Drive Technology, including preliminary performance evaluations, can be found in these references: [62, 76].

2.4.3 pHDF5

Traditionalfile systems store data hierarchically in tree-based structures that include folders andfiles. They were designed decades ago with the purpose of providing a common procedure of organizing data, accessible through a path within a root directory. The 2.4. ADDITIONAL APPROACHES 25 metadata associated with eachfile mostly describes basic information, like creation date or owner, but it does not give details about its context or purpose in the storage subsystem. Large-scale services and parallel applications, however, mostly generate unstructured data, where the content that the data represents is what matters. Hierarchical Data For- mat v5 (HDF5) emerges as a widely-extended scientific library that allows to map objects in a programming language to data stored in thefile system [43]. This provides some im- portant benefits, such as portability andflexibility for applications. For instance, HDF5 can be used to store different types of datasets commonly found in scientific computing, such as the following: Rectilinear 3D grids with balanced, cubic partitioning. − Rectilinear 3D grids with unbalanced, rectangular partitioning. − Contiguous but size-varying arrays for Adapative Mesh Refinement (AMR). − As a matter of fact, iPIC3D stores the particle information per process in HDF5 format for checkpointing purposes, as motivated during the introduction of Chapter 1 [10,72]. But HPC applications like iPIC3D utilize what is called the Parallel Hierarchical Data Format v5 (pHDF5) [86]. This library extends the original HDF5 implementation to provide support for parallel I/O using MPI-IO. Thus, an HDF5file can be opened in parallel from an MPI application by specifying a parallel “file driver” with an MPI communicator and info structure. This information is communicated to HDF5 through a property list, a special HDF5 structure that is used to modify the default behavior of the library. The MPI-IOfile driver defaults to independent mode, where each processor can access thefile independently and any conflicting accesses are handled by the underlying parallel file system. Alternatively, the MPI-IOfile driver can be set to collective mode to enable collective buffering (i.e., Two-Phase I/O [141]). This is not specified duringfile creation, but rather during each transfer (read or write) to a dataset by passing a transfer property list to the specific I/O call. Further information about HDF5 and pHDF5, including source code examples, can be obtained from the following references: [43, 86].

2.4.4 I/O in Fortran Fortran is a general-purpose, compiled imperative programming language that is espe- cially suited for numeric computation and scientific computing [106]. Alongside C, it is considered one of the most popular languages for HPC3. As a matter of fact, the MPI stan- dard provides a Fortran counterpart API for all of its functionality, including collective communication and MPI-IO [48, 81]. Like other programming languages, Fortran contains a solid I/O interface through the Fortran Interface for I/O (Fortran-IO) [67]. Its main purpose is to provide applications with support for printing and / or storing data. Multiple commands and formatting specifications serve for these purposes, such as the I/O operations OPEN/ CLOSE, FLUSH, or READ/ WRITE. Moreover, the language also includes PRINT and I/O formatting FORMAT functionality. In Fortran 2008, a significant addition integrated the ability to extend the basic facilities with user-defined routines to output derived types, including those that are themselves composed of other derived types [105]. Together, these commands form

3More information about Fortran and its extension, Coarray Fortran, is presented in Chapter 5. 26 CHAPTER 2. STATE-OF-THE-ART OF I/O ON HPC a very powerful assembly of facilities for reading and writing formatted and unformatted files with sequential, direct, or asynchronous access. Nonetheless, despite the existing support, Fortran still lacks of a comprehensive inte- gration of heterogeneous I/O interfaces. Thus, Chapter 5 presents a seamless I/O interface based on Coarray Fortran (CAF), meant to support transparent resilience in the near-term future [111]. Further information about Fortran and Fortran-IO can be acquired in these references: [67, 106]. Chapter 3

Unified Programming Interface for I/O in MPI

The Message-Passing Interface (MPI) is considered the de-facto standard for programming large-scale distributed-memory systems [47]. Since its inception, the standard has provided high-performance and portability on massively parallel supercomputers. Together with the conventional message-passing model of MPI (e.g., point-to-point), MPI-2 introduced Remote Direct Memory Access (RDMA) support with the MPI one-sided communication model [80]. These operations are meant to provide access to the local memory of other processes, a feature that resembles traditional shared-memory programming. In order to understand the main differences between each model, Figure 3.1 illustrates a simple comparison between the classic message-passing of “send+ receive”, and the one-sided communication. In the message-passing (a), processes are required to cooperate. The source process must state the intention to transfer data to the target process. This is frequently accomplished by a preliminary envelope exchange to identify the message on the specific target (i.e., tag-matching [75]), followed by a confirmation from the target that the space required for the message is available. The specific interaction can vary depending on the message size, type of communication primitive utilized, and also determined by the Eager and Rendezvous protocols [69]. After this exchange, the data is transferred. The problem with this interaction is that it makes the processes coupled, theoretically forcing the communication to stall until the negotiation is complete. The MPI one-sided communication model (b), on the other hand, dramatically streamlines this interaction. In this case, each process exposes part of its local memory through the concept of an MPI window [48]. The term window is utilized because only a limited part of the local memory is exposed (i.e., the rest remains private). The size of each window can differ among processes, and some may opt not share any memory. The critical difference with message-passing resides on the simplicity of MPI one-sided. For instance, to interact with a local or remote window, put/ get operations can be utilized for accessing both local and remote windows, without distinction. Moreover, support for shared memory programming using MPI windows is also possible since MPI-3, as in OpenMP1 [28, 60]. The new revision of the standard additionally defined atomic operations, such as Compare-

1In this case, the put/ get operations are replaced with simple load/ store [49].

27 28 CHAPTER 3. UNIFIED PROGRAMMING INTERFACE FOR I/O IN MPI

P2 ··· P1 P2 ··· PN P1 PN

Transfer Data

Put / Get MPI Window Post Receive from P1 Private Address Space Post Send to PN (a) Message-Passing Model (b) One-Sided Communication Model

Figure 3.1: (a) The message-passing model enforces both processes involved in the communication to interact. (b) The one-sided communication model decouples the processes, enabling asynchronous communication from / to the MPI windows [48].

And-Swap (CAS). More importantly, processes become effectively decoupled, allowing for asynchronous communication (i.e., no tag-matching) and reduced latency [45]. In the same way that storage will seamlessly extend memory in the near future, pro- gramming interfaces for memory operations will also become interfaces for I/O oper- ations. In this chapter, we propose to raise the level of programming abstraction by using MPI one-sided communication and MPI windows as a unified interface to any of the available memory and storage technologies [114]. MPI windows provide a familiar, straightforward interface that can be utilized to program data movement among memory and storage subsystems in MPI, and without the need for separate interfaces for I/O.

Key Contributions � We design and implement MPI storage windows to map MPI windows into storage devices. We provide a reference, open-source implementation atop the MPI Profiling Interface, and consider how the approach could be integrated inside MPICH [50]. � We show that MPI storage windows introduce only a relatively small runtime over- head when compared to MPI memory windows and MPI-IO, in most cases. However, it provides a higher level offlexibility and seamless integration of the memory and storage programming interfaces. � We present how to use MPI storage windows for out-of-core execution and par- allel I/O in reference applications, such as Distributed Hash Table [45], and the Hardware-Accelerated Cosmology Code I/O kernel [54]. � We illustrate how heterogeneous window allocations provide performance advantages when combining memory and storage allocations on certain HPC workloads.

3.1 Extending MPI Windows to Storage

In order to extend the concept of MPI window to storage, we observe that MPI does not restrict the type of allocation that a window should be pinned to [81]. The standard only defines how processes can interact with a window and the semantics involved in the communication, but it does not specify the technology that the window must utilize. Thus, MPI storage windows can become a seamless, unified programming interface for I/O 3.1. EXTENDING MPI WINDOWS TO STORAGE 29 inside MPI, allowing access to memory and storage through simple put/ get operations (Figure 3.2). Furthermore, no major changes are required into the standard. We propose to enable MPI windows to be allocated on storage when proper per- formance hints are given to the window allocation routines (Subsection 3.1.1) via the MPI Info Object [154]. The performance hints are tuples of key-value pairs that provide implementations of MPI with information about the underlying hardware. For instance, certain hints can improve the performance of collective communication by providing net- work topology. Also, MPI-IO operations can be optimized if the I/O characteristics of an HPC application are provided through hints. Thus, performance hints can determine the location of the mapping to storage and define other hardware-specific settings, such as thefile-stripe of a PFS (e.g., striping_factor) [85]. Moreover, if the specific MPI im- plementation does not support storage allocations, the hints are simply ignored, which maintains compatiblity among implementations. Several performance hints are defined to enable and configure MPI storage windows:

• alloc_type. If set to “storage”, it enables the MPI window allocation on storage. Otherwise, the window will be allocated in memory (default). • storage_alloc_filename. Defines the path and the name of the targetfile. A block device can also be provided, allowing us to support different storage technologies. • storage_alloc_offset. Identifies the displacement or starting point of the MPI storage window inside afile. This is specially valuable for accessing sharedfiles. • storage_alloc_factor. Enables combined window allocations, where a single win- dow contains both memory and storage mappings. A value of “0.5” would associate half of the addresses in memory, and half into storage. • storage_alloc_order. Defines the order of the allocation in combined windows. A value of “memory_first” setsfirst addresses in memory, and then storage (default). • storage_alloc_unlink. If set to “true”, it removes the associatedfile during the deallocation of an MPI storage window (i.e., useful for writing temporaryfiles). • storage_alloc_discard. If set to “true”, avoids to synchronize to storage the recent changes during the deallocation of an MPI storage window.

Applications that use MPI one-sided communication can continue to allocate windows in memory by avoiding the alloc_type hint, or by setting this hint with MPI_Info_set and a value of “memory”. To enable MPI storage windows, the alloc_type hint has to be set to “storage”. Applications are expected to provide, at least, the storage_alloc_filename hint, which is required to specify the path where the window is set to be mapped (e.g., afile). The rest of the described hints are optional and strictly depend on the particular use-case where MPI storage windows is integrated. In addition, it must be noted that most of the existing hints for MPI-IO can also be integrated in MPI storage windows [51].

3.1.1 Reference Implementation Even though support for MPI storage windows can be ideally provided at MPI implemen- tation level, the feature can be easily integrated at library-level as well. Understanding the type of allocation is possible through the attribute caching mechanism of MPI. This feature enables user-defined cached information on MPI communicators, datatypes and windows. 30 CHAPTER 3. UNIFIED PROGRAMMING INTERFACE FOR I/O IN MPI

Shared Memory (Intra-Node) Local Storage (Intra-Node) DRAM

Non-Volatile RAM NVMe SSD Storage Capacity Storage

Solid State Drive Scratch Disk ··· MPI Implementation

Put/Get Put/Get Put/Get Put/Get Remote Storage (Inter-Node)

Lustre GPFS MPI MPI MPI MPI Rank Rank Rank ··· Rank ··· 0 1 2 N

Figure 3.2: MPI storage windows enable seamless access to different types of mem- ory and storage technologies, including PFS (e.g., GPFS [126]). Compared to MPI-IO [48,142], HPC applications benefit from a unified programming interface.

In this case, metadata about the allocation can be attached to the MPI window to differ- entiate between traditional in-memory allocations, storage-based, or combined allocations. The properties can be retrieved querying the window object (i.e., MPI_Info_get). We design and implement MPI storage windows as an open-source reference library2 on top of MPI using the MPI Profiling Interface (PMPI) [66]. We also integrate the approach inside the MPICH MPI implementation (CH3) [50]. The library version allows us to quickly prototype the MPI storage window concept and to understand which features are required for supporting storage-based allocations in the future. The MPICH integration allows us to comprehend the complexity of defining this concept in a production-quality MPI implementation. Both implementations support the same functionality. Our reference implementation of MPI storage windows is based on the use of the memory-mapped I/O mechanism of the OS [8] (Subsection 2.4.1). Targetfiles or block devices from an MPI storage window arefirst opened, mapped into the virtual memory space of the MPI process, and then associated with the MPI window. Several Unix system and I/O functions are required for these purposes. First, mmap is utilized to reserve the range of virtual addresses and also to establish a storage mapping in main memory (e.g., fromfiles or block devices). When targetingfiles, ftruncate guarantees that the mapping has enough associated storage space. The changes in the mapping are synchronized inter- nally through msync, whichflushes all the dirty pages to storage from the page cache of the OS3. Lastly, munmap allows us to release the mapping of thefile or block device from the page table of the process. Several routines are extended to handle the allocation, deallocation, and synchroniza- tion of MPI storage windows through the PMPI interface. The specification of these routines remains unaltered and follow the original definition in the MPI-3 standard [81]: • MPI_Win_allocate. Allocates an MPI storage window through the MPI Info Object. The routine maps the physicalfile into the virtual memory space of the MPI process. • MPI_Win_allocate_shared. Equivalent to the previous routine, it defines an MPI shared window on storage [49]. By default, the mapped addresses are consecutive.

2https://github.com/sergiorg-kth/mpi-storagewin 3Memory-mappings utilize a global cache in main memory to improve performance [117]. 3.1. EXTENDING MPI WINDOWS TO STORAGE 31

MPI_Win_allocate / OS page MPI_Win_sync MPI_Win_allocate_shared caching Storage (Local / Remote)

SWin2 Shared Memory on Node SWin3

MPI2.bin MPI MPI Page Rank Rank Page cache 2 3 cache MPI13.bin

SWin0 Page MPI MPI Page SWin1 cache Rank Rank cache MPI0.bin 0 1

SWin0 offset = 0 SWin2 offset = 0 SWin1 offset = 0 x SWin3 offset = x MPI0.bin MPI2.bin MP13.bin

Figure 3.3: MPI storage windows can be created by mappingfiles into the process address space through the memory-mapped I/O mechanism of the OS [8]. The page cache optimizes the performance by maintaining the mapped space in memory.

• MPI_Win_sync. Synchronizes the memory and storage copies of the MPI storage window4. This routine may return immediately if the pages are already synchronized with storage by the OS (i.e., a selective synchronization is frequently performed). • MPI_Win_attach/ MPI_Win_detach. These two routines allow us to support MPI dy- namic windows on storage [45]. The mapping can be pre-established by providing the hints to MPI_Alloc_mem instead. • MPI_Win_free. Releases the mapping from the page table of the MPI process and deletes the mappedfile if requested with the storage_alloc_unlink hint. Figure 3.3 summarizes the use of these routines with four MPI storage windows. Three files are opened and mapped into the virtual memory space of the four MPI process. After the mapping is established, the OS automatically moves data from memory (page cache) to storage, and vice-versa. In the example, two MPI storage windows share afile. An offsetx is provided as performance hint during the allocation of the window. The MPI_Win_sync routine ensures that the window copy on memory, within the page cache of the OS, is synchronized with the mappedfile on storage.

3.1.2 Source Code Example Scientific applications on HPC that utilize the MPI one-sided communication model can immediately integrate MPI storage windows. Listing 3.1 shows a source code example that allocates an MPI storage window using some of the described performance hints. The ex- ample demonstrates how different MPI processes can write information to an MPI storage window of other processes with a simple put operation. This operation can also be used

4Even though MPI storage windows resembles the separate memory model of MPI win- dows [48], in this case, local and remote operations only affect the memory mapped region. 32 CHAPTER 3. UNIFIED PROGRAMMING INTERFACE FOR I/O IN MPI

1 ... 2 // Define the Info object to enable storage allocations 3 MPI_Info info = MPI_INFO_NULL; 4 MPI_Info_create (& info ); 5 MPI_Info_set (info, "alloc_type", "storage"); 6 MPI_Info_set (info, "storage_alloc_filename", "/path/to/file"); 7 MPI_Info_set (info, "storage_alloc_offset", "0"); 8 MPI_Info_set (info, "storage_alloc_unlink", "false"); 9 10 // Allocate storage window with space for num_procs integers 11 MPI_Win_allocate (num_procs * sizeof ( int ), sizeof ( int ), info, MPI_COMM_WORLD, 12 ( void **)&baseptr, &win); 13 14 // Limit the operation to half of the processes 15 if (IS_EVEN_NUM(rank)) 16 { 17 // Put a random value on the destination ranks 18 for ( int drank = 1; drank < num_procs; drank += 2) 19 { 20 int k = rank + timestamp; 21 MPI_Win_lock (MPI_LOCK_SHARED, drank, 0, win); 22 MPI_Put (&k, 1, MPI_INT, drank, offset, 1, MPI_INT, win); 23 MPI_Win_unlock (drank, win); 24 } 25 } 26 ...

Listing 3.1: Source code example that illustrates how to use MPI storage windows to allocate and write a symbolic value to other remote processes [113].

for local MPI storage windows. The codefirst creates an MPI Info object and sets the per- formance hints to enable storage allocations. Then, it allocates the MPI storage window passing the Info object and instructs half of the MPI processes to write to the correspon- dent MPI storage window of the other half. The MPI_Win_lock/ MPI_Win_unlock start and end the passive epoch to the MPI window on the target rank.

3.2 Data Consistency Considerations

By using the memory-mapped I/O mechanism, MPI storage windows implicitly integrate demand paging in memory through the page cache of the OS, temporarily holding fre- quently accessed pages mapped to storage. Theflexibility of this mechanism, however, also introduces several challenges for data consistency inside MPI storage windows. First and foremost, local or remote operations are only guaranteed to affect the memory copy of the window inside the page cache. The semantics of the MPI one-sided communication operations, such as MPI_Win_lock/ MPI_Win_unlock or MPI_Win_flush, will only ensure the completion of the local or remote operations inside the memory of the target process (i.e., the storage status is undefined at that point). As a consequence, when consistency of the data within the storage layer has to be preserved, applications are required to use MPI_Win_sync on the window to enforce a synchronization of the modified content inside the page cache of the OS (see Subsection 3.1.1). This operation blocks the MPI pro- cess until the data is ensured to beflushed from memory to storage. Therefore, write operations (e.g., MPI_Put) accompanied with a subsequent MPI_Win_sync guarantee data 3.3. EXPERIMENTAL RESULTS 33 consistency on the storage layer. Any read operation (e.g., MPI_Get) is not affected and triggers data accesses to storage through the page-fault mechanism of the OS. Another important challenge is to prevent remote data accesses during a window syn- chronization to storage. In this regard, the MPI standard already contains the definition atomic operations (e.g., Compare-And-Swap), as well as an exclusive lock inside the pas- sive target synchronization [81]. By default, accesses that are protected by an exclusive lock (i.e., MPI_LOCK_EXCLUSIVE) will not be concurrent with other accesses to the same window that is lock protected. Thus, guaranteeing that no interference exists during the synchronization of the data, useful for neighbour-based checkpointing (Section 7.3) [20].

3.3 Experimental Results

In this section, we briefly illustrate performance results of MPI storage windows on two different testbeds. Thefirst one is a compute node with local storage (HDD and SSD), allowing us to demonstrate the implications of our approach on HPC clusters with local persistency support. The second testbed is a supercomputer from KTH Royal Institute of Technology [95] with storage provided by a PFS. The specifications are given below:

• Blackdog is a workstation equipped with dual 8-core Xeon E5-2609v2 2.5GHz processors and 72GB DRAM. The storage consists of two 4TB HDD (Non-RAID WD4000F9YZ) and operations 250GB SSD (Samsung 850 EVO). The OS is Ubuntu Server 16.04 with Kernel 4.4.0-62-generic. Applications are compiled with GCC v5.4.0 and MPICH v3.2. The vm.dirty_ratio is set to 80% for better OS page cache use [146]. • Tegner is a supercomputer with 46 compute nodes equipped with dual 12-core Haswell E5-2690v3 2.6GHz processors and 512GB DRAM. The storage employs a Lustre PFS (client v2.5.2) with 165 OST servers. No local storage is provided. The OS is CentOS v7.3.1611 with Kernel 3.10.0-514.6.1.el7.x86_64. Applications are compiled with GCC v6.2.0 and Intel MPI v5.1.3.

Note that all thefigures reflect the standard deviation of the samples as error bars. We use the PMPI-based implementation in both testbeds for deployment reasons on Tegner. In addition, we set the default Lustre settings on Tegner, assigning one OST server per MPI process and a stripping size of 1MB. The swap partition is disabled in both testbeds.

3.3.1 Microbenchmark mSTREAM5 is a microbenchmark inspired by STREAM [73] that measures throughput by performing large memory operations over an MPI window allocation. We use mSTREAM to stress the storage subsystem and understand the performance differences with MPI memory windows. The tests utilize afixed window allocation size of 16GB and a transfer size of 16MB. A single MPI process is used for each test to avoid potential interferences. From the results, we observe that the performance of MPI storage windows is largely dominated by the need forflushing the data changes to storage when no other compu- tations are conducted. Figure 3.4 illustrates the throughput achieved by each kernel of

5https://github.com/sergiorg-kth/mpi-storagewin/tree/master/benchmark 34 CHAPTER 3. UNIFIED PROGRAMMING INTERFACE FOR I/O IN MPI

Memory Storage (SSD) Storage (HDD) Storage (Lustre)

8000 12000 8000 6000 9000 6000 4000 6000 4000 2000 3000 2000 0 0 0 hogpt(MB/s) Throughput hogpt(MB/s) Throughput SEQ PAD RND MIX SEQ PAD RND MIX (MB/s) Throughput Read Write Kernel Type Kernel Type Operation

(a) mSTREAM Eval. on Blackdog (b) mSTREAM Eval. on Tegner (c) Throughput on Tegner

Figure 3.4: (a,b) mSTREAM performance using MPI memory and storage windows on Blackdog and Tegner. (c) Comparison of read/ write throughput on Tegner. mSTREAM using MPI memory and storage windows on Blackdog and Tegner. The stor- age windows are mapped into the local storage of Blackdog, and on the Lustre PFS of Tegner. The performance results of (a) indicate that the degradation of MPI storage windows on local storage is 56.4% on average when using the SSD, of which 33.8% of the execution time is spent on transferring unflushed changes to storage. In the case of the HDD, theflush time exceeds 50%. On the other hand, the results of the same experiment running on Tegner (b) illustrate evident differences between MPI memory and storage windows. The storage-based allocation degrades the throughput by over 90% compared to the MPI window allocation in memory. In particular, the throughput of MPI memory windows reaches 10.6GB/s on average, while the peak throughput of MPI storage windows is approximately 1.0GB/s. The reason for this result is due to the lack of write buffering on Lustre for memory-mapped I/O, producing “asymmetric” performance (c).

3.3.2 Data Analytics and Out-of-Core Evaluation Due to the recent importance of data analytics on HPC [23, 104], we use a Distributed Hash Table (DHT) implementation6 based on MPI one-sided communication for the next evaluation [45]. The goal is to mimic data analytics applications that have random accesses to distributed data structures. In this implementation, each MPI process put and get values, and also resolve conflicts asynchronously on any of the exposed local volumes of the DHT through Compare-And-Swap (CAS) atomic operations. The use of atomic operations inside the DHT implementation effectively hides some of the constraints expected due to the memory and storage bandwidth differences. Figure 3.5 presents the average execution time using MPI memory and storage windows on Blackdog and Tegner. Each process inserts up to 80% of the DHT capacity. We use 8 processes on Blackdog and 96 processes (4 nodes) on Tegner. The results on (a) show that the overhead of MPI storage windows on conventional HDD is approximately 32% on average compared to MPI memory windows, and 20% when using the SSD. By increasing the process count and the number of active nodes on Tegner, the overhead of the network communication and atomic CAS operations required to maintain the DHT hides the performance limita- tions. In this case, we observe on (b) that using MPI storage windows barely affects the

6https://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 3.3. EXPERIMENTAL RESULTS 35

Memory Combined Storage (SSD) Storage (HDD) Storage (Lustre)

400 6000 3000 300 4500 2400 Mem. Limit 1800 200 3000 1200 100 1500 600 0 0 0 xcto ie(s) Time Execution xcto ie(s) Time Execution xcto ie(s) Time Execution 17.9 23.8 29.8 35.8 14.3 21.4 28.6 35.8 35.8 47.7 59.6 89.4 119.2 DHT Size (GB) DHT Size (GB) DHT Size (GB)

(a) DHT Eval. on Blackdog (b) DHT Eval. on Tegner (c) DHT Out-of-Core Eval. on Blackdog

Figure 3.5: (a,b) DHT evaluation using MPI memory and storage windows on Blackdog and Tegner. (c) Out-of-core evaluation with combined window allocation.

MPI-IO Storage Windows

40 12 30 9 20 6 10 3 0 0 xcto ie(s) Time Execution xcto ie(s) Time Execution 2 4 8 16 16 32 64 128 Num. Processes Num. Processes

(a) HACC I/O Results on Blackdog (b) HACC I/O Results on Tegner

Figure 3.6: HACC I/O evaluation using MPI-IO and MPI storage windows on Blackdog and Tegner. The tests vary the process count for afixed problem size. performance, with only a 2% degradation on average compared to MPI memory windows. Lastly, (c) demonstrates the performance of MPI storage windows by exceeding the main memory capacity on Blackdog. Utilizing a combined allocation with a factor of 0.57, we observe a mere 8% degradation when compared to MPI memory windows before exceeding the memory capacity, and only a 13% compared to the projected value. In the largest test case, where 59.6GB are on storage, the overhead increases to only 36% on average.

3.3.3 Checkpoint-Restart on Scientific Applications For the last evaluation, we use the checkpoint-restart mechanism8 of the Hardware- Accelerated Cosmology Code (HACC) [54], a physics particle-based code to simulate the evolution of the Universe after the Big Bang. Our tests evaluate the I/O capabilities of MPI storage windows compared to the performance of the existing MPI-IO implementa- tion. This latter implementation uses individualfile I/O (i.e., not collective operations). The results demonstrate that MPI storage windows can be an alternative to MPI-IO when using local storage and also a PFS. Figure 3.6 shows the average execution time

7This means that 50% of the allocation is located in memory, while the other 50% is on storage. 8https://asc.llnl.gov/CORAL-benchmarks 36 CHAPTER 3. UNIFIED PROGRAMMING INTERFACE FOR I/O IN MPI using MPI-IO and MPI storage windows on Blackdog and Tegner. The evaluations target the local HDD on Blackdog, and the Lustre PFS on Tegner. We use 100 million particles in all the evaluations while doubling the number of processes. The results on (a) indicate that MPI-IO performs slightly better on average on Blackdog (2–4% faster). However, (b) shows that MPI storage windows provides on Tegner about 32% improvement on average compared to the MPI-IO implementation, mostly after the process count increases from 16 to 32 processes, which also raises the number of active nodes from one to two.

3.4 Concluding Remarks

Compute nodes of next-generation supercomputers will include different memory and stor- age technologies. This chapter has proposed a novel use of MPI windows to hide the het- erogeneity of the memory and storage subsystems by providing a single common program- ming interface for data movement across these layers, without requiring drastic changes into the standard. The experimental results have also demonstrated that MPI storage windows transparently and efficiently integrates storage support into new and existing ap- plications. Moreover, it allows the definition of combined window allocations, that merge memory and storage under a unified virtual address space for out-of-core. As a matter of fact, the benefits for data analytics led to the work presented in the following chapter. Nonetheless, we also observe that the importance of an efficient implementation for MPI storage windows is required for data-intensive applications. The use of the memory- mappedfile I/O mechanism of the OS constraints the performance by introducing asym- metric read/ write bandwidth on PFS like Lustre. To overcome these limitations, Chap- ter 6 presents a high-performance memory-mapped I/O mechanism designed for HPC, which can benefit MPI implementations in the near-term future. More information about MPI storage windows, including additional source code ex- amples and performance evaluations, can be obtained in these references: [113, 114]. Chapter 4

Decoupled Strategy on Data Analytics Frameworks

During the past decade, data-intensive workloads have become an integral part of large- scale scientific computing [104, 124]. Alongside traditional HPC use-cases, data-centric applications have fostered the emergence of programming models and tools for data min- ing [153]. This allows scientists to understand large datasets of unstructured information. For instance, such developments have been successfully applied in the past for the trajec- tory analysis of high-performance Molecular Dynamics (MD) simulations [148]. In this regard, MapReduce has emerged as one of the preferred programming models to hide the complexity of process / data parallelism [34]. MapReduce originated in the con- text of cloud analytics. Its power resides on the definition of simple Map() and Reduce() functions, that become highly-parallel operations using complex inter-processor commu- nication [61]. Users are solely responsible for implementing the aforementioned functions, while the underlying framework manages everything else. As illustrated in Figure 4.1, the basic principle behind MapReduce is to split a certain input dataset into independent portions or tasks. These are then processed in parallel inside the Map and Reduce phases1. Each phase calls the specific Map() and Reduce() operations, defined by the user. More specifically, the former is designed to split the input into a collection of key-value pairs. The meaning of such processing depends on the use-case. Thereafter, each tuple is merged using the latter function, producing an aggregation of all the key-value pairs with identical key. If no feedback loop is required, the output represents thefinal result. Many real-world applications can be expressed following the MapReduce programming model, like high-energy physics data analysis, or K-means clustering [21, 34]. In the con- text of HPC, however, it has also been debated that cloud-based MapReduce frameworks (e.g., Hadoop [133]) pose numerous constraints. Despite supercomputers offering immense potential for data analytics, the elevated memory / storage requirements, alongside with the complex job scheduling, makes it difficult for such integration [53]. Moreover, MPI- based implementations of MapReduce for HPC are often limited by the inherent coupling between the different phases of the programming model [102].

1The output from the Map phase might be sorted during an intermediate step called Shuffle, before transferring the control to Reduce [30]. For simplicity, we do not consider this phase.

37 CHAPTER 4. DECOUPLED STRATEGY ON DATA ANALYTICS 38 FRAMEWORKS

KV KV KV KV KV K V KV KV KV

KV KV Phase

Map KV

ReducePhase KV KV Input Datasets KV KV Final Result

Figure 4.1: MapReduce analyzes input datasets to generate key-value pairs, that are later reduced and combined to generate thefinal result. The specific Map() and Reduce() functions, implemented by the user, are called inside each phase [30].

Given the irregular nature of certain input datasets and the expected increase in con- currency of exascale supercomputers [32], performance considerations arise from existing data analytics frameworks on HPC. In this chapter, we present a decoupled strategy for MapReduce, named MapReduce-1S (i.e., MapReduce “One-Sided”) [116]. This framework employs MPI one-sided communication and non-blocking I/O to overlap the Map and Reduce phases of the algorithm [48,141]. Processes synchronize using solely one-sided op- erations. The distribution of the tasks during Map is also decentralized and self-managed. Furthermore, by integrating MPI storage windows (Chapter 3) [114], MapReduce-1S en- ables seamless support for fault-tolerance and out-of-core for HPC data analytics.

Key Contributions � We design and implement MapReduce-1S, an open-source MapReduce framework based on MPI one-sided communication and MPI-IO non-blocking operations to reduce the constraints on highly-concurrent workloads. � We illustrate how MapReduce-1S provide benefits on data analytics use-cases (e.g., Word-Count) compared to a reference MapReduce implementation for HPC [61], both under balanced and unbalanced workloads in weak scaling evaluations. � We depict the execution timeline differences of our approach to demonstrate how processes become effectively decoupled among the different phases. � We integrate MPI storage windows (Chapter 3) [114] and evaluate a fault-tolerant implementation, that seamlessly allows for out-of-core execution as well.

4.1 Designing a Decoupled MapReduce

Existing MapReduce implementations for HPC are generally based on MPI functionality to take advantage of the high-performance network and storage subsystems of supercom- puters [61, 102]. These implementations often integrate collective communication and I/O within the different phases, as illustrated in Figure 4.2 (a). For instance, tasks are commonly distributed employing a master-slave approach with scatter operations (e.g., MPI_Scatter). Fixed-length, associative key-values are used to take advantage of the 4.1. DESIGNING A DECOUPLED MAPREDUCE 39

P0 P1 P2 ··· PN P0 P1 P2 ··· PN 1 Scatter 1 Init. Job Tasks Settings

2 Map ··· 2 Map (+ Local ··· Reduce)

Idle

3 Shuffle ··· e 3 Reduce Time Tim ··· 4 Reduce ··· ··· Idle 4 Combine ··· 5 Gather Result 5 Output Result 6 Output Result

(a) Traditional MapReduce Timeline (b) Decoupled MapReduce Timeline

Figure 4.2: (a) The communication pattern of MapReduce frameworks can intro- duce idle periods if the workload is unbalanced. (b) Using MPI one-sided commu- nication, processes can hide the idle periods following a decoupled strategy [116]. heavily-optimized reduce operations of MPI (e.g., MPI_Reduce). In addition, the inter- mediate data-shuffle is mapped to collective all-to-all operations (e.g., MPI_Alltoall) to optimize the communication between phases [53]. When reading the input, collective I/O operations of MPI-IO are utilized to decrease the overhead of accessing PFS [44, 141]. Even though these design considerations generally provide major advantages com- pared to cloud-based alternatives [70], the inherent coupling between the Map and Reduce phases may still produce workload imbalance. This is particularly the case when the input datasets feature an irregular distribution of the data and the process count is elevated. In such cases, it has been demonstrated that the use of decentralized algorithms can provide significant performance benefits [98]. Consequently, a decoupled strategy for MapReduce frameworks can reduce the synchronization overhead by overlapping the different phases.

4.1.1 Reference Implementation To solve this challenge, we propose MapReduce-1S, an open-source C++ framework2 that integrates MPI one-sided communication (Chapter 3) and non-blocking I/O (Sec- tion 2.1) [45,48]. The framework inherits the core principles of traditional MapReduce im- plementations, with subtle variations. The execution is divided into four different phases: 1. Map. Transforms a given input into key-value pairs. Each key-value is assigned to a target process and stored into designated MPI windows for remote communication. 2. Local Reduce. Aggregates key-value pairs locally during Map, whenever possible. Thus, decreasing the overall memory footprint and network overhead [44].

2https://github.com/sergiorg-kth/mpi-mapreduce-1s CHAPTER 4. DECOUPLED STRATEGY ON DATA ANALYTICS 40 FRAMEWORKS

3. Reduce. Aggregates all the key-value pairs found by the rest of the processes. MPI one-sided operations are utilized to retrieve the tuples asynchronously. 4. Combine. Combines key-value pairs to generate the result. Despite being similar to Shuffle, the difference is that Reduce performs ordering (i.e., the step is lighter). As illustrated in Figure 4.2 (b), our framework is designed to overlap computations and storage operations during the Map phase, while allowing processes to asynchronously proceed to the Reduce and Combine phases. In fact, instead of following a master-slave approach, we design a mechanism that enable processes to decide the next task to perform based on the rank, task size, and other parameters. The input portion for the task is retrieved using non-blocking MPI-IO operations [141]. Support for remote memory communications is provided through a multi-window configuration per process: • “Status” Window. Defines the status for each process (e.g., STATUS_REDUCE). The status is updated using atomic operations, like CAS, after completing a phase. • “Key-Value” Window. This multi-dimensional, dynamic window contains buck- ets to store the key-value pairs, indexed by the target rank. • “Combine” Window. Contains a single-dimension, dynamic window with ordered key-values. This window is critical to generate thefinal result in Combine. • “Displacement” Window. Two additional displacement windows are required to support the “Key-Value” and “Combine” dynamic windows3. When a key-value pair is found, a custom memory management is employed to store the tuple. Each key-value pair is mapped inside buckets assigned to the target processes. We use this approach as a mechanism to transfer information concurrently [61]. The target is determined byfirst generating a 64-bit hash of the key, and thereafter mapping the into the “Key-Value” window. Thus, allowing processes to have remote access without affecting the information stored in surrounding buckets. The information is encoded by including afixed-size headerh with the length of the key and value attributes to support variable-length tuples:

h key value

Hbytes Kbytes Vbytes

After a processfinishes the Map phase, it proceeds to the Reduce phase by collecting groups of key-value pairs assigned to this particular process, from all the other processes. This procedure is better illustrated in Figure 4.3. The tuples are retrieved in user-configurable groups with MPI one-sided operations through the passive target synchronization [45]. After a group is retrieved, the process splits the information by interpreting the header and reducing locally each key-value. The “Status” window is required to prevent incorrect data accesses to the “Key-Value” window. When the Map and Reduce phases are completed, the Combine phase sets up a tree- based sorting algorithm that fetches thefinal tuples from each process. We use an algorithm inspired by merge sort [24, 29]. The number of levels in the tree is

3Attaching new allocations to an MPI dynamic window requires to share the displacement for the buckets of the window [48]. Hence, we choose temporary MPI windows for this purpose. 4.2. FAULT-TOLERANCE AND OUT-OF-CORE SUPPORT 41

··· ··· Status Status P0 P1 P2··· PN P0 P1 P2 ··· PN Window Window 1 Status 3 Status P0 P0 P0 Change Change P0 P0 2 Get 4 Get P1 P1 P1 P0 P1 P1 P1 P2 Key-Value Key-Values Key-Values Key-Value Window P2 P2 P2 P2 P2 Window ··· ··· PNPN PN ··· ···

Figure 4.3: Each process stores tuples into dedicated MPI windows, that are later accessed by remote processes. Thefigure shows process P0 in Reduce, notifying the status change and getting its key-values found by other processes. given by log (numP rocs) + 1. The initial level stores the local key-value pairs, in-order. � 2 � Thereafter, processes retrieve the key-values from the previous level using MPI one-sided, and generate a new level with all the pairs ordered. Race conditions are prevented through an exclusive lock (i.e., MPI_LOCK_EXCLUSIVE) over the Combine window. The task is then repeated until one last process generates thefinal level, which corresponds to the result.

4.1.2 Source Code Example The MapReduce-1S framework employs a multi-inheritance mechanism by dividing the re- sponsibilities as a hierarchy of C++ classes. Thus, allowing applications to easily configure different back-ends over multiple use-cases. Listing 4.1 provides a source code example where a Word-Count job is created using MapReduce-1S as back-end. The examplefirst declares the MapReduce object with WordCount, that contains the definition of the Map() and Reduce() operations. The MapReduce job is initialized by providing several settings, such as the size of each individual task within the Map phase, or the maximum number of bytes that can be transferred simultaneously from remote processes during Reduce and Combine. The execution is then launched and the output result isfinally printed.

4.2 Fault-tolerance and Out-of-Core Support

Using MPI one-sided communication inside MapReduce-1S provides us with the opportu- nity to integrate the concept of MPI storage windows [114], presented in Chapter 3. Such integration extends the framework implementation with two main functionalities:

Fault-tolerance. The use of MPI storage windows enables transparent checkpoint ◦ support. By default, MapReduce-1S performs window synchronization points after each Map task, as well as after the Reduce phase is completed4. Thus, processes can easily remap the multi-window configuration from other processes after a failure.

4In MPI storage windows, applications guarantee data consistency through MPI_Win_sync (Section 3.2). Hence, this function ensures that the latest window changes areflushed to storage. CHAPTER 4. DECOUPLED STRATEGY ON DATA ANALYTICS 42 FRAMEWORKS

1 ... 2 char *input_filename = "/path/to/file"; // Input dataset for MapReduce 3 size_t initial_win_size = 16777216; // Initial window bucket size 4 size_t chunk_size = 1048576; // Limit for one-sided operations 5 size_t task_size = 67108864; // Task size per process 6 bool storage_enabled = false ; // Configures MPI storage windows 7 bool hash_enabled = true ; // Configures the hashing function 8 int8_t input_api = MR1S_MPIIO_NB; // Input API preferred during Map 9 size_t sfactor = 0; // PFS stripe factor (0 to ignore) 10 size_t sunit = 0; // PFS stripe unit (0 to ignore) 11 12 // Create the MapReduce object, employing MapReduce-1S as back - end 13 MapReduce1S *map_reduce = new WordCount (); 14 15 // Initialize the Word-Count job with the specified settings 16 map_reduce->Init(input_filename, initial_win_size, chunk_size, task_size, 17 storage_enabled, hash_enabled, input_api, sfactor, sunit); 18 19 // Launch the execution and output the result 20 map_reduce->Run(); 21 map_reduce->Print(); 22 23 // Close the job and release the MapReduce object 24 map_reduce->Finalize(); 25 delete map_reduce; 26 ...

Listing 4.1: Source code example that illustrates how to use and correctly configure MapReduce-1S to launch a Word-Count job in parallel [116].

Out-of-Core. One of the main challenges for data analytics on HPC is the possi- ◦ bility of embracing and processing extremely large datasets [6]. The integration of MPI dynamic storage windows in MapReduce-1S enables an automatic mechanism to expand main memory through the storage subsystem. The API designed for MapReduce-1S currently exposes the possibility of enabling MPI storage windows (see Listing 4.1). We expect to be able to configure the synchro- nization frequency with storage and other characteristics in future work on this topic.

4.3 Experimental Results

In this section, we evaluate MapReduce-1S in comparison with a reference MapReduce framework for HPC [61], that follows a master-slave approach and mostly uses MPI col- lective operations. We refer to it as MapReduce-2S (i.e., MapReduce “Two-Sided”). The evaluations utilize a supercomputer from KTH Royal Institute of Technology [95]: • Tegner is a supercomputer with 46 compute nodes equipped with dual 12-core Haswell E5-2690v3 2.6GHz processors and 512GB DRAM. The storage employs a Lustre PFS (client v2.5.2) with 165 OST servers. No local storage is provided. The OS is CentOS v7.4.1708 with Kernel 3.10.0-693.11.6.el7.x86_64. Each framework evaluated is compiled with Intel C++ Compiler (ICPC) and Intel MPI, both v18.0.1. All thefigures reflect the standard deviation of the samples as error bars. In addition, we neglect from our results the initialization time, but account for the time required to 4.3. EXPERIMENTAL RESULTS 43

MR-2S MR-1S MR-2S (MPI-IO) MR-1S (Storage Windows) 200 1000 300 0.8 160 800 240 0.6 120 600 180 80 400 120 0.4 40 200 60 0.2

0 0 0 Time Normalized 0 xcto ie(s) Time Execution xcto ie(s) Time Execution (s) Time Execution 32 64 128 256 32 64 128 256 64 128 256 512 128 256 512 32G 64G 128G 256G 32G 64G 128G 256G 64G 128G 256G 512G 128G 256G 512G Num. Processes / Size Num. Processes / Size Num. Processes / Size Num. Processes / Size

(a) Weak Scaling (Balanced) (b) Weak Scaling (Unbalanced) (c) Weak Scaling (Bal. / FT) (d) Perf. Overhead (FT)

Figure 4.4: (a,b) Weak scaling under balanced / unbalanced workloads using MR- 2S and MR-1S on Tegner. (c) Weak scaling under balanced workload with fault- tolerance support. (d) I/O performance overhead of the fault-tolerance evaluation.

Map Phase Reduce Phase Combine Phase 32 32 24 24

Rank 16 16

MPI 8 Rank MPI 8 0 0 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 Time (s) Time (s)

(a) MR-2S Execution Timeline (Unbalanced) (b) MR-1S Execution Timeline (Unbalanced)

Figure 4.5: Execution timeline using (a) MR-2S and (b) MR-1S on Tegner, under unbalanced workload. The symbols represent events from processes at each phase. retrieve the input datasets and bucket allocation in our framework. Lastly, the terms “MR- 1S” and “MR-2S” are used to refer to MapReduce-1S and MapReduce-2S, respectively.

4.3.1 Weak Scaling Evaluation We evaluate the scalability of MapReduce-1S using Word-Count and a large dataset from the Purdue MapReduce Benchmarks Suite (PUMA) [2]. In particular, we use the PUMA- Wikipedia dataset5, that containsfiles with articles, user discussions, and other content from Wikipedia. We pre-process the dataset off-line to havefine-grained control over the workload per process. Hence, allowing us to evaluate balanced and unbalanced workloads. The results confirm that MapReduce-1S can provide performance advantages on un- balanced workloads. Figures 4.4 (a,b) illustrate the performance of MapReduce-2S and MapReduce-1S on Tegner under balanced and unbalanced workloads. The tests use a fixed input size of 1GB per process (weak scaling), and vary the process count up to 256 (11 nodes). From (a), we deduct that MapReduce-1S does not provide any clear benefit when the workload is ideally balanced. However, when the workload per process becomes imbalanced, (b) shows evident performance improvements with up to 23.1% on average.

5https://engineering.purdue.edu/~puma/datasets.htm CHAPTER 4. DECOUPLED STRATEGY ON DATA ANALYTICS 44 FRAMEWORKS

Furthermore, Figures 4.4 (c,d) illustrate the same performance evaluation on Tegner under balanced workloads, but enabling fault-tolerance support. MapReduce-2S integrates collective MPI-IO for checkpointing, while our approach integrates MPI storage windows. The results indicate that MapReduce-1S with checkpoint support is 59.3% faster in the last case of 512 processes, in comparison with MapReduce-2S. As shown in (d), MPI storage windows only introduces a 3.8% penalty on average, considerably lower than the overhead obtained with collective I/O operations of MPI-IO. This is mainly due to the selective synchronization mechanism integrated inside MPI storage windows.

4.3.2 Timeline Comparison We confirm the observations provided in the previous subsection by analyzing the execu- tion timeline for each implementation. Figure 4.5 depicts the timelines for MapReduce-2S and MapReduce-1S on Tegner, using afixed 32GB dataset under unbalanced workloads. Eachfigure depicts the Map, Reduce, and Combine phases with unique symbols 6. The completion of each intermediate step is reflected in light-gray color. Note that, for illus- tration purposes, we use only 32 processes. Comparing thefigures, it can be observed on (a) that MapReduce-2S is constrained by the use of collective communication and I/O of MPI across the different phases. Pro- cesses move in the timeline following certain execution patterns, that reflect the result of utilizing this type of operations. On the other hand, (b) demonstrates that the decoupled strategy implemented in MapReduce-1S produces significant performance advantages on unbalanced workloads. Processes immediately begin the Reduce and Combine phases after finishing their respective Map phase, showing a 27.6% improvement in comparison.

4.4 Concluding Remarks

With the emergence of data-centric applications on HPC, MapReduce has become one of the preferred programming models to hide the complexity of process and data paral- lelism [30,102]. As a consequence, this chapter has presented MapReduce-1S, a decoupled strategy for MapReduce frameworks based on the integration of MPI one-sided commu- nication and non-blocking I/O operations [48, 141]. The aim is to decrease the workload imbalance among the processes on input datasets with irregular distribution of the data. Preliminary results have also indicated that the integration of MPI storage win- dows [114] opens the opportunity for large-scale resilience support on supercomputers. The selective synchronization mechanism allows applications to easily perform checkpoints on storage with only the recent changes. In comparison, the I/O overhead observed for MPI-IO is motivated by the fact that all the buffers of the process must be completely stored. The benefits for fault-tolerance led to the work presented in the next chapter. Further information about MapReduce-1S, including additional considerations (e.g., memory requirements), can be obtained in the following reference: [116].

6We capture this information by introducing events in the source code. Each event stores the current phase and time in microsecond precision, alongside other relevant information (e.g., RSS). Chapter 5

Transparent Resilience Support in Coarray Fortran

Fortran is a general-purpose, compiled imperative programming language that is especially suited for numeric computation and scientific computing [106]. Originally developed by IBM in the 1950s decade for scientific and engineering applications, it shortly became dominant in thisfield. For instance, Fortran is utilized in computationally intensive areas, such as numerical weather prediction,finite element analysis, computationalfluid dynam- ics, crystallography, and computational chemistry [131]. Alongside C, it is considered one of the most popular languages for HPC, with extensive support by MPI [81]. Despite representing a relatively old programming language, Fortran is constantly evolving according to the requirements of the community. One clear example is Coarray Fortran (CAF), originally a syntactic extension of Fortran 95 that eventually became part of the Fortran 2008 standard in 2010 (ISO/IEC 1539-1:2010) [88, 107]. The main objective of CAF is to simplify the development of parallel applications without the burden of explicitly invoking communication primitives or directives, such as those available in MPI or OpenMP [28]. The programming model of CAF is based on the Partitioned Global Address Space (PGAS), in which every process is able to access a portion of the memory address space of other processes using shared-memory semantics. The concept is similar to MPI one-sided communication [48], thoroughly discussed in the previous chapters. A program that uses CAF is treated as a replicated entity, commonly referred as image with unique index between 1 and the number of images (inclusive). As in MPI, Fortran provides the this_image() and num_images() functions to introduce parallelism in the code. Any variable can be declared as coarray inside an image, which can be represented by a scalar or array, static or dynamic, and of intrinsic or derived type. Applications access a coarray object on a remote image using square brackets “[index]”. The lack of square brackets signifies accessing the object local to the process. For further clarifications, Figure 5.1 demonstrates this representation showing one of the images performing a get operation to retrieve the arrayx from image N, and then a single put to image 2. Even though the versatility of Coarray Fortran provides tremendous opportunities to scientific applications, the language is currently not prepared for the integration of novel hardware and software components coming to HPC clusters (Chapter 1). Such integration is expected to considerably aggravate the Mean Time Between Failures (MTBF), while

45 CHAPTER 5. TRANSPARENT RESILIENCE SUPPORT IN COARRAY 46 FORTRAN

x(:) = x(:)[N] Getting an array from Image N Image 1 x(9)[2] = x(1) Image 2 ... Image N CAF Putting a value into Image 2 CAF CAF

Coarray / x Coarray / y Coarray / x Coarray / y ... Coarray / x Coarray / y

··· ··· ···

Array of N-elements Single Scalar Array of N-elements Single Scalar Array of N-elements Single Scalar

Figure 5.1: Coarray Fortran allows images (i.e., processes) to communicate fol- lowing shared-memory semantics [107]. The example illustrates several images, in which thefirst one is performing put/ get operations using the brackets notation. simultaneously increase the programming complexity of these clusters [39]. In this regard, the Fortran 2018 standard has already proposed the concept of failed images to enable fault-tolerance on existing CAF-based applications [106]. The problem with this feature, however, is that it only provides functionality to evaluate the status of a given image. Consequently, we observe an urgency towards the integration of transparent resilience support inside Coarray Fortran. In this chapter, we propose persistent coarrays [111], an extension of the OpenCoarrays communication library of GNU Fortran (GFortran) [38] that integrates MPI storage windows (Chapter 3) [114] to leverage its transport layer. The integration seamlessly enables mapping coarrays tofiles on different storage devices, providing with the support to recover execution in combination with failed images.

Key Contributions � We present persistent coarrays to provide transparent resilience support in Coarray Fortran, as well as a general-purpose I/O interface. The goal is to combine such functionality in the future with failed images from Fortran 2018 [106]. � We extend the transport layer of OpenCoarrays [38] with MPI storage windows (Chapter 3) [114] to map coarrays intofiles through a very simple mechanism based on environment variables. Thus, easing the integration on CAF-based applications. � We evaluate and compare the performance of our implementation under represen- tative workloads, and compare the results with non-resilient implementations. � We provide further insight about how this approach could be integrated into future revisions of the Coarray Fortran standard, including three potential APIs.

5.1 Integrating Persistent Coarrays in Fortran

The failed images feature of Fortran 2018 forces applications to utilize Fortran-IO (Sub- section 2.4.4) or other means to recover after a failure. Such support inherently increases the programming complexity, specially on heterogeneous clusters. To overcome this limita- tion, we propose to extend the concept of coarray to become persistent on storage devices. 5.1. INTEGRATING PERSISTENT COARRAYS IN FORTRAN 47

Image Image CAF CAF

Coarray / x Coarray / y Coarray / x Coarray / y

MPI_Win_allocate MPI_Win_allocate MPI_Win_sync MPI_Win_sync ......

MPI Storage Window MPI Storage Window MPI Storage Window MPI Memory Window

OpenCoarrays_1_0.win OpenCoarrays_1_1.win OpenCoarrays_2_0.win

Independent File Independent File Independent File No File Mapping

Transport Layer / LIBCAF_MPI Transport Layer / LIBCAF_MPI

(a) File-per-process Configuration (b) Persistent + Non-Persistent Configuration

Figure 5.2: By extending the transport layer of OpenCoarrays [38] with MPI storage windows, CAF images can (a) map coarrays into individualfiles, or (b) combine persistent / non-persistent coarrays for neighbour-based checkpointing [20].

Such change requires no major modifications into the Fortran standard. Particularly, the specification of Fortran mostly describes the semantics and respective functionality to in- teract between coarrays located on different images. However, it does not restrict the supporting technology where the specific coarray is pinned to (e.g., NVDIMM [64, 83]). Moreover, the different synchronization mechanisms, already available for CAF, provide an opportunity to guarantee data consistency with storage as necessary. Thus, the inte- gration of resiliency in CAF is expected to become feasible in the near-term future.

5.1.1 Reference Implementation We design persistent coarrays as an extension of the open-source OpenCoarrays commu- nication library for CAF compilers1 [38]. OpenCoarrays is a collection of transport layers that assist compilers by translating communication and synchronization requests into spe- cific primitives from other communication libraries (e.g., OpenSHMEM [19]). The GNU Fortran (GFortran) compiler has integrated OpenCoarrays since the GNU Compiler Col- lection (GCC) v5.1.0 release. The front end of GFortran is designed to remain agnostic about the actual transport layer employed in the communication. By default, GFortran uses the LIBCAF_MPI transport layer of OpenCoarrays, based on MPI one-sided commu- nication [45, 48]. Here, coarrays are represented as MPI memory windows and accessed using put/ get internally through the passive target synchronization of MPI-3 [81]. We extend the MPI-based transport layer of OpenCoarrays to integrate MPI storage windows [114] and transparently expose resilient variables in CAF. This is exemplified

1https://github.com/sourceryinstitute/OpenCoarrays/tree/MPISWin CHAPTER 5. TRANSPARENT RESILIENCE SUPPORT IN COARRAY 48 FORTRAN in Figure 5.2, where afile-per-process configuration is illustrated assigning each coarray to an independentfile on storage. We also expect to support mixed persistent and non- persistent configurations (see Section 5.2). More importantly, the use of persistent coarrays considerably reduces the programming complexity on HPC applications, while providing the opportunity for an alternative to Fortran-IO [67] or even MPI-IO [142]. Existing CAF-based applications compiled with GFortran and the extended Open- Coarrays implementation, will seamlessly convert their traditional coarrays (i.e., memory- based) into persistent coarrays (i.e., storage-based). In most cases, no specific source code changes are required. We also extend the synchronization statements defined in the CAF specification to automatically enforce consistency between main memory and storage. This is accomplished by internally calling MPI_Win_sync with the MPI storage window associated to each coarray (see Section 3.2). Two methods are possible2: Collective synchronization. Barrier-like synchronization to ensure RDMA com- ◦ pletion plus a synchronization with storage. This is accomplished with “sync all”. Individual synchronization. Enforces synchronization between images, without ◦ involving the rest. This is accomplished with “sync images([indices])”. Applications can choose whether to always synchronize with storage during each syn- chronization point (default), or to limit the amount of data synchronizations performed to improve the overall performance. For compatibility reasons with the Fortran standard, persistent coarrays are configured through several environment variables: • PCAF_ENABLED. If set to “true”, enables support for persistent coarrays in the ap- plication. Otherwise, the coarrays will be allocated in memory (default). • PCAF_PATH. Defines the path where the coarrayfiles are mapped (e.g., local or remote storage). The name of eachfile is generated using the image index. • PCAF_PREFIX. Sets the prefix for each coarrayfile. By default, “OpenCoarrays_”. • PCAF_SHAREDFILE. If set to “true”, allows the use of a single sharedfile that contains the coarrays of a given image. Otherwise, afile per coarray is created (default). • PCAF_UNLINK. If set to “true”, removes thefiles after execution (e.g., for out-of-core). • PCAF_SYNCFREQ. Specifies the expected synchronization frequency with storage [146]. • PCAF_SEGSIZE. Defines the segment size utilized in the specific MPI storage window where the coarray is mapped. By default, the page-size available at runtime is set. • PCAF_STRIPESIZE. Sets the striping unit to be used for the mappedfile (e.g., stripe size of Lustre). This hint has no effect on existingfiles and/or local storage. • PCAF_STRIPECOUNT. Sets the number of I/O devices that thefile is striped across (e.g., OSTs of Lustre). This hint has no effect on existingfiles and/or local storage. The environment variables described allows us to configure the MPI storage window settings of the coarray, included through the MPI performance hints provided to the MPI_Win_allocate call. For instance, applications that use CAF can continue to allocate coarrays in memory by avoiding to set the environment variable PCAF_ENABLED, or by set- ting it to “false”. To enable persistent coarrays, it is enough to define the aforementioned variable with a “true” value instead and to provide a base path with PCAF_PATH. The rest of the environment variables are optional and depend on the particular use-case. 2Fortran also provides locks, critical sections, atomic intrinsics, and events [106]. Using these operations, data consistency can be guaranteed even when multiple images access the same coarray. 5.2. APPLICATION PROGRAMMING INTERFACE 49

1 ... 2 ! Declare two coarrays and other variables 3 real , dimension (10) , codimension [*] :: x, y ! Arrays of 10 real numbers 4 integer :: num_img, img ! Aux. variables to parallelize 5 6 ! Retrieve index and number of images (similar to checking MPI_COMM_WORLD) 7 img = this_image() 8 num_img = num_images() 9 10 ! Perform certain operations across images 11 x(2) = x(3)[7] ! Get value from image 7 12 x(6)[4] = x(1) ! Put value on image 4 13 x(:)[2] = y(:) ! Put a complete array on image 2 14 15 ! Enforce a collective synchronization point, which additionally guarantees 16 ! data consistency with storage as well 17 sync all 18 19 ! Remote array transfer, with individual synchronizations among two images 20 if ( img == 1) then 21 y(:)[num_img] = x(:) 22 sync images(num_img) 23 elseif (img == num_img) then 24 sync images([1]) 25 endif 26 ...

Listing 5.1: Source code that illustrates how to use coarrays in Fortran. The exam- ple will use persistent coarrays if compiled with our version of OpenCoarrays [111].

5.1.2 Source Code Example The support for persistent coarrays is currently provided at compiler level, meaning that applications that already utilize CAF can be easily converted. Listing 5.1 illustrates a source code example using persistent coarrays. The example begins by declaring two coarraysx andy, with 10floating-point values each. The use of codimension[*] requests the compiler to define these as CAF coarrays, instead of traditional arrays. The brackets determine the layout and distribution of the images, set by default in this case. Thereafter, the example retrieves the index and also the number of images, and performs multiple operations across images using the square brackets “[index]”. Finally, the source code shows how to enforce both global and individual synchronizations between pair of images.

5.2 Application Programming Interface

The presented implementation in OpenCoarrays seamlessly allows to convert existing coar- rays into persistent coarrays. Despite the simplicity of our approach, we also consider that it imposes several constraints. For instance, it is currently not possible to differentiate in- dividually between persistent and non-persistent coarrays. Thus, limiting the possibilities of utilizing persistent coarrays as an alternative I/O interface to Fortran-IO [67]. Even though the design of a specific API for persistent coarrays has been considered out of the scope of this work, we provide below further insight about how the concept could pave its way into a generalized API that might even influence the Fortran standard in the future. CHAPTER 5. TRANSPARENT RESILIENCE SUPPORT IN COARRAY 50 FORTRAN

Keyword in Fortran Extending the declaration of a coarray to become persistent could be possible by the integration into the standard of a “persistent” keyword, provided during the declaration of the coarray. Although this is the most elegant solution, the main inconvenience is that it would take time to reach end-users. Furthermore, the path must still be configured by other means (e.g., through environment variables or a data-placement runtime).

real, dimension(n), persistent, codimension[*] :: a ! Persistent coarray real, dimension(n), codimension[*] :: b ! Traditional

Compiler Directives Instead of forcing changes into the Fortran standard, an alternative could be to define a mechanism using directives, like OpenMP [28]. Hence, the support is provided at compiler level and offers the opportunity to configure the persistent coarray (e.g., path).

!dir$ persistent, path, ... ! Persistent coarray config. real, dimension(n), codimension[*] :: a ! No changes in declaration

Library API OpenCoarrays could expose specific functionality to extend the definition of a coarray to become persistent (e.g., similar to the mechanism used by DMAPP [139]). The main advantage is that it does not depend on the Fortran standard or the compiler, and that it allows to configure the path individually for each coarray (i.e., instead of a base path).

real, dimension(n), codimension[*] :: a ! No changes in declaration call opencoarrays_config(a, "persistent", "path", ...)

5.3 Experimental Results

We evaluate persistent coarrays using a reference testbed from the National Center for At- mospheric Research (NCAR) [25]. This cluster allows us to demonstrate the implications of our approach on upcoming clusters with local persistency and a traditional PFS: • Casper is a cluster with 28 heterogeneous compute nodes equipped with dual 18- core Skylake Xeon Gold 6140 2.3GHz processors and 384GB DRAM. Storage is pro- vided through 2TB of local NVMe SSD drives per node, alongside a GPFS parallel file system with approximately 15PB. The OS is CentOS 7 with Kernel v3.10.0- 693.21.1.el7.x86_64. Applications are compiled with GCC v7.3.0 and OpenMPI v3.1.4. The transport layer is provided by OpenCoarrays v2.6.3 with our extension. Thefigures reflect the standard deviation of the samples as error bars, whenever possible. We configure MPI storage windows with the default values for most of the I/O settings of the PMPI-based library (Subsection 3.1.1). In addition, we also set the default settings for GPFS, assigning a block size of 8MB and a sub-block of 16KB. The evaluations are conducted on different days and timeframes, to account for the interferences produced by other users on the cluster (i.e., for fair use, we requested a shared queue in Casper). 5.3. EXPERIMENTAL RESULTS 51

Memory Persistent (NVMe) Persistent (GPFS)

15 15 15 15 GB/s)

( 12 12 12 12 9 9 9 9 6 6 6 6 3 3 3 3 0 0 0 0 Bandwidth (GB/s) Bandwidth (GB/s) Bandwidth (GB/s) Bandwidth 16K 32K 64K 128K 16K 32K 64K 128K 16K 32K 64K 128K 16K 32K 64K 128K Block Size Block Size Block Size Block Size (a) Single-Put (Single Node) (b) Single-Get (Single Node) (c) Single-Put (Multi-Node) (d) Single-Get (Multi-Node)

Figure 5.3: Bandwidth of traditional coarrays in memory, and persistent coarrays on local NVMe storage and GPFS using the EPCC CAF Microbenchmark [59] on Casper. The tests run with 2 processes on a single node (a,b) and two nodes (c,d).

Memory Persistent (NVMe) Persistent (GPFS)

s) 1.8 6.3 ( 1.5 5.4 1.2 4.5 3.6 0.9 2.7 0.6 1.8 0.3 0.9 xc Time Exec. 0 (s) Time Exec. 0 32 64 128 32 64 128 Num. Processes Num. Processes (a) Stencil Kernel Evaluation (b) Transpose Kernel Evaluation

Figure 5.4: Performance of traditional coarrays in memory, and persistent coarrays on local NVMe storage and GPFS using reference PRK kernels [149] on Casper. The tests run on 1, 2 and 4 nodes with afixed grid of 8192 8192 (strong scaling). ×

5.3.1 Remote Memory Operations Performance Using the Coarray Fortran Microbenchmark suite3 from the University of Edinburgh [59], we attempt to verify that the integration of MPI storage windows in the transport layer of OpenCoarrays does not incur in additional overheads when performing remote memory operations. Our evaluations avoid storage synchronizations for reliability of the results. The evaluations confirm that performing remote memory operations on persistent coarrays does not introduce any relevant overheads compared to traditional coarrays in memory. Figure 5.3 illustrates the measured bandwidth of coarrays in memory, alongside persistent coarrays on the local NVMe storage and GPFS of Casper. The tests conduct single and multi-node evaluations, using two different images. The results from (a,b) indicate that no relevant performance differences are observed when conducting put/ get operations, except for a slight 4% improvement in our implementation possibly due to how OpenMPI and MPI storage windows allocate memory (i.e., shared mapping). When conducting inter-node communication, (c,d) demonstrate almost identical performance.

3https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and- benchmarking/epcc-co-array-fortran-micro CHAPTER 5. TRANSPARENT RESILIENCE SUPPORT IN COARRAY 52 FORTRAN 5.3.2 Representative Workloads Evaluation To compare the previous results under representative workloads, we evaluate persistent coarrays using the Parallel Research Kernels (PRK) for Fortran4 [149]. This collection of kernels covers patterns of communication, computation, and synchronization encountered in HPC applications. In particular, we use the Stencil and Transpose kernels of PRK to understand the implications of persistent coarrays when enforcing storage synchroniza- tions. The former applies a data-parallel stencil operation to a two-dimensional array, while the latter is meant to stress communication and memory bandwidth. The results illustrate that the use of local storage for persistent coarrays can be benefi- cial, demonstrating the importance of heterogeneous support in I/O interfaces at exascale. Figure 5.4 illustrates the performance of traditional coarrays in memory, alongside persis- tent coarrays that use local NVMe storage and GPFS of Casper. The evaluations utilize afixed grid dimension of 8192 8192 (strong scaling) on 10 iterations. For 128 processes × (4 nodes), the performance improves 2.1 in comparison when observing the Transpose × kernel. Moreover, due to the problem size, the limited amount of I/O operations required for persistent coarrays produces identical results compared to coarrays in memory.

5.4 Concluding Remarks

The inherent increase in number of hardware and software components on HPC clusters, will likely aggravate the Mean Time Between Failures (MTBF) faced by scientific applica- tions [14,104]. This chapter has presented the concept of persistent coarrays, an extension to the coarray specification of Coarray Fortran that provides seamless resilience support. The approach enables the integration of fault-tolerance mechanisms for CAF, in combi- nation with the failed images feature [106]. In addition, it represents an alternative to Fortran-IO [67], allowing applications to conduct I/O operations by accessing coarrays. Despite the positive results, we note that the default implementation of MPI storage windows utilized in our experiments is based on the memory-mapped I/O mechanism of the OS (Subsection 2.4.1), which imposes major constraints on certain HPC applications [114]. Thus, future implementations of MPI storage windows are expected to provide persistent coarrays with a higher-level offlexibility and performance. The next chapter presents an HPC-centric memory-mapped I/O system library for this particular purpose [112]. More information about persistent coarrays and the concept of failed images of Fortran 2018 can be obtained in the following references: [39, 111].

4https://github.com/ParRes/Kernels/tree/master/FORTRAN Chapter 6

Global Memory Abstraction for Heterogeneous Clusters

Scientific applications have traditionally performed I/O by either instructing each process to write to separatefiles, or by making all processes send their data to dedicated processes that gather and write the data to storage [48,99]. The MPI standard eased the integration of such techniques in MPI-IO (Section 2.1) [48] with advanced features specifically designed for HPC applications, such as non-contiguous access, collective I/O, and non-blocking I/O variations. These are essential for high-performance I/O over PFS (e.g., GPFS [126]). Notwithstanding, the integration of local storage technologies on supercomputers raises concerns in regards to the viability of these programming interfaces. For instance, allo- cating and moving data in heterogeneous clusters often requires different approaches for accessing memory and storage [114]. Consequently, momentum is gathering around new ecosystems that enable domain-specific I/O architectures [152]. The goal is to bring device control and data planes into user space, integrating support for intra-node I/O without affecting the overall system stability. In such cases, mechanisms that resemble memory operations could also streamline the development and increase the adoption rate [150]. But as we motivated in Section 3.3, solutions like the memory-mapped I/O of the OS [8] have been historically neglected on HPC, featuring three major limitations:

1. Lack of user-configurable settings. Despite providing vmfiles [146], memory- mapped I/O only exposes these as global settings, difficult to set on supercomputers. 2. Absence of client-side buffering. Most PFS do not support the page cache of the OS (e.g., Lustre), producing “asymmetric” performance for read/ write [113]. 3. No support for recent storage technologies. The DataWarp burst buffer, integrated in the Cori supercomputer from NERSC [91], cannot be accessed.

These limitations increase the constraints in the development of scientific applications, while dismantling opportunities for multidisciplinary and multidomain applications at exascale. Moreover, MPI and other I/O libraries lack the proper underlying system-level support. MPI storage windows (Chapter 3) [114], for example, relies on the memory- mapped I/O mechanism of the OS to map memory and storage (Figure 6.1), indirectly affecting MapReduce-1S [116] and Persistent Coarrays [111] of the previous chapters.

53 CHAPTER 6. GLOBAL MEMORY ABSTRACTION FOR 54 HETEROGENEOUS CLUSTERS

MPI MPI Rank Rank 0 0

MPI_Win_allocate MPI_Win_allocate

MPI Implementation MPI Implementation

Reserved Virtual Memory Addresses Reserved Virtual Memory Addresses 0x1000000 … 0x7999999 0x1000000 … 0x7999999

Virt. Mem. Map. #2 Virt. Mem. Map. #1 Virt. Mem. Map. #2 0x1000000 … 0x7999999 0x1000000 0x4000000

Storage Memory Storage

(a) Storage window allocation (b) Combined window allocation

Figure 6.1: MPI storage windows utilizes the memory-mapped I/O mechanism of the OS [8] to (a) establish a map to storage, or (b) define a combined address space that includes memory and storage on the same window (e.g., for out-of-core).

Hence, abstracting the storage subsystem of heterogeneous clusters without the need for separate interfaces for I/O is a necessity on supercomputers. In this chapter, we present uMMAP-IO [112], a library that provides a user-level memory-mapped I/O implementa- tion for HPC. By employing a custom page-fault mechanism that intercepts memory accesses, applications integrate storage using just load/ store operations. Thus, our approach seamlessly exposes numerous storage technologies through a unified interface.

Key Contributions � We propose uMMAP-IO to map memory addresses into storage devices at user- level. As a system library, the approach also represents a clear alternative for the underlying implementation of MPI storage windows (Chapter 3) [114]. � We provide a reference, open-source implementation of uMMAP-IO and describe how it can be used, including some of the characteristics chosen for its novel API. � We compare the performance of our implementation with memory-only allocations, as well as the reference memory-mapped I/O support of the OS [8]. The evaluations also demonstrate support for recent storage technologies (e.g., Cray DataWarp [92]). � We briefly discuss how uMMAP-IO supports memory management modes and dif- ferent evict policies, including a novel WIRO policy that reduces I/O operations.

6.1 Creating a User-level Memory-mapped I/O

The basic principle behind uMMAP-IO resides on its memory abstraction layer to manage allocations. Such abstraction separates virtual address spaces from their corresponding 6.1. CREATING A USER-LEVEL MEMORY-MAPPED I/O 55

HPC Applications

load/ store

Virtual Memory Address Space 0x1000000 … 0x8FFFFFF

Segment #1 Segment #2 Segment #3 Segment #4 Segment #5 ··· 0x1000000…0x1000FFF 0x1001000…0x1001FFF 0x1002000…0x1002FFF 0x1003000…0x1003FFF 0x1004000…0x1004FFF ···

read/ read/ read/ read/ read/ ··· write write write write write

Segment #1 Segment #2 Segment #3 Segment #4 Segment #5 ··· offset = 0x9210000 offset = 0x9211000 offset = 0x9212000 offset = 0x9213000 offset = 0x9214000 ··· Mapped File in Storage

uMMAP-IO / Memory Abstraction Layer

Figure 6.2: Applications interact with each mapping through a seamless address space using load/ store operations. uMMAP-IO abstracts the division into seg- ments and how they are effectively mapped to storage. mapping to storage (Figure 6.2). When a new memory-mapping is requested, the library reserves a range of virtual addresses and divides it into logical segments of user-configurable size (e.g., 64MB). The concept segment here is similar to the concept of page in the OS. However, in this case, a segment represents the basic unit of operation when conducting I/O data transfers from / to storage (i.e., instead offixed-size 4KB pages for all mappings). Each segment is mapped to an offset within thefile, allowing different processes to map afile at different displacements. Alternatively, a process can remap its base offset inside the same or differentfile without any representative cost (e.g., incremental checkpoints). More importantly, applications interact with a seamless virtual address space using just load/ store operations. No distinction is necessary compared to a conventional memory allocation. Accessing any of the bytes assigned to the address space of a segment triggers the physical allocation in main memory and the data migration from storage, if necessary. Further accesses are solely made in main memory, until a data synchronization is either explicitly requested by the user, or implicitly performed by the library.

6.1.1 Reference Implementation We provide uMMAP-IO as a C (ISO/IEC 9899:1999) open-source library1. This library supports equivalent functionality to the original memory-mapped I/O implementation of the OS [8], but with theflexibility of integrating the majority of its features at user-level. The memory-mapping mechanism of uMMAP-IO is based on protecting anonymous mem- ory allocations and intercepting memory accesses to the associated virtual address space. Each segment is revoked permission by default whenfirst allocated. Thus, the protection

1https://github.com/sergiorg-kth/ummap-io CHAPTER 6. GLOBAL MEMORY ABSTRACTION FOR 56 HETEROGENEOUS CLUSTERS is established at segment-level. We use a combination of “mprotect+ SIGSEGV” capture to detect memory accesses. This technique is similar to how hypervisors implement post- copy live migration2 [132, 143]. Our library controls when the OS is allowed to maintain the association between virtual and physical addresses in main memory using segment-size granularity, as well as the content and permission established on those regions. Neither the OS nor the applications are aware of the division into segments. Several Unix system functions are employed to support this core functionality of uMMAP-IO:

• mmap. Utilized to reserve the range of virtual addresses. We specify MAP_ANONYMOUS and MAP_PRIVATE to request a memory-only mapping. • mprotect. Allows us to change the access permission for each segment. By default, we set PROT_NONE to the whole address space to detect both read and write accesses. The permissions are then set to PROT_READ and / or PROT_WRITE accordingly. • madvise. Allows us to control which segments reside physically in memory. We set MADV_DONTNEED on a segment when we want to reduce the Resident Set Size (RSS). • sigaction. We use it to register a page-fault handler for the SIGSEGV signal, along- side a custom SIGEVICT signal for Inter-Process Communication (IPC) in the node. • sigprocmask. Allows us to block the registered signals to avoid consistency issues. • munmap. Releases the virtual address space for the mapping when no longer needed.

When a SIGSEGV signal is intercepted, we evaluate the si_addrfield to understand the specific address accessed by the application. We use thisfield to estimate the base address of the mapping through a custom Pseudo-Least Recently Used (pLRU) cache that contains all the memory-mappings established by the library. The type of page-fault (i.e., read or write) is determined by evaluating the REG_ERR registry from the hardware. Thereafter, the segment structure is accessed and updated. This structure holds properties per segment that represent its status, whether or not it has been modified with respect to the last-known version in storage, and more. Thefields per segment are illustrated below:

← 64 bits → ← 64 bits → ← 128 bits → V D R - TS WRD CNT PREV NEXT Valid Dirty Read Res. Time Sync. Word Rank Count Prev. Seg. Next. Seg. Header Futex Policy Relations

The 64-bit header contains encoded bits that define four major properties required for the page-fault mechanism. For instance, the valid bitV is utilized to understand whether or not the segment is currently mapped into main memory, while the dirty bitD is set when the segment has been modified. The read bitR is set when the content of the segments must be retrieved from storage, improving the I/O performance by disabling it (e.g., on temporaryfiles). Lastly, the timestampTS determines when was the segment modified since the last storage synchronization, useful for data consistency. We envision two types of consistency semantics, which provide users with theflexibility of selecting, per memory-mapping, how to writeback the dirty segments to storage:

2Post-copy live migration is a form of memory externalization consisting of a virtual machine running with part of its memory on a different node [132]. 6.1. CREATING A USER-LEVEL MEMORY-MAPPED I/O 57

// Establishes and configures a user-level memory-mapped I/O allocation int ummap(size_t size, size_t seg_size, int prot, int fd, off_t offset, int flush_interval, int read_file, int ptype, void **addr);

// Performs a selective synchronization of the dirty segments and, if // requested, reduces the RSS by evicting the segments from memory int umsync(void *addr, int evict);

// Allows to re-configure a user-level memory-mapped I/O allocation int umremap(void *old_addr, size_t size, int fd, off_t offset, int sync, void **new_addr);

// Releases a given user-level memory-mapped I/O allocation int umunmap(void *addr, int sync);

Figure 6.3: List of the interfaces implemented for uMMAP-IO. The specification and semantics for each function resemble the traditional memory-mapped I/O of the OS [8], with slight variations to improve performance andflexibility.

Explicit Consistency. Users can enforce selective data synchronizations with ◦ storage at any point (i.e., only dirty segments), similar to msync from the OS. Implicit Consistency. Each process is assigned an I/O thread, mimicking the ◦ pdflush threads [146]. Users can vary the frequency per mapping, or ignore them. While cold-accesses to a non-valid segment trigger individual data migrations from storage to main memory if the read bitR is set, performing data synchronizations is conducted in two different modes. The Individual Synchronization mode synchronizes dirty segments independently in blocks of the segment size. On the other hand, the Bulk Synchronization mode utilizes a technique inspired by the Data Sieving algorithm of MPI-IO [141] to reduce the number of I/O operations and aggregate the requests into large groups of segments. In any of the cases, all the storage data transfers are conducted using POSIX-IO [1], allowing us to accept afile descriptor for each mapping. As discussed in Chapter 7, we also expect to leverage the support with dedicated I/O drivers, depending on the storage technology (e.g., collective MPI-IO on PFS). The last 192-bits of the segment properties define a custom 64-bit Fast User-level Mutex [56] to prevent incoherent data transfers (e.g., if the I/O thread is synchronizing a segment while the application is attempting to modify it), as well as the 128-bit evict policy relations between the segments, that are established and updated during each page-fault.

6.1.2 Application Programming Interface We design the interface of uMMAP-IO to resemble the memory-mapped I/O implemen- tation of the OS (Figure 6.3). At the same time, we extend the API to include valuable additions, such as the possibility of reducing the process’ RSS after data synchronizations, or the ability to remap to differentfiles. We describe below the most relevant aspects: CHAPTER 6. GLOBAL MEMORY ABSTRACTION FOR 58 HETEROGENEOUS CLUSTERS

• ummap. Establishes a user-level memory-mapping of afile. Applications can choose the segment size, offset within thefile, I/O thread synchronization frequency (e.g., -1 to ignore), and even whether or not to read the content of thefile (i.e.,R bit). • umsync. Performs a selective data synchronization of the given memory-mapping. The evictflag allows the library to reduce the RSS after synchronization. • umremap. Allows to re-configure an existing memory-mapping, useful for fault- tolerance. The new_addr reflects the base pointer after re-mapping (it might differ). • umunmap. Releases a given memory-mapping, allowing to perform a data synchro- nization before releasing, if required. In addition, certain settings can be configured through environment variables, meant to affect all the processes. One example is UMMAP_BULK_SYNC to enable bulk synchronization in umsync, or UMMAP_MEM_LIMIT to restrict the main memory consumption per node.

6.1.3 Source Code Example Listing 6.1 illustrates a reference source code example that demonstrates the use of uMMAP-IO. The example begins by opening the targetfile to obtain thefile descriptor. This is required by the library to create the memory-mapping. Using ummap, we request a 1GB allocation with 16MB segments. In addition, we ask not to read the content on cold-accesses (i.e., read bitR unset), and also disable the I/O thread to avoid implicit synchronizations by the library. At this point, the output baseptr can be utilized exactly like a conventional memory pointer. The example initializes the allocated mapping with a fixed value using the traditional array notation, and even calls memset to demonstrate the simplicity of the mechanism. Lastly, a data synchronization is enforced before releasing.

6.2 Evict Policies and Memory Management Modes

The last 128-bits of the segment properties represent the evict policy relations between segments. Such relations are specified by a doubly linked-list implementation (hence, the PREV/ NEXT pointers). We utilize this approach to determine which segments are more likely to be evicted in the condition of running out-of-memory, following user-configurable policies. For instance, a “FIFO” policy evicts from main memory thefirst segment that was accessed by the application. We also integrate our own “WIRO” (Writes-In, Reads-Out) policy, designed specifically to reduce the number of I/O operations. In this particular case, the goal is tofirst evict the segments that currently hold a read-only permission. The policies determine which segments must be evicted after the conditions of the memory management mode are met. Two different modes can be configured: a) “Static” Mode. Each process equally divides the main memory capacity of the node, and is not allowed to exceed the limit per process (default). b) “Dynamic” Mode. Akin the previous, but allows processes to exceed their limit temporarily as long as there is space. This is particularly useful in data analytics applications, where the memory consumption can be relatively unknown beforehand. In the dynamic case, processes behave following a procedure inspired by the addi- tive increase and multiplicative decrease of the congestion-avoidance protocol in TCP 6.3. EXPERIMENTAL RESULTS 59

1 ... 2 int flags = (O_CREAT | O_RDWR); // File flags with access mode 3 mode_t mode = (S_IRUSR | S_IWUSR); // File creation permission flags 4 int prot = (PROT_READ | PROT_WRITE); // Permissions for the mapping 5 size_t size = 1073741824; // 1GB for the requested memory-mapping 6 size_t segsize = 16777216; // 16MB for the segment size 7 off_t offset = 0; // Offset in file (i.e., beginning of file) 8 int fd = -1; // File descriptor 9 int8_t *baseptr = NULL; // Allocation pointer to access the mapping 10 11 // Open the file descriptor for the mapping 12 fd = open ("/path/to/file", flags, mode); 13 14 // Ensure that the file has space (optional) 15 ftruncate (fd , size ); 16 17 // Create the memory-mapping with uMMAP-IO, disabling the I/O thread 18 ummap (size, segsize, prot, fd, offset, -1, FALSE, 0, ( void **)&baseptr); 19 20 // It is now safe to close the file descriptor 21 close (fd ); 22 23 // Set some random value on the allocation using load / store 24 for ( off_t i = 0; i < size; i++) 25 { 26 baseptr[i] = 21; 27 } 28 29 // Alternative: Use traditional memory manipulation functions 30 memset (baseptr, 21, size); 31 32 // When required, synchronize with storage to ensure data consistency 33 umsync (baseptr, FALSE); 34 35 // Finally, release the mapping if no longer needed 36 umunmap (baseptr, TRUE); 37 ...

Listing 6.1: Source code that illustrates how to use uMMAP-IO to create a 1GB mapping with 16MB segments. Error-checking is skipped for illustration purposes.

Reno [77]. A custom real-time IPC signal (SIGEVICT) and a set of shared memory seg- ments is utilized to notify the processes that are currently exceeding their main memory limit. When a process receives such signal, it reduces the excess memory in half. This behaviour is illustrated in Figure 6.4. Thefigure shows the difference in performance for the two memory management modes using mSTREAM (Section 3.3). We run 8 processes with delayed execution, and randomize their consumption between 1GB to 4GB, with a fixed limit of 8GB (i.e., 1GB per process). The results clearly illustrate how the dynamic memory management reduces the execution time from 45.06s to 21.81s, 2.06 faster. ×

6.3 Experimental Results

We illustrate the performance of uMMAP-IO using two different testbeds. Thefirst testbed is a supercomputer from the National Energy Research Scientific Computing Cen- ter (NERSC) with SSD storage provided through a burst buffer I/O accelerator, alongside CHAPTER 6. GLOBAL MEMORY ABSTRACTION FOR 60 HETEROGENEOUS CLUSTERS RSS Mem. Ind. RSS Mem. IPC Event 9216 9216 8192 Limit 8192 Limit 7168 7168 (MB) 6144 6144 5120 Events 5120 Events 4096 4096 3072 3072 2048 2048 S Mem. RSS 1024 (MB) Mem. RSS 1024 0 0 0 5 10 15 20 25 30 35 4045 50 55 0 3 6 9 12 15 18 21 24 27 Time (s) Finished Time (s) Finished (45.06s) (21.81s) (a) Static Memory Management (b) Dynamic Memory Management

Figure 6.4: Execution timeline and RSS on afixed limit of 1GB per process using (a) static and (b) dynamic memory management with uMMAP-IO on Beskow [94]. a Lustre PFS [4]. The second testbed is a supercomputer from KTH Royal Institute of Technology with storage provided only by Lustre [94]. The specifications are given below:

• Cori is a Cray XC40 supercomputer with 2388 nodes equipped with dual 16-core Haswell E5-2698v3 2.3GHz processors and 128GB DRAM. The storage employs a Cray DataWarp (DVS v0.9.0), and a Lustre PFS with 248 OST servers (client v2.7.5.13). The OS is SUSE ES 12 with Kernel 4.4.178-94.91-default. Appli- cations are compiled with Intel C Compiler (ICC) v18.0.1 and Intel MPI v2018.up1. • Beskow is a Cray XC40 supercomputer with 1676 nodes equipped with dual 16- core Haswell E5-2698v3 2.3GHz processors and 64GB DRAM. The storage employs a Lustre PFS with 165 OST servers (client v2.5.2). The OS is SUSE Linux ES 11 with Kernel 3.0.101-0.46.1_1.0502.8871-cray_ari_s. Applications are compiled with Cray C Compiler (CCE) v8.3.4 and Cray-MPICH v7.0.4.

The source code utilized in all the tests is available as an open source repository3. When using Lustre, we follow the guidelines provided by NERSC [85]. Thefigures reflect the standard deviation of the samples as error bars. In addition and due to the nature of our evaluations (i.e., heavy I/O stress tests), we take special considerations to prevent affecting the I/O performance for other applications on each cluster4. Lastly, we refer the memory-mapped I/O and the burst buffer as “MMAP-IO” and “BBF”, respectively.

6.3.1 Single-node Microbenchmark Interleaved Or Random (IOR) [129] is a microbenchmark for evaluating the performance of traditional PFS. We use it to measure sustained throughput performing large amounts of small, non-overlapping I/O accesses over a single sharedfile across all processes. This represents the worst-case for uMMAP-IO, hence, the importance of the evaluations. With this microbenchmark, we observe that performing small I/O operations can introduce a large overhead when targeting Lustre compared to plain memory allocations,

3https://github.com/sergiorg-kth/ummap-io-benchmarks 4The evaluations are restricted on Cori to single-node results, while on Beskow, where we run larger-scale tests, we limit both the execution time and the amount of I/O operations. 6.3. EXPERIMENTAL RESULTS 61

Memory uMMAP-IO (BBF) uMMAP-IO (Lustre) MMAP-IO (Lustre)

2000 2000

(MB/s) 1600 1600 1200 1200 800 800 400 400 0 0 Throughput 1 2 4 8 16 32 (MB/s) Throughput 1 2 4 8 16 32 Num.qProcesses Num.qProcesses (a) IOR Weak Scaling (SEQ Kernel / Ind. BW) (b) IOR Weak Scaling (RND Kernel / Ind. BW)

Figure 6.5: IOR evaluation on a single node of Cori using conventional memory allocations, uMMAP-IO (BBF + Lustre), and MMAP-IO (Lustre). The tests set an I/O transfer block of 256KB, with 1GB alloc. per process using 256KB segments.

Memory uMMAP-IO (Lustre) MMAP-IO (Lustre)

s) 720 720 ( 600 Time Limit ...... 600 Time Limit ...... 480 480

Time 360 360

c. 240 240 120 120 Exe 0 (s) Time Exec. 0 256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 Num. Processes Num. Processes (a) PRK Stencil (4K 4K block / 100 Iter.) (b) PRK Transpose (4K 4K block / 100 Iter.) × × Figure 6.6: PRK-MPI evaluation with up to 256 nodes of Beskow using memory allocations, uMMAP-IO, and MMAP-IO. The testsfix the workload per process to 16M elements ( 128MB) on (a,b), and 32M elements ( 256MB) on (c,d). ≈ ≈ ≈ ≈ but that it can also provide almost optimal performance using the burst buffer instead. Figures 6.5 (a,b) show the performance of the sequential (SEQ) and pseudo-random (RND) kernels of IOR on a single node of Cori using conventional memory allocations, uMMAP- IO, and MMAP-IO. The tests set an I/O transfer block of 256KB, with an allocation size of 1GB per process and a segment size of 256KB for uMMAP-IO. Thefigures demonstrate that, while the throughput of uMMAP-IO over Lustre is 2.49 higher on average compared × to MMAP-IO, targeting the burst buffer can provide a boost in performance. In this case, the difference is 21.95% on average compared to plain memory allocations, with an exceptional 23.69 higher throughput in comparison with MMAP-IO. This demonstrates, × once again, the importance of supporting local storage technologies on HPC.

6.3.2 Multi-node Representative Workloads Extending the evaluations provided in Subsection 5.3.2, we use the Parallel Research Kernels (PRK) [149] to understand the implications of uMMAP-IO under representative workloads running on large process counts. We evaluate the Stencil and Transpose kernels on 100 iterations, with a time limit of 600 seconds and afixed grid size per process CHAPTER 6. GLOBAL MEMORY ABSTRACTION FOR 62 HETEROGENEOUS CLUSTERS of 4096 4096 (weak scaling). Compared to the previous evaluations, we enforce storage × synchronizations every 25 iterations to mimic checkpoints. Lastly, we use individualfiles per process and set 1MB segment size for uMMAP-IO to match the stripe size of Lustre. As expected, the results using the PRK kernels indicate very poor scalability using MMAP-IO, in comparison with uMMAP-IO and conventional memory allocations. Fig- ures 6.6 (a,b) depict the execution time spent for each kernel on Beskow using plain memory allocations, uMMAP-IO, and MMAP-IO. The tests utilize from 8 nodes of the cluster (256 processes), up to 256 nodes (8192 processes). From thisfigure, we observe that the execution time limit is exceeded using MMAP-IO on all the cases5. uMMAP-IO provides good scalability up to 2048 processes, and features very positive results as well for up to 8192 processes. On average, the performance difference is 50.65% in the latter kernel compared to plain memory allocations. This is despite the fact that approximately 8TB of data were written on each evaluation.

6.4 Concluding Remarks

The limited support for heterogeneous clusters can lead to drastic limitations on scientific applications and even MPI implementations. This chapter has presented uMMAP-IO, a user-level memory-mapped I/O implementation that simplifies data management on multi-tier storage subsystems. Our approach features relatively good scalability in com- parison with using only memory allocations without storage support, and provides clear advantages compared to the memory-mapped I/O of the OS in terms of performance and flexibility. Thus, immediately benefiting our contributions from the previous chapters. However, the results also suggest that better support inside uMMAP-IO for PFS is required. This is particularly due to the use of individual I/O operations in our current implementation. We consider that integrating collective I/O operations of MPI-IO could effectively improve the throughput in such cases [48]. As discussed in the next chapter, we plan to leverage uMMAP-IO with user-configurable I/O drivers, allowing to switch between MPI-IO, POSIX-IO, or even NVDIMM-based drivers, depending on the target storage tier. This could provide applications with tremendous opportunities. More information about uMMAP-IO, including extensive performance evaluations (e.g., page-fault latency), can be obtained in the following reference: [112].

5We estimate the execution time of MMAP-IO based on the preliminary time per iteration, multiplied by the number of iterations. Chapter 7

Future Work

The previous chapters have introduced the reader into the contributions that define the main research conducted throughout the Ph.D. studies. Even though the purpose of this thesis has been to explore and evaluate different approaches to ease the transition towards the integration of scientific applications on heterogeneous clusters, it must also be noted that we have identified four different topics that remain as potential candidates for future work. These candidates are illustrated below:

0x1921921 P3

PIM, FPGA, … P1 P2

Object Storage in MPI Data-placement Runtime Failed Images in CAF I/O Beyond Exascale

This chapter provides an overview of the aforementioned concepts, as well as briefly introduces a possible plan to address them in the future. Such work mostly leverages the topics described in this thesis, and extends their purpose to address additional require- ments that consolidate our contribution to thefield. Note that each section includes a short motivation and a subsection with the proposed solution.

7.1 Object Storage Support in MPI

Parallel File Systems on HPC have historically presented persistent storage as a collection offiles organized in a directory hierarchy [46]. From the POSIX-IOfile system interface [1] to MPI-IO [48,142], each programming interface has provided access to persistent storage following this representation. Despite the success of the “files + directories” abstraction, it has also been debated that large-scale services and parallel applications mostly generate what is often referred to as unstructured data, in which the importance resides in the content and not on its location [42]. Furthermore, the hierarchical storage model presents several limitations that directly affect the performance and scalability of HPC clusters [84]. For instance, accessing afile stored as a single piece of information inside a specific direc-

63 64 CHAPTER 7. FUTURE WORK

uMMAP-IO

Manages

Integrates Contains Memory Mapping

Communicates with Policy Driver Target

FIFO pLRU WIRO ··· MPI-IO Clovis ··· NVRAM Lustre Mero ···

Figure 7.1: By generalizing the target mapping of uMMAP-IO, we can provide different drivers that enable access to multiple storage technologies (e.g., object storage of Mero through the Clovis API [84]). tory, requires in the worst-case scenario a traverse of the hierarchy in order to retrieve its metadata (e.g., access permission) and content, without accounting for PFS locking. To overcome these limitations, some of the most popular parallelfile systems, such as Lustre, use an object storage abstraction underneath [9,85]. Here,files are mapped onto a set of objects distributed across storage servers (e.g., OST servers). These objects can each be thought of as an independent linear array of bytes with a global identifier and metadata, and no structural location. The main benefits are near-optimal scalability, faster data retrievals due to theflat address space, and a greater utilization of the resources [87]. In fact, these benefits have motivated the appearance of exascale-grade object storage systems, specifically designed to allow applications accessing objects directly for better performance. Thus, hiding the latency introduced by the “file object” conversion ↔ in a traditional PFS. The SAGE / Sage2 project, part of the EU-H2020 program, is leading the effort in providing such object storage system through Mero and its Clovis API (Section 2.2) [84]. It is one of thefirst projects that aim at providing multi-tier storage subsystem support, allowing access to objects located both intra-node (e.g., NVRAM) and inter-node (e.g., archival storage). Moreover, because object storage is driven by metadata, the opportunity for analysis enables an understanding of the access frequency and other parameters, which can be useful for automatic data migration purposes.

Proposed Solution for Future Work

As object storage systems become more frequent on HPC clusters, support for object storage in traditional programming interfaces like MPI is required. We plan to integrate uMMAP-IO (Chapter 6) [112] inside MPI storage windows (Chapter 3) [114] to enable a finer-grain control and performance to access storage compared to the traditional memory- mapped I/O mechanism of the OS [8]. With this integration, we then expect to extend uMMAP-IO to generalize its mapping interface and support multiple storage technologies, as illustrated in Figure 7.1. Here, each memory mapping would be associated with a target and an I/O driver. The custom page-fault mechanism of uMMAP-IO allows us to intercept memory accesses and control the content / permission of the segments available in main memory. The I/O driver, on the other hand, provides us with the specific procedures to transfer data from / to storage in user-configurable segments. While the default I/O driver 7.2. AUTOMATIC DATA-PLACEMENT OF MEMORY OBJECTS 65 of uMMAP-IO is meant to conduct storage transfers using POSIX-IO1 (i.e., targeting files in storage), the Clovis API can be utilized through a dedicated I/O driver in order to target Mero objects instead. Therefore, this integration allows us to accept a 128-bit Object Identifier (OID) to target other object storage systems in the future (e.g., Amazon S3 [93]). Applications would still use load/ store operations to access the address space. More importantly, including uMMAP-IO inside MPI provides us with the opportunity to create MPI object windows. This concept extends the building blocks of MPI storage windows to allow targeting object storage systems instead. The allocation / deallocation mechanism, including the operations allowed, would be almost identical to MPI memory windows and MPI storage windows. In other words, the only requirement is to configure the allocation through the MPI Info object and the associated performance hints. The outcome from this work might open the opportunity for discussions in MPI-4 [82].

7.2 Automatic Data-placement of Memory Objects

Next-generation supercomputers are likely to feature a deeper and heterogeneous memory hierarchy, where memory technologies with different characteristics work next to each other to mitigate both access latency and bandwidth constraints [101]. For example, the Intel Knights Landing processor, integrated in the Cori supercomputer (NERSC) [4], features 16GB of 3D-stacked Multi-Channel DRAM (MCDRAM) alongside 96GB of conventional DDR4 DRAM to support memory-intensive applications [135]. This smaller MCDRAM cache features a peak bandwidth of 460GB/s, 4 faster than its DRAM × counterpart. Similarly and as motivated during the introduction of this thesis (Chapter 1), the Summit supercomputer (ORNL) also features 512GB of DRAM + HBM combined, and 1.6TB of persistent NVRAM [40,127,151]. This new trend is in contrast with traditional supercomputers that employ DRAM technology for main memory, besides caches and most recently accelerator on-board memory. While the purpose of increasing the diversity of memory and storage technologies pro- vides certain benefits, selecting which memory / storage tier is ideal for a certain scientific application can become a relatively complex task. In fact, several studies have evaluated and demonstrated that memory access patterns considerably impact application perfor- mance directly [97]. In such studies, the results show that applications with regular data accesses are often limited by memory bandwidth, achieving up to 3 better performance × when using HBM compared to using solely DRAM. However, when applications feature mostly random memory accesses, the situation is completely different. In this case, the access pattern is largely affected by memory latency. Thus, using HBM would not benefit, as the latency is higher than in DRAM.

Proposed Solution for Future Work To ease the deployment of scientific applications on such type of clusters without requiring major source code modifications, we also expect to generalize uMMAP-IO as a runtime for automatic data-placement of memory objects. This work is based on the target concept

1As motivated in Chapter 6, when targeting a traditional PFS from an HPC cluster, we could integrate a dedicated MPI-IO driver that utilizes collective I/O operations to improve the overall performance, specially during write operations on large process counts [112]. 66 CHAPTER 7. FUTURE WORK

0x1000000 … 0x8FFFFFF Virtual Address Space Internally handled in segments Segment #1 Segment #2 Segment #3 ··· Assigned

0x1000000…0x1000FFF 0x1001000…0x1001FFF 0x1002000…0x1002FFF ··· Not Assigned Seg. Migration & Sync. & Migration Seg. Tier 0 MCDRAM Segment #1 Segment #2 Segment #3 ··· Primary Target Fast Tier Byte-addressable Tier 1 Default DRAM Segment #1 Segment #2 Segment #3 ··· Tier 2 HBM (GPU) Segment #1 Segment #2 Segment #3 ··· Secondary Target Comp. Dev. Migration explicit Tier 3 Src+Backup NVMe SSD Segment #1 Segment #2 Segment #3 ···

Figure 7.2: The memory abstraction mechanism of uMMAP-IO can be extended to provide a runtime for data-placement of memory objects. Applications would still interact with a seamless address space using load/ store operations.

(Figure 7.1). In such case, uMMAP-IO would be responsible for the allocation, dealloca- tion, and migration of memory objects across the variety of tiers in the memory / storage hierarchy. Such migration is managed by advisable policies (i.e., similar to the out-of-core mechanism). The memory-mapping of uMMAP-IO is expected to logically divide a virtual address space into user-configurable segments, where each segment describes a different memory object. This representation is illustrated in Figure 7.3. In this case, however, the address space could be mapped into a diverse range of memory and storage technologies, such as MCDRAM [101], HBM [89], and local / remote storage. Interacting with the ad- dress space is possible through load/ store operations, as usual. The mapping is based on explicit preference provided by the application, as well as the considerations given by novel services like the Heterogeneous Memory Management (HMM) of the kernel [145]. Compared to the extension proposed for MPI object windows, matching the aforemen- tioned segments to a memory technology depends on the consideration of separate target types inside uMMAP-IO: • Primary Targets. Represent the main tier of the applications, allowing direct byte-addressable access (e.g., main memory). • Secondary Targets. Represent the secondary tier of the applications, where the data-management is considered to be explicit (e.g., block-devices). This distinction allows us to effectively migrate data between each target type sep- arately. For instance, secondary targets, such as local storage, can be exposed through primary targets. In addition, segment migration across primary targets can benefit appli- cation performance [97]. The migration between each tier would be coordinated through the different components of the runtime, alongside information deducted by the internal representation of each virtual segment based on decision-making strategies (e.g., inferred access pattern by the application). The relation between the segments provide opportu- nities for automatic migration policies, particularly useful in tiers where the capacity is limited, like MCDRAM. Different policies allow the runtime to coordinate the migration of the memory objects between each memory tier. Note that this work will further develop the collaborations across the different exascale HPC projects of the EU-H2020 program. In fact, it is one of the main goals established by the dissemination strategy of EPiGRAM-HS and Sage2 projects. 7.3. FAILED IMAGES IN COARRAY FORTRAN 67

7.3 Failed Images in Coarray Fortran

The general-purpose technique usually adopted by HPC applications to guarantee re- silience, relies on checkpointing and rollback recovery. When a failure is detected, the fault-tolerant protocol uses past checkpoints to recover the application into a consistent state [103, 134]. Two main approaches for checkpointing exists: coordinated and uncoor- dinated checkpointing. In the former, all the processes ensure that their checkpoints are globally consistent. When a failure is detected, the application stops completely and the execution gets restarted from the last checkpoint available. In the uncoordinated check- pointing, however, each process checkpoints its own data independently. When a failure is detected, a replacing process recovers the data previously saved by the failed process and restarts the execution. The main benefit of this method is that it does not require global consistency (i.e., processes only roll back to their own checkpoints). It also considerably reduces both the communication and I/O required with the global checkpoint approach. In this context, it has been observed that PGAS languages like Coarray Fortran are well suitable for such recovery protocols2 [39]. Using CAF, every image could expose remote accessible memory for backing up the data belonging to another image, as in a shared-memory environment. The technique, commonly referred to as neighbour-based checkpointing, has also been integrated in certain MPI implementations [20]. Nonetheless, the coarray definition included in the Fortran 2008 standard only defines a simple syntax for accessing data on remote images, synchronization statements, and collective allocation / deallocation mechanisms [88, 107]. Although these features allow writing functional coarray programs, they do not allow to express more complex and useful mechanisms for failure management. As a matter of fact, the Fortran 2018 standard has recently introduced several extensions to mitigate this issue [106]. Specifically, the concept of failed images provides applications with a mechanism that enables the identification of which images have failed during the execution of a program.

Proposed Solution for Future Work We observe that the concept of failed images exclusively provides with the capability to detect which one of the available images of an application has failed. The is still in charge to implement a mechanism to store the information of each image and a recovery policy. Hence, the Fortran standard does not define how to store and recover the state of a given image. Using the persistent coarrays concept presented in Chapter 5 [111], we plan to explore how to consolidate the integration of failed images into existing CAF-based applications. Persistent coarrays can seamlessly map each coarray to storage without requiring any major source code modifications. This would allow users to have full or even incremental checkpoints per image, and recover the execution at runtime. In combination with the failed images interface (i.e., evaluating the image status and other functionality), we consider necessary the analysis of two different alternatives for recovery in Coarray Fortran: 1. Seamless Recovery. After each synchronization in CAF, the transport layer im- plementation could evaluate whether or not any of the neighbour images has failed.

2We demonstrated in Chapter 4 that data analytics frameworks that utilize a decoupled strat- egy using the MPI one-sided communication model and enabling MPI storage windows, are also ideal candidates [116]. 68 CHAPTER 7. FUTURE WORK

Image 1 Image 2 CAF CAF Failed Image

Coarray / x Coarray / x_tmp Coarray / x Coarray / x_tmp

MPI_Win_allocate MPI_Win_allocate MPI_Win_sync MPI_Win_sync ......

MPI Storage Window MPI Storage Window MPI Storage Window MPI Storage Window

OpenCoarrays_1_0.win OpenCoarrays_2_0.win

Re-Mapping of File Independent File from the Failed Image Independent File No File Mapping

Transport Layer / LIBCAF_MPI

Figure 7.3: Introducing recovery mechanisms inside Coarray Fortran would allow images to seamlessly remap the coarrays of failed images [39]. The example il- lustrates a possible recovery mechanism, where a temporary version ofx is exposed.

If so, the implementation could spawn new processes and remap their persistent coarrays into the original failed images3. 2. Explicit Recovery. Applications could manually evaluate if a neighbour image (e.g., based on its identifier) has failed, in which case they could request a remap of its persistent coarrays. The remapping can be set at a different offset inside the same persistent coarrays, or within implicit versions exposed to the application (e.g., coarrayx from a neighbour is available at x_tmp).

We plan to explore these and other alternatives (e.g., workload replication [17]), in- cluding the potential definition of an API for failure management that complements the original specification of failed images.

7.4 I/O Beyond Exascale

The integration of different kind of compute devices on HPC, such as GPUs, FPGAs, or even specialized co-processors (see Chapter 1), inherently implies an increased level in programming complexity. Each of these device types contains caches and fast mem- ory technologies that require moving data explicitly from traditional DRAM. Due to the type of operations mostly conducted on scientific applications, however, a recent trend is emerging that proposes to execute certain computations inside the memory components (i.e., instead of moving the data to the CPU and back). This paradigm is commonly known as Processing-In-Memory (PIM) [3, 52]. Its major implications are the possibility

3Note that dynamic process spawning might require a post-allocation of additional compute nodes, which can be rejected by the resource manager in the cluster [22]. 7.4. I/O BEYOND EXASCALE 69 of using certain instructions to run arithmetic or logical operations inside the memory slots, offloading the workload from the CPU to the DRAM. Studies have have demonstrated a reduction of almost 90% on the TDP using PIM, while still providing significant speedups in comparison [156]. Furthermore, this type of instructions might provide its truly potential on persistent memories like NVRAM. As a matter of fact, the SAGE / Sage2 project has proposed the use of function shipping to offload computations to the storage servers, allowing to perform certain operations directly where the data is stored [84].

Proposed Solution for Future Work Even though there is no clear proposal of work to address this challenge, we observe that the contributions of this thesis and the expected future work offer the potential to become a critical part of the solution. More specifically, the MPI one-sided communication model could be extended to execute in-memory local or remote operations, as well as in-storage operations. The assert parameter of MPI_Win_lock4 can be utilized to specify the type of operation requested during a new computational epoch. The MPI standard already defines a set of operations for collective reduce5. For instance, specifying MPI_SUM when locking the target window, will imply that an MPI_Put operation would perform the specified computation on the target address space (i.e., merging the source data into the remote window), while an MPI_Get would do the opposite (i.e., merging the target data into the source buffer). The same observation is valid for the rest of the operations. This contribution could provide a novel use of the MPI one-sided communication model, that complements the MPI storage windows presented in this thesis and the MPI object windows outlined for future work. Note that we also expect to evaluate how the PIM mechanism could benefit applications that offload certain computations directly inside storage devices (e.g., through specialized simulations [57]).

4https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node282.htm 5https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node112.htm Chapter 8

Conclusions

Throughout the different chapters of this Ph.D. thesis, we have provided a comprehensive overview of the state-of-the-art of I/O in High-Performance Computing and presented how we envision the transition towards heterogeneous supercomputers at exascale. This has included not only technical explanations, but also relevant experimental results and other discussions that might result of interest to HPC applications. In this last chapter, we consider necessary to provide an overview of our contributions to thefield, specially in regards to the scientific questions raised in the motivation of the thesis. Thereafter, we will conclude the document by providing certain last observations.

Addressing the Scientific Questions We presented three major open scientific questions in Section 1.1, that define the main goals for the Doctoral studies. The significance of each scientific question was based on some of the arguments provided in the introductory chapter. We will now briefly emphasize how we have addressed the different scientific questions.

[Q1] How can programming models and interfaces support heterogeneous I/O? Solution: One of the main challenges in exascale supercomputers resides on the increased- level of heterogeneity that compute nodes will feature. We have addressed this challenge mainly by presenting a novel contribution to the MPI specification, the de-facto stan- dard on HPC. MPI storage windows allows accessing both memory and storage through a familiar, unified programming interface without the need for separate I/O interfaces. Furthermore, we have also presented how this contribution can benefit data analytics programming models (e.g., MapReduce) specially in fault-tolerance and out-of-core use- cases, as well as provided resiliency support in Coarray Fortran with seamless persistency of coarrays. The latter contribution has major implications in the recent Fortran 2018 specification of failed images. At a system-level, the aforementioned contributions will benefit from the memory abstraction mechanism provided by uMMAP-IO. Thus, opening the opportunity for high-performance I/O over local and traditional PFS-based storage.

[Q2] How can the integration of emerging storage technologies be simplified? Solution: The variety of interfaces to access memory and storage has grown considerably

71 72 CHAPTER 8. CONCLUSIONS over the past two decades on HPC, increasing the programming complexity on scientific applications. We have addressed this challengefirst by providing a unified I/O interface in MPI (with MPI storage windows) and in Fortran (with Persistent Coarrays). The inclusion of such developments allow existing applications to easily conduct I/O over heterogeneous clusters, while integrating novel features such as transparent resiliency. Data analytics frameworks also expose out-of-core execution to users without any significant source code changes. Moreover, the memory abstraction mechanism of uMMAP-IO enables seamless access to memory and storage using only load/ store operations. Hence, allowing to indistinguishably target memory (e.g., MCDRAM) or storage (e.g., NVRAM). More im- portantly, all the proposed solutions incur in minimal source code changes on existing applications, easing the integration of these features at exascale.

[Q3] How can HPC applications seamlessly improve the I/O performance? Solution: While scientific applications increasingly demand for high-performance I/O, the bandwidth and access latency of the I/O subsystem is also expected to remain roughly constant compared to current petascale supercomputers, making storage a serious bottle- neck. We have addressed this challenge by introducing selective synchronization mecha- nisms in MPI through MPI storage windows to reduce the number of I/O transactions. Such contribution is implicitly part of persistent coarrays in Fortran. Furthermore, the use of MPI one-sided communication in data analytics frameworks opens the opportunity for a decoupled strategy that reduces the constraints on highly-concurrent workloads, while seamlessly allowing to improve the checkpoint efficiency compared to existing solutions based on MPI-IO. Finally, the memory abstraction of uMMAP-IO provides several orders of magnitude better performance compared to the memory-mapped I/O mechanism of the OS on HPC applications, while allowing access to novel storage technologies (e.g., Cray DataWarp burst buffer) through a friendly programming interface based on load/ store operations. This latter approach benefits all of the other contributions.

Final Observations In this thesis, we have presented several contributions that have the potential of providing value to thousands of scientific applications that utilize MPI and even Coarray Fortran. From the beginning, we set several goals that mainly focused on improving existing appli- cations, without incurring in drastic source code changes or increasing their programming complexity. We believe that such goals have been successfully met with MPI storage windows [114], MapReduce-1S [116], Persistent Coarrays [111], and uMMAP-IO [112]. To conclude this thesis, we would like to outline the importance of the proposed future work on the topic (Chapter 7) in regards to the scientific questions. Contributions such as the automatic data-placement of memory objects or the possibility of conducting I/O operations directly over storage devices through RDMA, consolidate our contributions to thefield and provide further value to scientific applications. Bibliography

[1] Standard for Information Technology–Portable Operating System Interface (POSIX(R)) Base Specifications, Issue 7. IEEE Std 1003.1, 2016 Edition (incor- porates IEEE Std 1003.1-2008, IEEE Std 1003.1-2008/Cor 1-2013, and IEEE Std 1003.1-2008/Cor 2-2016), pages 1–3957, Sep. 2016. [2] Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, and TN Vijaykumar. PUMA: Purdue MapReduce Benchmarks Suite. 2012. [3] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News, 44(3):105–117, 2016. [4] Katie Antypas, Nicholas Wright, Nicholas P Cardo, Allison Andrews, and Matthew Cordery. Cori: A Cray XC Pre-exascale System for NERSC. Cray User Group Proceedings. Cray, 2014. [5] Brendan Barry, Cormac Brick, Fergal Connor, David Donohoe, David Moloney, Richard Richmond, Martin O’Riordan, and Vasile Toma. Always-on Vision Pro- cessing Unit for mobile applications. IEEE Micro, 35(2):56–66, 2015. [6] Simona Boboila, Youngjae Kim, Sudharshan S Vazhkudai, Peter Desnoyers, and Galen M Shipman. Active Flash: Out-of-core data analytics onflash storage. In 28th IEEE Symposium on Mass Storage Systems and Technologies (MSST), pages 1–12. IEEE, 2012. [7] Katherine Bourzac. Has Intel created a universal memory technology? IEEE Spec- trum, 54(5):9–10, 2017. [8] Daniel P Bovet and Marco Cesati. Understanding the Linux Kernel: from I/O ports to process management. O’Reilly, 2005. [9] Peter Braam. The Lustre storage architecture. arXiv preprint arXiv:1903.01955, 2019. [10] Keeran Brabazon, Oliver Perks, Sergio Rivas-Gomez, Stefano Markidis, and Ivy Bo Peng. Tools Based Application Characterization. in: sagestorage.eu, http://www.sagestorage.eu/sites/default/files/D1.3_v1.3.pdf, 2017. [On- Line]. [11] Richard Bradshaw and Carl Schroeder. Fifty years of IBM innovation with infor- mation storage on magnetic tape. IBM Journal of Research and Development, 47 (4):373–383, 2003.

73 74 BIBLIOGRAPHY

[12] Henrik Brink, Joseph W Richards, Dovi Poznanski, Joshua S Bloom, John Rice, Sahand Negahban, and Martin Wainwright. Using machine learning for discovery in synoptic survey imaging data. Monthly Notices of the Royal Astronomical Society, 435(2):1047–1060, 2013. [13] Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015. [14] Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward Exascale Resilience. The International Journal of High Performance Computing Applications, 23(4):374–388, 2009. [15] Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. Understanding and improving computational science storage access through continuous characterization. ACM Transactions on Storage (TOS), 7(3):8, 2011. [16] Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Kather- ine Riley. 24/7 characterization of petascale I/O workloads. In Cluster Computing and Workshops, 2009. CLUSTER’09. IEEE International Conference on, pages 1– 10. IEEE, 2009. [17] Henri Casanova, Frédéric Vivien, and Dounia Zaidouni. Using replication for re- silience on exascale systems. In Fault-Tolerance Techniques for High-Performance Computing, pages 229–278. Springer, 2015. [18] Adrian M Caulfield, Joel Coburn, Todor Mollov, Arup De, Ameen Akel, Jiahua He, Arun Jagatheesan, Rajesh K Gupta, Allan Snavely, and Steven Swanson. Un- derstanding the impact of emerging non-volatile memories on high-performance, IO-intensive computing. In Proceedings of the 2010 ACM/IEEE International Con- ference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE Computer Society, 2010. [19] Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. Introducing openshmem: Shmem for the pgas community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, page 2. ACM, 2010. [20] Zizhong Chen, Graham E Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack Dongarra. Building fault survivable MPI programs with FT-MPI using diskless checkpointing. In Proceedings for ACM SIGPLAN Sympo- sium on Principles and Practice of Parallel Programming, pages 213–223, 2005. [21] Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Kunle Olukotun, and Andrew Y Ng. Map-Reduce for machine learning on multicore. In Advances in neural information processing systems, pages 281–288, 2007. [22] Carsten Clauss, Thomas Moschny, and Norbert Eicker. Dynamic process man- agement with allocation-internal co-scheduling towards interactive supercomputing. In Proceedings of the 1st COSH Workshop on Co-Scheduling of HPC Applications, page 13, 2016. BIBLIOGRAPHY 75

[23] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng An- drew. Deep learning with COTS HPC systems. In Proceedings of the 30th Interna- tional Conference on Machine Learning, pages 1337–1345, 2013. [24] Richard Cole. Parallel merge sort. SIAM Journal on Computing, 17(4):770–785, 1988. [25] Computational Information Systems Laboratory. Casper: Heteroge- neous cluster for data analysis and visualization. in: UCAR.edu, https://www2.cisl.ucar.edu/resources/computational-systems/casper, 2019. [On-Line]. [26] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. [27] Michael Cox and David Ellsworth. Application-controlled demand paging for out- of-core visualization. In Proceedings of the 8th conference on Visualization’97, pages 235–ff. IEEE Computer Society Press, 1997. [28] Leonardo Dagum and Ramesh Menon. OpenMP: An industry-standard API for shared-memory programming. Computing in Science & Engineering, (1), 1998. [29] Andrew Davidson, David Tarjan, Michael Garland, and John D Owens. Efficient parallel merge sort forfixed and variable length keys. In Innovative Parallel Com- puting (InPar), 2012, pages 1–9. IEEE, 2012. [30] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [31] Michel O Deville, Paul F Fischer, Paul F Fischer, EH Mund, et al. High-order Methods for Incompressible Fluid Flow, volume 9. Cambridge University Press, 2002. [32] Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean- Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braun- schweig, et al. The International Exascale Software Project Roadmap. The Inter- national Journal of High-Performance Computing Applications, 25(1), 2011. [33] Jack Dongarra, Dennis Gannon, Geoffrey Fox, and Ken Kennedy. The impact of multicore on computational science software. CTWatch Quarterly, 3(1):1–10, 2007. [34] Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox. MapReduce for data inten- sive scientific analyses. In eScience, 2008. eScience’08. IEEE Fourth International Conference on, pages 277–284. IEEE, 2008. [35] Jim Elliott and ES Jung. Ushering in the 3D memory era with V-NAND. In Proceedings of the Flash Memory Summit 2013, 2013. [36] R Evans, J Jumper, J Kirkpatrick, L Sifre, TFG Green, C Qin, A Zidek, A Nelson, A Bridgland, H Penedones, et al. De novo structure prediction with deep-learning based scoring. Annu Rev Biochem, 77:363–382. [37] Graham E Fagg and Jack J Dongarra. FT-MPI: Fault tolerant MPI, supporting dy- namic applications in a dynamic world. In European Parallel Virtual Machine/Mes- sage Passing Interface Users’ Group Meeting, pages 346–353. Springer, 2000. 76 BIBLIOGRAPHY

[38] Alessandro Fanfarillo, Tobias Burnus, Valeria Cardellini, Salvatore Filippone, Dan Nagle, and Damian Rouson. OpenCoarrays: Open-source Transport Layers support- ing Coarray Fortran compilers. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, page 4. ACM, 2014. [39] Alessandro Fanfarillo, Sudip Kumar Garain, Dinshaw Balsara, and Daniel Nagle. Resilient Computational Applications using Coarray Fortran. Parallel Computing, 81:58–67, 2019. [40] Michael Feldman. Oak Ridge readies Summit supercomputer for 2018 debut. in: Top500.org, https://www.top500.org/news/ oak-ridge-readies-summit-supercomputer-for-2018-debut, 2017. [On-Line]. [41] Erin Filliater. InfiniBand Technology and Usage Update. In Mellanox Technologies, Storage Developer Conference (SDC), SNIA, Santa Clara, CA, 2012. [42] Mike Folk, Albert Cheng, and Kim Yates. HDF5: Afile format and I/O library for high-performance computing applications. In Proceedings of supercomputing, volume 99, pages 5–33, 1999. [43] Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. An overview of the HDF5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pages 36–47. ACM, 2011. [44] Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, Pavan Balaji, and Michela Taufer. Mimir: Memory-efficient and scalable MapReduce for large super- computing systems. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pages 1098–1108. IEEE, 2017. [45] Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. Enabling highly-scalable remote memory access programming with MPI-3 one sided. Scientific Programming, 22(2):75–91, 2014. [46] David Goodell, Seong Jo Kim, Robert Latham, Mahmut Kandemir, and Robert Ross. An evolutionary path to Object Storage Access. In 2012 SC Compan- ion: High Performance Computing, Networking Storage and Analysis, pages 36–41. IEEE, 2012. [47] William Gropp, William D Gropp, Argonne Distinguished Fellow Emeritus Ewing Lusk, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable parallel program- ming with the Message-Passing Interface. MIT Press, 3rd edition, 2014. [48] William Gropp, Torsten Hoefler, Rajeev Thakur, and Ewing Lusk. Using advanced MPI: Modern features of the Message-Passing Interface. MIT Press, 1st edition, 2014. [49] William Gropp, Torsten Hoefler, Rajeev Thakur, and Ewing Lusk. MPI + MPI: Using MPI-3 Shared Memory as a Multicore Programming System. in: Rice.edu, https://www.caam.rice.edu/~mk51/presentations/SIAMPP2016_4.pdf, 2016. [On-Line]. [50] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high- performance, portable implementation of the MPI message passing interface stan- dard. Parallel computing, 22(6):789–828, 1996. BIBLIOGRAPHY 77

[51] William D Gropp, Dries Kimpe, Robert Ross, Rajeev Thakur, and Jesper Larsson Träff. Self-consistent MPI-IO performance requirements and expectations. In Eu- ropean Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pages 167–176. Springer, 2008. [52] Qi Guo, Nikolaos Alachiotis, Berkin Akin, Fazle Sadi, Guanglin Xu, Tze Meng Low, Larry Pileggi, James C Hoe, and Franz Franchetti. 3D-stacked memory-side acceleration: Accelerator and system design. In In the Workshop on Near-Data Processing (WoNDP)(Held in conjunction with MICRO-47), 2014. [53] Yanfei Guo, Wesley Bland, Pavan Balaji, and Xiaobo Zhou. Fault tolerant MapReduce-MPI for HPC clusters. In Proceedings of the International Confer- ence for High Performance Computing, Networking, Storage and Analysis, page 34. ACM, 2015. [54] Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Venkatram Vishwanath, Tom Peterka, Joe Insley, et al. HACC: Extreme scaling and performance across diverse architectures. Commu- nications of the ACM, 60(1):97–104, 2016. [55] Jim Handy. Understanding the Intel/Micron 3D XPoint memory. Proc. SDC, 2015. [56] Darren Hart. A futex overview and update. in: LWN.net, http://lwn.net/ Articles/360699, 2009. [On-Line]. [57] Wim Heirman, Trevor Carlson, and Lieven Eeckhout. Sniper: Scalable and accu- rate parallel multi-core simulation. In 8th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Sys- tems (ACACES-2012), pages 91–94. High Performance and Embedded Architecture and Compilation (HiPEAC), 2012. [58] John L Hennessy and David A Patterson. Computer Architecture: A quantitative approach. Elsevier, 2011. [59] David Henty. A Parallel Benchmark Suite for Fortran Coarrays. In Parallel Com- puting, pages 281–288. Elsevier, 2011. [60] Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. MPI+MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing, 95(12):1121–1136, 2013. [61] Torsten Hoefler, Andrew Lumsdaine, and Jack Dongarra. Towards efficient MapRe- duce using MPI. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pages 240–249. Springer, 2009. [62] Intel Corporation. Intel Memory Drive Technology Set Up and Configuration Guide. in: Intel.com, https://www.intel.com/content/dam/support/us/en/documents/ memory-and-storage/intel-mdt-setup-guide.pdf, 2019. [On-Line]. [63] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 448–456, 2015. 78 BIBLIOGRAPHY

[64] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. Ba- sic Performance Measurements of the Intel Optane DC Persistent Memory Module. arXiv preprint arXiv:1903.05714, 2019.

[65] Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli. Exploiting mem- ory access patterns to improve memory performance in data-parallel architectures. IEEE Transactions on Parallel and Distributed Systems, 22(1):105–118, 2011.

[66] Edward Karrels and Ewing Lusk. Performance analysis of MPI programs. In Proceed- ings of the Workshop on Environments and Tools For Parallel Scientific Computing, pages 195–200, 1994.

[67] Norman Kirkby. Fortran-IO. in: Wikibooks.org, https://en.wikibooks.org/wiki/ Fortran/io, 2018. [On-Line].

[68] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J Pai, and Naveen Rao. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. In Advances in Neural Information Processing Systems 30 (NIPS 2017), pages 1740–1750, 2017.

[69] LLNL Documentation. MPI Performance Topics. in: LLNL.gov, https:// computing.llnl.gov/tutorials/mpi_performance, 2019. [On-Line].

[70] Xiaoyi Lu, Bing Wang, Li Zha, and Zhiwei Xu. Can MPI benefit Hadoop and MapReduce applications? In Parallel Processing Workshops (ICPPW), 2011 40th International Conference on, pages 371–379. IEEE, 2011.

[71] Yutong Lu, Depei Qian, Haohuan Fu, and Wenguang Chen. Will Supercomputers Be Super-Data and Super-AI Machines? Communications of the ACM, 61(11): 82–87, 2018.

[72] Stefano Markidis, Giovanni Lapenta, et al. Multi-scale simulations of plasma with iPIC3D. Mathematics and in Simulation, 80(7):1509–1519, 2010.

[73] John D McCalpin. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, 19:25, 1995.

[74] Simon McIntosh-Smith, James Price, Tom Deakin, and Andrei Poenaru. Compar- ative benchmarking of thefirst generation of HPC-optimised ARM processors on Isambard. Cray User Group, 5, 2018.

[75] Mellanox Technologies Ltd. Understanding MPI Tag Matching and Rendezvous Offloads. in: Mellanox.com, https://community.mellanox.com/s/article/ understanding-mpi-tag-matching-and-rendezvous-offloads--connectx-5-x, 2018. [On-Line].

[76] Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh. Evaluation of Intel Memory Drive Technology Per- formance for Scientific Applications. In Proceedings of the Workshop on Memory Centric HPC, pages 14–21. ACM, 2018. BIBLIOGRAPHY 79

[77] Jeonghoon Mo, Richard J La, Venkat Anantharam, and Jean Walrand. Analysis and comparison of TCP Reno and Vegas. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’99), volume 3, pages 1556–1563. IEEE, 1999.

[78] David Moloney, Brendan Barry, Richard Richmond, Fergal Connor, Cormac Brick, and David Donohoe. Myriad 2: Eye of the computational vision storm. In 2014 IEEE Hot Chips 26 Symposium (HCS), pages 1–18. IEEE, 2014. [79] MPI Forum. MPI: A Message-Passing Interface Standard, volume 1.3. 2008. https://www.mpi-forum.org/docs/mpi-1.3/mpi-report-1.3-2008-05-30.pdf. [80] MPI Forum. MPI: A Message-Passing Interface Standard, volume 2.2. 2009. https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf. [81] MPI Forum. MPI: A Message-Passing Interface Standard, volume 3.1. 2015. http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf. [82] MPI Forum. MPI: A Message-Passing Interface Standard, volume 4.0. 2019. https://www.mpi-forum.org/mpi-40. [83] Mihir Nanavati, Malte Schwarzkopf, Jake Wires, and Andrew Warfield. Non-volatile storage. Communications of the ACM, 59(1):56–63, 2015.

[84] Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Stefano Markidis, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Dirk Pleiter, and Shaun De Witt. SAGE: Percipient Storage for Exascale Data-centric Computing. Parallel Computing, 2018.

[85] NERSC Documentation. Optimizing I/O performance on the Lus- trefile system. in: NERSC.gov, https://www.nersc.gov/users/ storage-and-file-systems/i-o-resources-for-scientific-applications/ optimizing-io-performance-for-lustre, 2018. [On-Line]. [86] NERSC Documentation. Scientific I/O in HDF5. in: NERSC.gov, https://www.nersc.gov/users/training/online-tutorials/ introduction-to-scientific-i-o/?show_all=1#toc-anchor-4, 2019. [On-Line].

[87] NetApp, Inc. What Is Object Storage? in: NetApp.com, https://www.netapp. com/us/info/what-is-object-storage.aspx, 2019. [On-Line]. [88] Robert W Numrich and John Reid. Co-array fortran for parallel programming. In ACM Sigplan Fortran Forum, volume 17, pages 1–31. ACM, 1998. [89] NVIDIA Corporation. NVIDIA Tesla V100 GPU Architecture. in: NVIDIA.com, http://images.nvidia.com/content/volta-architecture/pdf/ volta-architecture-whitepaper.pdf, 2017. [On-Line]. [90] NVIDIA Corporation. nvJPEG: GPU-accelerated JPEG decoder. in: NVIDIA.com, https://developer.nvidia.com/nvjpeg, 2019. [On-Line].

[91] Glen Overby. Cray DataWarp Overview. in: NERSC.gov, https://www.nersc. gov/assets/Uploads/dw-overview-overby.pdf, 2016. [On-Line]. 80 BIBLIOGRAPHY

[92] Andrey Ovsyannikov, Melissa Romanus, Brian Van Straalen, Gunther H Weber, and David Trebotich. Scientific workflows at DataWarp-speed: Accelerated data- intensive science using NERSC’s Burst Buffer. In 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS), pages 1–6. IEEE, 2016. [93] Mayur R Palankar, Adriana Iamnitchi, Matei Ripeanu, and Simson Garfinkel. Ama- zon S3 for science grids: a viable solution? In Proceedings of the 2008 international workshop on Data-aware , pages 55–64. ACM, 2008. [94] PDC Center for High-Performance Computing. Beskow Supercomputer: One the fastest academic supercomputing systems in Scandinavia. in: KTH.se, https://www.pdc.kth.se/hpc-services/computing-systems/beskow-1.737436, 2019. [On-Line]. [95] PDC Center for High-Performance Computing. Tegner Supercom- puter: The pre-/post-processing system for Beskow. in: KTH.se, https://www.pdc.kth.se/hpc-services/computing-systems/tegner-1.737437, 2019. [On-Line]. [96] Steven Pelley, Peter M Chen, and Thomas F Wenisch. Memory persistency: Se- mantics for byte-addressable nonvolatile memory technologies. IEEE Micro, 35(3): 125–131, 2015. [97] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. Exploring the Performance Benefit of Hybrid Memory System on HPC Environments. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 683–692. IEEE, 2017. [98] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Erwin Laure, and Stefano Markidis. Preparing HPC Applications for the Exascale Era: A Decoupling Strategy. In Parallel Processing (ICPP), 2017 46th International Conference on, pages 1–10. IEEE, 2017. [99] Ivy Bo Peng, Stefano Markidis, Roberto Gioiosa, Gokcen Kestor, and Erwin Laure. MPI Streams for HPC Applications. New Frontiers in High Performance Computing and , 30:75, 2017. [100] Ivy Bo Peng, Stefano Markidis, Erwin Laure, Daniel Holmes, and Mark Bull. A data streaming model in MPI. In Proceedings of the 3rd Workshop on Exascale MPI, page 2. ACM, 2015. [101] Ivy Bo Peng, Stefano Markidis, Erwin Laure, Gokcen Kestor, and Roberto Gioiosa. Exploring Application Performance on Emerging Hybrid-Memory Supercomputers. In High Performance Computing and Communications, pages 473–480. IEEE, 2016. [102] Steven J Plimpton and Karen D Devine. MapReduce in MPI for Large-scale Graph Algorithms. Parallel Computing, 37(9):610–632, 2011. [103] Suraj Prabhakaran, Marcel Neumann, and Felix Wolf. Efficient fault tolerance through dynamic node replacement. In Proceedings of the 18th IEEE/ACM Inter- national Symposium on Cluster, Cloud and Grid Computing, pages 163–172. IEEE Press, 2018. BIBLIOGRAPHY 81

[104] Daniel A Reed and Jack Dongarra. Exascale computing and big data. Communi- cations of the ACM, 58(7):56–68, 2015. [105] John Reid. The new features of Fortran 2008. In ACM SIGPLAN Fortran Forum, volume 27, pages 8–21. ACM, 2008. [106] John Reid. The new features of Fortran 2018. In ACM SIGPLAN Fortran Forum, volume 37, pages 5–44. ACM, 2018. [107] John Reid and Robert W Numrich. Co-arrays in the next Fortran Standard. Sci- entific Programming, 15(1):9–26, 2007. [108] David Reinsel and John Rydning. Breaking the 15K-rpm HDD Performance Barrier with Solid State Hybrid Drives. IDC, Framingham, MA, USA, White Paper, 244250, 2013. [109] Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A locality-aware memory hierarchy for energy-efficient GPU architectures. In Proceedings of the 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 86–98. IEEE, 2013. [110] Rolf Riesen, Kurt Ferreira, Jon Stearley, Ron Oldfield, James H Laros III, Kevin Pedretti, Ron Brightwell, et al. Redundant computing for exascale systems. Sandia National Laboratories, 2010. [111] Sergio Rivas-Gomez, Alessandro Fanfarillo, Sai Narasimhamurthy, and Stefano Markidis. Persistent Coarrays: Integrating MPI Storage Windows in Coarray For- tran. In Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI’19, pages 1–8. ACM, 2019. [112] Sergio Rivas-Gomez, Alessandro Fanfarillo, Sebastién Valat, Christophe Laferriere, Philippe Couvee, Sai Narasimhamurthy, and Stefano Markidis. uMMAP-IO: User- level Memory-mapped I/O for HPC. In Proceedings of the 26th IEEE Interna- tional Conference on High-Performance Computing, Data, and Analytics (HiPC’19). IEEE, 2019. [113] Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor, Sai Narasimhamurthy, Erwin Laure, and Stefano Markidis. MPI Windows on Stor- age for HPC applications. In Proceedings of the 24th European MPI Users’ Group Meeting, EuroMPI/USA’17, page 15. ACM, 2017. [114] Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor, Sai Narasimhamurthy, Erwin Laure, and Stefano Markidis. MPI Windows on Storage for HPC Applications. Parallel Computing, 77:38–56, 2018. [115] Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng, and Giuseppe Con- giu. State of the art of I/O with MPI and gap analysis. in: sagestorage.eu, http://www.sagestorage.eu/sites/default/files/SAGE-D4-v1. 1_Public_Updated.pdf, 2017. [On-Line]. [116] Sergio Rivas-Gomez, Stefano Markidis, Erwin Laure, Keeran Brabazon, Oliver Perks, and Sai Narasimhamurthy. Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks. In 2018 IEEE 20th International Conference on High Performance Computing and Communications (HPCC’18 / AHPCN Symposium), pages 921–927. IEEE, 2018. 82 BIBLIOGRAPHY

[117] Sergio Rivas-Gomez, Stefano Markidis, Ivy Bo Peng, Erwin Laure, Gokcen Kestor, and Roberto Gioiosa. Extending Message Passing Interface Windows to Storage. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2017), pages 727–730. IEEE, 2017.

[118] Sergio Rivas-Gomez, Antonio J Pena, David Moloney, Erwin Laure, and Stefano Markidis. Exploring the Vision Processing Unit as Co-processor for Inference. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 589–598. IEEE, 2018.

[119] Sergio Rivas-Gomez, Ivy Bo Peng, Stefano Markidis, and Erwin Laure. MPI Con- cepts for High-Performance Parallel I/O in Exascale. In 13th HiPEAC Interna- tional Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES’17), pages 167–170. HiPEAC, 2017.

[120] Robert Ross, Robert Latham, William Gropp, Rajeev Thakur, and Brian Toonen. Implementing MPI-IO atomic mode withoutfile system support. In CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., vol- ume 2, pages 1135–1142. IEEE, 2005.

[121] Andy Rudoff. Persistent memory programming. Login: The Usenix Magazine, 42: 34–40, 2017.

[122] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan- der C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

[123] Satish Kumar Sadasivam, Brian W Thompto, Ron Kalla, and William J Starke. IBM Power9 processor architecture. IEEE Micro, 37(2):40–51, 2017.

[124] SAGE Consortium. The SAGE project: Data storage for extreme scale. http:// www.sagestorage.eu/sites/default/files/Sage%20White%20Paper%20v1.0.pdf, 2016. [On-Line].

[125] Arthur Sainio. NVDIMM: Changes are here, so what’s next? In-Memory Computing Summit (IMC-Summit’16), 2016.

[126] Frank B Schmuck and Roger L Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST, volume 2, 2002.

[127] David Schneider. US supercomputing strikes back. IEEE Spectrum, 55(1):52–53, 2018.

[128] Will J Schroeder, Bill Lorensen, and Ken Martin. The Visualization Toolkit: An Object-oriented approach to 3D graphics. Kitware, 2004.

[129] Hongzhang Shan, Katie Antypas, and John Shalf. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 42. IEEE Press, 2008. BIBLIOGRAPHY 83

[130] Hongzhang Shan and John Shalf. Using IOR to analyze the I/O performance for HPC platforms. 2007. [131] Jaewook Shin, Mary W Hall, Jacqueline Chame, Chun Chen, Paul F Fischer, and Paul D Hovland. Speeding up NEK5000 with autotuning and specialization. In Proceedings of the 24th ACM International Conference on Supercomputing, pages 253–262. ACM, 2010. [132] Aidan Shribman and Benoit Hudzia. Pre-copy and Post-copy VM Live Migration for Memory Intensive Applications. In European Conference on Parallel Processing, pages 539–547. Springer, 2012. [133] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, pages 1–10. Ieee, 2010. [134] Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, et al. Addressing failures in exascale computing. The International Journal of High Performance Computing Applications, 28(2):129–173, 2014. [135] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sun- daram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. Knights Landing: Second-generation Intel Xeon Phi Product. IEEE Micro, 36(2):34–46, 2016. [136] Subcommittee, ASCAC and others. Top ten exascale research challenges. US De- partment Of Energy Report, 2014. [137] Prasanna Sundararajan. High-Performance Computing using FPGAs. Xilinx white paper: FPGAs, pages 1–15, 2010. [138] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. [139] Monika ten Bruggencate and Duncan Roweth. Dmapp - an api for one-sided program models on baker systems. In Cray User Group Conference, 2010. [140] Rajeev Thakur and Alok Choudhary. An extended two-phase method for accessing sections of out-of-core arrays. Scientific Programming, 5(4):301–317, 1996. [141] Rajeev Thakur, William Gropp, and Ewing Lusk. Data sieving and collective I/O in ROMIO. In Frontiers of Massively Parallel Computation (Frontiers’ 99). The Seventh Symposium on the, pages 182–189. IEEE, 1999. [142] Rajeev Thakur, Ewing Lusk, and William Gropp. Users guide for ROMIO: A high- performance, portable MPI-IO implementation. Technical report, Technical Report ANL/MCS-TM-234, Mathematics and Computer Science Division, Argonne Na- tional Laboratory, 1997. [143] The Linux Kernel Organization. Userfaultfd - Delegation of page-fault han- dling to user-space applications. in: Kernel.org, https://www.kernel.org/doc/ Documentation/vm/userfaultfd.txt, 2017. [On-Line]. 84 BIBLIOGRAPHY

[144] The Linux Kernel Organization. DAX - Direct Access forfiles. in: Kernel.org, https://www.kernel.org/doc/Documentation/filesystems/dax.txt, 2019. [On- Line].

[145] The Linux Kernel Organization. Heterogeneous Memory Management (HMM). in: Kernel.org, https://www.kernel.org/doc/html/v4.18/vm/hmm.html, 2019. [On- Line].

[146] The Linux Kernel Organization. Virtual memory (VM) subsystem. in: Kernel.org, https://www.kernel.org/doc/Documentation/sysctl/vm.txt, 2019. [On-Line].

[147] Susan E Thompson, Fergal Mullally, Jeff Coughlin, Jessie L Christiansen, Christo- pher E Henze, Michael R Haas, and Christopher J Burke. A machine learning technique to identify transit shaped signals. The Astrophysical Journal, 812(1):46, 2015.

[148] Tiankai Tu, Charles A Rendleman, David W Borhani, Ron O Dror, Justin Gullingsrud, Morten O Jensen, John L Klepeis, Paul Maragakis, Patrick Miller, Kate A Stafford, et al. A scalable parallel framework for analyzing terascale molec- ular dynamics simulation trajectories. In High Performance Computing, Network- ing, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1–12. IEEE, 2008.

[149] Rob F Van der Wijngaart and Timothy G Mattson. The Parallel Research Kernels. In 2014 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6. IEEE, 2014.

[150] Brian Van Essen, Henry Hsieh, Sasha Ames, and Maya Gokhale. DI-MMAP: A High-Performance Memory-Map Runtime for Data-intensive Applications. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pages 731–735. IEEE, 2012.

[151] Sudharshan S Vazhkudai, Bronis R de Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 52. IEEE Press, 2018.

[152] Daniel Waddington and Jim Harris. Software challenges for the changing storage landscape. Communications of the ACM, 61(11):136–145, 2018.

[153] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

[154] Joachim Worringen. Self-adaptive hints for collective I/O. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pages 202–211. Springer, 2006.

[155] Xiangyao Yu, Christopher J Hughes, Nadathur Satish, and Srinivas Devadas. IMP: Indirect Memory Prefetcher. In Proceedings of the 2015 48th Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO), pages 178–190. ACM, 2015. BIBLIOGRAPHY 85

[156] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. TOP-PIM: throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High- performance parallel and distributed computing, pages 85–98. ACM, 2014. Glossary

AMR Adapative Mesh Refinement. 25 API Application Programming Interface. 10, 11, 15, 16, 20, 25, 42, 46, 49, 54, 57, 64, 65, 68

BVLC Berkeley Vision and Learning Center. 8

CAF Coarray Fortran. 11, 14, 25, 26, 45–49, 51, 52, 67, 68, 71, 72 CAS Compare-And-Swap. 27, 28, 33, 34, 40 CCE Cray C Compiler. 60

DAX Direct Access forfiles. 23 DHT Distributed Hash Table. 28, 34, 35 DRAM Dynamic Random-Access Memory. 5, 6, 9, 23, 24, 33, 42, 50, 60, 65, 68, 69

EPiGRAM-HS Exascale Programming Models for Heterogeneous Systems. 12, 13, 66 ESD Extreme Scale Demonstrators. 20

FDR Infiniband Fourteen Data Rate. 19 FIFO First-In, First-Out. 58 FLOPS Floating Point Operations per Second. 6 Fortran-IO Fortran Interface for I/O. 25, 26, 46, 48, 49, 52 FPGA Field-Programmable Gate Array. 7, 68 FUTEX Fast User-level Mutex. 57

GCC GNU Compiler Collection. 33, 47, 50 GFortran GNU Fortran. 46–48 GMA Global Memory Abstraction. 13, 14, 19 GPU Graphics Processing Unit. 6–8, 68

HACC Hardware-Accelerated Cosmology Code. 28, 35 HBM High-Bandwidth Memory. 6, 8, 10, 65, 66

87 88 GLOSSARY

HDD Hard Disk Drive. 33, 34, 36 HDF5 Hierarchical Data Format v5. 4, 10, 25 HMM Heterogeneous Memory Management. 66 HPC High-Performance Computing. 3, 5, 6, 8–11, 14–16, 19, 21–23, 25, 28–31, 33, 34, 36–38, 42, 44, 45, 48, 52–54, 61, 63–68, 71, 72

ICC Intel C Compiler. 60 ICPC Intel C++ Compiler. 42 ILSVRC ImageNet Large Scale Visual Recognition Competition. 7, 8 IOR Interleaved Or Random. 60, 61 IPC Inter-Process Communication. 56, 59

KVS Key Value Store. 20

LTO Linear Tape-Open. 19

MCDRAM Multi-Channel DRAM. 8, 65, 66, 72 MD Molecular Dynamics. 37 MDT Memory Drive Technology. 10, 23, 24 MPI Message-Passing Interface. 3, 4, 8, 9, 11, 13–19, 21, 22, 25, 27–42, 44–48, 50–54, 60–62, 64–67, 69, 71, 72 MPI-IO Message-Passing Interface for Parallel I/O. 3, 8–10, 15–19, 21–23, 25, 28–30, 35, 36, 38–40, 44, 48, 53, 57, 62, 63, 65, 72 MTBF Mean Time Between Failures. 45, 52

NCAR National Center for Atmospheric Research. 50 NERSC National Energy Research Scientific Computing Center. 53, 59, 60, 65 NUMA Non-Uniform Memory Access. 5 NVDIMM Non-Volatile Dual In-line Memory Module. 19, 23, 24, 47, 62 NVMe Non-Volatile Memory Express. 19, 50–52 NVRAM Non-Volatile Random-Access Memory. 5, 6, 8, 19, 23, 64, 65, 69, 72

OID Object Identifier. 65 ORNL Oak Ridge National Laboratory. 6, 65 OS Operating System. 10, 14, 23, 24, 30–33, 36, 42, 50, 52–57, 60, 62, 64, 72 OST Object Storage Target. 4, 33, 42, 48, 60, 64

PFS Parallel File System. 3–6, 8, 9, 16–18, 21, 22, 29, 30, 33–36, 39, 42, 50, 53, 57, 60, 62–65, 71 PGAS Partitioned Global Address Space. 45, 67 GLOSSARY 89 pHDF5 Parallel Hierarchical Data Format v5. 25 PIC Particle-in-Cell. 4, 10, 13, 22, 25 PIM Processing-In-Memory. 69 pLRU Pseudo-Least Recently Used. 56 PMPI MPI Profiling Interface. 28, 30, 33, 50 POSIX Portable Operating System Interface. 16, 23 POSIX-AIO Portable Operating System Interface for Asynchronous I/O. 23 POSIX-IO Portable Operating System Interface for I/O. 9, 10, 16, 20, 23, 57, 62, 63, 65 PRK Parallel Research Kernels. 51, 52, 61, 62 PUMA Purdue MapReduce Benchmarks Suite. 43

QoS Quality-of-Service. 24

RDMA Remote Direct Memory Access. 16, 27, 48, 72 RPC Remote Procedure Call. 4, 20 RSS Resident Set Size. 44, 56–58, 60

SAGE Percipient Storage for Exascale Data Centric Computing. 12, 13, 19, 20, 22, 64, 69 Sage2 Percipient Storage for Exascale Data Centric Computing 2. 12, 13, 19, 20, 64, 66, 69 SDM Software-Defined Memory. 10, 23, 24 SHAVE Streaming Hybrid Architecture Vector Engine. 7 SoC System-on-Chip. 7 SSD Solid-State Drive. 19, 23, 24, 33, 34, 50, 59 SSHD Solid State Hybrid Drive. 19 STT-RAM Spin-Transfer Torque Random-Access Memory. 5

TDP Thermal Design Power. 7, 69

UOSI Unified Object Storage Infrastructure. 19

V-NAND Vertical NAND. 19 VPU Vision Processing Unit. 7 VTK Visualization Toolkit. 4

WIRO Writes-In, Reads-Out. 54, 58 Part II

Included Papers

91 Paper I MPI windows on storage for HPC applications

93 Paper II Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks

115 Paper III Persistent Coarrays: Integrating MPI Storage Windows in Coarray Fortran

125 Paper IV uMMAP-IO: User-level Memory-mapped I/O for HPC

135