Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, Philippe Navaux

Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, Philippe Navaux To cite this version: Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, et al.. Towards On-Demand I/O Forwarding in HPC Platforms. PDSW 2020 : 5th IEEE/ACM International Parallel Data Systems Workshop, Nov 2020, Atlanta, Georgia / Virtual, United States. hal-03150024 HAL Id: hal-03150024 https://hal.inria.fr/hal-03150024 Submitted on 23 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez1, Francieli Z. Boito2, Alberto Miranda3, Ramon Nou3, Toni Cortes3,4, Philippe O. A. Navaux1 1Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS) — Porto Alegre, Brazil 2LaBRI, University of Bordeaux, Inria, CNRS, Bordeaux-INP – Bordeaux, France 3Barcelona Supercomputing Center (BSC) — Barcelona, Spain 4Polytechnic University of Catalonia — Barcelona, Spain fjean.bez, [email protected], [email protected], falberto.miranda,[email protected], [email protected] Abstract—I/O forwarding is an established and widely-adopted accommodate larger applications and more concurrent jobs, technique in HPC to reduce contention and improve I/O the PFS is not able to keep providing performance due to performance in the access to shared storage infrastructure. increasing contention and interference [4], [5], [6]. On such machines, this layer is often physically deployed on I/O forwarding [7] is a technique that seeks to mitigate dedicated nodes, and their connection to the clients is static. this contention by reducing the number of compute nodes Furthermore, the increasingly heterogeneous workloads enter- concurrently accessing the PFS servers. Rather than having ing HPC installations stress the I/O stack, requiring tuning applications directly access the PFS, the forwarding tech- and reconfiguration based on the applications’ characteristics.t nique defines a set of I/O nodes that are responsible for Nonetheless, it is not always feasible in a production system receiving application requests and forward them to the PFS, to explore the potential benefits of this layer under different thus allowing the application of optimization techniques configurations without impacting clients. In this paper, we such as request scheduling and aggregation. Furthermore, investigate the effects of I/O forwarding on performance by I/O forwarding is file system agnostic and is transparently considering the application’s I/O access patterns and system applied to applications. Due to these benefits, the technique characteristics. We aim to explore when forwarding is the best is used by TOP500 machines [8], as showed in Table 1. choice for an application, how many I/O nodes it would benefit On such machines, I/O nodes are often physically de- from, and whether not using forwarding at all might be the ployed on special nodes with dedicated hardware, and their correct decision. To gather performance metrics, explore, and connection to the clients is static. Thus, a subset of compute understand the impact of forwarding I/O requests of different nodes will only forward their requests to a single fixed I/O access patterns, we implemented FORGE, a lightweight I/O node. Typically, in this setup, all applications are forced to forwarding layer in user-space. Using FORGE, we evaluated use forwarding when it is deployed in the machine. the optimal forwarding configurations for several access pat- In this paper, we seek to investigate the impact of terns on MareNostrum 4 (Spain) and Santos Dumont (Brazil) I/O forwarding on performance, taking into account the supercomputers. Our results demonstrate that shifting the application’s I/O demands (i.e., their access patterns) and focus from a static system-wide deployment to an on-demand the overall system characteristics. We seek to explore when reconfigurable I/O forwarding layer dictated by application forwarding is the best choice (and how many I/O nodes an demands can improve I/O performance on future machines. application would benefit from), and when not using it is the most suitable alternative. We hope that the design of the forwarding layer in future machines will build upon this 1. Introduction knowledge, leading towards a more re-configurable I/O forwarding layer. The Sunway TaihuLight [9] and Tianhe-2A Scientific applications impose ever-growing performance requirements to the High-Performance Computing (HPC) field. Increasingly heterogeneous workloads are entering TABLE 1. HIGHEST RAKED TOP 500 MACHINES THAT IMPLEMENT HPC installations, from the traditionally compute-bound THE I/O FORWARDING TECHNIQUES (JUNE 2020). scientific simulations to Machine Learning applications and I/O bound Big Data workflows. While HPC clusters typi- Compute I/O Rank Supercomputer cally rely on a shared storage infrastructure powered by a Nodes Nodes Parallel File System (PFS) such as Lustre [1], GPFS [2], or 4 Sunway TaihuLight [9] 40; 960 240 Panasas [3], the increasing I/O demands of applications from 5 Tianhe-2A [6] 16; 000 256 fundamentally distinct domains stress this shared infrastruc- 10 Piz Daint [10] 6; 751 54 ture. As systems grow in the number of compute nodes to 11 Trinity [11] 19; 420 576 [6] already provide an initial support towards this plasticity. Compute Node Compute Node Compute Node Compute Node Due to the physically static deployment of current I/O forwarding infrastructures, it is usually not possible to run 2 6 3 7 4 8 5 9 experiments with varying numbers of I/O nodes. Further- 10 14 11 15 12 16 13 16 more, any reconfiguration of the forwarding layer typically FORGE CN FORGE CN FORGE CN FORGE CN requires acquiring, installing, and configuring new hardware. Thus, most machines are not easily re-configurable, and, in Compute FORGE FORGE Compute fact, end-users are not allowed to make modifications to 0 1 this layer to prevent impacting a production system. We Node ION ION Node argue that a research/exploration alternative is required to enable obtaining an overview of the impact that different I/O MPI Rank Parallel File System access patterns might have under different I/O forwarding Figure 1. Overview of the architecture used by FORGE to implement the configurations. We seek a portable, simple-to-deploy solu- I/O forwarding technique at user-space. I/O nodes and computes nodes are tion, capable of covering different deployments and access distributed according to the hostfile. patterns. Consequently, we implement a lightweight I/O forwarding layer named FORGE that can be deployed as a user-level job to gather performance metrics and aid in size) derived from Darshan [13]. As mentioned, it can be understanding the impact of forwarding in an HPC system. submitted to HPC queues as a normal user-space job to Our evaluation was conducted in two different supercom- provide insights regarding the I/O node’s behavior on the puters: the MareNostrum 4 at Barcelona Supercomputing platform. FORGE is open-source and can be downloaded Center (BSC), in Spain, and the Santos Dumont at the from https://github.com/jeanbez/forge. National Laboratory for Scientific Computation (LNCC), in We designed FORGE based on the general concepts Brazil. We demonstrate that the number of I/O nodes for 189 applied in I/O forwarding tools such as IOFSL [14], IBM scenarios covering distinct access patterns is different. For CIOD [7], and Cray DVS [12]. The common underlying idea 90:5%, using forwarding would indeed be the correct course present in these solutions is that clients emit I/O requests of action. But using two I/O nodes, for instance, would only which are then intercepted and forwarded to one I/O node bring performance improvements for 44% of the scenarios. in the forwarding layer. This I/O node will receive requests The rest of this paper is organized as follows. Section 2 from multiple processes of a given application (or even presents FORGE, our user-level implementation of the I/O many applications), schedule those requests, and reshape the forwarding technique. Our experimental methodology, the I/O flow by reordering and aggregating the requests to better platforms and access patterns, are described in Section 3. We suit the underlying PFS. The requests are then dispatched also present and discuss our results in that section. Finally, to the back-end PFS, which executes them and replies to we conclude this paper in Section 5 and discuss future work. the I/O node. Only then, the I/O node responds to the client (with the requested data, if that is the case) whether the 2. FORGE: The I/O Forwarding Explorer operation was successful. I/O nodes typically have a pool of threads to handle multiple incoming requests and dispatch Existing I/O forwarding solutions in production today, them to the PFS concurrently. such as IBM’s CIOD [7] or Cray DVS [12] require dedicated FORGE was built as an MPI application where M ranks hardware for I/O nodes and/or a restrictive software stack. are split into two subsets: compute nodes and I/O nodes. They are, hence, not a straightforward solution to explore The first N processes act as I/O forwarding servers, and and evaluate I/O forwarding in machines that do not yet each should be placed exclusively on a compute node.

Load more