Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, Philippe Navaux

Total Page:16

File Type:pdf, Size:1020Kb

Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, Philippe Navaux Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, Philippe Navaux To cite this version: Jean Luca Bez, Francieli Zanon Boito, Alberto Miranda, Ramon Nou, Toni Cortes, et al.. Towards On-Demand I/O Forwarding in HPC Platforms. PDSW 2020 : 5th IEEE/ACM International Parallel Data Systems Workshop, Nov 2020, Atlanta, Georgia / Virtual, United States. hal-03150024 HAL Id: hal-03150024 https://hal.inria.fr/hal-03150024 Submitted on 23 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez1, Francieli Z. Boito2, Alberto Miranda3, Ramon Nou3, Toni Cortes3,4, Philippe O. A. Navaux1 1Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS) — Porto Alegre, Brazil 2LaBRI, University of Bordeaux, Inria, CNRS, Bordeaux-INP – Bordeaux, France 3Barcelona Supercomputing Center (BSC) — Barcelona, Spain 4Polytechnic University of Catalonia — Barcelona, Spain fjean.bez, [email protected], [email protected], falberto.miranda,[email protected], [email protected] Abstract—I/O forwarding is an established and widely-adopted accommodate larger applications and more concurrent jobs, technique in HPC to reduce contention and improve I/O the PFS is not able to keep providing performance due to performance in the access to shared storage infrastructure. increasing contention and interference [4], [5], [6]. On such machines, this layer is often physically deployed on I/O forwarding [7] is a technique that seeks to mitigate dedicated nodes, and their connection to the clients is static. this contention by reducing the number of compute nodes Furthermore, the increasingly heterogeneous workloads enter- concurrently accessing the PFS servers. Rather than having ing HPC installations stress the I/O stack, requiring tuning applications directly access the PFS, the forwarding tech- and reconfiguration based on the applications’ characteristics.t nique defines a set of I/O nodes that are responsible for Nonetheless, it is not always feasible in a production system receiving application requests and forward them to the PFS, to explore the potential benefits of this layer under different thus allowing the application of optimization techniques configurations without impacting clients. In this paper, we such as request scheduling and aggregation. Furthermore, investigate the effects of I/O forwarding on performance by I/O forwarding is file system agnostic and is transparently considering the application’s I/O access patterns and system applied to applications. Due to these benefits, the technique characteristics. We aim to explore when forwarding is the best is used by TOP500 machines [8], as showed in Table 1. choice for an application, how many I/O nodes it would benefit On such machines, I/O nodes are often physically de- from, and whether not using forwarding at all might be the ployed on special nodes with dedicated hardware, and their correct decision. To gather performance metrics, explore, and connection to the clients is static. Thus, a subset of compute understand the impact of forwarding I/O requests of different nodes will only forward their requests to a single fixed I/O access patterns, we implemented FORGE, a lightweight I/O node. Typically, in this setup, all applications are forced to forwarding layer in user-space. Using FORGE, we evaluated use forwarding when it is deployed in the machine. the optimal forwarding configurations for several access pat- In this paper, we seek to investigate the impact of terns on MareNostrum 4 (Spain) and Santos Dumont (Brazil) I/O forwarding on performance, taking into account the supercomputers. Our results demonstrate that shifting the application’s I/O demands (i.e., their access patterns) and focus from a static system-wide deployment to an on-demand the overall system characteristics. We seek to explore when reconfigurable I/O forwarding layer dictated by application forwarding is the best choice (and how many I/O nodes an demands can improve I/O performance on future machines. application would benefit from), and when not using it is the most suitable alternative. We hope that the design of the forwarding layer in future machines will build upon this 1. Introduction knowledge, leading towards a more re-configurable I/O for- warding layer. The Sunway TaihuLight [9] and Tianhe-2A Scientific applications impose ever-growing performance requirements to the High-Performance Computing (HPC) field. Increasingly heterogeneous workloads are entering TABLE 1. HIGHEST RAKED TOP 500 MACHINES THAT IMPLEMENT HPC installations, from the traditionally compute-bound THE I/O FORWARDING TECHNIQUES (JUNE 2020). scientific simulations to Machine Learning applications and I/O bound Big Data workflows. While HPC clusters typi- Compute I/O Rank Supercomputer cally rely on a shared storage infrastructure powered by a Nodes Nodes Parallel File System (PFS) such as Lustre [1], GPFS [2], or 4 Sunway TaihuLight [9] 40; 960 240 Panasas [3], the increasing I/O demands of applications from 5 Tianhe-2A [6] 16; 000 256 fundamentally distinct domains stress this shared infrastruc- 10 Piz Daint [10] 6; 751 54 ture. As systems grow in the number of compute nodes to 11 Trinity [11] 19; 420 576 [6] already provide an initial support towards this plasticity. Compute Node Compute Node Compute Node Compute Node Due to the physically static deployment of current I/O forwarding infrastructures, it is usually not possible to run 2 6 3 7 4 8 5 9 experiments with varying numbers of I/O nodes. Further- 10 14 11 15 12 16 13 16 more, any reconfiguration of the forwarding layer typically FORGE CN FORGE CN FORGE CN FORGE CN requires acquiring, installing, and configuring new hardware. Thus, most machines are not easily re-configurable, and, in Compute FORGE FORGE Compute fact, end-users are not allowed to make modifications to 0 1 this layer to prevent impacting a production system. We Node ION ION Node argue that a research/exploration alternative is required to enable obtaining an overview of the impact that different I/O MPI Rank Parallel File System access patterns might have under different I/O forwarding Figure 1. Overview of the architecture used by FORGE to implement the configurations. We seek a portable, simple-to-deploy solu- I/O forwarding technique at user-space. I/O nodes and computes nodes are tion, capable of covering different deployments and access distributed according to the hostfile. patterns. Consequently, we implement a lightweight I/O forwarding layer named FORGE that can be deployed as a user-level job to gather performance metrics and aid in size) derived from Darshan [13]. As mentioned, it can be understanding the impact of forwarding in an HPC system. submitted to HPC queues as a normal user-space job to Our evaluation was conducted in two different supercom- provide insights regarding the I/O node’s behavior on the puters: the MareNostrum 4 at Barcelona Supercomputing platform. FORGE is open-source and can be downloaded Center (BSC), in Spain, and the Santos Dumont at the from https://github.com/jeanbez/forge. National Laboratory for Scientific Computation (LNCC), in We designed FORGE based on the general concepts Brazil. We demonstrate that the number of I/O nodes for 189 applied in I/O forwarding tools such as IOFSL [14], IBM scenarios covering distinct access patterns is different. For CIOD [7], and Cray DVS [12]. The common underlying idea 90:5%, using forwarding would indeed be the correct course present in these solutions is that clients emit I/O requests of action. But using two I/O nodes, for instance, would only which are then intercepted and forwarded to one I/O node bring performance improvements for 44% of the scenarios. in the forwarding layer. This I/O node will receive requests The rest of this paper is organized as follows. Section 2 from multiple processes of a given application (or even presents FORGE, our user-level implementation of the I/O many applications), schedule those requests, and reshape the forwarding technique. Our experimental methodology, the I/O flow by reordering and aggregating the requests to better platforms and access patterns, are described in Section 3. We suit the underlying PFS. The requests are then dispatched also present and discuss our results in that section. Finally, to the back-end PFS, which executes them and replies to we conclude this paper in Section 5 and discuss future work. the I/O node. Only then, the I/O node responds to the client (with the requested data, if that is the case) whether the 2. FORGE: The I/O Forwarding Explorer operation was successful. I/O nodes typically have a pool of threads to handle multiple incoming requests and dispatch Existing I/O forwarding solutions in production today, them to the PFS concurrently. such as IBM’s CIOD [7] or Cray DVS [12] require dedicated FORGE was built as an MPI application where M ranks hardware for I/O nodes and/or a restrictive software stack. are split into two subsets: compute nodes and I/O nodes. They are, hence, not a straightforward solution to explore The first N processes act as I/O forwarding servers, and and evaluate I/O forwarding in machines that do not yet each should be placed exclusively on a compute node.
Recommended publications
  • Barcelona Supercomputing Center Puts Beauty of HPC on Display
    CASE STUDY High Performance Computing (HPC) Beauty and High Performance Computing Under Glass at Barcelona Supercomputing Center Intel® Omni-Path Architecture and Intel® Xeon® Scalable Processor Family enable breakthrough European science using 13.7 petaFLOPS MareNostrum 4 Challenge MareNostrum 4 In publicly and privately funded computational science research, dollars (or • Spain’s 13.7 petaFLOPS Euros in this case) follow FLOPS. And when you’re one of the leading computing supercomputer contributes centers in Europe with a reputation around the world of highly reliable, leading- to the Partnership for edge technology resources, you look for the best in supercomputing in order to Advanced Computing in continue supporting breakthrough research. Thus, Barcelona Supercomputing Europe (PRACE) and supports Center (BSC), or Centro Nacional de Supercomputación, is driven to build leading the Spanish Supercomputing supercomputing clusters for its research clients in the public and private sectors. Network (RES) “We have the privilege of users coming back to us each year to run their projects,” said Sergi Girona, BSC’s Operations Department Director. “They return because we • 3,456 nodes of Intel® Xeon® reliably provide the technology and services they need year after year, and because Scalable Processors, plus our systems are of the highest level.” 0.5 petaFLOPS cluster of Intel® Xeon Phi™ Processors Supported by the Spanish and Catalan governments and funded by the Ministry (codenamed Knights Landing of Economy and Competitiveness with €34 million in 2015, BSC sought to take its MareNostrum 3 system to the next-generation of computing capabilities. It and Knights Hill) specified multiple clusters for both general computational needs of ongoing • Intel® Omni-Path Architecture research, and for development of next-generation codes for many integrated core interconnects multiple (MIC) supercomputing and tools for the Exascale computing era.
    [Show full text]
  • TOP500 Supercomputer Sites
    7/24/2018 News | TOP500 Supercomputer Sites HOME | SEARCH | REGISTER RSS | MY ACCOUNT | EMBED RSS | SUPER RSS | Contact Us | News | TOP500 Supercomputer Sites http://top500.org/blog/category/feature-article/feeds/rss Are you the publisher? Claim or contact Browsing the Latest Browse All Articles (217 Live us about this channel Snapshot Articles) Browser Embed this Channel Description: content in your HTML TOP500 News Search Report adult content: 04/27/18--03:14: UK Commits a 0 0 Billion Pounds to AI Development click to rate The British government and the private sector are investing close to £1 billion Account: (login) pounds to boost the country’s artificial intelligence sector. The investment, which was announced on Thursday, is part of a wide-ranging strategy to make the UK a global leader in AI and big data. More Channels Under the investment, known as the “AI Sector Deal,” government, industry, and academia will contribute £603 million in new funding, adding to the £342 million already allocated in existing budgets. That brings the grand total to Showcase £945 million, or about $1.3 billion at the current exchange rate. The UK RSS Channel Showcase 1586818 government is also looking to increase R&D spending across all disciplines by 2.4 percent, while also raising the R&D tax credit from 11 to 12 percent. This is RSS Channel Showcase 2022206 part of a broader commitment to raise government spending in this area from RSS Channel Showcase 8083573 around £9.5 billion in 2016 to £12.5 billion in 2021. RSS Channel Showcase 1992889 The UK government policy paper that describes the sector deal meanders quite a bit, describing a lot of programs and initiatives that intersect with the AI investments, but are otherwise free-standing.
    [Show full text]
  • Interconnect Your Future Enabling the Best Datacenter Return on Investment
    Interconnect Your Future Enabling the Best Datacenter Return on Investment TOP500 Supercomputers, November 2015 InfiniBand FDR and EDR Continue Growth and Leadership The Most Used Interconnect On The TOP500 Supercomputing List EDR InfiniBand Now Used in 4 Systems on the List, with More EDR Systems Being Deployed by End of 2015 Will Result in More Than 2X the Number of EDR Systems Worldwide Compared to June 2015 . Mellanox Delivers Up To 99.8% System Efficiency with InfiniBand, 4X higher versus Ethernet • Delivering highest datacenter return on investment . For the first time many Chinese-based Ethernet Cloud/Web2.0 systems submitted and ranked on the list • Altering the trends perspective of high-performance computing systems on the TOP500 for Ethernet/InfiniBand connectivity © 2015 Mellanox Technologies 2 TOP500 Performance Trends 76% CAGR . Explosive high-performance computing market growth . Clusters continue to dominate with 85% of the TOP500 list Mellanox InfiniBand Solutions Provide the Highest Systems Utilization on the TOP500 © 2015 Mellanox Technologies 3 TOP500 Major Highlights . InfiniBand is the most used interconnect on the TOP500 list, with 237 systems, or 47.4% of the list . FDR InfiniBand is the most used technology on the TOP500 list, connecting 163 systems • 16% increase from Nov’14 to Nov’15 . EDR InfiniBand now used in 4 systems on the List, with more EDR systems being deployed by end of 2015 • More than 2X the number of EDR systems expected compared to July 2015 . InfiniBand enables the most efficient system on the list with 99.8% efficiency – record! • Delivering highest datacenter Return on Investment, 4X higher than 10GbE, 33% higher than Cray interconnect • InfiniBand enables the top 9 most efficient systems on the list, and 28 of the TOP30 .
    [Show full text]
  • BSC Roadmap for HPC in Europe
    BSC roadmap for HPC in Europe Prof. Mateo Valero Director, Barcelona Supercomputing Center Some societal challenges • Ageing population • Climate change • Cybersecurity • Increasing energy needs • Intensifying global competition Images courtesy of The PRACE Scientific Steering Committee, “The Scientific Case for Computing in Europe 2018-2026” Top10, Nov 2018 Rmax Rpeak GFlops Rank Name Site Manufacturer Country Cores Accelerators [TFlop/s] [TFlop/s] /Watts DOE/SC/Oak Ridge 1 Summit IBM US 2,397,824 2,196,480 143,500 200,795 14.67 National Laboratory DOE/NNSA/Lawrence 2 Sierra IBM/NVIDIA US 1,572,480 1,382,400 94,640 125,712 12.72 Livermore National Lab. Sunway National Supercomputing 3 NRCPC China 10,649,600 93,015 125,436 6.05 TaihuLight Center in Wuxi National Super Computer 4 Tianhe-2ª NUDT China 4,981,760 4,554,752 61,445 100,679 3.33 Center in Guangzhou Swiss National 5 Piz Daint Cray Inc. Switz 387,872 319,424 21,230 27.154.3 8.90 Supercomputing Centre 6 Trinity DOE/NNSA/LANL/SNL Cray Inc. US 979,072 20,158,7 41,461 2.66 AI Bridging National Inst. of Adv 7 Fujitsu Japan 391,680 348,160 19,880 32,577 12.05 Cloud Inf. Industrial Science & Tech. German 8 SuperMUC-NG Leibniz Rechenzentrum Lenovo 305,856 19,477 28,872,86 y DOE/SC/Oak Ridge 9 Titan Cray Inc. US 560,640 261,632 17,590 27,113 2.14 National Laboratory DOE/NNSA/Lawrence 10 Sequoia IBM US 1,572,864 17,173 20,133 2.18 Livermore National Lab.
    [Show full text]
  • European Processor Initiative & RISC-V
    European Processor Initiative & RISC-V Prof. Mateo Valero BSC Director 9/May/2018 RISC-V Workshop, Barcelona Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives Supercomputing services R&D in Computer, PhD programme, to Spanish and Life, Earth and technology transfer, EU researchers Engineering Sciences public engagement Spanish Government 60% BSC-CNS is a consortium Catalan Government 30% that includes Univ. Politècnica de Catalunya (UPC) 10% People and Resources Data as of 31st of December 2017 Context: The international Exascale challenge • Sustained real life application performances, not just Linpack… • Exascale will not just allow present solutions to run faster, but will enable new solutions not affordable with today HPC technology • From simulation to high predictability for precise medicine, energy, climate change, autonomous driven vehicles… • The International context (US, China, Japan and EU…) • The European HPC programme • The European Processor initiative • BSC role Top5 November 2017 list (compact view) Rank Site Computer Procs Rmax Rpeak Mflops/Watt 1 Wuxi, China Sunway SW26010 260C 10.649.600 93.015 125.436 6.051 3.120.000 2 Guangzhou, China Xeon E5-2692+Phi 33.863 54.902 1.902 2736000 361.760 3 CSCS, Switzerland Cray XC50, Xeon E2690 12C+P100 19.590 25.326 8.622 297920 19860000 4 JAMEST, Japan ZettaScaler Xeon D-1571, PEZY-SC2 19.135 28.192 14.173 19840000 560.640 5 DOE/SC/Oak Ridge, US Cray XK7, Opteron 16C+K20 17.590 27.113 2.143 261632 6 DOE/NNSA/LLNL, US BlueGene/Q, BQC 16C 1.572.864
    [Show full text]
  • Artificial Intelligence Understanding Its Applications, Value and Where to Begin
    Intelligent Transformation in the Data Centre Sumir Bhatia | 16 September 2018 President, Asia Pacific – Lenovo Data Canter Group @Sumir_Bhatia 2018 Lenovo Internal. All rights reserved. The Reality We All Face Cloud Transformation Private, hybrid, and public models offer on-demand delivery of resources and applications. Cybersecurity Business impact of security incidents and regulatory compliance. Performance data & analytics Speed-of-insight, actionable business intelligence. Customer Experience Deliver real customer benefit through reliability and trusted partnerships. Artificial Intelligence Understanding its applications, value and where to begin. 2018 Lenovo Internal. All rights reserved. 2018 Lenovo. All rights reserved. Data Centre Group Be the most trusted data centre partner – empowering customers’ Intelligent Transformation and solving humanity’s greatest challenges. INNOVATION FOCUS N O L E G A C Y DEEP PARTNERSHIPS #1 #1 #1 #1 88 #26 Customer TOP500 World Record X86 Server SAP Hana Gartner global Satisfaction Supercomputing & benchmarks reliability performance supply chain TBR - Lenovo's 5 yrs. running fastest growing HPC overall satisfaction company again surpassed scores for Dell 2,000 EMC and HPE 100 20M+ 25 Developers across 7 Years in the Servers per hour Servers shipped global labs data center 5 10,000 #1 6 Nationalities on Service TOP500 Owned/controlled 2018 Lenovo Internal. All rights reserved. DCG SLT professionals Supercomputing manufacturing sites AI & HPC The future is here today 2018 Lenovo Internal. All rights reserved. 4 $1.2 Trillion The expected amount of business that more agile AI-enabled companies will take from less informed peers. 2018 Lenovo Internal. All rights reserved. 5 Beyond The Buzzwords – What is AI? Artificial Intelligence Machines performing tasks in ways that are intelligent.
    [Show full text]
  • SUSE and Lenovo Deliver Some of the World's Most Powerful
    White Paper SUSE Linux Enterprise Server for High Performance Computing SUSE Enterprise Storage SUSE and Lenovo Deliver Some of the World’s Most Powerful Supercomputers Table of Contents page Put the Power of High-Performance Computing to Work for Your Business...........................................................................................................2 The Changing Face of HPC ..............................................................................................................2 Why SUSE and Lenovo? ..........................................................................................................................3 Get the Best of HPC without the Hassle .............................................................3 Components of the SUSE-Lenovo Solution ................................................3 Make the Most of HPC with SUSE and Lenovo .....................................5 White Paper SUSE and Lenovo Deliver Some of the World’s Most Powerful Supercomputers Put the Power of High- Performance Computing to Work for Your Business The demand for high-performance computing (HPC) has grown beyond its traditional home of research institutions and government labs. Today, enterprises of all sizes are realizing the value of HPC for a variety of tasks, such as financial forecasting, physical simulations, fraud detection, business intelligence, engineering, marketing and personalized medicine—and the most common use, data analytics. With an HPC infrastructure, enterprises more cost-effective for enterprises, and
    [Show full text]
  • Infiniband Strengthens Leadership As the Interconnect of Choice by Providing Best Return on Investment
    InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment TOP500 Supercomputers, July 2015 TOP500 Performance Trends 76% CAGR . Explosive high-performance computing market growth . Clusters continue to dominate with 87% of the TOP500 list Mellanox InfiniBand solutions provide the highest systems utilization in the TOP500 for both high-performance computing and clouds © 2015 Mellanox Technologies 2 TOP500 Major Highlights . InfiniBand is the de-facto Interconnect solution for High-Performance Computing • Positioned to continue and expand in Cloud and Web 2.0 . InfiniBand connects the majority of the systems on the TOP500 list, with 257 systems • Increasing 15.8% year-over-year, from 222 system in June’14 to 257 in June’15 . FDR InfiniBand is the most used technology on the TOP500 list, connecting 156 systems . EDR InfiniBand has entered the list with 3 systems . InfiniBand enables the most efficient system on the list with 99.8% efficiency – record! . InfiniBand enables the top 17 most efficient systems on the list . InfiniBand is the most used interconnect for Petascale systems with 33 systems © 2015 Mellanox Technologies 3 InfiniBand in the TOP500 . InfiniBand is the most used interconnect technology for high-performance computing • InfiniBand accelerates 257 systems, 51.4% of the list . FDR InfiniBand connects the fastest InfiniBand systems • TACC (#8), NASA (#11), Government (#13, #14), Tulip Trading (#15), EDRC (#16), Eni (#17), AFRL (#19), LRZ (#20) . InfiniBand connects the most powerful
    [Show full text]