Application Development Environment for Supercomputer Fugaku

Application Development Environment for Supercomputer Fugaku Kensuke Watanabe Takafumi Nose Kiyofumi Suzuki Shuichi Chiba The objectives of developing the supercomputer Fugaku (hereafter, Fugaku) are to achieve a system with up to 100 times greater application performance than that of the K computer and to ensure general versatility to support the operation of a broad range of applications. Accomplishing this required an integrated development environment that supports everything from the development to the execution of applications that extract the maximum performance from the hardware, but there were significant challenges in achieving this goal. First, it required development optimized for the CPU used by Fugaku and support for various applications. Moreover, it required support for new hardware and increased memory usage due to the increase in the total number of processes in the message passing interface (MPI) communication library. In addition, it required improvements of the practical usability of application development support tools. This article describes such initiatives for the application development environment. features of the application development support tools, 1. Introduction and initiatives undertaken to improve performance. Supercomputer Fugaku (hereafter, Fugaku) [1] is equipped with the high performance “A64FX” CPU 2. Compiler that adopts and was developed with Arm Limited’s This section discusses the technological develop- instruction set architecture (hereafter, Arm architec- ment of the compiler for Fugaku. ture), features general versatility that supports various The compiler for Fugaku supports three lan- applications, and achieves massively parallel process- guages, Fortran/C/C++. The development objectives ing through the Tofu Interconnect D (hereafter, TofuD) were to achieve up to 100 times greater application described below. The development goal for Fugaku performance than that of the K computer for applica- was to achieve a system with up to 100 times greater tions relating to social and scientific priority issues that application performance than that of the K computer should be selectively undertaken by Fugaku (hereafter, [2]. Accomplishing this required an integrated devel- priority issue apps) and to enable the operation of vari- opment environment that supports everything from ous applications. In order to achieve these objectives, the development to the execution of applications that it is necessary to understand the features of the A64FX extract the maximum performance from the hardware. and to enhance the appropriate features. The following The developed Fugaku has demonstrated a high level two issues were extracted through cooperative design of performance by taking first place worldwide in the (co-design) with the applications. four categories of TOP500 [4], HPCG [5], HPL-AI [6], • Enhanced optimization to support the Arm and Graph500 [7] in the high-performance comput- architecture ing (HPC) rankings announced at the International • Support for the increasing use of object-oriented Supercomputing Conference (ISC) 2020 [3]. languages in recent years and optimization for This article introduces the application development integer type operations environment compiler targeting Fugaku, the message The following subsections explain the initiatives passing interface (MPI) communication library, new with respect to these issues. Fujitsu Technical Review 1 Application Development Environment for Supercomputer Fugaku 2.1 Supporting Arm architecture latter features a high rate of memory throughput. As we Promoting the application of vector instructions can see, applying loop fission and SWP even for kernels for running applications at high speed and the ap- with different characteristics can achieve a performance plication of optimizations to increase the parallelism improvement of approximately 20%, which significantly of instructions are essential for creating an advanced contributed to the goal of achieving 100 times greater compiler. application performance of the K computer. First, to promote the use of vector instructions, the Scalable Vector Extension (SVE), newly adopted 2.2 Supporting object-oriented languages by the Fugaku, was utilized. With SVE, it has become and integer type operations possible to use a predicate register to specify whether In order to broaden the range of users with or not to execute operations on each element of the Fugaku, the next objective became the ability to ex- vector instructions. As a result, this enables the vector- ecute applications from various fields in addition to ization of complex loops that include IF statements and conventional simulation programs. Therefore, it was makes it possible to execute at high speed. necessary to enhance the optimizations for integer- Next, software pipelining (SWP), which is a related operations and object-oriented programs. strength of the Fujitsu compiler, is important as an op- Accordingly, the Clang/LLVM [9] open source software timization for increasing the parallelism of instructions. (OSS) compiler was adopted, which has a proven record While the K computer has 128 vector registers, Fugaku of accelerating C/C++ language programs, as the base has a limited 32 registers. In order to effectively run and ported the optimization technologies for HPCs SWP, which requires many registers even on Fugaku, that Fujitsu has previously cultivated. In addition, the loop fission optimization was enhanced. Loop fission concurrent use of conventional functions was enabled is an optimization that considers the number of reg- to ensure compatibility with programs used with the K isters and the number of memory accesses for a loop computer and traditional programs to make it easy to with many statements and fissions the statements into port from the K computer. loops to which SWP can be applied. Figure 1 shows the evaluation results of the 3. Communication library physical process kernel and the dynamic process kernel MPI is the de facto standard for the communica- for NICAM [8], which is one of the priority issue apps. tion application programming interface (API) used with The former features loop structures with a lot of com- highly parallel applications in the HPC field, and the putational complexity and branch processing, while the performance of the communication library with MPI implemented significantly impacts the practical usability of an HPC system. With the K computer, Open MPI 1.0 was used, which is an OSS that has implemented the 0.8 MPI standard, as the base and added unique improvements to obtain a high level of performance. Fugaku 0.6 also inherited this library, but there were issues with the support for the new hardware and memory usage 0.4 Tie ati due to the increase in the number of processes. 0.2 This section discusses the communication library initiative for Fugaku with respect to these issues. 0.0 hsical cess enel naic cess enel 3.1 Supporting new hardware Optiiatins alie Optiiatins alie t cute t Fuau With the K computer, we adopted the Torus Fusion The ati when the eecutin tie is set t tiiatins alie Interconnect (hereafter, Tofu) [10] which is a node in- t the cute terconnect with a six-dimensional mesh/torus topology that connects many processors in higher dimensions. Figure 1 Optimization effects in the NICAM kernel. The Fugaku adopted Tofu Interconnect D (hereafter, 2 Fujitsu Technical Review Application Development Environment for Supercomputer Fugaku TofuD) [11] which, in comparison with Tofu, enhances significantly improved the communication bandwidth the number of simultaneous communications, fault tol- (Figure 2). erance, and barrier synchronous communication, etc. Moreover, the communication function used The TofuD installed in the Fugaku has an en- for barrier communication [12] acceleration was enhanced communication speed of 6.8 GB/s compared to hanced in TofuD. By utilizing this enhancement, barrier the 5 GB/s of the Tofu used in the K computer. However, communication acceleration can also be applied to pat- because the ratio of improvement for communication terns in which communications occur due to multiple speed does not extend to that for computing perfor- processes within a node. Compared to the K computer, mance, the conventional collective communication the data length that the TofuD barrier communication algorithm, which performs data broadcasting and (hereafter, Tofu barrier communication) can be applied reduction operations involving the coordination of to has increased from one element to three elements each rank, cannot maximally utilize the computing (floating-point numbers) or six elements (integers). performance under those conditions. Because the Due to the significant acceleration compared to a soft- number of links capable of simultaneous communica- ware implementation, even in a range that exceeds tion was increased from four in the K computer to six these numbers of elements supported by the hardware, in the Fugaku, it was important to utilize this to refine there is a range in which further communication ac- the collective communication algorithm to fully utilize celeration can be achieved (Figure 3) beyond a pure the communication performance of the TofuD. In the software implementation by repeatedly calling the Tofu Fugaku, six instances of simultaneous communication barrier communication in the software. By adding an are transmitted via the five collective communication optional function that repeatedly executes Tofu barrier functions consisting

Application Development Environment for Supercomputer Fugaku

File System and Power Management Enhanced for Supercomputer Fugaku

Supercomputer Fugaku

Management Direction Update

Integrated Report 2019 the Fujitsu Way CONTENTS

Performance Tuning of Graph500 Benchmark on Supercomputer Fugaku

The Blue Gene/Q Compute Chip

MT-Lib: a Topology-Aware Message Transfer Library for Graph500 on Supercomputers Xinbiao Gan

World's Fastest Computer

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

Looking Back on Supercomputer Fugaku Development Project

Scalable Graph Traversal on Sunway Taihulight with Ten Million Cores