Application Development Environment for

Kensuke Watanabe Takafumi Nose Kiyofumi Suzuki Shuichi Chiba

The objectives of developing the supercomputer Fugaku (hereafter, Fugaku) are to achieve a system with up to 100 times greater application performance than that of the and to ensure general versatility to support the operation of a broad range of applications. Accomplishing this required an integrated development environment that supports everything from the development to the execution of applications that extract the maximum performance from the hardware, but there were significant challenges in achieving this goal. First, it re- quired development optimized for the CPU used by Fugaku and support for various applications. Moreover, it required support for new hardware and increased memory usage due to the increase in the total number of processes in the message passing interface (MPI) communication library. In addition, it required improvements of the practical usability of application development sup- port tools. This article describes such initiatives for the application development environment.

features of the application development support tools, 1. Introduction and initiatives undertaken to improve performance. Supercomputer Fugaku (hereafter, Fugaku) [1] is equipped with the high performance “A64FX” CPU 2. Compiler that adopts and was developed with Arm Limited’s This section discusses the technological develop- instruction set architecture (hereafter, Arm architec- ment of the compiler for Fugaku. ture), features general versatility that supports various The compiler for Fugaku supports three lan- applications, and achieves massively parallel process- guages, Fortran/C/C++. The development objectives ing through the Tofu Interconnect D (hereafter, TofuD) were to achieve up to 100 times greater application described below. The development goal for Fugaku performance than that of the K computer for applica- was to achieve a system with up to 100 times greater tions relating to social and scientific priority issues that application performance than that of the K computer should be selectively undertaken by Fugaku (hereafter, [2]. Accomplishing this required an integrated devel- priority issue apps) and to enable the operation of vari- opment environment that supports everything from ous applications. In order to achieve these objectives, the development to the execution of applications that it is necessary to understand the features of the A64FX extract the maximum performance from the hardware. and to enhance the appropriate features. The following The developed Fugaku has demonstrated a high level two issues were extracted through cooperative design of performance by taking first place worldwide in the (co-design) with the applications. four categories of TOP500 [4], HPCG [5], HPL-AI [6], • Enhanced optimization to support the Arm and [7] in the high-performance comput- architecture ing (HPC) rankings announced at the International • Support for the increasing use of object-oriented Supercomputing Conference (ISC) 2020 [3]. languages in recent years and optimization for This article introduces the application development integer type operations environment compiler targeting Fugaku, the message The following subsections explain the initiatives passing interface (MPI) communication library, new with respect to these issues.

Fujitsu Technical Review 1 Application Development Environment for Supercomputer Fugaku

2.1 Supporting Arm architecture latter features a high rate of memory throughput. As we Promoting the application of vector instructions can see, applying loop fission and SWP even for kernels for running applications at high speed and the ap- with different characteristics can achieve a performance plication of optimizations to increase the parallelism improvement of approximately 20%, which significantly of instructions are essential for creating an advanced contributed to the goal of achieving 100 times greater compiler. application performance of the K computer. First, to promote the use of vector instructions, the Scalable Vector Extension (SVE), newly adopted 2.2 Supporting object-oriented languages by the Fugaku, was utilized. With SVE, it has become and integer type operations possible to use a predicate register to specify whether In order to broaden the range of users with or not to execute operations on each element of the Fugaku, the next objective became the ability to ex- vector instructions. As a result, this enables the vector- ecute applications from various fields in addition to ization of complex loops that include IF statements and conventional simulation programs. Therefore, it was makes it possible to execute at high speed. necessary to enhance the optimizations for integer- Next, software pipelining (SWP), which is a related operations and object-oriented programs. strength of the compiler, is important as an op- Accordingly, the Clang/LLVM [9] open source software timization for increasing the parallelism of instructions. (OSS) compiler was adopted, which has a proven record While the K computer has 128 vector registers, Fugaku of accelerating C/C++ language programs, as the base has a limited 32 registers. In order to effectively run and ported the optimization technologies for HPCs SWP, which requires many registers even on Fugaku, that Fujitsu has previously cultivated. In addition, the loop fission optimization was enhanced. Loop fission concurrent use of conventional functions was enabled is an optimization that considers the number of reg- to ensure compatibility with programs used with the K isters and the number of memory accesses for a loop computer and traditional programs to make it easy to with many statements and fissions the statements into port from the K computer. loops to which SWP can be applied. Figure 1 shows the evaluation results of the 3. Communication library physical process kernel and the dynamic process kernel MPI is the de facto standard for the communica- for NICAM [8], which is one of the priority issue apps. tion application programming interface (API) used with The former features loop structures with a lot of com- highly parallel applications in the HPC field, and the putational complexity and branch processing, while the performance of the communication library with MPI implemented significantly impacts the practical usabil- ity of an HPC system. With the K computer, Open MPI was used, which is an OSS that has implemented the

MPI standard, as the base and added unique improve- ments to obtain a high level of performance. Fugaku also inherited this library, but there were issues with the support for the new hardware and memory usage

Tie ati due to the increase in the number of processes. This section discusses the communication library initiative for Fugaku with respect to these issues. hsical cess enel naic cess enel 3.1 Supporting new hardware tiiatins alie tiiatins alie t cute t Fuau With the K computer, we adopted the Torus Fusion

The ati when the eecutin tie is set t tiiatins alie Interconnect (hereafter, Tofu) [10] which is a node in- t the cute terconnect with a six-dimensional mesh/torus topology that connects many processors in higher dimensions. Figure 1 Optimization effects in the NICAM kernel. The Fugaku adopted Tofu Interconnect D (hereafter,

2 Fujitsu Technical Review Application Development Environment for Supercomputer Fugaku

TofuD) [11] which, in comparison with Tofu, enhances significantly improved the communication bandwidth the number of simultaneous communications, fault tol- (Figure 2). erance, and barrier synchronous communication, etc. Moreover, the communication function used The TofuD installed in the Fugaku has an en- for barrier communication [12] acceleration was en- hanced communication speed of 6.8 GB/s compared to hanced in TofuD. By utilizing this enhancement, barrier the 5 GB/s of the Tofu used in the K computer. However, communication acceleration can also be applied to pat- because the ratio of improvement for communication terns in which communications occur due to multiple speed does not extend to that for computing perfor- processes within a node. Compared to the K computer, mance, the conventional collective communication the data length that the TofuD barrier communication algorithm, which performs data broadcasting and (hereafter, Tofu barrier communication) can be applied reduction operations involving the coordination of to has increased from one element to three elements each rank, cannot maximally utilize the computing (floating-point numbers) or six elements (integers). performance under those conditions. Because the Due to the significant acceleration compared to a soft- number of links capable of simultaneous communica- ware implementation, even in a range that exceeds tion was increased from four in the K computer to six these numbers of elements supported by the hardware, in the Fugaku, it was important to utilize this to refine there is a range in which further communication ac- the collective communication algorithm to fully utilize celeration can be achieved (Figure 3) beyond a pure the communication performance of the TofuD. In the software implementation by repeatedly calling the Tofu Fugaku, six instances of simultaneous communication barrier communication in the software. By adding an are transmitted via the five collective communication optional function that repeatedly executes Tofu barrier functions consisting of MPI_Bcast, MPI_Reduce, MPI_ communication to the MPI library, users can use opti- Allreduce, MPI_Allgather, and MPI_Alltoall. In particular, mization to increase the applicable data length without because the conventional MPI_Bcast algorithm for large modifying the program. message lengths is restricted to three instances of simultaneous communication, increasing that to six

ew alith n Fuau instances siultaneus cunicatin l alith n Fuau instances siultaneus cunicatin lith n cute

unicatin anwith s

essae lenth tes

Figure 2 Performance of dedicated collection communication algorithm supporting six directions (Bcast).

Fujitsu Technical Review 3 Application Development Environment for Supercomputer Fugaku

s

atenc

cute Tu aie cunicatin when thee is nl eleent

Fuau stwae ileentatin alicatin Fuau Tu aie cunicatin alicatin

ih see

ue eleents

Figure 3 Accelerating barrier communication (MPI_Allreduce).

3.2 Memory usage control To deal with this problem, the Fugaku MPI library In the K computer, one node is equipped with does not allocate memory for all of the communication 16 GiB of memory, and one process monopolizes the partners during initialization with the MPI_Init function, entire memory space. However, while the memory but instead dynamically allocates the memory when it capacity for one node has been increased to 32 GiB in first communicates with a communication partner. As a the Fugaku, it has been split into four pieces through a result, it does not increase simply depending on the total structure within the node called the core memory group number of processes, but is improved by being able to (CMG). Furthermore, because it is recommended to al- limit the allocation to the necessary minimum depending locate multiple processes to a separate CMG, Fugaku on the communication pattern. Moreover, by implement- shares the 32 GiB among the four processes. Therefore, ing a communication mode utilizing a TofuD function that the memory usage per process has decreased to 8 GiB. further reduces the memory consumption, it improves the In addition, when viewed at the scale of the number of flexibility with respect to applications that require a bal- processes for the entire Fugaku system, because there ance between process scale and memory consumption. are 150,000 nodes that can run four processes on each In this communication mode, the maximum memory node simultaneously, we get a total of approximately consumption for 27,648 nodes and 110,592 processes 600,000 processes, which is a significant increase from can be reduced by approximately 34%. the approximately 80,000 processes of the K computer. Generally speaking, because the management infor- 4. Application development support mation required within the MPI library also increases tools in proportion to the number of processes that are con- This section describes the application develop- nection partners of MPI communication, an increase in ment support tools developed for Fugaku. the overall system scale becomes a factor that places The application development support tools are a strain on memory capacity and significantly impacts composed from three types of functions according to application performance. their purpose.

4 Fujitsu Technical Review Application Development Environment for Supercomputer Fugaku

• An easy-to-understand display profiler that col- Report” that can perform efficient performance analysis lects information for performance tuning by using these components in a staged manner de- • Debugger for Parallel Applications for supporting pending on the situation. the investigation of events such as deadlocks and abnormal termination that occur in large-scale 4.2 Understanding performance trends with parallel processing Instant Performance Profiler • Integrated development environment that can The Instant Performance Profiler collects cost perform operations such as compiling and sub- distribution information for the application with a mitting jobs through a GUI sampling method to support an understanding of the The following subsections introduce the features outline of overall application performance by display- and improvements for Fugaku of the most frequently ing the information at a procedure, loop, and line level. used profiler. The hot spots where the costs are concentrated can be identified from the cost distribution. 4.1 Supporting Fugaku With the K computer’s profiler, there were requests A Fugaku-compatible profiler was developed by from users who wanted to obtain the collected informa- inheriting and expanding on the K computer’s pro- tion in an easy to process and general-purpose format for filer, which was highly regarded from the perspective creating graphs and statistical processing, etc. However, of practical usability. As shown in Figure 4, it consists in reality the data was only provided in text and CSV for- of an “Instant Performance Profiler,” an “Advanced mats, so supporting output formats with a high degree Performance Profiler,” and a “CPU Performance Analysis of general usability was an issue with Fugaku. In the

einnes an uses wh nt unestan the hihcst eins alicatins nestanin the eance tens the veall alicatin nstant F value e thuhut an eance la ialances etween theas etc ile entiin hihcst eins ses wh unestan at the ceue l an line levels the hihcst eins alicatins tatistical ata athein hihcst eins vance Tansit euenc e ein an eance the aveae tie euie ile unicatin euenc an the aveae essae lenth

etaile analsis hihcst eatin eins eance cle accuntin e an cache us nalsis Ret cnitins vectiatin cnitins instuctin i etc

F Flatinint eatins e secn

Figure 4 Performance analysis procedure.

Fujitsu Technical Review 5 Application Development Environment for Supercomputer Fugaku

Fugaku profiler, this issue was addressed by supporting information makes it possible to understand the opera- output in XML format in both the Instant Performance tion performance bottlenecks in detail. Profiler and the Advanced Performance Profiler. A function that is similar to the Fugaku CPU Performance Analysis Report was achieved in the K 4.3 Understanding detailed information computer with a precision PA visualization function in with Advanced Performance Profiler Excel format. However, the application had to be run and CPU Performance Analysis Report seven times to gather the information, making the low The Advanced Performance Profiler and CPU level of convenience an issue. Moreover, in the case of Performance Analysis Report are used to obtain more the PRIMEHPC FX100 commercial unit that applied the detailed performance information for hot spots identi- K computer technologies, the application had to be run fied from the results of the Instant Performance Profiler. 11 times. The Advanced Performance Profiler collects MPI With the Fugaku CPU performance analysis re- communication cost information for specific sections port, a report can be created after an application is run and CPU performance analysis information to enable 1 time, 5 times, 11 times, or 17 times, which means a detailed performance analysis of hot spots. The CPU that the volume of information increases with the num- Performance Analysis Report displays the information ber of runs. Therefore, the CPU Performance Analysis of Performance Monitoring Unit (PMU) counter, which Report can be used in a flexible manner to change the is built into the CPU and measures the CPU operating number of runs according to the required volume of in- conditions regarding operation performance, in a struc- formation while also utilizing the newly available types tured and easy-to-understand manner using graph of PMU counter information to achieve a more detailed and table formats (Figure 5). Analyzing PMU counter analysis.

Tale eessin ccle accuntin

eance ache iss ah eessin veview ate ccle accuntin

Figure 5 Example of a CPU Performance Analysis Report.

6 Fujitsu Technical Review Application Development Environment for Supercomputer Fugaku

[8] Abbreviation of Nonhydrostatic ICosahedral Atmospheric 5. Conclusion Model. Adopted as a priority issue application for Fugaku. In a dynamic process, it solves for the wind This article discussed the Fugaku compiler, MPI velocity, air temperature, etc. on a grid and features communication library, functions developed with the structured grid stencil calculations and a loop struc- application development support tools, and initiatives ture with high memory requirements. In a physical to improve performance. Starting with the K computer process, it solves for cloud phase changes and other and its commercial successors such as the PRIMEHPC phenomena and features a loop structure with a lot of FX10 and FX100 and continuing up to the Fugaku, computation and branch processing. [9] Clang/LLVM. changes were made to the architecture and the https://clang.llvm.org/ hardware, which required support for major software [10] Y. Ajima, et al.: “Tofu: A 6D Mesh/Torus Interconnect for changes every time. However, resolving these issues Exascale Computers.” IEEE Computer, Vol. 42, No. 11, by incorporating clever solutions in each software ver- pp. 36–40 (2009). sion significantly contributed to achieving the Fugaku [11] Y. Ajima et al.: The Tofu Interconnect D. IEEE International development objective of reaching 100 times greater Conference on Cluster Computing, pp. 646–654 (2018). [12] Communication for the synchronization of multiple application performance than that of the K computer. MPI processes. With the TofuD, it not only performs Fugaku has taken the top spot on the TOP500 and synchronization, but is also capable of simultaneously other categories at ISC 2020. Going forward, software performing small amounts of Reduce and Allreduce improvements will be essential in order to use Fugaku operations. and commercial units such as the PRIMEHPC FX700 and FX1000 as cutting-edge research and development Kensuke Watanabe platforms. While placing importance on user conve- Fujitsu Limited, Platform Software Business nience, the developers of Fugaku would like to provide Unit Mr. Watanabe is currently engaged in com- high-performance functions and tie those characteris- piler development for the supercomputer tics to innovative results in science and technology. Fugaku.

All company and product names mentioned herein are trademarks or registered trademarks of their respective owners.

Takafumi Nose References and Notes Fujitsu Limited, Platform Software Business Unit [1] The official name of the post-K computer decided by Mr. Nose is currently engaged in MPI com- RIKEN in May 2019. munication library development for the [2] The official supercomputer name decided by RIKEN in supercomputer Fugaku. July 2011. [3] ISC 2020. https://www.isc-hpc.com/ [4] Ranking which takes the execution performance of the LINPACK program for solving simultaneous linear equa- Kiyofumi Suzuki tions with a matrix calculation as an index and uses those Fujitsu Limited, Platform Software Business results to rank the top 500 fastest computer systems. Unit Mr. Suzuki is currently engaged in the de- [5] Benchmark test ranking that uses the conjugate gra- velopment of application development dient method, which is a computational method for support tools for the supercomputer Fugaku. solving simultaneous linear equations composed from a sparse coefficient matrix and is frequently used in actual applications. [6] Benchmark test ranking which improves on the LINPACK benchmark and implements it with low-precision operations. [7] Benchmark test ranking which involves the analysis of large-scale graphs, which indicate the relationships between data through peaks and branches.

Fujitsu Technical Review 7 Application Development Environment for Supercomputer Fugaku

Shuichi Chiba Fujitsu Limited, Platform Software Business Unit Mr. Chiba is a manager of software de- velopment for the supercomputer Fugaku application development environment.

This article first appeared in Fujitsu Technical Review, one of Fujitsu’s technical information media. Please check out the other articles.

Fujitsu Technical Review

https://www.fujitsu.com/global/technicalreview/

8 ©2020 FUJITSU LIMITED Fujitsu Technical Review