<<

RIKEN -CCS Annual Report FY2018

Chapter 23

Flagship 2020 Project

23.1 Members

Primary members are only listed.

23.1.1 System Software Development Team Yutaka Ishikawa (Team Leader)

Masamichi Takagi (Senior Scientist)

Atsushi Hori (Research Scientist)

Balazs Gerofi (Research Scientist)

Masayuki Hatanaka (Research & Development Scientist)

Takahiro Ogura (Research & Development Scientist)

Tatiana Martsinkevich (Postdoctoral Researcher)

Fumiyoshi Shoji (Research & Development Scientist)

Atsuya Uno (Research & Development Scientist)

Toshiyuki Tsukamoto (Research & Development Scientist)

23.1.2 Architecture Development Mitsuhisa Sato (Team Leader)

Yuetsu Kodama (Senior Scientist)

Miwako Tsuji (Research Scientist)

Jinpil Lee (Postdoctoral Researcher)

Hitoshi Murai (Research Scientist)

Masahiro Nakao (Research Scientist)

Toshiyuki Imamura (Research Scientist)

Kentaro Sano (Research Scientist)

Tetsuya Odajima (Postdoctoral Researcher)

225 226 CHAPTER 23. 2020 PROJECT

23.1.3 Application Development Hirofumi Tomita (Team Leader) Yoshifumi Nakamura (Research Scientist) Hisashi Yashiro (Research Scientist) Soichiro Suzuki (Research & Development Scientist) Kazunori Mikami (Research & Development Scientist) Kiyoshi Kumahata (Research & Development Scientist) Mamiko Hata (Technical Staff I) Hiroshi Ueda (Research Scientist) Naoki Yoshioka (Research Scientist) Yiyu Tan (Research Scientist)

23.1.4 Co-Design Junichiro Makino (Team Leader) Keigo Nitadori (Research Scientist) Yutaka Maruyama (Research Scientist) Masaki Iwasawa (Research Scientist) Daisuke Namekata (Postdoctoral Researcher) Kentaro Nomura (Research Associate) Miyuki Tsubouchi (Technical Staff)

23.2 Project Overview

The Japanese government launched the FLAGPSHIP 2020 project 1 in FY 2014 whose missions are defined as follows: • Building the Japanese national flagship supercomputer, the successor to the K computer, which is tenta- tively named the post K computer, and • developing wide range of HPC applications that will run on the post K computer in order to solve the pressing societal and scientific issues facing our country. RIKEN is in charge of co-design of the post K computer and development of application codes in collaboration with the Priority Issue institutes selected by Japanese government, as well as research aimed at facilitating the efficient utilization of the post K computer by a broad community of users. Under the co-design concept, RIKEN and the selected institutions have been collaborated closely. The official name of the post K computer was decided in May 2019. It is now named Fugaku. As shown in Table 23.1, four development teams are working on Fugaku system development with the FLAGSHIP 2020 Planning and Coordination Office that supports development activities. The primary members are listed in Section 23.1. The Architecture Development team designs the architecture of Fugaku in cooperation with Fujitsu and designs and develops a productive , called XcalableMP (XMP), and its tuning tools. The team provides a cycle-level Fugaku processor simulator for the performance tuning and verification. The team also specifies requirements of standard languages such as and /C++ and mathematical libraries provided by Fujitsu.

1 FLAGSHIP is an acronym for Future LAtency core-based General-purpose Supercomputer with HIgh Productivity. 23.3. TARGET OF SYSTEM DEVELOPMENT AND ACHIEVEMENTS IN FY2018 227

Table 23.1: Development Teams

Team Name Team Leader Architecture Development Mitsuhisa Sato System Software Development Yutaka Ishikawa Co-Design Junichiro Makino Application Development Hirofumi Tomita

The System Software Development team designs and specifies a system software stack such as , MPI and File I/O middleware for the Fugaku in cooperation with Fujitsu and designs and develops multi-kernel for manycore architectures, Linux with light-weight kernel (McKernel), that provides a noise-less runtime environ- ment, extensibility and adaptability for future application demands. The Co-Design team leads to optimize architectural features and application codes together in cooperation with RIKEN teams and Fujitsu. It also designs and develops an application framework, FDPS (Framework for Developing Particle Simulator), to help HPC users implement advanced algorithms. The Application Development team is a representative of nine institutions aimed at solving Priority Issues. The team figures out weakness of target application codes in terms of performance and utilization of hardware resources and discusses them with RIKEN teams and Fujitsu to find out best solutions of architectural features and improvement of application codes. Published papers, presentation and posters are summarized in the final subsection in this chapter.

23.3 Target of System Development and Achievements in FY2018

The Fugaku’s design targets are as follows: • A one hundred times speed improvement over the K computer is achieved in maximum case of some target applications. This will be accomplished through co-design of system development and target applications for the nine Priority Issues. • The maximum electric power consumption should be between 30 and 40 MW.

23.3.1 Programming Environment and System Software In FY2018, the fourth phase of the detailed design was completed. The major components of programming environment and system software are summarized as follows: • Highly productive programming language, XcalableMP XcalableMP (XMP) is a directive-based PGAS language for large scale distributed memory systems that combines HPF-like concept and OpenMP-like description with directives. Two memory models are sup- ported: global view and local view. The global view is supported by the PGAS feature, i.e., large array is distributed to partial ones in nodes. The local view is provided by one-sided data transfer with Coarray notation. In 2018FY, We have been working on XcalableMP 2.0 which newly supports task-parallelism and the integration of PGAS models for distributed memory environment. We are are still working on C++ Front-end based on LLVM clang. • Domain specific library/language, FDPS FDPS is a framework for the development of massively parallel particle simulations. Users only need to program particle interactions and do not need to parallelize the code using the MPI library. The FDPS adopts highly optimized communication algorithms and its scalability has been confirmed using the K computer. In 2018FY, we implemented several new algorithms to remove potential bottlenecks which can appear in runs with extremely large number of MPI processes. We also worked on high-performance implementation of particle-particle interaction calculation on Fugaku. • MPI + OpenMP programming environment The current de facto standard programming environment, i.e., MPI + OpenMP environment, is supported. Two MPI implementations are being developed. Fujitsu continues to support own MPI implementation based on the OpenMPI. RIKEN is collaborating with ANL (Argonne National Laboratory) to develop MPICH for Fugaku. Achievements of our MPI implementation have been described in Section ??. 228 CHAPTER 23. FLAGSHIP 2020 PROJECT

• New file I/O middleware Fugaku does not employ the file staging technology for the layered storage system. The users do not need to specify which files must be staging-in and staging-out in their job scripts in the Fugaku environment. The LLIO midleware, employing asynchronous I/O and caching technologies, has been being designed by Fujitsu in order to provide transparent file access with better performance. LLIO was implemented and its component-leve tests was completed in FY2018. • Application-oriented file I/O middleware In scientific Big-Data applications, such as real-time weather prediction using observed meteorological data, a rapid data transfer mechanism between two jobs, ensemble simulations and data assimilation, is required to meet their deadlines. In FY2018, a framework called Data Transfer Framework (DTF), based on the PnetCDF file I/O library, that silently replaces file I/O with sending the data directly from one component to another over network was designed and its prototype system was improved and evaluated. The detailed achievement has been described in Section 1.3.2. • Process-in-Process “Process-in-Process” or “PiP” in short is a user-level runtime system for sharing an address space among processes. Unlike the Linux process model, a group of processes shares the address space and thus the process context switch among those processes does not involve hardware TLB flushing. It was evaluated using SNAP, one of CORAL benchmarks in FY2018. The detailed achievement has been described in Section 1.3.3. • Multi-Kernel for manycore architectures Multi-Kernel, Linux with light-weight Kernel (McKernel) is being designed and implemented. It pro- vides: i) a noiseless execution environment for bulk-synchronous applications, ii) ability to easily adapt to new/future system architectures, e.g., manycore CPUs, a new process/thread management, a memory management, heterogeneous core architectures, deep memory hierarchy, etc., and iii) ability to adapt to new/future application demands, such as Big-Data and in-situ applications that require optimization of data movement. In FY2018, McKernel was ported to an Arm architecture.

It should be noted that these components are not only for Fugaku, but also for other manycore-based supercomputer, such as Intel Xeon Phi.

23.3.2 Tools The architecture development team is also working on the researches on co-design tools as well as the design of Fugaku:

GEM5 processor simulator for the Fugaku processor We have been developing a cycle-level processor simulator for the Fugaku processor based on GEM-5, which is a general-purpose processor simulator commonly used for the processor architecture research. We enhanced the Armv8 gem5 simulator with Scalable Vector Extension (SVE), upstreamed by Arm, with Fugaku processor’s architecture parameters and more precious memory access mechanism such as prefetch. It enables the cycle-level performance evaluation of application kernels. We have been working on the adjustment of parameters and performance with Fujitsu-in-house processor simulator for more accurate performance evaluation. In 2018, we are providing the service to provide “Fugaku performance evaluation environment” including this simulator for performance evaluation and tuning by potential Fugaku users. And, for researches on more general architecture exploration based on Arm, we published the docker file containing our Armv8-SVE gem-5 simulator with ”open” parameters and Arm-SVE gcc on the Linaro docker site. Performance estimation tools for co-design study We have tools for co-design study for future huge-scale parallel systems. The MPI application replay tool is a system to investigate a performance and behavior of parallel applications on a single node using MPI traces. SCAMP (SCAlable Mpi Profiler) is other system to simulate a large scale network from a small number of profiling results. In 2018, we have implemented a pseudo MPI event trace file generator based on the analysis of LLVM intermediate representations. We compared the real and estimated execution times by our method on K-computer for a 2-D stencil code to indicate the simulation with pseudo trace files approximates the simulator with real traces. 23.4. INTERNATIONAL COLLABORATIONS 229

Study on performance metrics We have been developing a new metric, called Simplified Sustained System Performance (SSSP) met- ric, based on a suite of simple benchmarks, which enables performance projection that correlates with applications. In 2018, we applied the SSSP metric for Arm-based system, HPE Apollo 70.

In addition to co-design tools, we are working on the evaluation of for ARM SVE. There are two kinds of compiler for ARM SVE: Fujitsu Compiler and ARM compiler. The Fujitsu compiler is a proprietary compiler supporting C/C++ and Fortran. The ARM compiler is developed by ARM based on LLVM. Initially, LLVM only supports C and C++, and supports Fortran recently by flang. We are evaluating the quality of code generated by both of the compilers with collaboration of Kyoto University. Since these compilers are still immature, we give several feedbacks by examining the generated code.

23.3.3 Target Applcations The application development team discusses and improves the computational performance of applications on Fugaku, using the nine target applications.

Performance estimation of target applications So far, we have improved efficiency of the nine target applications in cooperation with the priority issue project and Fujitsu. These applications cover the computational characteristics, because the system is a general-purpose machine. In this fiscal year, the system specifications were almost fixed. Through the application review meeting held once a month, we discussed how to write the program to pull out potential of Fugaku within limit of power consumption in cooperation with the compiler and system software development group. Using the kernels that was extracted from the each of target applications, the computational performance of theses kernels was evaluated on several computer nodes of actual machine. Based on such data, the performance estimation as full-applications on Fugaku was conducted using the magnification from K Computer as an index. As a result, the target performance is expected to be achieved in all of target applications (https://postk-web.r-ccs.riken.jp/perf.html).

23.4 International Collaborations

23.4.1 DOE/MEXT Collaboration The following research topics were performed under the DOE/MEXT collaboration MOU.

• Efficient MPI for exascale In this research collaboration, the next version of MPICH MPI implementation, mainly developed by Argonne National Laboratory (ANL), has been cooperatively developed. The FY2018 achievements have been described in Section 1.3.1.

• Metadata and active storage This research collaboration, run by the University of Tsukuba as contract, studies metadata management and active storage.

• Storage as a Service This research collaboration explores APIs for delivering specific storage service models. This is also run by the University of Tsukuba.

• Parallel I/O Libraries This research collaboration is to improve parallel PnetCDF I/O software for extreme-scale computing facilities at both DOE and MEXT. To do that, the RIKEN side has designed DTF as described in Section 1.3.2.

• OpenMP/XMP Runtime This research collaboration explores interaction of Argobots/MPI with XcalableMP and PGAS models.

• LLVM for vectorization This research collaboration explores compiler techniques for vectorization on LLVM. 230 CHAPTER 23. FLAGSHIP 2020 PROJECT

• Exascale co-design and performance modeling tools This collaborates on an application performance modeling tools for extreme-scale applications, and shared catalog of US/JP mini-apps.

• Power Monitoring and Control, and Power Steering This research collaboration explores APIs for monitoring, analyzing, and managing power from the node to the global machine, and power steering techniques for over-provisioned systems are evaluated. This collaboration was done mainly by Dr. Kondo’s group.

• FPGA for future architecture This research collaboration explores the technology of FPGA as an accelerator for future high performance computing. This collaboration was done mainly by Dr. Sano’s group.

23.4.2 CEA RIKEN and CEA, Commissariata ` l’´energie atomique et auxe ´nergies alternatives, signed MOU in the fields of computational science and computer science concerning high performance computing and computational science in January 2017. The collaboration on following topics are on-going:

• Programming Language Environment

• Runtime Environment

• Energy-aware batch job scheduler

• Large DFT calculations and QM/MM

• Application of High Performance Computing to Earthquake Related Issues of Nuclear Power Plant Facil- ities

• Key Performance Indicators (KPIs)

• Human Resource and Training

23.5 Schedule and Future Plan

A prototype implementation and its evaluation were completed in FY2018. The hardware will be installed from the end of 2019 and the development of the system software, conducted by Fujitsu, will be finished in September 2020. The service is expected to start public operation in 2021. In FY2019, we will also continue to improve the performance of applications on actual machine through co-design with compilers and system software development, including improvements of application algorithm.

23.6 Publications 23.6.1 presentation and poster [1] Tetsuya Odajima, Yuetsu Kodama, Mitsuhisa Sato: Power performance analysis of ARM scalable vector extension. COOL CHIPS 2018: 1-3 [2] Tetsuya Odajima, Yuetsu Kodama, Miwako Tsuji, Mitsuhisa Sato: Performance and Power Consumption Analysis of ARM Scalable Vector Extension by using Gem5 Processor Simulator, ModSim 2018, Seattle, WA, USA, Aug. 2018. [3] Tetsuya Odajima, Yuetsu Kodama, Mitsuhisa Sato: Performance and Power Consumption Analysis of ARM Scalable Vector Extension, Arm Research Summit 2018, Cambridge, UK, Sep. 2018. 23.6. PUBLICATIONS 231

[4] Kazunori Mikami, Kenji Ono, Jorji Nonaka: Performance evaluation and visualization of scientific appli- cations using Pmlib, 2018 Sixth International Symposium on Computing and Networking Workshops (CAN- DARW), Takayama, Japan, Nov. 2018 [5] Hisashi Yashiro: Co-design of Post-K Priority Issue 4 Application, 3rd Research Result Meeting, Post-K Priority Issue 4, Yokohama, Japan, Mar. 2019

23.6.2 Articles and Technical Paper [1] Yuetsu Kodama, Tetsuya Odajima, Akira Asato, Mitsuhisa Sato: Evaluation of the RIKEN Post-K Processor Simulator. CoRR abs/1904.06451 (2019) [2] Kazunori Mikami, Kenji Ono, Jorji Nonaka: Performance evaluation and visualization of scientific appli- cations using Pmlib, 2018 Sixth International Symposium on Computing and Networking Workshops (CAN- DARW), 243-249 (2018)