System Software for Armv8-A with SVE

Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science

9:00– 9:25 14th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China Background: Flagship2020

• Missions • Building the Japanese national flagship , post K, and • Developing wide range of HPC applications, running on post K, in order to solve social and science issues in Japan • Project organization • Post K Computer development • RIKEN AICS is in charge of development • Fujitsu is vendor partner. • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, BSC, INRIA, RIKEN) • Applications • The government selected • 9 social & scientific priority issues • 4 exploratory issues and their R&D organizations. NOW

2 20019/1/14 RIKEN Center for Computational Science Background: Flagship2020

• Missions • Building the Japanese national flagship supercomputer, post K, and Target Applications • DevelopingProgram wide range Briefof HPC description applications, running on post K, in① orderGENESIS to solveMD for socialproteins and science issues in Japan ② Genomon Genome processing (Genome alignment) • Project organization Earthquake simulator (FEM in unstructured & structured ③ GAMERA • Post K Computergrid) development Weather prediction system using Big data (structured grid • ④RIKENNICAM+LETK AICS is in charge of development • Fujitsu is vendorstencil & ensemble partner. Kalman filter) ⑤ NTChem molecular electronic (structure calculation) • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, ⑥BSC,FFB INRIA, RIKEN)Large Eddy Simulation (unstructured grid) • Applications⑦ RSDFT an ab-initio program (density functional theory) • Computational Mechanics System for Large Scale Analysis ⑧TheAdventure government selected • 9 socialand & Design scientific (unstructured priority grid) issues ⑨ CCS-QCD• 4 exploratoryLattice QCD simulation issues (structured grid Monte Carlo) and their R&D organizations. NOW

3 20019/1/14 RIKEN Center for Computational Science Background: Post-K CPU A64FX

Architecture Armv8.2-A SVE (512 bit SIMD) Courtesy of FUJITSU LIMITED 48 cores for compute and 2/4 for OS activities Core DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store) Cache L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store) Memory HBM2 32 GiB, 1024 GB/s CMG: CPU Memory Group Interconnect TofuD (28 Gbps x 2 lane x 10 port) NOC: Network On Chip I/O PCIe Gen3 x 16 lane Technology 7nm FinFET

Performance Stream triad: 830+ GB/s Dgemm: 2.5+ TF (90+% efficiency) ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018. 20019/1/14 RIKEN Center for Computational Science 4 Background: An Overview of Post-K Hardware

● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect

● 3-level hierarchical storage system st ● 1 Layer

● Cache for global file system

● Temporary file systems

- Local file system for compute node

- Shared file system for a job nd ● 2 Layer

● Lustre-based global file system rd ● 3 Layer

● Storage for archive

20019/1/14 RIKEN Center for Computational Science 5 An Overview of System Software Stack

Easy of use is one of our KPIs (Key Performance Indicators)

Providing wide range of Linux Distribution applications/tools/libraries/ Eco-System

Fortran, /C++, OpenMP, Java, … Batch Job System Math libraries Hierarchical File System Tuning and Debugging Tools Parallel File System Parallel Programming Environments Communicati Application-oriente XMP, FDPS, … on d MPI File I/O Process/Thre File I/O for ad Low Level Communication Hierarchical Storage PIP LLIO Multi-Kernel System: Linux and light-weight kernel (McKernel)

Armv8 + SVE

20019/1/14 RIKEN Center for Computational Science 6 Post-K Programming Environment

● Programing Languages and Compilers ● Script Languages provided by Linux provided by Fujitsu distributor

● Fortran2008 & Fortran2018 subset ● E.g., Python+NumPy, SciPy

● C11 & GNU and Clang extensions ● Communication Libraries

● C++14 & C++17 subset and GNU and ● MPI 3.1 & MPI4.0 subset

Clang extensions ● Open MPI base (Fujitsu), MPICH (RIKEN)

● OpenMP 4.5 & OpenMP 5.0 subset ● Low-level Communication Libraries

● Java ● uTofu (Fujitsu), LLC(RIKEN) GCC, LLVM, and Arm will be also ● File I/O Libraries provided by RIKEN available ● pnetCDF, DTF, FTAR Scalableは筑波大・東大が運用する Oakforest-PACS上でも稼働している。 ● Parallel Programming Language & Domain ● Math Libraries Specific Library provided by RIKEN ● BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu) ● XcalableMP ● EigenEXA, Batched BLAS (RIKEN) ● FDPS (Framework for Developing Particle Simulator) ● Programming Tools provided by Fujitsu ● Profiler, Debugger, GUI ● Process/Thread Library provided by RIKEN

● PiP (Process in Process)

7 20019/1/14 RIKEN Center for Computational Science Open Source Management Tools

● EasyBuild

● Used at CEA

● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild

● CAFFE consists of several opensource packages: - boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv ● Spack

● Used at ECP project

● RIKEN is evaluating Spack also.

20019/1/14 RIKEN Center for Computational Science 8 IHK/McKernel developed at RIKEN

● IHK: Linux kernel module ● Partition resources (CPU cores, memory) Interface for Heterogeneous ● AllowsKernels dynamically partitioning of node resources: ● Full Linux kernel on some cores CPU cores, physical memory, …

● System daemons and in-situ non ● Enables management of LWKs (assign resources, HPC applications load, boot, destroy, etc..) ● Provides inter-kernel communication, messaging ● Device drivers and notification ● Light-weight kernel(LWK), McKernel ● McKernel: Light-weight kernel on other cores ● Is designed for HPC, noiseless, simple ● HPC applications ● Implements only performance sensitive system calls, e.g., process and memory management, and the rest are offloaded to Linux

In-situ non HPC application ● Executes the same binary of System Linux without any daemons HPC Applications Linu recompilation

Complex x Linux API (glibc, /sys/, /proc/) TCP stack VFS Mem. Mngt.

Thin LWK Very simple File Sys Process/Thread • IHK/McKernel runs on General memory Dev. Drivers management Driers scheduler management • Intel Xeon and Xeon phi ? • Fujitsu FX10 and FX100 Core … Core Core Core …Core Core (Experiments) Memory Interrupt Parti Parti tion tion 20019/1/14 RIKEN Center for Computational9 Science How to deploy IHK/McKernel

• Linux Kernel with IHK kernel module is resident – daemons for job scheduler and etc. run on Linux • McKernel is dynamically reloaded (rebooted) by IHK for each application • No hardware reboot

App B, requiring App A, requiring LWK-with-scheduler, LWK-without-schedu Is invoked ler, Is invoked

Finish Finish

Finish App C, using full Linux capability, Is invoked

20019/1/14 RIKEN Center for Computational Science 10 miniFE (CORAL benchmark suite)

Oakforest-PACS supercomputer, 25 PF in ● Conjugate gradient - strong scaling peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo ● Up to 3.5X improvement (Linux falls over.. ) 3.5X

Results using the same binary

Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for (ROSS), 2017 20019/1/14 RIKEN Center for Computational Science 11 Support of Software Development/Porting for Post-K

Contribution to Arm HPC (Armv8-A SVE) Ecosystem

NOW

CY2017 CY2018 CY2019 CY2020 CY2021

Installation, Design and Implementation Manufacturing Operation and Tuning

Specification Armv8-A + SVE Overview Detailed hardware info. Optimization Publishing Incrementally Guidebook

RIKEN Performance estimation tool using FX100 Performance Evaluation

Environment RIKEN Simulator Early Access

Program

• CY2018. Q2, Optimization guidebook is incrementally published • CY2020. Q2, Early access program start • CY2021. Q1/Q2, General operation starts

20019/1/14 RIKEN Center for Computational Science 12 Concluding Remarks https://postk-web.r-ccs.riken.jp/faq.html

20019/1/14 RIKEN Center for Computational Science 13 BACKUP

14 MPI Communication implemented using Tofu2 and TofuD

● Tofu2 and TofuD offloading mechanism

● Posting send commands (PUT, GET, NOP) to a command queue, the Tofu network interface processes posted commands.

● Tofu2 has two packet processing modes: Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role.

● Scheduling Pointer: Commands enqueued in the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node

20019/1/14 RIKEN Center for Computational Science 15 Evaluation: Latency

MPI_Neighbor_alltoall_init(sbuf, count, MPI_DOUBLE, rbuf, • The offload version is faster. MPI_DOUBLE, comm, &req[1]); • for (I = 0; …….) { Unlike the point-to-point version, the /` Computation `/ offload version doe not need CPU cycle MPI_Start(req); for communication progress. Thus /* Computation */ computation and communication MPI_Wait( req, stat); overlap is realized by the offload } version. Tofu2 Offload Persistent pt2pt. (≒Non-blocking pt2pt.)

Message Size [Bytes]

Direct Transfers between User Buffers Completely Asynchronous Progression Latency [us] Latency [us]

• Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Offloaded MPI persistent collectives using persistent generalized request interface,” Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI2017), ACM, 2017. • Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Prototyping of Offloaded Persistent Broadcast on Tofu2 Interconnect,” SC17, 2017 (poster) • Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Tagaki, Atsushi Hori, Yutaka Ishikawa, "Evaluation of Intra Node of Persistent Collective Communication using NIC Offloading," SWOPP'18, HPC165, 2018. (In Japanese)

20019/1/14 RIKEN Center for Computational Science 16 OSS Survey (9 priority issues developers)

● Application ● MODYLAYS, USQCD, OpenFOAM ● Library ● Numpy, Scipy, pysam, FFTW, LAPACK95, lapack, blas, Metis, ParMetis, HDF5, NetCDF, NetCDF-fortran, PnetCDF, scalasca, SCOTCH, Zoltan, openmpi1.8, openmpi1.10, mpich2-1.4.1, boost, FFTE, PETSc/SLEPc Elemental, BWA, Star, Blat, TopHat, TopHat2, MapSplice2, MPDyn2, ELPA, Trillinos, Eigen3, mesa, MesaGLUT, libxml2, C-LIME, EigenExa ● Tool/Visuallization Tool ● git, git-flow, gnuplot, Paraview, VisIT, ImageMagick, svn, Samtools, bedtools, Biobambam, Picard, GMT, GrADS, HDF-EOS, wgrib, GRIB API, Climate data Operators ● Build tool ● cmake, gnu Autotools, automake, autoconf, gcc, gfortran, C++, libtools ● Shell script / Programming language / Script language ● python2, python3, perl5, R, Ruby2, zsh, ksh, NCADS Command Language

17 20019/1/14 RIKEN Center for Computational Science OSS Survey (K computer users)

● Application ● ABINIT-MP, AkaiKKR, bedtools, Biobambam, BWA, CUBE, ERmod, fdps, FFV-C, FrontFlow/Red, FrontISTR, GAMES, GENESIS, gromacs, GROMACS, HIVE, LAMMPS, MapSplice2, MODYLAS, NEURON, octa, OpenFOAM, PBVR, Picard, PIMD, quantum ESPRESSO, rDock, Samtools, SCALE, Star, TopHat, TopHat 2, WHEEL, xTAPP, ● Library ● FFTW, matplotlib(python), beautiful soup(python), metis, ParMETIS, NetCDF4, HDF5, NuSDAS1.3, octa, fdps, Zoltan, cgns, Polylib, libsim ● Visualization tool ● gnuplot, PBVR, VTK, OSMesa ● Tool ● GNU utils, zlib, anaconda(python), itk, PAPI, PMlib, Szip, zip, TextParser, fpzip, ● Build tool ● make, autoconf, cmake ● Shell script / Programming language / Script language ● bash, curl, python, ruby ● ISV ● ABAQUS, Advance, AMBER, Ansys fluent, Gaussian, FLUENT, Scryu/Tetra, LS-DYNA, VPS solver ( PAM-CRASH ), Helyx, HEETAH, iconCFD, LaBS, JMAG, MIZUHO, NuFD, VASP, VSOP

18 20019/1/14 RIKEN Center for Computational Science