System Software for Armv8-A with SVE

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00– 9:25 14th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post K, and • Developing wide range of HPC applications, running on post K, in order to solve social and science issues in Japan • Project organization • Post K Computer development • RIKEN AICS is in charge of development • Fujitsu is vendor partner. • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, BSC, INRIA, RIKEN) • Applications • The government selected • 9 social & scientific priority issues • 4 exploratory issues and their R&D organizations. NOW 2 20019/1/14 RIKEN Center for Computational Science Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post K, and Target Applications • DevelopingProgram wide range Briefof HPC description applications, running on post K, in① orderGENESIS to solveMD for socialproteins and science issues in Japan ② Genomon Genome processing (Genome alignment) • Project organization Earthquake simulator (FEM in unstructured & structured ③ GAMERA • Post K Computergrid) development Weather prediction system using Big data (structured grid • ④RIKENNICAM+LETK AICS is in charge of development • Fujitsu is vendorstencil & ensemble partner. Kalman filter) ⑤ NTChem molecular electronic (structure calculation) • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, ⑥BSC,FFB INRIA, RIKEN)Large Eddy Simulation (unstructured grid) • Applications⑦ RSDFT an ab-initio program (density functional theory) • Computational Mechanics System for Large Scale Analysis ⑧TheAdventure government selected • 9 socialand & Design scientific (unstructured priority grid) issues ⑨ CCS-QCD• 4 exploratoryLattice QCD simulation issues (structured grid Monte Carlo) and their R&D organizations. NOW 3 20019/1/14 RIKEN Center for Computational Science Background: Post-K CPU A64FX Architecture Armv8.2-A SVE (512 bit SIMD) Courtesy of FUJITSU LIMITED 48 cores for compute and 2/4 for OS activities Core DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store) Cache L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store) Memory HBM2 32 GiB, 1024 GB/s CMG: CPU Memory Group Interconnect TofuD (28 Gbps x 2 lane x 10 port) NOC: Network On Chip I/O PCIe Gen3 x 16 lane Technology 7nm FinFET Performance Stream triad: 830+ GB/s Dgemm: 2.5+ TF (90+% efficiency) ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018. 20019/1/14 RIKEN Center for Computational Science 4 Background: An Overview of Post-K Hardware ● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect ● 3-level hierarchical storage system st ● 1 Layer ● Cache for global file system ● Temporary file systems - Local file system for compute node - Shared file system for a job nd ● 2 Layer ● Lustre-based global file system rd ● 3 Layer ● Storage for archive 20019/1/14 RIKEN Center for Computational Science 5 An Overview of System Software Stack Easy of use is one of our KPIs (Key Performance Indicators) Providing wide range of Linux Distribution applications/tools/libraries/compilers Eco-System Fortran, C/C++, OpenMP, Java, … Batch Job System Math libraries Hierarchical File System Tuning and Debugging Tools Parallel File System Parallel Programming Environments Communicati Application-oriente XMP, FDPS, … on d MPI File I/O Process/Thre File I/O for ad Low Level Communication Hierarchical Storage PIP LLIO Multi-Kernel System: Linux and light-weight kernel (McKernel) Armv8 + SVE 20019/1/14 RIKEN Center for Computational Science 6 Post-K Programming Environment ● Programing Languages and Compilers ● Script Languages provided by Linux provided by Fujitsu distributor ● Fortran2008 & Fortran2018 subset ● E.g., Python+NumPy, SciPy ● C11 & GNU and Clang extensions ● Communication Libraries ● C++14 & C++17 subset and GNU and ● MPI 3.1 & MPI4.0 subset Clang extensions ● Open MPI base (Fujitsu), MPICH (RIKEN） ● OpenMP 4.5 & OpenMP 5.0 subset ● Low-level Communication Libraries ● Java ● uTofu (Fujitsu), LLC(RIKEN） GCC, LLVM, and Arm compiler will be also ● File I/O Libraries provided by RIKEN available ● pnetCDF, DTF, FTAR Scalableは筑波大・東大が運用する Oakforest-PACS上でも稼働している。 ● Parallel Programming Language & Domain ● Math Libraries Specific Library provided by RIKEN ● BLAS, LAPACK, ScaLAPACK, SSL II （Fujitsu） ● XcalableMP ● EigenEXA, Batched BLAS （RIKEN） ● FDPS (Framework for Developing Particle Simulator) ● Programming Tools provided by Fujitsu ● Profiler, Debugger, GUI ● Process/Thread Library provided by RIKEN ● PiP (Process in Process) 7 20019/1/14 RIKEN Center for Computational Science Open Source Management Tools ● EasyBuild ● Used at CEA ● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild ● CAFFE consists of several opensource packages: - boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv ● Spack ● Used at ECP project ● RIKEN is evaluating Spack also. 20019/1/14 RIKEN Center for Computational Science 8 IHK/McKernel developed at RIKEN ● IHK: Linux kernel module ● Partition resources (CPU cores, memory) Interface for Heterogeneous ● AllowsKernels dynamically partitioning of node resources: ● Full Linux kernel on some cores CPU cores, physical memory, … ● System daemons and in-situ non ● Enables management of LWKs (assign resources, HPC applications load, boot, destroy, etc..) ● Provides inter-kernel communication, messaging ● Device drivers and notification ● Light-weight kernel(LWK), McKernel ● McKernel: Light-weight kernel on other cores ● Is designed for HPC, noiseless, simple ● HPC applications ● Implements only performance sensitive system calls, e.g., process and memory management, and the rest are offloaded to Linux In-situ non HPC application ● Executes the same binary of System Linux without any daemons HPC Applications Linu recompilation x Complex Linux API (glibc, /sys/, /proc/) TCP stack VFS Mem. Mngt. Thin LWK Very simple File Sys Process/Thread • IHK/McKernel runs on General memory Dev. Drivers management Driers scheduler management • Intel Xeon and Xeon phi ? • Fujitsu FX10 and FX100 Core … Core Core Core …Core Core (Experiments) Memory Interrupt Parti Parti tion tion 20019/1/14 RIKEN Center for Computational9 Science How to deploy IHK/McKernel • Linux Kernel with IHK kernel module is resident – daemons for job scheduler and etc. run on Linux • McKernel is dynamically reloaded (rebooted) by IHK for each application • No hardware reboot App B, requiring App A, requiring LWK-with-scheduler, LWK-without-schedu Is invoked ler, Is invoked Finish Finish Finish App C, using full Linux capability, Is invoked 20019/1/14 RIKEN Center for Computational Science 10 miniFE (CORAL benchmark suite) Oakforest-PACS supercomputer, 25 PF in ● Conjugate gradient - strong scaling peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo ● Up to 3.5X improvement (Linux falls over.. ) 3.5X Results using the same binary Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017 20019/1/14 RIKEN Center for Computational Science 11 Support of Software Development/Porting for Post-K Contribution to Arm HPC (Armv8-A SVE) Ecosystem NOW CY2017 CY2018 CY2019 CY2020 CY2021 Installation, Design and Implementation Manufacturing Operation and Tuning Specification Armv8-A + SVE Overview Detailed hardware info. Optimization Publishing Incrementally Guidebook RIKEN Performance estimation tool using FX100 Performance Evaluation Environment RIKEN Simulator Early Access Program • CY2018. Q2, Optimization guidebook is incrementally published • CY2020. Q2, Early access program start • CY2021. Q1/Q2, General operation starts 20019/1/14 RIKEN Center for Computational Science 12 Concluding Remarks https://postk-web.r-ccs.riken.jp/faq.html 20019/1/14 RIKEN Center for Computational Science 13 BACKUP 14 MPI Communication implemented using Tofu2 and TofuD ● Tofu2 and TofuD offloading mechanism ● Posting send commands (PUT, GET, NOP) to a command queue, the Tofu network interface processes posted commands. ● Tofu2 has two packet processing modes: Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role. ● Scheduling Pointer: Commands enqueued in the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node 20019/1/14 RIKEN Center for Computational Science 15 Evaluation: Latency MPI_Neighbor_alltoall_init(sbuf, count, MPI_DOUBLE, rbuf, • The offload version is faster. MPI_DOUBLE, comm, &req[1]); • for (I = 0; …….) { Unlike the point-to-point version, the /` Computation `/ offload version doe not need CPU cycle MPI_Start(req); for communication progress. Thus /* Computation */ computation and communication MPI_Wait( req, stat); overlap is realized by the offload } version. Tofu2 Offload Persistent pt2pt. (≒Non-blocking pt2pt.) Message Size [Bytes] Direct Transfers between User Buffers Completely Asynchronous

System Software for Armv8-A with SVE

D5.1 Report on Performance and Tuning of Runtime Libraries for ARM

Scalasca User Guide

TAU, PAPI, Scalasca and Vampir

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Profiling and Tracing Tools for Performance Analysis of Large Scale Applications

Best Practice Guide Modern Processors

CUBE 4.3.4 – User Guide Generic Display for Application Performance Data

Scalasca 2.3 User Guide Scalable Automatic Performance Analysis

The Scalasca Performance Toolset Architecture

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

The Arm Architecture for Exascale HPC

Large-Scale Performance Analysis of PFLOTRAN with Scalasca