System Software for Armv8-A with SVE

System Software for Armv8-A with SVE

System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00– 9:25 14th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post K, and • Developing wide range of HPC applications, running on post K, in order to solve social and science issues in Japan • Project organization • Post K Computer development • RIKEN AICS is in charge of development • Fujitsu is vendor partner. • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, BSC, INRIA, RIKEN) • Applications • The government selected • 9 social & scientific priority issues • 4 exploratory issues and their R&D organizations. NOW 2 20019/1/14 RIKEN Center for Computational Science Background: Flagship2020 • Missions • Building the Japanese national flagship supercomputer, post K, and Target Applications • DevelopingProgram wide range Briefof HPC description applications, running on post K, in① orderGENESIS to solveMD for socialproteins and science issues in Japan ② Genomon Genome processing (Genome alignment) • Project organization Earthquake simulator (FEM in unstructured & structured ③ GAMERA • Post K Computergrid) development Weather prediction system using Big data (structured grid • ④RIKENNICAM+LETK AICS is in charge of development • Fujitsu is vendorstencil & ensemble partner. Kalman filter) ⑤ NTChem molecular electronic (structure calculation) • International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC, ⑥BSC,FFB INRIA, RIKEN)Large Eddy Simulation (unstructured grid) • Applications⑦ RSDFT an ab-initio program (density functional theory) • Computational Mechanics System for Large Scale Analysis ⑧TheAdventure government selected • 9 socialand & Design scientific (unstructured priority grid) issues ⑨ CCS-QCD• 4 exploratoryLattice QCD simulation issues (structured grid Monte Carlo) and their R&D organizations. NOW 3 20019/1/14 RIKEN Center for Computational Science Background: Post-K CPU A64FX Architecture Armv8.2-A SVE (512 bit SIMD) Courtesy of FUJITSU LIMITED 48 cores for compute and 2/4 for OS activities Core DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store) Cache L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store) Memory HBM2 32 GiB, 1024 GB/s CMG: CPU Memory Group Interconnect TofuD (28 Gbps x 2 lane x 10 port) NOC: Network On Chip I/O PCIe Gen3 x 16 lane Technology 7nm FinFET Performance Stream triad: 830+ GB/s Dgemm: 2.5+ TF (90+% efficiency) ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018. 20019/1/14 RIKEN Center for Computational Science 4 Background: An Overview of Post-K Hardware ● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect ● 3-level hierarchical storage system st ● 1 Layer ● Cache for global file system ● Temporary file systems - Local file system for compute node - Shared file system for a job nd ● 2 Layer ● Lustre-based global file system rd ● 3 Layer ● Storage for archive 20019/1/14 RIKEN Center for Computational Science 5 An Overview of System Software Stack Easy of use is one of our KPIs (Key Performance Indicators) Providing wide range of Linux Distribution applications/tools/libraries/compilers Eco-System Fortran, C/C++, OpenMP, Java, … Batch Job System Math libraries Hierarchical File System Tuning and Debugging Tools Parallel File System Parallel Programming Environments Communicati Application-oriente XMP, FDPS, … on d MPI File I/O Process/Thre File I/O for ad Low Level Communication Hierarchical Storage PIP LLIO Multi-Kernel System: Linux and light-weight kernel (McKernel) Armv8 + SVE 20019/1/14 RIKEN Center for Computational Science 6 Post-K Programming Environment ● Programing Languages and Compilers ● Script Languages provided by Linux provided by Fujitsu distributor ● Fortran2008 & Fortran2018 subset ● E.g., Python+NumPy, SciPy ● C11 & GNU and Clang extensions ● Communication Libraries ● C++14 & C++17 subset and GNU and ● MPI 3.1 & MPI4.0 subset Clang extensions ● Open MPI base (Fujitsu), MPICH (RIKEN) ● OpenMP 4.5 & OpenMP 5.0 subset ● Low-level Communication Libraries ● Java ● uTofu (Fujitsu), LLC(RIKEN) GCC, LLVM, and Arm compiler will be also ● File I/O Libraries provided by RIKEN available ● pnetCDF, DTF, FTAR Scalableは筑波大・東大が運用する Oakforest-PACS上でも稼働している。 ● Parallel Programming Language & Domain ● Math Libraries Specific Library provided by RIKEN ● BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu) ● XcalableMP ● EigenEXA, Batched BLAS (RIKEN) ● FDPS (Framework for Developing Particle Simulator) ● Programming Tools provided by Fujitsu ● Profiler, Debugger, GUI ● Process/Thread Library provided by RIKEN ● PiP (Process in Process) 7 20019/1/14 RIKEN Center for Computational Science Open Source Management Tools ● EasyBuild ● Used at CEA ● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild ● CAFFE consists of several opensource packages: - boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv ● Spack ● Used at ECP project ● RIKEN is evaluating Spack also. 20019/1/14 RIKEN Center for Computational Science 8 IHK/McKernel developed at RIKEN ● IHK: Linux kernel module ● Partition resources (CPU cores, memory) Interface for Heterogeneous ● AllowsKernels dynamically partitioning of node resources: ● Full Linux kernel on some cores CPU cores, physical memory, … ● System daemons and in-situ non ● Enables management of LWKs (assign resources, HPC applications load, boot, destroy, etc..) ● Provides inter-kernel communication, messaging ● Device drivers and notification ● Light-weight kernel(LWK), McKernel ● McKernel: Light-weight kernel on other cores ● Is designed for HPC, noiseless, simple ● HPC applications ● Implements only performance sensitive system calls, e.g., process and memory management, and the rest are offloaded to Linux In-situ non HPC application ● Executes the same binary of System Linux without any daemons HPC Applications Linu recompilation x Complex Linux API (glibc, /sys/, /proc/) TCP stack VFS Mem. Mngt. Thin LWK Very simple File Sys Process/Thread • IHK/McKernel runs on General memory Dev. Drivers management Driers scheduler management • Intel Xeon and Xeon phi ? • Fujitsu FX10 and FX100 Core … Core Core Core …Core Core (Experiments) Memory Interrupt Parti Parti tion tion 20019/1/14 RIKEN Center for Computational9 Science How to deploy IHK/McKernel • Linux Kernel with IHK kernel module is resident – daemons for job scheduler and etc. run on Linux • McKernel is dynamically reloaded (rebooted) by IHK for each application • No hardware reboot App B, requiring App A, requiring LWK-with-scheduler, LWK-without-schedu Is invoked ler, Is invoked Finish Finish Finish App C, using full Linux capability, Is invoked 20019/1/14 RIKEN Center for Computational Science 10 miniFE (CORAL benchmark suite) Oakforest-PACS supercomputer, 25 PF in ● Conjugate gradient - strong scaling peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo ● Up to 3.5X improvement (Linux falls over.. ) 3.5X Results using the same binary Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017 20019/1/14 RIKEN Center for Computational Science 11 Support of Software Development/Porting for Post-K Contribution to Arm HPC (Armv8-A SVE) Ecosystem NOW CY2017 CY2018 CY2019 CY2020 CY2021 Installation, Design and Implementation Manufacturing Operation and Tuning Specification Armv8-A + SVE Overview Detailed hardware info. Optimization Publishing Incrementally Guidebook RIKEN Performance estimation tool using FX100 Performance Evaluation Environment RIKEN Simulator Early Access Program • CY2018. Q2, Optimization guidebook is incrementally published • CY2020. Q2, Early access program start • CY2021. Q1/Q2, General operation starts 20019/1/14 RIKEN Center for Computational Science 12 Concluding Remarks https://postk-web.r-ccs.riken.jp/faq.html 20019/1/14 RIKEN Center for Computational Science 13 BACKUP 14 MPI Communication implemented using Tofu2 and TofuD ● Tofu2 and TofuD offloading mechanism ● Posting send commands (PUT, GET, NOP) to a command queue, the Tofu network interface processes posted commands. ● Tofu2 has two packet processing modes: Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role. ● Scheduling Pointer: Commands enqueued in the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node 20019/1/14 RIKEN Center for Computational Science 15 Evaluation: Latency MPI_Neighbor_alltoall_init(sbuf, count, MPI_DOUBLE, rbuf, • The offload version is faster. MPI_DOUBLE, comm, &req[1]); • for (I = 0; …….) { Unlike the point-to-point version, the /` Computation `/ offload version doe not need CPU cycle MPI_Start(req); for communication progress. Thus /* Computation */ computation and communication MPI_Wait( req, stat); overlap is realized by the offload } version. Tofu2 Offload Persistent pt2pt. (≒Non-blocking pt2pt.) Message Size [Bytes] Direct Transfers between User Buffers Completely Asynchronous

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us