<<

Japanese Development and Hybrid Accelerated Supercomputing

Taisuke Boku Director, Center for Computational Sciences University of Tsukuba

With courtesy of HPCI and R-CCS (first part)

1 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba Agenda n Development and Deployment of in Japan

n Tier-1 and Tier-2 systems

n Supercomputers in national universities

n (Post-K) n Multi-Hybrid Accelerated Supercomputing at U. Tsukuba

n Today’s accelerated supercomputing

n New concept of multi-hybrid accelerated computing

n Combining GPU and FPGA in a system

n Programming and applications

n supercomputer in U. Tsukuba

n Conclusions

2 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Development and Deployment of Supercomputer in Japan

3 2019/08/27 HPC-AI Advisory Council @ Perth Towards

Future • Tier-1: PF Exascale Post National Flagship Machine 1000 Tier-1 and tier-2 supercomputers -> RikenR-CCS AICS • Originally developed MPP form HPCI and move forward to • K Computer 100 Exascale computing like two wheels • Fugaku (Post-K) Computer • Tier-2: University Supercomputer 10 OFP JCAHPC(U. Tsukuba and Centers U. Tokyo) • Cluster, vector, GPU, etc.

1 Tokyo Tech. • 9 national universities to TSUBAME2.0 procure original systems T2K U. of Tsukuba U. of Tokyo Kyoto U. 2008 2010 2012 2014 2016 2018 2020

4 HPC-AI Advisory Council @ Perth 2019/08/27 HPCI – High Performance Computing Infrastructure in Japan n National networking program to share most of supercomputers in Japan, under MEXT n National flagship supercomputer “K” (and “FUGAKU” in 2021), and all other representative supercomputers in national university supercomputer centers are connected physically and logically n Nation-wide supercomputer sharing program based on proposals (twice in a year) from all kind of computational science and fields reviewed by selection committee and assigned computation resources n Large capacity of shared storage (~250PByte) distributed in two sites to be shared by all HPCI supercomputer facilities connected by 100Gbps class of nation-wide network n “Single sign-on” system (Globus) for easy login and job scheduling among all resources

5 HPC-AI Advisory Council @ Perth 2019/08/27 6 2019/08/27 HPC-AI Advisory Council @ Perth HPCI Tier 2 Systems Roadmap (As of Jun. 2019)

Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027

HITACHIHITACHI SR16000/M1SR16000/M1((172TF,172TF, 22TB 22TB) ) Cloud System BS2000 (44TF, 14TB) 3.96 PF (UCC + CFL/M) 0.9 MW Hokkaido Cloud System BS2000 (44TF, 14TB) 35 PF (UCC + Data Science Cloud / Storage HA8000 / WOS7000 (10TF, 1.96PB) 0.16 PF (Cloud) 0.1MW CFL-M) 2MW

100~200 PF, Tohoku SX-ACE(707TF,160TB, 655TB/s) 100-200 PB/s LX406e(31TF), Storage(4PB), 3D Vis, 2MW ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW (CFL-D/CFL-D) ~4 MW

HA- PPX1 PPX2 Tsukuba PACS(1166TF) (62TF) (62TF) Cygnus 2.4PF (TPF) 0.4MW PACS-XI 100PF (TPF) COMA (PACS-IX) (1001 TF) Oakforest-PACS (OFP) 25 PF 100+ PF 4.5-6.0MW (UCC + TPF) FX10 (Oakleaf/Oakbridge) (1.27PFlops, 168TiB, 460 TB/s), (UCC + TPF) 3.2 MW 200+ Tokyo Oakbridge-II 4+ PF 1.0MW PF Reedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-Uは2020年6月末まで) BDEC 60+ PF (FAC) 3.5-4.5MW (FAC) Reedbush-L1.4 PF (FAC) 0.2 MW 6.5- Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) 8.0MW TSUBAME 3.0 (12.15 PF, 1.66 PB/s) Tokyo Tech. TSUBAME 2.5 TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW) (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW Fujitsu FX100 (2.9PF, 81 TiB) (542TF, 100+ PF (FAC/UCC+CFL-M) Fujitsu CX400 (774TF, 71TiB) 71TiB) 20+ PF (FAC/UCC + CFL-M) Nagoya up to 3MW up to 3MW 2MW in :XE6 + total GB8K XC30 Cray XC40(5.5PF) + CS400(1.0PF) 20-40+ PF 80-150+ PF Kyoto (983TF) 1.33 MW (FAC/TPF + UCC) 2 MW Cray XC30 (584TF) (FAC/TPF + UCC) 1.5 MW NEC SX-ACE NEC Express5800 3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M) 25.6 PB/s, 50-100Pflop/s Osaka (TPF) (423TF) (22.4TF) OCTPUS 1.463PF (UCC) 1.5-2.0MW

HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) Fujitsu CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW 100+ PF ~ 2.0MW 3MW Kyushu FX10 (272.4TF, 36 TB) FX10 (FAC/TPF + UCC/TPF) CX400 (966.2 TF, 183TB) (90.8TFLOPS) JAMSTEC SX-ACE(1.3PF, 320TiB) 3MW 100PF, 3MW

UV2000 (98TF, ISM 2PF, 0.3MW HPC-AI Advisory Council128TiB) @ Perth 0.3MW 7 2019/08/27 Power is the maximum consumption including cooling FUGAKU (富岳) New National Flagship Machine

Slides courtesy by M. Sato of RIKEN R-CCS

8 2019/08/27 HPC-AI Advisory Council @ Perth FLAGSHIP2020 Project p Missions • Building the Japanese national flagship supercomputer Fugaku (a.k. a post K), and • Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan p Overview of Fugaku architecture Fujitsu A64FX processor Prototype board Node: Manycore architecture • Armv8-A + SVE (Scalable Vector Extension) p Status and Update • SIMD Length: 512 bits • “Design and Implementation” completed • # of Cores: 48 + (2/4 for OS) (> 2.7 TF / 48 core) • The official contract with Fujitsu to manufacture, ship, • Co-design with application developers and high and install hardware for Fugaku is done memory bandwidth utilizing on-package stacked • RIKEN revealed #nodes > 150K memory (HBM2) 1 TB/s B/W • The Name of the system was decided as “Fugaku” • Low power : 15GF/W (dgemm) • RIKEN announced the Fugaku early access program to Network: TofuD begin around Q2/CY2020 • Chip-Integrated NIC, 6D mesh/torus Interconnect

2019/08/27 HPC-AI Advisory Council @ Perth 9 CPU Architecture: A64FX

l Armv8.2-A (AArch64 only) + SVE (Scalable Vector Extension) u “Common” programing model will be to run each l FP64/FP32/FP16 (https://developer.arm.com/products/architecture/a- MPI process on a NUMA node (CMG) with profile/docs) OpenMP-MPI hybrid programming. u 48 threads OpenMP is also supported. l SVE 512-bit wide SIMD

l # of Cores: 48 + (2/4 for OS) CMG(Core-Memory-Group): NUMA node 12+1 core l Co-design with application developers and high memory bandwidth utilizing on-package stacked memory: HBM2(32GiB) l Leading-edge Si-technology (7nm FinFET), low power logic design (approx. 15 GF/W (dgemm)), and power-controlling knobs l PCIe Gen3 16 lanes l Peak performance

l > 2.7 TFLOPS (>90% @ dgemm)

l Memory B/W 1024GB/s (>80% stream)

l Byte per Flops: approx. 0.4 HBM2: 8GiB

2019/08/27 HPC-AI Advisory Council @ Perth 10 Peak Performance n HPL & Stream Fugaku K

400+ Pflops n Peak DP 11.3 Pflops > 2.5TF / node for dgemm (double precision) (x34+)

Peak SP 800+ Pflops n > 830GB/s /node for stream triad 11.3 Pflops (single precision) (x70+)

Peak HP 1600+ Pflops -- (half precision) (x141+) Total memory 150+ PB/sec n 5.2PB/sec Himeno Benchmark (Fortran90) bandwidth (x29+)

† “Performance evaluation of a vector supercomputer SX- TSUBASA”, 12 SC18, https://dl.acm.org/citation.cfm?id=3291728 Target Application’s Performance l Performance Targets l 100 times faster than K for some applications (tuning included) https://postk-web.r-ccs.riken.jp/perf.html l 30 to 40 MW power consumption p Predicted Performance of 9 Target Applications As of 2019/05/14 Performance Area Priority Issue Application Brief description Speedup over K 1. Innovative computing infrastructure for drug x125+ GENESIS MD for proteins Health and discovery longevity 2. Personalized and preventive medicine using big Genomon Genome processing data x8+ (Genome alignment) 3. Integrated simulation systems induced by GAMERA Earthquake simulator (FEM in unstructured & structured grid) Disaster earthquake and tsunami x45+ prevention and Environment 4. Meteorological and global environmental NICAM+ Weather prediction system using (structured grid stencil & prediction using big data x120+ LETKF ensemble Kalman filter) 5. New technologies for energy creation, conversion Molecular electronic / storage, and use x40+ NTChem (structure calculation) Energy issue 6. Accelerated development of innovative clean Computational Mechanics System for Large Scale Analysis and energy systems x35+ Adventure Design (unstructured grid)

7. Creation of new functional devices and high- Ab-initio program Industrial performance materials x30+ RSDFT (density functional theory) competitivenes 8. Development of innovative design and production s enhancement Large Eddy Simulation (unstructured grid) processes x25+ FFB 9. Elucidation of the fundamental laws and Basic science LQCD Lattice QCD simulation (structured grid Monte Carlo) 14 evolution of the universe x25+ Performance study using Post-K simulator

l We have been developing a cycle-level simulator for the post-K processor using gem5.

l Collaboration with U. Tsukuba

l Kernel evaluation using single core

Post-K KNL Simulator Execution time 4.2 5.5 [msec] Number of L1D 29569 ー l 1.3 times faster than KNL per core misses L1D miss rate 1.19% ー l With further optimization (inst. scheduling) exec 3.4 msec by time reduced to 3.4 msec (1.6 times faster) Number of L2 misses 20 ー further l This is the evaluation on L1. OpenMP Multicore L2 miss rate 0.01% optimizationー execution will be much faster due to HBM 17 memory Fugaku prototype board and rack l “Fujitsu Completes Post-K Supercomputer CPU Prototype, Begins Functionality Trials”, HPCwire June 21, 2018

HBM2

60mm

60mm

Wa t e r

Wa t e r AOC QSFP2 8 ( Z) AOC QSFP2 8 ( Y) AOC QSFP2 8 ( X) Electrical signals Shelf: 48 CPUs (24 CMU) Rack: 8 shelves = 384 CPUs (8x48) 2 CPU / CMU

18 CPU Die

Photo: by Fujitsu Ltd.

19 Advances from the K computer

K computer Fugaku ratio # core 8 48 Si Tech Si tech. (nm) 45 7 Core perf. (GFLOPS) 16 > 56 3.5 SVE Chip(node) perf. (TFLOPS) 0.128 >2.7 21 CMG&Si Tech Memory BW (GB/s) 64 1024 HBM B/F (Bytes/FLOP) 0.5 0.4 #node / rack 96 384 4 Rack perf. (TFLOPS) 12.3 1036.8 84 #node/system 82,944 > 150,000 More than 7.5 M System perf.(DP PFLOPS) 10.6 > 405 38 General-purpose cores! l SVE increases core performance

l Silicon tech. and scalable architecture (CMG) to increase node performance

l HBM enables high bandwidth

20 Multi-Hybrid Accelerated Supercomputing

Work as PACS-X Project, CCS, U. Tsukuba in collaboration with M. Umemura, K. Yoshikawa, R. Kobayashi, N. Fujita and Y. Yamaguchi

22 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba Accelerators in HPC

n Traditionally...

n Cell Broadband Engine, ClearSpeed, GRAPE. ...

n then GPU (most popular)

n Is GPU perfect ?

n good for many applications (replacing vector machines)

n depending on very wide and regular parallelism

n large scale SIMD (STMD) mechanism in a chip Tesla V100 (Volta) n high bandwidth memory (HBM, HBM2) and local memory with PCIe interafce

n insufficient for cases with...

n not enough parallelism

n not regular computation (warp spliting)

n frequent inter-node communication (kernel switch, go back to CPU)

23 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba FPGA in HPC n Goodness of recent FPGA for HPC

n True codesigning with applications (essential)

n Programmability improvement: OpenCL, other high level languages

n High performance interconnect: 40Gb~100Gb

n Precision control is possible

n Relatively low power n Problems

n Programmability: OpenCL is not enough, not efficient

n Low standard FLOPS: still cannot catch up to GPU -> “never try what GPU works well on” Nallatech 520N with Stratix10 FPGA equipped with 4x 100Gbps optical n Memory bandwidth: 1-gen older than high end CPU/GPU interconnection interfaces -> be improved by HBM (Stratix10)

24 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba What is FPGA ? n FPGA – Field Programmable Gate Array n Reconfigurable logic circuit based on user description (low or high level) where all described logic elements are implemented to the circuit with network, gate and flip-flop n Low level language has been used so far, recently HLS (High Level Synthetic) is available such as C, C++, OpenCL • all “n” elements of computation is pipelined • circuit frequency is based on the complexity and length of each element path

25 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Simple pros/cons

external performance programming communicatio (FLOPS) cost n (sec, B/s) CPU △ ○ ◎

GPU ◎ △ ○

FPGA ○ ◎ ×➝△?

How to compensate with each other toward large degree of strong scaling ?

26 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba AiS and PACS-X Project

27 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba AiS: conceptual model of Accelerator in Swtich

invoke GPU/FPGA kernsls data transfer via PCIe GPU (invoked from FPGA) GPU • FPGA can work both for CPU CPU computation and FPGA FPGA communication in PCIe PCIe unified manner comp. comp. • GPU/CPU can request comm. collective or specialized comm. application-specific communication communication to FPGA > QSFP+ interconnect

Ethernet Switch

28 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Multi-Hybrid Accelerated Computing: PACS-X Project n Combining goodness of different type of accelerators: GPU + FPGA

n GPU is still an essential accelerator for simple and large degree of parallelism to provide ~10 TFLOPS peak performance

n FPGA is a new type of accelerator for application-specific hardware with programmability and speeded up based on pipelining of calculation

n FPGA is good for external communication between them with advanced high speed interconnection up to 100Gbps x4chan. n Multi (two type of devices) – Hybrid (acceleration) Supercomputing

n PC cluster with multiple acceleration devices: GPU + FPGA (+CPU)

n Mixing their good feature toward high performance computational sciences especially for strong scaling covering various size of computation on complicated calculation such as multi-physical simulation

n Using FPGA’s external communication link for low latency and high bandwidth communication

29 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba PPX (Pre-PACS-X): testbed under AiS concept (x 13 nodes)

1.6TB NVMe HCA: Mellanox IB/EDR

100G IB/EDR Broadwell GPU: coarse-grain QPI offloading NVIDIA P100 x 2 Xeon Broadwell or V100 x 2

FPGA: fine-grain partial offloading + high-speed interconnect Intel Arria10 or Xilinx Kintex 40Gb Ether x 2 -> upgrade to 100G x 4 (Stratix10)

30 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba OpenCL-enabled high speed network n OpenCL environment is available n ex) Intel FPGA SDK for OpenCL n basic computation can be written in OpenCL without Verilog HDL n But, current FPGA board is not ready for OpenCL on interconnect access n BSP (Board Supporting Package) is not complete for interconnect ➝ we developed for OpenCL access n Our goal n enabling OpenCL description by users including inter-FPGA communication n providing basic set of HPC applications such as collective communication, basic linear library n providing 40G~100G Ethernet access with external switches for large scale systems n CoE (Channel over Ethernet) n Verilog HDL module to connect OpenCL user kernels and FPGA optical interconnect interface hardware (IP), developed in CCS

31 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Enabling Networking from OpenCL n Board Support Package (BSP) is a hardware component describing a

board specification FPGA board (A10PL4) n Adding network controller into BSP FPGA Ethernet IP Ethernet QSFP+ Controller IP Core n 40Gb Ethernet IP (on Arria10) ⇨ 100Gb on Stratix10 DDR4 DDR4 Controller mem OpenCL n Bridge logic between OpenCL kernels and kernel DDR4 DDR4

the IP Interconnect Controller mem PCIe n OpenCL kernels can use it through Controller BSP I/O Channels. Driver

Host PC

Host Application

32 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba CoE user level programming

sender code on FPGA1

__kernel void sender(__global float* restrict x, int n) { for (int i = 0; i < n; i++) { float v = x[i]; write_channel_intel(network_out, v); } }

receiver code on FPGA2

__kernel void receiver(__global float* restrict x, int n) { for (int i = 0; i < n; i++) { float v = read_channel_intel(network_in); x[i] = v; } }

HPC-AI Advisory Council @ Perth 33 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Performance comparison: traditional way vs CoE

IB (InfiniBand) Switch

IB EDR (100Gbps) via-IB

NODE PCIe Gen.3x16 CPU0 IB HCA CPU0 IB HCA

QPI NODE ・・・・

CPU1 FPGA CPU1 FPGA PCIe Gen.3x8 (56Gbps) QSFP+ CoE (40Gbps) Ethernet Switch

34 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Communication bandwidth (on Arria10)

Minimum latency = 950ns Better

[Reference] • Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Taisuke Boku, "Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming", Proc. of Int. Workshop on Accelerators and Hybrid Exascale Systems (AsHES2019) in IPDPS2019 (to be published), May 20th, 2019. (theoretical peak performance = 40Gbps)

HPC-AI Advisory Council @ Perth 35 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba GPU-FPGA communication (via CPU memory)

CPU

FPGA GPU HPC-AI Advisory Council @ Perth 37 2019/08/27 GPU-FPGA communication (DMA)

CPU

FPGA GPU

38 HPC-AI Advisory Council @ Perth 2019/08/27 CPU

FPGA GPU

__kernel void fpga_dma(__global float *restrict fpga_mem, const ulong gpu_memadr, const uint id_and_len) { cldesc_t desc; // DMA transfer GPU -> FPGA desc.src = gpu_memadr; GPU-to-FPGA DMA kick desc.dst = (ulong)(&fpga_mem[0]); desc.id_and_len = id_and_len; write_channel_intel(fpga_dma, desc); ulong status = read_channel_intel(dma_stat); } 39 HPC-AI Advisory Council @ Perth 2019/08/27 Better Bandwidth (on Arria10 Communication 40

Bandwidth [GB/s] 0 1 2 3 4 5 6 7 8 1 HPC FPGA GPU direction - AI Advisory Council @ Perth FPGA GPU) via ← (FPGA CPU → → Minimum latency ( latency Minimum FPGA GPU - GPU DMA (FPGA ← GPU) 16 via via CPU 20 17 256 FPGA µsec Size [Bytes] - GPU GPU DMA 0.60 1.44 ) 4096 FPGA GPU) via → (FPGA CPU 2019/08/27 - GPU DMA (FPGA → GPU) 65536 1048576 – V100) [Reference] • published), May 20th, 2019. be (to IPDPS2019 in (AsHES2019) and Hybrid Proc. of Int. on Workshop Accelerators OpenCL Heterogeneous Computing with Taisuke Yoshiki Ryohei Kobayashi, Kobayashi, - Yamaguchi, Yamaguchi, enabled Direct Memory Access", Boku Exascale , "GPU , Norihisa - Ayumi FPGA Systems Systems Nakamichi Fujita, , How to Program GPU+FPGA?

41 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba King-Ghidorah (by Toho)

HPC-AI Advisory Council @ Perth 42 2019/08/27 King-Ghidorah (by Toho) Our System How to Program ??

IB HCA GPU IB HCA CPU

GPU FPGA

Optical Link

HPC-AI Advisory Council @ Perth 43 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Method-1: CUDA + OpenCL

n Calling two device Kernels written in CUDA (for GPU) and OpenCL (for FPGA)

n CUDA compiler (NVIDIA/PGI) and OpenCL compiler (Intel) → Two ”host” program exist Behavior of Host Program differs on two systems, but can be combined → One Host Program calls different system kernels

n We found the library to be resolved for each compiler and confirmed that hey don’t conflict → Linking everything

CUDA OpenCL kernel kernel

44 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Method-2: OpenACC high level coding n OpenACC is available both for GPU and FPGA

n under development in collaboration between CCS-Tsukuba and ORNL

n GPU compilation – PGI OpenACC compiler

n FPGA compilation – OpenARC for FPGA on OpenACC, developed at FTG ORNL -> collaboration between CCS and ORNL n Solving some conflict on host code environment and run-time

n Basic running is confirmed and testing example codes n Issues

n Performance enhancement on GPU and FPGA totally differs

n GPU – horizontal (data) parallelism in SIMD manner

n FPGA – clock level pipelining

n Memory model difference: HBM2 vs BRAM

45 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Method-3: High Level Parallel Programming Paradigm n XcalableACC

n under development in collaboration between CCS-Tsukuba and RIKEN-AICS

n PGAS language XcalableMP is enabled to imply OpenACC for sophisticated coding of distributed memory parallelization with accelerator

n inter-node communication among FPGA can be implemented by FPGA-Ethernet direct link

n Data movement between GPU and FPGA n OpenACC for FPGA

n (plan) research collaboration with ORNL FTG

n OpenACC -> OpenCL -> FPGA compilation by OpenARC project is under development

n final goal: XcalableACC with OpenARC compiler and CoE

46 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Application Example - ARGOT n ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree)

n Simulator for early stage universe where the first and were born

n Radiative transfer code developed in Center for Computational Sciences (CCS), University of Tsukuba

n CPU (OpenMP) and GPU (CUDA) implementations are available

n Inter-node parallelisms is also supported using MPI n ART (Authentic Radiation Transfer) method

n It solves radiative transfer from light source spreading out in the space

n Dominant computation part (90%~) of the ARGOT program n In this research, we accelerate the ART method on an FPGA using Intel FPGA SDK for OpenCL as an HLS environment

47 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Radiation SPH simulation on Radiative Feedback on First Formation

( 1 million dark matter / gas particles ) Gas cloud

UV radiation Star

UV radiation from a star generates an ionized region accompanied by a shock, which collides with a gas cloud. If the cloud density is higher than a threshold value, it can collapse to form a new star.

HPC-AI Advisory Council @ Perth 48 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba ARGOT code: radiation transfer simulation

Radiation from Radiation from spot spatially distributed light source (ARGOT light source (ART method) method)

HPC-AI Advisory Council @ Perth 49 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba ARGOT code: radiation transfer simulation

Radiation from Radiation from spot spot spot spot spatially distributed light source (ARGOT light source (ART method) method) GPU GPU GPU

AiS: Accelerator in Switch GPU FPGA space space space FPGA FPGA FPGA

HPC-AI Advisory Council @ Perth 50 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba ART Method n ART method is based on ray-tracing method n 3D target space split into 3D meshes n Rays come from boundaries and move in straight in parallel with each other n Directions (angles) are given by HEALPix algorithm n ART method computes radiative intensity on each mesh as shows as formula (1) n Bottleneck of this kernel is the exponential function (expf) n There is one expf call per frequency (ν). Number of frequency is from 1 to 6 at maximum, depending on the target problem n All computation uses single precision computations n Memory access pattern for mesh data is varies depending on ray’s direction n Not suitable for SIMD style architecture n FPGAs can optimize it using custom memory access logics.

51 HPC-AI Advisory Council @ Perth 2019/08/27 Parallelizing ART method using channels

n PE (Processing Element) BE BE

n Computation kernel of ART method 3 n Each PE has 8 size memory of mesh n BE (Boundary Element) BE PE PE BE

n BE handles ray I/O including boundary processing

n R/W access of boundary, or R/W access of ray buffer BE PE PE BE n Channels are used for communication 96bit x2 y (read,write) Channel n Transferring ray data to/from neighbor PE or BE x BE BE Ray Data n Two channels are used for each connection because a channel is one-sided

52 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Performance (FPGA vs. GPU vs. CPU)

1282.8 1165.2 1111.0 1133.5

2.3x faster 14C: 14 CPU cores=1 socket 11x faster 28C: 28 CPU cores=2 sockets

Better 183.4 (28C) 227.2 (28C) 165.0 (28C) 112.4 (14C)

[Reference] n FPGA is faster than CPU on all problem sizes • Norihisa Fujita, Ryohei Kobayashi, Taisuke 3 3 Boku, Yuma Oobata, Yoshiki Yamaguchi, n FPGA is 4.6 times faster on 64 and 6.9 times faster on 128 Kohji Yoshikawa, Makino Abe, Masayuki 3 3 Umemura, "Accelerating Space Radiative n Decreasing on CPU from 64 to 128 is caused by cache miss Transfer on FPGA using OpenCL", Proc. of n Size of mesh data cannot fit into the cache HEART2018 (Int. Symposium on Highly- Efficient Accelerators and Reconfigurable 53 Technologies), Toronto, Jun. 21st 2018. HPC-AI Advisory Council @ Perth 2019/08/27 Cygnus New Supercomputer at CCS

54 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba System name and construction n System name: “Cygnus”

n based on “Cygnus A” with two accelerated radiation streams n Consisting of two parts n GPU-only part: equipped with CPU and GPU to be used as ordinary GPU cluster ⇨ n GPU+FPGA part: equipped with GPU and FPGA connected via PCIe and CPU to be used as AiS system, as well as ordinary GPU cluster (not using FPGA) ⇨ n CPU and GPU of either part are identical as well as PCIe switch configuration n Two parts are combined with InfiniBand high performance interconnection n When FPGA-ready job is not running, GPU+FPGA part is also available as GPU- only part for high utilization ratio of the system n Each computation node of either part is with “fat” configuration: very high performance both on computation and communication

55 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Deneb • Cygnus has two major starts: Deneb and Albireo • Deneb is alpha-star (largest) • Albireo is beta-star and it is “double-star” ⇒ looks like two accelerators Albireo (GPU and FPGA) work together

56 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Single node configuration (Albireo)

SINGLE Network switch Network switch NODE (100Gbps x2) (100Gbps x2) (with FPGA) • Each node is equipped with both IB EDR and FPGA-direct network CPU HCA HCA CPU HCA HCA • Some nodes are equipped

with both FPGAs and GPUs, PCIe network (switch) PCIe network (switch) and other nodes are with

GPUs only G G FPGA G G FPGA P P P P U U U U

Inter-FPGA Inter-FPGA direct network direct network (100Gbps x4) (100Gbps x4)

57 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Single node configuration (Deneb)

SINGLE Network switch Network switch NODE (100Gbps x2) (100Gbps x2) (with FPGA) • Each node is equipped with both IB EDR and FPGA-direct network CPU HCA HCA CPU HCA HCA • Some nodes are equipped

with both FPGAs and GPUs, PCIe network (switch) PCIe network (switch) and other nodes are with

GPUs only G G G G P P P P U U U U

58 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Specification of Cygnus Item Specification Peak performance 2.4 PFLOPS DP (GPU: 2.24 PFLOPS, CPU: 0.16 PFLOPS + FPGA: 0.64 SP FLOPS) # of nodes 80 (32 Albireo nodes, 48 Deneb nodes) CPU / node Intel Xeon Gold x2 sockets GPU / node V100 x4 (PCIe) FPGA / node Nallatech (Bittware) 520N with Intel Stratix10 x2 (each with 100Gbps x4 links) NVMe Intel NVMe 1.6TB, driven by NVMe-oF Target Offload Global File System DDN , RAID6, 2.5 PB Interconnection Network Mellanox InfiniBand HDR100 x4 = 400Gbps/node (SW=HDR200) Total Network B/W 4 TB/s Programming Language CPU: C, C++, Fortran, OpenMP GPU: OpenACC, CUDA FPGA: OpenCL, Verilog HDL MPI MVAPICH2, IntelMPI with GDR + original FPGA-GPU-communication library System Integrator NEC

59 HPC-AI Advisory Council @ Perth 2019/08/27 Two types of interconnection network

Inter-FPGA direct network InfiniBand HDR100/200 network for parallel processing (only for Albireo nodes) communication and shared file system access from all nodes

IB HDR100/200 Network (100Gbps x4/node) FPGA FPGA FPGA

… … FPGA FPGA FPGA comp. comp. node node comp. … comp. comp. comp. … FPGA FPGA FPGA node node node node

Inter-FPGA torus network Deneb nodes Albireo nodes

64 of FPGAs on Albireo nodes (2 For all computation nodes (Albireo and Deneb) are connected by FPGAS/node) are connected by 8x8 full-bisection Fat Tree network with 4 channels of InfiniBand 2D torus network without switch HDR100 (combined to HDR200 switch) for parallel processing communication such as MPI, and also used to access to Lustre 60 shared file system. HPC-AI Advisory Council @ Perth 2019/08/27 Cygnus Overlook

61 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba 62 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba G G F F G G P P P P P P U U G G U U A A

IB HDR100 x4 ⇨ HDR200 x2 CPU CPU

100Gbps x4 FPGA optical network

IB HDR200 switch (for full-bisection Fat-Tree)

63 HPC-AI Advisory Council @ Perth 2019/08/27 Summary n CCS will introduce a new supercomputer based on AiS concept as the next generation‘s Multi-Hybrid Accelerated Supercomputer n New machine Cygnus is equipped with very high performance GPU and FPGA (partially) to make “strong scaling ready” accelerated system for applications where GPU-only solutions is weak, as well as all kind of GPU-ready applications n FPGA for HPC is a new concept toward next generation’s flexible and low power solution beyond GPU-only computing n Multi-physics simulation is the first stage target of Cygnus and will be expanded to variety of applications where GPU-only solution has some bottleneck n New type of applications including AI-related ones with complexity on computation such as non-SIMD style parallelism, mixed (and weird) precision floating-point/integer operations, irregular computation, etc. will be challenged

64 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba