SX-Aurora TSUBASA

NEC SX-Aurora TSUBASA 1 © NEC Corporation 2021 NEC HPC Business NEC Business 4 © NEC Corporation 2021 NEC Technology Labs 5 © NEC Corporation 2021 Solution Areas in EMEA Public Carrier Enterprise Smart Energy System platform sector Transport solutions Business Public Safety Energy Desktop displays (incl. mobile backhaul) mobility Storage Systems - Face & fingerprint Large format displays (ESS) Small cells as a Cloud recognition Service (ScaaS) Generation Energy Projectors Unified communications - Urban surveillance Management Emergency service & collaboration (cameras and Digital cinema solutions solutions Systems (GEMS) High availability analytics) Supercomputers Distributed & SDN/NFV solutions - Cyber security Storage Energy Servers TOMS (OSS/BSS) Management IT solutions Broadcast Systems Mainframes Cloud services - Retail (DEMS/SEMS) platform - Manufacturing Smart cities, Storage transport Demand Value added services Response Systems (DREMS) SDN/NFV solutions 6 © NEC Corporation 2021 NEC Our Global Scope Marketing & Service affiliates Manufacturing affiliates Liaison Offices Branch Offices Laboratories Business activities in over 160 countries and territories Our affiliates, offices, and laboratories: 58 countries and territories 7 © NEC Corporation 2021 NEC HPC EMEA 8 © NEC Corporation 2021 NEC HPC portfolio HPC SX–Aurora LX Series Storage Portfolio TSUBASA Vector CPU X86 CPU & GPU Scalable appliances NEC Application tuning Service & Support Expertise System architecting R&D System integration Sustained Performance 9 © NEC Corporation 2021 NEC Offers Direct Full Customer Service We deliver comprehensive solutions for High Performance Computing (HPC) ▌ Turnkey ▌ Caring for the customer during the whole solution lifecycle: From individual consulting to managed services Customers have access to NEC Aurora Benchmark Center Individual presales Application-, Burn-in tests of Software & OS consulting customer-, site- systems Installation specific sizing of HPC solution Benchmarking of different systems Application Installation Continual improvement Integration into Maintenance, support Onsite hardware Customer training customer’s & managed services environment installation 10 © NEC Corporation 2021 NEC key strengths ▌NEC has expertise in software Software and compiler development Benchmarking and code optimization Cluster management solution development ▌NEC develops and builds optimized HPC hardware solutions SX vector architecture for maximizing sustained performance Storage appliances for parallel files systems ▌NEC services Users and system administrators trainings Onsite or remote administration Users application support (optimization) Workshops and hackathons Joint research activities 11 © NEC Corporation 2021 NEC – Main HPC Customers Meteorology, Climatology, Environment Research Centers Universities, HPC Centers Industry 12 © NEC Corporation 2021 NEC Top 500 entries as of November 2020 Rank Site System Cores Rmax Rpeak (Tflop/s) (Tflop/s) 33 National Institute for Fusion Plasma Simulator - SX-Aurora TSUBASA 34,560 7,892.7 10,510.70 Science (NIFS) A412-8, Vector Engine Type10AE 8C Japan 1.58GHz, Infiniband HDR 200 NEC 140 Deutscher Wetterdienst SX-Aurora TSUBASA A412-8, Vector 14,848 2,595.6 4,360.00 Germany Engine Type10AE 8C 1.58GHz, InfiniBand HDR 100 NEC 152 Universitaet Aachen CLAIX (2018) – Intel-HNS2600BPB, Xeon 61,200 2,483.60 4,112.60 RWTH – IT Center Platinum 8160 24C 2.1GHz, Intel Omni- Germany Path NEC 251 Universitaet Mainz Mogon II – NEC Cluster, Xeon Gold 6130 49,432 1,967.80 2,800.90 Germany 16C 2.1 GHz, MEGWARE MiriQuid Xeon E5-2360v4, Intel Omni-Path NEC/MEGWARE 298 Institute for Molecular Science Molecular Simulator – NEC LX Cluster, 38,552 1,785.60 3,072.80 Japan Xeon Gold 6148/6154, Intel Omni-Path NEC 304 German Aerospace Center CARA – NEC LX Cluster, AMD Epyc 7601 145,920 1,746.00 2,568.20 Germany 32C 2.2GHz, InfiniBand HDR NEC 437 Center for Computational Cygnus – NEC LX Cluster, Xeon Gold 6126 27,520 1,582.00 2,399.70 Sciences 12C 2.6GHz, NVIDIA Tesla V100, University of Tsukuba InfiniBand HDR100 Japan NEC 13 © NEC Corporation 2021 NEC SX-Aurora TSUBASA Developing Super Computing technologies for 35 years 1900 1899 1950 1980 1983 1990 SX-2 SX-3 SX-4 2000 Earth Simulator SX-5 SX-6 SX-7 2010 SX-8 SX-9 SX-ACE 2020 2020 15 © NEC Corporation 2021 Development Philosophy 1 Highest memory bandwidth per processor Highest memory bandwidth per core 2 Fortran/C/C++ programming, OpenMP Automatic vectorization/parallelization 3 Power limitation for HPC systems becomes an issue Higher sustained performance with lower power 16 © NEC Corporation 2021 Architecture of SX-Aurora TSUBASA SX-Aurora TSUBASA Controlling Processing -Vector Processor -1.5TB/s Memory Access Vector X86 Standard -Low Power 200W CPU Engine X86/Linux Server Linux VE OS VE tools OS Library APP Vector VE CPU library Tools Vector compiler 17 © NEC Corporation 2021 Architecture Aurora Architecture x86 node + Vector Engine (VE) VE capability is provided on x86/Linux environment Aurora Architecture Product VE + x86 node x86 server VE OS SW Environment Linux driver x86 Linux OS OS Fortran/C/C++ standard programming Automatic vectorization by proven vector compiler x86 node Vector Engine Interconnect InfiniBand for MPI SX-Aurora TSUBASA 18 © NEC Corporation 2021 Execution modes - Native Vector Engine executes the entire application Only system calls (e.g. IOs) are handled by x86 Direct MPI communications (x86 is not involved) Application Application System calls System calls MPI Linux OS Linux OS x86 Vector Vector x86 PLX PLX processor Engine Engine processor Infiniband Infiniband 19 © NEC Corporation 2021 Execution modes - Offload Application runs on x86 Application runs on Vector Engine Offload vectorized functions to VE Offload “inefficient” scalar function from “Accelerator mode” such as GPU VE to x86 Application Function Application Function Function System calls Linux OS Linux OS x86 Vector x86 Vector processor Engine processor Engine 20 © NEC Corporation 2021 Large System Configuration InfiniBand network InfiniBand HCA VE PCIe SW File system x86 node (VH) 21 © NEC Corporation 2021 MPI communications Direct communications between VEs (no x86 involved and RDMA) Hybrid mode with processes on x86 & Aurora processors Infiniband Network mem. mem. mem. VH VH X86 IB x86 x86 cluster PCIe PCIe IB SW IB SW VE VE VE VE mem. mem. mem. mem. 22 © NEC Corporation 2021 Vector Engine (VE) Card Implementation Standard PCIe implementation Connector: PCIe Gen.3 x16 Double height (same form factor as Nvidia) <300W (DGEMM ~210W/VE, STREAM ~200W/VE, HPCG ~215W/VE) 24 © NEC Corporation 2021 VE20 Processor VE20 Specifications Processor Version Type A Type B core core core core core Cores/processor 10 8 core core core core core 307GF (DP) Core performance 614GF (SP) 0.4TB/s Processor 3.07TF (DP) 2.45TF (DP) 3TB/s performance 6.14TF (SP) 4.91TF (SP) Cache capacity 16MB Software controllable cache 16MB Cache bandwidth 3TB/s Cache Function Software Controllable Memory capacity 48GB 1.53TB/s Memory bandwidth 1.53TB/s ~300W (TDP) Power ~200W (Application) HBM2 memory x 6 25 © NEC Corporation 2021 Core Architecture Single core VFMA0 VFMA0 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV ALU1DIV ALU1DIV 1.53TB/s / processor 400GB/s / core ALU1DIV DIV DIV SPU Scalar Processing Unit 26 © NEC Corporation 2021 Vector Execution 32e 256e 32e 8e A vector register D 256e x 64 B FMA x3 64e (128kB) C Vector Length = 256e (32e x 8 cycle) 307.2G = (32 x 2 (op/cycle) ) x 3 x 1.6GHz 27 © NEC Corporation 2021 VE Roadmap VE50 VE40 2++TB/s memory bandwidth VE30 2+TB/s memory bandwidth VE20 10C/3.07TF 1.5TB/s memory bandwidth VE10E 8C/2.45TF Memory Memory bandwidth / VE 1.35TB/s memory bandwidth VE10 8C/2.45TF 1.22TB/s memory bandwidth 2019 2020 2021 2022 2023 2024 28 © NEC Corporation 2021 VE10/10E/20 SKU Memory Memory capacity bandwidth 2+TB/s 30 1.53TB/s 20B 20A * * 48GB 1.35TB/s 10BE 10AE *10AE is 1.584GHz *10BE is 1.408GHz 1.22TB/s 10B 10A 24GB 0.75TB/s 10C 2.15TF 2.45TF 3.07TF Frequency 1.4GHz 1.6GHz Cores 8core 10core 29 © NEC Corporation 2021 Server products Products A500 Supercomputer Model For large scale configuration A400 DLC with 40C water A300 Rack Mount Model Flexible configuration Air Cooled 2VE 4VE 8VE A100 Tower Model For developer/programmer Tower implementation 1VE 31 © NEC Corporation 2021 B401-8 High Density DLC Model (Components) Direct Liquid Cooling VE DLC To provide higher density Hot water cooling (~45C) 8VE/2U B401-8 VE x8 AMD Rome processor x1 IB HCA x2 Performance: 24.5TF /VH Memory bandwidth: 12.2TB/s /VH HPL: 2.4kW HPCG: 1.7kW Up to 18VH, 144VE per rack 32 © NEC Corporation 2021 B401-8 High Density DLC Front Rear 18VH, 144VE 18VH, 33 © NEC Corporation 2021

SX-Aurora TSUBASA

User's Manual

DMI-HIRLAM on the NEC SX-6

Performance Modeling the Earth Simulator and ASCI Q

Project Aurora 2017 Vector Inheritance

2020 Global High-Performance Computing Product Leadership Award

Hardware Technology of the Earth Simulator 1

Shared-Memory Vector Systems Compared

Recent Supercomputing Development in Japan

Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores

Performance Evaluation of Supercomputers Using HPCC and IMB Benchmarks

Optimizing Sparse Matrix-Vector Multiplication in NEC SX-Aurora Vector Engine

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?