BDEC2 Poznan Japanese HPC Infrastructure Update

Total Page:16

File Type:pdf, Size:1020Kb

BDEC2 Poznan Japanese HPC Infrastructure Update BDEC2 Poznan Japanese HPC Infrastructure Update Presenter: Masaaki Kondo, Riken R-CCS/Univ. of Tokyo (on behalf of Satoshi Matsuoka, Director, Riken R-CCS) 1 Post-K: The Game Changer (2020) 1. Heritage of the K-Computer, HP in simulation via extensive Co-Design • High performance: up to x100 performance of K in real applications • Multitudes of Scientific Breakthroughs via Post-K application programs • Simultaneous high performance and ease-of-programming 2. New Technology Innovations of Post-K Global leadership not just in High Performance, esp. via high memory BW • the machine & apps, but as Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps via BW & Vector acceleration cutting edge IT • Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs • Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co- design and world’s first implementation by Fujitsu High Perf. on Society5.0 apps incl. AI • ARM: Massive ecosystem Architectural features for high perf on Society 5.0 apps from embedded to HPC based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc. Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds 2 Arm64fx & Post-K (to be renamed) Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU Largest and fastest supercomputer to be ever built circa 2020 > 150,000 nodes, superseding LLNL Sequoia > 150 PetaByte/s memory BW Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) 25~30PB NVMe L1 storage ~10,000 endpoint 100Gbps I/O network into Lustre The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Post K A64fx Processor an Many-Core ARM CPU… 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design Near Xeon-Class Integer performance core ARM V8 --- 64bit ARM ecosystem Tofu-D + PCIe 3 external connection …but also an accelerated GPU-like processor SVE 512 bit vector extensions (ARM & Fujitsu) Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) Cache + scratchpad-like local memory (sector cache) HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4) Streaming memory access, strided access, scatter/gather etc. Intra-chip barrier synch. and other memory enhancing features GPU-like High performance in HPC, AI/Big Data, Auto Driving… 4 A64FX: Spec Summary Arm SVE, high performance and high efficiency DP performance 2.7+ TFLOPS, >90%@DGEMM Memory BW 1024 GB/s, >80%@STREAM Triad A64FX 12x compute cores 1x assistant core PCle Tofu ISA (Base, extension) Armv8.2-A, SVE Controller Interface Process technology 7 nm CMG CMG HBM HBM Peak DP performance > 2.7+ TFLOPS C C 2 2 SIMD width 512-bit N # of cores 48 + 4 CMG O CMG HBM C HBM Memory capacity 32 GiB (HBM2 x4) C C 2 2 Memory peak bandwidth 1024 GB/s PCIe Gen3 16 lanes CMG:Core Memory Group NOC:Network on High speed interconnect TofuD integrated Chip 5 Preliminary performance evaluation results Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx 6 Preliminary performance evaluation results Himeno Benchmark (Fortran90) † “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728 7 Overview of Post-K System & Storage 3-level hierarchical storage 1st Layer: GFS Cache + Temp FS (25~30 PB NVMe) 2nd Layer: Lustre-based GFS (a few hundred PB HDD) 3rd Layer: Off-site Cloud Storage Full Machine Spec >150,000 nodes ~8 million High Perf. Arm v8.2 Cores > 150PB/s memory BW Tofu-D 10x Global IDC traffic @ 60Pbps ~10,000 I/O fabric endpoints > 400 racks ~40 MegaWatts Machine+IDC PUE ~ 1.1 High Pressure DLC NRE pays off: ~= 15~30 million state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems) 8 Preparing the 40+MW Facility (actual photo) 9 What is HPCI ? World’s top class computing resources open to the world-wide HPC communities The operation of K computer will stop in August 2019. 10 HPCI Computational Resources (sites / machines) HPCI provides wide variety of supercomputer resources 11 HPCI Tier 2 Systems Roadmap (As of Nov. 2018) Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 HITACHIHITACHI SRSR1600016000/M/M11((172172TF,TF, 22 TB22)TB) Cloud System BS2000 (44TF, 14TB) ( ) ( Hokkaido Cloud System BS2000 (44TF, 14TB) 3.96 PF UCC + CFL/M 0.9 MW 35 PF UCC + Data Science Cloud / Storage HA8000 / WOS7000 ) (10TF, 1.96PB) 0.16 PF (Cloud) 0.1MW CFL-M 2MW 100~200 PF, Tohoku SX-ACE(707TF,160TB, 655TB/s) 100-200 PB/s LX406e(31TF), Storage(4PB), 3D Vis, 2MW ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW (CFL-D/CFL-D) ~4 MW PPX1 HA- PPX2 (62TF) Tsukuba PACS(1166TF) (62TF) Cygnus 3+PF (TPF) 0.4MW PACS-XI 100PF (TPF) COMA (PACS-IX) (1001 TF) Oakforest-PACS (OFP) 25 PF 100+ PF 4.5-6.0MW Fujitsu FX10 (Oakleaf/Oakbridge) (1.27PFlops, 168TiB, 460 TB/s), (UCC + TPF) 3.2 MW (UCC + TPF) 200+ PF Oakbridge-II 4+ PF 1.0MW Tokyo Reedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-Uは2020年6月末まで) (FAC) BDEC 60+ PF (FAC) 3.5-4.5MW 6.5- Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) Reedbush-L1.4 PF (FAC) 0.2 MW 8.0MW TSUBAME 3.0 (12.15 PF, 1.66 PB/s) Tokyo Tech. TSUBAME 2.5 1.4MW (total) TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW) (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW TSUBAME 2.6として 延長運用の可能性 最大5 PF Fujitsu FX100 (2.9PF, 81 TiB) 100+ PF (FAC/UCC+CFL-M) Fujitsu CX400 (774TF, 71TiB) (542TF, 71TiB) 20+ PF (FAC/UCC + CFL-M) Nagoya up to 3MW up to 3MW 2MW in total Cray:XE6 + GB8K XC30 Cray XC40(5.5PF) + CS400(1.0PF) 20-40+ PF 80-150+ PF (983TF) Kyoto 1.33 MW (FAC/TPF + UCC) 2 MW Cray XC30 (584TF) (FAC/TPF + UCC) 1.5 MW NEC SX-ACE NEC Express5800 3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M) 25.6 PB/s, 50-100Pflop/s (TPF) Osaka (423TF) (22.4TF) OCTPUS 1.463PF (UCC) 1.5-2.0MW HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) Fujitsu PRIMERGY CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW 100+ PF ~3MW 2.0MW Kyushu FX10 (272.4TF, 36 TB) (FAC/TPF + UCC/TPF) CX400 (966.2 TF, 183TB) FX10 (90.8TFLOPS) JAMSTEC SX-ACE(1.3PF, 320TiB) 3MW 100PF, 3MW ISM UV2000 (98TF, 128TiB) 0.3MW 2PF, 0.3MW Power is the maximum consumption including cooling 12 Supercomputer roadmap in IT C /U.Tokyo 2 big systems, 6 yr. cycle FY 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 13 IT C /U.Tokyo now operating 2 (or 4) Systems !! 2,000+ users (1,000 from outside of U.Tokyo) Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal)) Integrated Supercomputer System for Data Analyses & Scientific Simulations Jul. 2016-Jun.2020 (RB-U), -Mar. 2021 (RB-H/L) Reedbush-U: CPU only, 420 nodes, 508 TF (Jul. 2016) Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar. 2017) Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct. 2017) 502.2 kW (w/o cooling, aircooled) Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL)) JCAHPC: Joint Center for Advanced HPC (U.Tsukuba & U.Tokyo) 8,208 nodes, 25 PF, 13.55 PF (HPL) TOP 500 #14 (#2 in Japan), HPCG #9 (#3) (Nov. 2018) Omni-Path Architecture, Full Bi-section Fat-tree DDN IME (Burst Buffer), Lustre 26 PB IO 500 #1 (June 2018), #4 (Nov. 2018) 4.24 MW (incl. Cooling) 3.44 MW (w/o Cooling) 14 Oakbridge-CX (OBCX): Fujitsu awarded Intel Xeon Scalable Processors (CascadeLake SP) Platinum 8280 (28 Core, 2.7 GHz) x 2 socket Overview 1,368 nodes, 6.61 PF peak Aggregated Memory Bandwidth: 385.1 TB/sec Total HPL Performance: 4.2+ PF Fast Cache: SSD’s for 128 nodes: Intel SSD 1.6 TB/node, 3.20/1.32 GB/s/node for R/W Staging, Check-Pointing, Data Intensive Application 16 of these nodes can directly access external resources (server, storage, sensor network etc.) Network: Intel Omni-Path, 100 Gbps, Full Bi-Section Storage: DDN EXAScaler (Lustre) 12.4 PB, 193.9 GB/sec Power Consumption: 950.5 kVA Operation Starts: July 1st, 2019 15 Tokyo Tech.TSUBAME3.0 TSUBAME Supercomputing History World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use 2000 of massively parallel, many-core technology Matsuoka GSIC Appointment 2008 TSUBAME1.2 2006 TSUBAME1.0 2002 “TSUBAME0” 170 TeraFlops 2010 TSUBAME2.0 2000 80 TeraFlops 1.3 TeraFlops Word’s first GPU 2.4 Petarlops No1 World 128 Gigaflops No.1 Asia, No.7 World Custom First “TeraScale” Supercomputer No.1 Production Green 10,000 cores ACM Gordon Bell Prize Supercomputer JP Univ. Supercomputer 32 cores 800 cores General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling 2013 TSUBAME2.5 2013 TSUBAME-KFC 4118 GPUs Upgraded TSUBAME3 Prototype 2017 TSUBAME3.0, > 10 million cores 5.7 Petaflps, No.2 Japan Oil Immersive Cooling 12.1 Petaflops (AI Flops 47.2 Petaflops) AI Flops 17.1 Petaflops Green World No.1 Green World No1 2015 AI Prototype Upgrade (KFC/DL) HPC and Big Data / AI Convergence ABCI: AI Supercomputer to Serve Academic & Industry AI research in Japan, hosted by AIRC-AIST (2018) Commoditizing TSUBAME3 supercomputer • 4332 Volta GPUs + 2166 Skylake CPUs technologies to the Cloud • 0.55 AI-Exaflops, 37 DFP Petaflops (17 AI-Petaflops, 70kW/rack, free cooling) • ~2 Petabytes NVMe, 20 Petabytes HDD • DNN training optimized Infiniband EDR • New IDC single floor, inexpensive & fast build, hard concrete floor with 2 tons/m2 weight tolerance • Racks • 144 racks max.
Recommended publications
  • Supercomputer Fugaku
    Supercomputer Fugaku Toshiyuki Shimizu Feb. 18th, 2020 FUJITSU LIMITED Copyright 2020 FUJITSU LIMITED Outline ◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary 1 Copyright 2020 FUJITSU LIMITED Supercomputer “Fugaku”, formerly known as Post-K Focus Approach Application performance Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2 Leading-edge Si-technology, Fujitsu's proven low power & high Power efficiency performance logic design, and power-controlling knobs Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Usability Linux 2 Copyright 2020 FUJITSU LIMITED Fugaku project schedule 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Fugaku development & delivery Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning Select Architecture & Co-Design w/ apps groups apps sizing 3 Copyright 2020 FUJITSU LIMITED Fugaku co-design ◼ Co-design goals ◼ Obtain the best performance, 100x apps performance than K computer, within power budget, 30-40MW • Design applications, compilers, libraries, and hardware ◼ Approach ◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator • Computation time: brief & precise estimation • Communication time: bandwidth and latency for communication w/ some attributes for communication patterns • I/O time: ◼ Then, optimize apps/compilers etc. and resolve bottlenecks ◼ Estimation of performance and power ◼ Precise performance estimation for primary kernels • Make & run Fugaku objects on the Fugaku cycle base simulator ◼ Brief performance estimation for other sections • Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst.
    [Show full text]
  • Download SC18 Final Program
    Program November 11-16, 2018 The International Conference Exhibits November 12-15, 2018 for High Performance KAY BAILEY HUTCHISON CONVENTION CENTER Computing, Networking, DALLAS, TEXAS, USA Storage, and Analysis Sponsored by: MICHAEL S. RAWLINGS Mayor of Dallas November 11, 2018 Supercomputing – SC18 Conference: As Mayor of Dallas, it is my pleasure to welcome SC18 – the ultimate international conference for high performance computing, networking, storage, and analysis – to the Kay Bailey Hutchison Convention Center! I applaud SC18 for bringing together high-performance computing (HPC) professionals and students from across the globe to share ideas, attend and participate in technical presentations, papers and workshops, all while inspiring the next generation of HPC professionals. Dallas is the No. 1 visitor destination in Texas and one of the top cities in the nation for meetings and conventions. In addition to having the resources to host major events and conventions, Dallas is a greatly diverse American city and a melting pot of cultures. This important convergence of uniqueness and differences is reflected throughout the sights and sounds of our city. Dallas' authentic arts, music, food, historic landmarks and urban lifestyle all contribute to the city's rich cultural makeup. Let the bold, vibrant spirit of Dallas, with a touch of Texas charm, guide the way to your BIG Dallas moments. Our city is filled with the unique experiences one can only have in Dallas – from cuisine by celebrity chefs, to the cultural landscape in one of the nation’s largest arts districts, to first-class shopping - make the most of your time in our great city! From transportation to hotels, restaurants, retail and entertainment venues, Dallas is ready to roll out our red carpet to host SC18.
    [Show full text]
  • Mixed Precision Block Fused Multiply-Add
    Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores Pierre Blanchard, Nicholas Higham, Florent Lopez, Théo Mary, Srikara Pranesh To cite this version: Pierre Blanchard, Nicholas Higham, Florent Lopez, Théo Mary, Srikara Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, Society for Industrial and Applied Mathematics, 2020, 42 (3), pp.C124-C141. 10.1137/19M1289546. hal-02491076v2 HAL Id: hal-02491076 https://hal.archives-ouvertes.fr/hal-02491076v2 Submitted on 28 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES∗ PIERRE BLANCHARDy , NICHOLAS J. HIGHAMz , FLORENT LOPEZx , THEO MARY{, AND SRIKARA PRANESHk Abstract. Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units.
    [Show full text]
  • HPC Systems in the Next Decade – What to Expect, When, Where
    EPJ Web of Conferences 245, 11004 (2020) https://doi.org/10.1051/epjconf/202024511004 CHEP 2019 HPC Systems in the Next Decade – What to Expect, When, Where Dirk Pleiter1;∗ 1Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany Abstract. HPC systems have seen impressive growth in terms of performance over a period of many years. Soon the next milestone is expected to be reached with the deployment of exascale systems in 2021. In this paper, we provide an overview of the exascale challenges from a computer architecture’s perspec- tive and explore technological and other constraints. The analysis of upcoming architectural options and emerging technologies allow for setting expectations for application developers, which will have to cope with heterogeneous archi- tectures, increasingly diverse compute technologies as well as deeper memory and storage hierarchies. Finally, needs resulting from changing science and en- gineering workflows will be discussed, which need to be addressed by making HPC systems available as part of more open e-infrastructures that provide also other compute and storage services. 1 Introduction Over several decades, high-performance computing (HPC) has seen an exponential growth of performance when solving large-scale numerical problems involving dense-matrix opera- tions. This development is well documented by the Top500 list [1], which is published twice a year since 1993. In various regions around the world, significant efforts are being per- formed to reach the next milestone of performance with the deployment of exascale systems. For various reasons, improving on the performance of HPC systems has become increasingly more difficult. In this paper, we discuss some of the reasons.
    [Show full text]
  • The Post-K Project and Fujitsu ARM-SVE Enabled A64FX Processor for Energy-Efficiency and Sustained Application Performance
    CEA‐RIKEN Summer School, 13th June 2019 @ MDLS, Paris The post-K project and Fujitsu ARM-SVE enabled A64FX processor for energy-efficiency and sustained application performance Mitsuhisa Sato Team Leader of Architecture Development Team Deputy project leader, FLAGSHIP 2020 project Deputy Director, RIKEN Center for Computational Science (R-CCS) Professor (Cooperative Graduate School Program), University of Tsukuba The name of our system (a.k.a post‐K) was announced as “Fugaku” (May 23, 2019) 富岳 富岳 (Fugaku) = Mt. Fuji http://www.bestweb‐link.net/PD‐Museum‐of‐Art/ukiyoe/ukiyoe/fuga2 ku36/No.027.jpg FLAGSHIP2020 Project Missions • Building the Japanese national flagship supercomputer, “Fugaku” (a.k.a Post‐K), and • Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan Planned Budget (from 2014FY to 2020FY) • 110 billion JPY (about 1 billion US$ if 1US$=110JPY, total) includes: • Research and development, and manufacturing of the Fugakusystem • Development of applications Project organization Applications • The government selected 9 social & System development scientific priority issues and their • RIKEN is in charge of development R&D organizations. • Fujitsu is vendor partner. • Additional projects for Exploratory • International collaborations: DOE, JLESC, Issues were selected in Jun 2016 CEA .. 2019/5/17 Target science: 9 Priority Issues 重点課題① ⽣体分⼦システムの機能制御による ⾰新的創薬基盤の構築 重点課題② 個別化・予防医療を⽀援する 統合計算⽣命科学 重点課題③ 地震・津波による複合災害の ①Innovative Drug Discovery ②Personalized and Preventive ③Hazard統合的予測システムの構築 and Disaster induced by Medicine Earthquake and Tsunami Society with health and longevity Disaster prevention RIKEN Quant. Biology Center Inst. Medical Science, U. Tokyo Earthquakeand global climate Res. Inst., U.
    [Show full text]
  • SC20-Final-Program-V2.Pdf
    Table of Contents ACM Gordon Bell COVID Finalist Keynote ACM Gordon Bell Finalist More Than HPC Plenary ACM Student Research Competition: Panel Graduate Posters Paper ACM Student Research Competition: Research Posters Undergraduate Posters Scientific Visualization Awards Presentation & Data Analytics Showcase Birds of a Feather SCinet Booth Sessions State of the Practice Talk Doctoral Showcase Students@SC Early Career Program Test of Time Exhibitor Forum Tutorial Exhibits Virtual Student Cluster Competition Invited Talk Workshop Job Posting ACM Gordon Bell COVID Finalist (back to top) Thursday, November 19th 10:00 am - 12:00 pm Gordon Bell COVID-19 Prize Finalist Session 1 Session Description: Enabling Rapid COVID-19 Small Molecule Drug Design Through Scalable Deep Learning of Generative Models Sam Ade Jacobs (Lawrence Livermore National Laboratory), Tim Moon (Lawrence Livermore National Laboratory), Kevin McLoughlin (Lawrence Livermore National Laboratory), Derek Jones (Lawrence Livermore National Laboratory), David Hysom (Lawrence Livermore National Laboratory), Dong H. Ahn (Lawrence Livermore National Laboratory), John Gyllenhaal (Lawrence Livermore National Laboratory), Pythagoras Watson (Lawrence Livermore National Laboratory), Felice C. Lightsone (Lawrence Livermore National Laboratory), Jonathan E. Allen (Lawrence Livermore National Laboratory), Ian Karlin (Lawrence Livermore National Laboratory), Brian Van Essen (Lawrence Livermore National Laboratory) We improved the quality and reduced the time to produce machine-learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model trained on 1.613 billion compounds in 23 minutes while the previous state-of-the-art takes a day on 1 million compounds.
    [Show full text]
  • World's Fastest Computer
    HISTORY OF FUJITSU LIMITED SUPERCOMPUTING “FUGAKU” – World’s Fastest Computer Birth of Fugaku Fujitsu has been developing supercomputer systems since 1977, equating to over 40 years of direct hands-on experience. Fujitsu determined to further expand its HPC thought leadership and technical prowess by entering into a dedicated partnership with Japan’s leading research institute, RIKEN, in 2006. This technology development partnership led to the creation of the “K Computer,” successfully released in 2011. The “K Computer” raised the bar for processing capabilities (over 10 PetaFLOPS) to successfully take on even larger and more complex challenges – consistent with Fujitsu’s corporate charter of tackling societal challenges, including global climate change and sustainability. Fujitsu and Riken continued their collaboration by setting aggressive new targets for HPC solutions to further extend the ability to solve complex challenges, including applications to deliver improved energy efficiency, better weather and environmental disaster prediction, and next generation manufacturing. Fugaku, another name for Mt. Fuji, was announced May 23, 2019. Achieving The Peak of Performance Fujitsu Ltd. And RIKEN partnered to design the next generation supercomputer with the goal of achieving Exascale performance to enable HPC and AI applications to address the challenges facing mankind in various scientific and social fields. Goals of the project Ruling the Rankings included building and delivering the highest peak performance, but also included wider adoption in applications, and broader • : Top 500 list First place applicability across industries, cloud and academia. The Fugaku • First place: HPCG name symbolizes the achievement of these objectives. • First place: Graph 500 Fugaku’s designers also recognized the need for massively scalable systems to be power efficient, thus the team selected • First place: HPL-AI the ARM Instruction Set Architecture (ISA) along with the Scalable (real world applications) Vector Extensions (SVE) as the base processing design.
    [Show full text]
  • An Accidental Benchmarker
    An Accidental Benchmarker Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Over the Past 50 Years Evolving SW and Alg Tracking Hardware Developments Features: Performance, Portability, and Accuracy EISPACK (1970’s) Rely on (Translation of Algol to F66) - Fortran, but row oriented Level 1 Basic Linear Algebra Subprograms (BLAS) Standards for: Vector-Vector operations LINPACK (1980’s) Rely on (Vector operations) - Level-1 BLAS operations - Column oriented Level 2 & 3 BLAS Standards for: Matrix-Vector & Matrix-Matrix operations LAPACK (1990’s) Rely on •(Blocking,EISPACK cache isfriendly) a software library for numerical computation- Level of-3 BLAS eigenvalues operations and eigenvectors of matrices, PVM and• MPIWritten in FORTRAN. Standards for: Message passing ScaLAPACK• Contains(2000’s) subroutines for calculating the eigenvaluesRely on of nine classes of matrices: (Distributed• Memory)complex general, complex Hermitian, real general,- PBLAS realMess Passingsymmetric, real symmetric banded, • real symmetric tridiagonal, special real tridiagonal, generalized real, and PLASMA / MAGMA (2010’s) Rely on (Many-core•friendlygeneralized & GPUs) real symmetric matrices. - DAG/scheduler • The library drew heavily on Algol algorithms developed- block data by layout Jim Wilkinson & colleagues. PaRSEC Standards for: Scheduling SLATE (2020’s) Rely on C++ (DM and Heterogeneous arch) - Tasking DAG scheduling - Tiling, but tiles can come from anywhere - Heterogeneous HW, Batched dispatch Over the Past 50 Years
    [Show full text]
  • An Overview of High Performance Computing and Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers
    An Overview of High Performance Computing and Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 7/6/20 1 Outline • Overview of High Performance Computing • Some ideas for using mixed precision in scientific computations 2 State of Supercomputing in 2020 • Pflops (> 1015 Flop/s) computing fully established with all 500 systems. • Three technology architecture possibilities or “swim lanes” are thriving. • Commodity (e.g. Intel) • Commodity + accelerator (e.g. GPUs) (144 systems; 134 NVIDIA, 6 Intel Phi + 4) • Lightweight cores (e.g. IBM BG, Xeon Phi, TaihuLight, ARM (3 systems)) • China: Top consumer and top producer overall. • Interest in supercomputing is now worldwide, and growing in many new markets (~50% of Top500 computers are in industry). • Intel processors largest share, 94%; followed by AMD, 2%. 18 • Exascale (10 Flop/s) projects exist in many countries and 3regions. Since 1993 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Rate SC‘xy in the States in November Size Meeting in Germany in June - All data available from www.top500.org 4 PERFORMANCE DEVELOPMENT 100 Eflop/s 1.00E+11 10 Eflop/s 1.00E+10 2.22 EFlop/s 1 Eflop/s1.00E+09 100 Pflop/s 415 PFlop/s 1.00E+08 SUM 10 Pflop/s 1.00E+07 1 Pflop/s 1.00E+06 1.23 PFlop/s 100 Tflop/s 1.00E+05 N=1 10 Tflop/s 1.00E+04 1
    [Show full text]
  • Final Copy 2021 06 24 Foyer
    This electronic thesis or dissertation has been downloaded from Explore Bristol Research, http://research-information.bristol.ac.uk Author: Foyer, Clement M Title: Abstractions for Portable Data Management in Heterogeneous Memory Systems General rights Access to the thesis is subject to the Creative Commons Attribution - NonCommercial-No Derivatives 4.0 International Public License. A copy of this may be found at https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode This license sets out your rights and the restrictions that apply to your access to the thesis so it is important you read this before proceeding. Take down policy Some pages of this thesis may have been removed for copyright restrictions prior to having it been deposited in Explore Bristol Research. However, if you have discovered material within the thesis that you consider to be unlawful e.g. breaches of copyright (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please contact [email protected] and include the following information in your message: •Your contact details •Bibliographic details for the item, including a URL •An outline nature of the complaint Your claim will be investigated and, where appropriate, the item in question will be removed from public view as soon as possible. Abstractions for Portable Data Management in Heterogeneous Memory Systems Clément Foyer supervised by Simon McIntosh-Smith and Adrian Tate and Tim Dykes A dissertation submitted to the University of Bristol in accordance with the requirements for award of the degree of Doctor of Philosophy in the Faculty of Engineering, School of Computer Science.
    [Show full text]
  • The Arm Architecture for Exascale HPC
    The Arm Architecture for Exascale HPC EXALAT - Lattice Field Theory at the Exascale Dr. Olly Perks - Principal HPC Engineer [email protected] 17th June 2020 © 2020 Arm Limited (or its affiliates) Arm and our role in HPC © 2020 Arm Limited (or its affiliates) What is Arm? • Arm designs IP (such as the Arm ISA) • We do not manufacture hardware • 32-bit RISC instruction encoding • More simplistic but limiting • AArch64 execution state • Atomic memory ops • Half-precision float • Pointer authentication • A64 instruction set • Type2 hypervisor support • RAS support • Nested virtualization 4 © 2020 Arm Limited (or its affiliates) • Statistical profiling • Complex float Why Arm? Especially for Infrastructure / HPC / Scientific Computing / ML? Hardware Software • Flexibility: Allow vendors to differentiate • All based on the same instruction set • Speed and cost of development • Commonality between hardware • Reuse of software • Provide different licensing • Core - Reference design (A53/A72/N1) • Comprehensive software ecosystem • Architecture - Design your own (TX2, A64FX) • Operating systems, compilers, libraries, tools • Not just vendor - third party too • Other hardware components • NoCs, GPUs, memory controllers • “Building blocks” design • Large community • Everything from Android to HPC • Architecture validation for correctness 5 © 2020 Arm Limited (or its affiliates) Variation in the Processor Market Marvell (Cavium) Ampere (X-Gene) eMag Quicksilver Fujitsu Huawei (HiSilicon) 1616 Amazon (Annapurna) EPI / SiPearl Other PHYTIUM 6 © 2020 Arm Limited (or its affiliates) Each generation brings faster performance and new infrastructure specific features 5nm 7nm+ 7nm Poseidon 16nm Zeus Platform Ares Platform Cosmos Platform Platform “N1” 2021 2020 Today 2018 30% Faster System Performance per Generation + New Features Limited Arm © 2018 © Not Just Hardware Applications Performance Open-source, owned, commercial ISV codes, … Engineering … PBS Pro, Altair IBM LSF, SLURM, Arm Forge (DDT, MAP), Rogue Wave, HPC Toolkit, Schedulers Containers, Interpreters, etc.
    [Show full text]
  • Exascale” Supercomputer Fugaku & Beyond
    The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013
    [Show full text]