Exascale” Supercomputer Fugaku & Beyond

Total Page:16

File Type:pdf, Size:1020Kb

Exascale” Supercomputer Fugaku & Beyond The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013 2014 2019 2010 2018 2012 January 2006 End of FY2013 (Mar 2014) Next Generation Supercomputer Post-K Feasibility Study Reports March 2019 Project (K Computer) start Post-K Manufacturing start May 2019 June 2011 Post-K named “Supercomputer Fugaku” #1 on Top 500 July 2019 November 2011 Post-Moore Whitepaper start #1 on Top 500 > 10 Petaflops Aug 2019 April 2014 K Computer shutdown ACM Gordon Bell Award Post-K project start End of FY 2011 (March 2012) Dec 2019 June 2014 Fugaku installation start (planned) 4 SDHPC Whitepaper #1 on Graph 500 SDHPC (2011-2012) Candidate of ExaScale Architecture https://www.exascale.org/mediawiki/images/a/aa/Talk-3-kondo.pdf }} Four types of architectures are considered }} General Purpose(GP) }} Ordinary CPU-based MPPs Memory }} e.g.) K-Computer, GPU, Blue Gene, capacity x86-based PC-clusters CB oriented }} Capacity-Bandwidth oriented (CB) Memory }} With expensive memory-I/F rather than bandwidth General computing capability purpose Reduced }} e.g.) Vector machines Memory }} Reduced Memory(RM) Compute }} With embedded (main) memory oriented }} e.g.) SoC, MD-GRAPE4, Anton }} ComputeOriented (CO) FLOPS }} Many processing units }} e.g.) ClearSpeed, GRAPE-DR IESP Meeting@Kobe (April 12, 2012) 5 SDHPC (2011-2012) Performance Projection }} Performance projection for an HPC system in 2018 }} Achieved through continuous technology development }} Constraints: 20 – 30MW electricity & 2000sqmspace Total CPU Total Memory Total Memory Performance Bandwidth Capacity Byte / Flop Node Performance (PetaFLOPS) (PetaByte/s) (PetaByte) General Purpose 200~400 20~40 20~40 0.1 Capacity-BW Oriented 50~100 50~100 50~100 1.0 Reduced Memory 500~1000 250~500 0.1~0.2 0.5 Compute Oriented 1000~2000 5~10 5~10 0.005 Network Storage Min Max Total Capacity Total Bandwidth Injection P-to-P Bisection Latency Latency 1 EB 10TB/s High-radix 32GB/s 32 GB/s 2.0PB/s 200ns 1000ns 100 times larger For saving all data (Dragonfly) than main in memory to disks Low-radix 128GB/s 16 GB/s 0.13PB/s 100ns 5000ns memory within 1000-sec. (4DTorus) IESP Meeting@Kobe (April 12, 2012) 6 SDHPC (2011-2012) Gap Between Requirement and Technology Trends }} Mapping four architectures onto science requirement }} Projected performance vs. science requirement }} Big gap between projected and required performance Mapping of Architectures Projected vs. Required Perf. 1.0E+1 2700 CB Gap between 1.0E+0 requirements and F / technoloGy trends B 1800 1.0E-1 (PFLOPS) of RM GP 1.0E-2 CO 900 1.0E-3 Requirement Requirement 1.0E-4 0 1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+3 CP RM GP CB Requirement of Memory Capacity (PB) Needs national research project for science-driven HPC systems IESP Meeting@Kobe (April 12, 2012) 7 Post-K Feasibility Study (2012- 2013) l 3 Architecture Teams, from identified architectural types in the SDHPC report l General Purpose --- balanced l Compute Intensive --- high flops and/or low memory capacity & high memory BW l Large Memory Capacity --- also w/high memory BW l The A64fx processor satisfied multiple roles - basically balanced but also compute intensive l Application Team (Tomita, Matsuoka) l Put all the K-Computer applications stakeholders into one room l Templated reporting of science impact possible on exascale machines and their computational algorithms / requirements l 600 page report (English summary available) 8 Post-K Application Feasibility Study 2012- 2013 https://hpci-aplfs.r-ccs.riken.jp/document/roadmap/roadmap_e_1405.pdf 9 Co-Design Activities in Fugaku Multiple Activities since 2011 Science by Computing Science of Computing ・9 Priority App Areas: High Concern to General Public: Medical/Pharma, Environment/Disaster, Energy, Manufacturing, … A 6 4 f x For the Post-K supercomputer Select representatives fr Design systems with param om 100s of applications eters that consider various signifying various compu application characteristics tational characteristics l Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc. l Chose 9 representative apps as “target application” scenario l Achieve up to x100 speedup c.f. K-Computer l Also ease-of-programming, broad SW ecosystem, very low power, … Research Subjects of the Post-K Computer The post K computer will expand the fields pioneered by the K computer, and also challenge new areas. 11 Genesis MD: proteins in a cell environment Protein simulation before K Protein simulation with K n Simulation of a protein in isolation n all atom simulation of a cell interior Folding simulation of Villin, a small protein with 36 amino acids n cytoplasm of Mycoplasma genitalium 400nm DNA Ribosome Proteins GROEL metabolites Ribosome water O R G EL ion TRNA ATP 100nm 12 NICAM: Global Climate Simulation n Global cloud resolving model with 0.87 km-mesh which allows resolution of cumulus clouds n Month-long forecasts of Madden-Julian oscillations in the tropics isrealized. Global cloud resolving model Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944. 13 “Big Data Assimilation” NICAM+LETKF High-precision Simulations Future-generation technologies available 10 years in advance High-precision observations Mutual feedback Co-design from Apps to Architecture l Architectural Parameters to be determined l #SIMD, SIMD length, #core, #NUMA node, O3 resources, specialized hardware l cache (size and bandwidth), memory technologies Target applications representatives of l Chip die-size, power consumption almost all our applications in termsof l Interconnect computational methods and communication patterns in order to l We have selected a set of target applications design architectural features. l Performance estimation tool l Performance projection using Fujitsu FX100 execution profile to a set of arch. parameters. l Co-design Methodology (at early design phase) 1. Setting set of system parameters 2. Tuning target applications under the system parameters 3. Evaluating execution time using prediction tools 4. Identifying hardware bottlenecks and changing the set of system parameters 15 Co-design of Apps for Architecture l Tools for performance tuning l Performance estimation tool l Performance projection using Fujitsu FX100 execution profile l Gives “target” performance l Post-K processor simulator l Based on gem5, O3, cycle-level simulation l Very slow, so limited to kernel-level evaluation TarGet As is TuninG 1 performance l Co-design of apps TuninG 2 ① l 1. Estimate “target” performance using ② performance estimation tool ③ l 2. Extract kernel code for simulator time l 3. Measure exec time using simulator Execution l 4. Feed-back to code optimization Perform- ance estimation Simulator Simulator Simulator l 5. Feed-back to compiler toolcd 16 ARM for HPC - Co-design Opportunities l ARM SVE Vector Length Agnostic feature is very interesting, since we can examine vector performance using the same binary. l We have investigated how to improve the performance of SVEkeeping hardware-resource the same. (in “Rev-A” paper) l ex. “512 bits SVE x 2 pipes” vs. “1024 bits SVE x 1 pipe” l Evaluation of Performance and Power ( in “coolchips” paper) by using our gem-5 simulator (with “white” parameter) and ARM compiler. l Conclusion: Wide vector size over FPU element size will improve performance if there are enough rename registers and the utilization of FPU has room for improvement. Note that these researches are not relevantto 1.40 “post-K” architecture. 1.20 l Y. Kodama, T. Oajima and M. Sato. “Preliminary 1.00 Performance Evaluation of Application Kernels Using Time e ARM SVE with Multiple Vector Lengths”, In Re- 0.80 Emergence of Vector Architectures Workshop(Rev- ast F r A)in 2017 IEEE International Conference on Cluster Execution 0.60 Computing, pp. 677-684, Sep. 2017. 0.40 l T. Odajima, Y. Kodama and M. Sato, “Power Relative Performance Analysis of ARM Scalable Vector 0.20 Extension”, In IEEE Symposium on Low-Power and 0.00 High-Speed Chips and Systems (COOL Chips 21), Apr. triad nbody dgemm 2018 LEN=4 LEN=8 LEN=8 (x2) 17 A64FX Leading-edge Si-technology n TSMC 7nm FinFET & CoWoS TofuD Interface PCIe Interface n HBM2 Broadcom SerDes, HBM I/O, and Core Core Core Core Core Core Core Core Core Core Interface SRAMs interface L2 L2 HBM2 Core Core Core Core n 8.786 billion transistors Cache Cache n 594 signal pins R B Core Core Core Core Core Core u Core Core Core Core Core Core I N s G - Core Core Core Core Core Core Core Core Core Core Core Core HBM2 L2 L2 Core Core Core Core Cache Cache Interface Interface HBM2 Core Core Core Core Core Core Core Core Core Core Post-K Activities, ISC19, Frankfurt 18 Copyright 2019 FUJITSU LIMITED Fugaku: The Game Changer 1.
Recommended publications
  • Petaflops for the People
    PETAFLOPS SPOTLIGHT: NERSC housands of researchers have used facilities of the Advanced T Scientific Computing Research (ASCR) program and its EXTREME-WEATHER Department of Energy (DOE) computing predecessors over the past four decades. Their studies of hurricanes, earthquakes, NUMBER-CRUNCHING green-energy technologies and many other basic and applied Certain problems lend themselves to solution by science problems have, in turn, benefited millions of people. computers. Take hurricanes, for instance: They’re They owe it mainly to the capacity provided by the National too big, too dangerous and perhaps too expensive Energy Research Scientific Computing Center (NERSC), the Oak to understand fully without a supercomputer. Ridge Leadership Computing Facility (OLCF) and the Argonne Leadership Computing Facility (ALCF). Using decades of global climate data in a grid comprised of 25-kilometer squares, researchers in These ASCR installations have helped train the advanced Berkeley Lab’s Computational Research Division scientific workforce of the future. Postdoctoral scientists, captured the formation of hurricanes and typhoons graduate students and early-career researchers have worked and the extreme waves that they generate. Those there, learning to configure the world’s most sophisticated same models, when run at resolutions of about supercomputers for their own various and wide-ranging projects. 100 kilometers, missed the tropical cyclones and Cutting-edge supercomputing, once the purview of a small resulting waves, up to 30 meters high. group of experts, has trickled down to the benefit of thousands of investigators in the broader scientific community. Their findings, published inGeophysical Research Letters, demonstrated the importance of running Today, NERSC, at Lawrence Berkeley National Laboratory; climate models at higher resolution.
    [Show full text]
  • File System and Power Management Enhanced for Supercomputer Fugaku
    File System and Power Management Enhanced for Supercomputer Fugaku Hideyuki Akimoto Takuya Okamoto Takahiro Kagami Ken Seki Kenichirou Sakai Hiroaki Imade Makoto Shinohara Shinji Sumimoto RIKEN and Fujitsu are jointly developing the supercomputer Fugaku as the successor to the K computer with a view to starting public use in FY2021. While inheriting the software assets of the K computer, the plan for Fugaku is to make improvements, upgrades, and functional enhancements in various areas such as computational performance, efficient use of resources, and ease of use. As part of these changes, functions in the file system have been greatly enhanced with a focus on usability in addition to improving performance and capacity beyond that of the K computer. Additionally, as reducing power consumption and using power efficiently are issues common to all ultra-large-scale computer systems, power management functions have been newly designed and developed as part of the up- grading of operations management software in Fugaku. This article describes the Fugaku file system featuring significantly enhanced functions from the K computer and introduces new power management functions. 1. Introduction application development environment are taken up in RIKEN and Fujitsu are developing the supercom- separate articles [1, 2], this article focuses on the file puter Fugaku as the successor to the K computer and system and operations management software. The are planning to begin public service in FY2021. file system provides a high-performance and reliable The Fugaku is composed of various kinds of storage environment for application programs and as- system software for supporting the execution of su- sociated data.
    [Show full text]
  • Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators
    Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
    [Show full text]
  • NVIDIA Tesla Personal Supercomputer, Please Visit
    NVIDIA TESLA PERSONAL SUPERCOMPUTER TESLA DATASHEET Get your own supercomputer. Experience cluster level computing performance—up to 250 times faster than standard PCs and workstations—right at your desk. The NVIDIA® Tesla™ Personal Supercomputer AccessiBLE to Everyone TESLA C1060 COMPUTING ™ PROCESSORS ARE THE CORE is based on the revolutionary NVIDIA CUDA Available from OEMs and resellers parallel computing architecture and powered OF THE TESLA PERSONAL worldwide, the Tesla Personal Supercomputer SUPERCOMPUTER by up to 960 parallel processing cores. operates quietly and plugs into a standard power strip so you can take advantage YOUR OWN SUPERCOMPUTER of cluster level performance anytime Get nearly 4 teraflops of compute capability you want, right from your desk. and the ability to perform computations 250 times faster than a multi-CPU core PC or workstation. NVIDIA CUDA UnlocKS THE POWER OF GPU parallel COMPUTING The CUDA™ parallel computing architecture enables developers to utilize C programming with NVIDIA GPUs to run the most complex computationally-intensive applications. CUDA is easy to learn and has become widely adopted by thousands of application developers worldwide to accelerate the most performance demanding applications. TESLA PERSONAL SUPERCOMPUTER | DATASHEET | MAR09 | FINAL FEATURES AND BENEFITS Your own Supercomputer Dedicated computing resource for every computational researcher and technical professional. Cluster Performance The performance of a cluster in a desktop system. Four Tesla computing on your DesKtop processors deliver nearly 4 teraflops of performance. DESIGNED for OFFICE USE Plugs into a standard office power socket and quiet enough for use at your desk. Massively Parallel Many Core 240 parallel processor cores per GPU that can execute thousands of GPU Architecture concurrent threads.
    [Show full text]
  • Blockchain Database for a Cyber Security Learning System
    Session ETD 475 Blockchain Database for a Cyber Security Learning System Sophia Armstrong Department of Computer Science, College of Engineering and Technology East Carolina University Te-Shun Chou Department of Technology Systems, College of Engineering and Technology East Carolina University John Jones College of Engineering and Technology East Carolina University Abstract Our cyber security learning system involves an interactive environment for students to practice executing different attack and defense techniques relating to cyber security concepts. We intend to use a blockchain database to secure data from this learning system. The data being secured are students’ scores accumulated by successful attacks or defends from the other students’ implementations. As more professionals are departing from traditional relational databases, the enthusiasm around distributed ledger databases is growing, specifically blockchain. With many available platforms applying blockchain structures, it is important to understand how this emerging technology is being used, with the goal of utilizing this technology for our learning system. In order to successfully secure the data and ensure it is tamper resistant, an investigation of blockchain technology use cases must be conducted. In addition, this paper defined the primary characteristics of the emerging distributed ledgers or blockchain technology, to ensure we effectively harness this technology to secure our data. Moreover, we explored using a blockchain database for our data. 1. Introduction New buzz words are constantly surfacing in the ever evolving field of computer science, so it is critical to distinguish the difference between temporary fads and new evolutionary technology. Blockchain is one of the newest and most developmental technologies currently drawing interest.
    [Show full text]
  • Science-Driven Development of Supercomputer Fugaku
    Special Contribution Science-driven Development of Supercomputer Fugaku Hirofumi Tomita Flagship 2020 Project, Team Leader RIKEN Center for Computational Science In this article, I would like to take a historical look at activities on the application side during supercomputer Fugaku’s climb to the top of supercomputer rankings as seen from a vantage point in early July 2020. I would like to describe, in particular, the change in mindset that took place on the application side along the path toward Fugaku’s top ranking in four benchmarks in 2020 that all began with the Application Working Group and Computer Architecture/Compiler/ System Software Working Group in 2011. Somewhere along this path, the application side joined forces with the architecture/system side based on the keyword “co-design.” During this journey, there were times when our opinions clashed and when efforts to solve problems came to a halt, but there were also times when we took bold steps in a unified manner to surmount difficult obstacles. I will leave the description of specific technical debates to other articles, but here, I would like to describe the flow of Fugaku development from the application side based heavily on a “science-driven” approach along with some of my personal opinions. Actually, we are only halfway along this path. We look forward to the many scientific achievements and so- lutions to social problems that Fugaku is expected to bring about. 1. Introduction In this article, I would like to take a look back at On June 22, 2020, the supercomputer Fugaku the flow along the path toward Fugaku’s top ranking became the first supercomputer in history to simul- as seen from the application side from a vantage point taneously rank No.
    [Show full text]
  • An Introduction to Cloud Databases a Guide for Administrators
    Compliments of An Introduction to Cloud Databases A Guide for Administrators Wendy Neu, Vlad Vlasceanu, Andy Oram & Sam Alapati REPORT Break free from old guard databases AWS provides the broadest selection of purpose-built databases allowing you to save, grow, and innovate faster Enterprise scale at 3-5x the performance 14+ database engines 1/10th the cost of vs popular alternatives - more than any other commercial databases provider Learn more: aws.amazon.com/databases An Introduction to Cloud Databases A Guide for Administrators Wendy Neu, Vlad Vlasceanu, Andy Oram, and Sam Alapati Beijing Boston Farnham Sebastopol Tokyo An Introduction to Cloud Databases by Wendy A. Neu, Vlad Vlasceanu, Andy Oram, and Sam Alapati Copyright © 2019 O’Reilly Media Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Development Editor: Jeff Bleiel Interior Designer: David Futato Acquisitions Editor: Jonathan Hassell Cover Designer: Karen Montgomery Production Editor: Katherine Tozer Illustrator: Rebecca Demarest Copyeditor: Octal Publishing, LLC September 2019: First Edition Revision History for the First Edition 2019-08-19: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. An Introduction to Cloud Databases, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views.
    [Show full text]
  • Middleware-Based Database Replication: the Gaps Between Theory and Practice
    Appears in Proceedings of the ACM SIGMOD Conference, Vancouver, Canada (June 2008) Middleware-based Database Replication: The Gaps Between Theory and Practice Emmanuel Cecchet George Candea Anastasia Ailamaki EPFL EPFL & Aster Data Systems EPFL & Carnegie Mellon University Lausanne, Switzerland Lausanne, Switzerland Lausanne, Switzerland [email protected] [email protected] [email protected] ABSTRACT There exist replication “solutions” for every major DBMS, from Oracle RAC™, Streams™ and DataGuard™ to Slony-I for The need for high availability and performance in data Postgres, MySQL replication and cluster, and everything in- management systems has been fueling a long running interest in between. The naïve observer may conclude that such variety of database replication from both academia and industry. However, replication systems indicates a solved problem; the reality, academic groups often attack replication problems in isolation, however, is the exact opposite. Replication still falls short of overlooking the need for completeness in their solutions, while customer expectations, which explains the continued interest in developing new approaches, resulting in a dazzling variety of commercial teams take a holistic approach that often misses offerings. opportunities for fundamental innovation. This has created over time a gap between academic research and industrial practice. Even the “simple” cases are challenging at large scale. We deployed a replication system for a large travel ticket brokering This paper aims to characterize the gap along three axes: system at a Fortune-500 company faced with a workload where performance, availability, and administration. We build on our 95% of transactions were read-only. Still, the 5% write workload own experience developing and deploying replication systems in resulted in thousands of update requests per second, which commercial and academic settings, as well as on a large body of implied that a system using 2-phase-commit, or any other form of prior related work.
    [Show full text]
  • The Sunway Taihulight Supercomputer: System and Applications
    SCIENCE CHINA Information Sciences . RESEARCH PAPER . July 2016, Vol. 59 072001:1–072001:16 doi: 10.1007/s11432-016-5588-7 The Sunway TaihuLight supercomputer: system and applications Haohuan FU1,3 , Junfeng LIAO1,2,3 , Jinzhe YANG2, Lanning WANG4 , Zhenya SONG6 , Xiaomeng HUANG1,3 , Chao YANG5, Wei XUE1,2,3 , Fangfang LIU5 , Fangli QIAO6 , Wei ZHAO6 , Xunqiang YIN6 , Chaofeng HOU7 , Chenglong ZHANG7, Wei GE7 , Jian ZHANG8, Yangang WANG8, Chunbo ZHOU8 & Guangwen YANG1,2,3* 1Ministry of Education Key Laboratory for Earth System Modeling, and Center for Earth System Science, Tsinghua University, Beijing 100084, China; 2Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 3National Supercomputing Center in Wuxi, Wuxi 214072, China; 4College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China; 5Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 6First Institute of Oceanography, State Oceanic Administration, Qingdao 266061, China; 7Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China; 8Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China Received May 27, 2016; accepted June 11, 2016; published online June 21, 2016 Abstract The Sunway TaihuLight supercomputer is the world’s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip.
    [Show full text]
  • BDEC2 Poznan Japanese HPC Infrastructure Update
    BDEC2 Poznan Japanese HPC Infrastructure Update Presenter: Masaaki Kondo, Riken R-CCS/Univ. of Tokyo (on behalf of Satoshi Matsuoka, Director, Riken R-CCS) 1 Post-K: The Game Changer (2020) 1. Heritage of the K-Computer, HP in simulation via extensive Co-Design • High performance: up to x100 performance of K in real applications • Multitudes of Scientific Breakthroughs via Post-K application programs • Simultaneous high performance and ease-of-programming 2. New Technology Innovations of Post-K Global leadership not just in High Performance, esp. via high memory BW • the machine & apps, but as Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps via BW & Vector acceleration cutting edge IT • Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs • Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co- design and world’s first implementation by Fujitsu High Perf. on Society5.0 apps incl. AI • ARM: Massive ecosystem Architectural features for high perf on Society 5.0 apps from embedded to HPC based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc. Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds 2 Arm64fx & Post-K (to be renamed) Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) Gen purpose CPU – Linux,
    [Show full text]
  • Supercomputer Fugaku
    Supercomputer Fugaku Toshiyuki Shimizu Feb. 18th, 2020 FUJITSU LIMITED Copyright 2020 FUJITSU LIMITED Outline ◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary 1 Copyright 2020 FUJITSU LIMITED Supercomputer “Fugaku”, formerly known as Post-K Focus Approach Application performance Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2 Leading-edge Si-technology, Fujitsu's proven low power & high Power efficiency performance logic design, and power-controlling knobs Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Usability Linux 2 Copyright 2020 FUJITSU LIMITED Fugaku project schedule 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Fugaku development & delivery Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning Select Architecture & Co-Design w/ apps groups apps sizing 3 Copyright 2020 FUJITSU LIMITED Fugaku co-design ◼ Co-design goals ◼ Obtain the best performance, 100x apps performance than K computer, within power budget, 30-40MW • Design applications, compilers, libraries, and hardware ◼ Approach ◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator • Computation time: brief & precise estimation • Communication time: bandwidth and latency for communication w/ some attributes for communication patterns • I/O time: ◼ Then, optimize apps/compilers etc. and resolve bottlenecks ◼ Estimation of performance and power ◼ Precise performance estimation for primary kernels • Make & run Fugaku objects on the Fugaku cycle base simulator ◼ Brief performance estimation for other sections • Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst.
    [Show full text]
  • Providing a Robust Tools Landscape for CORAL Machines
    Providing a Robust Tools Landscape for CORAL Machines 4th Workshop on Extreme Scale Programming Tools @ SC15 in AusGn, Texas November 16, 2015 Dong H. Ahn Lawrence Livermore Naonal Lab Michael Brim Oak Ridge Naonal Lab Sco4 Parker Argonne Naonal Lab Gregory Watson IBM Wesley Bland Intel CORAL: Collaboration of ORNL, ANL, LLNL Current DOE Leadership Computers Objective - Procure 3 leadership computers to be Titan (ORNL) Sequoia (LLNL) Mira (ANL) 2012 - 2017 sited at Argonne, Oak Ridge and Lawrence Livermore 2012 - 2017 2012 - 2017 in 2017. Leadership Computers - RFP requests >100 PF, 2 GiB/core main memory, local NVRAM, and science performance 4x-8x Titan or Sequoia Approach • Competitive process - one RFP (issued by LLNL) leading to 2 R&D contracts and 3 computer procurement contracts • For risk reduction and to meet a broad set of requirements, 2 architectural paths will be selected and Oak Ridge and Argonne must choose different architectures • Once Selected, Multi-year Lab-Awardee relationship to co-design computers • Both R&D contracts jointly managed by the 3 Labs • Each lab manages and negotiates its own computer procurement contract, and may exercise options to meet their specific needs CORAL Innova,ons and Value IBM, Mellanox, and NVIDIA Awarded $325M U.S. Department of Energy’s CORAL Contracts • Scalable system solu.on – scale up, scale down – to address a wide range of applicaon domains • Modular, flexible, cost-effec.ve, • Directly leverages OpenPOWER partnerships and IBM’s Power system roadmap • Air and water cooling • Heterogeneous
    [Show full text]