Amd Strategy in Exascale Supercomputing

AMD STRATEGY IN EXASCALE SUPERCOMPUTING

TIMOUR PALTASHEV, MICHAEL SCHULTE AND GREGORY STONER NOVEMBER 29 , 2016

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 1 AGENDA

Exascale Goals and Challenges AMD’s Vision and Technologies for Exascale Computing GPUOpen and Radeon Open Compute (ROC) Initiatives AMD new HPC Product Lines with new Zen CPU Zen CPU core details

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 2 DEPARTMENT OF ENERGY’S GOALS FOR EXASCALE COMPUTING SYSTEMS

 The Department of Energy (DOE) plans to deliver exascale supercomputers that provide a 50x improvement in application performance over their current highest-performance supercomputers by 2023  System should provide a 50x performance improvement over today’s fastest supercomputes with 20 MWatts of power while not requiring human intervention due to hardware or system faults more than once a week on average  Important goals for exascale computing include ‒ Enabling new engineering capabilities and scientific discoveries ‒ Continuing U.S. leadership in science and engineering

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 3 EXASCALE CHALLENGES

The Top Ten Exascale Research Challenges 1) Energy efficiency 2) Interconnect technology 3) Memory technology 4) Scalable system software 5) Programming systems 6) Data management 7) Exascale algorithms 8) Algorithms for discovery, design, and decision 9) Resilience and correctness 10) Scientific productivity

Requires significant advances in processors, memory, software, and system design

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 4 POWER CONSTRAINTS LIMIT PERFORMANCE  Computing devices are power constrained ‒ This limits performance ‒ Continued growth in performance is crucial for new inventions and discoveries

ASCI Red @ Sandia Labs Current mobile devices 1997 (850KW) Teraflops 2016 (< 5W)  Power-constrained computing devices ‒ Exascale supercomputers: Over 1018 FLOPS in 20 MWatts ‒ Mobile computing devices: 1-5W power budget ‒ Future wearable devices: < 1mW power budget

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 5 THE NEED FOR GREATER PERFORMANCE AND POWER

Tianhe-2

54,900

33,860

3,120,000

80,000

12 or 57

17.8 MW

2014

Source: Bill Harrod, “DOE Exascale Computing Initiative (ECI) Update“ ASCAC Presentation, August 2012. Column for Tianhe-2 added by Michael Schulte, October 2014. | AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 6 DOE EXASCALE TARGET REQUIREMENTS  The DOE has aggressive goals and target requirements for exascale systems ‒ Requires research and innovation in a variety of areas  One of the most important goals is providing supercomputers that can be effectively utilized for important scientific discoveries  Technologies explored for exascale can be applied to a wide variety of computing systems

Target Requirements Target Value System-Level Power Efficiency 50 GFLOPS/Watt Compute Performance (per node) 10 TFLOPS Memory Capacity (per node) 5TB Memory Data Rate (per node) 4 TB/sec Message per Second (per node) 500 million (MPI), 2 billion (PGAS) Mean Time to Application Failure 7 days

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 7 AMD’S VISION FOR SUPERCOMPUTING

EMBRACING HETEROGENEITY

CHAMPIONING OPEN SOLUTIONS

ENABLING LEADERSHIP SYSTEMS

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 8 EMBRACING HETEROGENEITY  Customers must be free to choose the C/C++ FORTRAN Java technologies that suit their problems UPC/UPC++ python MPI  Specialization is key to high performance and energy efficiency Kokkos/RAJA OpenMP OpenACC  Heterogeneity should be managed by programming environments and runtimes PCM CPUs GPUs DRAM  The Heterogeneous System SRAM Architecture (HSA) provides: Flash ‒ A framework for heterogeneous DSPs Specialized computing accelerators ReRAM ‒ A platform for diverse programming Network languages processors FPGAs STT RAM Heterogeneity Options | AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 9 CHAMPIONING OPEN SOLUTIONS

 Harness the creativity and productivity of the entire industry  Partner with best-in-class suppliers to enable leading solutions  Multiple paths to open solutions ‒Open standards ‒Open-source software ‒Open collaborations

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 10 ENABLING LEADERSHIP SYSTEMS

Re-usable, high-performance technology building blocks

High-performance network on chip

Modular engineering methodology and tools

Software tools and programming environments

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 11 TECHNOLOGIES FOR EXASCALE COMPUTING

ACCELERATED CPU CORES GRAPHICS PROCESSORS PROCESSING UNITS

NODE TECHNOLOGIES ENERGY EFFICIENCY SOFTWARE

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 12 FUTURE HIGH DENSITY COMPUTE CONFIGURATIONS

 Network TOR switch, Exascale systems require enhanced IO nodes, management performance, power-efficiency, nodes, etc. reliability, and programmer productivity Compute Node

… Compute Node

Cabinet Cabinet Cabinet Cabinet Cabinet Cabinet Cabinet ‒ Significant advances are needed in Compute Node multiple areas and technologies (a) Compute Node  Exascale systems will be heterogeneous Compute Node DRAM C C C C C C C C DRAM … Compute Node ‒ Programming environments and runtimes C C C C C C C C (b) should manage this heterogeneity DRAM DRAM  New computing technologies provide a DRAM GPU DRAM Off-package NVRAM path to productive, power-efficient (multiple modules) exascale systems DRAM C C C C C C C C DRAM C C C C C C C C C CPU core GPU cores APU chip Silicon Interposer (c) Die-stacked DRAM For further details see: “Achieving Exascale Capabilities through Heterogeneous Computing,” IEEE Micro, July/August 2015. | AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 13 THE HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

 HSA is a platform architecture and software environment for simplified efficient parallel programming of heterogeneous systems, targeting: ‒ Single-source language support: ‒ Mainstream languages: C, C++, Fortran, Python, OpenMP ‒ Task-based, domain-specific, and PGAS languages ‒ Extensibility to a variety of accelerators ‒ GPUs, DSPs, FPGAs,, etc. Board

Supporters  The HSA Foundation promotes HSA via: ‒ Open, royalty-free, multi-vendor specifications ‒ Open-source software stack and tools Contributors ‒ Runtime stack ‒ Compilers, debuggers, and profilers Academic

 See http://www.hsafoundation.com and http://github.com/hsafoundation | AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 14 GPUOPEN AND RADEON OPEN COMPUTE (ROC)

 Radeon Open Compute provides HSA software and capabilities on CPUs, GPUs, and APUs ‒The Heterogeneous Compute Compiler (HCC) provides acceleration for C++ with parallel extensions, OpenMP, and Charm++ ‒The HIP Tools (with HCC) provides a mechanism to convert CUDA to C++ code that can run efficiently on GPUs from multiple vendors ‒Continuum-analytic’s Numba software accelerates Python on HSA  See http://gpuopen.com/professional-compute/ for compilers, tools, libraries, and sample applications

Efforts designed to better engage the programming community

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 15 INTRODUCING ROCM SOFTWARE PLATFORM A new fully “Open Source” foundation for hyperScale and HPC class GPU computing

Graphics Core Next Headless Linux® 64-bit HSA Drives Rich Capabilities Into the ROCm Driver Hardware and Software • Multi-GPU shared virtual memory • User mode queues • Large memory single allocation • Architected queuing language • Peer to Peer Multi-GPU • Flat memory addressing • Peer to Peer with RDMA • Atomic memory transactions • Systems management API and tools • Process concurrency & preemption

Rich Compiler Foundation For HPC Developer “Open Source” Tools and Libraries • LLVM native GCN ISA code generation • Rich Set of “Open Source” math libraries • Offline compilation support • Tuned “Deep Learning” frameworks • Standardized loader and code object format • Optimized parallel programing frameworks • GCN ISA assembler and disassembler • CodeXL profiler and GDB debugging • Full documentation to GCN ISA

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 16 DELIVERING AN OPEN PLATFORM FOR GPU COMPUTING Language neutral solution to match developer needs as heterogeneous programing model evolve

Compiler Front End

GCN LLVM GCN Compiler Compiler Assembly LLVM Opt LLVM Opt Passes Passes Language Runtime API UCX CPU ISA GCN Target Target ROCr System Runtime API

CPU Code ROCm Driver

GPU Code

Linux OS

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 17 Foundation To Enable The Broader Software Engineering Community

INCREASING MARKET ACCESS

Software Landscape PYTHON Through 2015 OpenCL CatalystTM CUDA ISO C++ No GPU Acceleration Limited GPU Acceleration

Radeon Open Compute ROC Runtime ROC Runtime + Platform ROC Runtime ROC Runtime + Continuum IO OpenCLTM + HIP + ISO C++ C++ 11/14 Python (ROCm) OpenCL AMD HIP Compiler HIP AMD HCC Compiler + ANCONDA Via PSTL NUMBA Improved One Code Base Rich Single Source Access to Simplest Path to Performance Multiple Platforms GPU Acceleration GPU Acceleration

OpenCL and the OpenCL logo are trademarks of Apple Inc, used by permission by Khronos. | AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 18 Bringing Rich Tools To Simplify Bringing Your Application ROCm

HCC Heterogeneous Compute HIP Heterogeneous-Compute Compiler Interface for Portability

. Single source compiler for both the . Portable “CUDA” like programming CPU and GPU model

. Built on rich compiler infrastructure . C++ kernel language with C runtime CLANG/LLVM and libC++ . Tools to simplify porting from CUDA . ISO C++ 11/14/17 with C++17 with applications to HIP – HIPIFY, CMAKE the Parallel Standard Template Library Tools

. OpenMP 3.1 support for CPU . Code executes on AMD and NVIDIA hardware . Showing OpenMP 4.5 GPU offloading at SC2016 . Compiles with HCC on AMD hardware and NVCC on NVIDA hardware . Performance optimized for accelerators

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 19 OPTIMIZED TO SUPPORT DENSE GPU COMPUTING TOPOLOGIES

DNN Training Optimized Platform Oil and Gas And Molecular Dynamics Optimized Platform Memory Memory Memory Memory

CPU CPU CPU CPU

NVMe NVMe PCIe3 Switch PCIe3 Switch N G G G G N V V P P P M P M U U e U U e G G G G G G G G IB IB P P P P P P P P EDR EDR U U U U U U U U

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 20 INTRODUCING ROCM 1.3 Our next album for SC 2016

Hardware and OS ROCm Core New Features • Hawaii, Fiji, Polaris Family Architecture • GPU process concurrency Support • KVM pass through support (developer • New Radeon RX480 & RX460 support preview) • Ubuntu 16.04 & Fedora 24 • Image support in ROCr Runtime • Initial loadable module via DKMS for KFD and other kernel services • KFD signals move to 4096 from 256 • Doorbell map into GPUVM • Improved interrupt service request Compiler Foundation • LLVM 4.0 Native GCN Compiler • HCC CLANG 4.0 Developer Release Open Source Tools and Libraries • HIP 1.0 • hipDriver API • ROCm-SMI update optimized power • hipStream API , hipThreadfence & heuristics hipThreadfence_block • ROCm profiler • HIP BLAS and FFT library interfaces • rocBLAS • rocFFT • Tensile - tensor library • ATMI - low level task library

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 21 ROCm Gear For 2016

New for ROCm 1.3 Entry Level Hardware Current ROCm Pro Gear for DIY, Hobbyist & Students RADEON R9 Nano RADEON PRO DUO FirePro S9300x2

FirePro W9100 FirePro S9150 FirePro S9170

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 22 ROCM SUPPORT AND CONSULTING IN RUSSIA AND CIS AMD COMPILER TEAM IN ST. PETERSBURG ODC Works in Moscow Time Zone shifted towards afternoon

 Boris Ivanovsky E-mail: [email protected]

 Nikolay Haustov E-mail: [email protected]

 Ilya Perminov E-mail: [email protected]

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 23 EXTENDING SUPPORT TO A BROADER HARDWARE ECOSYSTEM ROCm “Open Source” foundation brings a rich foundation to these new ecosystems

IBM OpenPower Support AMD64 Support ARM AArch64 Support IBM Power 8 AMD ZEN Cavium Thunder X ‒

Intel Xeon E5 v3 v4

ROCm is being built to support next generation I/O Interfaces

GenZ Founding Member CCIX Founding Member OpenCAPI Founding Member

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 24 ZEN CPU CORE: PERFORMANCE AND THROUGHPUT

QUANTUM LEAP IN CORE EXECUTION CAPABILITY  Enhanced branch prediction to select the right instructions  Micro-op cache for efficient ops issue  1.75X instruction scheduler window*  1.5X issue width and execution resources*

Result: instruction level parallelism designed for dramatic gains in single-threaded performance *Compared to Excavator

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 25 NEW ZEN CPU CORE IN DESKTOPS/WORKSTATIONS

“SUMMIT RIDGE”

8 CORES, 16 THREADS AM4 Platform • DDR4 • PCI EXPRESS® GEN 3 • NEXT-GEN I/O

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 26 ZEN CPU CORE PERFORMANCE

Broadwell-E

8-CORE, 16-THREAD 8-CORE, 16-THREAD AMD “SUMMIT RIDGE” INTEL CORE i7

Identical Clock Speeds, Comparable Configurations

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 27 “NAPLES” SERVER SOC

32-Core, Enabling Industry 64-Thread 1st Public Demo, August 2016, Software Support San Francisco

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 28 ZEN CPU CORE : MEMORY THROUGHPUT

HIGH BANDWIDTH& LOW LATENCY MEMORY HIERARCHY

• Significantly enhanced pre-fetcher • 8MB of shared L3 cache • Large unified L2 cache for instructions and data • Separate low latency L1 instruction and L1 Data caches • Up to 5X cache bandwidth to a core

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 29 ZEN CPU CORE: MEMORY THROUGHPUT

HIGH BANDWIDTH, CORE 0 LOW LATENCY 64K 32B/cycle I-Cache 512K L2 4-way 32B/cycle Significantly enhanced pre-fetcher I+D 32K 32B/cycle Cache 8MB of shared L3 cache D-Cache 8-way 8M L3 8-way 32B/cycle Large unified L2 cache for I+D Cache instructions and data 16-way Separate low latency L1 instruction and L1 Data caches

Up to 5X cache bandwidth to a core

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 30 ZEN CPU CORE: PROCESSING THROUGHPUT

SIMULTANEOUS MULTI-THREADING

Thread appears the same as an independent core to software High performance cores have gaps in utilization now exploited for an additional thread Excellent synergy with single thread – more execution resources benefit both modes

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 31 ZEN CPU CORE: EFFICIENCY

IMPROVED TRANSISTOR DENSITY & EFFICIENCY

Energy-efficient 14nm FinFET design scales from client to enterprise-class products

Chart for illustrative purposes only

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 32 ZEN CPU CORE: EFFICIENCY

LOW-POWER DESIGN METHODOLOGIES

Aggressive clock gating with multi-level regions Write back L1 cache Large Micro-op Cache Stack Engine Move elimination

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 33 DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION © 2016 Advanced Micro Devices, Inc. and AMD Advanced Research. All rights reserved. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

| AMD STRATEGY IN SUPERCOMPUTING AND NEW HPC PRODUCT | NOVEMBER 29, 2016 NATIONAL SUPERCOMPUTING FORUM, PERESLAVL-MOSCOW 34