IWOPH workshop, ISC, Germany June 21, 2017

OpenPOWER Innovations for HPC

IBM Research

Christoph Hagleitner, hle@zurich..com

IBM Research - Zurich Lab IBM Research - Zurich

. Established in 1956 . 45+ different nationalities . Open Collaboration – 43 funded projects and 500+ partners in Horizon 2020 – Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA) – 7 European Research Council Grants . Two Nobel Prizes: – 1986 for the scanning tunneling microscope (Binnig and Rohrer) – 1987 for the discovery of high- temperature superconductivity (Müller and Bednorz)

6/22/2017 IBM Research - Zurich Lab 2 Agenda

. Cognitive Computing & HPC

. HPC Software ecosystem

. HPC system roadmap

. HPC Processor

. HPC Accelerators

6/22/2017 IBM Research - Zurich Lab 3 Cognitive Computing Applications

Deep Computational Learning Complexity Graph Analytics O(N3)

Classic HPC Applications

Dim. Reduction HPC

2 Uncertainty Knowledge Graph O(N ) Quantification Creation

Information Retrieval

HADOOP O(N) Database queries

Data MB GB TB PB Volume

6/22/2017 IBM Research - Zurich Lab 4 Cognitive Computing: Integration & Disaggregation

. hadoop-style workloads . complex HPC-like workloads ... scale-out via network ... scale-up via high-speed buses

. main metrics . main metrics – cost (capital, energy) – memory / accelerator / inter-node BW – compute density – optimal mix of heterogeneous resources – scalability (CPU / GPU / FPGA / HBM / DRAM / NVMe)  homogeneous nodes – compute density, scalability (CPU / FPGA / NVMe plus compute)  heterogeneous nodes  datacenter disaggregation  data centric design

6/22/2017 IBM Research - Zurich Lab 5 Dense, Energy Efficient Computing: Hyperscale FPGA . Cloud economics – density (>1000 nodes / rack) – integrated NICs – switch card (backplane, no cables) – medium to low-cost compute chips . Passive liquid cooling – ultimate density (cooling >70W / node) – energy re-use . Built to integrate heterogeneous resources – CPUs – Accelerators

6/22/2017 IBM Research - Zurich Lab 6 Cognitive Computing: Integration & Disaggregation

. hadoop-style workloads . complex HPC-like workloads ... scale-out via network ... scale-up via high-speed buses

. main metrics . main metrics – cost (capital, energy) – memory / accelerator / inter-node BW – compute density – optimal mix of heterogeneous resources – scalability (CPU / GPU / FPGA / HBM / DRAM / NVMe)  homogeneous nodes – compute density, scalability (CPU / FPGA / NVMe plus compute)  heterogeneous nodes  datacenter disaggregation  data centric design

6/22/2017 IBM Research - Zurich Lab 7 OpenPOWER, a catalyst for Open Innovation

Open Development OpenPOWER enables greater innovation through both open software and open hardware

Collaboration across multiple thought leaders Collaborative development model drives collective thought leadership, simultaneously across multiple disciplines

Performance of leading POWER architecture Broadens the capability and performance of the POWER platform

The OpenPOWER Foundation creates an open ecosystem, using the POWER Architecture to expertise, investment, and server-class intellectual property to serve the evolving needs of customers.

IBM Research - Zurich Lab OpenPOWER Foundation: 230+ Partners, 24 countries SuperVessel: OpenPOWER Cloud

. Open cloud platform based on Power/OpenPOWER and OpenStack technology for business partners, developers, and university students . Heterogeneous Computing Cloud @ Zurich Research Lab

IBM Research - Zurich Lab SuperVessel: OpenPOWER Cloud for Developers

6/22/2017 IBM Research - Zurich Lab 11 OpenPOWER Software Support

• 50+ IBM Innovation Centers, over 2,300 ISVs developing on Power • Moving to little endian (almost complete)

IBM Research - Zurich Lab OpenPOWER OS & Compiler Support

. Choose your favorite and latest Linux flavor – RHEL – – ...... in little endian (ppc64le) . Standard compilers : GCC 4.8.5, MPICH 3.0.4, CUDA 8.0 . AT9.0.3 compilers: GCC 5.3.1, Python 3.4, and more optimized for POWER . AT10.0.1 compilers: GCC 6.2.1, Python 3.5, and more optimized for POWER . IBM compilers: XLF, XLC, ... . Optimized libraries: MASS (math functions) ESSL (BLAS) and MPI

6/22/2017 IBM Research - Zurich Lab 13 Introducing PowerAI: Enterprise Deep Learning Distribution

Package of Pre-Compiled Easy to install & get started Optimized for Performance Major Deep Learning with Deep Learning with To Take Advantage of Frameworks Enterprise-Class Support NVLink

Enabled by High Performance Computing Infrastructure

IBM Research - Zurich Lab 14 PowerAI Deep Learning Software Distribution

Caffe NVCaffe IBMCaffe Torch Deep Learning Frameworks Distributed TensorFlow Theano Chainer TensorFlow

Supporting OpenBLAS Distributed NCCL DIGITS Libraries Bazel Communications

Cluster of NVLink Spectrum Scale: Scale to Servers High-Speed Parallel Cloud Accelerated File System Servers and Infrastructure for Scaling

IBM Research - Zurich Lab S822LC: IBM POWER8+ for HPC

6/22/2017 IBM Research - Zurich Lab 16 OpenPOWER Core Technology Roadmap

Connect-IB ConnectX-4 ConnectX-5 Mellanox FDR Infiniband EDR Infiniband Next-Gen Infiniband Interconnect PCIe Gen3 CAPI over PCIe Gen3 Enhanced CAPI over PCIe Gen4

Kepler Pascal Volta GPUs PCIe Gen3 NVLink Enhanced NVLink

POWER8 POWER8’ POWER9 IBM CPUs PCIe Gen3 & NVLink & CAPI Enhanced NVLink, CAPI Interface OpenCAPI & PCIe Gen4

NVLINK 25G Accelerator Link PCIe Gen3 5x 7-10x Accelerator GPU GPU Links GPU 1x FPGA 4x FPGA FPGA PCIe Gen4 2015 2016 2017

IBM Research - Zurich Lab Looking Ahead: POWER9 Chip

New Core Microarchitecture Leadership Hardware Acceleration . Stronger thread performance Platform . Efficient agile pipeline . Enhanced on-chip acceleration . POWER ISA v3.0 . Nvidia NVLink 2.0: High bandwidth, advanced features . CAPI 2.0: Coherent accelerator and Enhanced Cache Hierarchy storage attach (PCIe G4) . 120MB NUCA L3 architecture . OpenCAPI: Improved latency and . 12 x 20-way associative regions bandwidth, open interface . Advanced replacement policies . Fed by 7 TB/s on-chip bandwidth State of the Art I/O Subsystem . PCIe Gen4 – 48 lanes 14nm finFET Cloud + Virtualization Innovation . Improved device performance High Bandwidth Signaling . Quality of service assists and reduced energy Technology . New interrupt architecture . 17 layer metal stack and . 16 Gb/s interface: Local SMP . Workload optimized frequency eDRAM . 25 Gb/s interface: 25G Link for . Hardware enforced trusted execution . 8.0 billion transistors Accelerator and remote SMP IBM Research - Zurich Lab POWER9 mArchitecture

Re-factored Core Provides Improved Efficiency & Workload Alignment . Enhanced pipeline efficiency with modular execution and intelligent pipeline control . Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD . Shared compute resource optimizes data-type interchange

6/22/2017 IBM Research - Zurich Lab 19 POWER9: POWER ISA v3.0 Broader data type support . 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications . Expanded BCD and 128b Decimal Integer – For database and native analytics . Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange

Support Emerging Algorithms . Enhanced Arithmetic and SIMD . Random Number Generation Instruction

Accelerate Emerging Workloads . Memory Atomics – For high scale data-centric applications . Hardware Assisted Garbage Collection – Optimize response time of interpretive languages

Cloud Optimization . Enhanced Translation Architecture – Optimized for Linux . New Interrupt Architecture – Automated partition routing for extreme virtualization . Enhanced Accelerator Virtualization . Hardware Enforced Trusted Execution

Energy & Frequency Management . POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup latency

6/22/2017 IBM Research - Zurich Lab 20 POWER8 Caches

. L2: 1 MB 8 way per core

. L3: 96 MB (12 x 8 MB 8 way Bank)

. L4: 128 MB (on Centaur)

. “NUCA” Cache policy (Non-Uniform Cache Architecture)

. Cache bandwidth

– 4 TB/sec L2 BW

– 3 TB/sec L3 BW

6/22/2017 IBM Research - Zurich Lab 21 Lookin ahead: POWER9 Data Capacity & Throughput

L3 Cache: 120 MB Shared Capacity NUCA High-Throughput On-Chip Fabric Cache . 10 MB Capacity + 512k L2 per 2x SMT4 Core . Over 7 TB/s On-chip Switch . Enhanced Replacement with Reuse & Data- . Move Data in/out at 256 GB/s Type Awareness per 2x SMT4 Core

6/22/2017 IBM Research - Zurich Lab 22 Looking Ahead: POWER9 Accelerator Interfaces Extreme Accelerator Bandwidth and Reduced Latency – PCIe Gen 4 x 48 lanes – 192 GB/s peak bandwidth (duplex) – IBM BlueLink 25Gb/s x 48 lanes – 300 GB/s peak bandwidth (duplex) Coherent Memory and Virtual Addressing Capability for all Accelerators – CAPI 2.0 - 4x bandwidth of POWER8 using openCAPI PCIe Gen 4 – NVLink 2.0 – Next generation of GPU/CPU openCAPI bandwidth and integration using BlueLink – OpenCAPI – High bandwidth, low latency and open interface using BlueLink

6/22/2017 IBM Research - Zurich Lab 23 POWER9 + Accelerators: GPUs

See NVIDIA @ GTC 2017

6/22/2017 IBM Research - Zurich Lab 24 POWER9 + GPU: Unified Memory

Pascal

Volta

See NVIDIA @ GTC 2017

6/22/2017 IBM Research - Zurich Lab 25 CAPI ... Coherent Accelerator Processor Interface

Standard I/O Model Flow DD Call Copy/Pin MMIO Notify Accelerate Poll / Int Copy/Unpin Return DD

Flow with a Coherent Model Shared Mem. Shared Memory Accelerate Notify Accelerator Completion

CAPI FPGA

POWER Service Layer

AFU AFU AFU 1 AFU AFU n

CAPP PCIe AFU 2 0 POWER8 Processor

6/22/2017 IBM Research - Zurich Lab 26 6/22/2017 IBM Research - Zurich Lab 27 OpenCAPI v3.0 and NVLINK 2.0 with POWER9

POWER9 CPU

2 CAPI v2 Proxy Cores PCIe Accelerator Card P9 P9 CAPP CAPP PHB Core Core PCIe FPGA

Gen4 SNAP Action0 Memory Bus PSL OpenCAPI and NPU Action1 NVLink processing unit 25G

Memory Device or Network OpenCAPI NVIDIANVIDIA PascalNVIDIA GPU AcceleratorOpenCAPI Pascal GPU - Streaming Layer for CAPI v1.0+ OpenCAPICardOpenCAPI Pascal GPU AcceleratorAccelerator - Simplifies accelerator StorageAcceleratorCard Class development and use MemoryCardCard

- Supports High-Level-Synthesis PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth CAPI SNAP SNAP (HLS) for FPGA development 25G Link x 48 lanes – 300 GB/s duplex bandwidth

- Available as open source IBM Research - Zurich Lab Available Accelerator Cards

Nallatech team explaining CAPI Flash card: https://www.youtube.com/watch?v=1n_ceKkCRuk

6/22/2017 IBM Research - Zurich Lab 29 Dense Memory (distributed) . Prototype Dense Memory integration software stack available – byte addressable, distributed globally accessable DM resource – exports industry standard asynchronous RDMA API for DM read and write access . Implements efficient local and remote DM access – zero copy local access via direct DMA device - application buffer – zero copy remote access via IB RDMA remote host - application buffer . Performance measurements – local DM access at NVMe devices performance limits (3.5 GB/s read, 1.8 GB/s write of 4k buffers) – remote DM access at network (100Gb/s InfiniBand) and device limits: 12.5 GB/s distributed DM random read with 4 storage nodes, all equipped with one NVMe SSD each – close to 900k IOPs for single device short sequential red/write operations

6/22/2017 IBM Research - Zurich Lab 30 But be willing take incremental steps when you can!

IBM Research - Zurich Lab 31