Openpower Innovations for HPC IBM Research
Total Page:16
File Type:pdf, Size:1020Kb
IWOPH workshop, ISC, Germany June 21, 2017 OpenPOWER Innovations for HPC IBM Research Christoph Hagleitner, [email protected] IBM Research - Zurich Lab IBM Research - Zurich . Established in 1956 . 45+ different nationalities . Open Collaboration – 43 funded projects and 500+ partners in Horizon 2020 – Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA) – 7 European Research Council Grants . Two Nobel Prizes: – 1986 for the scanning tunneling microscope (Binnig and Rohrer) – 1987 for the discovery of high- temperature superconductivity (Müller and Bednorz) 6/22/2017 IBM Research - Zurich Lab 2 Agenda . Cognitive Computing & HPC . HPC Software ecosystem . HPC system roadmap . HPC Processor . HPC Accelerators 6/22/2017 IBM Research - Zurich Lab 3 Cognitive Computing Applications Deep Computational Learning Complexity Graph Analytics O(N3) Classic HPC Applications Dim. Reduction HPC 2 Uncertainty Knowledge Graph O(N ) Quantification Creation Information Retrieval HADOOP O(N) Database queries Data MB GB TB PB Volume 6/22/2017 IBM Research - Zurich Lab 4 Cognitive Computing: Integration & Disaggregation . hadoop-style workloads . complex HPC-like workloads ... scale-out via network ... scale-up via high-speed buses . main metrics . main metrics – cost (capital, energy) – memory / accelerator / inter-node BW – compute density – optimal mix of heterogeneous resources – scalability (CPU / GPU / FPGA / HBM / DRAM / NVMe) homogeneous nodes – compute density, scalability (CPU / FPGA / NVMe plus compute) heterogeneous nodes datacenter disaggregation data centric design 6/22/2017 IBM Research - Zurich Lab 5 Dense, Energy Efficient Computing: Hyperscale FPGA . Cloud economics – density (>1000 nodes / rack) – integrated NICs – switch card (backplane, no cables) – medium to low-cost compute chips . Passive liquid cooling – ultimate density (cooling >70W / node) – energy re-use . Built to integrate heterogeneous resources – CPUs – Accelerators 6/22/2017 IBM Research - Zurich Lab 6 Cognitive Computing: Integration & Disaggregation . hadoop-style workloads . complex HPC-like workloads ... scale-out via network ... scale-up via high-speed buses . main metrics . main metrics – cost (capital, energy) – memory / accelerator / inter-node BW – compute density – optimal mix of heterogeneous resources – scalability (CPU / GPU / FPGA / HBM / DRAM / NVMe) homogeneous nodes – compute density, scalability (CPU / FPGA / NVMe plus compute) heterogeneous nodes datacenter disaggregation data centric design 6/22/2017 IBM Research - Zurich Lab 7 OpenPOWER, a catalyst for Open Innovation Open Development OpenPOWER enables greater innovation through both open software and open hardware Collaboration across multiple thought leaders Collaborative development model drives collective thought leadership, simultaneously across multiple disciplines Performance of leading POWER architecture Broadens the capability and performance of the POWER platform The OpenPOWER Foundation creates an open ecosystem, using the POWER Architecture to share expertise, investment, and server-class intellectual property to serve the evolving needs of customers. IBM Research - Zurich Lab OpenPOWER Foundation: 230+ Partners, 24 countries SuperVessel: OpenPOWER Cloud . Open cloud platform based on Power/OpenPOWER and OpenStack technology for business partners, developers, and university students . Heterogeneous Computing Cloud @ Zurich Research Lab IBM Research - Zurich Lab SuperVessel: OpenPOWER Cloud for Developers 6/22/2017 IBM Research - Zurich Lab 11 OpenPOWER Software Support • 50+ IBM Innovation Centers, over 2,300 Linux ISVs developing on Power • Moving to little endian (almost complete) IBM Research - Zurich Lab OpenPOWER OS & Compiler Support . Choose your favorite and latest Linux flavor – RHEL – Ubuntu – ... ... in little endian (ppc64le) . Standard compilers : GCC 4.8.5, MPICH 3.0.4, CUDA 8.0 . AT9.0.3 compilers: GCC 5.3.1, Python 3.4, and more optimized for POWER . AT10.0.1 compilers: GCC 6.2.1, Python 3.5, and more optimized for POWER . IBM compilers: XLF, XLC, ... Optimized libraries: MASS (math functions) ESSL (BLAS) and MPI 6/22/2017 IBM Research - Zurich Lab 13 Introducing PowerAI: Enterprise Deep Learning Distribution Package of Pre-Compiled Easy to install & get started Optimized for Performance Major Deep Learning with Deep Learning with To Take Advantage of Frameworks Enterprise-Class Support NVLink Enabled by High Performance Computing Infrastructure IBM Research - Zurich Lab 14 PowerAI Deep Learning Software Distribution Caffe NVCaffe IBMCaffe Torch Deep Learning Frameworks Distributed TensorFlow Theano Chainer TensorFlow Supporting OpenBLAS Distributed NCCL DIGITS Libraries Bazel Communications Cluster of NVLink Spectrum Scale: Scale to Servers High-Speed Parallel Cloud Accelerated File System Servers and Infrastructure for Scaling IBM Research - Zurich Lab S822LC: IBM POWER8+ for HPC 6/22/2017 IBM Research - Zurich Lab 16 OpenPOWER Core Technology Roadmap Connect-IB ConnectX-4 ConnectX-5 Mellanox FDR Infiniband EDR Infiniband Next-Gen Infiniband Interconnect PCIe Gen3 CAPI over PCIe Gen3 Enhanced CAPI over PCIe Gen4 Kepler Pascal Volta NVIDIA GPUs PCIe Gen3 NVLink Enhanced NVLink POWER8 POWER8’ POWER9 IBM CPUs PCIe Gen3 & NVLink & CAPI Enhanced NVLink, CAPI Interface OpenCAPI & PCIe Gen4 NVLINK 25G Accelerator Link PCIe Gen3 5x 7-10x Accelerator GPU GPU Links GPU 1x FPGA 4x FPGA FPGA PCIe Gen4 2015 2016 2017 IBM Research - Zurich Lab Looking Ahead: POWER9 Chip New Core Microarchitecture Leadership Hardware Acceleration . Stronger thread performance Platform . Efficient agile pipeline . Enhanced on-chip acceleration . POWER ISA v3.0 . Nvidia NVLink 2.0: High bandwidth, advanced features . CAPI 2.0: Coherent accelerator and Enhanced Cache Hierarchy storage attach (PCIe G4) . 120MB NUCA L3 architecture . OpenCAPI: Improved latency and . 12 x 20-way associative regions bandwidth, open interface . Advanced replacement policies . Fed by 7 TB/s on-chip bandwidth State of the Art I/O Subsystem . PCIe Gen4 – 48 lanes 14nm finFET Cloud + Virtualization Innovation . Improved device performance High Bandwidth Signaling . Quality of service assists and reduced energy Technology . New interrupt architecture . 17 layer metal stack and . 16 Gb/s interface: Local SMP . Workload optimized frequency eDRAM . 25 Gb/s interface: 25G Link for . Hardware enforced trusted execution . 8.0 billion transistors Accelerator and remote SMP IBM Research - Zurich Lab POWER9 mArchitecture Re-factored Core Provides Improved Efficiency & Workload Alignment . Enhanced pipeline efficiency with modular execution and intelligent pipeline control . Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD . Shared compute resource optimizes data-type interchange 6/22/2017 IBM Research - Zurich Lab 19 POWER9: POWER ISA v3.0 Broader data type support . 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications . Expanded BCD and 128b Decimal Integer – For database and native analytics . Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange Support Emerging Algorithms . Enhanced Arithmetic and SIMD . Random Number Generation Instruction Accelerate Emerging Workloads . Memory Atomics – For high scale data-centric applications . Hardware Assisted Garbage Collection – Optimize response time of interpretive languages Cloud Optimization . Enhanced Translation Architecture – Optimized for Linux . New Interrupt Architecture – Automated partition routing for extreme virtualization . Enhanced Accelerator Virtualization . Hardware Enforced Trusted Execution Energy & Frequency Management . POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup latency 6/22/2017 IBM Research - Zurich Lab 20 POWER8 Caches . L2: 1 MB 8 way per core . L3: 96 MB (12 x 8 MB 8 way Bank) . L4: 128 MB (on Centaur) . “NUCA” Cache policy (Non-Uniform Cache Architecture) . Cache bandwidth – 4 TB/sec L2 BW – 3 TB/sec L3 BW 6/22/2017 IBM Research - Zurich Lab 21 Lookin ahead: POWER9 Data Capacity & Throughput L3 Cache: 120 MB Shared Capacity NUCA High-Throughput On-Chip Fabric Cache . 10 MB Capacity + 512k L2 per 2x SMT4 Core . Over 7 TB/s On-chip Switch . Enhanced Replacement with Reuse & Data- . Move Data in/out at 256 GB/s Type Awareness per 2x SMT4 Core 6/22/2017 IBM Research - Zurich Lab 22 Looking Ahead: POWER9 Accelerator Interfaces Extreme Accelerator Bandwidth and Reduced Latency – PCIe Gen 4 x 48 lanes – 192 GB/s peak bandwidth (duplex) – IBM BlueLink 25Gb/s x 48 lanes – 300 GB/s peak bandwidth (duplex) Coherent Memory and Virtual Addressing Capability for all Accelerators – CAPI 2.0 - 4x bandwidth of POWER8 using openCAPI PCIe Gen 4 – NVLink 2.0 – Next generation of GPU/CPU openCAPI bandwidth and integration using BlueLink – OpenCAPI – High bandwidth, low latency and open interface using BlueLink 6/22/2017 IBM Research - Zurich Lab 23 POWER9 + Accelerators: GPUs See NVIDIA @ GTC 2017 6/22/2017 IBM Research - Zurich Lab 24 POWER9 + GPU: Unified Memory Pascal Volta See NVIDIA @ GTC 2017 6/22/2017 IBM Research - Zurich Lab 25 CAPI ... Coherent Accelerator Processor Interface Standard I/O Model Flow DD Call Copy/Pin MMIO Notify Accelerate Poll / Int Copy/Unpin Return DD Flow with a Coherent Model Shared Mem. Shared Memory Accelerate Notify Accelerator Completion CAPI FPGA POWER Service Layer AFU 1 AFU AFU n AFU CAPP PCIe