With Extreme Scale Computing the Rules Have Changed

With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems. • Three technology architecture possibilities or “swim lanes” are thriving. • Commodity (e.g. Intel) • Commodity + accelerator (e.g. GPUs) (104 systems) • Special purpose lightweight cores (e.g. IBM BG, ARM, Knights Landing) • Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). • Exascale (1018 Flop/s) projects exist in many countries and regions. • Intel processors largest share, 89% followed by AMD, 4%. 9 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt National Super Tianhe-2 NUDT, 1 Computer Center in Xeon 12C 2.2GHz + IntelXeon China 3,120,000 33.9 62 17.8 1905 Guangzhou Phi (57c) + Custom Titan, Cray XK7, AMD (16C) + DOE / OS 2 Nvidia Kepler GPU (14c) + USA 560,640 17.6 65 8.3 2120 Oak Ridge Nat Lab Custom DOE / NNSA Sequoia, BlueGene/Q (16c) 3 USA 1,572,864 17.2 85 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 4 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + Custom DOE / OS Mira, BlueGene/Q (16c) 5 USA 786,432 8.16 85 3.95 2066 Argonne Nat Lab + Custom Piz Daint, Cray XC30, Xeon 8C + 6 Swiss CSCS Swiss 115,984 6.27 81 2.3 2726 Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon Saudi 7 KAUST 196,608 5.54 77 4.5 1146 16C + Custom Arabia Texas Advanced Stampede, Dell Intel (8c) + Intel 8 USA 204,900 5.17 61 4.5 1489 Computing Center Xeon Phi (61c) + IB Forschungszentrum JuQUEEN, BlueGene/Q, 9 Germany 458,752 5.01 85 2.30 2178 Juelich (FZJ) Power BQC 16C 1.6GHz+Custom DOE / NNSA Vulcan, BlueGene/Q, 10 USA 393,216 4.29 85 1.97 2177 L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom 500 (422) Software Comp HP Cluster USA 18,896 .309 48 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt National Super Tianhe-2 NUDT, 1 Computer Center in Xeon 12C + IntelXeon Phi (57c) China 3,120,000 33.9 62 17.8 1905 Guangzhou + Custom Titan, Cray XK7, AMD (16C) + DOE / OS 2 Nvidia Kepler GPU (14c) + USA 560,640 17.6 65 8.3 2120 Oak Ridge Nat Lab Custom DOE / NNSA Sequoia, BlueGene/Q (16c) 3 USA 1,572,864 17.2 85 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 4 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + Custom DOE / OS Mira, BlueGene/Q (16c) 5 USA 786,432 8.16 85 3.95 2066 Argonne Nat Lab + Custom DOE / NNSA / Trinity, Cray XC40,Xeon 16C + 6 USA 301,056 8.10 80 Los Alamos & Sandia Custom Piz Daint, Cray XC30, Xeon 8C + 7 Swiss CSCS Swiss 115,984 6.27 81 2.3 2726 Nvidia Kepler (14c) + Custom Hazel Hen, Cray XC40, Xeon 12C 8 HLRS Stuttgart Germany 185,088 5.64 76 + Custom Shaheen II, Cray XC30, Xeon Saudi 9 KAUST 196,608 5.54 77 4.5 1146 16C + Custom Arabia Texas Advanced Stampede, Dell Intel (8c) + Intel 10 USA 204,900 5.17 61 4.5 1489 Computing Center Xeon Phi (61c) + IB 500 (368) Karlsruher MEGAWARE Intel Germany 10,800 .206 95 ♦ US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 - $525M hardware ♦ Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems ♦ Argonne Lab to receive Intel based system ! After this Exaflops ♦ US Dept of Commerce is preventing some China groups from receiving Intel technology ! Citing concerns about nuclear research being done with the systems; February 2015. ! On the blockade list: ! National SC Center Guangzhou, site of Tianhe-2 ! National SC Center Tianjin, site of Tianhe-1A ! National University for Defense Technology, developer ! National SC Center Changsha, location of NUDT 12 07 13 07 14 ♦ Sunway Blue Light - Sunway BlueLight MPP, ShenWei processor SW1600 975.00 MHz, Infiniband QDR ! Site:National Supercomputing Center in Jinan ! Cores:137,200 Linpack Performance (Rmax)795.9 TFlop/s Theoretical Peak (Rpeak)1,070.16 TFlop/s ! Processor:ShenWei processor SW1600 16C 975MHz (Alpha arch) ! Interconnect:Infiniband QDR ♦ Rumored to have a 100 Pflop/s system 07 15 Peak Performance - Per Core Floating point operations per cycle per core ! Most of the recent computers have FMA (Fused multiple add): (i.e. x ←x + y*z in one cycle) ! Intel Xeon earlier models and AMD Opteron have SSE2 ! 2 flops/cycle DP & 4 flops/cycle SP ! Intel Xeon Nehalem (’09) & Westmere (’10) have SSE4 ! 4 flops/cycle DP & 8 flops/cycle SP ! Intel Xeon Sandy Bridge(’11) & Ivy Bridge (’12) have AVX ! 8 flops/cycle DP & 16 flops/cycle SP ! Intel Xeon Haswell (’13) & (Broadwell (’14)) AVX2 ! 16 flops/cycle DP & 32 flops/cycle SP ! We Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP are ! Intel Xeon Skylake (server) (’15) AVX 512 Here ! soon 32 flops/cycle DP & 64 flops/cycle SP CPU Access Latencies in Clock Cycles In 167 cycles can do 2672 DP Flops Cycles Cycles ♦ Processors over provisioned for floating point arithmetic ♦ Data movement extremely expensive ♦ Operation count is not a good indicator of the time to solve a problem. ♦ Algorithms that do more ops may actually take less time. 11/17/15 18 Singular Value Decomposi<on LAPACK Version 1991 Level 1, 2, & 3 BLAS First Stage 8/3 n3 Ops square, with vectors 50 40 30 lapack QR lapack QR (1 core) linpack QR 20 eispack (1 core) QR refers to the QR algorithm speedup over eispack 10 for computing the eigenvalues 0 Intel Sandy Bridge 2.6 GHz 0k 4k 8k 12k 16k 20k (8 Flops per core per cycle) columns (matrix size N N) ⇥ BoGleneck in the Bidiagonalizaon The Standard Bidiagonal Reducon: xGEBRD Two Steps: Factor Panel & Update Tailing Matrix 0 0 0 0 10 10 10 10 20 20 20 20 30 30 30 30 40 40 40 40 50 50 50 50 60 60 60 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3035 nz = 275 nz = 2500 nz = 225 factor panel k then update ! factor panel k+1 Requires 2 GEMVs H " Characteristics Q*A*P • Total cost 8n3/3, (reduction to bi-diagonal) • Too many Level 2 BLAS operations • 4/3 n3 from GEMV and 4/3 n3 from GEMM • Performance limited to 2* performance of GEMV • !Memory bound algorithm. 16 cores Intel Sandy Bridge, 2.6 GHz, 20 MB shared L3 cache. The theoretical peak per core double precision is 20.4 Gflop/s per core. Compiled with icc and using MKL 2015.3.187 Recent Work on 2-Stage Algorithm 0 0 Second stage 0 10 First stage 10 Bulge chasing 10 20 To band 20 To bi-diagonal 20 30 30 30 40 40 40 50 50 50 60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3600 nz = 605 nz = 119 " Characteristics • Stage 1: • Fully Level 3 BLAS • Dataflow Asynchronous execution • Stage 2: • Level “BLAS-1.5” • Asynchronous execution • Cache friendly kernel (reduced communication) Recent work on developing new 2-stage algorithm n n − b n 0 0 b 0 3 +( )Second3 +( stage ) 10 3+( ) ( ) 3 10 flops 10 Â 2nb nt s 3nb nt s 10 3 nb nt s nt s 5nb First stage ⇡ s=1 − Bulge chasing− − ⇥ − 20 20n n 20 To band − b n To bi-diagonal 30 30 b 30 + 3 +( ) 3 +( ) 10 3+( ) ( ) 3 40 40 Â 2nb nt s 1 3nb nt 40 s 1 3 nb nt s nt s 1 5nb s=1 − − − − − ⇥ − − 50 50 50 60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 10 3 nz =10n 605 b 2 2nb 3 nz = 119 nz = 3600 n + n + n ⇡ 3 3 3 n n − b nb 10 3 3 3 3 10n (3gemm)first stage 3 flops Â 2nb +(nt s)3nb +(nt⇡ s) 3 nb+(nt s) (nt s)5nb ⇡ s=1 − − − ⇥ − n n − b nb (7) 3 3 10 3 3 + Â 2nb +(nt s 1)3nb +(nt s 1) 3 nb+(nt s) (nt s 1)5nb s=1 − − − − − ⇥ − − 10 n3 + 10nb n2 + 2nsecondb n3 stage: ⇡ 3 3 3 10 n3(gemm) flops = 6 n n2(gemv) (8) ⇡ 3 first stage ⇥ b ⇥ second stage More Flops, original did 8/3 n 3 (7) maximium performance bound: second stage: reference number of operations Pmax = minimum time tmin 2 flops = 6 nb n (gemv)second stage (8) ⇥ ⇥ 8 3 3 n = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) maximium performance bound: 3 8n PgemmPgemv = 3 2 (9) reference number of operations 10n Pgemv+18nbn Pgemm Pmax = minimum time tmin 8 8 8 3 = 40 Pgemm Pmax 12 Pgemm 3 n ) = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) if Pgemm is about 20x Pgemv and 120 nb 240.

With Extreme Scale Computing the Rules Have Changed

UNIT 8B a Full Adder

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Misleading Performance Reporting in the Supercomputing Field David H

Intel Xeon & Dgpu Update

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Summarizing CPU and GPU Design Trends with Product Data

Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware

Parallel Programming – Multicore Systems

PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus

Comparing Performance and Energy Efficiency of Fpgas and Gpus For

Measurement, Control, and Performance Analysis for Intel Xeon Phi

Intel® Itanium™ Processor Microarchitecture Overview