Project Aurora 2017 Vector Inheritance

Vector Coding a simple user’s perspective Rudolf Fischer NEC Deutschland GmbH Düsseldorf, Germany SIMD vs. vector Input Pipeline Result Scalar SIMD people call it “vector”! Vector SX 2 © NEC Corporation 2017 Data Parallelism ▌‘Vector Loop’, data parallel do i = 1, n Real, dimension(n): a,b,c a(i) = b(i) + c(i) … end do a = b + c ▌‘Scalar Loop’, not data parallel, ex. linear recursion do i = 2, n a(i) = a(i-1) + b(i) end do ▌Reduction? do i = 1, n, VL do i_ = i, min(n,i+VL-1) do i = 1, n s_(i_) = s_(i_) + v(i_)* w(i_) s = s + v(i)* w(i) end do end do end do s = reduction(s_) (hardware!) 3 © NEC Corporation 2017 Vector Coding Paradigm ▌‘Scalar’ thinking / coding: There is a (grid-point,particle,equation,element), what am I going to do with it? ▌‘Vector’ thinking / coding: There is a certain action or operation, to which (grid-points,particles,equations,elements) am I going to apply it simultaneously? 4 © NEC Corporation 2017 Identifying the data-parallel structure ▌ Very simple case: do j = 2, m-1 do i = 2, n-1 rho_new(i,j) = rho_old(i,j) + dt * something_with_fluxes(i+/-1,j+/-1) end do end do ▌ Does it tell us something? Partial differential equations Local theories Vectorization is very “natural” 5 © NEC Corporation 2017 Identifying the data-parallel subset ▌ Simple case: V------> do j = 1, m |+-----> do i = 2, n || a(i,j) = a(i-1,j) + b(i,j) |+----- end do V------ end do ▌ The compiler will vectorise along j ▌ non-unit-stride access, suboptimal ▌ Results for a certain case: Totally scalar (directives): 25.516ms Avoid outer vector loop: 21.032ms Default compilation: 0.515ms 6 © NEC Corporation 2017 Identifying the data-parallel subset ▌ example: 2d ILU or Gauß-Seidel +V------> do idiagj = 2, = n2, 2*n-1 |+|V-----> do i = 2,max( n 2, idiag+1-n ), min( n, idiag-1 ) || x(j i,j= idiag) = rhs- (ii,j+ )1 - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) |+|| ----- end x( i,jdo ) = rhs(i,j) - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) +|V----------- S end end do do V------ end do ▌ solution: hyperplane-ordering ▌ Results: Default 36.2ms hyperplane 2.0ms 7 © NEC Corporation 2017 Data Layout: Sparse Matrix * Vector (SMV) ▌ Normal case: compressed row storage (CRS) ▌ res = A * x do irow = 1, nrow do jcol = 1, jstart(irow+1)-jstart(irow) ind = jstart(irow)+jcol-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do ▌ better: jagged diagonal storage (JDS) (Yousef Saad, 90s!) do jcol = 1, ncol do irow = 1, istart(jcol+1)-istart(jcol) ind = istart(jcol)+irow-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do JDS allows for quite some additional optimizations! 8 © NEC Corporation 2017 Conclusion ▌ Vectorization is normal Mother nature tends to provide the necessary structures All contemporary architectures need it for performance anyway ▌ There is a difference between SSE, AVX … you name it, and a real vector-architecture This can lead to different approaches in some cases In many cases just identical ▌ The techniques are known since ages anyway And by no means intellectually challenging ▌ NEC expects that codes will be adapted gradually, just because the architecture promises a lot of efficiency And coding for Aurora is anyway easier than CUDA 9 © NEC Corporation 2017 CPU Package Memory Bandwidth measured Past vectors: Theoretical Peak: • Single core: ~330GB/s • Cray T90-32: 360GB/s • Best result: ~980GB/s • NEC SX-4/32: 512GB/s World’s first implementation of 6 HBM memories 10 © NEC Corporation 2017 HPCG HPL and STREAM are extreme benchmarks HPCG is somehow “in between”, better representing real app’s Performance HPCG / node HPCG / price Characteristics Aurora SKL Performance ratio Performance ratio Performance ratio Performance 3x 3x Memory bandwidth Performance bound bound SKL = dual-socket Intel Skylake 6148 11 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Tianhe-2 (MilkyWay-2) – TH-IVB-FEP Cluster, Intel 2 NSCC / Guangzhou Xeon 12C 2.2GHz, TH 3120000 33,863 2 0,5800 1,10% 17808,00 32,57 Express 2, Intel Xeon Phi 31S1P 57-core Trinity – Cray XC40, Intel 3 DOE/NNSA/LANL/SNL Xeon Phi 7250 68C 1.4GHz, 979072 14,137 7 0,5460 1,80% 3843,58 142,06 Aries interconnect Piz Daint – Cray XC50, Swiss National Intel Xeon E5-2690v3 12C 4 Supercomputing Centre 361760 19,590 3 0,4860 1,90% 2271,99 213,91 2.6GHz, Cray Aries, NVIDIA (CSCS) Tesla P100 16GB Sunway TaihuLight – National Supercomputing 5 Sunway MPP, SW26010 10649600 93,015 1 0,4810 0,40% 15371,00 31,29 Center in Wuxi 260C 1.45GHz, Sunway Oakforest-PACS – Joint Center for Advanced PRIMERGY CX600 M1, Intel 6 High Performance Xeon Phi Processor 7250 557056 13,555 9 0,3850 1,50% 2718,70 141,61 Computing 68C 1.4GHz, Intel Omni-Path Architecture Cori – XC40, Intel Xeon Phi 7 DOE/SC/LBNL/NERSC 7250 68C 1.4GHz, Cray 632400 13,832 8 0,3550 1,30% 3939,00 90,12 Aries Sequoia – IBM BlueGene/Q, 8 DOE/NNSA/LLNL PowerPC A2 1.6 GHz 16- 1572864 17,173 6 0,3300 1,60% 7890,00 41,83 core, 5D Torus Titan – Cray XK7, Opteron 6274 16C 2.200GHz, Cray 9 DOE/SC/Oak Ridge Nat Lab 560640 17,590 5 0,3220 1,20% 8209,00 39,23 Gemini interconnect, NVIDIA K20x TSUBAME3.0 – SGI ICE XA (HPE SGI 8600), IP139- SXM2, Intel Xeon E5-2680 GSIC Center, Tokyo 10 v4 15120C 2.9GHz, Intel 136080 8,125 13 0,1890 1,60% 792,08 238,61 Institute of Technology Omni-Path Architecture, NVIDIA TESLA P100 SXM2 with NVLink Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Mira – IBM BlueGene/Q, DOE/SC/Argonne National 12 PowerPC A2 1.6 GHz 16- 786432 8,587 11 0,1670 1,70% 3945,00 42,33 Laboratory core, 5D Torus Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR Hazel Hen – Cray XC40, 14 HLRS/University of Stuttgart Intel Xeon E5-2680-V3, Cray 185088 5,640 19 0,1380 1,90% 3615,00 38,17 Aries MareNostrum4 – ThinkSystem SD530, Intel 15 BSC-CNS 241108 6,227 16 0,1220 1,10% 1632,00 74,75 Xeon Platinium 8160 24C 2.1GHz, INTEL OmniPath 12 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) NEC SX – NEC SX-ACE Cyberscience Center / 61 4C+IXS, NEC SX-ACE 2048 0,123 0,0150 11,40% 140,80 106,53 Tohoku University 4C+IXS, NEC IXS Earth Simulator – NEC SX- 36 CEIST / JAMSTEC 8192 0,487 0,0580 11,00% 563,20 102,98 ACE, NEC SX-ACE, Osaka U ACE – SX-ACE, Cybermedia Center, Osaka 62 NEC SX-ACE 2048C 1.0GHz, 2048 0,123 0,0140 10,80% 140,80 99,43 University IXS Cyberscience Center, Cyberscience Center, Tohoku University – SX- 51 4096 0,246 0,0280 10,70% 281,60 99,43 Tohoku University ACE, NEC SX-ACE;4096;1.0, IXS, None NEC SX-ACE – NEC SX- Christian-Albrechts- 83 ACE, NEC SX-ACE 1024C 1024 0,062 0,0068 10,50% 70,40 96,59 Universitaet zu Kiel 1.0GHz, NEC IXS Center for Global SX-ACE – NEC SX-ACE, NEC Environmental Research, 74 SX-ACE 1536C 1.0GHz, NEC 1536 0,092 0,0100 10,20% 105,60 94,70 National Institute for IXS Environmental Studies K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Information Technology Oakleaf-FX – PRIMEHPC 37 Center, The University of FX10, SPARC64 Ixfx 32C 76800 1,043 157 0,0570 5,00% 1176,80 48,44 Tokyo 1.848GHz, Tofu interconnect iDataPlex – iDataPlex Max-Planck-Gesellschaft DX360M4, Intel Xeon E5- 35 65320 1,283 100 0,0610 4,20% 1260,00 48,41 MPI/IPP 2680v2 10C 2.800GHz, Infiniband FDR SORA-MA – PRIMEHPC Japan Aerospace FX100, SPARC64 Xifx 32C 17 103680 3,157 38 0,1100 3,20% 1652,40 66,57 eXploration Agency 1.975GHz, Tofu interconnect 2 ARCHER – Cray XC30, Intel EPSRC/University of 24 Xeon E5 v2 12C 2.700GHz, 118080 1,643 79 0,0810 3,20% 3306,24 24,50 Edinburgh Aries interconnect Edison – Cray XC30, Intel 25 DOE/SC/LBNL/NERSC Xeon E5-2695v2 12C 133824 1,655 78 0,0790 3,10% 3747,07 21,08 2.4GHz, Aries interconnect Curie thin nodes – Bullx 39 CEA/TGCC-GENCI B510, Intel Xeon E5-2680 8C 77184 1,359 93 0,0510 3,10% 2132,00 23,92 2.700GHz, Infiniband QDR Plasma Simulator – Fujitsu National Institute for Fusion 26 PRIMEHPC FX100, SPARC64 82944 2,376 59 0,0730 2,80% 1244,16 58,67 Science Xifx, Tofu Interconnect 2 ITC Nagoya – PRIMEHPC Information Technology 21 FX100, SPARC64 XIfx, Tofu 92160 2,910 43 0,0870 2,70% 1382,40 62,93 Center, Nagoya University interconnect 2 Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR 13 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) Research Computation Facility for GOSAT-2 (RCF2) – SGI Rackable National Institute for 55 C1104-GP1, Intel Xeon E5- 16320 0,770 320 0,0230 2,10% 78,64 292,47 Environmental Studies 2650 v4 2880C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 PCIe Reedbush-L – SGI Information Technology Rackable C1102-GP8, Intel 52 Center, The University of Xeon E5-2695 v4 2304C 16640 0,806 292 0,0230 1,60% 79,24 290,26 Tokyo 2.1GHz, EDR Infiniband, NVIDIA Tesla P100 NVLink

Project Aurora 2017 Vector Inheritance

Performance Modeling the Earth Simulator and ASCI Q

2020 Global High-Performance Computing Product Leadership Award

Hardware Technology of the Earth Simulator 1

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

Beyond Earth Simulator

Evaluation of Cache-Based Superscalar and Cacheless Vector Architectures for Scientific Computations

Japan's 10 Peta FLOPS Supercomputer Development

1. Introduction

Notes on the Earth Simulator

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Porting the 3D Gyrokinetic Particle-In-Cell Code GTC to the NEC SX-6 Vector Architecture: Perspectives and Challenges

SX-Aurora TSUBASA Introduction Vector Supercomputer Technology on a Pcie Card What Is Vector Processor? (1/2)