Vector Coding a simple user’s perspective

Rudolf Fischer NEC Deutschland GmbH Düsseldorf, Germany SIMD vs. vector

Input Pipeline Result

Scalar

SIMD people call it “vector”!

Vector

SX

2 © NEC Corporation 2017 Data Parallelism

▌‘Vector Loop’, data parallel do i = 1, n Real, dimension(n): a,b,c a(i) = b(i) + c(i) … end do a = b + c

▌‘Scalar Loop’, not data parallel, ex. linear recursion do i = 2, n a(i) = a(i-1) + b(i) end do

▌Reduction?

do i = 1, n, VL do i_ = i, min(n,i+VL-1) do i = 1, n s_(i_) = s_(i_) + v(i_)* w(i_) s = s + v(i)* w(i) end do end do end do s = reduction(s_) (hardware!)

3 © NEC Corporation 2017 Vector Coding Paradigm

▌‘Scalar’ thinking / coding:

There is a (grid-point,particle,equation,element), what am I going to do with it?

▌‘Vector’ thinking / coding:

There is a certain action or operation, to which (grid-points,particles,equations,elements) am I going to apply it simultaneously?

4 © NEC Corporation 2017 Identifying the data-parallel structure

▌ Very simple case:

do j = 2, m-1 do i = 2, n-1 rho_new(i,j) = rho_old(i,j) + dt * something_with_fluxes(i+/-1,j+/-1) end do end do

▌ Does it tell us something?  Partial differential equations  Local theories  Vectorization is very “natural”

5 © NEC Corporation 2017 Identifying the data-parallel subset

▌ Simple case:

V------> do j = 1, m |+-----> do i = 2, n || a(i,j) = a(i-1,j) + b(i,j) |+----- end do V------end do

▌ The compiler will vectorise along j ▌ non-unit-stride access, suboptimal ▌ Results for a certain case: Totally scalar (directives): 25.516ms Avoid outer vector loop: 21.032ms Default compilation: 0.515ms

6 © NEC Corporation 2017 Identifying the data-parallel subset

▌ example: 2d ILU or Gauß-Seidel

+V------> do idiagj = 2,= n2, 2*n-1 |+|V-----> do i = 2,max( n 2, idiag+1-n ), min( n, idiag-1 ) || x(j i,j= idiag) = rhs- (ii,j+ )1 - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) |+|| ----- end x( i,jdo ) = rhs(i,j) - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) +|V------S end end do do V------end do

▌ solution: hyperplane-ordering ▌ Results:  Default 36.2ms  hyperplane 2.0ms

7 © NEC Corporation 2017 Data Layout: Sparse Matrix * Vector (SMV)

▌ Normal case: compressed row storage (CRS) ▌ res = A * x do irow = 1, nrow do jcol = 1, jstart(irow+1)-jstart(irow) ind = jstart(irow)+jcol-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do

▌ better: jagged diagonal storage (JDS) (Yousef Saad, 90s!)

do jcol = 1, ncol do irow = 1, istart(jcol+1)-istart(jcol) ind = istart(jcol)+irow-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do

JDS allows for quite some additional optimizations!

8 © NEC Corporation 2017 Conclusion

▌ Vectorization is normal  Mother nature tends to provide the necessary structures  All contemporary architectures need it for performance anyway ▌ There is a difference between SSE, AVX … you name it, and a real vector-architecture  This can lead to different approaches in some cases  In many cases just identical ▌ The techniques are known since ages anyway  And by no means intellectually challenging ▌ NEC expects that codes will be adapted gradually, just because the architecture promises a lot of efficiency  And coding for Aurora is anyway easier than CUDA

9 © NEC Corporation 2017 CPU Package

Memory Bandwidth measured Past vectors: Theoretical Peak: • Single core: ~330GB/s • Cray T90-32: 360GB/s • Best result: ~980GB/s • NEC SX-4/32: 512GB/s

World’s first implementation of 6 HBM memories

10 © NEC Corporation 2017 HPCG

HPL and STREAM are extreme benchmarks HPCG is somehow “in between”, better representing real app’s

Performance HPCG / node HPCG / price

Characteristics

Aurora

SKL

Performance ratio Performance ratio Performance ratio Performance

3x 3x Memory bandwidth Performance bound bound SKL = dual-socket Intel Skylake 6148

11 © NEC Corporation 2017 „green“ HPCG

Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Tianhe-2 (MilkyWay-2) – TH-IVB-FEP Cluster, Intel 2 NSCC / Guangzhou Xeon 12C 2.2GHz, TH 3120000 33,863 2 0,5800 1,10% 17808,00 32,57 Express 2, Intel Xeon Phi 31S1P 57-core Trinity – Cray XC40, Intel 3 DOE/NNSA/LANL/SNL Xeon Phi 7250 68C 1.4GHz, 979072 14,137 7 0,5460 1,80% 3843,58 142,06 Aries interconnect Piz Daint – Cray XC50, Swiss National Intel Xeon E5-2690v3 12C 4 Supercomputing Centre 361760 19,590 3 0,4860 1,90% 2271,99 213,91 2.6GHz, Cray Aries, NVIDIA (CSCS) Tesla P100 16GB Sunway TaihuLight – National Supercomputing 5 Sunway MPP, SW26010 10649600 93,015 1 0,4810 0,40% 15371,00 31,29 Center in Wuxi 260C 1.45GHz, Sunway Oakforest-PACS – Joint Center for Advanced PRIMERGY CX600 M1, Intel 6 High Performance Xeon Phi Processor 7250 557056 13,555 9 0,3850 1,50% 2718,70 141,61 Computing 68C 1.4GHz, Intel Omni-Path Architecture Cori – XC40, Intel Xeon Phi 7 DOE/SC/LBNL/NERSC 7250 68C 1.4GHz, Cray 632400 13,832 8 0,3550 1,30% 3939,00 90,12 Aries Sequoia – IBM BlueGene/Q, 8 DOE/NNSA/LLNL PowerPC A2 1.6 GHz 16- 1572864 17,173 6 0,3300 1,60% 7890,00 41,83 core, 5D Torus Titan – Cray XK7, Opteron 6274 16C 2.200GHz, Cray 9 DOE/SC/Oak Ridge Nat Lab 560640 17,590 5 0,3220 1,20% 8209,00 39,23 Gemini interconnect, NVIDIA K20x TSUBAME3.0 – SGI ICE XA (HPE SGI 8600), IP139- SXM2, Intel Xeon E5-2680 GSIC Center, Tokyo 10 v4 15120C 2.9GHz, Intel 136080 8,125 13 0,1890 1,60% 792,08 238,61 Institute of Technology Omni-Path Architecture, NVIDIA TESLA P100 SXM2 with NVLink Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Mira – IBM BlueGene/Q, DOE/SC/Argonne National 12 PowerPC A2 1.6 GHz 16- 786432 8,587 11 0,1670 1,70% 3945,00 42,33 Laboratory core, 5D Torus Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR Hazel Hen – Cray XC40, 14 HLRS/University of Stuttgart Intel Xeon E5-2680-V3, Cray 185088 5,640 19 0,1380 1,90% 3615,00 38,17 Aries MareNostrum4 – ThinkSystem SD530, Intel 15 BSC-CNS 241108 6,227 16 0,1220 1,10% 1632,00 74,75 Xeon Platinium 8160 24C 2.1GHz, INTEL OmniPath

12 © NEC Corporation 2017 „green“ HPCG

Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) NEC SX – NEC SX-ACE Cyberscience Center / 61 4C+IXS, NEC SX-ACE 2048 0,123 0,0150 11,40% 140,80 106,53 Tohoku University 4C+IXS, NEC IXS Earth Simulator – NEC SX- 36 CEIST / JAMSTEC 8192 0,487 0,0580 11,00% 563,20 102,98 ACE, NEC SX-ACE, Osaka U ACE – SX-ACE, Cybermedia Center, Osaka 62 NEC SX-ACE 2048C 1.0GHz, 2048 0,123 0,0140 10,80% 140,80 99,43 University IXS Cyberscience Center, Cyberscience Center, Tohoku University – SX- 51 4096 0,246 0,0280 10,70% 281,60 99,43 Tohoku University ACE, NEC SX-ACE;4096;1.0, IXS, None NEC SX-ACE – NEC SX- Christian-Albrechts- 83 ACE, NEC SX-ACE 1024C 1024 0,062 0,0068 10,50% 70,40 96,59 Universitaet zu Kiel 1.0GHz, NEC IXS Center for Global SX-ACE – NEC SX-ACE, NEC Environmental Research, 74 SX-ACE 1536C 1.0GHz, NEC 1536 0,092 0,0100 10,20% 105,60 94,70 National Institute for IXS Environmental Studies K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Information Technology Oakleaf-FX – PRIMEHPC 37 Center, The University of FX10, SPARC64 Ixfx 32C 76800 1,043 157 0,0570 5,00% 1176,80 48,44 Tokyo 1.848GHz, Tofu interconnect iDataPlex – iDataPlex Max-Planck-Gesellschaft DX360M4, Intel Xeon E5- 35 65320 1,283 100 0,0610 4,20% 1260,00 48,41 MPI/IPP 2680v2 10C 2.800GHz, Infiniband FDR SORA-MA – PRIMEHPC Japan Aerospace FX100, SPARC64 Xifx 32C 17 103680 3,157 38 0,1100 3,20% 1652,40 66,57 eXploration Agency 1.975GHz, Tofu interconnect 2 ARCHER – Cray XC30, Intel EPSRC/University of 24 Xeon E5 v2 12C 2.700GHz, 118080 1,643 79 0,0810 3,20% 3306,24 24,50 Edinburgh Aries interconnect Edison – Cray XC30, Intel 25 DOE/SC/LBNL/NERSC Xeon E5-2695v2 12C 133824 1,655 78 0,0790 3,10% 3747,07 21,08 2.4GHz, Aries interconnect Curie thin nodes – Bullx 39 CEA/TGCC-GENCI B510, Intel Xeon E5-2680 8C 77184 1,359 93 0,0510 3,10% 2132,00 23,92 2.700GHz, Infiniband QDR

Plasma Simulator – Fujitsu National Institute for Fusion 26 PRIMEHPC FX100, SPARC64 82944 2,376 59 0,0730 2,80% 1244,16 58,67 Science Xifx, Tofu Interconnect 2 ITC Nagoya – PRIMEHPC Information Technology 21 FX100, SPARC64 XIfx, Tofu 92160 2,910 43 0,0870 2,70% 1382,40 62,93 Center, Nagoya University interconnect 2 Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR

13 © NEC Corporation 2017 „green“ HPCG

Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) Research Computation Facility for GOSAT-2 (RCF2) – SGI Rackable National Institute for 55 C1104-GP1, Intel Xeon E5- 16320 0,770 320 0,0230 2,10% 78,64 292,47 Environmental Studies 2650 v4 2880C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 PCIe Reedbush-L – SGI Information Technology Rackable C1102-GP8, Intel 52 Center, The University of Xeon E5-2695 v4 2304C 16640 0,806 292 0,0230 1,60% 79,24 290,26 Tokyo 2.1GHz, EDR Infiniband, NVIDIA Tesla P100 NVLink TSUBAME3.0 – SGI ICE XA (HPE SGI 8600), IP139- SXM2, Intel Xeon E5-2680 GSIC Center, Tokyo 10 v4 15120C 2.9GHz, Intel 136080 8,125 13 0,1890 1,60% 792,08 238,61 Institute of Technology Omni-Path Architecture, NVIDIA TESLA P100 SXM2 with NVLink Reedbush-H – SGI Information Technology Rackable C1102-GP8, Intel 56 Center, The University of Xeon E5-2695 v4 4320C 17760 0,802 296 0,0220 1,70% 93,57 235,12 Tokyo 2.1GHz, Infiniband FDR, NVIDIA Tesla P100 NVLink Piz Daint – Cray XC50, Swiss National Intel Xeon E5-2690v3 12C 4 Supercomputing Centre 361760 19,590 3 0,4860 1,90% 2271,99 213,91 2.6GHz, Cray Aries, NVIDIA (CSCS) Tesla P100 16GB Trinity – Cray XC40, Intel 3 DOE/NNSA/LANL/SNL Xeon Phi 7250 68C 1.4GHz, 979072 14,137 7 0,5460 1,80% 3843,58 142,06 Aries interconnect Oakforest-PACS – Joint Center for Advanced PRIMERGY CX600 M1, Intel 6 High Performance Xeon Phi Processor 7250 557056 13,555 9 0,3850 1,50% 2718,70 141,61 Computing 68C 1.4GHz, Intel Omni-Path Architecture NEC SX – NEC SX-ACE Cyberscience Center / 61 4C+IXS, NEC SX-ACE 2048 0,123 0,0150 11,40% 140,80 106,53 Tohoku University 4C+IXS, NEC IXS Earth Simulator – NEC SX- 36 CEIST / JAMSTEC 8192 0,487 0,0580 11,00% 563,20 102,98 ACE, NEC SX-ACE, Osaka U ACE – SX-ACE, Cybermedia Center, Osaka 62 NEC SX-ACE 2048C 1.0GHz, 2048 0,123 0,0140 10,80% 140,80 99,43 University IXS Cyberscience Center, Cyberscience Center, Tohoku University – SX- 51 4096 0,246 0,0280 10,70% 281,60 99,43 Tohoku University ACE, NEC SX-ACE;4096;1.0, IXS, None NEC SX-ACE – NEC SX- Christian-Albrechts- 83 ACE, NEC SX-ACE 1024C 1024 0,062 0,0068 10,50% 70,40 96,59 Universitaet zu Kiel 1.0GHz, NEC IXS Center for Global SX-ACE – NEC SX-ACE, NEC Environmental Research, 74 SX-ACE 1536C 1.0GHz, NEC 1536 0,092 0,0100 10,20% 105,60 94,70 National Institute for IXS Environmental Studies Cori – XC40, Intel Xeon Phi 7 DOE/SC/LBNL/NERSC 7250 68C 1.4GHz, Cray 632400 13,832 8 0,3550 1,30% 3939,00 90,12 Aries 14 © NEC Corporation 2017 „green“ HPCG

Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) Research Computation Facility for GOSAT-2 (RCF2) – SGI Rackable National Institute for 55 C1104-GP1, Intel Xeon E5- 16320 0,770 320 0,0230 2,10% 78,64 292,47 Environmental Studies 2650 v4 2880C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 PCIe Reedbush-L – SGI Information Technology Rackable C1102-GP8, Intel 52 Center, The University of Xeon E5-2695 v4 2304C 16640 0,806 292 0,0230 1,60% 79,24 290,26 Tokyo 2.1GHz, EDR Infiniband, NVIDIA Tesla P100 NVLink TSUBAME3.0 – SGI ICE XA NEC(HPE SX SGI 8600),-Aurora IP139- TSUBASA SXM2, Intel Xeon E5-2680 GSIC Center, Tokyo 10 v4 15120C 2.9GHz, Intel 136080 8,125 13 0,1890 1,60% 792,08 238,61 Institute of Technology • 86GFlopsOmni-Path Architecture, per VE NVIDIA TESLA P100 SXM2 with NVLink • HPL:Reedbush-H 250 – SGI Watt Information Technology Rackable C1102-GP8, Intel 56 Center, The University of Xeon E5-2695 v4 4320C 17760 0,802 296 0,0220 1,70% 93,57 235,12 Tokyo  340GFlops/kW2.1GHz, Infiniband FDR, NVIDIA Tesla P100 NVLink Piz Daint – Cray XC50, Swiss National Intel Xeon E5-2690v3 12C 4 Supercomputing Centre 361760 19,590 3 0,4860 1,90% 2271,99 213,91 2.6GHz, Cray Aries, NVIDIA (CSCS) Tesla P100 16GB NEC SX – NEC SX-ACE Cyberscience Center• / HPCG: 150 Watt 61 4C+IXS, NEC SX-ACE 2048 0,123 0,0150 11,40% 140,80 106,53 Tohoku University 4C+IXS, NEC IXS Earth Simulator – NEC SX- 36 CEIST / JAMSTEC 8192 0,487 0,0580 11,00% 563,20 102,98  550GFlops/kWACE, NEC SX-ACE, Osaka U ACE – SX-ACE, Cybermedia Center, Osaka 62 NEC SX-ACE 2048C 1.0GHz, 2048 0,123 0,0140 10,80% 140,80 99,43 University IXS Cyberscience Center, Cyberscience Center, Tohoku University – SX- 51 4096 0,246 0,0280 10,70% 281,60 99,43 Tohoku University ACE, NEC SX-ACE;4096;1.0, IXS, None NEC SX-ACE – NEC SX- Christian-Albrechts- 83 ACE, NEC SX-ACE 1024C 1024 0,062 0,0068 10,50% 70,40 96,59 Universitaet zu Kiel 1.0GHz, NEC IXS Center for Global SX-ACE – NEC SX-ACE, NEC Environmental Research, 74 SX-ACE 1536C 1.0GHz, NEC 1536 0,092 0,0100 10,20% 105,60 94,70 National Institute for IXS Environmental Studies MareNostrum4 – ThinkSystem SD530, Intel 15 BSC-CNS 241108 6,227 16 0,1220 1,10% 1632,00 74,75 Xeon Platinium 8160 24C 2.1GHz, INTEL OmniPath Electra – HPE SGI 8600 and SGI ICE-X, Intel Xeon Gold NASA Advanced 6148 20C 2.4GHz, E5- 34 Supercomputing/NASA 78336 3,329 33 0,0650 1,90% 979,62 66,35 2680v4 14C 2.4GHz, Dual Ames Research Center Rail Infiniband - EDR and FDR

15 © NEC Corporation 2017 Decreasing vector length when scaling up

16 © NEC Corporation 2017 Woraround in the past: Stripes

17 © NEC Corporation 2017 Fortran representation!

▌Trying to describe the hardware-functionality in FORTRAN  Second example …

! Starting point

Subroutine sub_2( a, b, n, m )

! Workaround not possible without ! changing calling routine(s)

Integer, Parameter :: VL = 256

real a(n,m), b(2:n-1,2:m-1)

do j = 2, m-1

do i = 2, n-1 a(i,j) = b(i,j)+ 1.0 end do

end do

return end

18 © NEC Corporation 2017 Fortran representation!

▌Trying to describe the hardware-functionality in FORTRAN  Second example …

! First step: normal stripmining of inner loop  vectorization

Subroutine sub_2( a, b, n, m )

! Workaround not possible without ! changing calling routine(s)

Integer, Parameter :: VL = 256

real a(n,m), b(2:n-1,2:m-1)

do j = 2, m-1

do is = 2, n-1, VL do i = is, min( is+VL-1, n-1 ) One a(i,j) = b(i,j)+ 1.0 Assembler end do instruction end do

end do

return end

19 © NEC Corporation 2017 Fortran representation!

▌Trying to describe the hardware-functionality in FORTRAN  2d-vectorization, thanks to Uwe Küster!

! Second step: „2d-blocking“

Subroutine sub_2( a, b, n, m )

! Workaround not possible without ! changing calling routine(s)

Integer, Parameter :: VL = 16

real a(n,m), b(2:n-1,2:m-1)

Do j_ = 2, m-1, VL do i_ = 2, n-1, VL do j = j_, min( j_+VL-1, m-1 ) do i = i_, min( i_+VL-1, n-1 ) One a(i,j) = b(i,j)+ 1.0 Assembler end do instruction end do end do end do

return end

20 © NEC Corporation 2017