Project Aurora 2017 Vector Inheritance
Total Page:16
File Type:pdf, Size:1020Kb
Vector Coding a simple user’s perspective Rudolf Fischer NEC Deutschland GmbH Düsseldorf, Germany SIMD vs. vector Input Pipeline Result Scalar SIMD people call it “vector”! Vector SX 2 © NEC Corporation 2017 Data Parallelism ▌‘Vector Loop’, data parallel do i = 1, n Real, dimension(n): a,b,c a(i) = b(i) + c(i) … end do a = b + c ▌‘Scalar Loop’, not data parallel, ex. linear recursion do i = 2, n a(i) = a(i-1) + b(i) end do ▌Reduction? do i = 1, n, VL do i_ = i, min(n,i+VL-1) do i = 1, n s_(i_) = s_(i_) + v(i_)* w(i_) s = s + v(i)* w(i) end do end do end do s = reduction(s_) (hardware!) 3 © NEC Corporation 2017 Vector Coding Paradigm ▌‘Scalar’ thinking / coding: There is a (grid-point,particle,equation,element), what am I going to do with it? ▌‘Vector’ thinking / coding: There is a certain action or operation, to which (grid-points,particles,equations,elements) am I going to apply it simultaneously? 4 © NEC Corporation 2017 Identifying the data-parallel structure ▌ Very simple case: do j = 2, m-1 do i = 2, n-1 rho_new(i,j) = rho_old(i,j) + dt * something_with_fluxes(i+/-1,j+/-1) end do end do ▌ Does it tell us something? Partial differential equations Local theories Vectorization is very “natural” 5 © NEC Corporation 2017 Identifying the data-parallel subset ▌ Simple case: V------> do j = 1, m |+-----> do i = 2, n || a(i,j) = a(i-1,j) + b(i,j) |+----- end do V------ end do ▌ The compiler will vectorise along j ▌ non-unit-stride access, suboptimal ▌ Results for a certain case: Totally scalar (directives): 25.516ms Avoid outer vector loop: 21.032ms Default compilation: 0.515ms 6 © NEC Corporation 2017 Identifying the data-parallel subset ▌ example: 2d ILU or Gauß-Seidel +V------> do idiagj = 2, = n2, 2*n-1 |+|V-----> do i = 2,max( n 2, idiag+1-n ), min( n, idiag-1 ) || x(j i,j= idiag) = rhs- (ii,j+ )1 - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) |+|| ----- end x( i,jdo ) = rhs(i,j) - ( a(i,j) * x(i-1,j) + b(i,j) * x(i,j-1) ) +|V----------- S end end do do V------ end do ▌ solution: hyperplane-ordering ▌ Results: Default 36.2ms hyperplane 2.0ms 7 © NEC Corporation 2017 Data Layout: Sparse Matrix * Vector (SMV) ▌ Normal case: compressed row storage (CRS) ▌ res = A * x do irow = 1, nrow do jcol = 1, jstart(irow+1)-jstart(irow) ind = jstart(irow)+jcol-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do ▌ better: jagged diagonal storage (JDS) (Yousef Saad, 90s!) do jcol = 1, ncol do irow = 1, istart(jcol+1)-istart(jcol) ind = istart(jcol)+irow-1 res(irow) = res(irow) + a(ind) * x(index(ind)) end do end do JDS allows for quite some additional optimizations! 8 © NEC Corporation 2017 Conclusion ▌ Vectorization is normal Mother nature tends to provide the necessary structures All contemporary architectures need it for performance anyway ▌ There is a difference between SSE, AVX … you name it, and a real vector-architecture This can lead to different approaches in some cases In many cases just identical ▌ The techniques are known since ages anyway And by no means intellectually challenging ▌ NEC expects that codes will be adapted gradually, just because the architecture promises a lot of efficiency And coding for Aurora is anyway easier than CUDA 9 © NEC Corporation 2017 CPU Package Memory Bandwidth measured Past vectors: Theoretical Peak: • Single core: ~330GB/s • Cray T90-32: 360GB/s • Best result: ~980GB/s • NEC SX-4/32: 512GB/s World’s first implementation of 6 HBM memories 10 © NEC Corporation 2017 HPCG HPL and STREAM are extreme benchmarks HPCG is somehow “in between”, better representing real app’s Performance HPCG / node HPCG / price Characteristics Aurora SKL Performance ratio Performance ratio Performance ratio Performance 3x 3x Memory bandwidth Performance bound bound SKL = dual-socket Intel Skylake 6148 11 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Tianhe-2 (MilkyWay-2) – TH-IVB-FEP Cluster, Intel 2 NSCC / Guangzhou Xeon 12C 2.2GHz, TH 3120000 33,863 2 0,5800 1,10% 17808,00 32,57 Express 2, Intel Xeon Phi 31S1P 57-core Trinity – Cray XC40, Intel 3 DOE/NNSA/LANL/SNL Xeon Phi 7250 68C 1.4GHz, 979072 14,137 7 0,5460 1,80% 3843,58 142,06 Aries interconnect Piz Daint – Cray XC50, Swiss National Intel Xeon E5-2690v3 12C 4 Supercomputing Centre 361760 19,590 3 0,4860 1,90% 2271,99 213,91 2.6GHz, Cray Aries, NVIDIA (CSCS) Tesla P100 16GB Sunway TaihuLight – National Supercomputing 5 Sunway MPP, SW26010 10649600 93,015 1 0,4810 0,40% 15371,00 31,29 Center in Wuxi 260C 1.45GHz, Sunway Oakforest-PACS – Joint Center for Advanced PRIMERGY CX600 M1, Intel 6 High Performance Xeon Phi Processor 7250 557056 13,555 9 0,3850 1,50% 2718,70 141,61 Computing 68C 1.4GHz, Intel Omni-Path Architecture Cori – XC40, Intel Xeon Phi 7 DOE/SC/LBNL/NERSC 7250 68C 1.4GHz, Cray 632400 13,832 8 0,3550 1,30% 3939,00 90,12 Aries Sequoia – IBM BlueGene/Q, 8 DOE/NNSA/LLNL PowerPC A2 1.6 GHz 16- 1572864 17,173 6 0,3300 1,60% 7890,00 41,83 core, 5D Torus Titan – Cray XK7, Opteron 6274 16C 2.200GHz, Cray 9 DOE/SC/Oak Ridge Nat Lab 560640 17,590 5 0,3220 1,20% 8209,00 39,23 Gemini interconnect, NVIDIA K20x TSUBAME3.0 – SGI ICE XA (HPE SGI 8600), IP139- SXM2, Intel Xeon E5-2680 GSIC Center, Tokyo 10 v4 15120C 2.9GHz, Intel 136080 8,125 13 0,1890 1,60% 792,08 238,61 Institute of Technology Omni-Path Architecture, NVIDIA TESLA P100 SXM2 with NVLink Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Mira – IBM BlueGene/Q, DOE/SC/Argonne National 12 PowerPC A2 1.6 GHz 16- 786432 8,587 11 0,1670 1,70% 3945,00 42,33 Laboratory core, 5D Torus Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR Hazel Hen – Cray XC40, 14 HLRS/University of Stuttgart Intel Xeon E5-2680-V3, Cray 185088 5,640 19 0,1380 1,90% 3615,00 38,17 Aries MareNostrum4 – ThinkSystem SD530, Intel 15 BSC-CNS 241108 6,227 16 0,1220 1,10% 1632,00 74,75 Xeon Platinium 8160 24C 2.1GHz, INTEL OmniPath 12 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) NEC SX – NEC SX-ACE Cyberscience Center / 61 4C+IXS, NEC SX-ACE 2048 0,123 0,0150 11,40% 140,80 106,53 Tohoku University 4C+IXS, NEC IXS Earth Simulator – NEC SX- 36 CEIST / JAMSTEC 8192 0,487 0,0580 11,00% 563,20 102,98 ACE, NEC SX-ACE, Osaka U ACE – SX-ACE, Cybermedia Center, Osaka 62 NEC SX-ACE 2048C 1.0GHz, 2048 0,123 0,0140 10,80% 140,80 99,43 University IXS Cyberscience Center, Cyberscience Center, Tohoku University – SX- 51 4096 0,246 0,0280 10,70% 281,60 99,43 Tohoku University ACE, NEC SX-ACE;4096;1.0, IXS, None NEC SX-ACE – NEC SX- Christian-Albrechts- 83 ACE, NEC SX-ACE 1024C 1024 0,062 0,0068 10,50% 70,40 96,59 Universitaet zu Kiel 1.0GHz, NEC IXS Center for Global SX-ACE – NEC SX-ACE, NEC Environmental Research, 74 SX-ACE 1536C 1.0GHz, NEC 1536 0,092 0,0100 10,20% 105,60 94,70 National Institute for IXS Environmental Studies K computer – , SPARC64 RIKEN Advanced Institute 1 VIIIfx 2.0GHz, Tofu 705024 10,510 10 0,6030 5,30% 12659,89 47,63 for Computational Science interconnect Information Technology Oakleaf-FX – PRIMEHPC 37 Center, The University of FX10, SPARC64 Ixfx 32C 76800 1,043 157 0,0570 5,00% 1176,80 48,44 Tokyo 1.848GHz, Tofu interconnect iDataPlex – iDataPlex Max-Planck-Gesellschaft DX360M4, Intel Xeon E5- 35 65320 1,283 100 0,0610 4,20% 1260,00 48,41 MPI/IPP 2680v2 10C 2.800GHz, Infiniband FDR SORA-MA – PRIMEHPC Japan Aerospace FX100, SPARC64 Xifx 32C 17 103680 3,157 38 0,1100 3,20% 1652,40 66,57 eXploration Agency 1.975GHz, Tofu interconnect 2 ARCHER – Cray XC30, Intel EPSRC/University of 24 Xeon E5 v2 12C 2.700GHz, 118080 1,643 79 0,0810 3,20% 3306,24 24,50 Edinburgh Aries interconnect Edison – Cray XC30, Intel 25 DOE/SC/LBNL/NERSC Xeon E5-2695v2 12C 133824 1,655 78 0,0790 3,10% 3747,07 21,08 2.4GHz, Aries interconnect Curie thin nodes – Bullx 39 CEA/TGCC-GENCI B510, Intel Xeon E5-2680 8C 77184 1,359 93 0,0510 3,10% 2132,00 23,92 2.700GHz, Infiniband QDR Plasma Simulator – Fujitsu National Institute for Fusion 26 PRIMEHPC FX100, SPARC64 82944 2,376 59 0,0730 2,80% 1244,16 58,67 Science Xifx, Tofu Interconnect 2 ITC Nagoya – PRIMEHPC Information Technology 21 FX100, SPARC64 XIfx, Tofu 92160 2,910 43 0,0870 2,70% 1382,40 62,93 Center, Nagoya University interconnect 2 Pleiades – SGI ICE X, Intel Xeon E5-2670, E5-2680V2, 11 NASA / Mountain View 243008 5,952 17 0,1750 2,50% 4407,00 39,71 E5-2680V3, E5-2680V4, Infiniband FDR Pangea – SGI ICE X, Intel 13 TOTAL Xeon E5-2670 12C 2.6GHz, 218592 5,283 21 0,1630 2,40% 4150,00 39,28 Infiniband FDR 13 © NEC Corporation 2017 „green“ HPCG Pow er HPL Rmax HPCG Fraction of HPCG/Pow er Rank Site Computer Cores TOP500 Rank Consumption (Pflop/s) (Pflop/s) Peak (Gflops/kW) (kW) Research Computation Facility for GOSAT-2 (RCF2) – SGI Rackable National Institute for 55 C1104-GP1, Intel Xeon E5- 16320 0,770 320 0,0230 2,10% 78,64 292,47 Environmental Studies 2650 v4 2880C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 PCIe Reedbush-L – SGI Information Technology Rackable C1102-GP8, Intel 52 Center, The University of Xeon E5-2695 v4 2304C 16640 0,806 292 0,0230 1,60% 79,24 290,26 Tokyo 2.1GHz, EDR Infiniband, NVIDIA Tesla P100 NVLink