Performance Engineering for Legacy Codes on a XC40 with Intel Xeon Phi (KNL)

Matthias Noack ([email protected]), Florian Wende, Thomas Steinke, Alexander Reinefeld

Zuse Institute Berlin

2017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17 1 / 44 North-German Supercomputing Alliance (HLRN)

Applications on the HLRN-III TDS (Berlin, ZIB) - many pure MPI codes - 16 KNC nodes (until July 2016) - 80 KNL nodes (since July 2016) - some MPI+OpenMP - data warp nodes - some vectorized

“Konrad” (Berlin, ZIB) - 1872 Xeon nodes - 44928 cores

10 Gbps (243 km linear distance)

“Gottfried” (Hanover, LUIS) - 1680 Xeon nodes - 40320 cores + 64 SMP servers, 256/512 GB

2 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] 2,5 TFlops 1994 1987 486 GFlops Cray T3D 1984 Cray X-MP 38 GFlops Cray 1M 471 MFlops 160 MFlops

3 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops 200 kWatt Cray T3D 1984 Cray X-MP 10 M€ 38 GFlops Cray 1M 471 MFlops Xeon Phi KNL, 2016 160 MFlops Intel Xeon Phi 7290 72 cores (288 threads) 3 TFLOPS Y 245 Watt 6662 € (6/2017)

4 / 44 Research Center for Many-Core HPC at ZIB

Intel Parallel Compute Center (IPCC) Applications • GLAT (atomistic thermodynamics) Challenges • VASP (electronic structure) • Adapting data structures for enabling SIMD • BQCD (high-energy physics) OBJECTIVE • Vectorising complex code structures • HEOM (photo-active processes) • Transition to hybrid MPI + OpenMP • BOSS (time series analysis, phase 2) Many-Core High- • (Offload with Intel LEO vs. OpenMP 4.x) • PALM (fluid dynamics, phase 2) Performance Computing • PHOENIX3D (astrophysics, phase 2)

APPLICATIONS RESEARCH

Modernization, Programming OpenMP/MPI, Models, Scalability Runtime Libraries

5 / 44 Intel Xeon Phi (Knights Landing) – Architecture

Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1)  self-booting CPU, optionall with integrated OmniPath fabric  64+ core (based on Intel Atom Silvermont architecture, x86-64)  4-way hardware-threading  512-bit SIMD vector processing (AVX-512)  On-Chip Multi-Channel (MC) DRAM: 16 GiB  DDR4 main memory: up to 384 GiB

1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016

6 / 44 Intel Xeon Phi (Knights Landing) – Architecture

KNL CPU

MCDRAM MCDRAM MCDRAM MCDRAM

EDC EDC EDC EDC PCIe 3 DMI

Tile

up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels

Interconnect

3 3 Channels DDR4 3 DDR4

EDC EDC Misc EDC EDC

MCDRAM MCDRAM MCDRAM MCDRAM

7 / 44 Intel Xeon Phi (Knights Landing) – Architecture

KNL CPU Tile

MCDRAM MCDRAM MCDRAM MCDRAM 2 VPUs CHA 2 VPUs 1 MiB L2 Cache EDC EDC EDC EDC Core Core PCIe 3 DMI

Tile  2 out-of-order cores up to 36 active tiles

Channels  DDR MC connected via 2D-Mesh- DDR MC 2 VPUs per core (AVX-512)

Interconnect  1 MiB shared L2-Cache

3 3 Channels DDR4 3 DDR4  Caching/Home-Agent (CHA)  interface to 2D-Mesh-Interconnect

EDC EDC Misc EDC EDC  distributed tag directory (MESIF cache coherency protocol)

MCDRAM MCDRAM MCDRAM MCDRAM

8 / 44 Intel Xeon Phi (Knights Landing) – Architecture

KNL CPU DDR4 Memory (up to 384 GiB)  2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM  6 DDR4 Channels

EDC EDC EDC EDC PCIe 3 DMI

Tile

up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels

Interconnect

3 3 Channels DDR4 3 DDR4

EDC EDC Misc EDC EDC

MCDRAM MCDRAM MCDRAM MCDRAM

9 / 44 Intel Xeon Phi (Knights Landing) – Architecture

KNL CPU DDR4 Memory (up to 384 GiB)  2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM  6 DDR4 Channels

EDC EDC EDC EDC PCIe 3 DMI MCDRAM (16 GiB) Tile  8 on-chip units, each 2 GiB

up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels

Interconnect

3 3 Channels DDR4 3 DDR4

EDC EDC Misc EDC EDC

MCDRAM MCDRAM MCDRAM MCDRAM

10 / 44 Intel Xeon Phi (Knights Landing) – Architecture

KNL CPU Memory Modes:  Flat-Mode 16 GiB MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM DDR4 and MCDRAM

in same address space DDR4 EDC EDC EDC EDC

PCIe 3 DMI space address phys.  Cache-Mode Tile MCDRAM = direct- mapped Cache up to 36 active tiles 16 GiB DDR MC DDR MC Channels for DDR4 DDR4 connected via 2D-Mesh- MCDRAM

Interconnect 3 3 DDR4 3 3 Channels DDR4  Hybrid-Mode 8 | 4 GiB MCDRAM in Cache-Mode, 8 | 12 GiB MCDRAM EDC EDC Misc EDC EDC remainder in 8 | 4 GiB

Flat-Mode DDR4 space address MCDRAM .

MCDRAM MCDRAM MCDRAM MCDRAM

hys p

11 / 44 KNL in the Roofline Model (Samuel Williams, 2008)

[GFLOPS]

4096 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

12 / 44 KNL in the Roofline Model - adding peak FLOPS

[GFLOPS]

4096 peak FLOPS 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model - adding peak DRAM bandwidth

[GFLOPS]

4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model

[GFLOPS]

4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min  FLOPS  Memory Arithmetic  [GFLOPS] Bandwidth × Intensity   4096 peak FLOPS  1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min  FLOPS  Memory Arithmetic  [GFLOPS] Bandwidth × Intensity   4096 peak FLOPS  1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS = [GFLOPS] Memory Bandwidth 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak Minimal FLOPS = Arithmetic Intensity [GFLOPS] Memory Bandwidth necessary for peak FLOPS peak FLOPS 4096 (HLRN nodes): 1024 MCDRAM BW KNL DRAM: 22.7 256 DRAM BW KNL MCDRAM: 5.3 Haswell: 7.1 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon12 / 44 E5-2680v3: 480 GFLOPS, 68 GiB/s • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)

Refining the Ceilings - FLOPS

Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

13 / 44 • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)

Refining the Ceilings - FLOPS

Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS

13 / 44 ⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)

Refining the Ceilings - FLOPS

Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load

13 / 44 Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)

Refining the Ceilings - FLOPS

Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS

13 / 44 • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)

Refining the Ceilings - FLOPS

Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %)

13 / 44 Refining the Ceilings - FLOPS

Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)

13 / 44 KNL in the Roofline Model

[GFLOPS]

4096 peak FLOPS Upper compute w/o ILP 1024 MCDRAM BW bound in GFLOPS: 256 DRAM BW peak: 2611.2 w/o ILP+SIMD w/o ILP: 652.8 64 w/o ILP+SIMD: 84.6 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity

Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 14 / 44 Case study I: Atmospheric Research with PALM

WIND ENERGY TURBULENCE EFFECTS ON AIRCRAFT

URBAN METEOROLOGY AND CITY PLANING

DUSTDEVILS CLOUD PHYSIC

https://palm.muk.uni-hannover.de 15 / 44 The PALM Code

• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)

https://palm.muk.uni-hannover.de 16 / 44 The PALM Code

• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)

• Fortran 95/2003. • hybrid MPI + OpenMP code • 140 kLOC, 79 modules and 171 source files • highly scalable, tested for up to 43,200 cores

https://palm.muk.uni-hannover.de 16 / 44 The PALM Code

• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)

• Fortran 95/2003. • hybrid MPI + OpenMP code • 140 kLOC, 79 modules and 171 source files • highly scalable, tested for up to 43,200 cores

• runs on the HLRN supercomputing facilities at Berlin (ZIB) and Hannover (LUIS) • modernisation target within the Intel Parallel Computing Center at ZIB

https://palm.muk.uni-hannover.de 16 / 44 PALM Hackathon at ZIB

left to right: Matthias Noack, Florian Wende, Helge Knoop, Matthias Sühring, Tobias Gronemeier; behind the camera: Thomas Steinke 17 / 44 Parallelization Strategy

https://palm.muk.uni-hannover.de 18 / 44 Parallelization Strategy

2D domain decomposition

2 5 8 1 4 7 0 3 6

y

x https://palm.muk.uni-hannover.de 18 / 44 Parallelization Strategy

2D domain decomposition outer of 3 nested loops threaded

2 5 8 • different variants for different 1 4 7 n targets 0 3 6 • e.g. cache optimised with decomposed inner loop

• no vectorisation ⇒ use of floating point exceptions prevents y automatic vectorisation ⇒ no explicit SIMD constructs

x https://palm.muk.uni-hannover.de 18 / 44 KNL Optimisation Strategy

get operational on KNL

19 / 44 KNL Optimisation Strategy

get operational on KNL - build scripts - job scripts - bug fixing

19 / 44 KNL Optimisation Strategy

get define operational some on KNL benchmarks - build scripts - job scripts - bug fixing

19 / 44 KNL Optimisation Strategy

get define operational some on KNL benchmarks - build scripts - small - job scripts - medium - bug fixing - large

19 / 44 KNL Optimisation Strategy

get define measure operational some baseline on KNL benchmarks on Xeon - build scripts - small - job scripts - medium - bug fixing - large

19 / 44 Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad) • 2 × Intel Xeon E5-2680v3 (Haswell) • 2 × 12 cores at 2.5 GHz ⇒ 960 GFLOPS per node • 64 GiB DDR3, 136 GiB/s

20 / 44 Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad) Haswell Results • 2 × Intel Xeon E5-2680v3 (Haswell) nodes runtime • 2 × 12 cores at 2.5 GHz small 4 131.3 s ⇒ 960 GFLOPS per node medium 8 147.6 s • 64 GiB DDR3, 136 GiB/s large 16 164.6 s

20 / 44 Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad) Haswell Results • 2 × Intel Xeon E5-2680v3 (Haswell) nodes runtime • 2 × 12 cores at 2.5 GHz small 4 131.3 s ⇒ 960 GFLOPS per node medium 8 147.6 s • 64 GiB DDR3, 136 GiB/s large 16 164.6 s HLRN KNL TDS node, 80 nodes • 1 × Intel Xeon Phi 7250 (KNL) • 68 cores at 1.4 GHz ⇒ 2611.2 GFLOPS with AVX clock "3.05 TFLOPS" • 96 GiB DDR4, 115.2 GiB/s • 16 GiB MCDRAM, 490 GiB/s

20 / 44 Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad) Haswell Results • 2 × Intel Xeon E5-2680v3 (Haswell) nodes runtime • 2 × 12 cores at 2.5 GHz small 4 131.3 s ⇒ 960 GFLOPS per node medium 8 147.6 s • 64 GiB DDR3, 136 GiB/s large 16 164.6 s HLRN KNL TDS node, 80 nodes • 1 × Intel Xeon Phi 7250 (KNL) • 68 cores at 1.4 GHz ⇒ 2611.2 GFLOPS with AVX clock "3.05 TFLOPS" • 96 GiB DDR4, 115.2 GiB/s • 16 GiB MCDRAM, 490 GiB/s • Upper bounds for speed-up: ⇒ compute bound: 2.7× ⇒ memory bound: 3.6× (MCDRAM) 20 / 44 KNL Optimisation Strategy

get define measure operational some baseline on KNL benchmarks on Xeon

- build scripts - small X - job scripts - medium - bug fixing - large

21 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and on KNL benchmarks on Xeon boot mode

- build scripts - small X - job scripts - medium - bug fixing - large

21 / 44 MCDRAM Usage and Boot Mode

boot flat mode

run in DDR - problem < 16 GiB run in MCDRAM - upper bound for gain from MCDRAM

boot cache mode

- good enough? run again ⇒ decide about explicit placement

22 / 44 MCDRAM Usage and Boot Mode MCDRAM Usage Comparison 160 quad_at, DDR4 140 quad_at, MCDRAM quad_cache, both 120

100 1.25 1.22 1.41 1.39 boot flat mode 80 1.41 1.38 60 runtime [s] run in DDR 40 20 - problem < 16 GiB 0 run in MCDRAM - upper bound for small medium large gain from MCDRAM • 25 - 41 % gain from MCDRAM boot cache mode • ≤ 3 % loss from Cache-Mode - good enough? ⇒ no need for explicit placement run again ⇒ decide about • 2 MiB pages worked best explicit placement

22 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and on KNL benchmarks on Xeon boot mode

- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large

23 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode

- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large

MPI processes ... vs. OpenMP threads

23 / 44 MPI processes vs. OpenMP threads

Tuning run for optimal per-node config

ranks 64 32 16 8 4 2 1 threads 1 2 4 8 16 32 64 small X medium X large X ≤ 2 % off from best worse ≤ 15 % off from best X fastest

quad_cache mode 24 / 44 MPI processes vs. OpenMP threads

Tuning run for optimal per-node config Conclusion • fastest ranks 64 32 16 8 4 2 1 ⇒ small: 16 ranks × 4 thread threads 1 2 4 8 16 32 64 ⇒ medium: 8 ranks × 8 threads small X ⇒ large: 32 × 2 threads • 16 × 4 performs for all setups medium X • fastest vs. slowest config: 2.3× large X ≤ 2 % off from best worse ≤ 15 % off from best X fastest

quad_cache mode 24 / 44 MPI processes vs. OpenMP threads

Tuning run for optimal per-node config Conclusion • fastest ranks 64 32 16 8 4 2 1 ⇒ small: 16 ranks × 4 thread threads 1 2 4 8 16 32 64 ⇒ medium: 8 ranks × 8 threads small X ⇒ large: 32 × 2 threads • 16 × 4 performs for all setups medium X • fastest vs. slowest config: 2.3× large X ≤ 2 % off from best worse • very low impact on speedups between MCDRAM usage models ≤ 15 % off from best X fastest • absolute numbers vary largely ⇒ do memory first

quad_cache mode 24 / 44 MPI processes vs. OpenMP threads

Tuning run for optimal per-node config Conclusion • fastest ranks 64 32 16 8 4 2 1 ⇒ small: 16 ranks × 4 thread threads 1 2 4 8 16 32 64 ⇒ medium: 8 ranks × 8 threads small X ⇒ large: 32 × 2 threads • 16 × 4 performs for all setups medium X • fastest vs. slowest config: 2.3× large X ≤ 2 % off from best worse • very low impact on speedups between MCDRAM usage models ≤ 15 % off from best X fastest • absolute numbers vary largely ⇒ do memory first

• always check pinning/affinity quad_cache mode 24 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode

- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large

MPI processes ... vs. OpenMP threads

⇒ 16 × 4

25 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode

- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large

MPI processes start ... vs. optimisation OpenMP threads workflow

⇒ 16 × 4

25 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode

- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large

MPI processes start ... vs. optimisation OpenMP threads workflow

⇒ 16 × 4

25 / 44 KNL Optimisation Strategy

get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode

- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large

MPI processes start ... vs. optimisation OpenMP threads workflow

⇒ 16 × 4

- compiler optimisation reports - Intel VTune Amplifier XE, Advisor XE 25 / 44 First Code Changes

Floating point exception-handling

• prevents vectorisation • fp-model-strict → fp-model-source • remove exception handling • add NaN/Inf-tests ⇒ when writing checkpoints ⇒ no significant auto vectorisation ⇒ suboptimal memory layout

26 / 44 First Code Changes

Floating point exception-handling

• prevents vectorisation • fp-model-strict → fp-model-source • remove exception handling • add NaN/Inf-tests ⇒ when writing checkpoints ⇒ no significant auto vectorisation ⇒ suboptimal memory layout

Using MKL FFT

• small gain for benchmarks ⇒ might be significant for larger setups

26 / 44 First Code Changes

Floating point exception-handling CONTIGUOUS keyword

• prevents vectorisation • from Fortran 2008 • fp-model-strict → fp-model-source • tell the compiler about contiguously • remove exception handling allocated arrays • add NaN/Inf-tests Current Results, Intel Compiler 17.0.0 ⇒ when writing checkpoints HSW, Baseline HSW, Current ⇒ no significant auto vectorisation 200 KNL, Current 1.00 1.01 0.96 1.00 ⇒ suboptimal memory layout 150 1.00 1.13 1.07 1.09 1.24 100

Using MKL FFT runtime [s] 50 • small gain for benchmarks 0 ⇒ might be significant for larger setups small medium large

26 / 44 Cray Specialities

Hardware/Software

• Cray Aries network • Cray MPI • Cray Performance Tools • Cray Compiler

27 / 44 Cray Specialities

Hardware/Software

• Cray Aries network • Cray MPI • Cray Performance Tools • Cray Compiler

Rank Reordering

• instrumented application run • optimised mapping of MPI ranks to cores and nodes ⇒ no improvement on KNL

27 / 44 Cray Specialities

Hardware/Software Cray Compiler

• Cray Aries network • initial KNL support • Cray MPI • crash with OpenMP • Cray Performance Tools ⇒ 64 MPI ranks per KNL

• Cray Compiler Current Results, Intel 17.0.0 vs. Cray Compiler 8.5.3 HSW, Baseline, Intel HSW, Current, Intel 200 KNL, Current, Intel Rank Reordering KNL, Current, Cray 1.00 1.01 0.96 1.00 1.12 150 1.00 1.13 1.07 • instrumented application run 1.09 1.17 1.24 1.29 • optimised mapping of MPI 100 ranks to cores and nodes runtime [s] 50 ⇒ no improvement on KNL 0 small medium large

27 / 44 Projected Production Run Performance

• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours ⇒ serial initialisation becomes negligible ⇒ plot speedup based on ttotal − tinit

28 / 44 Projected Production Run Performance

• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours ⇒ serial initialisation becomes negligible ⇒ plot speedup based on ttotal − tinit Projected Speedups (without init) 2 HSW, Baseline HSW, Current, Intel KNL, Current, Intel 1.45 KNL, Current, Cray 1.5 1.25 1.25 1.23 1.10 1.14 1.11 1.00 1.00 1.00 1.00 1.06 1 speedup

0.5

0 small medium large 28 / 44 PALM - Final Remarks Bottom Line • getting started on KNL was easy ⇒ way easier than KNC and offloading • good initial performance (cache-mode) • scalar parts hurt ⇒ initialisation ⇒ ...

https://palm.muk.uni-hannover.de 29 / 44 PALM - Final Remarks Bottom Line • getting started on KNL was easy ⇒ way easier than KNC and offloading • good initial performance (cache-mode) • scalar parts hurt ⇒ initialisation ⇒ ... • Cray Fortran compiler up to 16.5 % faster than Intel on KNL

https://palm.muk.uni-hannover.de 29 / 44 PALM - Final Remarks Bottom Line • getting started on KNL was easy ⇒ way easier than KNC and offloading • good initial performance (cache-mode) • scalar parts hurt ⇒ initialisation ⇒ ... • Cray Fortran compiler up to 16.5 % faster than Intel on KNL • speedup over dual HSW HLRN node: • benchmark: up to 1.29× • production (projected): up to 1.45× ⇒ even without vectorisation

https://palm.muk.uni-hannover.de 29 / 44 PALM - Final Remarks

Bottom Line What’s next • getting started on KNL was easy • VTune Results: ⇒ way easier than KNC and offloading ⇒ increase concurrency • good initial performance (cache-mode) ⇒ reduce L2 misses on KNL • scalar parts hurt • adapt data layout for SIMD ⇒ initialisation ⇒ twice the potential on KNL ⇒ ... • Cray Fortran compiler up to 16.5 % faster than Intel on KNL • speedup over dual HSW HLRN node: • benchmark: up to 1.29× • production (projected): up to 1.45× ⇒ even without vectorisation

https://palm.muk.uni-hannover.de 29 / 44 Case study II: Material Science with VASP VASP – Vienna Ab-Initio Simulation Package

• Plan wave electronic structure code to model atomic scale materials from first principals: Hψ = Eψ • widely used in material sciences • historically grown, large MPI-only Fortran code base

• different approximations (DFT, hybrid functionals, . . . ) to tackle the physics • library dependencies: FFTW, BLAS, scaLAPACK, ELPA

Collaboration of:

ZIB contact: Florian Wende ([email protected]) 30 / 44 SIMD Introduction

SIMD – Single Instruction Multiple Data  Multiple words are processed at once sharing 1 program counter  SIMD registers become increasingly larger: currently 512 bit with AVX-512  Slightly increased logic on the chip, but heavily increased arithmetic throughput  Xeon Phi KNL w/ and w/o SIMD: 3 TFLOPS vs. 0.37 TFLOPS

31 / 44 SIMD Introduction

SIMD – Single Instruction Multiple Data  Multiple words are processed at once sharing 1 program counter

for (i = 0; i < N; ++i)

y[i] = log(x[i]); time ...

32 / 44 SIMD Introduction

SIMD – Single Instruction Multiple Data  Multiple words are processed at once sharing 1 program counter

for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8) y[i] = log(x[i]); y[i + 0] = log(x[i + 0]); ... y[i] = vlog(x[i]) y[i + 7] = log(x[i + 7]);

... 8 times faster

execution with SIMD time ...

33 /2 44 SIMD Introduction

SIMD – Single Instruction Multiple Data  Multiple words are processed at once sharing 1 program counter  Control flow divergences can hurt SIMD performance significantly

for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8) if (p[i]) y[i] = log(x[i]); if (m ← p[i]) y[i] = vlog_mask(y[i], m, x[i]); else y[i] = exp(x[i]); else y[i] = vexp_mask(y[i], ~m, x[i]);

4 times faster ...

execution with SIMD time ...

34 / 44 SIMD Vectorisation in VASP

OpenMP 4.x compiler directives . portability across compilers . low code invasiveness . no SIMD intrinsics for Fortran

. combine OpenMP 4.x SIMD with “high-level vectors” (loop chunking) to increase flexibility and expressiveness

35 / 44 SIMD Vectorisation in VASP - Example

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0; for (i = 0; i < ni; ++i) { C-version of the Codes while (“some condition”) (not optimized) ++idx; d = data[idx]; for (j = 0; j < nj; ++j) res[j] += d * (...); }

36 / 44 SIMD Vectorisation in VASP - Example

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0; for (i = 0; i < ni; ++i) { while (“some condition”) ++idx; d = data[idx]; for (j = 0; j < nj; ++j) res[j] += d * (...); nj rather small: not a candidate for SIMD vectorization }

37 / 44 SIMD Vectorisation in VASP - Example

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0; for (i = 0; i < ni; ++i) { while (“some condition”) ++idx; d = data[idx]; loop iterations are not independent: idx for (j = 0; j < nj; ++j) res[j] += d * (...); }

38 / 44 SIMD Vectorisation in VASP - Example

Non-vectorizable loop split into parts to enable SIMD vectorization Loop-Chunking, e.g. idx = 0; idx = 0; CHUNKSIZE=32 for (i = 0; i < ni; ++i) { for (i = 0; i < ni; i += CHUNKSIZE) { while (“some condition”) ii_max = min(CHUNKSIZE, ni – i); ++idx; for (ii = 0; ii < ii_max; ++ii) { d = data[idx]; while (“some condition”) for (j = 0; j < nj; ++j) ++idx; res[j] += d * (...); vidx[ii] = idx; } } ... }

Compute idx-values in advance to enable SIMD vectorization afterwards!

39 / 44 SIMD Vectorisation in VASP - Example

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0; idx = 0; for (i = 0; i < ni; ++i) { for (i = 0; i < ni; i += CHUNKSIZE) { while (“some condition”) ii_max = min(CHUNKSIZE, ni – i); ++idx; for (ii = 0; ii < ii_max; ++ii) { d = data[idx]; while (“some condition”) for (j = 0; j < nj; ++j) ++idx; res[j] += d * (...); vidx[ii] = idx; } } for (ii = 0; ii < ii_max; ++ii) vd[ii] = data[vidx[ii]]; ... }

Load data in a separate loop: leave it to the compiler to vectorize or not

40 / 44 SIMD Vectorisation in VASP - Example

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0; idx = 0; for (i = 0; i < ni; ++i) { for (i = 0; i < ni; i += CHUNKSIZE) { while (“some condition”) ii_max = min(CHUNKSIZE, ni – i); ++idx; for (ii = 0; ii < ii_max; ++ii) { d = data[idx]; while (“some condition”) for (j = 0; j < nj; ++j) ++idx; res[j] += d * (...); vidx[ii] = idx; } } for (ii = 0; ii < ii_max; ++ii) vd[ii] = data[vidx[ii]]; for (j = 0; j < nj; ++j) #pragma omp simd for (ii = 0; ii < ii_max; ++ii) res[j] += vd[ii] * (...); }

41 / 44 SIMD Vectorisation in VASP - Results

Non-vectorizable loop split into parts to enable SIMD vectorization

GW0 subroutine only Whole program 35 90 30 80 70 25 60 20 50 15 40

Time [s] Time 30 10 20 5 10 0 0 further optimization no-SIMD (KNL) SIMD (KNL) no-SIMD (2x Haswell CPU) SIMD (2x Haswell CPU) needed! Xeon Phi nodes: quadrant mode, all data in MCDRAM

42 / 44 . . . but getting performance is challenging

• ease of use can be misleading towards quick fixes • code needs to be re-thought and re-written for SIMD • effort pays off with significant speed-up for hot-spots ⇒ Xeon benefits from Xeon Phi optimisations as well • overall application performance suffers from low single-thread performance ⇒ working on a few hotspots is not sufficient

With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.

KNL Summary Xeon Phi (KNL) has a low entry barrier. . .

• no offloading • well-known CPU toolchains and workflows

43 / 44 With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.

KNL Summary Xeon Phi (KNL) has a low entry barrier. . .

• no offloading • well-known CPU toolchains and workflows

. . . but getting performance is challenging

• ease of use can be misleading towards quick fixes • code needs to be re-thought and re-written for SIMD • effort pays off with significant speed-up for hot-spots ⇒ Xeon benefits from Xeon Phi optimisations as well • overall application performance suffers from low single-thread performance ⇒ working on a few hotspots is not sufficient

43 / 44 KNL Summary Xeon Phi (KNL) has a low entry barrier. . .

• no offloading • well-known CPU toolchains and workflows

. . . but getting performance is challenging

• ease of use can be misleading towards quick fixes • code needs to be re-thought and re-written for SIMD • effort pays off with significant speed-up for hot-spots ⇒ Xeon benefits from Xeon Phi optimisations as well • overall application performance suffers from low single-thread performance ⇒ working on a few hotspots is not sufficient

With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.

43 / 44 Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.

- EoP

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit its computing power.

44 / 44 Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.

- EoP

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit its computing power.

Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.

44 / 44 - EoP

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit its computing power.

Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.

44 / 44 Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit its computing power.

Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.

- EoP

44 / 44