Performance Engineering for Legacy Codes on a Cray XC40 with Intel Xeon Phi (KNL)
Matthias Noack ([email protected]), Florian Wende, Thomas Steinke, Alexander Reinefeld
Zuse Institute Berlin
2017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17 1 / 44 North-German Supercomputing Alliance (HLRN)
Applications on the HLRN-III TDS (Berlin, ZIB) - many pure MPI codes - 16 KNC nodes (until July 2016) - 80 KNL nodes (since July 2016) - some MPI+OpenMP - data warp nodes - some vectorized
“Konrad” (Berlin, ZIB) - 1872 Xeon nodes - 44928 cores
10 Gbps (243 km linear distance)
“Gottfried” (Hanover, LUIS) - 1680 Xeon nodes - 40320 cores + 64 SMP servers, 256/512 GB
2 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops Cray T3D 1984 Cray X-MP 38 GFlops Cray 1M 471 MFlops 160 MFlops
3 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops 200 kWatt Cray T3D 1984 Cray X-MP 10 M€ 38 GFlops Cray 1M 471 MFlops Xeon Phi KNL, 2016 160 MFlops Intel Xeon Phi 7290 72 cores (288 threads) 3 TFLOPS Y 245 Watt 6662 € (6/2017)
4 / 44 Research Center for Many-Core HPC at ZIB
Intel Parallel Compute Center (IPCC) Applications • GLAT (atomistic thermodynamics) Challenges • VASP (electronic structure) • Adapting data structures for enabling SIMD • BQCD (high-energy physics) OBJECTIVE • Vectorising complex code structures • HEOM (photo-active processes) • Transition to hybrid MPI + OpenMP • BOSS (time series analysis, phase 2) Many-Core High- • (Offload with Intel LEO vs. OpenMP 4.x) • PALM (fluid dynamics, phase 2) Performance Computing • PHOENIX3D (astrophysics, phase 2)
APPLICATIONS RESEARCH
Modernization, Programming OpenMP/MPI, Models, Scalability Runtime Libraries
5 / 44 Intel Xeon Phi (Knights Landing) – Architecture
Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1) self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64) 4-way hardware-threading 512-bit SIMD vector processing (AVX-512) On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB
1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016
6 / 44 Intel Xeon Phi (Knights Landing) – Architecture
KNL CPU
MCDRAM MCDRAM MCDRAM MCDRAM
EDC EDC EDC EDC PCIe 3 DMI
Tile
up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels
Interconnect
3 3 Channels DDR4 3 DDR4
EDC EDC Misc EDC EDC
MCDRAM MCDRAM MCDRAM MCDRAM
7 / 44 Intel Xeon Phi (Knights Landing) – Architecture
KNL CPU Tile
MCDRAM MCDRAM MCDRAM MCDRAM 2 VPUs CHA 2 VPUs 1 MiB L2 Cache EDC EDC EDC EDC Core Core PCIe 3 DMI
Tile 2 out-of-order cores up to 36 active tiles
Channels DDR MC connected via 2D-Mesh- DDR MC 2 VPUs per core (AVX-512)
Interconnect 1 MiB shared L2-Cache
3 3 Channels DDR4 3 DDR4 Caching/Home-Agent (CHA) interface to 2D-Mesh-Interconnect
EDC EDC Misc EDC EDC distributed tag directory (MESIF cache coherency protocol)
MCDRAM MCDRAM MCDRAM MCDRAM
8 / 44 Intel Xeon Phi (Knights Landing) – Architecture
KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels
EDC EDC EDC EDC PCIe 3 DMI
Tile
up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels
Interconnect
3 3 Channels DDR4 3 DDR4
EDC EDC Misc EDC EDC
MCDRAM MCDRAM MCDRAM MCDRAM
9 / 44 Intel Xeon Phi (Knights Landing) – Architecture
KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels
EDC EDC EDC EDC PCIe 3 DMI MCDRAM (16 GiB) Tile 8 on-chip units, each 2 GiB
up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels
Interconnect
3 3 Channels DDR4 3 DDR4
EDC EDC Misc EDC EDC
MCDRAM MCDRAM MCDRAM MCDRAM
10 / 44 Intel Xeon Phi (Knights Landing) – Architecture
KNL CPU Memory Modes: Flat-Mode 16 GiB MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM DDR4 and MCDRAM
in same address space DDR4 EDC EDC EDC EDC
PCIe 3 DMI space address phys. Cache-Mode Tile MCDRAM = direct- mapped Cache up to 36 active tiles 16 GiB DDR MC DDR MC Channels for DDR4 DDR4 connected via 2D-Mesh- MCDRAM
Interconnect 3 3 DDR4 3 3 Channels DDR4 Hybrid-Mode 8 | 4 GiB MCDRAM in Cache-Mode, 8 | 12 GiB MCDRAM EDC EDC Misc EDC EDC remainder in 8 | 4 GiB
Flat-Mode DDR4 space address MCDRAM .
MCDRAM MCDRAM MCDRAM MCDRAM
hys p
11 / 44 KNL in the Roofline Model (Samuel Williams, 2008)
[GFLOPS]
4096 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
12 / 44 KNL in the Roofline Model - adding peak FLOPS
[GFLOPS]
4096 peak FLOPS 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model - adding peak DRAM bandwidth
[GFLOPS]
4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model
[GFLOPS]
4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS = [GFLOPS] Memory Bandwidth 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak Minimal FLOPS = Arithmetic Intensity [GFLOPS] Memory Bandwidth necessary for peak FLOPS peak FLOPS 4096 (HLRN nodes): 1024 MCDRAM BW KNL DRAM: 22.7 256 DRAM BW KNL MCDRAM: 5.3 Haswell: 7.1 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon12 / 44 E5-2680v3: 480 GFLOPS, 68 GiB/s • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)
Refining the Ceilings - FLOPS
Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
13 / 44 • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)
Refining the Ceilings - FLOPS
Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS
13 / 44 ⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)
Refining the Ceilings - FLOPS
Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load
13 / 44 Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)
Refining the Ceilings - FLOPS
Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS
13 / 44 • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)
Refining the Ceilings - FLOPS
Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %)
13 / 44 Refining the Ceilings - FLOPS
Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %)
13 / 44 KNL in the Roofline Model
[GFLOPS]
4096 peak FLOPS Upper compute w/o ILP 1024 MCDRAM BW bound in GFLOPS: 256 DRAM BW peak: 2611.2 w/o ILP+SIMD w/o ILP: 652.8 64 w/o ILP+SIMD: 84.6 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity
Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 14 / 44 Case study I: Atmospheric Research with PALM
WIND ENERGY TURBULENCE EFFECTS ON AIRCRAFT
URBAN METEOROLOGY AND CITY PLANING
DUSTDEVILS CLOUD PHYSIC
https://palm.muk.uni-hannover.de 15 / 44 The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
https://palm.muk.uni-hannover.de 16 / 44 The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
• Fortran 95/2003. • hybrid MPI + OpenMP code • 140 kLOC, 79 modules and 171 source files • highly scalable, tested for up to 43,200 cores
https://palm.muk.uni-hannover.de 16 / 44 The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
• Fortran 95/2003. • hybrid MPI + OpenMP code • 140 kLOC, 79 modules and 171 source files • highly scalable, tested for up to 43,200 cores
• runs on the HLRN supercomputing facilities at Berlin (ZIB) and Hannover (LUIS) • modernisation target within the Intel Parallel Computing Center at ZIB
https://palm.muk.uni-hannover.de 16 / 44 PALM Hackathon at ZIB
left to right: Matthias Noack, Florian Wende, Helge Knoop, Matthias Sühring, Tobias Gronemeier; behind the camera: Thomas Steinke 17 / 44 Parallelization Strategy
https://palm.muk.uni-hannover.de 18 / 44 Parallelization Strategy
2D domain decomposition
2 5 8 1 4 7 0 3 6
y
x https://palm.muk.uni-hannover.de 18 / 44 Parallelization Strategy
2D domain decomposition outer of 3 nested loops threaded
2 5 8 • different variants for different 1 4 7 n targets 0 3 6 • e.g. cache optimised with decomposed inner loop
• no vectorisation ⇒ use of floating point exceptions prevents y automatic vectorisation ⇒ no explicit SIMD constructs
x https://palm.muk.uni-hannover.de 18 / 44 KNL Optimisation Strategy
get operational on KNL
19 / 44 KNL Optimisation Strategy
get operational on KNL - build scripts - job scripts - bug fixing
19 / 44 KNL Optimisation Strategy
get define operational some on KNL benchmarks - build scripts - job scripts - bug fixing
19 / 44 KNL Optimisation Strategy
get define operational some on KNL benchmarks - build scripts - small - job scripts - medium - bug fixing - large
19 / 44 KNL Optimisation Strategy
get define measure operational some baseline on KNL benchmarks on Xeon - build scripts - small - job scripts - medium - bug fixing - large
19 / 44 Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad) • 2 × Intel Xeon E5-2680v3 (Haswell) • 2 × 12 cores at 2.5 GHz ⇒ 960 GFLOPS per node • 64 GiB DDR3, 136 GiB/s
20 / 44 Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad) Haswell Results • 2 × Intel Xeon E5-2680v3 (Haswell) nodes runtime • 2 × 12 cores at 2.5 GHz small 4 131.3 s ⇒ 960 GFLOPS per node medium 8 147.6 s • 64 GiB DDR3, 136 GiB/s large 16 164.6 s
20 / 44 Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad) Haswell Results • 2 × Intel Xeon E5-2680v3 (Haswell) nodes runtime • 2 × 12 cores at 2.5 GHz small 4 131.3 s ⇒ 960 GFLOPS per node medium 8 147.6 s • 64 GiB DDR3, 136 GiB/s large 16 164.6 s HLRN KNL TDS node, 80 nodes • 1 × Intel Xeon Phi 7250 (KNL) • 68 cores at 1.4 GHz ⇒ 2611.2 GFLOPS with AVX clock "3.05 TFLOPS" • 96 GiB DDR4, 115.2 GiB/s • 16 GiB MCDRAM, 490 GiB/s
20 / 44 Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad) Haswell Results • 2 × Intel Xeon E5-2680v3 (Haswell) nodes runtime • 2 × 12 cores at 2.5 GHz small 4 131.3 s ⇒ 960 GFLOPS per node medium 8 147.6 s • 64 GiB DDR3, 136 GiB/s large 16 164.6 s HLRN KNL TDS node, 80 nodes • 1 × Intel Xeon Phi 7250 (KNL) • 68 cores at 1.4 GHz ⇒ 2611.2 GFLOPS with AVX clock "3.05 TFLOPS" • 96 GiB DDR4, 115.2 GiB/s • 16 GiB MCDRAM, 490 GiB/s • Upper bounds for speed-up: ⇒ compute bound: 2.7× ⇒ memory bound: 3.6× (MCDRAM) 20 / 44 KNL Optimisation Strategy
get define measure operational some baseline on KNL benchmarks on Xeon
- build scripts - small X - job scripts - medium - bug fixing - large
21 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and on KNL benchmarks on Xeon boot mode
- build scripts - small X - job scripts - medium - bug fixing - large
21 / 44 MCDRAM Usage and Boot Mode
boot flat mode
run in DDR - problem < 16 GiB run in MCDRAM - upper bound for gain from MCDRAM
boot cache mode
- good enough? run again ⇒ decide about explicit placement
22 / 44 MCDRAM Usage and Boot Mode MCDRAM Usage Comparison 160 quad_at, DDR4 140 quad_at, MCDRAM quad_cache, both 120
100 1.25 1.22 1.41 1.39 boot flat mode 80 1.41 1.38 60 runtime [s] run in DDR 40 20 - problem < 16 GiB 0 run in MCDRAM - upper bound for small medium large gain from MCDRAM • 25 - 41 % gain from MCDRAM boot cache mode • ≤ 3 % loss from Cache-Mode - good enough? ⇒ no need for explicit placement run again ⇒ decide about • 2 MiB pages worked best explicit placement
22 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and on KNL benchmarks on Xeon boot mode
- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large
23 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode
- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large
MPI processes ... vs. OpenMP threads
23 / 44 MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1 threads 1 2 4 8 16 32 64 small X medium X large X ≤ 2 % off from best worse ≤ 15 % off from best X fastest
quad_cache mode 24 / 44 MPI processes vs. OpenMP threads
Tuning run for optimal per-node config Conclusion • fastest ranks 64 32 16 8 4 2 1 ⇒ small: 16 ranks × 4 thread threads 1 2 4 8 16 32 64 ⇒ medium: 8 ranks × 8 threads small X ⇒ large: 32 × 2 threads • 16 × 4 performs for all setups medium X • fastest vs. slowest config: 2.3× large X ≤ 2 % off from best worse ≤ 15 % off from best X fastest
quad_cache mode 24 / 44 MPI processes vs. OpenMP threads
Tuning run for optimal per-node config Conclusion • fastest ranks 64 32 16 8 4 2 1 ⇒ small: 16 ranks × 4 thread threads 1 2 4 8 16 32 64 ⇒ medium: 8 ranks × 8 threads small X ⇒ large: 32 × 2 threads • 16 × 4 performs for all setups medium X • fastest vs. slowest config: 2.3× large X ≤ 2 % off from best worse • very low impact on speedups between MCDRAM usage models ≤ 15 % off from best X fastest • absolute numbers vary largely ⇒ do memory first
quad_cache mode 24 / 44 MPI processes vs. OpenMP threads
Tuning run for optimal per-node config Conclusion • fastest ranks 64 32 16 8 4 2 1 ⇒ small: 16 ranks × 4 thread threads 1 2 4 8 16 32 64 ⇒ medium: 8 ranks × 8 threads small X ⇒ large: 32 × 2 threads • 16 × 4 performs for all setups medium X • fastest vs. slowest config: 2.3× large X ≤ 2 % off from best worse • very low impact on speedups between MCDRAM usage models ≤ 15 % off from best X fastest • absolute numbers vary largely ⇒ do memory first
• always check pinning/affinity quad_cache mode 24 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode
- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large
MPI processes ... vs. OpenMP threads
⇒ 16 × 4
25 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode
- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large
MPI processes start ... vs. optimisation OpenMP threads workflow
⇒ 16 × 4
25 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode
- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large
MPI processes start ... vs. optimisation OpenMP threads workflow
⇒ 16 × 4
25 / 44 KNL Optimisation Strategy
get define measure MCDRAM operational some baseline and ... on KNL benchmarks on Xeon boot mode
- build scripts - small X ⇒ quad_cache - job scripts - medium - bug fixing - large
MPI processes start ... vs. optimisation OpenMP threads workflow
⇒ 16 × 4
- compiler optimisation reports - Intel VTune Amplifier XE, Advisor XE 25 / 44 First Code Changes
Floating point exception-handling
• prevents vectorisation • fp-model-strict → fp-model-source • remove exception handling • add NaN/Inf-tests ⇒ when writing checkpoints ⇒ no significant auto vectorisation ⇒ suboptimal memory layout
26 / 44 First Code Changes
Floating point exception-handling
• prevents vectorisation • fp-model-strict → fp-model-source • remove exception handling • add NaN/Inf-tests ⇒ when writing checkpoints ⇒ no significant auto vectorisation ⇒ suboptimal memory layout
Using MKL FFT
• small gain for benchmarks ⇒ might be significant for larger setups
26 / 44 First Code Changes
Floating point exception-handling CONTIGUOUS keyword
• prevents vectorisation • from Fortran 2008 • fp-model-strict → fp-model-source • tell the compiler about contiguously • remove exception handling allocated arrays • add NaN/Inf-tests Current Results, Intel Compiler 17.0.0 ⇒ when writing checkpoints HSW, Baseline HSW, Current ⇒ no significant auto vectorisation 200 KNL, Current 1.00 1.01 0.96 1.00 ⇒ suboptimal memory layout 150 1.00 1.13 1.07 1.09 1.24 100
Using MKL FFT runtime [s] 50 • small gain for benchmarks 0 ⇒ might be significant for larger setups small medium large
26 / 44 Cray Specialities
Hardware/Software
• Cray Aries network • Cray MPI • Cray Performance Tools • Cray Compiler
27 / 44 Cray Specialities
Hardware/Software
• Cray Aries network • Cray MPI • Cray Performance Tools • Cray Compiler
Rank Reordering
• instrumented application run • optimised mapping of MPI ranks to cores and nodes ⇒ no improvement on KNL
27 / 44 Cray Specialities
Hardware/Software Cray Compiler
• Cray Aries network • initial KNL support • Cray MPI • crash with OpenMP • Cray Performance Tools ⇒ 64 MPI ranks per KNL
• Cray Compiler Current Results, Intel 17.0.0 vs. Cray Compiler 8.5.3 HSW, Baseline, Intel HSW, Current, Intel 200 KNL, Current, Intel Rank Reordering KNL, Current, Cray 1.00 1.01 0.96 1.00 1.12 150 1.00 1.13 1.07 • instrumented application run 1.09 1.17 1.24 1.29 • optimised mapping of MPI 100 ranks to cores and nodes runtime [s] 50 ⇒ no improvement on KNL 0 small medium large
27 / 44 Projected Production Run Performance
• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours ⇒ serial initialisation becomes negligible ⇒ plot speedup based on ttotal − tinit
28 / 44 Projected Production Run Performance
• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours ⇒ serial initialisation becomes negligible ⇒ plot speedup based on ttotal − tinit Projected Speedups (without init) 2 HSW, Baseline HSW, Current, Intel KNL, Current, Intel 1.45 KNL, Current, Cray 1.5 1.25 1.25 1.23 1.10 1.14 1.11 1.00 1.00 1.00 1.00 1.06 1 speedup
0.5
0 small medium large 28 / 44 PALM - Final Remarks Bottom Line • getting started on KNL was easy ⇒ way easier than KNC and offloading • good initial performance (cache-mode) • scalar parts hurt ⇒ initialisation ⇒ ...
https://palm.muk.uni-hannover.de 29 / 44 PALM - Final Remarks Bottom Line • getting started on KNL was easy ⇒ way easier than KNC and offloading • good initial performance (cache-mode) • scalar parts hurt ⇒ initialisation ⇒ ... • Cray Fortran compiler up to 16.5 % faster than Intel on KNL
https://palm.muk.uni-hannover.de 29 / 44 PALM - Final Remarks Bottom Line • getting started on KNL was easy ⇒ way easier than KNC and offloading • good initial performance (cache-mode) • scalar parts hurt ⇒ initialisation ⇒ ... • Cray Fortran compiler up to 16.5 % faster than Intel on KNL • speedup over dual HSW HLRN node: • benchmark: up to 1.29× • production (projected): up to 1.45× ⇒ even without vectorisation
https://palm.muk.uni-hannover.de 29 / 44 PALM - Final Remarks
Bottom Line What’s next • getting started on KNL was easy • VTune Results: ⇒ way easier than KNC and offloading ⇒ increase concurrency • good initial performance (cache-mode) ⇒ reduce L2 misses on KNL • scalar parts hurt • adapt data layout for SIMD ⇒ initialisation ⇒ twice the potential on KNL ⇒ ... • Cray Fortran compiler up to 16.5 % faster than Intel on KNL • speedup over dual HSW HLRN node: • benchmark: up to 1.29× • production (projected): up to 1.45× ⇒ even without vectorisation
https://palm.muk.uni-hannover.de 29 / 44 Case study II: Material Science with VASP VASP – Vienna Ab-Initio Simulation Package
• Plan wave electronic structure code to model atomic scale materials from first principals: Hψ = Eψ • widely used in material sciences • historically grown, large MPI-only Fortran code base
• different approximations (DFT, hybrid functionals, . . . ) to tackle the physics • library dependencies: FFTW, BLAS, scaLAPACK, ELPA
Collaboration of:
ZIB contact: Florian Wende ([email protected]) 30 / 44 SIMD Introduction
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter SIMD registers become increasingly larger: currently 512 bit with AVX-512 Slightly increased logic on the chip, but heavily increased arithmetic throughput Xeon Phi KNL w/ and w/o SIMD: 3 TFLOPS vs. 0.37 TFLOPS
31 / 44 SIMD Introduction
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter
for (i = 0; i < N; ++i)
y[i] = log(x[i]); time ...
32 / 44 SIMD Introduction
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter
for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8) y[i] = log(x[i]); y[i + 0] = log(x[i + 0]); ... y[i] = vlog(x[i]) y[i + 7] = log(x[i + 7]);
... 8 times faster
execution with SIMD time ...
33 /2 44 SIMD Introduction
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter Control flow divergences can hurt SIMD performance significantly
for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8) if (p[i]) y[i] = log(x[i]); if (m ← p[i]) y[i] = vlog_mask(y[i], m, x[i]); else y[i] = exp(x[i]); else y[i] = vexp_mask(y[i], ~m, x[i]);
4 times faster ...
execution with SIMD time ...
34 / 44 SIMD Vectorisation in VASP
OpenMP 4.x compiler directives . portability across compilers . low code invasiveness . no SIMD intrinsics for Fortran
. combine OpenMP 4.x SIMD with “high-level vectors” (loop chunking) to increase flexibility and expressiveness
35 / 44 SIMD Vectorisation in VASP - Example
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0; for (i = 0; i < ni; ++i) { C-version of the Codes while (“some condition”) (not optimized) ++idx; d = data[idx]; for (j = 0; j < nj; ++j) res[j] += d * (...); }
36 / 44 SIMD Vectorisation in VASP - Example
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0; for (i = 0; i < ni; ++i) { while (“some condition”) ++idx; d = data[idx]; for (j = 0; j < nj; ++j) res[j] += d * (...); nj rather small: not a candidate for SIMD vectorization }
37 / 44 SIMD Vectorisation in VASP - Example
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0; for (i = 0; i < ni; ++i) { while (“some condition”) ++idx; d = data[idx]; loop iterations are not independent: idx for (j = 0; j < nj; ++j) res[j] += d * (...); }
38 / 44 SIMD Vectorisation in VASP - Example
Non-vectorizable loop split into parts to enable SIMD vectorization Loop-Chunking, e.g. idx = 0; idx = 0; CHUNKSIZE=32 for (i = 0; i < ni; ++i) { for (i = 0; i < ni; i += CHUNKSIZE) { while (“some condition”) ii_max = min(CHUNKSIZE, ni – i); ++idx; for (ii = 0; ii < ii_max; ++ii) { d = data[idx]; while (“some condition”) for (j = 0; j < nj; ++j) ++idx; res[j] += d * (...); vidx[ii] = idx; } } ... }
Compute idx-values in advance to enable SIMD vectorization afterwards!
39 / 44 SIMD Vectorisation in VASP - Example
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0; idx = 0; for (i = 0; i < ni; ++i) { for (i = 0; i < ni; i += CHUNKSIZE) { while (“some condition”) ii_max = min(CHUNKSIZE, ni – i); ++idx; for (ii = 0; ii < ii_max; ++ii) { d = data[idx]; while (“some condition”) for (j = 0; j < nj; ++j) ++idx; res[j] += d * (...); vidx[ii] = idx; } } for (ii = 0; ii < ii_max; ++ii) vd[ii] = data[vidx[ii]]; ... }
Load data in a separate loop: leave it to the compiler to vectorize or not
40 / 44 SIMD Vectorisation in VASP - Example
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0; idx = 0; for (i = 0; i < ni; ++i) { for (i = 0; i < ni; i += CHUNKSIZE) { while (“some condition”) ii_max = min(CHUNKSIZE, ni – i); ++idx; for (ii = 0; ii < ii_max; ++ii) { d = data[idx]; while (“some condition”) for (j = 0; j < nj; ++j) ++idx; res[j] += d * (...); vidx[ii] = idx; } } for (ii = 0; ii < ii_max; ++ii) vd[ii] = data[vidx[ii]]; for (j = 0; j < nj; ++j) #pragma omp simd for (ii = 0; ii < ii_max; ++ii) res[j] += vd[ii] * (...); }
41 / 44 SIMD Vectorisation in VASP - Results
Non-vectorizable loop split into parts to enable SIMD vectorization
GW0 subroutine only Whole program 35 90 30 80 70 25 60 20 50 15 40
Time [s] Time 30 10 20 5 10 0 0 further optimization no-SIMD (KNL) SIMD (KNL) no-SIMD (2x Haswell CPU) SIMD (2x Haswell CPU) needed! Xeon Phi nodes: quadrant mode, all data in MCDRAM
42 / 44 . . . but getting performance is challenging
• ease of use can be misleading towards quick fixes • code needs to be re-thought and re-written for SIMD • effort pays off with significant speed-up for hot-spots ⇒ Xeon benefits from Xeon Phi optimisations as well • overall application performance suffers from low single-thread performance ⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
KNL Summary Xeon Phi (KNL) has a low entry barrier. . .
• no offloading • well-known CPU toolchains and workflows
43 / 44 With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
KNL Summary Xeon Phi (KNL) has a low entry barrier. . .
• no offloading • well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes • code needs to be re-thought and re-written for SIMD • effort pays off with significant speed-up for hot-spots ⇒ Xeon benefits from Xeon Phi optimisations as well • overall application performance suffers from low single-thread performance ⇒ working on a few hotspots is not sufficient
43 / 44 KNL Summary Xeon Phi (KNL) has a low entry barrier. . .
• no offloading • well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes • code needs to be re-thought and re-written for SIMD • effort pays off with significant speed-up for hot-spots ⇒ Xeon benefits from Xeon Phi optimisations as well • overall application performance suffers from low single-thread performance ⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44 Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.
- EoP
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit its computing power.
44 / 44 Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.
- EoP
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit its computing power.
Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.
44 / 44 - EoP
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit its computing power.
Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.
44 / 44 Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit its computing power.
Code modernisation for KNL within IPCCs world-wide is just one example of the effort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares of HPC centre’s budgets will have to be allocated for code modernisation work in order to utilise future machines efficiently.
- EoP
44 / 44