Performance Engineering on Cray XC40 with Xeon Phi
Total Page:16
File Type:pdf, Size:1020Kb
Performance Engineering for Legacy Codes on a Cray XC40 with Intel Xeon Phi (KNL) Matthias Noack ([email protected]), Florian Wende, Thomas Steinke, Alexander Reinefeld Zuse Institute Berlin 2017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17 1 / 44 North-German Supercomputing Alliance (HLRN) Applications on the HLRN-III TDS (Berlin, ZIB) - many pure MPI codes - 16 KNC nodes (until July 2016) - 80 KNL nodes (since July 2016) - some MPI+OpenMP - data warp nodes - some vectorized “Konrad” (Berlin, ZIB) - 1872 Xeon nodes - 44928 cores 10 Gbps (243 km linear distance) “Gottfried” (Hanover, LUIS) - 1680 Xeon nodes - 40320 cores + 64 SMP servers, 256/512 GB 2 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops Cray T3D 1984 Cray X-MP 38 GFlops Cray 1M 471 MFlops 160 MFlops 3 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops 200 kWatt Cray T3D 1984 Cray X-MP 10 M€ 38 GFlops Cray 1M 471 MFlops Xeon Phi KNL, 2016 160 MFlops Intel Xeon Phi 7290 72 cores (288 threads) 3 TFLOPS Y 245 Watt 6662 € (6/2017) 4 / 44 Research Center for Many-Core HPC at ZIB Intel Parallel Compute Center (IPCC) Applications • GLAT (atomistic thermodynamics) Challenges • VASP (electronic structure) • Adapting data structures for enabling SIMD • BQCD (high-energy physics) OBJECTIVE • Vectorising complex code structures • HEOM (photo-active processes) • Transition to hybrid MPI + OpenMP • BOSS (time series analysis, phase 2) Many-Core High- • (Offload with Intel LEO vs. OpenMP 4.x) • PALM (fluid dynamics, phase 2) Performance Computing • PHOENIX3D (astrophysics, phase 2) APPLICATIONS RESEARCH Modernization, Programming OpenMP/MPI, Models, Scalability Runtime Libraries 5 / 44 Intel Xeon Phi (Knights Landing) – Architecture Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1) self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64) 4-way hardware-threading 512-bit SIMD vector processing (AVX-512) On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB 1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016 6 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU MCDRAM MCDRAM MCDRAM MCDRAM EDC EDC EDC EDC PCIe 3 DMI Tile up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 7 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU Tile MCDRAM MCDRAM MCDRAM MCDRAM 2 VPUs CHA 2 VPUs 1 MiB L2 Cache EDC EDC EDC EDC Core Core PCIe 3 DMI Tile 2 out-of-order cores up to 36 active tiles Channels DDR MC connected via 2D-Mesh- DDR MC 2 VPUs per core (AVX-512) Interconnect 1 MiB shared L2-Cache 3 Channels DDR4 3 DDR4 Caching/Home-Agent (CHA) interface to 2D-Mesh-Interconnect EDC EDC Misc EDC EDC distributed tag directory (MESIF cache coherency protocol) MCDRAM MCDRAM MCDRAM MCDRAM 8 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels EDC EDC EDC EDC PCIe 3 DMI Tile up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 9 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels EDC EDC EDC EDC PCIe 3 DMI MCDRAM (16 GiB) Tile 8 on-chip units, each 2 GiB up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 10 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU Memory Modes: Flat-Mode 16 GiB MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM DDR4 and MCDRAM in same address space DDR4 EDC EDC EDC EDC PCIe 3 DMI space address phys. Cache-Mode Tile MCDRAM = direct- mapped Cache up to 36 active tiles 16 GiB DDR MC DDR MC Channels for DDR4 DDR4 connected via 2D-Mesh- MCDRAM Interconnect 3 DDR4 3 Channels DDR4 Hybrid-Mode 8 | 4 GiB MCDRAM in Cache-Mode, 8 | 12 GiB MCDRAM EDC EDC Misc EDC EDC remainder in 8 | 4 GiB Flat-Mode DDR4 space address MCDRAM . MCDRAM MCDRAM MCDRAM MCDRAM hys p 11 / 44 KNL in the Roofline Model (Samuel Williams, 2008) [GFLOPS] 4096 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity 12 / 44 KNL in the Roofline Model - adding peak FLOPS [GFLOPS] 4096 peak FLOPS 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model - adding peak DRAM bandwidth [GFLOPS] 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model [GFLOPS] 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS = [GFLOPS] Memory Bandwidth 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak Minimal FLOPS = Arithmetic Intensity [GFLOPS] Memory Bandwidth necessary for peak FLOPS peak FLOPS 4096 (HLRN nodes): 1024 MCDRAM BW KNL DRAM: 22.7 256 DRAM BW KNL MCDRAM: 5.3 Haswell: 7.1 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon12 / 44 E5-2680v3: 480 GFLOPS, 68 GiB/s • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP 13 / 44 • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS 13 / 44 ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load 13 / 44 Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS 13 / 44 • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e.