Performance Engineering on Cray XC40 with Xeon Phi

Performance Engineering on Cray XC40 with Xeon Phi

Performance Engineering for Legacy Codes on a Cray XC40 with Intel Xeon Phi (KNL) Matthias Noack ([email protected]), Florian Wende, Thomas Steinke, Alexander Reinefeld Zuse Institute Berlin 2017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17 1 / 44 North-German Supercomputing Alliance (HLRN) Applications on the HLRN-III TDS (Berlin, ZIB) - many pure MPI codes - 16 KNC nodes (until July 2016) - 80 KNL nodes (since July 2016) - some MPI+OpenMP - data warp nodes - some vectorized “Konrad” (Berlin, ZIB) - 1872 Xeon nodes - 44928 cores 10 Gbps (243 km linear distance) “Gottfried” (Hanover, LUIS) - 1680 Xeon nodes - 40320 cores + 64 SMP servers, 256/512 GB 2 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops Cray T3D 1984 Cray X-MP 38 GFlops Cray 1M 471 MFlops 160 MFlops 3 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops 200 kWatt Cray T3D 1984 Cray X-MP 10 M€ 38 GFlops Cray 1M 471 MFlops Xeon Phi KNL, 2016 160 MFlops Intel Xeon Phi 7290 72 cores (288 threads) 3 TFLOPS Y 245 Watt 6662 € (6/2017) 4 / 44 Research Center for Many-Core HPC at ZIB Intel Parallel Compute Center (IPCC) Applications • GLAT (atomistic thermodynamics) Challenges • VASP (electronic structure) • Adapting data structures for enabling SIMD • BQCD (high-energy physics) OBJECTIVE • Vectorising complex code structures • HEOM (photo-active processes) • Transition to hybrid MPI + OpenMP • BOSS (time series analysis, phase 2) Many-Core High- • (Offload with Intel LEO vs. OpenMP 4.x) • PALM (fluid dynamics, phase 2) Performance Computing • PHOENIX3D (astrophysics, phase 2) APPLICATIONS RESEARCH Modernization, Programming OpenMP/MPI, Models, Scalability Runtime Libraries 5 / 44 Intel Xeon Phi (Knights Landing) – Architecture Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1) self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64) 4-way hardware-threading 512-bit SIMD vector processing (AVX-512) On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB 1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016 6 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU MCDRAM MCDRAM MCDRAM MCDRAM EDC EDC EDC EDC PCIe 3 DMI Tile up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 7 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU Tile MCDRAM MCDRAM MCDRAM MCDRAM 2 VPUs CHA 2 VPUs 1 MiB L2 Cache EDC EDC EDC EDC Core Core PCIe 3 DMI Tile 2 out-of-order cores up to 36 active tiles Channels DDR MC connected via 2D-Mesh- DDR MC 2 VPUs per core (AVX-512) Interconnect 1 MiB shared L2-Cache 3 Channels DDR4 3 DDR4 Caching/Home-Agent (CHA) interface to 2D-Mesh-Interconnect EDC EDC Misc EDC EDC distributed tag directory (MESIF cache coherency protocol) MCDRAM MCDRAM MCDRAM MCDRAM 8 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels EDC EDC EDC EDC PCIe 3 DMI Tile up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 9 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels EDC EDC EDC EDC PCIe 3 DMI MCDRAM (16 GiB) Tile 8 on-chip units, each 2 GiB up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 10 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU Memory Modes: Flat-Mode 16 GiB MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM DDR4 and MCDRAM in same address space DDR4 EDC EDC EDC EDC PCIe 3 DMI space address phys. Cache-Mode Tile MCDRAM = direct- mapped Cache up to 36 active tiles 16 GiB DDR MC DDR MC Channels for DDR4 DDR4 connected via 2D-Mesh- MCDRAM Interconnect 3 DDR4 3 Channels DDR4 Hybrid-Mode 8 | 4 GiB MCDRAM in Cache-Mode, 8 | 12 GiB MCDRAM EDC EDC Misc EDC EDC remainder in 8 | 4 GiB Flat-Mode DDR4 space address MCDRAM . MCDRAM MCDRAM MCDRAM MCDRAM hys p 11 / 44 KNL in the Roofline Model (Samuel Williams, 2008) [GFLOPS] 4096 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity 12 / 44 KNL in the Roofline Model - adding peak FLOPS [GFLOPS] 4096 peak FLOPS 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model - adding peak DRAM bandwidth [GFLOPS] 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model [GFLOPS] 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS = [GFLOPS] Memory Bandwidth 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak Minimal FLOPS = Arithmetic Intensity [GFLOPS] Memory Bandwidth necessary for peak FLOPS peak FLOPS 4096 (HLRN nodes): 1024 MCDRAM BW KNL DRAM: 22.7 256 DRAM BW KNL MCDRAM: 5.3 Haswell: 7.1 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon12 / 44 E5-2680v3: 480 GFLOPS, 68 GiB/s • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP 13 / 44 • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS 13 / 44 ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load 13 / 44 Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS 13 / 44 • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    89 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us