Best Practice Guide Modern Processors

Best Practice Guide Modern Processors Ole Widar Saastad, University of Oslo, Norway Kristina Kapanova, NCSA, Bulgaria Stoyan Markov, NCSA, Bulgaria Cristian Morales, BSC, Spain Anastasiia Shamakina, HLRS, Germany Nick Johnson, EPCC, United Kingdom Ezhilmathi Krishnasamy, University of Luxembourg, Luxembourg Sebastien Varrette, University of Luxembourg, Luxembourg Hayk Shoukourian (Editor), LRZ, Germany Updated 5-5-2021 1 Best Practice Guide Modern Processors Table of Contents 1. Introduction .............................................................................................................................. 4 2. ARM Processors ....................................................................................................................... 6 2.1. Architecture ................................................................................................................... 6 2.1.1. Kunpeng 920 ....................................................................................................... 6 2.1.2. ThunderX2 .......................................................................................................... 7 2.1.3. NUMA architecture .............................................................................................. 9 2.2. Programming Environment ............................................................................................... 9 2.2.1. Compilers ........................................................................................................... 9 2.2.2. Vendor performance libraries ................................................................................ 10 2.2.3. Scalable Vector Extension (SVE) software support ................................................... 11 2.3. Benchmark performance ................................................................................................. 12 2.3.1. STREAM - memory bandwidth benchmark - Kunpeng 920 ......................................... 12 2.3.2. STREAM - memory bandwidth benchmark - Thunder X2 .......................................... 13 2.3.3. High Performance Linpack ................................................................................... 14 2.4. MPI Ping-pong performance using RoCE .......................................................................... 15 2.5. HPCG - High Performance Conjugated Gradients ............................................................... 15 2.6. Simultaneous Multi Threading (SMT) performance impact ................................................... 17 2.7. IOR ............................................................................................................................ 19 2.8. European ARM processor based systems ........................................................................... 19 2.8.1. Fulhame (EPCC) ................................................................................................ 19 3. Processors Intel Skylake ........................................................................................................... 21 3.1. Architecture ................................................................................................................. 21 3.1.1. Memory Architecture .......................................................................................... 22 3.1.2. Power Management ............................................................................................. 22 3.2. Programming Environment ............................................................................................. 23 3.2.1. Compilers .......................................................................................................... 23 3.2.2. Available Numerical Libraries ............................................................................... 24 3.3. Benchmark performance ................................................................................................. 25 3.3.1. MareNostrum system ........................................................................................... 25 3.3.2. SuperMUC-NG system ........................................................................................ 28 3.4. Performance Analysis .................................................................................................... 33 3.4.1. Intel Application Performance Snapshot .................................................................. 33 3.4.2. Scalasca ............................................................................................................ 42 3.4.3. Arm Forge Reports ............................................................................................. 44 3.4.4. PAPI ................................................................................................................ 45 3.5. Tuning ........................................................................................................................ 48 3.5.1. Compiler Flags ................................................................................................... 48 3.5.2. Serial Code Optimisation ..................................................................................... 50 3.5.3. Shared Memory Programming-OpenMP .................................................................. 53 3.5.4. Distributed memory programming -MPI .................................................................. 56 3.5.5. Environment Variables for Process Pinning OpenMP+MPI ......................................... 61 3.6. European SkyLake processor based systems ....................................................................... 63 3.6.1. MareNostrum 4 (BSC) ......................................................................................... 63 3.6.2. SuperMUC-NG (LRZ) ......................................................................................... 67 4. AMD Rome Processors ............................................................................................................ 69 4.1. System Architecture ....................................................................................................... 70 4.1.1. Cores - «real» vs. virtual/logical ............................................................................ 75 4.1.2. Memory Architecture .......................................................................................... 76 4.1.3. NUMA ............................................................................................................. 79 4.1.4. Balance of AMD/Rome system ............................................................................. 80 4.2. Programming Environment ............................................................................................. 80 4.2.1. Available Compilers ............................................................................................ 80 4.2.2. Compiler Flags ................................................................................................... 81 4.2.3. AMD Optimizing CPU Libraries (AOCL) ............................................................... 82 4.2.4. Intel Math Kernel Library .................................................................................... 83 2 Best Practice Guide Modern Processors 4.2.5. Library performance ............................................................................................ 84 4.3. Benchmark performance ................................................................................................. 84 4.3.1. Stream - memory bandwidth benchmark ................................................................. 84 4.3.2. High Performance Linpack ................................................................................... 85 4.4. Performance Analysis .................................................................................................... 86 4.4.1. perf (Linux utility) .............................................................................................. 86 4.4.2. perfcatch ........................................................................................................... 87 4.4.3. AMD µProf ....................................................................................................... 88 4.4.4. Roof line model ................................................................................................. 91 4.5. Tuning ........................................................................................................................ 93 4.5.1. Introduction ....................................................................................................... 93 4.5.2. Intel MKL pre 2020 version ................................................................................. 93 4.5.3. Intel MKL 2020 version ...................................................................................... 93 4.5.4. Memory bandwidth per core ................................................................................. 94 4.6. European AMD processor based systems ........................................................................... 95 4.6.1. HAWK system (HLRS) ....................................................................................... 95 4.6.2. Betzy system (Sigma2) ........................................................................................ 96 A. Acronyms and Abbreviations ................................................................................................... 100 1. Units ...........................................................................................................................

Best Practice Guide Modern Processors

AMD Athlon™ Processor X86 Code Optimization Guide

Forge and Performance Reports Modules $ Module Load Intel Intelmpi $ Module Use /P/Scratch/Share/VI-HPS/JURECA/Mf/ $ Module Load Arm-Forge Arm-Reports

Adaptive Data Migration in Load-Imbalanced HPC Applications

Benchmark Evaluations of Modern Multi Processor Vlsi Ds Pm Ps

D5.1 Report on Performance and Tuning of Runtime Libraries for ARM

(12) Patent Application Publication (10) Pub. No.: US 2011/0231717 A1 HUR Et Al

Fairborn Camera & Video

Laptops Portfolio

Scalasca User Guide

Computer Architecture Out-Of-Order Execution

Performance Tuning Workshop

State of Lenovo Notebooks Running Coreboot Who Am I