Best Practice Guide Modern Processors Ole Widar Saastad, University of Oslo, Norway Kristina Kapanova, NCSA, Bulgaria Stoyan Markov, NCSA, Bulgaria Cristian Morales, BSC, Spain Anastasiia Shamakina, HLRS, Germany Nick Johnson, EPCC, United Kingdom Ezhilmathi Krishnasamy, University of Luxembourg, Luxembourg Sebastien Varrette, University of Luxembourg, Luxembourg Hayk Shoukourian (Editor), LRZ, Germany Updated 5-5-2021 1 Best Practice Guide Modern Processors Table of Contents 1. Introduction .............................................................................................................................. 4 2. ARM Processors ....................................................................................................................... 6 2.1. Architecture ................................................................................................................... 6 2.1.1. Kunpeng 920 ....................................................................................................... 6 2.1.2. ThunderX2 .......................................................................................................... 7 2.1.3. NUMA architecture .............................................................................................. 9 2.2. Programming Environment ............................................................................................... 9 2.2.1. Compilers ........................................................................................................... 9 2.2.2. Vendor performance libraries ................................................................................ 10 2.2.3. Scalable Vector Extension (SVE) software support ................................................... 11 2.3. Benchmark performance ................................................................................................. 12 2.3.1. STREAM - memory bandwidth benchmark - Kunpeng 920 ......................................... 12 2.3.2. STREAM - memory bandwidth benchmark - Thunder X2 .......................................... 13 2.3.3. High Performance Linpack ................................................................................... 14 2.4. MPI Ping-pong performance using RoCE .......................................................................... 15 2.5. HPCG - High Performance Conjugated Gradients ............................................................... 15 2.6. Simultaneous Multi Threading (SMT) performance impact ................................................... 17 2.7. IOR ............................................................................................................................ 19 2.8. European ARM processor based systems ........................................................................... 19 2.8.1. Fulhame (EPCC) ................................................................................................ 19 3. Processors Intel Skylake ........................................................................................................... 21 3.1. Architecture ................................................................................................................. 21 3.1.1. Memory Architecture .......................................................................................... 22 3.1.2. Power Management ............................................................................................. 22 3.2. Programming Environment ............................................................................................. 23 3.2.1. Compilers .......................................................................................................... 23 3.2.2. Available Numerical Libraries ............................................................................... 24 3.3. Benchmark performance ................................................................................................. 25 3.3.1. MareNostrum system ........................................................................................... 25 3.3.2. SuperMUC-NG system ........................................................................................ 28 3.4. Performance Analysis .................................................................................................... 33 3.4.1. Intel Application Performance Snapshot .................................................................. 33 3.4.2. Scalasca ............................................................................................................ 42 3.4.3. Arm Forge Reports ............................................................................................. 44 3.4.4. PAPI ................................................................................................................ 45 3.5. Tuning ........................................................................................................................ 48 3.5.1. Compiler Flags ................................................................................................... 48 3.5.2. Serial Code Optimisation ..................................................................................... 50 3.5.3. Shared Memory Programming-OpenMP .................................................................. 53 3.5.4. Distributed memory programming -MPI .................................................................. 56 3.5.5. Environment Variables for Process Pinning OpenMP+MPI ......................................... 61 3.6. European SkyLake processor based systems ....................................................................... 63 3.6.1. MareNostrum 4 (BSC) ......................................................................................... 63 3.6.2. SuperMUC-NG (LRZ) ......................................................................................... 67 4. AMD Rome Processors ............................................................................................................ 69 4.1. System Architecture ....................................................................................................... 70 4.1.1. Cores - «real» vs. virtual/logical ............................................................................ 75 4.1.2. Memory Architecture .......................................................................................... 76 4.1.3. NUMA ............................................................................................................. 79 4.1.4. Balance of AMD/Rome system ............................................................................. 80 4.2. Programming Environment ............................................................................................. 80 4.2.1. Available Compilers ............................................................................................ 80 4.2.2. Compiler Flags ................................................................................................... 81 4.2.3. AMD Optimizing CPU Libraries (AOCL) ............................................................... 82 4.2.4. Intel Math Kernel Library .................................................................................... 83 2 Best Practice Guide Modern Processors 4.2.5. Library performance ............................................................................................ 84 4.3. Benchmark performance ................................................................................................. 84 4.3.1. Stream - memory bandwidth benchmark ................................................................. 84 4.3.2. High Performance Linpack ................................................................................... 85 4.4. Performance Analysis .................................................................................................... 86 4.4.1. perf (Linux utility) .............................................................................................. 86 4.4.2. perfcatch ........................................................................................................... 87 4.4.3. AMD µProf ....................................................................................................... 88 4.4.4. Roof line model ................................................................................................. 91 4.5. Tuning ........................................................................................................................ 93 4.5.1. Introduction ....................................................................................................... 93 4.5.2. Intel MKL pre 2020 version ................................................................................. 93 4.5.3. Intel MKL 2020 version ...................................................................................... 93 4.5.4. Memory bandwidth per core ................................................................................. 94 4.6. European AMD processor based systems ........................................................................... 95 4.6.1. HAWK system (HLRS) ....................................................................................... 95 4.6.2. Betzy system (Sigma2) ........................................................................................ 96 A. Acronyms and Abbreviations ................................................................................................... 100 1. Units ...........................................................................................................................
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages109 Page
-
File Size-