Combining Static and Dynamic Approaches to Model Loop Performance in HPC Vincent Palomares

Combining static and dynamic approaches to model loop performance in HPC Vincent Palomares To cite this version: Vincent Palomares. Combining static and dynamic approaches to model loop performance in HPC. Hardware Architecture [cs.AR]. Université de Versailles-Saint Quentin en Yvelines, 2015. English. NNT : 2015VERS040V. tel-01293040 HAL Id: tel-01293040 https://tel.archives-ouvertes.fr/tel-01293040 Submitted on 24 Mar 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Université de Versailles Saint-Quentin-en-Yvelines École Doctorale “STV” Combiner Approches Statique et Dynamique pour Modéliser la Performance de Boucles HPC Combining Static and Dynamic Approaches to Model Loop Performance in HPC THÈSE présentée et soutenue publiquement le 21 Septembre 2015 pour l’obtention du Doctorat de l’Université de Versailles Saint-Quentin-en-Yvelines (spécialité informatique) par Vincent Palomares Composition du jury Président : François Bodin - Professeur, Université de Rennes Rapporteurs : Denis Barthou - Professeur, Université de Bordeaux Henri-Pierre Charles - Directeur de Recherche, CEA Examinateurs : Alexandre Farcy - CPU Architect, Intel David J. Kuck - Fellow, Intel Directeur de thèse : William Jalby - Professeur, Université de Versailles Thanks I would first like to thank William Jalby, my advisor, for the help and guidance he provided throughout these past few years. His healthy doses of both optimism and skepticism motivated me to keep aiming higher. I would also like to thank David Wong and David Kuck, with whom collab- orating was very pleasant and stimulating. Working on Cape with them was an enriching experience. I am grateful to Denis Barthou and Henri-Pierre Charles, the reporters, for reviewing my work and making suggestions on how to improve my manuscript. Their input was very helpful and contributed to making this document clearer. My thesis work was extremely fun, in no small part thanks to Zakaria Bendi- fallah and José Noudohouenou. We certainly had our share of good laughs in our office. Let’s just hope the quotations we’ve collected along the years1 never get in the wrong hands. I would like to thank all the members of the lab for their dynamism, competence and friendliness. I do not want to make a comprehensive list here (for fear I may forget a name or two2 and cause jealousy between those mentioned and those not). Surely, a blanket statement should be enough to prevent that?! Some of them are however definitely worthy of a special mention. First, Emmanuel Oseret, with whom I often discussed microarchitectural details on Intel CPUs, and who agreed to implement some features in his CQA tool that were specially tailored to my needs. He also proofread my manuscript, helping me rid it of various mistakes. I also want to mention Mathieu Tribalat, who helped me navigate some of MAQAO’s intricacies, and whose work in taking over DECAN made getting application measurements considerably easier. The end-of-thesis rush and the tension that came with it were made easier by chatting (and complaining!) with other students in the same situation. I want to thank Nicolas Triquenaux for finding and sharing information about thesis completion procedures, and Zakaria (once again), whose perpetual struggle with paperwork made it easier to see mine was quite mild after all. On a more personal side, I would like to thank Anna Galusza, my fiancée, for supporting me throughout the writing of this manuscript and reviewing some of its key parts, as well as my parents and the rest of my family for helping me get this far. Finally, I would like to thank you, the reader: you are the reason why I wrote this manuscript3. This is particularly true if you read it until the end, at which point you should feel free to fill the following blank with your name4: THANK YOU, ! 1 Some quotations? What quotations? Move along, there is nothing to see here! 2 Or three... 3 You, and getting my Ph.D., that is. 4 NB: Not all libraries are very fond of this practice, so you might want to be discreet if using a borrowed copy. ii Fluffy Canary I am told I can do anything I want with the Thanks section... Let’s see if it’s true! iii Résumé: La complexité des CPUs s’est accrue considérablement depuis leurs débuts, in- troduisant des mécanismes comme le renommage de registres, l’exécution dans le désordre, la vectorisation, les préfetchers et les environnements multi-coeurs pour améliorer les performances avec chaque nouvelle génération de processeurs. Cepen- dant, la difficulté a suivi la même tendance pour ce qui est a) d’utiliser ces mêmes mécanismes à leur plein potentiel, b) d’évaluer si un programme utilise une machine correctement, ou c) de savoir si le design d’un processeur répond bien aux besoins des utilisateurs. Cette thèse porte sur l’amélioration de l’observabilité des facteurs limitants dans les boucles de calcul intensif, ainsi que leurs interactions au sein de microarchitectures modernes. Nous introduirons d’abord un framework combinant CQA et DECAN (des outils d’analyse respectivement statique et dynamique) pour obtenir des métriques détail- lées de performance sur des petits codelets et dans divers scénarios d’exécution. Nous présenterons ensuite PAMDA, une méthodologie d’analyse de performance tirant partie de l’analyse de codelets pour détecter d’éventuels problèmes de performance dans des applications de calcul à haute performance et en guider la résolution. Un travail permettant au modèle linéaire Cape de couvrir la microarchitecture Sandy Bridge de façon détaillée sera décrit, lui donnant plus de flexibilité pour effectuer du codesign matériel / logiciel. Il sera mis en pratique dans VP3, un outil évaluant les gains de performance atteignables en vectorisant des boucles. Nous décrirons finalement UFS, une approche combinant analyse statique et simulation au cycle près pour permettre l’estimation rapide du temps d’exécution d’une boucle en prenant en compte certaines des limites de l’exécution en désordre dans des microarchitectures modernes. Mots Clés: codelet, analyse de boucle, analyse statique, analyse dynamique, calcul intensif, HPC, optimisation, modélisation rapide, performance, exécution dans le désordre, simulation au cycle près iv Abstract: The complexity of CPUs has increased considerably since their beginnings, intro- ducing mechanisms such as register renaming, out-of-order execution, vectorization, prefetchers and multi-core environments to keep performance rising with each product generation. However, so has the difficulty in making proper use of all these mechanisms, or even evaluating whether one’s program makes good use of a machine, whether users’ needs match a CPU’s design, or, for CPU architects, knowing how each feature really affects customers. This thesis focuses on increasing the observability of potential bottlenecks in HPC computational loops and how they relate to each other in modern microarchitectures. We will first introduce a framework combining CQA and DECAN (respectively static and dynamic analysis tools) to get detailed performance metrics on small codelets in various execution scenarios. We will then present PAMDA, a performance analysis methodology leveraging elements obtained from codelet analysis to detect potential performance problems in HPC applications and help resolve them. A work extending the Cape linear model to better cover Sandy Bridge and give it more flexibility for HW/SW codesign purposes will also be described. It will be directly used in VP3, a tool evaluating the performance gains vectorizing loops could provide. Finally, we will describe UFS, an approach combining static analysis and cycle- accurate simulation to very quickly estimate a loop’s execution time while accounting for out-of-order limitations in modern CPUs. Keywords: codelet, loop analysis, static analysis, dynamic analysis, HPC, optimization, fast modeling, performance, out-of-order, cycle-accurate simulation Contents 1 Introduction1 1.1 High Performance Computing (HPC)..................1 1.2 Objectives and Contributions......................2 1.3 Overview.................................2 2 Background5 2.1 Recent Microarchitectures........................5 2.1.1 Sandy Bridge...........................5 2.1.2 Ivy Bridge............................. 10 2.1.3 Haswell.............................. 10 2.1.4 Silvermont............................. 12 2.2 Performance Analysis........................... 14 2.2.1 Static Analysis.......................... 15 2.2.2 Dynamic Analysis........................ 16 2.2.3 Simulation............................. 18 2.3 Using Codelets.............................. 21 2.3.1 Codelet Presentation....................... 21 2.3.2 Artificial Codelets........................ 22 2.3.3 Extracted Codelets........................ 23 3 Codelet Performance Measurement Framework 25 3.1 Introduction................................ 25 3.2 Target Loops: Numerical Recipes Codelets............... 26 3.2.1 Obtention and Target Properties................ 26 3.2.2 Presentation and Categories................... 27 3.3 Measurement

Combining Static and Dynamic Approaches to Model Loop Performance in HPC Vincent Palomares

Microcode Revision Guidance August 31, 2019 MCU Recommendations

A Superscalar Out-Of-Order X86 Soft Processor for FPGA

The Microarchitecture of the Pentium 4 Processor

Multiprocessing Contents

Evaluation of the Intel Sandy Bridge-EP Server Processor

The Intel X86 Microarchitectures Map Version 2.0

A Performance Analysis Tool for Intel SGX Enclaves

Intel PSU Cage Replacement Process Support Guide

M39 Sandy Bridge-PDF

Intel® Itanium® Architecture Assembly Language Reference Guide

CPU) MCU / MPU / DSP This Page of Product Is Rohs Compliant

Intel® Omni-Path Architecture Overview and Update