Trace-Based Performance Analysis for Hardware Accelerators

Trace-based Performance Analysis for Hardware Accelerators Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) vorgelegt an der Technischen Universität Dresden Fakultät Informatik eingereicht von Diplom-Ingenieur für Informationssystemtechnik Guido Juckeland geboren am 7. Juli 1979 in Nordhausen Betreuender Hochschullehrer: Prof. Dr. rer. nat. Wolfgang E. Nagel Dresden, 8. Oktober 2012 To J. Kurzfassung Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausge- lagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Paral- lelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um recheninten- sive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi- hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API- gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eige- ne Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeich- nen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungspro- file oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PP- FGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Pro- grammspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU – einer multi-hybriden Simulation aus der Plasmaphysik –, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat. Abstract This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time re- taining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU—a highly parallel, multi- hybrid plasma physics simulation—that benefited greatly from the developed performance analysis methods. 1 Contents Nomenclature 3 1 Introduction 5 2 Hardware Accelerators 7 2.1 Rise and Fall of Special Purpose Solutions . .8 2.1.1 Cell Broadband Engine . .8 2.1.2 ClearSpeed Accelerators . 10 2.2 State-of-the-Art Accelerator Technology . 13 2.2.1 Floating-Point Acceleration . 14 2.2.2 Field Programmable Gate Arrays (FPGAs) . 16 2.2.3 Graphic Processing Units . 19 2.3 Generic Accelerator Programming Paradigms . 25 2.3.1 OpenCL . 25 2.3.2 Directive Based Accelerator Programming . 26 2.4 Common Charateristics of Hardware Accelerators . 28 3 State-of-the-Art and Related Work 31 3.1 Gathering Performance Data Using Event Logs . 32 3.2 Performance Profiles and Timelines . 38 3.3 Repeating Patterns in Event Logs . 40 3.4 Existing Performance Tools from Hardware Accelerator Vendors . 42 3.4.1 IBM Cell Broadband Engine . 42 3.4.2 ClearSpeed Accelerators . 43 3.4.3 Convey . 44 3.4.4 NVIDIA Visual Profiler . 44 3.4.5 NVIDIA Parallel Nsight . 45 3.4.6 AMD APP Profiler . 46 3.5 Current Third Party Performance Tools with Accelerator Support . 46 3.5.1 VampirTrace/Vampir . 47 3.5.2 TAU . 51 3.5.3 Integrated Performance Monitoring (IPM) . 52 3.5.4 CEPBA MPItrace . 53 3.5.5 PAPI . 53 3.5.6 GPU Ocelot . 53 3.6 Classification of the Existing Performance Tools . 53 2 4 Generic Extension of Performance Analysis to Accelerators 55 4.1 Automatically Intercepting Library Calls . 55 4.1.1 Framework for Logging any API Invocation . 55 4.1.2 Additional Semantics to Automatically Generated Library Wrappers . 58 4.1.3 CUDA/OpenCL: Exposing Details on the API Level . 61 4.2 Design of a Host Based Accelerator Performance API . 62 4.2.1 Additional Device Performance Data . 64 4.2.2 Methods of Acquiring the Device Data . 65 4.2.3 The Implementation of NVIDIA CUPTI . 68 4.3 Accelerator Based Event Logging . 70 4.3.1 Accessing Performance Counter Information . 70 4.3.2 Kernel Wrapping . 71 4.3.3 Kernel Instrumentation . 72 4.4 Effects of Overlapping Metrics . 74 4.5 Parallel Program Flow Graphs . 77 4.5.1 Offline Construction of a PPFG . 78 4.5.2 Online Construction of a PPFG . 81 4.5.3 Display Aliasing of a PPFG . 82 4.5.4 Other Presentation Approaches Using the PPFG Data . 83 4.5.5 Using PPFG for Event Logs from Hybrid Applications . 83 5 Studying Multi-Hybrid Application Performance 87 5.1 Particle-in-Cell Simulations . 87 5.2 Towards a Hardware-Accelerated PIC-Solution . 90 5.3 Current Status and Future Work . 95 5.4 Application of the Developed Methodology . 97 6 Conclusion and Future Work 101 Bibliography 103 List of Figures 111 List of Tables 113 3 Nomenclature API An application programming interface (API) specifies how software components can communicate with one another. It typically defines routines and data structures. ASIC An application specific integrated circuit (ASIC) is a special purpose processor de- signed to perform one specific task very well. It cannot execute any arbitrary com- putable algorithm. DDR Double data rate (DDR) memory is a special form of computer memory that can transmit data on both the rising and falling edge of the clock signal. die The die is one coherent piece of semiconductor substrate that contains electric cir- cuitary. DirectCompute DirectCompute is a component of Microsoft’s DirectX which allows non-graphical computing of data that is stored in the graphics accelerator’s memory. DMA Direct memory access (DMA) is a technique that allows the mapping of two disjoint address spaces into one region so that one process can directly access data from another. It is typically used for data transfers from and to external devices. GUI A graphical user interface (GUI) enables an interaction between a computer program and its user using images and gestures (via a mouse or direct using a finger) rather then text commands. IDE An integrated development environment (IDE) is a GUI based piece of software that contains or integrates all tools necessary for the development of software. It usually contains a source code editor, an automated software build tool, and a debugger. Some IDEs also contain performance analysis tools. Inlining Inlining is a technique used by compilers which copies small subroutines directly into the place where they were called to save the overhead for a subroutine invocation. ISA An instruction set architecture (ISA) defines an interface between soft- and hardware. It specifies supported data types and operations, provides a format

Load more