Scalable Parallel Performance Measurement and Analysis Tools – State-Of-The-Art and Future Challenges B

DOI: 10.14529/jsfi140207 Scalable parallel performance measurement and analysis tools – state-of-the-art and future challenges B. Mohr1 Current large-scale HPC systems consist of complex configurations with a huge number of potentially heterogeneous components. As the systems get larger, their behavior becomes more and more dynamic and unpredictable because of hard- and software re-configurations due to fault recovery and power usage optimizations. Deep software hierarchies of large, complex system software and middleware components are required to operate such systems. Therefore, porting, adapting and tuning applications to today’s complex systems is a complicated and time-consuming task. Sophisticated integrated performance measurement, analysis, and optimization capabilities are required to efficiently utilize such systems. This article will summarize the state-of-the-art of scalable and portable parallel performance tools and the challenges these tools are facing on future extreme-scale and big data systems. Keywords: parallel programming, performance tools, extreme scale computing. Introduction Current large-scale HPC systems consist of complex configurations with a huge number of components. Each node has multiple multi-core sockets and often one or more additional accelerator units in the form of many-core nodes (e.g., Intel Xeon Phi), GPGPUs or FPGAs, resulting in a heterogeneous system architecture. Caches, memory and storage are attached to the components on various levels and are shared between components in varying degrees. Typically, there is a (huge) imbalance between the computational power of the components and the amount of memory available to these components. Thus, deep software hierarchies of large, complex system software and middleware components are required to efficiently use such systems. Therefore, porting, adapting and tuning applications to today’s complex systems is a complicated and time-consuming task. Sophisticated integrated performance measurement, analysis, and optimization capabilities are required to efficiently utilize such systems. However, designing and implementing efficient and useful performance tools is extremely difficult because of the same reasons. While the system and middleware software layers are designed to be transparent, they are typically not transparent with respect to performance. This performance intransparency will result in escalation of unforeseen problems to higher layers, including the application. This is not really a new problem, but certain properties of current large-scale and future extreme-scale systems significantly increase its severity and significance: At extreme-scale scale, there always will be failing components in the system, which has • a large impact on performance. A real-world application will probably never run on the exact same configuration twice. Load balancing issues limit the success even on moderately parallel systems, and the • challenge of data locality will become another severe issue which has to be addressed by appropriate mechanisms and tools. Dynamic power management, e.g., at hardware level inside a CPU or accelerator, will result • in performance variability between cores, nodes, and across different runs. The alternative to run at lower but system-wide consistent speed without dynamic power adjustments may not be an option in the future. 1Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Jülich, Germany 108 Supercomputing Frontiers and Innovations B. Mohr The challenge of predicting application performance at extreme scale will make it difficult • to detect a performance problem if it is escalated undetected to the application level. The ever growing higher integration of components into a single chip and the use of more • and more hardware accelerators makes it more difficult to monitor application performance and move performance data out of the system unless special hardware support will be integrated into future systems. Altogether this will require an integrated, collaborative, and holistic approach to handle performance issues and correctly detect and analyze performance problems. In the following, we will summarize the state-of-the-art of scalable and portable parallel performance tools. Next, we discuss the various methods and approaches regarding tool portability, scalability, and integration. The article closes with an outlook on the challenges tools will face on future extreme-scale and big data systems. 1. Overview of state-of-the-art parallel performance tools Research on parallel performance tools has a long history. First tools appeared at the same time as the first parallel computer systems back in the 1980s and early 90s [1]. Meanwhile, many performance instrumentation, measurement, analysis and visualization tools exist, and summarizing and listing all major methods, approaches and tools is impossible in a short journal paper. Therefore, in the following section we concentrate on the major tool sets that are (i) portable, i.e., they can be used on more than one of today’s dominating architectures: Cray, IBM BlueGene, Fujitsu K computer, and Linux/UNIX clusters, (ii) scalable, i.e., it was demonstrated that they can be successfully used for an application executing on a couple of thousand nodes, (iii) versatile, i.e., they allow the performance analysis of all levels of today’s HPC systems: message passing between nodes, multi-threading and multi-tasking inside nodes, and offloading to accelerators, and (iv) supported, i.e., there are well-established groups or organizations behind them which maintain and further develop them. The structure of the following subsections is modelled after and re-uses excerpts from the VI-HPS Tools Guide [2], which also includes descriptions of further tools. 1.1. TAU TAU [3, 4] is a comprehensive profiling and tracing toolkit that supports performance eval- uation of programs written in C++, C, UPC, Fortran, Python, and Java. It is a robust, flexible, portable, and integrated framework and toolset for performance instrumentation, measurement, analysis, and visualization of large-scale parallel computer systems and applications. TAU supports both direct measurement as well as sampling modes of instrumentation and interfaces with external packages such as PAPI [25] or tools like Score-P, Scalasca, and Vampir (all described in the following sections). TAU is available under a BSD-style license. Typical Workflow TAU allows the user to instrument the program in a variety of ways including rewriting the binary using tau rewrite or runtime pre-loading of shared objects using tau exec. Source- level instrumentation typically involves substituting a compiler in the build process with a TAU compiler wrapper. This wrapper uses a given TAU configuration to link in the TAU library. At 2014, Vol. 1, No. 2 109 Scalable Parallel Performance Measurement and Analysis Tools — State-of-the-Art and... runtime, a user may specify different TAU environment variables to control the measurement options chosen for the performance experiment. This allows the user to generate call-path pro- files, specify hardware performance counters, turn on event-based sampling, generate traces, or specify memory instrumentation options. Figure 1. TAU main profile window. Each line shows a flattened histogram of a selected performance metric for a specific thread or task. The different colors represent the different program regions (e.g., functions, loops, parallel sections, etc.). The display allows to get a quick overview via comparing the differences/commonalities of the different threads/tasks. Clicking on a line label brings up more detailed information about the selected thread or task. Clicking on a col- ored region brings up more detailed information about the selected program region. The menu bar allows to choose the performance metric shown or to launch further, more detailed displays like a call-graph or communication matrix display. Results from larger performance experiments can be more easily analyzed with the 3D displays of TAU, see fig. 2. Performance-analysis results may be stored in TAUdb, a database for cross-experiment analysis and advanced performance data mining operations using TAU’s PerfExplorer tool [5]. It may be visualized using ParaProf, TAU’s profile browser that can show the extent of performance variation and compare executions, see fig. 1. Supported platforms IBM Blue Gene/P/Q, NVIDIA and AMD GPUs and Intel MIC systems, Cray XE/XK/XC30, SGI Altix, Fujitsu K Computer (FX10), NEC SX-9, Solaris & Linux clusters (x86/x86 64, MIPS, ARM), Windows, Apple Mac OS X. Supported Runtime Layers MPI, OpenMP (using GOMP, OMPT, and Opari instrumentation), Pthread, MPC Threads, Java Threads, Windows Threads, CUDA, OpenCL, OpenACC. 110 Supercomputing Frontiers and Innovations B. Mohr Figure 2. TAU 3D profile view. It displays the distribution of two selected performance metrics (encoded via height and color) over the axes program regions and threads/tasks. 1.2. HPCToolkit HPCToolkit [6, 7] is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multi-core desktop systems to the largest supercomput- ers. HPCToolkit provides accurate measurements of a program’s work, resource consumption, and inefficiency, correlates these metrics with the program’s source code, works with multilingual, fully optimized binaries, and has very low measurement overhead. HPCToolkit’s measurements provide support for analyzing a program’s execution cost, inefficiency, and scaling characteristics

Scalable Parallel Performance Measurement and Analysis Tools – State-Of-The-Art and Future Challenges B

D5.1 Report on Performance and Tuning of Runtime Libraries for ARM

Scalasca User Guide

TAU, PAPI, Scalasca and Vampir

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Profiling and Tracing Tools for Performance Analysis of Large Scale Applications

Best Practice Guide Modern Processors

CUBE 4.3.4 – User Guide Generic Display for Application Performance Data

Scalasca 2.3 User Guide Scalable Automatic Performance Analysis

The Scalasca Performance Toolset Architecture

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

The Arm Architecture for Exascale HPC

Large-Scale Performance Analysis of PFLOTRAN with Scalasca