Using Software-Based Performance Counters to Expose Low-Level Open MPI Performance Information

Using Software-Based Performance Counters to Expose Low-Level Open MPI Performance Information David Eberius, Thananon Patinyasakdikul, George Bosilca Innovative Computing Laboratory, University of Tennessee, Knoxville 1122 Volunteer Blvd Knoxville, Tennessee 37996 ABSTRACT is usually supplemented by a performance analysis stage where This paper details the implementation and usage of software-based performance bottlenecks (either in the algorithm itself or in the performance counters to understand the performance of a particu- communication pattern) are identified and addressed. The MPI lar implementation of the MPI standard, Open MPI. Such counters standard defines a profiling interface (PMPI) that allows tools to can expose intrinsic features of the software stack that are not avail- preempt all MPI function calls and add instrumentation or other able otherwise in a generic and portable way. The PMPI-interface is functionality. Many tools like Vampir [2], Paraver [8], and TAU [12] useful for instrumenting MPI applications at a user level, however use the PMPI interface to profile MPI applications, mainly through it is insufficient for providing meaningful internal MPI performance inserting timing functionality to track when MPI functions start and details. While the Peruse interface provides more detailed informa- complete. This information is generally stored into a binary trace tion on state changes within Open MPI, it has not seen widespread file, available to the tools post-mortem for thorough analysis. This adoption. We introduce a simple low-level approach that instru- method provides an overview of how the application progressed ments the Open MPI code at key locations to provide fine-grained overall, but cannot expose low-level details and therefore provides MPI performance metrics. We evaluate the overhead associated little insight into what was happening within MPI. The MPI perfor- with adding these counters to Open MPI as well as their use in mance revealing extension interface (Peruse)[9] was developed as determining bottlenecks and areas for improvement both in user a means to complement the lack of fine grained details in the PMPI code and the MPI implementation itself. interface, and to provide more insight into MPI implementation performance. As an example, using the Peruse interface a tool could CCS CONCEPTS have exposed detailed MPI state change information such as when a send request enters the queue of posted messages or when a com- • General and reference → Metrics; Evaluation; • Networks munication request is completed. This interface had great potential → Network measurement; • Computing methodologies → for performing in-depth analysis of the state changes experienced Massively parallel and high-performance simulations; by each communication, however it does not provide information KEYWORDS about what is happening within those states. In this paper we seek to provide low-level MPI performance MPI, Tools, Performance Counters, Profiling metrics to help identify the root cause of performance bottlenecks ACM Reference format: as well as a simple overview of MPI application properties. Our David Eberius, Thananon Patinyasakdikul, George Bosilca. 2017. Using metrics are modeled after the hardware performance counters that Software-Based Performance Counters to Expose Low-Level Open MPI are exposed through PAPI [11]. Unlike PAPI, however, our counters Performance Information. In Proceedings of EuroMPI/USA ’17, Chicago, IL, are based entirely in software. USA, September 25–28, 2017, 8 pages. This paper introduces an implementation of Software-Based Per- https://doi.org/10.1145/3127024.3127039 formance Counters (SPCs) within the Open MPI [4] implementation. We will start off our discussion with Section 2 in which we intro- 1 INTRODUCTION duce related work, then we will move on to Section 3 where we will elaborate on our motivations behind implementing SPCs. In The standard paradigm for distributed memory parallelization is Section 4 we explain our design and implementation of SPCs within the Message Passing Interface (MPI) [3]. MPI has been used ex- Open MPI and addresses the extent of the details we decided to tensively for achieving high performance on distributed systems expose through our counters. Section 5 shows the overhead associ- within academia and industry alike. Parallelizing an application ated with adding our counters to Open MPI, and Section 6 shows is the first step toward a potential shorter time to solution, andit examples of our counters being used within both synthetic and real Permission to make digital or hard copies of all or part of this work for personal or world applications. The final section provides a conclusion to our classroom use is granted without fee provided that copies are not made or distributed work and illuminates prospects for future work. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a 2 RELATED WORK fee. Request permissions from [email protected]. EuroMPI/USA ’17, September 25–28, 2017, Chicago, IL, USA There are a variety of different performance analysis tools avail- © 2017 Association for Computing Machinery. able to parallel application developers that offer different views of ACM ISBN 978-1-4503-4849-2/17/09...$15.00 application performance. Each of these tools has its own strategy https://doi.org/10.1145/3127024.3127039 EuroMPI/USA ’17, September 25–28, 2017, Chicago, IL, USA D. Eberius et al. for meeting the challenge of scalable performance analysis. Per- The MPI_T interface is divided into two parts. The first part pro- formance analysis tools in a High Performance Computing (HPC) vides information and editing capability for control variables used environment tend to either collect data during runtime or perform to tune the MPI implementation’s configuration3 [ ]. The second postmortem analysis. Within data collection tools, there are two part provides users access to internal performance variables in the approaches to data collection: instrumentation-based and interrupt- MPI implementation. Our SPCs are similar to these internal perfor- based. Our SPC approach uses manual code instrumentation to mance variables and could be integrated into the MPI_T standard collect performance information during runtime. as performance variables. Before analyzing the performance of an application, one must first collect detailed profiling data for analysis. The performance application programming interface (PAPI) provides an interface for accessing hardware performance counters available from many 3 MOTIVATION modern microprocessors. PAPI is highly portable with support When profiling MPI applications, it can often be difficult totell for most hardware vendors and operating systems, and provides what is causing performance issues, particularly when the problem a wealth of information with its counters such as the number of lies within MPI itself. There are many factors that can affect the cache misses and the total number of instructions issued. The Tun- performance of an MPI implementation, such as handling unex- ing and Analysis Utilities (TAU) tool is another option for code pected or out-of-sequence messages. An unexpected message is instrumentation, and has support for both compiler-based and man- one that arrived before the corresponding receive was posted. It ual instrumentation of applications [12]. TAU keeps track of the is well known that searching the unexpected message queue can information gathered and can generate binary trace files supported quickly become a bottleneck, particularly when there are a large by visualization tools like Vampir and Paraver. There are many number of messages in the queue [14]. Out-of-sequence messages other HPC profiling tools, and often there are redundant measure- are messages that were delivered out of the MPI-imposed order ment functionalities that these tools provide. The Score-P tool was (MPI is expected to deliver the messages in FIFO order between created to provide a common infrastructure for performing these each pair of processes within a communicator), due to multiple redundant performance measurement capabilities, and provides network paths between the processes. These messages block the support for several other tools, including Scalasca and TAU [10]. matching queue on the target process, as all matching must be Once data has been collected, tools are needed for analyzing and delayed until the FIFO order can be guaranteed. We wanted to have visualizing the data. Trace files are a simple way to represent what a tool that would be able to report such internal MPI information to happened during a parallel execution, often including a series of the user/tool, to provide a more precise picture of what particular time stamps for when certain events started and stopped along with conditions could have affected the application performance. which processing unit they executed on. The Paraver and Vampir Such a tool has the potential to be generic enough to be of use tools are designed to parse such trace files and display their data in not only to users, but also be particularly useful to those who a meaningful layout showing what was happening on each compu- are developing an MPI implementation. Having metrics on the

Using Software-Based Performance Counters to Expose Low-Level Open MPI Performance Information

D5.1 Report on Performance and Tuning of Runtime Libraries for ARM

Scalasca User Guide

TAU, PAPI, Scalasca and Vampir

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Profiling and Tracing Tools for Performance Analysis of Large Scale Applications

Best Practice Guide Modern Processors

CUBE 4.3.4 – User Guide Generic Display for Application Performance Data

Scalasca 2.3 User Guide Scalable Automatic Performance Analysis

The Scalasca Performance Toolset Architecture

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

The Arm Architecture for Exascale HPC

Large-Scale Performance Analysis of PFLOTRAN with Scalasca