Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers

Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers Lei Huang Kent Milfeld Si Liu Texas Advanced Computing Center Texas Advanced Computing Center Texas Advanced Computing Center The University of Texas at Austin The University of Texas at Austin The University of Texas at Austin 10100 Burnet Rd. Austin, TX, 78758 10100 Burnet Rd. Austin, TX, 78758 10100 Burnet Rd. Austin, TX, 78758 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Performance boosts in HPC nodes have come from Skylake (SKX) and Cascade Lake (CLX) processors [31], making SIMD units wider and aggressively packing more and respectively. These, and other HPC processors, also support more cores in each processor. With multiple processors and so Simultaneous Multi-Threading (SMT) to a level of 2 to 4 many cores it has become necessary to understand and manage process and thread affinity and pinning. However, affinity tools per core. Consequently, there could be 2x to 4x more logical have not been designed specifically for HPC users to quickly processors than physical processors on a node. evaluate process affinity and execution location. To fill in the When working with nodes of such large core counts, the gap, three HPC user-friendly tools, core usage, show affinity, and performance of HPC applications is not only dependent upon amask, have been designed to eliminate barriers that frustrate the number and speed of the cores, but also upon proper users and impede users from evaluating and analyzing affinity for applications. These tools focus on providing convenient scheduling of processes and threads. HPC application runs methods, easy-to-understand affinity representations for large with proper affinity settings will take full advantage of re- process counts, process locality, and run-time core load with sources like local memory, reusable caches, etc., and will socket aggregation. These tools will significantly help HPC users, obtain a distinct benefit in performance. developers and site administrators easily monitor processor utilization from an affinity perspective. II. BACKGROUND Index Terms—Supercomputers, User support tool, Multicore A. Process and Thread Affinity system, Affinity, Resource utilization, Core binding, Real-time monitoring, Debugging A modern computer often has more than one socket per node and therefore HPC applications may have non-uniform I. INTRODUCTION access to memory. Ideally, an application process should be Up to the millennium, the processor frequency of com- placed on a processor that is close to the data in memory modity CPUs increased exponentially year after year. High it accesses, to get the best performance. Process and thread CPU frequency had been one of the major driving forces to affinity/pinning allows a process or a thread to bind to a boost CPU performance, other than the introduction of vector single processor or a set of (logical) processors. The processes processor units. However, it ceased to grow significantly in or threads with specific affinity settings will then only run recent years due to both technical reasons and market forces. on the designated processor(s). For Single-Program Multiple- To accommodate the high demand of computing power in Data (SPMD) applications, managing this affinity can be HPC, significantly more cores are being packed into a single difficult. Moreover, the present-day workflows on modern compute node [15]. supercomputers have moved beyond the SPMD approach and The needs of HPC and the use of core-rich processors are now include hierarchical levels of Multiple-Program Multiple- exemplified in the extraordinary large-scale supercomputers Data (MPMD), demanding even more attention to affinity. found throughout the world. The Sierra supercomputer [17] at MPI affinity for Intel MPI (IMPI), MVAPICH2 the Lawrence Livermore National Laboratory and the Summit (MV2), Open MPI (OMPI), and IBM Spectrum MPI supercomputer [19] at the Oak Ridge National Lab have 44 (SMPI) have a variety of mechanisms for setting processing cores per compute node with two IBM Power9 affinity. IMPI relies solely on “I MPI x” environment CPUs [13]. The Sunway TaihuLight supercomputer [18] at variables, as does MV2 (MV2 CPU/HYBRID BINDING x, the National Supercomputing Center in Wuxi deploys Sunway MV2 CPU MAPPING x, etc.). SMPI uses both environment SW26010 manycore processors, containing 256 processing variables (MP TASK/CPU x) and mpirun command-line cores and additional 4 auxiliary cores for system manage- options (-map-by, -bind-to, -aff shortcuts, etc.). Similarly, ment [32] per node. The Stampede2 supercomputer [28] at OMPI uses mpirun options (-bind-to-core, –cpus-per-proc, the Texas Advanced Computing Center (TACC) provides Intel etc.) and also accepts a rankfile file with a map (slot-list) for Knights Landing (KNL) nodes with 68 cores per node. The each rank. Stampede2 [28] and Frontera [27] supercomputers at TACC When no affinity is specified these MPIs evaluate a node’s provide 48 and 56 processing cores per node with Intel’s hardware configuration (for example with hwloc for MV2 and OMPI) and make appropriate default affinity settings. OpenMP B. Related Work affinity for hybrid runs can be specified by various “vendor” There are many ways to view CPU loads and the affinity methods. However, since all of these MPIs accept OpenMP’s mask of a process. Moreover, some methods are not well- OMP PLACES/OMP PROC BIND specifications, it is best known or only available through unexpected means. to use the standard’s mechanism. Hence, for portable hybrid computing a user must deal with many ways of setting each The Linux command-line tool top [8] and its more recent rank’s affinity. (When a master thread encounters a parallel counterparts (htop or atop) can be used to monitor CPU loads, region it inherits the MPI rank’s mask, and OpenMP Affinity and manage process and thread affinity in real time. The specifications take over). Linux command ps [3] can report which core a process is running on. However, it does not report the affinity mask Fig. 1 shows a schematic of the affinity process. A mask explicitly. It only reports the core the process is presently is maintained for each process by the kernel that describes running on. The taskset [7] command-line utility is normally which processor(s) the process can run on. The mask consists more helpful since it can query and modify the binding affinity of a bit for each processor, and the process can execute on any of a given thread or process. Linux also provides API functions processor where a mask bit is set. There are a myriad of ways sched getaffinity [5] and pthread setaffinity np [4] for a to set and alter the affinity mask for processes of a parallel process or thread to query and set the affinity (kernel mask) application. For instance, vendors have their own way to set of itself. While these tools are pervasive and do provide the affinities for MPI and OpenMP, usually through environment information needed, they are sometimes cumbersome to use, variables. Only recently has OpenMP 4.5 [20], [21] provided a particularly for supercomputer users working with large core standard way to set affinity for threads, and MPI has yet to do counts. this for MPI tasks. As shown in Fig. 1 the affinity can not only be affected before an application is launched but also while For HPC users, these tools may provide too much admin- it is running. There are utilities such as numactl [2] and util- istrative information, and it may not be apparent how to get linux taskset [7] to do this. Furthermore, the affinity can even HPC-relevant information for their applications. Users need be changed within a program with the sched setaffinity [6] to remember extra options or give extra instructions to obtain function. relevant CPU information. For instance, top does not show the loads of individual processors by default. For example, pressing “1” within a top session is required to display the performance data of individual CPUs; and pressing “z” is needed to display running process in color. The htop utility does show usage information for all logical processors and the load on each individual core is represented by a progress bar in text mode. It works up to about one hundred sixty cores. However, the progress bar is distracting on a computer with many cores. Furthermore, such tools were originally designed for administrators to display multi-user information, not for a single- user screen display of HPC information. Therefore, there is a real need for convenient HPC tools to readily display each CPU utilization and affinity information for multicore compute nodes. The MPI libraries and OpenMP 5.0 [20] implementions themselves can present affinity mask information on the pro- Fig. 1. The left box indicates mechanisms for setting the affinity mask. The cesses or threads they instantiate. For instance, by setting the right box illustrates how a BIOS setting has designated the processor ids for the hardware (cores). The center section shows a mask with bits set for I MPI DEBUG environment to 4 or above, the Intel MPI execution on cores 1, 3, 5, and 7. runtime will print a list of processor ids (mask bits set) for each process (rank) at launch time. Likewise for OpenMP 5.0 im- Understanding the “vernaculars” of all these methods can be plementations, setting OMP AFFINITY DISPLAY to TRUE challenging. Even the default settings are sometimes unknown will have the runtime print a line for each thread (number) to users. In addition, users are commonly uncertain of their reporting the processor ids associated with its binding, at the attempts to set the affinity for processes of their parallel beginning of the first parallel region. However, it is difficult to applications.

Tools for Monitoring CPU Usage and Affinity in Multicore Supercomputers

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Unit: 4 Processes and Threads in Distributed Systems

Parallel Computing

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Parallel Programming with Openmp

Intel's Hyper-Threading

Lecture 3: a Simple Example, Tools, and CUDA Threads

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem

Concurrent Computing

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Opencl Introduction

An Introduction to CUDA/Opencl and Graphics Processors