Hierarchical Roofline Analysis for Gpus: Accelerating Performance

Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang, Thorsten Kurth Samuel Williams National Energy Research Scientific Computing Center Computational Research Division Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA Berkeley, CA 94720, USA fcjyang, [email protected] [email protected] Abstract—The Roofline performance model provides an Performance (GFLOP/s) is bound by: intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In prepa- Peak GFLOP/s GFLOP/s ≤ min (1) ration for the next-generation supercomputer Perlmutter at Peak GB/s × Arithmetic Intensity NERSC, this paper presents a methodology to construct a hierarchical Roofline on NVIDIA GPUs and extend it to support which produces the traditional Roofline formulation when reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory and system memory plotted on a log-log plot. bandwidths into one single figure, and it offers more profound Previously, the Roofline model was expanded to support insights into performance analysis than the traditional DRAM- the full memory hierarchy [2], [3] by adding additional band- only Roofline. We use our Roofline methodology to analyze width “ceilings”. Similarly, additional ceilings beneath the three proxy applications: GPP from BerkeleyGW, HPGMG Roofline can be added to represent performance bottlenecks from AMReX, and conv2d from TensorFlow. In so doing, we demonstrate the ability of our methodology to readily arising from lack of vectorization or the failure to exploit understand various aspects of performance and performance fused multiply-add (FMA) instructions. bottlenecks on NVIDIA GPUs and motivate code optimizations. Orthogonal to the Roofline description of hardware is characterizing applications in terms of Roofline-related coor- Keywords-Cray, NVIDIA GPU, Roofline, Tensor Core, Per- dinates — Performance (GFLOP/s) and Arithmetic Intensity formance Analysis, Code Optimization (FLOPs/Byte). One can employ a variety of methods to calculate these terms ranging from hand counting FLOPs I. INTRODUCTION and estimating bytes, to performance counters [4], [5], to NERSC’s next supercomputer Perlmutter will be an software simulators [2] that trade performance for accuracy. NVIDIA GPU-accelerated Cray supercomputer with AMD Over the last decade, Roofline analysis has been proven a EPYC host CPUs and an Ethernet-compatible Slingshot great success especially with the hierarchical Roofline on In- network. Although NERSC users are generally familiar with tel CPUs [2] and was a benefit to understanding performance performance optimization on Intel and AMD CPUs, there on NERSC’s previous KNL-based Cori Supercomputer [6], are a number of new facets of performance optimization [7]. However, Roofline has yet to be fully developed on on GPUs including thread predication, deep memory hierar- NVIDIA GPUs. This paper builds upon the previous work chies, mixed precision computation, and Tensor Cores that on CPU architectures as well as the HBM-only Roofline need to be better understood. methodology we developed for NVIDIA GPUs [8] and Rather than forcing users to embrace a ‘trial-and-error’ expands the model into a hierarchical Roofline methodology approach to performance optimization or dig through nu- that also captures the performance effects associated with merous profiler metrics, the Roofline performance model [1] reduced precision and Tensor Cores on NVIDIA’s latest provides a visually-intuitive way for users to identify per- V100 GPUs. formance bottlenecks and motivate code optimization strate- Our expanded methodology includes: gies. Roofline is a throughput-oriented performance model • Empirical measurement of peak performance centered around the interplay between computational ca- (GFLOP/s) and bandwidth (GB/s) pabilities (e.g. peak GFLOP/s), memory bandwidth (e.g. • Accurate measurement of the total number of FLOPs STREAM GB/s), and data locality (i.e. reuse of data once in the code it is loaded from memory). Data locality is commonly ex- • Accurate measurement of data movement in the code, pressed as arithmetic intensity which is the ratio of floating- throughout the memory/cache hierarchy, i.e. BytesL1, point operations performed to data movement (FLOPs:Byte). BytesL2, BytesHBM, BytesSystemMemory • Calculation of arithmetic intensities on various memory/cache levels, i.e. FLOPs:ByteL1, FLOPs:ByteL2, 104 FLOPs:Byte , FLOPs:Byte Theoretical FMA: 7833.6 GFLOP/s HBM SystemMemory Empirical FMA: 7068.9 GFLOP/s • Quantifying the performance implications of FMA, Theoretical No-FMA: 3916.8 GFLOP/s FPADD, and FPMUL in the instruction mix Empirical No-FMA: 3535.8 GFLOP/s • Quantifying the performance implications of reduced precision (FP16 and FP32) and Tensor Cores, and 103 Theoretical L2: 4198.4 GB/s • Plotting application performance against architecture Empirical L2: 2996.8 GB/s peaks Performance [GFLOP/sec] In this paper, we provide a detailed description of our Theoretical HBM: 900.0 GB/s Roofline methodology for NVIDIA GPUs. We then apply Empirical HBM: 828.8 GB/s 102 this methodology to three proxy applications, GPP from 100 101 102 103 104 105 Arithmetic Intensity [FLOPs/Byte] the Material Science code BerkeleyGW [9], HPGMG from the Adaptive Mesh Refinement framework AMReX [10], Figure 1. NVIDIA V100 Hierarchical Roofline Ceilings. Observe V100 and conv2d from TensorFlow [11]. For each of these appli- advertised performance is very close to empirical performance. cations, we include multiple variants of the same code in order to highlight the ability of our methodology to capture the different nuances of performance analysis on NVIDIA ERT, as written, is solely a double-precision benchmark. GPUs. Throughout this process we provide a detailed anal- As such, in this paper, we use a simple linear extrapola- ysis of the information our Roofline methodology extracts. tion for single- and half-precision performance. Moreover, Finally, we conclude the paper with some high-level insights, whereas ERT’s kernels are optimized for fused-multiply-add observations, and espouse several directions for future work. (FMA) instruction-set architectures, we estimate the penalty II. ROOFLINE METHODOLOGY ON NVIDIA GPUS of not exploiting FMA by defining a “no FMA” ceiling that is half the FMA performance. NVIDIA GPUs implement In order to affect Roofline analysis of GPU-accelerated 16-bit (FP16) Tensor Core matrix-matrix multiplications. applications, one must perform three steps. First, one must Throughout this paper, we use the theoretical peak Tensor characterize the underlying GPU’s computational capabil- Core performance. This may seem optimistic, but we will ities in terms of this Roofline model. In effect, this is show it does not skew our analysis. measuring peak performance (GFLOP/s) and bandwidth Ultimately, we collect 10 performance numbers for our (GB/s) as a function of data precision, operation, and memo- target GPU: L1, L2, and HBM bandwidth, FP16/FP32/FP64 ry/cache level. Second, one must characterize the execution FMA and FP16/FP32/FP64 “no FMA” performance, and of an application and extract the relevant Roofline-related Tensor Core peak performance. For brevity, Figure 1 plots parameters including data movement at each level of the measured ERT and theoretical performance for FP64 FMA, memory hierarchy, floating-point operations performed (by no-FMA, L2, and HBM on an NVIDIA Volta V100 GPU. precision), and kernel run times. Finally, one must synthesize Clearly, theoretical performance generally overestimates at- these two data sets together and plot them in a single figure. tainable performance by about 10%. A. Architectural Characterization The Empirical Roofline Toolkit (ERT) [12] was developed B. Application Characterization to characterize multicore, manycore, and GPU-accelerated systems. It was written in MPI+OpenMP+CUDA in order to In this paper, we leverage the proof of concept method- replicate the most common programming environments on ology developed by Yang et al. [8] and extend it to support DOE (Department of Energy) supercomputers. ERT defines both hierarchical (L1, L2, HBM, System Memory) Roofline a kernel of varying L1 arithmetic intensity on a parameter- analysis as well as FP32 and FP16 precision (including ized vector. By sweeping a range of arithmetic intensities Tensor Core). To that end, we use nvprof to collect a set of and vector sizes, it can extract the peak performance of a metrics for each kernel in an application. We then synthesize target platform as well as the bandwidth at each level of the those metrics together in order to plot each kernel on a memory hierarchy. In this paper, we used the MPI+CUDA Roofline using its Arithmetic Intensity (x) and GFLOP/s (y) implementation of ERT to characterize a single Volta V100 coordinates. GPU. Unfortunately, ERT, as written, consistently fails to In order to calculate a kernel’s arithmetic intensity (AI) identify the L1 cache on NVIDIA GPUs. To that end, and GFLOP/s performance, we must collect three raw quan- throughout this paper, we use a theoretical L1 bandwidth tities — kernel run time, FLOPs executed (for FP64, FP32, coupled with empirical (ERT) bandwidths for L2 and HBM, and FP16), and bytes read and written by each level of the for the Roofline ceilings. memory hierarchy (L1, L2, HBM and System Memory). Transaction Level Metrics Size nvprof FLOPs gld_transactions AI = <precision>

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

Elementary Functions: Towards Automatically Generated, Efficient

Caching Basics

45-Year CPU Evolution: One Law and Two Equations

Andre Heidekrueger

An Algorithmic Theory of Caches by Sridhar Ramachandran

Generalized Methods for Application Specific Hardware Specialization

14. Caching and Cache-Efficient Algorithms

Open Jishen-Zhao-Dissertation.Pdf

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Beating In-Order Stalls with “Flea-Flicker” Two-Pass Pipelining

Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation

Chapter 2: Memory Hierarchy Design