An Instruction Roofline Model for Gpus
Total Page:16
File Type:pdf, Size:1020Kb
An Instruction Roofline Model for GPUs Nan Ding, Samuel Williams Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA fNanDing, [email protected] Abstract—The Roofline performance model provides an in- • Definition of instruction Roofline ceilings that charac- tuitive approach to identify performance bottlenecks and guide terize the peak machine instruction throughput. performance optimization. However, the classic FLOP-centric • Creation of memory walls that define a range of effi- approach is inappropriate for the emerging applications that perform more integer operations than floating point operations. ciency for memory access to global or shared memory. In this paper, we propose an Instruction Roofline Model on • Development of an application characterization NVIDIA GPUs. The Instruction Roofline incorporates instruc- methodology that incorporates instruction throughput tions and memory transactions across all memory hierarchies and thread predication into the Roofline model. together and provides more performance insights than the • Development of an application execution characteri- FLOP-oriented Roofline Model, i.e., instruction throughput, stride memory access patterns, bank conflicts, and thread zation methodology that allows easy visualization of predication. We use our Instruction Roofline methodology global and shared memory access patterns. to analyze five proxy applications: HPGMG from AMReX, • Evaluation of the Instruction Roofline Model and BatchSW from merAligner, Matrix Transpose benchmarks, methodology on NVIDIA’s latest V100 GPUs using cudaTensorCoreGemm, and cuBLAS. We demonstrate the five proxy applications: HPGMG (mix of floating-point ability of our methodology to understand various aspects of performance and performance bottlenecks on NVIDIA GPUs and integer instructions), BatchSW (integer-only), Ma- and motivate code optimizations. trixTranspose (load/store), and cudaTensorCoreGemm (tensor core Warp-Matrix-Multiply-Accumulate opera- Keywords-Instruction Roofline Model, NVIDIA GPUs, mem- ory patterns tions) and cuBLAS (cublasGemmEx API). II. THE CLASSIC ROOFLINE MODEL I. INTRODUCTION The Roofline model characterizes a kernel’s performance Migrating an application to a new architecture is a chal- in GigaFLOPs per second (GFLOP/s) as a function of its lenge, not only in porting the code but also in understanding arithmetic intensity (AI), as described as Eq.(1). The AI is and tuning the performance. Rather than manually perform- expressed as the ratio of floating-point operations performed ing the analysis, developers tend to use tools to motivate to data movement (FLOPs/Bytes). For a given kernel, we the optimization. Therefore, performance analysis tools are can find a point on the X-axis based on its AI. The Y- becoming one of the most critical components for modern axis represents the measured GFLOP/s. This performance architectures. Performance modeling, the critical technology number can be compared against the bounds set by the to quantify performance characteristics and identify potential peak compute performance (Peak GFLOPs) and the memory performance bottlenecks associated with machine capabili- bandwidth of the system (Peak GB/s) to determine what is ties, is becoming an indispensable tool understanding per- limiting performance: memory or compute. ( formance behaviors and guiding performance optimization. P eak GF LOP=s GF LOP=s ≤ min The Roofline model [1] is a visually-intuitive method P eak GB=s × Arithmetic Intensity for users to understand performance by coupling together (1) floating-point performance, data locality (arithmetic inten- The classic Roofline model has been successfully used sity), and memory performance into a two-dimensional for performance analysis on different architectures. In prior graph. The Roofline model [2–4] can tell whether the code work, researchers created additional compute ceilings (e.g. is either memory-bound across the full memory hierarchy “no FMA” peak) and memory ceilings (e.g., cache lev- or compute-bound. Unfortunately, even with sufficient data els) [3, 5, 6]. However, such refinement is misplaced when locality, one cannot guarantee high performance. Many the bottleneck is not floating-point in nature but pertains to applications perform more integer operations than floating- integer instruction throughput or memory access. point, and there are applications in emerging domains, e.g., graph analytics, genomics, etc.. that perform no floating- III. INSTRUCTION ROOFLINE MODEL FOR GPUS point operations at all. The classic, FLOP-centric Roofline Intel Advisor [7] introduced the ability to analyze integer- model is inappropriate for such domains. To that end, we heavy applications and generate scalar and vector integer develop an Instruction Roofline Model for GPUs to affect operation (IntOP) ceilings. At its core, the Roofline model performance analysis of integer-heavy computations. in Intel Advisor is based on operations (floating-point or The contributions in this paper include: integer). In either case, the vertical axis remains performance (FLOP/s, int/s, or int/s+float/s) while the horizontal axis bandwidth (GTXN/s), Instruction Intensity, and machine remains Arithmetic Intensity (operations per byte). peak GIPS. “Instruction Intensity” on the GPU is defined as Whereas Intel Advisor can be quite effective in this warp-based instructions per transaction. Figure 1 shows the regard, it suffers from two aspects. First, it only captures resultant Instruction Roofline ceilings for the GV100. L1, pipeline throughput and may not detect instruction fetch- L2, and HBM bandwidths of 14000, 2996, and 828 GB/s decode-issue bottlenecks. Second, it is an x86 CPU-only sustain 437, 93.6, and 25.9 GTXN/s when normalized to solution. The latter is particularly troublesome given the the typical 32-byte transaction size. ( ascendancy of accelerated computing. P eak GIP S GIP S ≤ min In order to affect the Instruction Roofline analysis for P eak GT XN=s × Instruction Intensity GPU-accelerated applications, we need to target a different (2) set of metrics. First, rather than counting floating-point and/or integer operations, we count instructions. Counting 3 10 Thoeretical Peak: 489.6 warp GIPS instructions allows us to both identify fetch-decode-issue bottlenecks, and, when categorized by types, pipeline uti- 102 lization. Thread predication is another critical performance L1 437.5 GTXN/s factor on GPUs. When a branch is executed, threads that 101 don’t take the branch are predicated (masked in vector L2 93.6 GTXN/s 0 parlance) so that they do not execute subsequent operations. 10 HBM 25.9 GTXN/s When predication is frequent, one may observe poor kernel performance as very few threads execute work on any given Performance (warp GIPS) 10-2 10-1 100 101 102 cycle. Finally, the nature of GPU computing makes efficient Instuction Intensity (Warp Instructions per Transaction) data movement a critical factor in application execution time. As such, it is essential that we also characterize the global Figure 1: Instruction Roofline Model for the GV100. 9 and shared memory access patterns to assess the efficiency of 1 GTXN/s is 10 transactions per second. data motion and motivate future code optimization. Although developers can use nvprof [8] and nvvp [9] to diagnose Global Memory Walls: When a warp executes an in- the performance bottlenecks discussed above, the Instruction struction to access global memory, it is crucial to consider Roofline Model provides an approachable means of charac- the access pattern of threads in that warp because inefficient terizing performance bottlenecks in a single figure. memory access can lower the performance by generating superfluous transactions. A. Architectural Characterization As discussed, a warp-level load instruction can generate First, we describe how we define the Instruction Roofline anywhere from 1 to 32 transactions. We can recast such a ceilings and memory pattern walls. Here we use NVIDIA’s ratio into two key instruction intensities: 1 and 1=32 warp- latest V100 GPU (GV100) [10] to describe the methodology, level global loads per global transaction. The former arises but it is applicable to any GPU architecture. when all threads in a warp reference the same memory Instructions and Bandwidth Ceilings: Each GV100 location and only a single transaction is generated. We call Streaming Multiprocessor (SM) consists of four process- this “stride-0”. Conversely, the other extreme can occur for ing blocks (warp schedulers), and each warp sched- a number of scenarios including random access, and striding uler can dispatch one instruction per cycle. As such, by over 32 bytes (“stride-8” if FP32/INT32 and “stride-4” the theoretical maximum (warp-based) instruction/s is if FP64). Unit-stride (“stride-1”), memory access provides 80(SM) × 4(warp scheduler) × 1(instruction=cycle) × a global LD/ST intensity of 1=8 (FP64) and 1=4 (FP32, 1:53(GHz) = 489:6 GIPS. Memory access is coalesced into INT32). Thus, on the Instruction Roofline, we may plot three transactions. The transaction size for global/local memory, intensity “walls” representing stride-0, stride-1 (unit-stride), the L2 cache, and HBM are 32 bytes. The shared memory and stride-8. transaction size is 128 bytes. In practice, a warp-level load Shared Memory Walls: GPU shared memory is a highly- may generate anywhere from 1 to 32 transactions depending banked structure within each SM providing 4-byte access. on memory patterns. This makes the “transaction” the natu- In theory, this allows