Understanding GPGPU Vector Register File Usage Mark Wyse* [email protected] AMD Research, Advanced Micro Devices, Inc

Understanding GPGPU Vector Register File Usage Mark Wyse* [email protected] AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington ABSTRACT GPUs are no longer bound to their traditional domain Graphics processing units (GPUs) have emerged as of graphics, but they are commonly viewed as the a favored compute accelerator for workstations, serv- workhorse for computationally intense applications. ers, and supercomputers. At their core, GPUs are As the use of GPUs has expanded, the architecture massively-multithreaded compute engines, capable of of GPGPU devices has evolved. GPGPUs are mas- concurrently supporting over one hundred thousand sively-multithreaded devices, concurrently operating active threads. Supporting this many threads requires on tens to hundreds of thousands of threads. Unlike storing context for every thread on-chip, and results in CPUs, which target low-latency computation, GPUs large vector register files consuming a significant excel at high throughput computation. Achieving high amount of die area and power. Thus, it is imperative throughput requires supporting many threads, each re- that the vast number of registers are used effectively, quiring on-chip context. This context typically in- efficiently, and to maximal benefit. cludes shared memory space, program counters, syn- This work evaluates the usage of the vector register chronization resources, and private storage registers. file in a modern GPGPU architecture. We confirm the Maintaining context on-chip enables multithreading results of prior studies, showing vector registers are among the thousands of active threads, with single-cy- reused in small windows by few consumers and that cle context switching between groups of threads. vector registers are a key limiter of workgroup dis- However, the required context consumes millions of patch. We then evaluate the effectiveness of previously bytes, orders of magnitude more than the context of proposed techniques at reusing register values and the few threads present in a traditional CPU. The vec- hiding bank access conflict penalties. Lastly, we study tor register file storage space alone is typically larger the performance impact of introducing additional vec- than the L1 data caches and consumes as much as 16 tor registers and show that additional parallelism is MB in a state-of-the-art, fully configured AMD not always beneficial, somewhat counter-intuitive to Radeon™ RX “VEGA” GPU [8][9]. With a consider- the “more threads, better throughput” view of able amount of storage, die area, and energy being GPGPU acceleration. consumed by the vector register files, it is important to understand the use of this structure in GPGPU applications so that it may be optimized for performance 1. INTRODUCTION and/or energy-efficiency. Contemporary graphics processing units (GPUs) are This paper examines modern GPGPU architectures, incredibly powerful data-parallel compute accelera- focusing on their use of vector general-purpose regis- tors. Originally designed exclusively for graphics ters and the vector register subsystem architecture. workloads, GPUs have evolved into programmable, Our study consists of three main parts. First, we repli- general-purpose compute devices. GPUs are now used cate experiments from prior work revealing the vector to solve some of the most computationally demanding register usage patterns for a set of compute applica- problems, in areas ranging from molecular dynamics tions. We confirm the results of prior work, despite to machine intelligence. The rapid adoption of GPUs modeling a GPGPU architecture based on products into general-purpose computing has given rise to a from a different device vendor. Second, we evaluate new term describing these devices and use: General- the effectiveness of operand buffering and register file Purpose GPU (GPGPU) computing. In this context, caching as proposed in prior work. Our experiments show these structures to be highly effective at hiding * This work was completed while the author was a Post-Grad bank access conflict penalties and enabling vector reg- Scholar at AMD Research in Bellevue, WA ister value reuse. Third, we examine the potential parallelism and occupancy benefit of a GPGPU architec- AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names ture providing (physically or logically) twice the num- used in this publication are for identification purposes only and may ber of vector general-purpose registers. We show that be trademarks of their respective companies. the benefit of higher wave-level parallelism and device 1 occupancy is application dependent. For many devel- Compute Unit (CU) opers this notion remains counter-intuitive. The remainder of the paper is organized as follows. Instruction Fetch Section 2 provides background on GPGPU architec- WF Context 1 WF Context 2 WF Context N ture and execution. Section 3 describes our analysis and simulation methodology. Sections 4, 5, 6, and 7 Dependency Logic detail our experimental results. Section 8 covers re- lated work, Section 9 provides thoughts on future re- Instruction Arbitration & Scheduler search directions, and we conclude in Section 10. Execution Units SIMD VALU SALU SIMD VALU SALU 2. BACKGROUND GPUs are massively-multithreaded processing de- Vector RF Scalar RF Vector RF Scalar RF vices that support over one hundred thousand active threads. Supporting this many active threads requires Scalar Local Global an architecture that is modular and compartmental- Memory Memory Memory Pipeline Pipeline Pipeline ized, as well as a programming model to express data- parallel computation. This section details the GPGPU programming model, describes the hardware execu- Scalar Cache LDS Data Cache I-Cache tion model, and details the specific GPU architecture used in this study. Figure 1. Sample CU Architecture. 2.1 GPGPU Programming Model GPGPUs use a data-parallel, streaming computation to facilitate execution of many workgroups concur- programming model. In this model, a program, or ker- rently. nel, is executed by a collection of work-items Within a CU, the SIMD unit is the hardware compo- (threads). The programming model typically uses the nent responsible for executing wavefronts. Each wave- single instruction, multiple thread (SIMT) execution front within a workgroup is assigned to a single SIMD model. Work-items within a kernel are subdivided into within the CU the workgroup is dispatched to. The workgroups by the programmer, which are further SIMD unit is responsible for executing all work-items subdivided into wavefronts by hardware. The work- in a wavefront in lock-step. Each SIMD has access to items within a wavefront are logically executed in a scalar ALU (SALU), a branch and message unit, and lock-step. All work-items within a workgroup may memory pipelines. perform synchronization operations with one another. The wavefront size is a hardware parameter that may AMD’s GCN architecture [2] also includes scalar in- change across architecture generations or between de- structions that are executed on the scalar ALU. These vices capable of executing the same Instruction Set scalar instructions are generated by the compiler, Architecture (ISA) generation. Programmers should transparent to the programmer, and are intermixed not rely on the wavefront size remaining constant with vector instructions in the instruction stream. Sca- across hardware generations and should not have de- lar instructions are used for control flow or operations pendencies on a specific wavefront size in their code. that produce a single result shared by all work-items in 2.3 Baseline GPGPU Architecture a wavefront. In this section we detail the CU architecture em- 2.2 GPGPU Hardware Execution Model ployed in our study. Figure 1 depicts the architecture Modern GPU architectures execute kernels using a of the CU we model, which is capable of executing SIMD (Single Instruction, Multiple Data) hardware AMD’s GCN3 ISA [3]. Without loss of generality, we model. As mentioned above, a kernel is composed of elect to use AMD’s terminology where applicable. many work-items that are collected into workgroups. The CU used in our study contains two SIMD Vector The workgroup is the unit of dispatch to the Compute ALUs (VALUs), two Scalar ALUs (SALUs), Vector Units (CUs), which is the hardware unit responsible Register Files (VRFs), Scalar Register Files (SRFs), a for executing workgroups. A CU must be able to sup- Local Data Share (LDS), forty wavefront slots, Local port at least one full-sized workgroup, but may be able Memory (LM), Global Memory (GM), and Scalar to execute additional workgroups concurrently if hard- Memory (ScM) pipelines, and the CU is connected to ware resources allow. All work-items from the same scalar, data, and instruction caches. The following workgroup are executed on the same CU. A GPU de- subsections detail the main blocks within the CU. Note vice contains at least one CU, but it may contain more that the Scalar Cache and I-Cache are shared between 2 2.3.3.2 Operand Buffer The Operand Buffer (OB) [12][15] is responsible for reading the vector source operands of each VALU in- Lane 0 Vector RF struction. The primary purpose of the OB is to hide Bank 0 Lane 1 bank access conflict latency penalties. It is a FIFO Lane 2 Vector RF queue, and instructions enter and leave the OB in-or- Bank 1 Buffer der. However, the OB may read source operands for SIMD VALU any instruction present in the FIFO in any cycle (i.e., Vector RF Bank 2 out-of-order with respect to the execution order). In Operand Operand Lane 61 Register File Cache this study, an oldest-first-then-greedy policy is used to Vector RF Lane 62 read source operands, but this may be changed in fu- Bank 3 Lane 63 ture implementations. The OB attempts to read the operands of the oldest instruction first, but will greedily Figure 2. Vector Register File Subsystem Architecture. read operands for younger instructions to avoid bank conflicts or if there are banks with available read ports multiple CUs, while all other blocks are private per that contain operands for younger instructions.

Load more