Understanding GPGPU Vector Register File Usage Mark Wyse* [email protected] AMD Research, , Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington

ABSTRACT GPUs are no longer bound to their traditional domain Graphics processing units (GPUs) have emerged as of graphics, but they are commonly viewed as the a favored compute accelerator for , serv- workhorse for computationally intense applications. ers, and . At their core, GPUs are As the use of GPUs has expanded, the architecture massively-multithreaded compute engines, capable of of GPGPU devices has evolved. GPGPUs are mas- concurrently supporting over one hundred thousand sively-multithreaded devices, concurrently operating active threads. Supporting this many threads requires on tens to hundreds of thousands of threads. Unlike storing context for every on-chip, and results in CPUs, which target low-latency computation, GPUs large vector register files consuming a significant excel at high throughput computation. Achieving high amount of die area and power. Thus, it is imperative throughput requires supporting many threads, each re- that the vast number of registers are used effectively, quiring on-chip context. This context typically in- efficiently, and to maximal benefit. cludes shared memory space, program counters, syn- This work evaluates the usage of the vector register chronization resources, and private storage registers. file in a modern GPGPU architecture. We confirm the Maintaining context on-chip enables multithreading results of prior studies, showing vector registers are among the thousands of active threads, with single-cy- reused in small windows by few consumers and that cle context switching between groups of threads. vector registers are a key limiter of workgroup dis- However, the required context consumes millions of patch. We then evaluate the effectiveness of previously , orders of magnitude more than the context of proposed techniques at reusing register values and the few threads present in a traditional CPU. The vec- hiding bank access conflict penalties. Lastly, we study tor register file storage space alone is typically larger the performance impact of introducing additional vec- than the L1 data caches and consumes as much as 16 tor registers and show that additional parallelism is MB in a state-of-the-art, fully configured AMD not always beneficial, somewhat counter-intuitive to ™ RX “VEGA” GPU [8][9]. With a consider- the “more threads, better throughput” view of able amount of storage, die area, and energy being GPGPU acceleration. consumed by the vector register files, it is important to understand the use of this structure in GPGPU appli- cations so that it may be optimized for performance 1. INTRODUCTION and/or energy-efficiency. Contemporary graphics processing units (GPUs) are This paper examines modern GPGPU architectures, incredibly powerful data-parallel compute accelera- focusing on their use of vector general-purpose regis- tors. Originally designed exclusively for graphics ters and the vector register subsystem architecture. workloads, GPUs have evolved into programmable, Our study consists of three main parts. First, we repli- general-purpose compute devices. GPUs are now used cate experiments from prior work revealing the vector to solve some of the most computationally demanding register usage patterns for a set of compute applica- problems, in areas ranging from molecular dynamics tions. We confirm the results of prior work, despite to machine intelligence. The rapid adoption of GPUs modeling a GPGPU architecture based on products into general-purpose computing has given rise to a from a different device vendor. Second, we evaluate new term describing these devices and use: General- the effectiveness of operand buffering and register file Purpose GPU (GPGPU) computing. In this context, caching as proposed in prior work. Our experiments

show these structures to be highly effective at hiding * This work was completed while the author was a Post-Grad bank access conflict penalties and enabling vector reg- Scholar at AMD Research in Bellevue, WA ister value reuse. Third, we examine the potential par- allelism and occupancy benefit of a GPGPU architec- AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names ture providing (physically or logically) twice the num- used in this publication are for identification purposes only and may ber of vector general-purpose registers. We show that be trademarks of their respective companies. the benefit of higher wave-level parallelism and device

1

occupancy is application dependent. For many devel- Compute Unit (CU) opers this notion remains counter-intuitive. The remainder of the paper is organized as follows. Instruction Fetch Section 2 provides background on GPGPU architec- WF Context 1 WF Context 2 WF Context N ture and execution. Section 3 describes our analysis and simulation methodology. Sections 4, 5, 6, and 7 Dependency Logic detail our experimental results. Section 8 covers re- lated work, Section 9 provides thoughts on future re- Instruction Arbitration & Scheduler search directions, and we conclude in Section 10. Execution Units

SIMD VALU SALU SIMD VALU SALU 2. BACKGROUND GPUs are massively-multithreaded processing de- Vector RF Scalar RF Vector RF Scalar RF vices that support over one hundred thousand active threads. Supporting this many active threads requires Scalar Local Global an architecture that is modular and compartmental- Memory Memory Memory Pipeline Pipeline Pipeline ized, as well as a programming model to express data- parallel computation. This section details the GPGPU programming model, describes the hardware execu- Scalar Cache LDS Data Cache I-Cache tion model, and details the specific GPU architecture used in this study. Figure 1. Sample CU Architecture. 2.1 GPGPU Programming Model

GPGPUs use a data-parallel, streaming computation to facilitate execution of many workgroups concur- programming model. In this model, a program, or ker- rently. nel, is executed by a collection of work-items Within a CU, the SIMD unit is the hardware compo- (threads). The programming model typically uses the nent responsible for executing wavefronts. Each wave- single instruction, multiple thread (SIMT) execution front within a workgroup is assigned to a single SIMD model. Work-items within a kernel are subdivided into within the CU the workgroup is dispatched to. The workgroups by the programmer, which are further SIMD unit is responsible for executing all work-items subdivided into wavefronts by hardware. The work- in a wavefront in lock-step. Each SIMD has access to items within a wavefront are logically executed in a scalar ALU (SALU), a branch and message unit, and lock-step. All work-items within a workgroup may memory pipelines. perform synchronization operations with one another. The wavefront size is a hardware parameter that may AMD’s GCN architecture [2] also includes scalar in- change across architecture generations or between de- structions that are executed on the scalar ALU. These vices capable of executing the same Instruction Set scalar instructions are generated by the compiler, Architecture (ISA) generation. Programmers should transparent to the programmer, and are intermixed not rely on the wavefront size remaining constant with vector instructions in the instruction stream. Sca- across hardware generations and should not have de- lar instructions are used for control flow or operations pendencies on a specific wavefront size in their code. that produce a single result shared by all work-items in 2.3 Baseline GPGPU Architecture a wavefront. In this section we detail the CU architecture em- 2.2 GPGPU Hardware Execution Model ployed in our study. Figure 1 depicts the architecture Modern GPU architectures execute kernels using a of the CU we model, which is capable of executing SIMD (Single Instruction, Multiple Data) hardware AMD’s GCN3 ISA [3]. Without loss of generality, we model. As mentioned above, a kernel is composed of elect to use AMD’s terminology where applicable. many work-items that are collected into workgroups. The CU used in our study contains two SIMD Vector The workgroup is the unit of dispatch to the Compute ALUs (VALUs), two Scalar ALUs (SALUs), Vector Units (CUs), which is the hardware unit responsible Register Files (VRFs), Scalar Register Files (SRFs), a for executing workgroups. A CU must be able to sup- Local Data Share (LDS), forty wavefront slots, Local port at least one full-sized workgroup, but may be able Memory (LM), Global Memory (GM), and Scalar to execute additional workgroups concurrently if hard- Memory (ScM) pipelines, and the CU is connected to ware resources allow. All work-items from the same scalar, data, and instruction caches. The following workgroup are executed on the same CU. A GPU de- subsections detail the main blocks within the CU. Note vice contains at least one CU, but it may contain more that the Scalar Cache and I-Cache are shared between

2

2.3.3.2 Operand Buffer The Operand Buffer (OB) [12][15] is responsible for reading the vector source operands of each VALU in- Lane 0 Vector RF struction. The primary purpose of the OB is to hide Bank 0 Lane 1 bank access conflict latency penalties. It is a FIFO Lane 2 Vector RF queue, and instructions enter and leave the OB in-or- Bank 1

Buffer der. However, the OB may read source operands for SIMD VALU any instruction present in the FIFO in any cycle (i.e., Vector RF

Bank 2 out-of-order with respect to the execution order). In Operand Operand

Lane 61 Register File Cache this study, an oldest-first-then-greedy policy is used to Vector RF Lane 62 read source operands, but this may be changed in fu- Bank 3 Lane 63 ture implementations. The OB attempts to read the op- erands of the oldest instruction first, but will greedily Figure 2. Vector Register File Subsystem Architecture. read operands for younger instructions to avoid bank conflicts or if there are banks with available read ports multiple CUs, while all other blocks are private per that contain operands for younger instructions. Source CU. operands are read from the VRF unless the operand exists in the Register File Cache or will be produced 2.3.1 Wavefront Context by an instruction in the VALU pipeline. Reading all Each CU contains a total of forty wavefront context operands for an instruction may take multiple cycles slots [2]. The wavefront slots are divided equally due to bank conflicts. Bank conflicts may occur both among the SIMD VALUs, and all instructions from a within (intra-instruction) and between (inter-instruc- wavefront are executed by the same SIMD/SALU pair tion) instructions. for the duration of the wavefront’s life. The wavefront context consists of the program counter, register state 2.3.3.3 Register File Cache information, synchronization and memory counters, The register file cache (RFC) used in this work is in- and an instruction buffer. spired by the RFC proposed by Gebhart et al. [12]. The RFC sits between the VALU and the VRF banks. It 2.3.2 SIMD VALU receives results from the VALU pipeline, forwards Each SIMD within the CU is a sixty-four wide Vec- those results to the OB and VALU pipeline for future tor ALU (VALU), capable of issuing for execution one instructions if needed, and lazily writes back results to sixty-four wide vector instruction per cycle. the VRF. 2.3.3 Vector Register File Subsystem The RFC holds data for one or more VGPR sized en- The Vector Register File (VRF) subsystem consists tries. Each entry is one complete 64-wide by 32-bit of banked vector register files, containing 1024 64- VGPR. The RFC is an on-demand allocation and evic- wide by 32-bit Vector General-Purpose Registers tion cache, with strict LRU eviction and replacement. (VGPRs) [3], Operand Buffers (OB) [12][15], and reg- The RFC’s primary purposes are: (a) forwarding ister file caches (RFC) [12]. There is a private VRF, source operands to the OB and VALU pipeline, OB, and RFC per SIMD VALU. Figure 2 depicts the thereby reducing the number of VRF reads, and (b) to various components and operand delivery paths in the hide the latency penalty of bank write access conflicts. VRF subsystem. The RFC can provide up to three VGPRs of 64 32- 2.3.3.1 Banked VRF bit values to the instruction being dispatched from the The vector register file associated with each SIMD OB to the VALU pipeline over forwarding paths. This unit contains 128 KB of storage. A CU with two path operates similar to bypass paths in traditional VALUs and two VRFs contains 256 KB of VGPR computational pipelines, and allows source operands storage [2]. The VRF comprises multiple SRAM- to be delivered directly to the VALU pipeline without based banks. Each bank has one read port and one waiting for the values to be read and written from the write port, and both a read and a write may occur in VRF, saving access time and energy. the same cycle. In this study, we configure the VRF to Operands may also be forwarded from the RFC to have four banks, with each bank holding 128 VGPRs the OB as the OB attempts to read source operands for of 64 by 32-bit values. There are 512 VGPRs distrib- instructions. This path is activated when an existing uted across the four banks per VRF, with a total of RFC entry is evicted, and may further reduce source 1024 VGPRs per CU [3]. The bank width matches the operand reads performed from the VRF. The evicted wavefront size to facilitate reading and writing an en- RFC entry is provided to every instruction in the OB tire VGPR per cycle per bank. that needs it.

3

2.3.4 Scalar ALU and Scalar Register File provides the number of workgroups executed per ker- AMD’s GCN architecture includes a Scalar ALU nel dispatch. Combining these data with architectural (SALU) to handle execution of scalar instructions. Un- parameters of our system, we are able to determine the like vector instructions that operate on each individual resources that limit kernel and workgroup dispatch and work-item in a wavefront, scalar instructions are exe- evaluate dispatch limits as architectural parameters are cuted once for all work-items in a wavefront. The pri- varied. mary purpose of scalar instructions is to handle control 3.1.2 Dynamic Register Profiling flow and perform thread independent computation for We use the gem5 simulator [14], which includes a the wavefront. modified version of AMD’s APU gem5 model [7] (de- Our CU model includes two Scalar ALUs, with each tails below) to collect register reuse and producer-con- SALU being associated to one of the VALUs. Each sumer data. Simple implementation of the register file SALU has a private Scalar Register File (SRF) con- system is sufficient to provide both the number of con- taining 800 32-bit Scalar General-Purpose Registers sumers per value producer and the distance between (SGPRs) [3][8]. The SGPRs are assigned at dispatch producer and consumer for all vector register values. time to wavefronts being executed by the VALU/SALU pair. 3.2 The gem5 Simulator 2.3.5 Memory Subsystem The gem5 simulator is an execution-driven, cycle- The memory subsystem used in the baseline archi- level simulator that is capable of ex- tecture in this paper is modeled after the GCN device ecuting real ISAs on simulated . architecture [2][3][8][9]. In this setup, a CU contains AMD’s recent APU extension has added support for a private L1 vector data cache and Local Data Share GPU Compute Units within the simulation framework. (LDS) scratchpad memory. A CU shares a scalar data The APU model is compatible with gem5’s system cache and instruction cache with a collection of other call emulation (SE) mode, where system calls invoked CUs in the system. All three caches (vector, scalar, in- by simulated applications are either emulated in the struction) are supported by a shared L2 cache, which simulator or passed to the host for execution. In this in turn connects to main memory. study, we use an updated version of the AMD APU The vector L1 data cache is a 16 KB, 16-way set as- compute model that faithfully implements the GCN3 sociative, 64- cache block SRAM cache [2]. The ISA and runs an unmodified, publicly-released ROCm shared instruction cache is a 32 KB, 8-way set associ- [6] version 1.1 software stack, with only kernel driver ative, 64-byte cache block SRAM cache [2]. The L2 functionality being emulated. We simulate an APU cache is a 512 KB, 16-way set associative, 64-byte with one CPU and a single CU to stress the CU and cache block SRAM cache [17]. The L2 cache unifies VRF to the greatest extent. the scalar data, vector data, and instruction caches, and The following subsections detail the instruction is connected to system memory. readiness, dispatch, and execution flow in our Com- Each CU also contains a 64KB Local Data Share pute Unit implementation. The remaining structures (LDS). The LDS is a software managed cache, with 32 (VALU, SALU, VRF, SRF, etc.) are implemented banks [3][8][9]. This structure provides a high-band- faithfully to the descriptions provided in Section 2.3 width, low-latency, software managed memory and above. acts as a data cache bandwidth amplifier. 3.2.1 Wavefront and Instruction Readiness As described in the GCN3 architecture, each SIMD VALU has many associated wavefronts it is responsi- 3. METHODOLOGY ble for executing, with our simulated architecture sup- This section describes the benchmark analysis and porting twenty wavefronts per SIMD. Every cycle, simulation methodologies used for the experiments each SIMD evaluates all wavefronts for readiness. A presented. wavefront is deemed ready to execute if it is active, not 3.1 Benchmark and Kernel Analysis waiting for a synchronization operation to complete, In this subsection we detail the static and dynamic has at least one instruction to execute, and all true analysis performed to assess register usage and de- (RAW) register dependencies are resolved. Register pendency characteristics. dependencies are tracked using a scoreboard indicat- 3.1.1 Kernel Analysis ing which, if any, source operands have not been pro- To evaluate dispatch limits for the benchmarks under duced yet by the various functional units (VALU, study, we rely on data produced by the compiler and SALU, memory pipelines) and are busy. Each wave- disassembly tools for AMD’s GCN3 ISA [3][4]. These front that is ready is presented to the instruction dis- tools provide the number of vector and scalar general- patch unit as a candidate for execution. purpose registers required per work-item and wave- front, respectively. Simulation (methodology below)

4

3.2.2 Instruction Dispatch and Execution Benchmark Description After wavefronts are checked for readiness, the in- struction dispatch unit selects and attempts to schedule Array-BW Memory streaming for execution up to one wave per execution resource. Bitonic Sort Parallel Merge Sort The execution resources are the VALUs, SALUs, and memory pipelines. Each cycle, a scheduler selects a CoMD DOE Molecular-dynamics algorithms candidate wave for each resource, typically using an FFT Digital signal processing oldest-first policy. To be dispatched for execution, a selected wave must HPGMG Ranks HPC systems first gather all source operands from the register files. Each non-scalar instruction is sent to the vector regis- MD Generic Molecular-dynamics algorithms ter file, with VALU operations requesting a slot in the SNAP Discrete ordinates neutral particle transport Operand Buffer, and vector memory (VMEM) instruc- application tions requesting access to the VRF banks. VMEM in- SpMV Sparse matrix-vector multiplication structions receive priority for reading VRF banks. Once operands are read from the register files, the ap- XSBench Monte Carlo particle transport simulation propriate execution resources are checked for readi- Table 1. Description of evaluated workloads. ness. An execution resource may disallow instruction issue due to certain conditions, such as issue period limitations or full buffers (e.g., for vector memory co- 4. VECTOR REGISTER USAGE alescing). Prior works [10] [12] have examined the usage of After all source operands and execution resources vector register values in GPGPU architectures and are ready, an instruction is deemed ready for execu- concluded that most values produced are consumed a tion. At this point all non-vector ALU operations will small number of times within a small instruction win- be issued for execution. VALU operations must make dow from the producer instruction, and many registers one final request to the register file cache to allocate do not contain live values for significant portions of slots for destination registers. If the RFC is unable to execution. allocate slots for the instruction, VALU instruction is- The authors of [12] claim up to 70% of values are sue will stall. Once the RFC accepts the destination read only once, and only around 10% of values are slot allocation request, the VALU operation will be is- read more than 2 times. This prior study evaluates an sued to the pipeline for execution. architecture modeled after those from NVIDIA, thus it At the end of instruction execution, the destination values will be written into the RFC and the scoreboard updated to indicate result data are available for use. Register file write-back operations occur lazily from the RFC. Memory loads are enqueued in the memory pipelines and return data in variable latencies depend- 0 Reads 1 Read 2 Reads >2 Reads ing on memory system behavior and contention. Loads update the scoreboard and write-back results to the 100% VRF once data return from the memory system. 90% 80% 3.3 Benchmarks 70% 60% Table 1 lists the benchmarks used in this study. These 50% applications are obtained from the AMD compute ap- 40%

plications GitHub [1] [5]. The applications used in this 30% Produced study represent common kernels from HPC and scien- 20% tific computing workloads that are of interest in the 10%

GPGPU community. 0% Percent of All ValuesVector All Percentof The selected applications are written using the heter- ogeneous compute (HC) C++ API. Source code is compiled using the heterogeneous compute compiler (HCC) [4], which is based on Clang and LLVM. HCC is an open source compiler for heterogeneous compute Figure 3. Number of reads per vector register value. applications that target the ROCm stack.

5

results (data not shown) and confirm that the applica- 1 Inst 2 Insts 3 Insts >3 Insts tions used in our study exhibit similar behavior. No 100% application utilized all of its compiler-allocated regis- 90% ters, and all applications had register usage patterns 80% with significant variation in the number of live register 70% values throughout execution. 60% 50% 40% 5. REDUCING THE NUMBER OF REGISTER 30%

Consumed FILE READS 20% 10% One function of the OB and RFC is to reduce the 0% number of reads from the vector register file. As de-

Percent of All ValuesVector All Percentof scribed above, the RFC is able to both recycle oper- ands to the OB and forward source operands to the VALU pipeline at instruction dispatch. Each of these paths reduces the number of VGPR reads performed from the main register file by the OB. In this experi- Figure 4. Lifetime of vector register values. ment, we examine the number of reads required by the OB for VALU instructions that can be saved by resiz- i s worth asking the question: do the same patterns hold ing the RFC, and discuss the performance and imple- for AMD GPUs and GCN3 ISA code? mentation implications of such changes. We replicate two of their studies and find that 60- Figure 5 shows the number of reads saved for each 90% (75% average) of vector values produced are read RFC configuration. We sweep RFC sizes from 2 exactly once, and 4-13% (10% average) of vector val- through 512 total entries, with the y-axis showing the ues produced are read more than twice, as shown in percentage of vector sources that are provided by the Figure 3. Further, we find 24-57% (40% average) of RFC to the OB, or equivalently, the number of VRF all values consumed were produced within 3 instruc- reads saved (higher is better) for VALU instructions. tions prior, as shown in Figure 4. In both experiments, Figure 7 shows the relative performance for each RFC our results are in line with prior work. While not configuration. The y-axis is IPC normalized to an RFC overly surprising, this adds confidence to the remain- with eight entries (higher is better). der of our evaluation and experiments, and confirms At small RFC sizes (2 or 4 entries), we observe that that the codes being used for evaluation exhibit similar up to 25% of all possible reads required by the OB behavior despite differences in implementation, com- from the VRF are avoided. At these sizes, we observe pilers, and ISA. performance degradation, caused by the timing of Prior studies also examine the liveness of register RFC entry allocation in our simulator implementation. values over the course of execution [10] [11]. The au- An RFC slot is allocated when an instruction is dis- thors’ conclusions are that many registers are short- patched to the VALU pipeline, and because the pipe- lived and thus contain no live values for a significant line latency is larger than the number of RFC slots for percentage of a kernel’s execution. We replicated their small sizes, instruction issue stalls on RFC allocation.

RFC-2 RFC-4 RFC-8 RFC-16 RFC-32 RFC-64 RFC-128 RFC-256 RFC-512

100%

80%

60%

40%

20% VGPR VRF Reads Reads VGPRVRFSaved 0% Array-BW Bitonic-Sort CoMD FFT HPGMG MD SNAP SpMV XSBench

Figure 5. Number of VGPR reads saved by RFC forwarding paths.

6

RFC-2 RFC-4 RFC-8 RFC-16 RFC-32

1

0.8

0.6

0.4 Relative Relative IPC 0.2

0 Array-BW Bitonic-Sort CoMD FFT HPGMG MD SNAP SpMV XSBench

Figure 6. Relative performance (IPC) for RFC sizes. Performance at sizes greater than 32 is stable at a relative IPC of 1.

Pipeline bubbles are introduced, diminishing perfor- by a VALU instruction, if they ever are, those values mance. This behavior is an artifact of the simulator im- will be read from the VRF, not the RFC. At larger RFC plementation, and correcting it is left as future work. sizes, no performance benefit is observed. This implies At larger RFC sizes (16 or more entries), we observe that the RFC and OB are effective at hiding latency the number of saved reads increases significantly with penalties from bank access conflicts at default sizes of RFC size. More than half of the benchmarks studied eight RFC entries and four OB entries (details in Sec- are able to serve more than 80% of all required VGPR tion 6). Although increasing the RFC size does not operands for VALU instructions from the RFC at the lead to better performance, it may lead to reduced VRF largest RFC configuration. Although the RFC is as access energy as more operands can be provided by large as the VRF at size 512, not all source operands the lower energy forwarding paths. However, there are will be captured for reuse by the RFC. Any value trade-off costs in implementation (area, power, and la- loaded by a memory instruction into VGPRs must be tency) as RFC size increases that may make larger read from the VRF instead of the RFC to ensure cor- RFCs less energy-efficient. rectness, until the physical VGPR loaded to is over- written by a VALU instruction and the result stored in 6. REGISTER BANK CONFLICTS IN A the RFC. Additionally, some VGPR values are initial- COMPUTE OPTIMIZED GPU ARCHI- ized by hardware at wavefront dispatch. Until the physical VGPRs holding these values are overwritten TECTURE The VGPRs are physically stored in a banked regis- ter file to provide high-bandwidth access without the Baseline Bank Conflict Free overhead of multi-ported register files. Each bank is 101% the width of a GCN3 wavefront, or 64 32-bit entries, or equivalently, one VGPR wide, and can read and 100% write one VGPR per cycle. 99% In both the GCN3 and our simulated architecture, 98% both intra- and inter-instruction bank conflicts may oc- cur. Intra-instruction read conflicts occur when two or 97%

Relative Relative IPC more source operands in the same instruction reside in 96% the same physical VGPR bank. Inter-instruction read conflicts occur when multiple instructions have oper- 95% ands residing in the same bank and attempt to read them in the same cycle. Given our GPU architecture detailed above, the natural question to ask is: how ef- fective are the RFC and OB at hiding the latency pen- alties of bank access conflicts? Figure 7. Relative performance comparing baseline archi- We answer this question by performing a limit study tecture to a bank conflict free configuration (Note y-axis on the performance benefit of removing bank con- begins at 95%). flicts. As discussed in the Section 5, the RFC and OB significantly reduce the number of accesses to the

7

Max WG/CU by Re-

source Baseline 2x VGPR Application WF VGPR SGPR Limiter 140% Array-BW 2 1 12 VGPR Bitonic-Sort 2 4 12 WF 120% CoMD 40 28 100 VGPR 100% FFT 40 12 66 VGPR 80% HPGMG 2 1 4 VGPR 60% MD 2 1 6 VGPR

SNAP 40 14 100 VGPR Relative IPC 40% SpMV 20 21 100 WF 20% XSBench 10 4 25 VGPR 0% Table 2. Maximum Workgroups per CU per application when only limited by Wavefront Slots (WF), VGPRs, or SGPRs.

VRF. However, these structures are also meant to hide the latency penalty of bank access conflicts. Specifi- cally, the OB acts as a read buffer to opportunistically Figure 8. Relative performance with 2x VGPR per SIMD. read operands when banks become available and pre- vent the insertion of bubbles into the VALU pipeline. and examine the impacts of increased resource availa- To assess the effectiveness of the OB at hiding conflict bility in terms of performance and parallelism. penalties, we compare our baseline architecture to one without bank conflicts in the VRF. We reconfigure the 7.1 Workgroup Dispatch Limits simulator to place each VGPR in its own VRF bank. Launching, or dispatching, a kernel for execution on By definition, two different VGPRs will not conflict a GPU requires a set of resources to be available. with one another in this setup. These resources are wavefront slots, vector registers, Figure 6 shows the results of this experiment and dis- scalar registers, and scratch memory. For our applica- plays normalized performance of the baseline and tions, we find that the majority are vector register lim- bank conflict free configurations. The y-axis is perfor- ited for dispatch. Table 2 shows the results of offline mance in IPC, normalized to the baseline configura- analysis on the compute kernels and lists the maxi- tion (higher is better). As shown, there is negligible mum number of workgroups that can be dispatched change in IPC between the two configurations (less per CU when only considering one resource at a time than +/- 1% typically), indicating that the RFC and OB (other resources assumed infinite). Each application are well-suited at handling all bank access conflicts has a single , except FFT, which has encountered during dynamic execution. both a forward and inverse FFT kernel. The FFT ker- There are a few different conclusions that may be nels, however, have identical resource requirements drawn from these results. Perhaps most obvious is that and are executed sequentially. The three columns give the RFC and OB are effective at hiding the latency the maximum number of workgroups per CU when penalty of any conflicts that occur. This may be due to limited only by wavefront slots (WF), VGPR availa- the OB’s ability to gather operands out-of-order with bility (VGPR), and SGPR availability (SGPR). The fi- respect to instruction issue order, or that the RFC’s for- nal column (Limiter) lists which resource limits dis- warding paths are able to adequately reduce the num- patch and prevents further workgroups from being dis- ber of VRF accesses required by the OB. Fewer re- patched. Of the nine applications we study, only Bi- quired accesses means less register file pressure and a tonic-Sort and SpMV are not VGPR limited for dis- lower probability of bank conflicts. Another possible patch. explanation is that the applications studied have lower Because many applications are limited by VGPR than expected dynamic register usage. Although most availability, it is natural to ask: does providing addi- of the applications are limited by VGPRs in dispatch tional VPGRs result in improved wave-level parallel- (see Section 7), the dynamic usage may not be as great ism and/or performance? as the static demand. It is possible that codes in other 7.2 Increasing Wave-Level Parallelism and domains may have greater dynamic register usage. Performance with Additional VGPRs General-purpose registers are shown to be the limit- 7. DISPATCH LIMITS AND WAVE-LEVEL ing factor in workgroup dispatch for seven of the nine PARALLELISM benchmarks in this study. GPGPUs are throughput ac- In this section we examine the resource requirements celerators, and it is often thought that improving the for workgroup dispatch for each benchmark studied amount of available work or occupancy will result in

8

large as those used in GPUs, but it allows us to study Baseline 2x VGPR the upper bound benefit of additional VGPRs. 350% Figure 8 and Figure 9 show the results of our exper- iment. Figure 8 displays relative performance in IPC, 300% for the baseline and 2x VRF configurations, normal- 250% ized to the baseline. The y-axis is improvement in IPC 200% over the baseline, and higher is better. Figure 9 shows 150% the realized wave-level parallelism (WLP) for the 100% baseline and 2x VRF configurations. The y-axis is the Relative Relative WLP average WLP observed normalized to the baseline 50% configuration, and a value greater than one indicates 0% greater observed WLP. We measure WLP by counting the number of already active wavefronts when each new wavefront is dispatched, then averaging this count from all wavefront dispatches. Our experimental results are mixed. All seven appli- cations that are VGPR limited for dispatch observe in- Figure 9. Relative WLP with 2x VGPR per SIMD. creased average WLP. However, the performance ben- efit of higher WLP is mixed. CoMD, MD, and improved performance. However, prior work has XSBench experience performance degradation, FFT shown that additional parallelism may not always be sees no performance change, and Array-BW, beneficial [16]. HPGMG, and SNAP see improved performance. In this experiment, we assess the benefits of having For the benchmarks with a performance loss, aver- additional VGPRs available per SIMD VALU and aging around 7% worse IPC, we suspect memory ac- CU. For the benchmarks studied, do the additional cess divergence is at fault. Prior analysis (data not VGPRs (a) allow more workgroups to be dispatched shown) revealed MD and XSBench have greater and increase the wave-level parallelism (WLP), and memory divergence per memory instruction. These (b) if WLP is increased, is there a resulting perfor- two benchmarks also have long tails in their memory mance benefit? access latency distributions compared to other appli- To estimate the best-case improvement, we modify cations, as shown in Figure 10. We only show a subset our simulator configuration to have twice the number of the benchmarks for clarity in the figure; however, of VGPRs per SIMD/CU (1024 VGPR per SIMD, 512 the benchmarks not shown have CDF’s that fall be- KB per CU), but do not modify any timing for VGPR tween those shown for FFT and SpMV. The diver- access. Assuming no timing penalty for a larger VRF gence in memory requests increases the number of is optimistic, especially when the register files are as post-coalescing memory requests per instruction and appears to cause an increase in average memory