Bank Stealing for a Compact and Efficient Register File Architecture in Gpgpu 3

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Bank Stealing for a Compact and Efficient Register File Architecture in GPGPU Naifeng Jing, Shunning Jiang, Shuang Chen, Jingjie Zhang, Li Jiang, Chao Li, Member, IEEE, and Xiaoyao Liang Abstract— Modern general-purpose graphic processing threads (SIMTs) manner. To hide the long memory access units (GPGPUs) have emerged as pervasive alternatives latency during the thread execution, the SIMT execution for parallel high-performance computing. The extreme is featured by a fast thread switching mechanism, termed multithreading in modern GPGPUs demands a large register file (RF), which is typically organized into multiple banks to the zero-cost context switching, supported by a large-sized support the massive parallelism. Although a heavily banked register file (RF), where the active contexts of the concurrent structure benefits RF throughput, its associated area and energy threads can be held in each streaming multiprocessor (SM). costs with diminishing performance gains greatly limit the For instance, the classic Nvidia Fermi architecture includes future RF scaling. In this paper, we propose an improved RF 15 SMs, where each SM is equipped with 32 768 32-bit design with bank stealing techniques, which enable a high RF throughput with compact area. By deeply investigating the registers to support 1536 concurrent threads, resulting a GPGPU microarchitecture, we find that the state-of-the-art nearly 2-MB on-chip RF storage in total in a single GPU [1]. RF designs’ is far from optimal due to the deficiency in The capacity keeps increasing in each product generation, bank utilization, which is the intrinsic limitation to a high e.g., doubled in Kepler architecture [2], in seek of even higher RF throughput and a compact RF area. We investigate the thread-level parallelism (TLP). causes for bank conflicts and identify that most conflicts can be eliminated by leveraging the fact that the highly banked RF To manage such a large RF, a multibanked structure is oftentimes experiences underutilization. This is especially true widely used [3]–[5]. Multibanked RF sustains the multiported in GPGPUs, where multiple ready warps are available at the RF throughput by serving concurrent operand accesses as scheduling stage with their operands to be wisely coordinated. In long as there are no bank conflicts. If conflict occurs, the this paper, we propose two lightweight bank stealing techniques requests have to be serialized, and the RF throughput is that can opportunistically fill the idle banks and register entries for better operand service. Using the proposed architecture, reduced leading to performance loss. As a result, in face of the average GPGPU performance can be improved under a the RF capacity scaling in the future GPGPUs, adding more smaller energy budget with significant area saving, which makes banks is a natural way to accommodate the increasing number it promising for sustainable RF scaling. of registers without worsening the bank conflict problem. Index Terms— Area efficiency, bank conflict, bank stealing, Otherwise, the performance gained from higher TLP could general purpose graphic processing unit (GPGPU), register be compromised because more threads from schedulers will file (RF), register liveness. be competing for an insufficient number of RF banks that aggravates the conflicts and reduces the throughput. I. INTRODUCTION However, banks are not free. Adding banks requires extra ODERN general-purpose graphic processing column circuitry, such as sensor amplifiers, prechargers, bitline Munits (GPGPUs) have emerged as pervasive alternatives drivers, and a bundle of wires [6]. Meanwhile, the crossbar for high-performance computing. GPGPU exploits the between banks and operand collectors grows significantly with inherent massive parallelism by utilizing a large number of the increasing number of banks. All these issues affect the concurrent threads executed in single instruction multiple RF area, timing, and power [7]. As modern GPGPUs tend Manuscript received December 16, 2015; revised March 9, 2016 and to become area-limited and they have already been power- May 4, 2016; accepted June 4, 2016. This work was supported in part by limited as pointed in [5], it is important to incorporate a the National Natural Science Foundation of China under Grant 61402285, suitable number of banks while seeking for more economic Grant 61202026, and Grant 61332001, in part by the China Postdoctoral Science Foundation under Grant 2014T70418, and in part by the Program ways to mitigate the conflicts for the ever-increasing of the China National 1000 Young Talent Plan and Shanghai Science and RF capacity. Technology Committee under Grant 15YF1406000. (Corresponding author: In this paper, we explicitly address the issue for an effi- Xiaoyao Liang.) N. Jing, L. Jiang, C. Li, and X. Liang are with Shanghai Jiao Tong cient RF design to sustain the capacity scaling in the future University, Shanghai 200240, China (e-mail: [email protected]; GPGPUs. We first identify the causes for the bank conflicts. [email protected]; [email protected]). Then, in the presence of the many bubbles observed in S. Jiang and S. Chen were with Shanghai Jiao Tong University, Shanghai 200240, China. They are now with Cornell University, Ithaca, bank utilization and RF entries, we propose to apply fine- NY 14850 USA (e-mail: [email protected]; chenshuang0804@ grained architectural control by stealing the idle banks to gmail.com). serve concurrent operand accesses, which effectively recovers J. Zhang is with Fudan University, Shanghai 200433, China (e-mail: [email protected]). the throughput loss due to bank conflicts. By leveraging Digital Object Identifier 10.1109/TVLSI.2016.2584623 the multiwarp and multibank characteristics in GPGPUs, we 1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 1. Block diagram of a typical GPGPU pipeline, highlighting the banked RF with operand collectors and a crossbar unit. make lightweight modifications on the compiler as well as To keep the execution units busy even if some warps are the pipeline for a more balanced utilization of the existing stalled for long memory operations, a large number of warps RF resources instead of simply adding more banks. can be switched with zero-cycle penalty, which requires that Experimental results demonstrate that our proposed RF with every thread in SM is allocated with some dedicated physical bank stealing can deliver better performance, less energy, and registers in the RF. As a result, the RF size is generally very significant area saving. large. More importantly, the RF capacity keeps doubling for This paper differs from the previous proposals in every product generation in pursuit of even larger parallelism. many ways. For example, we leverage the single-ported SRAM to build the RF. Although it may cause more B. Register File conflicts, as demonstrated in Section III-B, it is more The requirement for simultaneous multiple read/write oper- area- and power-efficient than the multiported static ations poses challenges to the RF design. For example, in the random access memory (SRAM) or pseudodual-port Nvidia PTX standard [10], each instruction can read up to RF solutions for the conflict problem, which add at three registers and write one register. To guarantee the through- least 30% area, as reported in [8]. At the same time, we put, the simplest solution is to use a multiported SRAM, stick to the conventional monolithic banking structure, which which requires two more transistors per bit for an additional has lower design complexity and more compact area in the port. Obviously, constructing such a heavily ported (six-read datapath compared with the multilevel RF designs, such as and two-write ports for dual-issue) large RF (hundreds of RF cache [4]. We also step forward from our prior work [9] kilobytes) is typically not feasible or economical. by identifying all the possible bank conflicts and propose To solve this problem, banked RF structure [3]–[5] has been a combined solution. Using the proposed bank stealing proposed to avoid the unaffordable area and power overhead in techniques and combined structure, the performance of our the multiported designs. While there are various possible RF single-ported and monolithic RF architecture is comparable organizations, one of the most commonly used is to combine or even better than other types of RF designs, while the much multiple single-ported banks into a monolithic RF, which reduced silicon area makes it a promising alternative for the reduces the power and design complexity. Each bank has its RF capability scaling in the future GPGPUs. own decoder, sense amplifiers, and so on to operate indepen- The remainder of this paper is organized as follows. dently. Registers associated with each warp are mapped to Section II reviews the background and preliminaries of modern the banks according to the register/warp identifiers and the GPGPU architecture. Section III introduces our motivation. maximum registers a thread can use [3], such that different Section IV proposes our techniques with detailed architectural registers in a warp and the same registers in different warps can support. Sections V and VI present the evaluation methodol- be evenly distributed across the multiple physical banks. In this ogy and experimental results, respectively. Finally, Section VII way, the mapping offers a fair balancing and higher possibility describes related work, Section VIII concludes this paper. for the multiple operands in an instruction to be fetched across multiple banks in a single cycle with no conflicts.

Load more