Elastic Pipeline: Addressing GPU On-Chip Shared Memory Bank Conﬂicts

Elastic Pipeline: Addressing GPU On-chip Shared Memory Bank Conflicts Chunyang Gou Georgi N. Gaydadjiev Computer Engineering Lab Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology, The Netherlands {C.Gou, G.N.Gaydadjiev}@tudelft.nl ABSTRACT off-chip memory resources. As a rule, the factors impacting the One of the major problems with the GPU on-chip shared mem- on-chip memory efficiency have quite different characteristics ory is bank conflicts. We observed that the throughput of the compared to the off-chip case. For example, on-chip memories GPU processor core is often constrained neither by the shared tend to be more sensitive to dynamically changing latencies, memory bandwidth, nor by the shared memory latency (as while bandwidth limitations are more severe for off-chip mem- long as it stays constant), but is rather due to the varied laten- ories. In the particular case of GPUs, the on-chip first level cies caused by memory bank conflicts. This results in conflicts memories, including both the software managed shared mem- at the writeback stage of the in-order pipeline and pipeline ory and the hardware cache, are heavily banked, in order to stalls, thus degrading system throughput. Based on this ob- provide high bandwidth for the parallel SIMD lanes. Even servation, we investigate and propose a novel elastic pipeline with adequate bandwidth provided by the parallel memory design that minimizes the negative impact of on-chip mem- banks, however, applications can still suffer drastic pipeline ory bank conflicts on system throughput, by decoupling bank stalls, resulting significant performance loss. This is due to conflicts from pipeline stalls. Simulation results show that our unbalanced accesses to the on-chip memory banks. This in- proposed elastic pipeline together with the co-designed bank- creases the overhead in using on-chip shared memories, since conflict aware warp scheduling reduces the pipeline stalls by the programmer has to take care of the bank conflicts. Fur- up to 64.0% (with 42.3% on average) and improves the over- thermore, often the GPGPU shared memory utilization range all performance by up to 20.7% (on average 13.3%) for our is constrained due to such overhead. benchmark applications, at trivial hardware overhead. In this paper, we observed that the throughput of the GPU processor core is often hampered neither by the on-chip mem- Categories and Subject Descriptors: C.1.2[Multiple Data ory bandwidth, nor by the on-chip memory latency (as long Stream Architectures (Multiprocessors)]:SIMD; B.3.2[Design as it stays constant), but rather by the varied latencies due to Styles]:Interleaved memories memory bank conflicts, which end up with writeback conflicts General Terms: Design, Performance and pipeline stalls in the in-order pipeline, thus degrading system throughput. To address this problem, we will investigate 1. INTRODUCTION novel elastic pipeline design that minimizes the negative im- The trend is quite clear that multi/many-core processors pact of on-chip memory bank conflicts on system throughput. are becoming pervasive computing platforms nowadays. GPU This paper makes the following specific contributions: is one example that uses massive lightweight cores to achieve • We analyzed the GPU on-chip shared memory bank con- high aggregated performance, especially for highly data-parallel flict problem, and identified how the bank conflicts are workloads. Although GPUs are originally designed for graph- translated into pipeline performance degradation; ics processing, the performance of many well tuned general • purpose applications on GPUs have established them among We investigate and propose a novel elastic pipeline design one of the most attractive computing platforms in a more gen- that minimizes the negative impact of on-chip shared eral context – leading to the GPGPU (General-purpose Pro- memory conflicts on overall system throughput, by cut- cessing on GPUs) domain[2]. ting the link between bank conflict and pipeline stall; In manycore systems such as GPUs, massive multithreading • We co-designed bank-conflict aware warp scheduling to is used to hide long latencies of the core pipeline, interconnect assist our elastic pipeline hardware; and different memory hierarchy levels. On such heavily multi- threaded execution platforms, the overall system performance • We carefully simulated our proposal and have shown is significantly affected by the efficiency of both on-chip and pipeline stalls reduction of up to 64.0% leading to overall system performance improvement of up to 20.7%. Permission to make digital or hard copies of all or part of this work for The remainder of the paper is organized as follows. In Sec- personal or classroom use is granted without fee provided that copies are tion 2, we provide the background and motivation for this not made or distributed for profit or commercial advantage and that copies work. In Section 3, we discuss our proposed elastic pipeline bear this notice and the full citation on the first page. To copy otherwise, to design. The co-designed bank-conflict aware warp scheduling republish, to post on servers or to redistribute to lists, requires prior specific technique is elaborated in Section 4. Simulated performance of permission and/or a fee. CF’11, May 3–5, 2011, Ischia, Italy. our proposed elastic pipeline in GPGPU applications is eval- Copyright 2011 ACM 978-1-4503-0698-0/11/05 ...$10.00. uated in Section 5, followed by some general discussions of Grid 0 ld.shared.f32 %f1, [addr] GPU with addr = 20*tid.y + 4*(tid.x%4) + 0x00 Host interface chip Block Block Block (0,0) (1,0) (2,0) Workload(CTA) dispatch core core core 201 0 warp scheduling ... Block Block Block 0 1 K-1 (0,1) (1,1) (2,1) GPU e Fetch Decode ssu Interconnect k i core 0 oc Bl lane 0 lane 1 lane 2 lane 3 lane 4 mem mem mem ... Block(0,0) ctrl 0 ctrl 1 ctrl L-1 warp 0 RF RF RF RF RF Thread Thread Thread Thread Thread DRAM (0,0) (1,0) (2,0) (3,0) (4,0) chips Thread Thread Thread Thread Thread warp 1 (0,1) (1,1) (2,1) (3,1) (4,1) Shared memory memory access (c) Thread Thread Thread Thread Thread warp 2 bank 0 bank 1 bank 2 bank 3 bank 4 Other request (to (0,2) (1,2) (2,2) (3,2) (4,2) Non- memory 0x00 0x04 0x08 0x0c 0x10 bank mem (global/ interconnect) 0x14 0x18 0x1c 0x20 0x24 conflict 0x28 0x2c 0x30 0x34 0x38 ory const/ insn data returned 0x3c 0x40 0x44 0x48 0x4c texture) (a) … … … … … insn (from interconnect) Writeback (b) Figure 1: (a)CUDA threads hierarchy; (b)thread execution in GPU core pipeline; (c)GPU chip organization our simulated GPU core architecture along with the elastic scheduled and issued to the pipeline in an interleaved manner pipeline in Section 6. The major differences between our pro- which is also known as barrel processing[22]. posal and related art are described in Section 7. Finally, Sec- GPUs rely mainly on massive hardware multithreading to tion 8 concludes the paper. hide external DRAM latencies. In addition, on-chip memory hierarchies are also deployed in GPUs in order to provide high bandwidth and low latency. Such on-chip memories in- 2. BACKGROUND AND MOTIVATION clude, software managed caches (shared memory), or hardware In this section, we will first introduce some GPGPU related caches, or a combination of both[13]. To provide adequate background and their shared memory accesses. Then we pro- bandwidth for the GPU parallel SIMD lanes, the shared mem- vide a motivation example. ory is heavily banked. However, when accesses to the shared memory banks are unbalanced, shared memory bank conflicts 2.1 Shared Memory Access on GPGPU occur. For example, with the memory access pattern shown on top of Figure 1(b), data needed by both lanes 0 and 4 re- GPU utilization has spanned far beyond graphics render- side in the same shared memory bank 0. In this case hot bank ing, covering a wide spectrum of general purpose computing is formed at bank 0, and the two conflicting accesses have to known as GPGPU[2]. The programming models of GPGPU 1 be serialized, assuming a single-port shared memory design . (such as OpenCL[4] and CUDA[19]) are generally referred to As a result, the GPU core throughput may be substantially as explicitly-parallel, bulk-synchronous SPMD (Single Program degraded, as to be exemplified by the following example. Multiple Data). In such programming models, the programmer extracts the data-parallel section of the sequential ap- 2.2 Motivation Example plication code, identifies the basic working unit (typically an element in the problem domain), and explicitly expresses the A snapshot of the AES encryption kernel source is shown in same sequence of operations on each working unit in a kernel. Figure 2. The code shown there deals with the second encryp- Multiple kernel instances (called threads in CUDA) are run- tion stage. First, the stage input data indexes are loaded from ning independently on GPU cores. The parallel threads are shared memory region stageBlock2 (phase I). Then the stage organized into a two-level hierarchy, in which a kernel (grid input data are loaded from shared memory regions tBox*Block in CUDA) consists of parallel CTAs (Cooperating Thread Ar- (phase II), with the indexes from phase I. Afterwards the data ray,orblock in CUDA), with each CTA composed by paral- is processed (phase III), and finally stored to the shared mem- lel threads, as shown in Figure 1(a). Explicit, localized syn- ory region stageBlock1 (phase IV). The other stages of the chronizations and on-chip data sharing mechanisms (such as encryption process work similarly.

Load more