Supporting Irregularity in Throughput-Oriented Computing by SIMT-SIMD Integration

Daniel Thuerck NEC Laboratories Europe

Abstract—The last two decades have seen continued propose software and hardware modifications that are quickly exponential performance increases in HPC systems, well after realizable and add native support for data and control flow the predicted end of Moore’s Law for CPUs, largely due irregularity. to the widespread adoption of throughput-oriented compute accelerators such as GPUs. When faced with irregular Our contributions are as follows: yet throughput-oriented applications, their simple, grid-based • We extend CUDA’s PTX ISA to support an irregular computing model turns into a serious limitation. Instead of compute model via an mapping to metadata registers in repeatedly tackling the issues of irregularity on the application wide-SIMD processors (handling data irregularity). layer, we argue that a generalization of the CUDA model to irregular grids can be supported through minor modifications • We describe a method to use register renaming as a to already established throughput-oriented architectures. To that tool for SMT-like processing on a simple vector core, end, we propose a unifying approach that combines techniques avoiding costly context (handling control flow from both SIMD and MIMD approaches, but adheres to the irregularity). SIMT principles – based on an unlikely ally: a wide-SIMD • Lastly, we integrate both ideas into a modified wise- vector architecture. We extend CUDA’s familiar programming model and implement SIMT-inspired strategies for dealing with SIMD architecture and validate the effectiveness of our data and control flow irregularities. Our approach requires only ideas using a model-based software simulation. minimal hardware changes and an additional compiler phase. This paper focuses on ideas and their integration; realizing Using a model-based software simulation, we demonstrate that a full system is an objective for future work. We back our the proposed system can be a first step towards native support claims and support the viability of our ideas using a model- for irregularity on throughput-oriented processors while greatly simplifying the development of irregular applications. based simulation (cf. below). II.RELATED WORK I.INTRODUCTION The architectural continuum between throughput-oriented The last decade has seen a massive increase in raw compute SIMT/SIMD designs and latency-based (e.g. SMT) designs power due to the inclusion of GPUs in HPC systems. GPUs has been explored to great depths. Multiple extensions to augment the previously dominant multi-core CPU setups by SIMT designs have been proposed: Scalar co-processors throughput oriented computing (TOC). Garland and Kirk [11] for GPU cores [3], [19], [22] that avoid repeated scalar mention three important design principles for such systems: computations; central schedulers share the GPU between host an abundance of rather simple processing units, SIMD-style threads [14] or dynamic re-grouping of threads from a warp execution and hardware multithreading. In this spirit, the or block into convergent subgroups [9], [10]. Similar to scheduling hardware on GPUs is kept simple in favor of SMT context switches, Frey et al. [8] propose a model for more compute, leading them to perform poorly in case of oversubscription of tasks to SIMT cores and a method for control flow irregularity (e.g. branch divergence, fine-grained faster context switches using the cores’ L1 . Similarly, synchronization) or data irregularity (e.g. differing resource work-stealing between warps has been explored [13]. requirements between work items). Modern multi-core CPUs, On the other end of the spectrum, multiple works have on the other hand, are optimized for latency and can use investigated layering a SIMT scheduler on top of arrays of multithreading (SMT) to swap between executing applications. in-order CPU cores [2]. These arrays between MIMD CPU cores operate independently and request resources as mode (each operates independently) and SIMT needed. Even though the concepts of latency and throughput- mode (control logic is shared by all processors) to save power. oriented architectures are diametrical opposites, researchers Subsequent works extend this idea into a form of “hardware have tried to transplant features between them to speed up the auto-vectorization” [5], [20] of scalar code. In order to group execution of programs in their native programming models. similar instructions on cores of the array, expensive crossbars Classical domains where TOC proves effective are are required. now facing the increasing use of irregular applications: Liquid SIMD [4] and Vapour SIMD [17] on the software sparse matrices, graph neural networks and sum-product side and ARM’s SVE hardware extensions [1] improve SIMD networks frequently appear in e.g. the machine learning systems for irregular applications: they offer a convenient way community. In order to extend the success and simplicity to set the SIMD vector length at runtime, adapting to tasks of CUDA’s programming model to these applications, we with varying resource requirements. III.PROGRAMMING MODEL Such buffers and renaming units are frequently found in We now describe our proposed generalization of SIMD (e.g. Intel CPUs with AVX). The CUDA’s grid-based compute model to irregular workloads. fundamental ideas of our system are to translate SIMT code Traditionally, CUDA kernels are parameterized over a grid (with SIMD-friendly additions) into SIMD code at runtime of blocks, e.g. (as opposed to e.g. binary translation [7]) and use register renaming tables to simultaneously execute multiple warps. kernel<<>>(...); A. Front-End which launches m blocks of n threads. Each block is scheduled onto a streaming multiprocessor (SM) which executes warps Executing SIMT code efficiently requires hardware support of 32 threads. All threads within a warp operate in lockstep1 for predicated execution and branch- as well as reconvergence and branches or conditionals are implemented by predication handling. In SIMT models, each inside a warp executes In recent GPU architectures, all threads in a warp share access the same (scalar) code, but is parameterized by its index inside lane_id to the SM’s register file and communicate through it with low the warp (CUDA: ). SIMD code, on the other hand, latency. Blocks of variable sizes m have to be emulated by operates only on whole vectors at once. Thus, we propose i modifications to CUDA’s virtual PTX code in order to make setting the block size to m = max{mi}i and masking out threads in each block at runtime. it more SIMD-friendly, simplifying processing in the back- end. Changes are visualized using the SpMV example in Fig. We build our proposed programming model with this issue 1. in mind: First, we get rid of the block abstraction and directly Metadata registers expose warps to users. In order to handle data irregularity, we . We follow the SIMT-on-SIMD make the individual warps’ sizes a runtime parameter. Instead paradigm of ISPC [18] by mapping the corresponding of passing parameters m, n to the kernel, this requires passing registers of threads in a warp to lanes in SIMD registers. We an explicit list of warp sizes: find that it is possible to emulate SIMT execution by explicitly associating keywords such as $tid in metadata registers to const int w_list[] = {4, 11, 3, 2, 8, 3}; each lane (parameterized SIMD execution). Appearances of kernel<<>>(...); those keywords in the code are then translated to registers with suitable execution, as marked by (1) in Fig. 1. Unless As on GPUs, warps are assigned statically to SMs at there is branching involved, SIMT code translates 1:1 into initialization. Implementing kernels follows the same principle SIMD code; predicated instruction are realized through as for warp-centric [12] models: all threads in a warp execute masked SIMD registers. One more data register holds a in a bulk-synchronous manner and threads communicate candidate PC per thread. through shuffle instructions. To distinguish code for our model Scalar branch control. Without such per-lane program from traditional CUDA code, we use the keywords tid for a counters (PCs), SIMD hardware is unable to track execution thread’s index in a warp and wid, ntids for a warp’s index paths of different threads. Instead, all threads must follow the and size, respectively. As a poster child example, we use the same path of execution, inactive threads’ lanes are masked SpMV-kernel in Fig. 1 (left), where each warp handles one out. In order to handle divergence and lane masks, we use the row of a sparse CSR matrix (arrays csr_row, csr_col, technique by Lorie and Strong [15] that inserts JOIN nodes csr_val). Therein, each thread handles one nonzero entry into the code at which the activity of all lanes is tested; if of the row and the results are accumulated by a warp-wide (through e.g. branching) no active threads remain, the PC of logarithmic reduction (lines 8 through 12). This simple kernel an inactive thread is picked up as the warps’ sole PC. In the exhibits both data and light control flow irregularity: First, following, we refer to PTX code with these two additions as each warp uses a varying amount of threads and thus the share vector-ready PTX (vrPTX). of the SM’s register file depends on a runtime parameter where the classical CUDA execution model requires the register B. Back-End count at compiler time. Second, the number of reduction steps With parameterized SIMD execution, differently-sized depends on the warp size, leading to different execution paths. warps all execute the same vrPTX code. Nevertheless, IV. IMPLEMENTATION executing each warp on its own SIMD core would often underutilize the hardware and prevent us from hiding latencies, A little unconventional, we propose executing programs one of the bedrocks of TOC. Therefore, we propose two in the programming model from above on traditional hardware modifications in SIMD systems: SIMD hardware. Specifically, we consider wide-SIMD (i.e. Partitioning. Following our execution models, executing a vector) hardware due to their ability to set a vector size single warp of size less than the number of SIMD lanes would at runtime. The proposed system introduces additions to leave many SIMD lanes in wide-SIMD systems unoccupied. CUDA’s C-to-PTX compiler and modifications to vector Since a lane’s execution only depends on its metadata, we use instruction buffers and register renaming units in hardware. this fact to pack multiple warps’ data into the same SIMD 1With thread-independent scheduling as in the Pascal , the registers, increasing vector length as necessary and refer to lockstep model has been somewhat relaxed. the packed warps as one partition. Since warps in a partition C Source vrPTX vector (ntids = 4) 1 float a_val, x_val, res; .reg .b32 %a_val, %x_val, %res, %0, %s; a_val … 2 int a_ix, col; .reg .pred %loop, %if, %ifs; $tid 0 1 2 3 3 a_ix = csr_row[wid] + tid; 1 ld.global.b32 %a_ix, [$wid + csr_row]; csr_row[$wid] 4 a_val = csr_val[a_ix]; vector add.i32 %a_ix, %a_ix, $tid; 5 col = csr_col[a_ix]; gather

6 x_val = x[col]; mul.f32 %res, %x_val, %a_val; + vector 7 res = a_val * x_val; $tid 0 1 2 3 add L1: 8 for(int s = 1; s < ntids; s *= 2) 1 1 1 1 JOIN 2 %s 9 { setp.lt.i32 %loop, %s, $ntids; < mask 10 if(tid + s < ntids) $ntids 4 4 4 4 build @%loop bra L2; 11 res += shuffle_down(res, s); = %loop     12 } setp.lt.i32 %if, %check, $ntids; 13 if(tid == 0) 3 and.pred %inloop, %loop, %if JOIN(     ) PC 14 b[wid] = res; @%inloop add.f32 %res, %res, %shfl > 0: continue with PC select = 0: select new PC L2: JOIN 2 setp.eq.i32 %if, $tid, 0 @%if st.global.b32 [b + $wid], %res

Figure 1: Our solution integrates a modified compilation pass from CUDA to vector-ready PTX, that enables execution of SIMT code on SIMD architectures by concatenating threads’ registers into SIMD registers. There, we 1 add special registers for warp and thread indices, 2 insert synchronization point for PC selection after thread divergence 3 turn predicated into masked instructions. may diverge at branches, we propose the following: in the vector instruction scheduler, this method results automatically compile phase, we unroll the vrPTX source for multiple values hides latencies. of ntids and count the resulting number of instructions. We then group the possible values of ntids into buckets C. Integration into SX-Aurora according to the difference in their number of instructions; As a practical example, we consider the modification of a warps that fall into the same bucket may be put into the design that is already on the market: NEC’s same partition, hoping that they would behave similarly. We SX-Aurora TSUBASA [21]. Aurora’s PCIe cards offers up may repeat that experiment for random outcomes of branch to 10 cores at 1.6 GHz with up to 3.07 TFLOPs in double instructions to improve our estimate. precision mode. All cores share 16 MB last-level cache (LLC) Vector code issue and multiplexing. Even with packing, and 48 GB HBM2 memory with a peak bandwidth of 1.53 we still need a method to handle warps of drastically TB/s. Each Aurora core includes a (SPU) different sizes and diverging control flow. Wrapping warps and a vector processing unit (VPU) (with several vector into SMT threads is not an option: with larger SIMD registers, pipeline processors (VPP)) as in Fig. 2a. The VPU uses context switches become prohibitively expensive. Instead, we register renaming to execute vector instructions out-of-order propose a static partition multiplex scheme that uses a register by dispatching them to vector data and mask registers as well renaming unit to execute multiple partitions at once. We as FMA/ALU execution units. In our design, we focus on visualize our idea in Fig. 2b: As long as vl is less than the VPU exclusively and use only the instruction fetching the number of SIMD lanes, all partitions require the same capabilities of the SPU. Fig. 3 depicts our modifications: number of SIMD registers. Hence, we can proceed analogous Compared to the original VPU, we pull the register renaming to SIMT processors and divide the register file according to unit before the vector instruction buffer, since we do not the partitions. In our example in Fig. 2b, partition 0 (packing use out of order execution within partitions. Instead, vrPTX warps 0 and 2) uses physical SIMD registers v0 through instructions are loaded for multiple partitions and multiplexed v4. Using this partitioned register table (PRT), the incoming into a single vector instruction stream. Hence, we offer a vrPTX instructions can be mapped conflict-free to physical comparatively cheap way to leverage existing IP for efficient SIMD registers. After the mapping, a lookup table performs irregular processing. the 1:1 translation from vrPTX to SIMD vector instructions and sets the runtime vl accordingly. After renaming, there V. SIMULATION RESULTS are no conflicts between streams from different partitions, so Due to a lack of details regarding NEC’s SX-Aurora, we all are multiplexed into the same instruction buffer. Through a used the available ISA documentation in order to build a Partition P0 Partition P1 Vector instructions (warp 4) from SPU load (warps 0, 2) data Instruction VPP PC0 PC1 PC2 from Warp Dispatch buffer VPPVPP add %0, $tid, %s add %0, $tid, %s Register

renaming vmask0 Partitioned Register Table (PRT)

vreg1 vreg2 vreg4 vreg5 vreg6 vreg7 vreg0 vreg3 Register Rename P# areg init preg vl Instruction 0 $tid [0 1 2 3 0 1 2] v0 Scheduler vl 7 vl 8 vadd v3, v0, v4 vadd v8, v5, v9 $ntids [4 4 4 4 3 3 3] v1 Forwarding $wid [0 0 0 0 2 2 2] v2 7

%0 v3

FMA1 FMA2 ALU0 ALU1 ST / CPX Complex FMA0 Instruction Mux operation %s v4 vl 7 1 $tid [0 1 2 3 4 5 6 7] v5 vadd v3, v0, v4 $ntids [8 8 8 8 8 8 8 8] v6 8 vl 8 $wid [4 4 4 4 4 4 4 4] v7 vadd v8, v5, v9 %0 v8

%s v9 . . . store . . . data Free list: v10, v11, v12, v13, v14, … (a) (b) Figure 2: (a) SX-Aurora’s vector unit offers 8 VPPs and an out-of-order scheduler that resolves hazards through register renaming. (b) We propose to re-purpose the register renaming unit to multiplex vrPTX instruction streams from multiple warps into one single stream of vector instructions.

wid 0 2 4 Instruction Cache data irregularity (i.e. different row lengths). This test is meant t ntids 4 3 7 to showcase two things: First, allowing more partitions ( – also the number of required instruction fetch units) per core PC0 PC1 4-way Instruction-Fetch PC2 PC3 results in a larger vector instruction buffer which leads to better utilization of the execution units and thus less empty cycles. Replaces SPU Replaces Second, packing can save instructions by batching warps – Group 4-way Instruction Decode again, we expect less time to termination.

Warp Dispatch and Dispatch Warp Figs. 4 presents simulation results for two matrices that are Partitioned Register Table Register Rename representative for the test set: lp22 (2, 958×16, 392; 68, 512 nz – first row) and mycielskian11 (1, 535 × 1, 535; 134, 710 nz –

Instruction Mux second row). Their row length distributions, and thus warp size distributions, are visualized in Figs. 4a. Figs. 4b support our first hypothesis: Independent from the packing setup, more VPP slots result in consistently less cycles being used. More slots Vector Instruction VPP Buffer VPP lead to more and potentially different simultaneous instructions

LLC in the instruction buffer which in turn may be executed in

Vector Scheduler parallel (pending availability). Furthermore, a looser threshold for packing (permitting higher warp size Figure 3: Integration of our proposed warp (Figure variations inside a partition) further reduces the total number 2a) into SX-Aurora’s vector core. We move the instruction of cycles spent. In Figs. 4c, we visualize the execution buffer behind it and remove the SPU, except its utilization in the same experiments: Again, both more fetch unit. partitions as well as looser packing thresholds increase the utilization until reaching a plateau. Lastly, we point to the error bars for t = 4 (Figs. 4b, 4c): model following Fig. 2a and simulate it using the SimPy for each parameter setting, we ran the simulation 100 times, framework [16]. We express all latencies in terms of multiples every time with a random order of the input matrix’ rows, and of a simple arithmetic vector operation that takes 1 cycle, plot the resulting error bars in both cycle and utilization plot. any complex resp. store operation’s latency is the active We point out that although the variation is relatively large, at vector length. We simulate one core of the vector processor times negating the benefit of packing entirely, the average line with the same number of memory controllers and ports as (black) tends strongly towards the better region (lower cycles, Aurora. Our simulation consumes the generated PTX from higher utilization). This indicates that a smaller number of our SPMV example code (see Figure 1). We input multiple outliers is responsible for such failures. Since we currently do matrices from the SuiteSparse Matrix Collection’s [6] linear not support work stealing or dynamic allocation, these outliers programming category, since these matrices often suffer from directly correspond to certain row orders. We leave a closer 0.6 600 4

3 400 0.4 Cycles 6 2 10 × #Rows - lp22 No packing

200 VPP utilization 0.2 1 t = 2 t = 4 0 0 20 40 60 80 2 4 6 8 10 12 14 2 4 6 8 10 12 14

80 4

60 0.3 3 Cycles

40 6 10 2 0.2 No packing ×

20 VPP utilization t = 2

#Rows - mycielskian11 1 t = 4 0.1 0 20 40 60 80 100 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Row Length Partitions Partitions (a) (b) (c) Figure 4: Results from a model-driven simulation of our setup running an SpMV kernel for two sparse matrices (upper row: lp22, lower row: mycielskian11) with one order of magnitude irregularity. The results show that our architecture modifications help to make efficient use of the execution units and hide latencies.

investigation for future work. [10] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. VI.CONCLUSION ACM TACO, 6(2):1–37, 2009. [11] M. Garland and D. B. Kirk. Understanding throughput-oriented In this paper, we briefly discussed how existing vector architectures. Communications of the ACM, 53(11):58–66, 2010. accelerators could potentially be improved: Embedded in a [12] S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating full stack of programming model, compiler and architecture CUDA graph algorithms at maximum warp. ACM SIGPLAN Notices, 46(8):267–276, 2011. changes, they can mark a first step towards throughput oriented [13] M. Huzaifa, J. Alsop, A. Mahmoud, G. Salvador, M. D. Sinclair, and computing with full support for irregular workloads while S. V. Adve. Inter-kernel Reuse-aware Thread Block Scheduling. ACM retaining the familiar CUDA programming model. Having so TACO, 17(3):1–27, 2020. [14] J. Kim, J. Cha, J. J. K. Park, D. Jeon, and Y. Park. Improving far only validated our ideas using a model-based simulation, GPU Multitasking Efficiency Using Dynamic Resource Sharing. IEEE we plan to follow that up with a cycle-accurate simulation of Letters, 18(1):1–5, 2019. the proposed system. [15] R. A. Lorie and H. R. Strong Jr. Method for conditional branch execution in SIMD vector processors, U.S. Patent US4435758A, Mar. 1984. [16] K. G. Müller and T. Vignaux. Simpy. REFERENCES https://github.com/cristiklein/simpy. [1] A. Armejach, H. Caminal, J. M. Cebrian, R. González-Alberquilla, [17] D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, K. Williams, D. Yuste, C. Adeniyi-Jones, M. Valero, M. Casas, and M. Moretó. Stencil codes A. Cohen, and A. Zaks. Vapor SIMD: Auto-vectorize once, run on a vector length agnostic architecture. In PACT’18. everywhere. In CGO’11. [2] K.-C. Chen and C.-H. Chen. Enabling SIMT Execution Model on [18] M. Pharr and W. R. Mark. Ispc: A SPMD compiler for high-performance Homogeneous Multi-Core System. ACM TACO, 15(1):1–26, 2018. CPU programming. In Inpar’12. [3] Z. Chen and D. Kaeli. Balancing Scalar and Vector Execution on GPU [19] M. Stanic, O. Palomar, T. Hayes, I. Ratkovic, A. Cristal, O. Unsal, and Architectures. In IPDPS’16. M. Valero. An Integrated Vector-Scalar Design on an In-Order ARM [4] N. Clark, A. Hormati, S. Yehia, S. Mahlke, and K. Flautner. Liquid Core. ACM TACO, 14(2):1–26, 2017. SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. [20] A. Tino, C. Collange, and A. Seznec. SIMT-X: Extending Single- In HPCA’07. Instruction Multi-Threading to Out-of-Order Cores. ACM TACO, [5] S. Collange. Simty: Generalized SIMT execution on RISC-V. 2017. 17(2):15:1–15:23, 2020. [6] T. A. Davis and Y. Hu. The University of Florida sparse matrix [21] Y. Yamada and S. Momose. Vector engine processor of NEC’s brand- collection. ACM TOMS, 38:1–25, 2011. new supercomputer SX-Aurora TSUBASA. In HotChips’18. [7] G. Diamos, A. Kerr, and M. Kesavan. Translating GPU Binaries to [22] Y. Yang, P. Xiang, M. Mantor, N. Rubin, L. Hsu, Q. Dong, and H. Zhou. Tiered SIMD Architectures with Ocelot. Technical Report GIT-CERCS- A Case for a Flexible Scalar Unit in SIMT Architecture. In IPDPS’14. 09-01. [8] S. Frey, G. Reina, and T. Ertl. SIMT Microscheduling: Reducing Thread Stalling in Divergent Iterative Algorithms. In PDP’20. [9] W. W. L. Fung and T. M. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA’11.