Lost in Abstraction: Pitfalls of Analyzing Gpus at the Intermediate Language Level
Total Page:16
File Type:pdf, Size:1020Kb
2018 IEEE International Symposium on High Performance Computer Architecture Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, John Kalamatianos, Onur Kayiran, Michael LeBeane, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain†, Timothy G. Rogers† AMD Research, Advanced Micro Devices, Inc. †School of Electrical and Computer Engineering, Purdue University {anthony.gutierrez, brad.beckmann, alexandru.dutu, joe.gross, john.kalamatianos, onur.kayiran, michael.lebeane, matthew.poremba, brandon.potter, sooraj.puthoor, matthew.sinclair, mark.wyse, jieming.yin, xianwei.zhang}@amd.com {jain156, timrogers}@purdue.edu Abstract—Modern GPU frameworks use a two-phase handful of open-source simulators, such as GPGPU-Sim compilation approach. Kernels written in a high-level [11], gem5 [14], and Multi2Sim [35]. language are initially compiled to an implementation- Simulating a GPU is especially challenging because of agnostic intermediate language (IL), then finalized to the two-phase compilation flow typically used to generate the machine ISA only when the target GPU hardware GPU kernel binaries. To maintain portability between dif- is known. Most GPU microarchitecture simulators ferent generations of GPU hardware, GPU kernels are first available to academics execute IL instructions because compiled to an intermediate language (IL). Prior to launch- there is substantially less functional state associated ing a kernel, low-level software, such as the GPU driver or with the instructions, and in some situations, the ma- runtime, finalizes the IL into the machine instruction set chine ISA’s intellectual property may not be publicly architecture (ISA) representation that targets the GPU disclosed. In this paper, we demonstrate the pitfalls of hardware architecture. evaluating GPUs using this higher-level abstraction, Unlike CPU ISAs, GPU ISAs are often proprietary, and make the case that several important microarchi- change frequently, and require the simulation model to in- tecture interactions are only visible when executing corporate more functional state, thus academic researchers lower-level instructions. often use IL simulation. In particular, NVIDIA’s Parallel Our analysis shows that given identical application Thread Execution (PTX) ISA [33] and the HSA founda- source code and GPU microarchitecture models, execu- tion’s Heterogeneous Systems Architecture Intermediate tion behavior will differ significantly depending on the Language (HSAIL) Virtual ISA [26] are the most popular instruction set abstraction. For example, our analysis ILs. However, AMD has recently disclosed several GPU shows the dynamic instruction count of the machine ISA specifications, including the Graphics Core Next ISA is nearly 2× that of the IL on average, but conten- (GCN) 3 ISA [3] explored in this work. AMD has also re- tion for vector registers is reduced by 3× due to the op- leased their complete GPU software stack under an open- timized resource utilization. In addition, our analysis source license [8], thus it is now feasible for academic re- highlights the deficiencies of using IL to model instruc- searchers to simulate at the lower machine ISA level. tion fetching, control divergence, and value similarity. From an industrial perspective, this paper makes the Finally, we show that simulating IL instructions adds case that using IL execution is not sufficient when evaluat- 33% error as compared to the machine ISA when com- ing many aspects of GPU behavior and implores academics paring absolute runtimes to real hardware. to use machine ISA execution, especially when evaluating microarchitecture features. The primary goals of ILs (e.g., Keywords-ABI; GPU; Intermediate Language; abstracting away hardware details and making kernels Intermediate Representation; ISA; Simulation; portable across different HW architectures) directly con- flict with the goals of cycle-level simulation, such as ac- INTRODUCTION counting for hardware resource contention and taking ad- Research in GPU microarchitecture has increased dra- vantage of unique hardware features. Furthermore, GPUs matically with the advent of GPGPUs. Heterogeneous pro- are co-designed hardware-software systems where vendors gramming features are now appearing in the most popular frequently change the machine ISA to simplify the micro- programming languages, such as C++ [24] and Python architecture and push more complexity to software. [18]. To evaluate research ideas for these massively paral- While several prior works have investigated modeling lel architectures, academic researchers typically rely on cy- challenges for the CPU and memory [12] [15] [20] [23], cle-level simulators. Due to the long development time re- this paper is the first to deeply investigate the unique issues quired to create and maintain these cycle-level simulators, that occur when modeling GPU execution using the IL. To much of the academic research community relies on a facilitate the investigation, this paper also introduces a new simulation infrastructure capable of simulating both 2378-203X/18/$31.00 ©2018 IEEE 608 DOI 10.1109/HPCA.2018.00058 Figure 1: Average of dissimilar and similar statistics. HSAIL and AMD’s GCN3 machine ISA using the same microarchitecture model. Figure 2: Block diagram of the compute unit pipeline. This paper explores the space of statistics for which IL simulation provides a faithful representation of the pro- tomation tool is responsible for identifying the computa- gram under test, and those for which it does not. Figure 1 tion forming the kernel. Modern GPUs support control summarizes several key statistics that differ significantly flow operations allowing programmers to construct com- between HSAIL and GCN3 and a few that do not. Specifi- plex kernels. High-level GPU programming languages de- cally, due to significant code expansion, substantial differ- fine kernels using the single instruction, multiple thread ences in the application binary interface (ABI), and the lack (SIMT) execution model. Threads in a kernel are subdi- of scalar instructions, HSAIL substantially underestimates vided into workgroups, which are further subdivided into the dynamic instruction count and the code footprint. De- wavefronts (WF). All work-items in a WF are executed in spite the significant underestimation in the instruction issue lock step on the SIMD units of a compute unit (CU). Ad- rate, HSAIL substantially overestimates GPU cycles, bank ditionally, AMD’s GCN3 architecture includes a scalar conflicts, value uniqueness within the vector register file unit with corresponding scalar instructions. The use of (VRF), and instruction buffer (IB) flushes. Meanwhile, the these scalar instructions is transparent to the high-level ap- lack of a scalar pipeline does not impact HSAIL’s accuracy plication programmer, as they are generated by the compi- in estimating single instruction, multiple data (SIMD) unit lation toolchain and are inserted into the WF’s instruction utilization and the program’s data footprint. stream, which also includes vector instructions. In summary, we make the following contributions: B. GPU Hardware We add support for the Radeon Open Compute plat- CUs are the scalable execution units that are instanti- form (ROCm) [8] and the state-of-the-art GCN3 ISA ated many times within a GPU. GPUs also typically in- to gem5’s GPU compute model, and we demonstrate clude a multi-level cache hierarchy to improve memory how using the actual runtimes and ABIs impact accu- bandwidth and command processors (CP), which are also racy. called packet processors using HSA terminology [25], to We perform the first study quantifying the effects of manage work assignment. simulating GPUs using an IL, and we identify when Figure 2 provides a high-level block diagram of the CU using an IL is acceptable and when it is not. model in gem5, which is based on AMD’s GCN3 architec- Furthermore, we demonstrate that anticipating the im- ture [2]. Each CU contains four SIMD engines, a scalar pact IL simulation has on estimating runtime is partic- unit, WF slots, local and global memory pipelines, a branch ularly hard to predict and application dependent, thus unit, a scalar register file (SRF), a VRF, a private L1 data architects cannot simply rely on “fudge-factors” to cache, memory coalescing logic, and an on-chip local data make up the difference. share (LDS). Each CU is connected to shared L1 scalar data and instruction caches, which connect to memory through BACKGROUND a shared L2 cache. A. GPU Programming The GCN3 design splits the CU into four separate sets GPUs are programmed using a data parallel, streaming of SIMDs, each of which executes a single instruction computation model. In this model a kernel is executed by a (same PC) in lockstep on sixteen lanes. Thus, a single 64- collection of threads (named work-items in the terminol- work-item WF instruction is executed across 4 cycles. The ogy applied in this paper). A programmer, compiler, or au- GCN3 scalar unit’s primary purpose is to handle control flow and to aid address generation. 609 HSAIL GCN3 HSAIL GCN3 # Return Absolute WI ID # Read AQL Pkt #load kernarg at addr %arg1 # mv kernarg base to v[1:2] workitemabsid $v0, 0; s_load_dword s10, s[4:5], 0x04 ld_kernarg $v[0:1], [%arg1] v_mov v1, s6 v_mov v2, s7 # Wait for number of s_loads = 0 s_waitcnt lgkmcnt(0) #load kernarg into v3 flat_load_dword