THE AMD Gem5 APU SIMULATOR: MODELING GPUS USING the MACHINE ISA

THE AMD gem5 APU SIMULATOR: MODELING GPUS USING THE MACHINE ISA TONY GUTIERREZ, SOORAJ PUTHOOR, TUAN TA*, MATT SINCLAIR, AND BRAD BECKMANN AMD RESEARCH, *CORNELL JUNE 2, 2018 OBJECTIVES AND SCOPE Objectives ‒ Introduce the Radeon Open Compute Platform (ROCm) ‒ AMD’s Graphics Core Next (GCN) architecture and GCN3 ISA ‒ Describe the gem5-based APU simulator Modeling Scope APU ‒ Emphasis on the GPU side of the simulator systems ‒ APU (CPU+GPU) model, not discrete GPU ‒ Covers GPU arch, GCN3 ISA, and HW-SW interfaces Why are we releasing our code? ‒ Encourage AMD-relevant research ‒ Modeling ISA and real system stack is important [1] ‒ Enhance academic collaborations ‒ Enable intern candidates to get experience before arriving ‒ Enable interns to take their experience back to school Acknowledgement ‒ AMD Research’s gem5 team [1] Gutierrez et al. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. HPCA, 2018. 2 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | QUICK SURVEY Who is in our audience? ‒ Graduate students ‒ Faculty members ‒ Working for government research labs ‒ Working for industry Have you written an GPU program? ‒ CUDA, OpenCLTM, HIP, HC, C++ AMP, other languages Have you used these simulators? ‒ GPGPU-Sim ‒ Multi2Sim ‒ gem5 ‒ Our HSAIL-based APU model Are you familiar with our HPCA 2018 paper? ‒ Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level 3 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | OUTLINE Topic Presenter Time Background Tony 8:00 – 8:15 ROCm, GCN3 ISA, and GPU Arch Tony 8:15 – 9:15 HSA Implementation in gem5 Sooraj 9:15 – 10:00 Break 10:00 – 10:30 Ruby and GPU Protocol Tester Tuan 10:30 – 11:15 Demo and Workloads Matt 11:15 – 11:50 Summary and Questions All 11:50 – 12:00 4 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | BACKGROUND Overview of gem5 ‒ Source tree GPU terminology and system overview HSA standard and building blocks ‒ Coherent shared virtual memory ‒ User-level queues ‒ Signals ‒ etc. 5 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | OVERVIEW OF gem5 Open-source modular platform for system architecture research ‒ Integration of M5 (Univ. of Michigan) and GEMS (Univ. of Wisconsin) ‒ Actively used in academia and industry Discrete-event simulation platform with numerous models ‒ CPU models at various performance/accuracy trade-off points ‒ Multiple ISAs: x86, ARM, Alpha, Power, SPARC, MIPS ‒ Two memory system models: Ruby and “classic” (M5) ‒ Including caches, DRAM controllers, interconnect, coherence protocols, etc. ‒ I/O devices: disk, Ethernet, video, etc. ‒ Full system or app-only (system-call emulation) Cycle-level modeling (not “cycle accurate”) ‒ Accurate enough to capture first-order performance effects ‒ Flexible enough to allow prototyping new ideas reasonably quickly See http://www.gem5.org More information available from Jason Lowe-Power’s tutorial ‒ http://learning.gem5.org/tutorial/ 6 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | APU SIMULATOR CODE ORGANIZATION Gem5 top-level directory ‒ src/ For more information about the configuration system, see Jason ‒ gpu-compute/ GPU core model gem5 Lowe-Power’s tutorial. ‒ mem/protocol/ APU memory model ‒ mem/ruby/ APU memory model ‒ dev/hsa/ HSA device models src configs ‒ configs/ ‒ example/ apu_se.py sample script ruby/ ‒ APU protocol configs gpu- mem/ mem/ dev/ compute protocol ruby hsa For the remainder of this talk, files without a directory prefix are located in src/gpu-compute/ 7 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | GPU TERMINOLOGY SQC: Sequencer Cache (shared L1 instruction) CU: Compute Unit (SM in NVIDIA terminology) GPU I-Cache SQC GPU GPU GPU GPU Scalar Cache Scalar Cache CU CU CU CU Core Core Core Core TCP: Texture Cache per Pipe L1D L1D L1D L1D TCP TCP TCP TCP (private L1 data) L2 TCC TCC: Texture Cache per Channel (shared L2) AMD terminology Not shown (per GPU core): LDS: Local Data Share (Shared memory in NVIDIA terminology, sometimes called “scratch pad”) 8 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | EXAMPLE APU SYSTEM GPU + CPU CORE-PAIR WITH A SHARED DIRECTORY CPU CPU I-Cache GPU SQC Scalar Cache CU CU CU CU CPU0 CPU1 TCP TCP TCP TCP L1D L1D TCC L2 Memory Directory Memory Controller 9 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | AMD TERMINOLOGY IN A NUTSHELL Heterogeneous Systems Architecture (HSA) programming abstraction ‒ Standard for heterogeneous compute – supported by AMD hardware ‒ Light abstractions of parallel physical hardware ‒ Captures basic HSA and OpenCL constructs, plus much more Work-item GPU Architecture Thread block HSA Model in CUDA Grid in CUDA GPU Grid GPU Core GPU Core Workgroup Workgroup Grid Workgroup Thread in CUDA Work-item (WI) Wavefront (WF) Grid: N-Dimensional (N = 1, 2, or 3) index space Warp in CUDA ‒ Partitioned into workgroups, wavefronts, and work-items 10 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | SPECIFICATION BUILDING BLOCKS HSA Hardware Building Blocks HSA Software Building Blocks Shared Virtual Memory HSA Runtime ‒ Single address space ‒ Implemented by the ROCm runtime Open- ‒ Coherent ‒ Create queues Source ‒ Pageable ‒ Allocate memory HSA System Runtime ‒ Fast access from all components Open- ‒ Device discovery Specification Source ‒ Can share pointers HSA Platform System Arch Multiple high-level compilers Architected User-Level Queues Specification ‒ CLANG/LLVM ‒ C++, HIP, OpenMP, OpenACC, Python Signals Open- Source Platform Atomics GCN3 Instruction Set Architecture ‒ Kernel state GCN3 ISA Specification Defined Memory Model ‒ ISA encodings Context Switching ‒ Program flow control Industry standard, architected requirements for Industry specifications to enable how devices share memory and communicate existing programming languages to with each other target the GPU http://hsafoundation.com 11 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | http://github.com/HSAFoundation APU SIMULATION SUPPORT HSA Hardware Building Blocks HSA Software Building Blocks Shared virtual memory Radeon Open Compute platform (ROCm) ‒ Single address space ‒ AMD’s implementation of HSA principles ‒ Coherent ‒ Create queues ‒ Fast access from all components ‒ Device discovery ‒ Can share pointers ‒ AQL support ‒ Pageable ‒ Allocate memory Architected user-level queues Machine ISA ‒ Via architected queuing language (AQL) ‒ GCN3 Signals Heterogeneous Compute Compiler (HCC) ‒ CLANG/LLVM – direct to GCN3 ISA Platform atomics ‒ C++, C++ AMP, HIP, OpenMP, OpenACC, Python Defined memory model ‒ BasicAcquire acquire and release and release semantics operations as implemented by the compiler ‒ Merging functional and timing models Legend Included in this release Context switching Work-in-progress / may be released Longer term work 12 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | OUTLINE Topic Presenter Time Background Tony 8:00 – 8:15 ROCm, GCN3 ISA, and GPU Arch Tony 8:15 – 9:15 HSA Implementation in gem5 Sooraj 9:15 – 10:00 Break 10:00 – 10:30 Ruby and GPU Protocol Tester Tuan 10:30 – 11:15 Demo and Workloads Matt 11:15 – 11:50 Summary and Questions All 11:50 – 12:00 13 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | HW-SW INTERFACES ROCm – high-level SW stack HW-SW interfaces Kernel launch flow GCN3 ISA overview 14 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | ARE YOU READY TO ROCm? SW STACK AND HIGH-LEVEL SIMULATION FLOW Application GCN3 ELF + HCC HCC source Code metadata ‒ Clang front end and LLVM-based backend x86 ELF ‒ Direct to ISA HCC ‒ Multi-ISA binary (x86 + GCN3) Libraries ROCm Stack ROCr ‒ HCC libraries User space ‒ Runtime layer – ROCr ROCt ‒ Thunk (user space driver) – ROCt Runtime loader loads ‒ Kernel fusion driver (KFD) – ROCk GCN3 ELF into memory OS kernel space ROCk GPU is a HW-SW co-designed machine ‒ Command processor (CP) HW aids in implementing HSA standard Hardware ‒ Rich application binary interface (ABI) shader.[hh|cc] MEM GPU directly executes GCN3 ISA compute_unit.[hh|cc] ‒ Runtime ELF loaders for GCN3 binary gpu_command_processor.[hh|cc] Command CU Processor See https://rocm.github.io for documentation, source, and more. GPU 15 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | DETAILED VIEW OF KERNEL LAUNCH GPU FRONTEND AND HW-SW INTERFACE User Space SW gpu_compute_driver.[hh|cc] User space SW talks to GPU via ioctl() ioctl() ‒ HCC/ROCr/ROCt are off-the-shelf ROCm ROCk ‒ ROCk is emulated in gem5 ‒ Handles ioctl commands dev/hsa/hsa_packet_processor.[hh|cc] MEM CP frontend dev/hsa/hw_scheduler.[hh|cc] HW Queue CP ‒ Two primary components: Dispatcher kernels Scheduler work- ‒ HSA packet processor (HSAPP) CU ‒ Workgroup dispatcher HW queue groups HSAPP Runtime creates soft HSA queues GPU ‒ HSAPP maps them to hardware queues HW Model Components ‒ HSAPP schedules active queues Head ptr Runtime creates and enqueues AQL packets ‒ Packets include: Tail ptr ‒ Kernel resource requirements hsa_packet.hh ‒ Kernel size ‒ Kernel code object pointer hsa_queue.hh ‒ More… HSA software queue 16 | THE AMD gem5 APU SIMULATOR | JUNE 2, 2018 | ISCA 2018 TUTORIAL | DETAILED VIEW OF KERNEL LAUNCH DISPATCHER WORKGROUP ASSIGNMENT Kernel dispatch is resource limited dispatcher.[hh|cc] ‒ WGs are scheduled to CUs hsa_queue_entry.hh Dispatcher tracks status of in-flight/pending Shader kernels GPU Dispatcher

THE AMD Gem5 APU SIMULATOR: MODELING GPUS USING the MACHINE ISA

Gem5, INTEROPERABILITY, and IMPROVING SIMULATOR METHODOLOGY

Lost in Abstraction: Pitfalls of Analyzing Gpus at the Intermediate Language Level

SST + Gem5 = a Scalable Simulation Infrastructure for High Performance Computing

Enhancing the RISC-V Instruction Set Architecture

Latest Status of These Tests on Gem5

Sources of Error in Full-System Simulation

Empirical CPU Power Modelling and Estimation in the Gem5 Simulator

Appendices a Sample Gem5 Python Configuration Code

Validation of the Gem5 Simulator for X86 Architectures

The Gem5 Simulator: Version 20.0+∗ a New Era for the Open-Source Computer Architecture Simulator

Evaluating Gem5 and QEMU Virtual Platforms for ARM Multicore Architectures

Accelerates In-Memory Databases with Near Data Computing Kevin Hsieh Amirali Boroumand [email protected] [email protected]