Rocm: an Open Platform for GPU Computing Exploration

ROCm: An open platform for GPU computing exploration Ben Sander: Senior Fellow, Radeon Open Compute Gregory Stoner: Senior Director, Radeon Open Compute 1 FEBRUARY 2017 | ROCM REVOLUTION IN GPU COMPUTING Radeon Open Compute Platform (ROCm) Modern Heterogeneous HPC and Hyper Scale Accelerator Platform for Large Scale Systems Performance Open Hyper Scale Rich foundation built for latency First fully “Open Source” Built from the ground up reduction and throughput professional GPU accelerator to service multi- optimization computing solution accelerators in node and across the rack 2 FEBRUARY 2017 | ROCM Introducing ROCm Software Platform A new, fully “Open Source” foundation for Hyper Scale and HPC-class GPU computing Graphics Core Next Headless Linux® 64-bit HSA Drives Rich Capabilities Into the ROCm Driver Hardware and Software • Large memory single allocation • User mode queues • Peer-to-Peer Multi-GPU • Architected queuing language • Peer-to-Peer with RDMA • Flat memory addressing • Systems management API and tools • Atomic memory transactions • Process concurrency & preemption Rich Compiler Foundation For HPC Developer “Open Source” Tools and Libraries • LLVM native GCN ISA code generation • Rich Set of “Open Source” math libraries • Offline compilation support • Tuned “Deep Learning” frameworks • Standardized loader and code object format • Optimized parallel programing frameworks • GCN ISA assembler and disassembler • CodeXL profiler and GDB debugging • Full documentation to GCN ISA 3 FEBRUARY 2017 | ROCM ROCm Programming Model Options HIP HCC OpenCL Convert CUDA to portable C++ True single-source C++ Khronos Industry Standard accelerator language accelerator language • Single-source Host+Kernel • Single-source Host+Kernel • Split Host/Kernel • C++ Kernel Language • C++ Kernel Language • C99-based Kernel Language • C Runtime • C++ Runtime • C Runtime • Platforms: AMD GPU, NVIDIA • Platforms: AMD GPU • Platforms: CPU, GPU, FPGA (same perf as native CUDA) When to use it? When to use it? When to use it? • Port existing CUDA code • New projects where true C++ • Port existing OpenCL code • Developers familiar with CUDA language preferred • New project that needs • New project that needs • Use features from latest ISO portability to CPU,GPU,FPGA portability to AMD and NVIDIA C++ standards 4 FEBRUARY 2017 | ROCM HIP : Key Features Strong support for most commonly used parts of CUDA API ‒ Streams, events, memory allocation/deallocation, profiling ‒ HIP includes driver API support (modules and contexts) Full C++ support including templates, namespace, classes, lambdas ‒ AMD’s open-source GPU compiler based on near-tip clang+llvm ‒ Support C++11, C++14, some C++17 features Hipified code is portable to AMD/ROCM and NVIDIA/CUDA ‒ On CUDA, developers can use native CUDA tools (nvcc, nvprof, etc) ‒ On ROCM, developers can use native ROCM tools (hcc, rocm-prof, codexl) ‒ HIP ecosystem includes hipBlas, hipFFT, hipRNG, MIOpen Hipify tools automate the translation from CUDA to HIP ‒ Developers should expect some final cleanup and performance tuning 5 FEBRUARY 2017 | ROCM Hipification of CUDA Kernel (CAFFE) HIPIFY CUDA (Automated) HIP namespace caffe { namespace caffe { C++ Features template <typename Dtype> Unchanged! template <typename Dtype> __global__ void __global__ void BNLLForward(const int n, BNLLForward(hipLaunchParm lp, const int n, const Dtype* in, Dtype* out) const Dtype* in, Dtype* out) { { for (int i = blockIdx.x * blockDim.x + threadIdx.x; for (int i = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x; i < (n); i += blockDim.x * gridDim.x) { i < (n); i += hipBlockDim_x * hipGridDim_x) { out[index] = in[index] > 0 ? out[index] = in[index] > 0 ? in[index] + log(1. + exp(-in[index])) : in[index] + log(1. + exp(-in[index])) : log(1. + exp(in[index])); Math Libs log(1. + exp(in[index])); Unchanged! } } } } 6 FEBRUARY 2017 | ROCM Hipification of CUDA Runtime APIs (CAFFE) HIPIFY CUDA (Automated) HIP void SyncedMemory::async_gpu_push(const cudaStream_t& void SyncedMemory::async_gpu_push(const hipStream_t& stream) { stream) { CHECK(head_ == HEAD_AT_CPU); CHECK(head_ == HEAD_AT_CPU); if (gpu_ptr_ == NULL) { if (gpu_ptr_ == NULL) { cudaGetDevice(&gpu_device_); hipGetDevice(&gpu_device_); cudaMalloc(&gpu_ptr_, size_); hipMalloc(&gpu_ptr_, size_); own_gpu_data_ = true; own_gpu_data_ = true; } } const cudaMemcpyKind put = cudaMemcpyHostToDevice; const hipMemcpyKind put = hipMemcpyHostToDevice; cudaMemcpyAsync(gpu_ptr_, cpu_ptr_, size_, put, stream); hipMemcpyAsync(gpu_ptr_, cpu_ptr_, size_, put, stream); // Assume caller will synchronize on the stream // Assume caller will synchronize on the stream head_ = SYNCED; head_ = SYNCED; } } 7 FEBRUARY 2017 | ROCM Porting with hipify tool CUDA hipify ~99%+ Automatic Conversion Developer Cleanup and Tuning Portable Developer maintains HIP port HIP C++ Resulting C++ code runs on NVIDIA or AMD GPUs 8 FEBRUARY 2017 | ROCM HIP Compilation Process Portable HIP C++ (Kernels + HIP API) NVIDIA AMD HIP API implemented with HIP->HC lightweight HIP runtime HIP API implemented as HIP->CUDA Uses HCC’s hc::accelerator, inlined calls to CUDA Header Runtime Header hc::accelerator_view, Compute kernels mostly hc::completion_future unchanged Some calls directly into ROCm RT Compute kernels mostly unchanged CUDA HCC C++ (Kernels + CUDA API) (Kernels + HC/ROCr) Code compiled with Code compiled with HCC NVCC (same as CUDA) NVCC HCC Can use CodeXL Profiler/Debugger Can use nvprof, CUDA debugger, other tools Source Portable HCC Executable CUDA Executable Not Binary Portable 9 FEBRUARY 2017 | ROCM ROCm : Deep Learning Gets HIP Bringing a faster path to bring deep learning application to AMD GPUs Complexity of • The Challenge: CAFFE Application Porting: • Popular machine-learning framework CAFFE • Tip version on GitHub has 55000+ lines-of-code • GPU-accelerated with CUDA 35000 30000 Manual, 32227 25000 • Results: 20000 • 99.6% of code unmodified or automatically converted 15000 • Port required less than 1 week developer time • Supports all CAFFE features (multi-gpu, P2P, FFT 10000 filters) Changed ofCode Lines 5000 Manual, 219 • HIPCAFFE is the fastest CAFFE on AMD hardware – 0 Automatic, 688 1.8X faster than CAFFE/OpenCL OpenCL Port HIP Port AMD Internal Data 10 FEBRUARY 2017 | ROCM HCC: Heterogeneous Compute Compiler Architecture: Dialects: . HC: . Built on open-source CLANG/LLVM . C++ runtime – hc::accelerator, . Single source compiler for both CPU & GPU hc::accelerator_view, hc::completion_future . Standard object code can be linked with g++, clang, icc . Kernels launched with parallel_for_each around lambda expression . ISO C++ . C++17 Parallel Standard Template . Performance optimized for accelerators: Library . Explicit and implicit data movement . Next steps include executors and . Scratchpad memories concurrency controls . Asynchronous commands . OpenMP . OpenMP 3.1 support for CPU . OpenMP 4.5 GPU offload at SC2016 11 FEBRUARY 2017 | ROCM HCC Example – “HC” Syntax AUTOMATIC MEMORY MANAGEMENT VIA ARRAY_VIEW #include <hc.hpp> // Launch kernel onto default accelerator int main(int argc, char *argv[]) // HCC runtime ensures that A and B are available on { the accelerator before kernel launch: int sizeElements = 1000000; hc::parallel_for_each(hc::extent<1> (sizeElements), [=] (hc::index<1> idx) [[hc]] { // Alloc auto-managed array_view // Kernel is lambda inside for-loop hc::array_view<double> A(sizeElements); int i = idx[0]; hc::array_view<double> B(sizeElements); C[i] = A[i] + B[i]; hc::array_view<double> C(sizeElements); }); // Initialize host memory for (int i=0; i<sizeElements; i++) { // Check result A[i] = 1.618 * i; for (int i=0; i<sizeElements; i++) { B[i] = 3.142 * i; double ref= 1.618 * i + 3.142 * i; } // Because C is an array_view, the HCC runtime will copy C back to host at first access: // Tell runtime not to copy CPU host data. if (C[i] != ref) { C.discard_data(); printf ("error:%d computed=%6.2f, reference=%6.2f\n", I , C[i], ref); } }; } } 12 FEBRUARY 2017 | ROCM HCC Example – “HC” Syntax EXPLICIT MEMORY MANAGEMENT VIA ARRAY #include <hc.hpp> // Launch kernel onto default accelerator int main(int argc, char *argv[]) // HCC runtime ensures that A and B are available on { the accelerator before kernel launch: int sizeElements = 1000000; hc::parallel_for_each(hc::extent<1> (sizeElements), [=] (hc::index<1> idx) [[hc]] { // Alloc GPU arrays // Kernel is lambda inside for-loop hc::array<double> Ad(sizeElements); int i = idx[0]; hc::array<double> Bd(sizeElements); Cd[i] = Ad[i] + Bd[i]; hc::array<double> Cd(sizeElements); }); double * Ah = malloc(sizeElements*8); double * Bh = malloc(sizeElements*8); copy(Cd, Ch); // Copy results GPU to host double * Ch = malloc(sizeElements*8); // Check result // Initialize host memory for (int i=0; i<sizeElements; i++) { for (int i=0; i<sizeElements; i++) { double ref= 1.618 * i + 3.142 * i; Ah[i] = 1.618 * i; if (Ch[i] != ref) { Bh[i] = 3.142 * i; printf ("error:%d computed=%6.2f, } reference=%6.2f\n", I , C[i], ref); } // Copy host data to GPU }; copy(Ah, Ad); } copy(Bh, Bd); } 13 FEBRUARY 2017 | ROCM HCC “HC” Mode KEY FEATURES Many core structures similar to C++AMP ‒ Implementation uses "hc" namespace ‒ hc::accelerator_view, hc::array_view, hc::completion_future ‒ With expanded capabilities… Controls over asynchronous kernel and data commands ‒ hc::parallel_for_each returns hc::completion_future ‒ Asynchronous copy commands ‒ C++17 then, when_any, when_all for managing device-side dependencies [under development] Memory Management ‒ Approachable

Rocm: an Open Platform for GPU Computing Exploration

The Amd Linux Graphics Stack – 2018 Edition Nicolai Hähnle Fosdem 2018

AMD Linux Driver 2021.10 Release Notes

Codexl 2.6 GA Release Notes

Beim Linux Kernel

Rocm: Leveraging Open Source Software for Accelerating

AMD Radeon Open Compute Platform Felix Kuehling

AMD Linux Driver 2020.20 Release Notes

AMD Linux Driver 2020.30 Release Notes

Amd Acp Application

Unbreakable Enterprise Kernel Release Notes for Unbreakable Enterprise Kernel Release 6

Radeon HD 5000 Series

Amdgpu Graphics Stack Documentation Release 18.30