Bringing Next Generation C++ to Gpus
Total Page:16
File Type:pdf, Size:1020Kb
Bringing Next Generation C++ to GPUs Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1 1University of Muenster, Germany 2University of Edinburgh, UK The Problem: Dot Product std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>()); • The STL is the C++ programmers swiss knife • STL containers, iterators and algorithms introduce a high-level of abstraction • Since C++17 it is also parallel 1 The Problem: Dot Product std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>()); f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } 1 The Problem: Dot Product std::vector<int> a(N), b(N); auto result = sum(mult(a, b)); f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } 1 f1.hpp f2.hpp The Problem: Dot Product std::vector<int> a(N), b(N); auto result = sum(mult(a, b)); 40 Performance: 20 template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ • vectors of size 25600000 Runtime (ms) const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); • Clang/LLVM std::transform(a.begin(), 5.0.0svn -O3 optimizeda.end(), } b.begin(), tmp.begin(), 0 std::multiplies<T>()); transform/accumulatetransform/accumulate*inner_product return tmp; } * LLVM patched with extended D17386 (loop fusion) 1 The Problem: Dot Product on GPUs thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>()); • Highly tuned STL-like library for GPU programming • Thrust offers containers, iterators and algorithms • Based on CUDA 2 The Problem: Dot Product on GPUs thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>()); • Highly tuned STL-like library for GPU programming • Thrust offers containers, iterators and algorithms • Based on CUDA 30 20 Same Experiment: 10 • nvcc -O3 (from CUDA 8.0) Runtime (ms) 0 transform/reduce inner_product 2 The Next Generation: Ranges for the STL • range-v3 prototype implementation by E. Niebler • Proposed as N4560 for the C++ Standard std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3 The Next Generation: Ranges for the STL std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 40 Performance? 20 • Clang/LLVM 5.0.0svn -O3 optimized Runtime (ms) 0 transform/accumulateinner_product 3 The Next Generation: Ranges for the STL std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); • Views describe lazy, non-mutating operations on ranges • Evaluation happens inside an algorithm (e.g., accumulate) • Fusion is guaranteed by the implementation 3 Ranges for GPUs • Extended range-v3 with GPU-enabled container and algorithms • Original code of range-v3 remains unmodified std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4 Programming Accelerators with C++ (PACXX) Offline Stage Executable PACXX PACXX Runtime Offline Compiler Online Stage Clang Frontend LLVM-based LLVM Onlineonline Compiler compiler IR LLVM libc++ OpenCL CUDA LLVM IR to SPIR NVPTX Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, C++const T& value) { OpenCL Runtime CUDA Runtime for (; first != last; ++first) { *first = value; } AMD GPU Intel MIC Nvidia GPU } • Based entirely on LLVM / Clang • Supports C++14 for GPU Programming • Just-In-Time Compilation of LLVM IR for target accelerators 5 Multi-Stage Programming template <typename InRng, typename T, typename Fun> auto reduce(InRng&& in, T init, Fun&& fun){ // 1. preparation of kernel call ... // 2. create GPU kernel auto kernel = pacxx::kernel( [fun](auto&& in, auto&& out, int size, auto init){ // 2a. stage elements per thread auto ept = stage([&]{ return size / get_block_size(0); }); // 2b. start reduction computation auto sum = init; for (int x = 0; x < ept; ++x){ sum = fun(sum, *(in + gid)); gid += glbSize;} // 2c. perform reduction in shared memory ... // 2d. write result back if (lid = 0) *(out + bid) = shared[0]; }, blocks, threads); // 3. execute kernel kernel(in, out, distance(in), init); // 4. finish reduction on the CPU return std::accumulate(out, init, fun); } 6 MSP Integration into PACXX Executable PACXX Runtime PACXX MSP MSP Engine Offline Compiler IR Clang Frontend LLVM-based KERNEL online compiler IR LLVM libc++ OpenCL CUDA LLVM IR to SPIR NVPTX Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, C++const T& value) { OpenCL Runtime CUDA Runtime for (; first != last; ++first) { *first = value; } AMD GPU Intel MIC Nvidia GPU } • MSP Engine JIT compiles the MSP IR, • evaluates stage prior to a kernel launch, and • replaces the calls to stage in the kernel’s IR with the results. • Enables more optimizations (e.g., loop-unrolling) in the online stage. 7 Performance Impact of MSP gpu::reduce on Nvidia K20c 1.4 Dot 1.35 Sum Dot +MS Sum +MS 1.3 1.25 1.2 1.15 Speedup 1.1 1.05 1 0.95 0.9 215 217 219 221 223 225 Input Size Up to 35% better performance compared to non-MSP version 8 Just-In-Time Compilation Overhead Comparing MSP in PACXX with Nvidia’s nvrtc library 450 CUDA 7.5 CUDA 8.0 RC 400 PACXX 350 300 250 200 150 Compilation Time (ms) 100 50 0 Dot Sum 10 to 20 times faster because front-end actions are performed. 9 Benchmarks range-v3 + PACXX vs. Nvidia’s Thrust 2 82:32% Thrust 73:31% PACXX 1:5 33:6% 8:37% 1:61% −3:13% 1 −7:38% Speedup 0:5 vadd saxpy sum dot Monte CarloMandelbrotVoronoi • Evaluated on a Nvidia K20c GPU • 11 different input sizes • 1000 runs for each benchmark Competitive performance with a composable GPU programming API 10 Going Native: Work in Progress Executable PACXX Runtime PACXX MSP MSP Engine Offline Compiler IR Clang Frontend LLVM-based KERNEL online compiler IR WFV LLVM libc++ OpenCL LLVM IR [1] CUDA NVPTX Backend to SPIR MCJIT Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, C++const T& value) { OpenCL Runtime CUDA Runtime for (; first != last; ++first) { *first = value; } AMD GPU Intel MIC CPUs: Intel / AMD / IBM ... Nvidia GPU } • PACXX is extended by a native CPU backend • The Kernel IR is modified to be runnable on a CPU • Kernels are vectorized by the Whole Function Vectorizer (WFV) [1] • MCJIT compiles the kernels and TBB executes them in parallel. 11 [1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 Benchmarks range-v3 + PACXX vs. OpenCL on x86_64 2 OpenCL (Intel) 1:5 PACXX −1:45% −13:94% −12:64% 1 −20:02% −36:48% Speedup 0:5 0 vadd saxpy sum dot Mandelbrot • Running on 2x Intel Xeon E5-2620 CPUs • Intel’s auto-vectorizer optimizes the OpenCL C code 12 Benchmarks range-v3 + PACXX vs. OpenCL on x86_64 3 OpenCL (AMD) 153:82% PACXX 2 58:12% 58:45% Speedup 1 vadd saxpy sum dot Mandelbrot • Running on 2x Intel Xeon E5-2620 CPUs • AMD OpenCL SDK has no auto-vectorizer • Barriers are very expensive in AMD’s OpenCL implementation (speedup up to 126x for sum) 13 Benchmarks range-v3 + PACXX vs. OpenMP on IBM Power8 2 OpenMP (IBM) PACXX 1:5 35:09% 17:97% 0:45% 1 Speedup −20:05% −42:95% 0:5 vadd saxpy sum dot Mandelbrot • Running on a PowerNV 8247-42L with 4x IBM Power8e CPUs • No OpenCL implementation from IBM available • #prama omp parallel for simd parallelized loops • Compiled with XL C++ 13.1.5 14 Questions? 14.