Bringing Next Generation C++ to Gpus

Bringing Next Generation C++ to GPUs Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1 1University of Muenster, Germany 2University of Edinburgh, UK The Problem: Dot Product std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>()); • The STL is the C++ programmers swiss knife • STL containers, iterators and algorithms introduce a high-level of abstraction • Since C++17 it is also parallel 1 The Problem: Dot Product std::vector<int> a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<int>()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus<int>()); f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } 1 The Problem: Dot Product std::vector<int> a(N), b(N); auto result = sum(mult(a, b)); f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; } 1 f1.hpp f2.hpp The Problem: Dot Product std::vector<int> a(N), b(N); auto result = sum(mult(a, b)); 40 Performance: 20 template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ • vectors of size 25600000 Runtime (ms) const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); • Clang/LLVM std::transform(a.begin(), 5.0.0svn -O3 optimizeda.end(), } b.begin(), tmp.begin(), 0 std::multiplies<T>()); transform/accumulatetransform/accumulate*inner_product return tmp; } * LLVM patched with extended D17386 (loop fusion) 1 The Problem: Dot Product on GPUs thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>()); • Highly tuned STL-like library for GPU programming • Thrust offers containers, iterators and algorithms • Based on CUDA 2 The Problem: Dot Product on GPUs thrust::device_vector<int> a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies<int>()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus<int>()); • Highly tuned STL-like library for GPU programming • Thrust offers containers, iterators and algorithms • Based on CUDA 30 20 Same Experiment: 10 • nvcc -O3 (from CUDA 8.0) Runtime (ms) 0 transform/reduce inner_product 2 The Next Generation: Ranges for the STL • range-v3 prototype implementation by E. Niebler • Proposed as N4560 for the C++ Standard std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3 The Next Generation: Ranges for the STL std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 40 Performance? 20 • Clang/LLVM 5.0.0svn -O3 optimized Runtime (ms) 0 transform/accumulateinner_product 3 The Next Generation: Ranges for the STL std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); • Views describe lazy, non-mutating operations on ranges • Evaluation happens inside an algorithm (e.g., accumulate) • Fusion is guaranteed by the implementation 3 Ranges for GPUs • Extended range-v3 with GPU-enabled container and algorithms • Original code of range-v3 remains unmodified std::vector<int> a(N), b(N); auto mult = [](auto tpl){ return get<0>(tpl)* get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4 Programming Accelerators with C++ (PACXX) Offline Stage Executable PACXX PACXX Runtime Offline Compiler Online Stage Clang Frontend LLVM-based LLVM Onlineonline Compiler compiler IR LLVM libc++ OpenCL CUDA LLVM IR to SPIR NVPTX Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, C++const T& value) { OpenCL Runtime CUDA Runtime for (; first != last; ++first) { *first = value; } AMD GPU Intel MIC Nvidia GPU } • Based entirely on LLVM / Clang • Supports C++14 for GPU Programming • Just-In-Time Compilation of LLVM IR for target accelerators 5 Multi-Stage Programming template <typename InRng, typename T, typename Fun> auto reduce(InRng&& in, T init, Fun&& fun){ // 1. preparation of kernel call ... // 2. create GPU kernel auto kernel = pacxx::kernel( [fun](auto&& in, auto&& out, int size, auto init){ // 2a. stage elements per thread auto ept = stage([&]{ return size / get_block_size(0); }); // 2b. start reduction computation auto sum = init; for (int x = 0; x < ept; ++x){ sum = fun(sum, *(in + gid)); gid += glbSize;} // 2c. perform reduction in shared memory ... // 2d. write result back if (lid = 0) *(out + bid) = shared[0]; }, blocks, threads); // 3. execute kernel kernel(in, out, distance(in), init); // 4. finish reduction on the CPU return std::accumulate(out, init, fun); } 6 MSP Integration into PACXX Executable PACXX Runtime PACXX MSP MSP Engine Offline Compiler IR Clang Frontend LLVM-based KERNEL online compiler IR LLVM libc++ OpenCL CUDA LLVM IR to SPIR NVPTX Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, C++const T& value) { OpenCL Runtime CUDA Runtime for (; first != last; ++first) { *first = value; } AMD GPU Intel MIC Nvidia GPU } • MSP Engine JIT compiles the MSP IR, • evaluates stage prior to a kernel launch, and • replaces the calls to stage in the kernel’s IR with the results. • Enables more optimizations (e.g., loop-unrolling) in the online stage. 7 Performance Impact of MSP gpu::reduce on Nvidia K20c 1.4 Dot 1.35 Sum Dot +MS Sum +MS 1.3 1.25 1.2 1.15 Speedup 1.1 1.05 1 0.95 0.9 215 217 219 221 223 225 Input Size Up to 35% better performance compared to non-MSP version 8 Just-In-Time Compilation Overhead Comparing MSP in PACXX with Nvidia’s nvrtc library 450 CUDA 7.5 CUDA 8.0 RC 400 PACXX 350 300 250 200 150 Compilation Time (ms) 100 50 0 Dot Sum 10 to 20 times faster because front-end actions are performed. 9 Benchmarks range-v3 + PACXX vs. Nvidia’s Thrust 2 82:32% Thrust 73:31% PACXX 1:5 33:6% 8:37% 1:61% −3:13% 1 −7:38% Speedup 0:5 vadd saxpy sum dot Monte CarloMandelbrotVoronoi • Evaluated on a Nvidia K20c GPU • 11 different input sizes • 1000 runs for each benchmark Competitive performance with a composable GPU programming API 10 Going Native: Work in Progress Executable PACXX Runtime PACXX MSP MSP Engine Offline Compiler IR Clang Frontend LLVM-based KERNEL online compiler IR WFV LLVM libc++ OpenCL LLVM IR [1] CUDA NVPTX Backend to SPIR MCJIT Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, C++const T& value) { OpenCL Runtime CUDA Runtime for (; first != last; ++first) { *first = value; } AMD GPU Intel MIC CPUs: Intel / AMD / IBM ... Nvidia GPU } • PACXX is extended by a native CPU backend • The Kernel IR is modified to be runnable on a CPU • Kernels are vectorized by the Whole Function Vectorizer (WFV) [1] • MCJIT compiles the kernels and TBB executes them in parallel. 11 [1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 Benchmarks range-v3 + PACXX vs. OpenCL on x86_64 2 OpenCL (Intel) 1:5 PACXX −1:45% −13:94% −12:64% 1 −20:02% −36:48% Speedup 0:5 0 vadd saxpy sum dot Mandelbrot • Running on 2x Intel Xeon E5-2620 CPUs • Intel’s auto-vectorizer optimizes the OpenCL C code 12 Benchmarks range-v3 + PACXX vs. OpenCL on x86_64 3 OpenCL (AMD) 153:82% PACXX 2 58:12% 58:45% Speedup 1 vadd saxpy sum dot Mandelbrot • Running on 2x Intel Xeon E5-2620 CPUs • AMD OpenCL SDK has no auto-vectorizer • Barriers are very expensive in AMD’s OpenCL implementation (speedup up to 126x for sum) 13 Benchmarks range-v3 + PACXX vs. OpenMP on IBM Power8 2 OpenMP (IBM) PACXX 1:5 35:09% 17:97% 0:45% 1 Speedup −20:05% −42:95% 0:5 vadd saxpy sum dot Mandelbrot • Running on a PowerNV 8247-42L with 4x IBM Power8e CPUs • No OpenCL implementation from IBM available • #prama omp parallel for simd parallelized loops • Compiled with XL C++ 13.1.5 14 Questions? 14.

Bringing Next Generation C++ to Gpus

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support