Harnessing the Power of Heterogeneous Computing using Boost.Compute + OpenCL

Armin Sobhani [email protected] SHARCNET University of Ontario Institute of Technology (UOIT) August 15, 2018 Outline

• A Quick introduction to Boost.Compute and OpenCL

• Tutorial for developing Boost.Compute applications on SHARCNet clusters:

https://git.sharcnet.ca/asobhani/bc-tutorial

Heterogeneous Computing using Boost.Compute + OpenCL 2 Standard (STL)

• Software library for the C++ Containers • Influenced many parts of the C++ Standard Iterators Library Algorithms • Consisting of 4 components: Functions

Heterogeneous Computing using Boost.Compute + OpenCL 3 Parallel STL

Why? Available Implementations

C++17 Parallel Algorithms

• Intel’s open source Parallel STL Familiar to most C++ programmers • KhronosGroup’s SYCL Parallel STL

Third-Party C++ Libraries Simplifies porting existing applications to parallel • Boost.Compute architectures • Nvidia’s Thrust • AMD’s Bolt

Heterogeneous Computing using Boost.Compute + OpenCL 4 Boost.Compute

header-only template library

for

based on OpenCL

available in Boost starting with version 1.61

Heterogeneous Computing using Boost.Compute + OpenCL 5 OpenCL

Open Computing Language

Open standard for parallel programming of heterogenous systems

Latest version is 2.2

Views a computing system as consisting of a number of compute devices

Functions executed on an OpenCL device are called kernels

Works by compiling C99 code at run-time to generate kernel objects

Heterogeneous Computing using Boost.Compute + OpenCL 6 Boost.Compute vs. Thrust vs. Bolt

Nvidia’s Thrust AMD’s Bolt

Doesn’t work Only works with with AMD GPUs AMD CPUs/GPUs

Heterogeneous Computing using Boost.Compute + OpenCL 7 Boost.Compute 101 Boost.Compute

Multi-Core CPUs Intel AMD

GPUs STL for Accelerators AMD Xeon Phi Intel Parallel Epiphany Nvidia Devices

FPGAs Altera Xilinx

Heterogeneous Computing using Boost.Compute + OpenCL 9 Library Architecture

Boost.Compute

STL-like API containers Lambda Iterators RNGs Expressions algorithms functions

Low-Level API

OpenCL

Compute Devices (Multi-core CPU, GPU, Accelerators, FPGA)

Heterogeneous Computing using Boost.Compute + OpenCL 10 Low-Level API

OpenCL C++ Wrapper

OpenCL objects Utility buffer Ta k e s c a r e o f context reference counting Functions command_queue error checking e.g. setting up etc. default device

Heterogeneous Computing using Boost.Compute + OpenCL 11 Low-Level API

for (auto const& device : boost::compute::system::devices()) std::cout << "device = " << device.name() << std::endl; •

Heterogeneous Computing using Boost.Compute + OpenCL 12 Low-Level API Example

#include The default device is selected based on a set of heuristics and can be influenced using one of the // lookup default compute device following environment variables: auto device = boost::compute::system::default_device(); • BOOST_COMPUTE_DEFAULT_DEVICE – // create OpenCL context for the device name of the compute device (e.g. "AMD Radeon") auto ctx = boost::compute::context(device);

// get default command queue • BOOST_COMPUTE_DEFAULT_DEVICE_TYPE – type of the auto queue = boost::compute::command_queue(ctx, device); compute device (e.g. "GPU" or "CPU")

// print device name • BOOST_COMPUTE_DEFAULT_PLATFORM – name of the std:cout << "device = " << device.name() << std::endl; platform (e.g. "NVIDIA CUDA")

• BOOST_COMPUTE_DEFAULT_VENDOR – name of the device vendor (e.g. "Intel")

Heterogeneous Computing using Boost.Compute + OpenCL 13 High-Level API – Parallel STL

array buffer_iterator bernoulli_distribution basic_string constant_buffer_iterator default_random_engine dynamic_bitset<> constant_iterator discrete_distribution linear_congruential_engine

flat_map counting_iterator RNGs mersenne_twister_engine flat_set discard_iterator normal_distribution mapped_view Iterators function_input_iterator uniform_int_distribution stack permutation_iterator

Containers uniform_real_distribution string strided_iterator valarray transform_iterator vector zip_iterator

Heterogeneous Computing using Boost.Compute + OpenCL 14 High-Level API – Parallel STL

accumulate() for_each_n() none_of() set_difference() adjacent_difference() gather() nth_element() set_intersection() adjacent_find() generate() partial_sum() set_symmetric_ all_of() generate_n() partition() difference() any_of() includes() partition_copy() set_union() binary_search() inclusive_scan() partition_point() sort() copy() inner_product() prev_permutation() sort_by_key() copy_if() inplace_merge() random_shuffle() stable_partition() copy_n() iota() reduce() stable_sort() count() is_partitioned() reduce_by_key() stable_sort_by_key() count_if() is_permutation() remove() swap_ranges() equal() is_sorted() remove_if() transform() equal_range() lower_bound() replace() transform_reduce() exclusive_scan() lexicographical_ replace_copy() unique() fill() compare() reverse() unique_copy() fill_n() max_element() reverse_copy() upper_bound() find() merge() rotate() find_end() min_element() rotate_copy() Algorithms find_if() minmax_element() scatter() find_if_not() mismatch() search() for_each() next_permutation() search_n()

Heterogeneous Computing using Boost.Compute + OpenCL 15 Host Sort Example

STL Boost.Compute

#include #include #include //#include #include std::vector v{…}; std::vector v{…};

// fill the vector with some data // fill the vector with some data std::sort(v.begin(), v.end()); boost::compute::sort(v.begin(), v.end(), queue);

Heterogeneous Computing using Boost.Compute + OpenCL 16 Host Sort Example

Heterogeneous Computing using Boost.Compute + OpenCL 17 Algorithm Internals

generate produce run on the OpenCL compile at OpenCL kernels in kernel compute run-time device C99 objects

C++ OpenCL

Heterogeneous Computing using Boost.Compute + OpenCL 18 Custom Functions

BOOST_COMPUTE_FUNCTION(int, add_three, (int x), • BOOST_COMPUTE_FUNCTION() { return x + 3; macro produces Boost.Compute }); function object boost::compute::vector vector = { ... }; boost::compute::transform( • Body of the function is checked vector.begin(), during OpenCL compilation vector.end(), vector.begin(), add_three, queue );

Heterogeneous Computing using Boost.Compute + OpenCL 19 Lambda Expressions

• Not the same as C++11 lambdas • Easy way for specifying custom function for algorithms • Fully type-checked by the C++ compiler

using boost::compute::_1;

boost::compute::vector vec = { ... }; boost::compute::transform(vec.begin(), vec.end(), vec.begin(), _1 + 5, queue);

Heterogeneous Computing using Boost.Compute + OpenCL 20 Boost.Compute Tu t o r i a l Available on SHARCNet GitLab

https://git.sharcnet.ca/asobhani/bc-tutorial

Heterogeneous Computing using Boost.Compute + OpenCL 22