SYCL Command Queues Per SYCL Platform Device ▪ Can Create Multiple Command Queues for a Host Device, E.G

Agenda • What is SYCL and why Intel needs it? • SYCL language/API • SYCL programming model overview • Kernel execution model • “Hello world” example • Language evolution • DPC++/SYCL Compilers • App compilation and execution flow • SYCL implementations • References Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Heterogeneous Compute Platforms - A modern platform includes: - One or more CPUs - One or more GPUs - DSP processors - FPGA - Individual processors have many (possibly heterogeneous) cores Programmers want to write a single portable program that uses ALL resources in the heterogeneousIntel platform IceLake architecture Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. What is SYCL? Single-source heterogeneous programming using STANDARD C++ ▪ Use C++ templates and lambda functions for host & device code Aligns the hardware acceleration of OpenCL with direction of the C++ standard Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others. Why SYCL? Reactive and Proactive Motivation: Reactive to OpenCL Pros and Cons: Proactive about Future C++: • OpenCL has a well-defined, portable • SYCL is based on purely modern C++ execution model. and should feel familiar to C++11 users. • OpenCL is prohibitively verbose for • SYCL expected to run ahead of C++Next many application developers. regarding heterogeneity and parallelism. ISO C++ of tomorrow may • OpenCL remains a C API and only look a lot like SYCL. recently supported C++ kernels. • Not held back by C99 or C++03 • Just-in-time compilation model and compatibility goals. disjoint source code is awkward and contrary to HPC usage models. Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others. Terminology: SYCL platform A SYCL implementation consists of: SYCL Platform ▪ A Host: Makes SYCL API calls ▪ SYCL Devices: Runs SYCL Kernels Host ▪ The Host and SYCL Devices make up an SYCL Platform ▪ One system may have multiple installed SYCL Platforms Example: Intel Platform (CPU) + NVIDIA Platform (GPU) CPU SYCL Device GPU SYCL Device ▪ SYCL Host Device is native C++ implementation of SYCL API on the host Some Other FPGA SYCL Device SYCL Device Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. SYCL Software Architecture Applications create SYCL Command Queues per SYCL Platform Device ▪ Can create multiple Command Queues for a Host Device, e.g. one per thread ▪ No cross-device Command Queues, e.g. for GPU Queue automatic load balancing CPUCPU Queue Queue Applications execute SYCL Command Groups via CPU SYCL Device GPU SYCL Device Command Queues ▪ Example Commands: Copy Memory, Fill Buffers, Launch Kernels, etc. Some Other FPGA SYCL Device SYCL Device Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. Kernel Execution Model • Explicit ND-range for control • 1, 2 or 3 dimensional range • Kernel code is SPMD/SIMT ND-range Work-group Work-item Global work space GEN11 slice architecture diagram Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Subgroups • Additional level of execution model hierarchy - access SIMD unit • GEN EU can execute 16 FP32 floating point operations per clock • [2 ALUs x SIMD-4 x 2 Ops (Add + Mul)] Sub-group decomposition of work-group Work-group Sub-group Work-item Global work space Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. SYCL Constructs to Describe Parallelism 1. Basic data parallel parallel_for( num_work_items ) 2. Explicit ND-Range parallel_for( NDRange( global_size, work_group_size ) ) ▪ Enable SPMD/SIMT coding for OpenCL and CUDA experts parallel_for_work_group( num_work_groups ) { 3. Hierarchical parallelism parallel_for_work_item( items_per_work_group ) } ▪ Exploit scope to control execution granularity ▪ Similar to Kokkos hierarchical parallelism 4. Single task single_task() {} Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others. Hierarchical parallelism (logical view) parallel_for_work_group (…) { … parallel_for_work_item (…) { … } } ▪ Fundamentally top down expression of parallelism ▪ Many embedded features and details, not covered here Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Memory Model • Private Memory Private Private Private Private Memory Memory Memory Memory –Per Work Item (kernel invocation) Work-Item Work-Item Work-Item Work-Item • Local Memory Local Memory Local Memory –Shared within a Work Group Workgroup Workgroup • Global/Constant Memory Global/Constant Memory –Visible to all Work Groups Compute Device • Host Memory Host Memory –On the Host Host Memory management is explicit. We must explicitly move data from host -> global -> local and back. Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. SYCL Device/Kernel Features Supported features Unsupported features ₊ templates ₋ dynamic memory allocation ₊ classes ₋ dynamic polymorphism ₊ operator overloading ₋ runtime type information ₊ static polymorphism ₋ exception handling ₊ lambdas ₋ function pointers* ₊ short vector types (2/3/4/8/16-wide) ₊ reach library of built-in functions ₋ pointer structure members* ₋ static variables* Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { ... } Let’s compute C = A + B on GPU! Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { Step 1: create buffers sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; (represent both host and sycl::buffer<double> d_C{ C.data(), C.size() }; device memory) ... } Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; Step 2: create a device queue sycl::gpu_selector device_selector; (developer can specify a device type via sycl::queue q(device_selector); device selector or use default selector) ... } Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; Step 2: create a device queue sycl::gpu_selector device_selector; (developer can specify a device type via sycl::queue q(device_selector); device selector or use default selector) ... } Multiple queues can be created to target multiple devices from the same (or multiple) thread Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; sycl::gpu_selector device_selector; sycl::queue q(device_selector); q.submit([&](sycl::handler &h) { Step 3: submit a command for ... (asynchronous) execution }); } Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; sycl::gpu_selector device_selector; sycl::queue q(device_selector); q.submit([&](sycl::handler &h) { Step 4: create buffer auto acc_A =

SYCL Command Queues Per SYCL Platform Device ▪ Can Create Multiple Command Queues for a Host Device, E.G

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support