SYCL Command Queues Per SYCL Platform Device ▪ Can Create Multiple Command Queues for a Host Device, E.G

Agenda • What is SYCL and why Intel needs it? • SYCL language/API • SYCL programming model overview • Kernel execution model • “Hello world” example • Language evolution • DPC++/SYCL Compilers • App compilation and execution flow • SYCL implementations • References Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Heterogeneous Compute Platforms - A modern platform includes: - One or more CPUs - One or more GPUs - DSP processors - FPGA - Individual processors have many (possibly heterogeneous) cores Programmers want to write a single portable program that uses ALL resources in the heterogeneousIntel platform IceLake architecture Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. What is SYCL? Single-source heterogeneous programming using STANDARD C++ ▪ Use C++ templates and lambda functions for host & device code Aligns the hardware acceleration of OpenCL with direction of the C++ standard Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others. Why SYCL? Reactive and Proactive Motivation: Reactive to OpenCL Pros and Cons: Proactive about Future C++: • OpenCL has a well-defined, portable • SYCL is based on purely modern C++ execution model. and should feel familiar to C++11 users. • OpenCL is prohibitively verbose for • SYCL expected to run ahead of C++Next many application developers. regarding heterogeneity and parallelism. ISO C++ of tomorrow may • OpenCL remains a C API and only look a lot like SYCL. recently supported C++ kernels. • Not held back by C99 or C++03 • Just-in-time compilation model and compatibility goals. disjoint source code is awkward and contrary to HPC usage models. Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others. Terminology: SYCL platform A SYCL implementation consists of: SYCL Platform ▪ A Host: Makes SYCL API calls ▪ SYCL Devices: Runs SYCL Kernels Host ▪ The Host and SYCL Devices make up an SYCL Platform ▪ One system may have multiple installed SYCL Platforms Example: Intel Platform (CPU) + NVIDIA Platform (GPU) CPU SYCL Device GPU SYCL Device ▪ SYCL Host Device is native C++ implementation of SYCL API on the host Some Other FPGA SYCL Device SYCL Device Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. SYCL Software Architecture Applications create SYCL Command Queues per SYCL Platform Device ▪ Can create multiple Command Queues for a Host Device, e.g. one per thread ▪ No cross-device Command Queues, e.g. for GPU Queue automatic load balancing CPUCPU Queue Queue Applications execute SYCL Command Groups via CPU SYCL Device GPU SYCL Device Command Queues ▪ Example Commands: Copy Memory, Fill Buffers, Launch Kernels, etc. Some Other FPGA SYCL Device SYCL Device Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. Kernel Execution Model • Explicit ND-range for control • 1, 2 or 3 dimensional range • Kernel code is SPMD/SIMT ND-range Work-group Work-item Global work space GEN11 slice architecture diagram Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Subgroups • Additional level of execution model hierarchy - access SIMD unit • GEN EU can execute 16 FP32 floating point operations per clock • [2 ALUs x SIMD-4 x 2 Ops (Add + Mul)] Sub-group decomposition of work-group Work-group Sub-group Work-item Global work space Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. SYCL Constructs to Describe Parallelism 1. Basic data parallel parallel_for( num_work_items ) 2. Explicit ND-Range parallel_for( NDRange( global_size, work_group_size ) ) ▪ Enable SPMD/SIMT coding for OpenCL and CUDA experts parallel_for_work_group( num_work_groups ) { 3. Hierarchical parallelism parallel_for_work_item( items_per_work_group ) } ▪ Exploit scope to control execution granularity ▪ Similar to Kokkos hierarchical parallelism 4. Single task single_task() {} Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others. Hierarchical parallelism (logical view) parallel_for_work_group (…) { … parallel_for_work_item (…) { … } } ▪ Fundamentally top down expression of parallelism ▪ Many embedded features and details, not covered here Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Memory Model • Private Memory Private Private Private Private Memory Memory Memory Memory –Per Work Item (kernel invocation) Work-Item Work-Item Work-Item Work-Item • Local Memory Local Memory Local Memory –Shared within a Work Group Workgroup Workgroup • Global/Constant Memory Global/Constant Memory –Visible to all Work Groups Compute Device • Host Memory Host Memory –On the Host Host Memory management is explicit. We must explicitly move data from host -> global -> local and back. Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. SYCL Device/Kernel Features Supported features Unsupported features ₊ templates ₋ dynamic memory allocation ₊ classes ₋ dynamic polymorphism ₊ operator overloading ₋ runtime type information ₊ static polymorphism ₋ exception handling ₊ lambdas ₋ function pointers* ₊ short vector types (2/3/4/8/16-wide) ₊ reach library of built-in functions ₋ pointer structure members* ₋ static variables* Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { ... } Let’s compute C = A + B on GPU! Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { Step 1: create buffers sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; (represent both host and sycl::buffer<double> d_C{ C.data(), C.size() }; device memory) ... } Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; Step 2: create a device queue sycl::gpu_selector device_selector; (developer can specify a device type via sycl::queue q(device_selector); device selector or use default selector) ... } Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; Step 2: create a device queue sycl::gpu_selector device_selector; (developer can specify a device type via sycl::queue q(device_selector); device selector or use default selector) ... } Multiple queues can be created to target multiple devices from the same (or multiple) thread Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; sycl::gpu_selector device_selector; sycl::queue q(device_selector); q.submit([&](sycl::handler &h) { Step 3: submit a command for ... (asynchronous) execution }); } Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example void vector_add(const std::vector<double> &A, const std::vector<double> &B, const std::vector<double> &C) { sycl::buffer<double> d_A{ A.data(), A.size() }; sycl::buffer<double> d_B{ B.data(), B.size() }; sycl::buffer<double> d_C{ C.data(), C.size() }; sycl::gpu_selector device_selector; sycl::queue q(device_selector); q.submit([&](sycl::handler &h) { Step 4: create buffer auto acc_A =

Load more