Heterogeneous Computing

Yu Zhang

Course web site: http://staff.ustc.edu.cn/~yuzhang/pldpa

Heterogeneous Computing and oneAPI 1 Resources

- https://www.linux.org/ • TLB、 Page Table Management • https://github.com/numactl/numactl NUMA support for Linux - https://www.khronos.org/: OpenCL、SYCL、SPIR… - Intel (oneAPI)、 (CUDA) - https://github.com/S4Plus/ABC/tree/master/oneAPI - Book • The Art of Multiprocessor Programming

Heterogeneous Computing and Intel oneAPI 2 Multiprocessor-Multicores

Heterogeneous Computing and Intel oneAPI 3 Modern Systems

• Almost all current systems all have more than one CPU/core • Multiprocessor - More than one physical CPU - SMP: Symmetric multiprocessing, • Each CPU is identical to every other • Each has the same capabilities and privileges - Each CPU is plugged into system via its own slot/socket

• Multicore - More than one CPU in a single physical package - Multiple CPUs connect to system via a shared slot /socket - Currently most multicores are SMP

Heterogeneous Computing and Intel oneAPI 4 SMP architecture

• Each in system can - perform the same tasks: Execute same set of instructions, Access memory, Interact with devices - connect to system in same way: Interact with memory/devices via communication over the shared bus/interconnect • Easily lead to chaos - Why we need synchronization

Heterogeneous Computing and Intel oneAPI 5 Multiprocessor-Multicores architecture

• During the early/mid 2000s CPUs => multicores - Could no longer increase speeds exponentially - But: transistor density was still increasing • SMP with multicores • cat /proc/cpuinfo

Heterogeneous Computing and Intel oneAPI 6 yuzhang@user-SYS-2049U-TR4:~$ cat /proc/cpuinfo processor : 0 There are 64 processors in the vendor_id : GenuineIntel machine, and here is the cpu family : 6 model : 85 cpuinfo of processor 0. model name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz stepping : 4 : 0x2006a08 cpu MHz : 1000.354 size : 22528 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 16 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit bogomips : 4200.00 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual Heterogeneous Computing and Intel oneAPI 7 : Multiprocessor-Multicores architecture

• What does this mean for the OS? - Mostly hidden by HW - OS sees N cpus that are identical

• But the similarity does not always hold for memory - More on that in a minute - L1 Cache, L2 Cache, Last-level Cache, …

Heterogeneous Computing and Intel oneAPI 8 Latency Numbers Every Programmer Should Know • Latency Comparison Numbers (~2012) execute typical instruction 1/1,000,000,000 sec = 1 nanosec fetch from L1 cache memory 0.5 nanosec branch misprediction 5 nanosec fetch from L2 cache memory 7 nanosec Mutex lock/unlock 25 nanosec fetch from main memory 100 nanosec send 2K bytes over 1Gbps network 20,000 nanosec read 1MB sequentially from memory 250,000 nanosec fetch from new disk location (seek) 8,000,000 nanosec read 1MB sequentially from disk 20,000,000 nanosec send packet US to Europe and back 150 milliseconds = 150,000,000 nanosec https://research.google/people/jeff/ By Jeff Dean https://gist.github.com/hellerbarde/2843375 https://i.imgur.com/k0t1e.png Heterogeneous Computing and Intel oneAPI 9 TLB: translation lookaside buffer

-64 architecture TLB 48 address lines faster

A translation lookaside buffer (TLB) is a memory Page table Structure cache that is used to reduce slower the time taken to access a user memory location. It is a part of the chip's memory- management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. Heterogeneous Computing and Intel oneAPI 10 Page Table Structure

A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. Heterogeneous Computing and Intel oneAPI 11 Cross CPU Communication (Shared Memory) • OS must still track state of entire system - Global data structure updated by each core - cat /proc/loadavg provides a look at the load average in regard to both the CPU and IO over time, as well as additional data used by uptime and other commands

0.08 0.06 0.10 1/442 8347

CPU and IO the number of utilization of the last currently running the last process ID one, five, and 15 processes and the total used. minute periods number of processes

Heterogeneous Computing and Intel oneAPI 12 Cross CPU Communication (Shared Memory) • OS must still track state of entire system - Global data structure updated by each core

• Traditional approach - Single copy of data, protected by locks - Bad scalability, every CPU constantly takes a global lock to update its own state

• Modern approach - Replicate state across all CPUs/cores - Each core updates its own local copy (so NO locks!) - Contention only when state is read • Global lock Is required, but concurrent reads are rare

Heterogeneous Computing and Intel oneAPI 13 Cross CPU Communication (Signals) • System allows CPUs to explicitly signal each other - Two approaches: notifications and cross-calls - Almost always built on top of interrupts • X86: Inter Processor Interrupts (IPIs) • Notifications - CPU is notified that “something” has happened - No other information - Mostly used to wakeup a remote CPU • Cross Calls - The target CPU jumps to a specified instruction • Source CPU makes a function call that execs on target CPU - Synchronous or asynchronous? • Can be both, up to the programmer

Heterogeneous Computing and Intel oneAPI 14 CPU Interconnects

• Mechanism by which CPUs communicate - Old way: Front Side Bus (FSB) • Slow with limited scalability • With potentially 100s of CPUs in a system, a bus won’t work - Modern Approach: Exploit HPC networking techniques • Embed a true interconnect into the system • Intel: QPI (QuickPath Interconnects) • AMD: HyperTransport

• Interconnects allow point to point communication - Multiple messages can be sent in parallel if they don’t intersect

Heterogeneous Computing and Intel oneAPI 15 Multiprocessing and Memory

• Shared memory is by far the most popular approach to multiprocessing - Each CPU can access all of a system’s memory - Conflicting accesses resolved via synchronization (locks) - Benefits • Easy to program, allows direct communication - Disadvantages • Limits scalability and performance

• Requires more advanced caching behavior - Systems contain a cache hierarchy with different scopes

Heterogeneous Computing and Intel oneAPI 16 Multiprocessor Caching

• On multicore CPUs some (but not all) caches are shared - Each core has its own private L1 cache - L2 cache can either be private to a core, or shared between cores - L3 cache almost always shared between cores - Caches not shared across physical CPU dies • What if two CPUs update the same memory location stored in their L1 caches? - Shared memory systems require an absolute ordering of operations - Cache coherency ensures this ordering • Implemented in hardware to ensure that memory updates are propagated throughout the entire system • Utilizes CPU interconnect for communication

Heterogeneous Computing and Intel oneAPI 17 Memory Issues

• As core count increases shared memory becomes harder - Increasingly difficult for HW to provide shared memory behavior to all CPU cores • Manycore CPUs: Need to cross other cores to access memory • Some cores are closer to memory and thus faster • Memory is slow or fast depending on which CPU is accessing it - This is called Non Uniform Memory Access (NUMA)

Dell R710

Heterogeneous Computing and Intel oneAPI 18 NUMA: Non Uniform Memory Access

https://github.com/numactl/numactl NUMA support for Linux • Memory is organized in a non uniform manner - Its closer to some CPUs than others - Far away memory is slower than close memory - Not required to be cache coherent, but usually is • ccNUMA: Cache Coherent NUMA • Typical organization is to divide system into “zones” - A zone usually contains a CPU socket/slot and a portion of the system memory - Memory is “local” if its in the CPU’s zone • Fast to access • Accessing memory in the local zone does not impact performance in other zones - Interconnect is point to point

Heterogeneous Computing and Intel oneAPI 19 Dealing with NUMA

• Programming a NUMA system is hard - Ultimately it’s a failed abstraction - Goal: Make all memory ops the same • But they aren’t, because some are slower • AND the abstraction hides the details • Result: Very few people explicitly design an application with NUMA support - Those that do are generally in the HPC community - So its up to the user and the OS to deal with it • But mostly people just ignore it…

Heterogeneous Computing and Intel oneAPI 20 Querying the NUMA Layout

yuzhang@user-SYS-2049U-TR4:~$ numactl --hardware • numactl –hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 46949 MB node 0 free: 45626 MB • 64 CPU cores node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 48380 MB - 4 16-core CPUs node 1 free: 47536 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 2 size: 48381 MB node 2 free: 47643 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 3 size: 48380 MB node 3 free: 47738 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10

Heterogeneous Computing and Intel oneAPI 21 Querying the NUMA Layout

• numactl –hardware

• 48 CPU cores - 2 24-core CPUs

yuzhang@user-SYS-7048GR-TR:~$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 128826 MB node 0 free: 122559 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 128993 MB node 1 free: 122041 MB node distances: node 0 1 0: 10 21 1: 21 10 Heterogeneous Computing and Intel oneAPI 22 Dealing with NUMA (users)

• Users can force OS to confine a process to a specific zone - Restricts what memory a process gets allocated - Restricts which CPUs process can run on • Per process via command line - numactl --physcpubind= • Groups of processes using scheduling domains - Linux: cgroups and containers (LXC) sudo apt-get install libcgroup1 cgroup-tools sudo apt-get install lxc

Heterogeneous Computing and Intel oneAPI 23 Dealing with NUMA (OS)

• An OS can deal with NUMA systems by restricting its own behavior - Force processes to always execute in a zone, and always allocate memory from the same zone - This makes balancing resource utilization tricky

• However, nothing prevents an application from forcing bad behavior - E.g. two applications in separate zones want to communicate using shared memory…

Heterogeneous Computing and Intel oneAPI 24 Multiprocessing and Power

• More cores require more energy (and heat) - Managing the energy consumption of a system becoming critically important - Modern systems cannot fully utilize all resources for very long • Approaches - Slow down processors periodically • CPUs no longer identical (some faster, some slower) - Shutdown entire cores • System dynamically powers down CPUs • OS must deal with processors coming and going

Heterogeneous Computing and Intel oneAPI 25 Heterogeneous CPUs

• Heterogeneous computing resources across system - Core specialization: CPU resources tailored to specific workloads - GPUs, lightweight cores, I/O cores, stream processors • OS must manage these dynamically - What to schedule where and when? - How should the OS approach this issue? • Active area of current research

Heterogeneous Computing and Intel oneAPI 26 GPU:

Heterogeneous Computing and Intel oneAPI 27 Motivation

Raytracing: for all pixels (i,j) in image: From camera eye point, calculate ray point and direction in 3d space if ray intersects object: calculate lighting at closest object point store color of (i,j) Assemble into image file Superquadric Cylinders, exponent 0.1, yellow glass balls, Barr, 1981

Each pixel could be computed simultaneously, with enough parallelism!

Heterogeneous Computing and Intel oneAPI 28 Simple Example

• Add two arrays • A[ ] + B[ ] -> C[ ]

• On the CPU:

float *C = malloc(N * sizeof(float)); for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; return C;

On CPUs the above code operates sequentially, but can we do better on CPUs?

Heterogeneous Computing and Intel oneAPI 29 Simple Example

• On the CPU(multi-threaded, pseudocode):

(allocate memory for C) Create # of threads equal to number of cores on processor (around 2, 4, perhaps 8?) (Indicate portions of A, B, C to each thread...)

...

In each thread, For (i from beginning region of thread) C[i] <- A[i] + B[i] //lots of waiting involved for memory reads, writes, ... Wait for threads to synchronize...

This is slightly faster – 2-8x (slightly more with other tricks)

Heterogeneous Computing and Intel oneAPI 30 A Simple Problem …

• How many threads are available on the CPUs? How can the performance scale with thread count?

• Context switching: • The action of switching which thread is being processed • High penalty on the CPU (main computer) • Not a big issue on the GPU

Heterogeneous Computing and Intel oneAPI 31 A Simple Problem …

• On the GPU (allocate memory for A, B, C on GPU) Create the “kernel” – each thread will perform one (or a few) additions Specify the following kernel operation:

For all i‘s (indices) assigned to this thread: C[i] <- A[i] + B[i]

Start ~20000 (!) threads all at the same time! Wait for threads to synchronize...

Heterogeneous Computing and Intel oneAPI 32 CUDA

• Compute Unified Device Architecture • NVCC and g++ - .cu/.cuh is compiled by nvcc to produce a .o file • Since CUDA 7.0 / 9.0 there’s support by NVCC for most C++11 / C++14 language features, but make sure to read restrictions for device code - https://docs.nvidia.com/cuda/cuda-c-programming- guide/index.html#c-cplusplus-language-support - .cpp/.hpp is compiled by g++ and the .o file from the CUDA code is simply linked in using a "#include xxx.cuh" call • No different from how you link in .o files from normal C++ code

Heterogeneous Computing and Intel oneAPI 33 CUDA: Compute Unified Device Architecture • The Kernel - A parallel function given to each thread

- Indexing

https://cs.calvin.edu/courses/cs/374/CUDA/CUDA-Thread-Indexing-Cheatsheet.pdf Heterogeneous Computing and Intel oneAPI 34 Calling the Kernel

Heterogeneous Computing and Intel oneAPI 35 Calling the Kernel (2)

Heterogeneous Computing and Intel oneAPI 36 CUDA Architecture Streaming Multiprocessors (SMs) Each SM has a set of execution units, a set of registers and a chunk of shared memory.

The basic unit of execution is the warp.

A warp is a collection of threads, usually 32.

Heterogeneous Computing and Intel oneAPI 37 Thread Block Organization

• CUDA Programming Model - Thread - Distributed by the CUDA runtime (threadIdx) - Block - A user defined group of 1 to ~512 threads (blockIdx) - Grid - A group of one or more blocks. A grid is created for each CUDA kernel function called. Imagine thread organization as an array of thread indices

When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to SMs with available execution capacity.The threads of a thread block execute concurrently on one SM, and multiple thread blocks

can execute concurrently onHeterogeneous one SM Computing and Intel oneAPI 38 Block and Grid Dimensions

• For many parallelizeable problems involving arrays, it’s useful to think of multidimensional arrays. - E.g. linear algebra, physical modelling, etc, where we want to assign unique thread indices over a multidimensional object - So, CUDA provides built in multidimensional thread indexing capabilities with a struct called dim3 (defined in vector_types.h) !

Heterogeneous Computing and Intel oneAPI 39 Grid/Block/Thread Visualized

When a kernel is started, the number of blocks per grid and the number of threads per block are fixed (gridDim and blockDim). The thread index (threadIdx) The block index (blockIdx)

https://nyu-cds.github.io/python-gpu/02-cuda/ Heterogeneous Computing and Intel oneAPI 40 Thread Identity

• The index of a thread and its thread ID relate to each other as follows: - For a 1-dimensional block, the thread index and thread ID are the same - For a 2-dimensional block, the thread index (x,y) has thread

ID=x+yDx, for block size (Dx,Dy) - For a 3-dimensional block, the thread index (x,y, z) has thread

ID=x+yDx+zDxDy, for block size (Dx,Dy,Dz)

Heterogeneous Computing and Intel oneAPI 41

• Global memory - accessible to all threads as well as the host (CPU) • Global memory is allocated and deallocated by the host • Used to initialize the data that the GPU will work on

Heterogeneous Computing and Intel oneAPI 42 Memory Hierarchy

• Shared memory - Each thread block has its own shared memory • Accessible only by threads within the block • Much faster than local or global memory • Requires special handling to get maximum performance • Only exists for the lifetime of the block

Heterogeneous Computing and Intel oneAPI 43 Memory Hierarchy

• Local memory - Each thread block has its own private local memory • Only exists for the lifetime of the thread • Generally handled automatically by the compiler

• Constant and texture memory - read-only memory spaces accessible by all threads. • Constant memory is used to cache values that are shared by all functional units • Texture memory is optimized for texturing operations provided by the hardware Heterogeneous Computing and Intel oneAPI 44 SIMD: Single Instruction, Multiple Data

• SIMD describes a class of instructions which perform the same operation on multiple registers simultaneously. • Example: Add some scalar to 3 registers, storing the output for each addition in those registers. - Used to increase the brightness of a pixel • CPUs also have SIMD instructions and are very important for applications that need to do a lot of number crunching - Video codecs like x264/x265 make extensive use of SIMD instructions to speed up video encoding and decoding.

Heterogeneous Computing and Intel oneAPI 45 SIMD cont’d

• Converting an algorithm to use SIMD is usually called “Vectorizing” - Not every algorithm can benefit from this or even be vectorized at all, e.x. Parsing. - Using SIMD instructions is not always beneficial though. • Even using the SIMD hardware requires additional power, and thus waste heat. • If the gains are small it probably isn’t worth the additional complexity. - Optimizing compilers like GCC and LLVM are still being trained to be able to vectorize code usefully, though there has been many exciting developments on this front in the last 2 years and is an active area of study. • https://polly.llvm.org/

Heterogeneous Computing and Intel oneAPI 46 SIMT (Single Instruction, Multiple Thread) Architecture

• A looser extension of SIMD which is what CUDA’s computational model uses - Key differences: • Single instruction, multiple register sets - Wastes some registers, but mostly necessary for following two points • Single instruction, multiple addresses (i.e. parallel memory access!) - Memory access conflicts! - Single instruction, multiple flow paths (i.e. if statements are allowed!!!) - Introduces slowdowns, called ‘warp-divergence.’ • Good description of differences - https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html - https://docs.nvidia.com/cuda/cuda-c-programming- guide/index.html#hardware-implementation

Heterogeneous Computing and Intel oneAPI 47 Accelerated API Jungle

oneDNN MIOpen

基于图的 基于C++的异构 视觉和推 并行编程框架 理加速

HIP

支持并行执行 和图形的跨API GPU 的中间表示

Heterogeneous Computing and Intel oneAPI 48 Data Parallel C++/ Cross-Architecture Compiling

C++ 17 and SYCL; and subset of OpenMP* 4.5 initial support for C++ 20 and 5.0 for GPU offload

Heterogeneous Computing and Intel oneAPI 49 DevCloud for oneAPI https://devcloud.intel.com/oneapi/

Heterogeneous Computing and Intel oneAPI 50 Access to DevCloud for oneAPI

• Sign up: https://intelsoftwaresites.secure.force.com/devcloud/oneapi • Configure SSH Connection - Get your ssh key file from here and save it in your ~/.ssh - Run (XXXX is your user ID): bash setup-devcloud-access-XXXX.txt - You get your private key devcloud-access-key-XXXX.txt -

Heterogeneous Computing and Intel oneAPI 51 Data Parallel C++

• Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL Authors: Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., Tian, X. • https://www.apress.com/jp/book/9781484255735 (Preview, access for free) • Samples: https://github.com/Apress/data-parallel-CPP • Tools Intel® oneAPI DPC++/C++ Compiler DPC++/C++ Compiler Developer Guide and Reference https://github.com/intel/llvm https://github.com/oneapi-src/oneDPL

Heterogeneous Computing and Intel 52 oneAPI

It is an open, non-profit, member-driven consortium of over 150 industry-leading companies creating advanced, royalty-free interoperability standards for 3D graphics, augmented and virtual reality, parallel programming, vision acceleration and machine learning. Khronos members are enabled to contribute to the development of Khronos specifications, are empowered to vote at various stages before public deployment and are able to accelerate the delivery of their cutting-edge accelerated platforms and applications through early access to specification drafts and conformance tests.

Heterogeneous Computing and Intel oneAPI 53 Khronos Group

Heterogeneous Computing and Intel oneAPI 54 Heterogeneous Computing and Intel oneAPI 55 https://www.khronos.org/spir/

Heterogeneous Computing and Intel oneAPI 56 Other Slides for this Lecture

• Khronos - OpenCL : OpenCL-3.0-Launch-Apr 2020.pdf • Performance and safety for Heterogeneous (Embedded) Computing - SYCL: SYCL-2020-Launch-Feb 2021.pdf • https://www.iwocl.org/iwocl-2020/sycl-tutorials/ - for Accelerating Embedded Vision and Inferencing: Industry Overview of Options and Trade-offs, Neil Trevett, Khronos President, NVIDIA VP Developer Ecosystems, Embedded World 2020 • Intel - SYCL compiler:zero-cost abstraction and type safety for heterogeneous computing, Andrew Savonichev, 2019 European LLVM Developers Meeting • https://www.iwocl.org/ 9th International Workshop on OpenCL and SYCL, 26-29 April 2021 | VIRTUAL

Heterogeneous Computing and Intel oneAPI 57