NUMA and GPU So, I Know How to Use MPI and Openmp
Total Page:16
File Type:pdf, Size:1020Kb
Some hot topics in HPC NUMA and GPU So, I know how to use MPI and OpenMP... is that all ? • (Un)fortunately no Today’s lecture is about two “hot topics” in HPC: • NUMA nodes and thread affinity • GPUs (accelerators) 2 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 3 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 4 / 52 UMA and NUMA (Review) — What’s inside a modern cluster 1. A network 2. Interconnected nodes 3. Nodes with multiple processors/sockets (and accelerators) 4. Processors/sockets with multiple cores 5 / 52 UMA and NUMA (Review) — And what about memory ? From the network point of view: • Each node (a collection of processors) has access to its own memory • The nodes are communicating by sending messages • We called that distributed memory and used MPI to handle it From the node point of view: • The (node’s own) memory is shared among the cores • We called that shared memory and used OpenMP to handle it • Ok, but how is it shared ? −→ Uniform Memory Access (UMA) −→ Non-Uniform Memory Access (NUMA) 6 / 52 UMA and NUMA (Review) — The UMA way Memory c0 c1 c2 c3 c4 c5 c6 c7 The cores and the memory modules are interconnected by a bus Every core can access any part of the memory at the same speed Pros: • No matter where the data are located • No matter where the computations are done Cons: • If the number of core increases the bus has to be faster −→ Does not scale −→ Stuck at around 8 cores on the same memory bus 7 / 52 UMA and NUMA (Review) — The NUMA way I c0 c1 c2 c3 c4 c5 c6 c7 Memory0 Memory1 The cores are split in groups • NUMA nodes Each NUMA node has a fast access to a part of the memory (UMA way) The NUMA nodes are interconnected by a bus (or set of buses) If a core of a NUMA node needs data he does not own: • It “asks” the corresponding NUMA node • Slower mechanism than accessing its own memory 8 / 52 UMA and NUMA (Review) — The NUMA way II c0 c1 c2 c3 c4 c5 c6 c7 Memory0 Memory1 Pros: • Scales Cons: • Data location does matter 9 / 52 UMA and NUMA (Remote access) — The beast I We will use the following machine: %> hwloc-info --no-io --of txt We have two NUMA nodes We have 6 UMA cores per NUMA nodes 10 / 52 UMA and NUMA (Remote access) — The beast II 11 / 52 UMA and NUMA (Remote access) — UNIX and malloc() I Let’s try the following code: %> gcc -O3 firsttouch.c %> ./a. out Time to allocated 100000000 bytes Call to malloc(): 0.000045 [s] First Touch : 0.037014 [s] Second Touch : 0.001181 [s] 12 / 52 UMA and NUMA (Remote access) — UNIX and malloc() II Is it possible to allocate 100 MB in 45 µs ? • Means a memory bandwidth of 2 TB/s −→ Hum. Why are the loops with different timing ? • The first loop is actually allocating the memory malloc() just informs the kernel of future possible allocation Memory is actually allocated by chunk of (usually) 4 KiB (a page) The allocation is done when a page is first touched 13 / 52 UMA and NUMA (Remote access) — First touch policy I In a multithread context it is the first touch policy that is used When a page is first touched by a thread, it is allocated on NUMA node that runs this thread 14 / 52 UMA and NUMA (Remote access) — First touch policy II Let’s try the following code: %> gcc -O3 -fopenmp numacopy.c %> ./a. out Time to copy 80000000 bytes One: 0.009035 [s] Two: 0.017308 [s] Ratio: 1.915637 [-] One: NUMA aware allocation Two: NUMA unaware allocation 15 / 52 UMA and NUMA (Remote access) — First touch policy III NUMA unaware allocation is two times slower Larger NUMA interconnects may have even slower access libnuma may help handling memory access numactl may help for non NUMA aware codes 16 / 52 UMA and NUMA (Remote access) — First touch policy IV numactl can (among other things) allow interleaved allocation %> gcc -O3 -fopenmp numacopy.c %> numactl --interleave=all ./a.out Time to copy 80000000 bytes One: 0.010230 [s] Two: 0.009739 [s] Ratio: 0.952014 [-] One: NUMA aware allocation Two: NUMA unaware allocation 17 / 52 UMA and NUMA (Thread scheduling) — Kernel panic ? Ok, I’m allocating memory with a thread on NUMA node i • Now, this thread has fast access to this segment NUMA node i What if the kernel’s scheduler swaps this thread on NUMA node i + 1 ? • Has to be avoided ! Can be done in OpenMP: • OMP_PROC_BIND = [true | false] • Threads are not allowed to move between processors if set to true More control with POSIX threads: • sched_setaffinity() and sched_getaffinity() • Linux specific: sched.h libnuma can also help 18 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 19 / 52 Cache memory (Review) — You said cache ? Cache memory allows a fast access to a part of the memory Each core has its own cache Memory Cache0 Core0 20 / 52 Cache memory (False sharing) — In parallel ? What happens if two cores are modifying the same cache line ? Cache0: Mem[128-135] Invalid Invalid Cache1: Mem[128-135] Mem[128-135]: Cache may be not coherent any more ! • Synchronization needed −→ Takes time. False sharing • From a software point of view: data are not shared • From a hardware point of view: data are shared (same cache line) 21 / 52 Cache memory (False sharing) — Reduction Let’s try the following code: %> gcc -fopenmp -O3 false.c %> ./a. out Test without padding: 0.001330 [s] Test with padding: 0.000754 [s] Ratio: 1.764073[-] 22 / 52 Cache memory (False sharing) — High level cache Synchronization can be achieved through shared higher level of cache On multi-socket motherboards synchronization must go through RAM 23 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 24 / 52 GPUs (What’s that ?) — Surely about graphics ! GPU:GraphicsProcessingUnit Handles 3D graphics: • Projection of a 3D scene of 2D plane • Rasterisation of the 2D plane Most modern devices also handles shading, reflections, etc Specialized hardware for intensive work Healthy video game industry pushing this technology • Relatively fair price 25 / 52 GPUs (What’s that ?) — Unreal Engine and Malcom https://en.wikipedia.org/wiki/File:Unreal_Engine_Comparison.jpg c Epic Games 26 / 52 GPUs (What’s that ?) — What about HPC ?I Highly parallel processors • More than 2000 processing units on NVIDIA GeForce GTX 780 Better thermal efficiency than a CPU Cheap 27 / 52 GPUs (What’s that ?) — What about HPC ? II Around 2007 GPUs are using floating point arithmetic (IEEE 754) Introduction of C extensions: • CUDA: Proprietary (NVIDIA) • OpenCL: Open (Khronos Group) 28 / 52 GPUs (Architecture) — CPU vs GPU Let’s look at the chips: • On a CPU control and memory are dominant • On a GPU floating point units are dominant 29 / 52 GPUs (Architecture) — Vocabulary A device is composed of: • The GPU • The GPU own RAM memory • An interface with the motherboard (PCI-Express) A host is composed of: • The CPU • The CPU own RAM memory • The motherboard 30 / 52 GPUs (Architecture) — Inside the GPU A GPU is composed of streaming multiprocessors (SM(X)) • 12 on NVIDIA GeForce GTX 780 A high capacity (from 0.5 GB to 4 GB) RAM is shared among the SMs An SM is composed of streaming processors (SP) • Floating point units • 192 on NVIDIA GeForce GTX 780 • 32 on NVIDIA GeForce GT 430 (single precision) • 16 on NVIDIA GeForce GT 430 (double precision) • 8 on NVIDIA GeForce 8500 GT (single precision only) A SM is also composed of: • Memory units • Control units • Special function units (SFUs) 31 / 52 GPUs (Architecture) — A streaming multiprocessor Instruction cache One instruction fetch Into a SM the same instruction is executed by all the SPs Single Instruction Multiple Threads (SIMT) • SIMD without locality 32 / 52 GPUs (Architecture) — The big picture Host 33 / 52 GPUs (Architecture) — Host job The host: • Allocate and deallocate memory on the device • Send data to the device (synchronously or asynchronously) • Fetch data from the device (synchronously or asynchronously) • Send and execute code on the device −→ This code is called a kernel −→ Calls are asynchronous • The host needs to statically split the threads among the SMs −→ Blocks of threads −→ Blocks distributed among the SMs −→ Kind of dynamism introduced in newer architecture ?? • Good practice to have much more threads than SPs −→ Keeps the SP busy during memory access −→ Kind of Simultaneous MultiThreading (SMT) 34 / 52 GPUs (Architecture) — Device job The host can: • Distribute the blocks of threads among the SMs • Execute the kernel • Handle the send/fetch request of the host at the same time −→ Task parallelism 35 / 52 GPUs (Architecture) — More ? I can keep talking: • Memory limitations • Branching operations • Threads block limitations • ... but, let’s stop for the architecture • I think we have the bases What is important to remember: • Many floating point units • Few “near core” memory • Copies between host and device • SIMT Let’s try some code 36 / 52 GPUs (A first example (CUDA)) — hello, world! Let’s try the vectorial addition c = a + b 37 / 52 GPUs (A first example (CUDA)) — main int main(void){ int N = 1742;// Vector size float *a, *b, *c; // Allocate host memory// a = (float*)malloc(sizeof(float)*N); b = (float*)malloc(sizeof(float)*N);