Some hot topics in HPC NUMA and GPU So, I know how to use MPI and OpenMP...
. . . is that all ? • (Un)fortunately no
Today’s lecture is about two “hot topics” in HPC: • NUMA nodes and thread affinity • GPUs (accelerators)
2 / 52 Outline
1 UMA and NUMA Review Remote access Thread scheduling
2 Cache memory Review False sharing
3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies
3 / 52 Outline
1 UMA and NUMA Review Remote access Thread scheduling
2 Cache memory Review False sharing
3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies
4 / 52 UMA and NUMA (Review) — What’s inside a modern cluster
1. A network 2. Interconnected nodes 3. Nodes with multiple processors/sockets (and accelerators) 4. Processors/sockets with multiple cores
5 / 52 UMA and NUMA (Review) — And what about memory ?
From the network point of view: • Each node (a collection of processors) has access to its own memory • The nodes are communicating by sending messages • We called that distributed memory and used MPI to handle it
From the node point of view: • The (node’s own) memory is shared among the cores • We called that shared memory and used OpenMP to handle it
• Ok, but how is it shared ? −→ Uniform Memory Access (UMA) −→ Non-Uniform Memory Access (NUMA)
6 / 52 UMA and NUMA (Review) — The UMA way
Memory
c0 c1 c2 c3 c4 c5 c6 c7
The cores and the memory modules are interconnected by a bus Every core can access any part of the memory at the same speed Pros: • No matter where the data are located • No matter where the computations are done Cons: • If the number of core increases the bus has to be faster −→ Does not scale −→ Stuck at around 8 cores on the same memory bus
7 / 52 UMA and NUMA (Review) — The NUMA way I
c0 c1 c2 c3 c4 c5 c6 c7
Memory0 Memory1
The cores are split in groups • NUMA nodes Each NUMA node has a fast access to a part of the memory (UMA way) The NUMA nodes are interconnected by a bus (or set of buses) If a core of a NUMA node needs data he does not own: • It “asks” the corresponding NUMA node • Slower mechanism than accessing its own memory
8 / 52 UMA and NUMA (Review) — The NUMA way II
c0 c1 c2 c3 c4 c5 c6 c7
Memory0 Memory1
Pros: • Scales Cons: • Data location does matter
9 / 52 UMA and NUMA (Remote access) — The beast I
We will use the following machine:
%> hwloc-info --no-io --of txt
We have two NUMA nodes We have 6 UMA cores per NUMA nodes
10 / 52 UMA and NUMA (Remote access) — The beast II
11 / 52 UMA and NUMA (Remote access) — UNIX and malloc() I
Let’s try the following code:
%> gcc -O3 firsttouch.c %> ./a. out Time to allocated 100000000 bytes
Call to malloc(): 0.000045 [s] First Touch : 0.037014 [s] Second Touch : 0.001181 [s]
12 / 52 UMA and NUMA (Remote access) — UNIX and malloc() II
Is it possible to allocate 100 MB in 45 µs ? • Means a memory bandwidth of 2 TB/s −→ Hum. . . Why are the loops with different timing ? • The first loop is actually allocating the memory
malloc() just informs the kernel of future possible allocation Memory is actually allocated by chunk of (usually) 4 KiB (a page) The allocation is done when a page is first touched
13 / 52 UMA and NUMA (Remote access) — First touch policy I
In a multithread context it is the first touch policy that is used
When a page is first touched by a thread, it is allocated on NUMA node that runs this thread
14 / 52 UMA and NUMA (Remote access) — First touch policy II
Let’s try the following code:
%> gcc -O3 -fopenmp numacopy.c %> ./a. out Time to copy 80000000 bytes
One: 0.009035 [s] Two: 0.017308 [s] Ratio: 1.915637 [-]
One: NUMA aware allocation Two: NUMA unaware allocation
15 / 52 UMA and NUMA (Remote access) — First touch policy III
NUMA unaware allocation is two times slower Larger NUMA interconnects may have even slower access libnuma may help handling memory access numactl may help for non NUMA aware codes
16 / 52 UMA and NUMA (Remote access) — First touch policy IV
numactl can (among other things) allow interleaved allocation
%> gcc -O3 -fopenmp numacopy.c %> numactl --interleave=all ./a.out Time to copy 80000000 bytes
One: 0.010230 [s] Two: 0.009739 [s] Ratio: 0.952014 [-]
One: NUMA aware allocation Two: NUMA unaware allocation
17 / 52 UMA and NUMA (Thread scheduling) — Kernel panic ?
Ok, I’m allocating memory with a thread on NUMA node i • Now, this thread has fast access to this segment NUMA node i What if the kernel’s scheduler swaps this thread on NUMA node i + 1 ? • Has to be avoided !
Can be done in OpenMP: • OMP_PROC_BIND = [true | false] • Threads are not allowed to move between processors if set to true More control with POSIX threads: • sched_setaffinity() and sched_getaffinity() • Linux specific: sched.h libnuma can also help
18 / 52 Outline
1 UMA and NUMA Review Remote access Thread scheduling
2 Cache memory Review False sharing
3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies
19 / 52 Cache memory (Review) — You said cache ?
Cache memory allows a fast access to a part of the memory Each core has its own cache
Memory Cache0 Core0
20 / 52 Cache memory (False sharing) — In parallel ?
What happens if two cores are modifying the same cache line ?
Cache0: Mem[128-135] Invalid Invalid Cache1: Mem[128-135]
Mem[128-135]:
Cache may be not coherent any more ! • Synchronization needed −→ Takes time. . . False sharing • From a software point of view: data are not shared • From a hardware point of view: data are shared (same cache line)
21 / 52 Cache memory (False sharing) — Reduction
Let’s try the following code:
%> gcc -fopenmp -O3 false.c %> ./a. out Test without padding: 0.001330 [s] Test with padding: 0.000754 [s] Ratio: 1.764073[-]
22 / 52 Cache memory (False sharing) — High level cache
Synchronization can be achieved through shared higher level of cache On multi-socket motherboards synchronization must go through RAM
23 / 52 Outline
1 UMA and NUMA Review Remote access Thread scheduling
2 Cache memory Review False sharing
3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies
24 / 52 GPUs (What’s that ?) — Surely about graphics !
GPU:GraphicsProcessingUnit Handles 3D graphics: • Projection of a 3D scene of 2D plane • Rasterisation of the 2D plane Most modern devices also handles shading, reflections, etc
Specialized hardware for intensive work Healthy video game industry pushing this technology • Relatively fair price
25 / 52 GPUs (What’s that ?) — Unreal Engine and Malcom
https://en.wikipedia.org/wiki/File:Unreal_Engine_Comparison.jpg
c Epic Games 26 / 52 GPUs (What’s that ?) — What about HPC ?I
Highly parallel processors • More than 2000 processing units on NVIDIA GeForce GTX 780 Better thermal efficiency than a CPU Cheap
27 / 52 GPUs (What’s that ?) — What about HPC ?II
Around 2007 GPUs are using floating point arithmetic (IEEE 754) Introduction of C extensions: • CUDA: Proprietary (NVIDIA) • OpenCL: Open (Khronos Group)
28 / 52 GPUs (Architecture) — CPU vs GPU
Let’s look at the chips: • On a CPU control and memory are dominant • On a GPU floating point units are dominant
29 / 52 GPUs (Architecture) — Vocabulary
A device is composed of: • The GPU • The GPU own RAM memory • An interface with the motherboard (PCI-Express) A host is composed of: • The CPU • The CPU own RAM memory • The motherboard
30 / 52 GPUs (Architecture) — Inside the GPU
A GPU is composed of streaming multiprocessors (SM(X)) • 12 on NVIDIA GeForce GTX 780 A high capacity (from 0.5 GB to 4 GB) RAM is shared among the SMs
An SM is composed of streaming processors (SP) • Floating point units • 192 on NVIDIA GeForce GTX 780 • 32 on NVIDIA GeForce GT 430 (single precision) • 16 on NVIDIA GeForce GT 430 (double precision) • 8 on NVIDIA GeForce 8500 GT (single precision only) A SM is also composed of: • Memory units • Control units • Special function units (SFUs)
31 / 52 GPUs (Architecture) — A streaming multiprocessor
Instruction cache
One instruction fetch Into a SM the same instruction is executed by all the SPs Single Instruction Multiple Threads (SIMT) • SIMD without locality
32 / 52 GPUs (Architecture) — The big picture
Host
33 / 52 GPUs (Architecture) — Host job
The host: • Allocate and deallocate memory on the device • Send data to the device (synchronously or asynchronously) • Fetch data from the device (synchronously or asynchronously) • Send and execute code on the device −→ This code is called a kernel −→ Calls are asynchronous • The host needs to statically split the threads among the SMs −→ Blocks of threads −→ Blocks distributed among the SMs −→ Kind of dynamism introduced in newer architecture ?? • Good practice to have much more threads than SPs −→ Keeps the SP busy during memory access −→ Kind of Simultaneous MultiThreading (SMT)
34 / 52 GPUs (Architecture) — Device job
The host can: • Distribute the blocks of threads among the SMs • Execute the kernel • Handle the send/fetch request of the host at the same time −→ Task parallelism
35 / 52 GPUs (Architecture) — More ?
I can keep talking: • Memory limitations • Branching operations • Threads block limitations • ...... but, let’s stop for the architecture • I think we have the bases What is important to remember: • Many floating point units • Few “near core” memory • Copies between host and device • SIMT Let’s try some code
36 / 52 GPUs (A first example (CUDA)) — hello, world!
Let’s try the vectorial addition c = a + b
37 / 52 GPUs (A first example (CUDA)) — main
int main(void){ int N = 1742;// Vector size float *a, *b, *c;
// Allocate host memory// a = (float*)malloc(sizeof(float)*N); b = (float*)malloc(sizeof(float)*N); c = (float*)malloc(sizeof(float)*N);
// Compute Addition(of noise vectors)// vectAdd(a, b, c, N);
// Now, we havec=a+b:-)//
// Free host memory// free (a); free (b); free (c);
return 0; } 38 / 52 GPUs (A first example (CUDA)) — vectAdd I
void vectAdd(float *aHost, float *bHost, float *cHost, intN){ // Host pointer with device address float *aDevice, *bDevice, *cDevice;
// Allocate device memory// cudaMalloc((void**) &aDevice, sizeof(float)*N); cudaMalloc((void**) &bDevice, sizeof(float)*N); cudaMalloc((void**) &cDevice, sizeof(float)*N);
// Copy aHost and bHost on device// cudaMemcpy(aDevice, aHost, sizeof(float)*N, cudaMemcpyHostToDevice); cudaMemcpy(bDevice, bHost, sizeof(float)*N, cudaMemcpyHostToDevice);
// See next slide//
39 / 52 GPUs (A first example (CUDA)) — vectAdd II
// See previous slide//
// Launch kernel// dim3 TpB(256);// Threads per bloc dim3 BpG((N + TpB.x - 1) / TpB.x);// Number of blocs vectAddKernel<<
// Wait Resuslts from device& get them// cudaMemcpy(cHost, cDevice, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Free device memory// cudaFree(aDevice); cudaFree(bDevice); cudaFree(cDevice); }
40 / 52 GPUs (A first example (CUDA)) — vectAddKernel
__global__ void vectAddKernel (float *a, float *b, float *c, intN){ // Thread globalID// int i = blockIdx.x * blockDim.x + threadIdx.x;
// We could have more thanN threads// if(i < N) // Each thread doa part of the job// c[i] = a[i] + b[i]; }
41 / 52 GPUs (A first example (CUDA)) — Compiler et al.
CUDA compiler: nvcc • Kernel (vectAddKernel) and calling function (vectAdd) • The other parts are handled by your favorite compiler File format: .cu Header: cuda.h Library: libcudart.so
42 / 52 GPUs (Let’s get serious) — BLAS and cuBLAS
cuBLAS: An proprietary (NVIDIA) library implementing BLAS on GPU Let’s compare: • Two Intel Xeon E5645 with OpenBLAS (12 cores) • NVIDIA Tesla M2075 (448 SPs) • Double precision GPU code: • Send the data on the GPU • Do BLAS3 on size increasing square matrices • Do BLAS1 on size increasing vectors • Get the result bask CPU code: • Do BLAS3 on size increasing square matrices • Do BLAS1 on size increasing vectors
43 / 52 GPUs (Let’s get serious) — BLAS3
GPU CPU
100 ] s [ Time 10−1
10−2 1,000 2,000 3,000 4,000 5,000 6,000 Matrix size [−]
44 / 52 GPUs (Let’s get serious) — BLAS1
·10−2
2.8 GPU 2.6 CPU 2.4 2.2 2 1.8 ] s [ 1.6 1.4
Time 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 Matrix size [−] ·106
45 / 52 GPUs (Let’s get serious) — BLAS3: Transfers
100 ] s [ 10−1 Time
10−2 Transfers Computations
1,000 2,000 3,000 4,000 5,000 6,000 Matrix size [−]
46 / 52 GPUs (Let’s get serious) — BLAS1: Transfers
·10−2 2.8 2.6 Transfers 2.4 Computations 2.2 2 1.8
] 1.6 s [ 1.4 1.2 Time 1 0.8 0.6 0.4 0.2 0 −0.2 1 2 3 4 5 6 Matrix size [−] ·106
47 / 52 GPUs (Asynchronous copies) — Let’s do it pipeline style !
Sometimes it is difficult to handle the host – device transfers • Happens usually when a CPU code is ported to a GPU version • We have two choices: −→ Rewrite a large part of the CPU version −→ Use task parallelism and asynchronous copies A GPU can at the same time handle: • A host – device copy • A kernel execution • A device – host copy
The data can be split is small chunks • A copy – execute – copy pipeline can be build −→ Host – device transfer are handled by the DMA module −→ Direct Memory Access −→ The CPU is no longer involved in the copies
48 / 52 GPUs (Asynchronous copies) — Task parallelism
Time
Upload
Kernel
Download
Time
49 / 52 GPUs (Asynchronous copies) — Are you insane ?
Wait a minute. . . • The CPU is no longer involved in the copy mechanism • What if the kernel swaps the memory segment that is being copied ?
We need to tell the kernel that the copied pages cannot be swapped ! • Memory pinning −→ System memory is decreasing ! −→ Need to pin and unpin or the system will go out of memory • cudaHostRegister() • cudaHostUnregister()
We also need the guarantee that our segment starts at a page • posix_memalign()
50 / 52 GPUs (Asynchronous copies) — Did it work ? I
We ported a CPU optimized Discontinuous Galerkin code on GPU Only the most time consuming function was rewritten Task parallelism was used
800 GPU CPU 700
600 ] s [ 500 Time 400
300
200
100 150 200 250 300 350 Iterations [−] 51 / 52 GPUs (Asynchronous copies) — Did it work ? II
The two versions have almost the same execution times The GPU version frees the CPU • The CPU can do other things: hybrid programming The CPU costed around 100 e The GPU costed around 30 e
52 / 52 game over: you win !
%> shutdown -h now "merry christmas and happy new year"