Some hot topics in HPC NUMA and GPU So, I know how to use MPI and OpenMP...

. . . is that all ? • (Un)fortunately no

Today’s lecture is about two “hot topics” in HPC: • NUMA nodes and affinity • GPUs (accelerators)

2 / 52 Outline

1 UMA and NUMA Review Remote access Thread scheduling

2 Cache memory Review False sharing

3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies

3 / 52 Outline

1 UMA and NUMA Review Remote access Thread scheduling

2 Cache memory Review False sharing

3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies

4 / 52 UMA and NUMA (Review) — What’s inside a modern cluster

1. A network 2. Interconnected nodes 3. Nodes with multiple processors/sockets (and accelerators) 4. Processors/sockets with multiple cores

5 / 52 UMA and NUMA (Review) — And what about memory ?

From the network point of view: • Each node (a collection of processors) has access to its own memory • The nodes are communicating by sending messages • We called that and used MPI to handle it

From the node point of view: • The (node’s own) memory is shared among the cores • We called that and used OpenMP to handle it

• Ok, but how is it shared ? −→ Uniform Memory Access (UMA) −→ Non-Uniform Memory Access (NUMA)

6 / 52 UMA and NUMA (Review) — The UMA way

Memory

c0 c1 c2 c3 c4 c5 c6 c7

The cores and the memory modules are interconnected by a bus Every core can access any part of the memory at the same speed Pros: • No matter where the data are located • No matter where the computations are done Cons: • If the number of core increases the bus has to be faster −→ Does not scale −→ Stuck at around 8 cores on the same memory bus

7 / 52 UMA and NUMA (Review) — The NUMA way I

c0 c1 c2 c3 c4 c5 c6 c7

Memory0 Memory1

The cores are split in groups • NUMA nodes Each NUMA node has a fast access to a part of the memory (UMA way) The NUMA nodes are interconnected by a bus (or set of buses) If a core of a NUMA node needs data he does not own: • It “asks” the corresponding NUMA node • Slower mechanism than accessing its own memory

8 / 52 UMA and NUMA (Review) — The NUMA way II

c0 c1 c2 c3 c4 c5 c6 c7

Memory0 Memory1

Pros: • Scales Cons: • Data location does matter

9 / 52 UMA and NUMA (Remote access) — The beast I

We will use the following machine:

%> hwloc-info --no-io --of txt

We have two NUMA nodes We have 6 UMA cores per NUMA nodes

10 / 52 UMA and NUMA (Remote access) — The beast II

11 / 52 UMA and NUMA (Remote access) — UNIX and malloc() I

Let’s try the following code:

%> gcc -O3 firsttouch.c %> ./a. out Time to allocated 100000000 bytes

Call to malloc(): 0.000045 [s] First Touch : 0.037014 [s] Second Touch : 0.001181 [s]

12 / 52 UMA and NUMA (Remote access) — UNIX and malloc() II

Is it possible to allocate 100 MB in 45 µs ? • Means a memory bandwidth of 2 TB/s −→ Hum. . . Why are the loops with different timing ? • The first loop is actually allocating the memory

malloc() just informs the kernel of future possible allocation Memory is actually allocated by chunk of (usually) 4 KiB (a page) The allocation is done when a page is first touched

13 / 52 UMA and NUMA (Remote access) — First touch policy I

In a multithread context it is the first touch policy that is used

When a page is first touched by a thread, it is allocated on NUMA node that runs this thread

14 / 52 UMA and NUMA (Remote access) — First touch policy II

Let’s try the following code:

%> gcc -O3 -fopenmp numacopy.c %> ./a. out Time to copy 80000000 bytes

One: 0.009035 [s] Two: 0.017308 [s] Ratio: 1.915637 [-]

One: NUMA aware allocation Two: NUMA unaware allocation

15 / 52 UMA and NUMA (Remote access) — First touch policy III

NUMA unaware allocation is two times slower Larger NUMA interconnects may have even slower access libnuma may help handling memory access numactl may help for non NUMA aware codes

16 / 52 UMA and NUMA (Remote access) — First touch policy IV

numactl can (among other things) allow interleaved allocation

%> gcc -O3 -fopenmp numacopy.c %> numactl --interleave=all ./a.out Time to copy 80000000 bytes

One: 0.010230 [s] Two: 0.009739 [s] Ratio: 0.952014 [-]

One: NUMA aware allocation Two: NUMA unaware allocation

17 / 52 UMA and NUMA (Thread scheduling) — Kernel panic ?

Ok, I’m allocating memory with a thread on NUMA node i • Now, this thread has fast access to this segment NUMA node i What if the kernel’s scheduler swaps this thread on NUMA node i + 1 ? • Has to be avoided !

Can be done in OpenMP: • OMP_PROC_BIND = [true | false] • Threads are not allowed to move between processors if set to true More control with POSIX threads: • sched_setaffinity() and sched_getaffinity() • specific: sched.h libnuma can also help

18 / 52 Outline

1 UMA and NUMA Review Remote access Thread scheduling

2 Cache memory Review False sharing

3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies

19 / 52 Cache memory (Review) — You said cache ?

Cache memory allows a fast access to a part of the memory Each core has its own cache

Memory Cache0 Core0

20 / 52 Cache memory (False sharing) — In parallel ?

What happens if two cores are modifying the same cache line ?

Cache0: Mem[128-135] Invalid Invalid Cache1: Mem[128-135]

Mem[128-135]:

Cache may be not coherent any more ! • Synchronization needed −→ Takes time. . . False sharing • From a software point of view: data are not shared • From a hardware point of view: data are shared (same cache line)

21 / 52 Cache memory (False sharing) — Reduction

Let’s try the following code:

%> gcc -fopenmp -O3 false.c %> ./a. out Test without padding: 0.001330 [s] Test with padding: 0.000754 [s] Ratio: 1.764073[-]

22 / 52 Cache memory (False sharing) — High level cache

Synchronization can be achieved through shared higher level of cache On multi-socket motherboards synchronization must go through RAM

23 / 52 Outline

1 UMA and NUMA Review Remote access Thread scheduling

2 Cache memory Review False sharing

3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies

24 / 52 GPUs (What’s that ?) — Surely about graphics !

GPU:GraphicsProcessingUnit Handles 3D graphics: • Projection of a 3D scene of 2D plane • Rasterisation of the 2D plane Most modern devices also handles shading, reflections, etc

Specialized hardware for intensive work Healthy video game industry pushing this technology • Relatively fair price

25 / 52 GPUs (What’s that ?) — and Malcom

https://en.wikipedia.org/wiki/File:Unreal_Engine_Comparison.jpg

c Epic Games 26 / 52 GPUs (What’s that ?) — What about HPC ?I

Highly parallel processors • More than 2000 processing units on NVIDIA GeForce GTX 780 Better thermal efficiency than a CPU Cheap

27 / 52 GPUs (What’s that ?) — What about HPC ?II

Around 2007 GPUs are using floating point arithmetic (IEEE 754) Introduction of C extensions: • CUDA: Proprietary (NVIDIA) • OpenCL: Open (Khronos Group)

28 / 52 GPUs (Architecture) — CPU vs GPU

Let’s look at the chips: • On a CPU control and memory are dominant • On a GPU floating point units are dominant

29 / 52 GPUs (Architecture) — Vocabulary

A device is composed of: • The GPU • The GPU own RAM memory • An interface with the motherboard (PCI-Express) A host is composed of: • The CPU • The CPU own RAM memory • The motherboard

30 / 52 GPUs (Architecture) — Inside the GPU

A GPU is composed of streaming multiprocessors (SM(X)) • 12 on NVIDIA GeForce GTX 780 A high capacity (from 0.5 GB to 4 GB) RAM is shared among the SMs

An SM is composed of streaming processors (SP) • Floating point units • 192 on NVIDIA GeForce GTX 780 • 32 on NVIDIA GeForce GT 430 (single precision) • 16 on NVIDIA GeForce GT 430 (double precision) • 8 on NVIDIA GeForce 8500 GT (single precision only) A SM is also composed of: • Memory units • Control units • Special function units (SFUs)

31 / 52 GPUs (Architecture) — A streaming multiprocessor

Instruction cache

One instruction fetch Into a SM the same instruction is executed by all the SPs Single Instruction Multiple Threads (SIMT) • SIMD without locality

32 / 52 GPUs (Architecture) — The big picture

Host

33 / 52 GPUs (Architecture) — Host job

The host: • Allocate and deallocate memory on the device • Send data to the device (synchronously or asynchronously) • Fetch data from the device (synchronously or asynchronously) • Send and execute code on the device −→ This code is called a kernel −→ Calls are asynchronous • The host needs to statically split the threads among the SMs −→ Blocks of threads −→ Blocks distributed among the SMs −→ Kind of dynamism introduced in newer architecture ?? • Good practice to have much more threads than SPs −→ Keeps the SP busy during memory access −→ Kind of Simultaneous MultiThreading (SMT)

34 / 52 GPUs (Architecture) — Device job

The host can: • Distribute the blocks of threads among the SMs • Execute the kernel • Handle the send/fetch request of the host at the same time −→

35 / 52 GPUs (Architecture) — More ?

I can keep talking: • Memory limitations • Branching operations • Threads block limitations • ...... but, let’s stop for the architecture • I think we have the bases What is important to remember: • Many floating point units • Few “near core” memory • Copies between host and device • SIMT Let’s try some code

36 / 52 GPUs (A first example (CUDA)) — hello, world!

Let’s try the vectorial addition c = a + b

37 / 52 GPUs (A first example (CUDA)) — main

int main(void){ int N = 1742;// Vector size float *a, *b, *c;

// Allocate host memory// a = (float*)malloc(sizeof(float)*N); b = (float*)malloc(sizeof(float)*N); c = (float*)malloc(sizeof(float)*N);

// Compute Addition(of noise vectors)// vectAdd(a, b, c, N);

// Now, we havec=a+b:-)//

// Free host memory// free (a); free (b); free (c);

return 0; } 38 / 52 GPUs (A first example (CUDA)) — vectAdd I

void vectAdd(float *aHost, float *bHost, float *cHost, intN){ // Host pointer with device address float *aDevice, *bDevice, *cDevice;

// Allocate device memory// cudaMalloc((void**) &aDevice, sizeof(float)*N); cudaMalloc((void**) &bDevice, sizeof(float)*N); cudaMalloc((void**) &cDevice, sizeof(float)*N);

// Copy aHost and bHost on device// cudaMemcpy(aDevice, aHost, sizeof(float)*N, cudaMemcpyHostToDevice); cudaMemcpy(bDevice, bHost, sizeof(float)*N, cudaMemcpyHostToDevice);

// See next slide//

39 / 52 GPUs (A first example (CUDA)) — vectAdd II

// See previous slide//

// Launch kernel// dim3 TpB(256);// Threads per bloc dim3 BpG((N + TpB.x - 1) / TpB.x);// Number of blocs vectAddKernel<<>> (aDevice, bDevice, cDevice, N);

// Wait Resuslts from device& get them// cudaMemcpy(cHost, cDevice, sizeof(float)*N, cudaMemcpyDeviceToHost);

// Free device memory// cudaFree(aDevice); cudaFree(bDevice); cudaFree(cDevice); }

40 / 52 GPUs (A first example (CUDA)) — vectAddKernel

__global__ void vectAddKernel (float *a, float *b, float *c, intN){ // Thread globalID// int i = blockIdx.x * blockDim.x + threadIdx.x;

// We could have more thanN threads// if(i < N) // Each thread doa part of the job// c[i] = a[i] + b[i]; }

41 / 52 GPUs (A first example (CUDA)) — Compiler et al.

CUDA compiler: nvcc • Kernel (vectAddKernel) and calling function (vectAdd) • The other parts are handled by your favorite compiler File format: .cu Header: .h Library: libcudart.so

42 / 52 GPUs (Let’s get serious) — BLAS and cuBLAS

cuBLAS: An proprietary (NVIDIA) library implementing BLAS on GPU Let’s compare: • Two Intel Xeon E5645 with OpenBLAS (12 cores) • NVIDIA Tesla M2075 (448 SPs) • Double precision GPU code: • Send the data on the GPU • Do BLAS3 on size increasing square matrices • Do BLAS1 on size increasing vectors • Get the result bask CPU code: • Do BLAS3 on size increasing square matrices • Do BLAS1 on size increasing vectors

43 / 52 GPUs (Let’s get serious) — BLAS3

GPU CPU

100 ] s [ Time 10−1

10−2 1,000 2,000 3,000 4,000 5,000 6,000 Matrix size [−]

44 / 52 GPUs (Let’s get serious) — BLAS1

·10−2

2.8 GPU 2.6 CPU 2.4 2.2 2 1.8 ] s [ 1.6 1.4

Time 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 Matrix size [−] ·106

45 / 52 GPUs (Let’s get serious) — BLAS3: Transfers

100 ] s [ 10−1 Time

10−2 Transfers Computations

1,000 2,000 3,000 4,000 5,000 6,000 Matrix size [−]

46 / 52 GPUs (Let’s get serious) — BLAS1: Transfers

·10−2 2.8 2.6 Transfers 2.4 Computations 2.2 2 1.8

] 1.6 s [ 1.4 1.2 Time 1 0.8 0.6 0.4 0.2 0 −0.2 1 2 3 4 5 6 Matrix size [−] ·106

47 / 52 GPUs (Asynchronous copies) — Let’s do it style !

Sometimes it is difficult to handle the host – device transfers • Happens usually when a CPU code is ported to a GPU version • We have two choices: −→ Rewrite a large part of the CPU version −→ Use task parallelism and asynchronous copies A GPU can at the same time handle: • A host – device copy • A kernel execution • A device – host copy

The data can be split is small chunks • A copy – execute – copy pipeline can be −→ Host – device transfer are handled by the DMA module −→ Direct Memory Access −→ The CPU is no longer involved in the copies

48 / 52 GPUs (Asynchronous copies) — Task parallelism

Time

Upload

Kernel

Download

Time

49 / 52 GPUs (Asynchronous copies) — Are you ?

Wait a minute. . . • The CPU is no longer involved in the copy mechanism • What if the kernel swaps the memory segment that is being copied ?

We need to tell the kernel that the copied pages cannot be swapped ! • Memory pinning −→ System memory is decreasing ! −→ Need to pin and unpin or the system will go out of memory • cudaHostRegister() • cudaHostUnregister()

We also need the guarantee that our segment starts at a page • posix_memalign()

50 / 52 GPUs (Asynchronous copies) — Did it work ? I

We ported a CPU optimized Discontinuous Galerkin code on GPU Only the most time consuming function was rewritten Task parallelism was used

800 GPU CPU 700

600 ] s [ 500 Time 400

300

200

100 150 200 250 300 350 Iterations [−] 51 / 52 GPUs (Asynchronous copies) — Did it work ? II

The two versions have almost the same execution times The GPU version frees the CPU • The CPU can do other things: hybrid programming The CPU costed around 100 e The GPU costed around 30 e

52 / 52 game over: you win !

%> shutdown -h now "merry christmas and happy new year"