Introduction to GPU/

Ioannis E. Venetis University of Patras

1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems

2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what?

Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them

3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017)

CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , P100 Cray Inc. DOE/SC/Oak Ridge - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do we build an HPC system?

Limitations in technology It is impossible to fit all computational resources we require into a single chip

We have to build our system hierarchically

5 Introduction to GPU/Parallel Computing www.prace-ri.eu

All modern processors are “multi-core” Multiple, independent processors are placed on the same chip They might also support Simultaneous Multi-Threading (SMT) Every core is capable to execute more flows of instructions Threads However, these share most of the functional units of each core 1st level of parallelism Typically 4, 8 or 16 cores

6 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card

1 or more processors are placed on a compute card Typically, a single compute card operates as a shared memory system It usually contains 1, 2 or 4 processors

7 Introduction to GPU/Parallel Computing www.prace-ri.eu Node

Multiple compute cards are placed in a node There is no shared memory among compute cards The interconnection network among compute cards can be implemented in may different ways Usually there exists 1 or more additional compute cards that are dedicated to manage communication with the rest of the nodes

8 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack

Multiple nodes are placed in a rack There is no shared memory among nodes of a rack The interconnection network among nodes can be implemented in may different ways Not necessarily in the same way that compute cards are connected within a single node

9 Introduction to GPU/Parallel Computing www.prace-ri.eu The whole system

Multiple racks are connected Typically there are dedicated nodes that handle I/O

10 Introduction to GPU/Parallel Computing www.prace-ri.eu Hierarchical parallelism

IBM BlueGene/P

or

11 Introduction to GPU/Parallel Computing www.prace-ri.eu Examples of modern High Performance Systems

12 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight (No 1, Top 500 list, June 2017)

Computing node Basic element of the architecture 256 computing nodes create a super node Super nodes are connected through the central switch network

Sources of images: • The Sunway TaihuLight : system and applications. Fu, H., Liao, J., Yang, J. et al. Sci. China Inf. Sci. (2016) 59: 072001. doi:10.1007/s11432-016-5588-7 • Report on the Sunway TaihuLight System Dongara, J., Tech Report UT-EECS-16-742, June 2016.

13 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor

SW26010 One of the few systems the rely on a custom made processor Designed by the Shanghai High Performance IC Design Center Characteristic example of a heterogeneous many-core processor Composed of 2 types of different cores

14 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor

Contains 4 Core Groups (CGs) Connected through a Network On Chip (NoC) Each CG is composed of: 1 Management Processing Element (MPE) 64 Computing Processing Elements (CPEs) Placed on a 8x8 grid

15 Introduction to GPU/Parallel Computing www.prace-ri.eu Processor

Each CG has a distinct address space Connected to the MPE and the CPEs through a (MC) Each processor connects to the rest of the system through the System Interface (SI)

16 Introduction to GPU/Parallel Computing www.prace-ri.eu The two types of cores

Management Processing Element (MPE) Complete 64-bit RISC core Executes instructions in user and system modes, handles interrupts, memory mamagement, superscalar, out-of-order execution, … Performs all management and communication tasks Computing Processing Element (CPE) Reduced capability 64-bit RISC core Executes instructions only in user mode, does not handle interrupts, … Objectives of the design Maximum overall performance Reduced design complexity Placed on an 8x8 grid Allows for fast exchange of data directly between registers

17 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card

2 processors

18 Introduction to GPU/Parallel Computing www.prace-ri.eu Node

4 compute cards 2 σε κάθε μεριά

19 Introduction to GPU/Parallel Computing www.prace-ri.eu Supernode

32 nodes (256 processors)

20 Introduction to GPU/Parallel Computing www.prace-ri.eu Cabinet

4 supernodes (1024 processors)

21 Introduction to GPU/Parallel Computing www.prace-ri.eu Sunway TaihuLight

40 cabinets

22 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview

Cores 10.649.600 Peak performance 125,436 PFlops Linpack performance 93,015 PFlops CPU frequency 1,45 GHz Peak performance of a CPU 3,06 TFlops Total memory 1310,72 TB Total memory bandwidth 5591,5 TB/s Network link bandwidth 16 GB/s Network bisection bandwidth 70 TB/s Network diameter 7 Total storage 20 PB Total I/O bandwidth 288 GB/s Power consumption when running the Linpack test 15,371 MW Performance power ratio 6,05 GFlops/W

23 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2 (No 2, Top 500 list, June 2017)

In contrast to Synway TaihuLight it has typical/commercial processors Intel Xeon E5-2692 12 cores 2.2 GHz To achieve high performance it uses coprocessors Intel 31S1P 57 cores 4-way SMT 1.1 GHz PCI-E 2.0 interconnect with the host system

24 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card

Contains 2 processors and 3 Xeon Phi

25 Introduction to GPU/Parallel Computing www.prace-ri.eu Node

Contains 2 compute cards Special interconnection

26 Introduction to GPU/Parallel Computing www.prace-ri.eu Frame

16 nodes

27 Introduction to GPU/Parallel Computing www.prace-ri.eu Rack

4 frames

28 Introduction to GPU/Parallel Computing www.prace-ri.eu Tianhe-2

125 racks

29 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview

Cores 3.120.000 Peak performance 54,902 PFlops Linpack performance 33,863 PFlops CPU frequency 2,2 GHz / 1,1 GHz Total memory 1.404 TB Total storage 12,4 PB Total I/O bandwidth 100 GB/s Power consumption when running Linpack 17,808 MW Performance power ratio 1,9 GFlops/W

30 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan (No 4, Top 500 list, June 2017)

Also consists of typical/commercial processors AMD Opteron 6274 16 cores 2.2 GHz To achieve high performance it uses coprocessors NVidia K20x 2688 cores 732 MHz PCI-E 2.0 interconnect with the host system

31 Introduction to GPU/Parallel Computing www.prace-ri.eu Compute card / Node

Contains 1 processor + 1 GPU 2 nodes share the router of the interconnection network

Z

Y

X 32 Introduction to GPU/Parallel Computing www.prace-ri.eu Blade / Cabinet

Each blade contains 4 nodes Each cabinet contains 24 blades

33 Introduction to GPU/Parallel Computing www.prace-ri.eu Titan

200 cabinets

34 Introduction to GPU/Parallel Computing www.prace-ri.eu Overview

Cores 560.640 Peak performance 27,113 PFlops Linpack performance 17,590 Pflops CPU frequency 2,2 GHz / 2,2 GHz Total memory 710 TB Total storage 40 PB Total I/O bandwidth 1,4 TB/s Power consumption when running Linpack 8,209 MW Performance power ratio 2,1 GFlops/W

35 Introduction to GPU/Parallel Computing www.prace-ri.eu Comparison

Sunway Tianhe-2 Titan TaihuLight Cores 10.649.600 3.120.000 560.640 Peak performance 125,436 PFlops 54,902 PFlops 27,113 PFlops Linpack performance 93,015 PFlops 33,863 PFlops 17,590 Pflops CPU frequency 1,45 GHz 2,2 GHz / 1,1 GHz 2,2 GHz / 2,2 GHz Total memory 1310,72 TB 1.404 TB 710 TB Total storage 20 PB 12,4 PB 40 PB Total I/O bandwidth 288 GB/s 100 GB/s 1,4 TB/s Power consumption for Linpack 15,371 MW 17,808 MW 8,209 MW Performance power ratio 6,05 GFlops/W 1,9 GFlops/W 2,1 GFlops/W

36 Introduction to GPU/Parallel Computing www.prace-ri.eu Power consumption

Average daily power consumption per household: 11 KWh http://www.cres.gr/pepesec/apotelesmata.html Small study, but gives a picture Tianhe-2: 17.808 KW * 24 hours = 427.392 KWh Consumes as much as 38.854 households per day! If on average 3 people live an household: 38.854 * 3 = 116.562 Volos: 6η largest city in Greece 120.000 citicens (2011 census)

37 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming High Performance Systems

38 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming High Performance Systems

As we have outlined previously, HPC systems are composed of different parts with different architectural features Shared memory (processor, compute card) Distributed memory (node, rack, complete system) Coprocessor

How to exploit all computational resources? A different programming model has to be used for each level in the hierarchy

39 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming level

IBM BlueGene/P

or

Distributed memory 40 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming for distributed memory

Any programming model that targets distributed memory systems MPI …

41 Introduction to GPU/Parallel Computing www.prace-ri.eu MPI (Message Passing Interface)

Prototype Not a specific implementation Library to pass messages among processes Layered design At a high level, it provides an API to the programmer At a low level, it takes care of the communication through the interconnection network Portable across different distributed memory systems It supports C, C++, Fortran 77 and Fortran 90

42 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with MPI (1/3) (Just to give an intuition of its use…)

#include #include “mpi.h”

int main(int argc, char *argv[]) { int my_rank, p, i, num, size; int A[100], B[100], C[100], loc_A[100], loc_B[100], loc_C[100];

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank == 0) { printf(“Enter size of vectors:”); scanf(“%d”, &size); printf(“Enter values of vector A elements: ”, size); for (i = 0; i < size; i++) { scanf(“%d”, &A[i]); } printf(“Enter values of vector B elements: ”, size); for (i = 0; i < size; i++) { scanf(“%d”, &B[i]); } }

43 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with MPI (2/3)

MPI_Bcast(&size, 1, MPI_INT, 0, MPI_COMM_WORLD);

num = size / p;

MPI_Scatter(A, num, MPI_INT, loc_A, num, MPI_INT, 0, MPI_COMM_WORLD); MPI_Scatter(B, num, MPI_INT, loc_B, num, MPI_INT, 0, MPI_COMM_WORLD);

for (i = 0; i < num; i++) { loc_C[i] = loc_A[i] + loc_B[i]; }

printf(“\nLocal results for process %d:\n”, my_rank); for (i = 0; i < num; i++) { printf(“%d ”, loc_C[i]); } printf(“\n\n”);

44 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with MPI (3/3)

MPI_Gather(loc_C, num, MPI_INT, C, num, MPI_INT, 0, MPI_COMM_WORLD);

if (my_rank == 0) { printf(“\nFinal result:\n”); for (i = 0; i < size; i++) { printf(“%d ”, C[i]); } printf(“\n\n”); }

MPI_Finalize();

return(0); }

45 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming level

IBM BlueGene/P

or

Shared memory 46 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming for shared memory

Any programming model that targets shared memory systems OpenMP Most used OpenCL Remember this! Threading library (POSIX Threads, …) Cilk Plus …

47 Introduction to GPU/Parallel Computing www.prace-ri.eu OpenMP

Programming model for parallel computation Based on the idea of directives The programmer marks the points in its program that have to be parallelized and how they should be parallelized The compiler creates and integrates parallelism into the executable file

48 Introduction to GPU/Parallel Computing www.prace-ri.eu OpenMP

Prototype Not a specific implementation It provides an API to program shared memory systems through: Directives to the compiler Function calls to a Run-Time Library Environment Variables It supports C, C++ and Fortran Portable across different shared memory systems It supports different patterns of parallelism

49 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with OpenMP (1/2) (Just to give an intuition of its use…)

int *A, *B, *C; int N;

int main(int argc, char *argv[]) { int i;

if (argc != 2) { printf(“Provide the problem size.\n”); exit(0); }

N = atoi(argv[1]);

A = (int *)malloc(N * sizeof(int)); B = (int *)malloc(N * sizeof(int)); C = (int *)malloc(N * sizeof(int)); if ((A == NULL) || (B == NULL) || (C == NULL)) { printf(“Could not allocate memory.\n”); exit(0); }

50 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with OpenMP (1/2)

/* * Here should A and B be initialized */

#pragma omp parallel { #pragma omp for for (i = 0; i < Ν; i++) { C[i] = A[i] + B[i]; } }

printf(“C = ”); for (i = 0; i < N; i++) { printf(“%d ”, C[i]); }

return(0); } /* main() ends here */

51 Introduction to GPU/Parallel Computing www.prace-ri.eu Advantages

High level of abstraction No threads are (directly) involved or details of the hardware Single source code Same source code can be used for serial or parallel compilation/execution Performance Acceptable performance compared to highly optimized code with threads Scalability Relatively easy to achieve for different systems Incremental parallelization Porting and optimization of parts of the code, depending on the available resources and profiling No rewriting/reorganization of code required (threads) Shorter time for parallelization, with less errors

52 Introduction to GPU/Parallel Computing www.prace-ri.eu OpenMP 4.5: Parallelization levels Cluster Group of nodes that communicate through a fast interconnection network Coprocessor/AcceleratorOpenMP for Processing units that attach to the main processing coprocessors unit through a special interconnect Node Group of processors that communicate through shared memory Socket Group of cores that communicate through shared memory and/or shared cache memory Core Group of functional units that communicate through registers Hyper-Threads Group of hardware-level threads that share functional units of a core Superscalar Group of instructions that share functional units

Pipeline Sequence of instructions that share functional units

Vector A single instruction that uses multiple functional units 53 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming level

IBM BlueGene/P Coprocessor

or

54 Introduction to GPU/Parallel Computing www.prace-ri.eu Programming for a coprocessor

Appropriate programming model for the coprocessor used NVidia GPU CUDA OpenACC OpenMP OpenCL Intel Xeon Phi OpenMP Cilk Plus OpenCL

55 Introduction to GPU/Parallel Computing www.prace-ri.eu CUDA

Programming model for Nvidia GPUs Extension to the C/C++ programming languages New keywords New predefined structs Functions that are called from the main program (kernels) Pre-defined macros General purpose programming

56 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with CUDA (1/3) (Foretaste…)

__global__ void vecAdd(float *A_d, float *B_d, float *C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x;

if (i < n) { C_d[i] = A_d[i] + B_d[i]; } }

57 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with CUDA (2/3)

int main(int argc, char *argv[]) { const unsigned int n = 2048; float *A_h, *B_h, *C_h;

A_h = (float *)malloc(n * sizeof(float)); for (unsigned int i = 0; i < n; i++) { A_h[i] = (rand() % 100) / 100.0; }

B_h = (float *)malloc(n * sizeof(float)); for (unsigned int i = 0; i < n; i++) { B_h[i] = (rand() % 100) / 100.0; }

C_h = (float *)malloc(n * sizeof(float));

float *A_d, *B_d, *C_d;

cudaMalloc((void**)&A_d, n * sizeof(float)); cudaMalloc((void**)&B_d, n * sizeof(float)); cudaMalloc((void**)&C_d, n * sizeof(float));

58 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with CUDA (3/3)

cudaMemcpy(A_d, A_h, n * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(B_d, B_h, n * sizeof(float), cudaMemcpyHostToDevice);

const unsigned int THREADS_PER_BLOCK = 512; const unsigned int numBlocks = (n - 1)/THREADS_PER_BLOCK + 1; dim3 gridDim(numBlocks, 1, 1), blockDim(THREADS_PER_BLOCK, 1, 1);

vecAdd<<< gridDim, blockDim >>>(A_d, B_d, C_d, n);

cudaMemcpy(C_h, C_d, sizeof(float)*n, cudaMemcpyDeviceToHost);

free(A_h); free(B_h); free(C_h);

cudaFree(A_d); cudaFree(B_d); cudaFree(C_d);

return(0); }

59 Introduction to GPU/Parallel Computing www.prace-ri.eu OpenACC

Programming model based on directives Comparable to OpenMP Targets coprocessors Started as branch of OpenMP with the aim to be merged again Never did…

60 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with OpenACC (1/2) (Just to give an intuition of its use…)

int *A, *B, *C; int N;

int main(int argc, char *argv[]) { int i;

if (argc != 2) { printf(“Provide the problem size.\n”); exit(0); }

N = atoi(argv[1]);

A = (int *)malloc(N * sizeof(int)); B = (int *)malloc(N * sizeof(int)); C = (int *)malloc(N * sizeof(int)); if ((A == NULL) || (B == NULL) || (C == NULL)) { printf(“Could not allocate memory.\n”); exit(0); }

61 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with OpenACC (2/2)

/* * Here should A and B be initialized */

#pragma acc kernels copyin(A[0:N],B[0:N]), copyout(C[0:N]) for (i = 0; i < Ν; i++) { C[i] = A[i] + B[i]; }

printf(“C = ”); for (i = 0; i < N; i++) { printf(“%d ”, C[i]); }

return(0); } /* main() ends here */

62 Introduction to GPU/Parallel Computing www.prace-ri.eu OpenMP

Starting with version 4.0, it provides directives to support offloading of computations and data to coprocessors

63 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with OpenMP (1/2) (Just to give an intuition of its use…)

int *A, *B, *C; int N;

int main(int argc, char *argv[]) { int i;

if (argc != 2) { printf(“Provide the problem size.\n”); exit(0); }

N = atoi(argv[1]);

A = (int *)malloc(N * sizeof(int)); B = (int *)malloc(N * sizeof(int)); C = (int *)malloc(N * sizeof(int)); if ((A == NULL) || (B == NULL) || (C == NULL)) { printf(“Could not allocate memory.\n”); exit(0); }

64 Introduction to GPU/Parallel Computing www.prace-ri.eu Vector addition with OpenMP (2/2)

/* * Here should A and B be initialized */

#pragma omp target map(to: N, A[0:N], B[0:N]) map(from: C[0:N]) #pragma omp parallel for for (i = 0; i < Ν; i++) { C[i] = A[i] + B[i]; }

printf(“C = ”); for (i = 0; i < N; i++) { printf(“%d ”, C[i]); }

return(0); } /* main() ends here */

65 Introduction to GPU/Parallel Computing www.prace-ri.eu OpenCL

Programming model for executing code on heterogeneous computational resources CPU, GPU, FPGA, DSP, … Based on function calls defined by the prototype Many similarities to CUDA A bit more complicated programming

66 Introduction to GPU/Parallel Computing www.prace-ri.eu Software development issues on HPC environments

67 Introduction to GPU/Parallel Computing www.prace-ri.eu HPC Programmers

Most of them are actually not Computer Scientists or Computer Engineers Their aim: Computational science Production of scientific results in their respective field using computers Almost none has knowledge about software engineering issues

68 Introduction to GPU/Parallel Computing www.prace-ri.eu The HPC community

Victor R. Basili, Jeffrey C. Carver, Daniela Cruzes, Lorin M. Hochstein, Jeffrey K. Hollingsworth, Forrest Shull, Marvin V. Zelkowitz, "Understanding the High-Performance-Computing Community: A Software Engineer's Perspective", IEEE Software, Vol. 25(4), pp. 29-36, July/August 2008.

69 Introduction to GPU/Parallel Computing www.prace-ri.eu HPC Programmers thoughts

The aim is research in the respective field, not in programming FLOPS don’t count how good research in the respective field is The quality of research matters, not the performance of the applications Writing code that achieves good performance on an HPC system is only one component towards achieving the goals, it is not a goal in itself There is not always a willingness to achieve better performance, especially if this leads to problems in code maintenance

70 Introduction to GPU/Parallel Computing www.prace-ri.eu Technologies

There is skepticism in the usage of new technologies C and Fortran are used traditionally OpenMP and MPI are used traditionally If we make the step to use a new technology and that technology does not gain wide acceptance what happens? The key is for well established and new technologies to coexist Sharing of computational resources Many users run simultaneously their applications on a single system Submission of jobs is performed through a batch system Makes debugging diffcult Remote access also causes difficulties

71 Introduction to GPU/Parallel Computing www.prace-ri.eu Other approaches and issues

Achieving better performance Changing method instead of optimizing code Verification of results Comparison with data acquired from experiments

72 Introduction to GPU/Parallel Computing www.prace-ri.eu Developments in technology

Everything already mentioned is intensified by developments in technology Different technologies are used on HPC systems throughout time Vector machines Multiprocessors Multi-core Coprocessors Clusters …

73 Introduction to GPU/Parallel Computing www.prace-ri.eu An example that leads to a shift in technology in HPC systems

+20 +0% +52 % %

+25

% Reserved. +7% Copyright 2011, © Elsevier Inc.All rights

“Memory wall”: The speed of the memory improves with a much lower rate, compared to the speed of processors Percentage of improvement per year and per core

74 Introduction to GPU/Parallel Computing www.prace-ri.eu CPU and GPU: A different approach in design

GPU CPU Throughput Oriented Cores Latency Oriented Cores Chip Chip Compute Unit Core Cache/Local Mem

Threading Local Cache

Registers Control SIMD Registers

Unit SIMD Unit

75 Introduction to GPU/Parallel Computing www.prace-ri.eu Latency vs. Throughput

Latency The time between the initiation of a process and the moment the results become available Throughput The number of processes that complete per time unit

CPU Low latency, Low throughput GPU High latency, High throughput

76 Introduction to GPU/Parallel Computing www.prace-ri.eu CPU: Designed for high latency accesses to the memory

Large cache memories They convert slow accesses to main memory into fast accesses to the cache memory CPU Sophisticated flow control ALU ALU Branch prediction to reduce delays Control Data forwarding within the pipeline to reduce ALU ALU delays Powerful ALU Cache Faster execution per instruction

DRAM

77 Introduction to GPU/Parallel Computing www.prace-ri.eu GPU: Design for achieving high throughput

Small cache memories Mainly for instructions Simple flow control GPU No branch prediction No data forwarding Power efficient ALUs Many of them, large execution time per instruction, but many stages in the pipeline Requires different approach to programming: Large number of threads to hide the large latency to access main memory DRAM

78 Introduction to GPU/Parallel Computing www.prace-ri.eu The most efficient applications use both the CPU and the GPU

CPUs for serial execution of parts of the GPUs for parallel execution of parts of program where latency is important the program where throughput is CPUs can be > 10x faster compared important to GPUs for serial execution GPUs can be >10x faster compared to CPUs for parallel code

79 Introduction to GPU/Parallel Computing www.prace-ri.eu Hybrid programming is becoming mainstream

Data Scientific Engineering Medical Financial Intensive Simulation Simulation Imaging Analysis Analytics

Electronic Digital Digital Computer Biomedical Design Audio Video Vision Informatics Processing Processing Automation

Ray Statistical Interactive Numerical Tracing Modeling Physics Methods Rendering

80 Introduction to GPU/Parallel Computing www.prace-ri.eu GPU Gems

Series of three books Highly efficient parallelization of applications for the GPU 480 submissions to GPU Computing Gems 150 articles have been included in the books Freely available: https://developer.nvidia.com/gpugems/GPUGems/gpugems_pref01.html

81 Introduction to GPU/Parallel Computing www.prace-ri.eu A catch - Scalability

8

7

6

5

4 Scalable 1 Scalable 2 3

Not Scalable Execution Time Execution 2

1

0 0 20 40 60 80 Compute Units

82 Introduction to GPU/Parallel Computing www.prace-ri.eu Complexity of algorithms 120

100

80

60 Quadratic n*log(n)

40 Linear Number of Operations of Number 20

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Data Size

83 Introduction to GPU/Parallel Computing www.prace-ri.eu Scalability of data

Any other complexity other than linear for an algorithm is not scalable Execution time increases dramatically as the amount data increases, even for algorithms of n*log(n) complexity Parallel algorithms are meant to process large amounts of data A serial algorithm with linear complexity will be faster than a parallel algorithm with n*log(n) complexity The term log(n) increases faster than the amount of parallelism that can be provided on an HPC system This leads to a parallel algorithm that is slower that a serial algorithm for larger amounts of data

84 Introduction to GPU/Parallel Computing www.prace-ri.eu GPU architecture

85 Introduction to GPU/Parallel Computing www.prace-ri.eu Design principles

GPUs have different design principles than CPUs They favor throughput Instead of latency

What are the architectural characteristics that contribute towards this design philosophy?

86 Introduction to GPU/Parallel Computing www.prace-ri.eu Streaming Processor (SP)

Basic computational unit Also known as “CUDA Core” Composed of: 1 Arithmetic Logic Unit (ALU) 1 Floating Point Unit (FPU) Both fully pipelined, single-issue, in-order Does not include Fetch, Decode and Dispatch units Multiple SPs share these units at the next level

87 Introduction to GPU/Parallel Computing www.prace-ri.eu Streaming Multiprocessor (SM)

Composed of a number of SPs Supported by additional computational and management units The number of SPs that constitute an SM differs among generations of GPUs

Generation/Name of GPU SPs/SM 1st generation / Tesla 8 2nd generation / Fermi 32 3rd generation / Kepler 192 4th generation / Maxwell 128 5th generation / Pascal 64 6th generation / Volta 64

88 Introduction to GPU/Parallel Computing www.prace-ri.eu Tesla SM

Multithreaded Instruction Unit G80: Up to 768 active threads GT200: Up to 1024 active threads Scheduling of threads in hardware 8 SPs IEEE 754 32-bit floating point 32-bit and 64-bit integer arithmetic 16K 32-bit registers Shared among SPs 2 SFUs (Special Function Unit) sin, cos, log, exp 16KB shared memory GT200: 48KB GT200 only 1 DPU (Double Precision Unit) IEEE 754 64-bit floating point Fused multiply-add assembly instruction

89 Introduction to GPU/Parallel Computing www.prace-ri.eu Thread execution

Threads created in an application that will execute on a GPU are divided into groups that are called “warps” All GPU generations have a warp size of 32 threads However the size of a warp can change in future GPU generations All threads in the same warp execute the same instruction (typically on different data) SIMT model of execution Single Instruction Multiple Threads

90 Introduction to GPU/Parallel Computing www.prace-ri.eu Problem

How are 32 threads within a warp executed on only 8 SPs? The threads of a warp are further divided into 4 smaller groups of 8 threads each 4 clock cycles are required for the current instruction to be executed by all threads within a warp

91 Introduction to GPU/Parallel Computing www.prace-ri.eu Kepler SM (SMX) - ARIS

Up to 2048 active threads 192 SPs IEEE 754 32-bit floating point

32-bit and 64-bit integer arithmetic 64K or 128Κ 32-bit καταχωρητές Shared among SPs 4 warp schedulers 32 Load/Store Units Allows calculation of 32 memory addresses for reading/writing data 32 SFUs (Special Function Unit) 48KB or 112KB shared memory 64 DPU (Double Precision Unit) IEEE 754 64-bit floating point Fused multiply-add assembly instruction

92 Introduction to GPU/Parallel Computing www.prace-ri.eu Thread execution

192 SPs / 32 = 6 warps There are only 4 warp schedulers How are all SPs used? Each warp scheduler is double issue For each warp that is selected for execution 2 instructions are fetched Can be executed at the same time, as long as there are no data dependencies If there exist data dependencies, they are executed serially Other functional units are simultaneously used by the SPs 8 warps can truly execute in parallel

93 Introduction to GPU/Parallel Computing www.prace-ri.eu Volta SM - Latest

Up to 2048 active threads 64 SPs Grouped into 4 processing blocks

All resources are equally shared among processing blocks 64K 32-bit registers 4 warp schedulers 1 dispatch unit / warp scheduler 16 Load/Store Units Allows calculation of 16 memory addresses for reading/writing data 32 SFUs 8 Tensor Cores Targeted towards Deep Learning Up to 96KB shared memory Configurable

94 Introduction to GPU/Parallel Computing www.prace-ri.eu Texture/Processor Cluster (TPC)

Multiple SMs are grouped into TPCs Number of SMs that constitute a TPC depends on the specific GPU In more recent architectures (Maxwell, Pascal) they have been replaced by Graphics Processing Cluster (GPC) They contain other shared resources

95 Introduction to GPU/Parallel Computing www.prace-ri.eu GPU

Multiple TPCs are placed on a chip and constitute aGPU G80 (Tesla): 8 TPC A total of 128 threads

96 Introduction to GPU/Parallel Computing www.prace-ri.eu GT200 (Tesla)

10 TPC * 3 SM/TPC * 8 SP/SM = 240 threads

97 Introduction to GPU/Parallel Computing www.prace-ri.eu GK100 (Kepler)

13 SMs * 192 SP/SM = 2496 threads

98 Introduction to GPU/Parallel Computing www.prace-ri.eu GV 100 (Volta)

6 GPC * 14 SM/GPC * 64 SP/SM = 5376 threads

99 Introduction to GPU/Parallel Computing www.prace-ri.eu Memory hierarchy

At the architectural level Registers Shared memory Constant ή read-only memory L1 and L2 cache memory Global memory We will discuss later the view of the memory hierarchy from the programmers point of view

100 Introduction to GPU/Parallel Computing www.prace-ri.eu Characteristics of the memory hierarchy

Access time Accessibility Life span (clock cycles)

Registers 1 Only 1 thread Computational kernel

All threads in the same Shared memory 30-50 Computational kernel block Application (preserved 1-7 (cache hit) All threads of a Constant memory across calls to 80-2750 (cache miss) computational kernel computational kernels) Application (preserved All threads of a Global memory 80-2750 across calls to computational kernel computational kernels)

“Dissecting GPU Memory Hierarchy through Microbenchmarking”, Xinxin Mei, Xiaowen Chu, https://arxiv.org/abs/1509.02308, Last accessed 04/10/2016

101 Introduction to GPU/Parallel Computing www.prace-ri.eu Interconnection with the host

Typically through a PCI Express (PCIe) bus NVidia developed NVLink 5 up to 12× faster transfer speed between the CPU and the GPU Allows data transfer to/from the GPU almost at the same speed that the main memory of the GPU is accessed Available in the Pascal and Volta architectures

102 Introduction to GPU/Parallel Computing www.prace-ri.eu Dynamic Parallelism

Supported starting with the Kepler architecture Nested parallelism Code executing on the GPU can start execution of parallel code that will also execute on the GPU Sychronization for creating/using results Scheduling at the hardware level The CPU is not involved E.g., dynamic adaptation of a grid during a simulation

103 Introduction to GPU/Parallel Computing www.prace-ri.eu The concept of Compute Capability

104 Introduction to GPU/Parallel Computing www.prace-ri.eu What is Compute Capability?

It is a “version number” for the hardware of GPUs It does not indicate processing speed However, newer GPUs that have a larger compute capability are also faster This is due to their better design It indicates the computational capabilities of a GPU

105 Introduction to GPU/Parallel Computing www.prace-ri.eu GPU generations

GPU with the same major revision number are based on the same architecture The minor revision number indicates gradual additions to the basic architecture Possibly with the addition of new fetures

Compute capability Microarchitecture 1.x Tesla 2.x Fermi 3.x Kepler 5.x Maxwell 6.x Pascal 7.x Volta

106 Introduction to GPU/Parallel Computing www.prace-ri.eu Feature support

Compute capability Feature support 1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2 6.0 6.1 Integer atomic functions operating on 32-bit words in global memory No Yes atomicExch() operating on 32-bit floating point values in global memory Integer atomic functions operating on 32-bit words in shared memory atomicExch() operating on 32-bit floating point values in shared memory No Yes Integer atomic functions operating on 64-bit words in global memory Warp vote functions Double-precision floating-point operations No Yes Atomic functions operating on 64-bit integer values in shared memory Floating-point atomic addition operating on 32-bit words in global and shared memory _ballot() _threadfence_system() No Yes _syncthreads_count(), _syncthreads_and(), _syncthreads_or() Surface functions 3D grid of thread block Warp shuffle functions No Yes Funnel shift No Yes Dynamic107 parallelismIntroduction to GPU/Parallel Computing www.prace-ri.eu Technical specifications (1/3) Compute capability Technical specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2 5.3 Maximum dimensionality of grid of thread blocks 2 3 Maximum x-dimension of a grid of thread blocks 65535 231 − 1 Maximum y-, or z-dimension of a grid of thread blocks 65535 Maximum dimensionality of thread block 3 Maximum x- or y -dimension of a block 512 1024 Maximum z-dimension of a block 64 Maximum number of threads per block 512 1024 Warp size 32 Maximum number of resident blocks per multiprocessor 8 16 32 Maximum number of resident warps per multiprocessor 24 32 48 64 Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Number of 32-bit registers per multiprocessor 8 K 16 K 32 K 64 K 128 K 64 K Maximum number of 32-bit registers per thread block N/A 32 K 64 K 32 K Maximum number of 32-bit registers per thread 124 63 255 Maximum amount of shared memory per multiprocessor 16 KB 48 KB 112 KB 64 KB 96 KB 64 KB Number of shared memory banks 16 32 Amount of local memory per thread 16 KB 512 KB Constant memory size 64 KB Cache working set per multiprocessor for constant memory 8 KB 10 KB

Cache working set per multiprocessor for texture memory 6 – 8 KB 12 KB 12 – 48 KB 24 KB 48 KB N/A

108 Introduction to GPU/Parallel Computing www.prace-ri.eu Technical specifications (2/3)

Compute capability Technical specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2 5.3 Maximum width for 1D texture reference bound to a CUDA 8192 65536 array Maximum width for 1D texture reference bound to linear 227 memory Maximum width and number of layers for a 1D layered 8192 × 512 16384 × 2048 texture reference Maximum width and height for 2D texture reference bound 65536 × 32768 655362 to a CUDA array Maximum width and height for 2D texture reference bound 650002 to a linear memory Maximum width and height for 2D texture reference bound N/A 163842 to a CUDA array supporting texture gather Maximum width, height, and number of layers for a 2D 8192 × 8192 × 512 16384 × 16384 × 2048 layered texture reference Maximum width, height and depth for a 3D texture 20483 40963 reference bound to linear memory or a CUDA array Maximum width and number of layers for a cubemap N/A 16384 × 2046 layered texture reference Maximum number of textures that can be bound to a kernel 128 256

109 Introduction to GPU/Parallel Computing www.prace-ri.eu Technical specifications (3/3)

Compute capability Technical specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5 3.7 5.0 5.2 5.3 Maximum width for a 1D surface reference bound to a 65536 CUDA array Maximum width and number of layers for a 1D layered 65536 × 2048 surface reference Maximum width and height for a 2D surface reference 65536 × 32768 bound to a CUDA array Maximum width, height, and number of layers for a 2D Not 65536 × 32768 × 2048 layered surface reference supported Maximum width, height, and depth for a 3D surface 65536 × 32768 × 2048 reference bound to a CUDA array Maximum width and number of layers for a cubemap 32768 × 2046 layered surface reference Maximum number of surfaces that can be bound to a kernel 8 16 Maximum number of instructions per kernel 2 million 512 million

110 Introduction to GPU/Parallel Computing www.prace-ri.eu Architecture specifications

Compute capability (version) Architecture specifications 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5 3.7 5.0 5.2 6.0 6.1

Number of ALU lanes for integer and single-precision 8 32 48 192 128 64 128 floating-point arithmetic operations

Number of special function units for single-precision 2 4 8 32 16 32 floating-point transcendental functions

Number of texture filtering units for every 2 4 8 16 8 texture address unit or render output unit (ROP)

Number of warp schedulers 1 2 4 2 4

Number of instructions issued at once by scheduler 1 2

111 Introduction to GPU/Parallel Computing www.prace-ri.eu THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu

112 Introduction to GPU/Parallel Computing www.prace-ri.eu