<<

COSC 6339 Accelerators in Big Data

Edgar Gabriel Fall 2018

Motivation

• Programming models such as MapReduce and Spark provide a high-level view of parallelism – not easy for all problems, e.g. recursive , many graph problems, etc. • How to handle problems that do not have inherent high-level parallelism? – sequential processing: time to solution takes too long for large problems – exploit low-level parallelism: often groups of very few instructions • Problem with instruction level parallelism: costs for exploiting parallelism exceed benefits if using regular threads/processes/tasks

1 Historic context: SIMD Instructions

• Same operation executed for multiple data items • Uses a fixed length register and partitions the carry chain to allow utilizing the same functional unit for multiple operations – E.g. a 256 bit adder can be utilized for eight 32-bit add operations simultaneously • All elements in a register have to be on the same memory page to avoid page faults within the instruction

Comparison of instructions

• Example add operation of eight 32-bit integers with and and without SIMD instruction 3 instructions required for managing the loop (i.e. not contributing to the LOOP: LOAD R2, 0(R4) /* load x(i) */ actual solution of the LOAD R0, 0(R6) /* load y(i) */ problem) ADD R2, R2, R0 /* x(i)+y(i)*/ Branch instructions STORE R2, 0(R4) /* store x(i) */ typically lead to ADD R4, R4, #4 /* increment x */ processor stalls since ADD R6, R6, #4 /* increment y */ you have to wait for the outcome BNEQ R4, R20, LOOP of the comparison before you can decide ------what is the next LOAD256 YMM1, 0(R4) /* loads 256 bits of data*/ instruction to execute LOAD256 YMM2, 0(R6) /* ditto */ VADDSP YMM1, YMM1, YMM2 /* AVX ADD operation */ STORE256 YMM1, 0(R4) Note: not actual Intel assembly instructions and registers used

2 SIMD Instructions • MMX (Mult-Media Extension) - 1996 – Existing 64 bit floating point register could be used for eight 8-bit operations or four 16-bit operations • SSE (Streaming SIMD Extension) – 1999 – Successor to MMX instructions – Separate 128-bit registers added for sixteen 8-bit, eight 16-bit, or four 32-bit operations • SSE2 – 2001, SSE3 – 2004, SSE4 - 2007 – Added support for double precision operations • AVX (Advanced Vector Extensions) - 2010 – 256-bit registers added • AVX2 – 2013 – 512 –bit registers added

Graphics Processing Units (GPU)

• Hardware in Graphics Units similar to SIMD units – Works well with data-level parallel problems – Scatter-gather transfers – Mask registers – Large register files • Using NVIDIA GPUs as an example

3 Graphics Processing Units (II)

• Basic idea: – Heterogeneous execution model • CPU is the host, GPU is the device – Develop a C-like programming language for GPU – Unify all forms of GPU parallelism as CUDA – Programming model is “Single Instruction Multiple Threads” • GPU hardware handles thread management, not applications or OS

Example: Vector Addition

• Sequential code: int main ( int argc, char **argv ) { int A[N], B[N], C[N];

for ( i=0; i

return (0); }

CUDA: replace the loop by N threads each executing on element of the vector add operation

4 Example: Vector Addition (II)

• CUDA: replace the loop by N threads each executing one element of the vector add operation • Question: How does each thread know which elements to execute? – threadIdx : each thread has an id which is unique in the thread block • of type dim3, which is a struct { int x,y,z; } dim3; – blockDim: Total number of threads in the thread block • a thread block can be 1D, 2D or 3D

Example: Vector Addition (III)

• Initial CUDA kernel: void vecadd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; }

Assuming a 1-D thread block -> only x-dimension used • This code is limited by the maximum number of threads in a thread block – Upper limit on max. number of threads in one block – if vector is longer, we have to create multiple thread blocks

5 How does the compiler now which code to compile for CPU and which one for GPU?

• Specifier tells compiler where function will be executed -> compiler can generate code for corresponding processor

• Executed on CPU, called form CPU (default if not specified) __host__ void func(…)

• CUDA kernel to be executed on GPU, called from CPU __global__ void func(...);

• CUDA kernel to be executed on GPU, called from GPU __device__ void func(...);

Example: Vector Addition (IV)

• so the CUDA kernel is in reality:

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; }

• Note: – d_A, d_B, and d_C are in global memory – int i is in local memory of the thread

6 If you have multiple thread blocks

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; }

ID of the thread block that Number of threads in a this thread is part of thread block

Using more than one element per thread

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j;

for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++) d_C[j] = d_A[j] + d_B[j]; return; }

7 Nvidia GT200

• A GT200 is multi-core chip with two level hierarchy – focuses on high throughput on data parallel workloads • 1st level of hierarchy: 10 Thread Processing Clusters (TPC) • 2nd level of hierarchy: each TPC has – 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) – a texture pipeline (used for memory access) • Global Block Scheduler: – issues thread blocks to SMs with available capacity – simple round-robin but taking resource availability (e.g. of ) into account

Nvidia GT200

Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1

8 Streaming multi-processor (I)

• Instruction fetch, decode and issue logic • 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a ‘core’ by Nvidia) • 8 branch units: no branch prediction or speculation, branch delay: 4 cycles • Can execute up to 8 thread blocks/1024 threads concurrently • Each SP has access to 2048 register file entries each with 32 bits – a double precision number has to utilize two adjacent registers – register file can be used by up to 128 threads concurrently

CUDA Memory Model

9 CUDA Memory Model (II)

• cudaError_t cudaMalloc(void** devPtr, size_t size) – Allocates size bytes of device(global) memory pointed to by *devPtr – Returns cudaSuccess for no error • cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) – Dst = destination memory address – Src = source memory address – Count = bytes to copy – Kind = type of transfer (“HostToDevice”, “DeviceToHost”, “DeviceToDevice”) • cudaError_t cudaFree(void* devPtr) – Frees memory allocated with cudaMalloc

Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

Example: Vector Addition (V) int main ( int argc, char ** argv) { float a[N], b[N], c[N]; float *d_a, *d_b, *d_c; cudaMalloc( &d_a, N*sizeof(float)); cudaMalloc( &d_b, N*sizeof(float)); cudaMalloc( &d_c, N*sizeof(float)); cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsPerBlock(256); // 1-D array of threads dim3 blocksPerGrid(N/256); // 1-D grid vecAdd<<>>(d_a, d_b, d_c); cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return-0; }

10 Nvidia Tesla V100 GPU

• Most recent Nvidia GPU architecture • Architecture: each V100 contains – 6 GPU Processing Clusters (GPCs) – Each GPC has • 7 Texture Processing Clusters (TPCs) • 14 Streaming Multiprocessors (SMs) – Each SM has • 64 32-bit Floating Point cores • 64 32-bit Integer cores • 32 64-bit Floating Point cores • 8 Tensor Cores • 4 Texture Units

Nvidia V100 Tensor Cores

• Specifically designed to support neural networks • Designed to execute – D = A×B + C for 4x4 matrices – Operate on FP16 input data with FP32 accumulation

Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

11 Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

NVLink

• GPUs traditionally utilize a PCIe slot for moving data and instructions from CPU to GPU and between GPUs – PCIe 3.0 x8: 8 GB/s bandwidth – PCIe 3.0 x16: 16 GB/s bandwidth – motherboards often restricted in the number of PCIe lanes managed: i.e. using multiple PCIe cards will reduce the bandwidth available for each card

• NVLink : high speed connection between multiple GPUs – higher bandwidth per link (25 GB/sec) than PCIe – V100 supports up to 6 NVLINKs per GPU

12 NVLink2 multi-GPU no CPU support

Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

NVLink2 with CPU support

• only supports IBM Power 9 CPUs at the moment

Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

13 Other V100 enhancements

• In earlier GPUs – a group of threads (warps) executed a single instruction. – A single was used in combination with a active mask that specified which threads of the warp are active at any given point in time – Divergent paths (e.g. if-then-else statements) lead to some threads being inactive • V100 introduces program counters and call stacks per thread. – Independent thread scheduling allows the GPU to execution of any thread – Schedule-optimizer dynamically detects which how to group active threads into SIMT units – Threads can diverge and converge at sub-warp granularity

Google (TPU)

• Google’s DNN ASIC • Coprocessor on the PCIe bus • Large software-managed scratchpad

• Scratchpad: – high speed memory (similar to cache) – content controlled by application instead of system

14 TPU Microarchitecture

• Matrix multiply unit contains 256x256 ALUs that can perform 8 bit multiply-and-add operations generating 16-bit products • Accumulator can be used for updating partial results • Weights for matrix multiply operations provided by an off- chip 8GiB weight memory and provided through the Weight FIFO • Intermediate results are held in 24 MiB on chip unified memory

• Host server sends instructions over PCI bus to the TPU • Programmable DMA transfers data to or from Host memory

Google TPU Architecture

Image source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

15 TPU Instruction Set Architecture

• No program counter • No branch instructions • Contain a repeat field • Very high CPI ( 10 – 20)

TPU Instruction Function Read_Host_Memory Read data from memory Read_Weights Read weights from memory MatrixMultiply/Convolve Multiply or convolve with the data and weights, accumulate the results Activate Apply activation functions Write_Host_Memory Write result to memory

TPU microarchitecture

• Goal: hide the costs of other instruction and keep the Matrix Multiply Unit busy • Systolic array: 2-D collection of arithmetic units that each independently compute a partial result • Data arrives at cells from different directions at regular intervals • Data flows through the array similar to a wave front -> systolic execution

Image source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

16 Systolic execution: Matrix-Vector Example

Image source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

Systolic execution: Matrix-Matrix Example

Image source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

17 Architecture No. of instructions per cycle CPU A few CPU w/ vector extensions Tens - hundreds GPU Tens of thousands TPU Up to 128k

Image source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

TPU software

• At this point mostly limited to Tensorflow • Code that is expected to run on TPU is compiled using an API that can run on GPU, TPU or CPU

18