COSC 6385 Computer Architecture - Data Level Parallelism (II)

Edgar Gabriel Fall 2013

Edgar Gabriel

SIMD Instructions

• Originally developed for Multimedia applications • Same operation executed for multiple data items • Uses a fixed length register and partitions the carry chain to allow utilizing the same functional unit for multiple operations – E.g. a 64 bit adder can be utilized for two 32-bit add operations simultaneously • Instructions originally not intended to be used by compiler, but just for handcrafting specific operations in device drivers • All elements in a register have to be on the same memory page to avoid page faults within the instruction

COSC 6385 – Computer Architecture Edgar Gabriel

1 SIMD Instructions

• MMX (Mult-Media Extension) - 1996 – Existing 64 bit floating point register could be used for eight 8-bit operations or four 16-bit operations • SSE (Streaming SIMD Extension) – 1999 – Successor to MMX instructions – Separate 128-bit registers added for sixteen 8-bit, eight 16-bit, or four 32-bit operations • SSE2 – 2001, SSE3 – 2004, SSE4 - 2007 – Added support for double precision operations • AVX (Advanced Vector Extensions) - 2010 – 256-bit registers added

COSC 6385 – Computer Architecture Edgar Gabriel

AVX Instructions AVX Instruction Description VAPDDPD Add four packed double-precision operands VSUBPD Subtract four packed double-precision operands VMULPD Multiply four packed double-precision operands VDIVPD Divide four packed double-precision operands VFMADDPD Multiply and add four packed double-precision operands VFMSUBPD Multiply and subtract four packed double-precision operands VCMPxx Compare four packed double-precision operands for EQ, NEQ, LT, LTE, GT, GE … VMOVAPD Move aligned four packed double-precision operands VBROADCASTSD Broadcast one double-precision operand to four locations in a 256-bit register

COSC 6385 – Computer Architecture Edgar Gabriel

2 Graphics Processing Units (GPU)

• Hardware in Graphics Units similar to Vector Processors – Works well with data-level parallel problems – Scatter-gather transfers – Mask registers – Large register files • Differences: – No scalar processor – Uses multithreading to hide memory latency – Has many functional units, as opposed to a few deeply pipelined units like a vector processor

COSC 6385 – Computer Architecture Edgar Gabriel

Graphics Processing Units (II)

• Using NVIDIA GPUs as an example • Basic idea: – Heterogeneous execution model • CPU is the host , GPU is the device – Develop a C-like programming language for GPU – Unify all forms of GPU parallelism as CUDA thread – Programming model is “Single Instruction Multiple Thread” • GPU hardware handles thread management, not applications or OS

COSC 6385 – Computer Architecture Edgar Gabriel

3 Example: Vector Addition

• Sequential code: int main ( int argc, char **argv ) { int A[N], B[N], C[N];

for ( i=0; i

return (0); }

CUDA: replace the loop by N threads each executing on element of the vector add operation COSC 6385 – Computer Architecture Edgar Gabriel

Example: Vector Addition (II)

• CUDA: replace the loop by N threads each executing on element of the vector add operation • Question: How does each thread know which elements to execute? – threadIdx : each thread has an id which is unique in the thread block • of type dim3 , which is a struct { int x,y,z; } dim3; – blockDim: Total number of threads in the thread block • a thread block can be 1D, 2D or 3D COSC 6385 – Computer Architecture Edgar Gabriel

4 Example: Vector Addition (III)

• Initial CUDA kernel: void vecadd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; }

Assuming a 1-D thread block -> only x-dimension used • This code is limited by the maximum number of threads in a thread block – CUDA 1.3: 512 threads max. – if vector is longer, we have to create multiple thread blocks

COSC 6385 – Computer Architecture Edgar Gabriel

How does the compiler now which code to compile for CPU and which one for GPU?

• Specifier tells compiler where function will be executed -> compiler can generate code for corresponding processor

• Executed on CPU, called form CPU (default if not specified) __host__ void func(…)

• CUDA kernel to be executed on GPU, called from CPU __global__ void func(...);

• CUDA kernel to be executed on GPU, called from GPU __device__ void func(...);

COSC 6385 – Computer Architecture Edgar Gabriel

5 Example: Vector Addition (IV)

• so the CUDA kernel is in reality:

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; }

• Note: – d_A, d_B , and d_C are in global memory – int i is in local memory of the thread

COSC 6385 – Computer Architecture Edgar Gabriel

If you have multiple thread blocks

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; }

ID of the thread block that Number of threads in a this thread is part of thread block

COSC 6385 – Computer Architecture Edgar Gabriel

6 Using more than one element per thread

__global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j;

for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++) d_C[j] = d_A[j] + d_B[j]; return; }

COSC 6385 – Computer Architecture Edgar Gabriel

NVIDIA Instruction Set Architecture • Parallel Thread Execution (PTX) – is an abstraction of the hardware instruction set – Uses virtual registers – Translation to machine code is performed in software – Example for one iteration of a loop executing y[i] = a*x[i] + y[i] with a blocksize of 512 threads per block shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * a add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) COSC 6385 – Computer Architecture Edgar Gabriel

7 Conditional Branching

• Branch hardware uses internal masks – Branch synchronization stack to support nested branch instructions • Entries consist of masks for each SIMD lane (CUDA thread) – Instruction markers to manage when a branch diverges into multiple execution paths • Push on divergent branch – …and when paths converge • Act as barriers • Pops stack • For equal length IF-ELSE conditions, code will operate at 50% efficiency – Either IF or the ELSE part is not executing

COSC 6385 – Computer Architecture Edgar Gabriel

Nvidia GT200

• A GT200 is multi-core chip with two level hierarchy – focuses on high throughput on data parallel workloads • 1st level of hierarchy: 10 Thread Processing Clusters (TPC) • 2nd level of hierarchy: each TPC has – 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) – a texture pipeline (used for memory access) • Global Block Scheduler: – issues thread blocks to SMs with available capacity – simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account

COSC 6385 – Computer Architecture Edgar Gabriel

8 Nvidia GT200

Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”, COSC 6385 – Computerhttp://www.realworldtech.com/page.cfm?ArticleID=RWT0908 Architecture 08195242&p=1 Edgar Gabriel

Streaming multi-processor (I)

• Instruction fetch, decode and issue logic • 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a ‘core’ by Nvidia) • 8 branch units – a thread encountering a branch will stall until it is resolved (no speculation), branch delay: 4 cycles • Two 64bit special units for less frequent operations – 64bit operations 8-12 times slower than 32bit operations! • 1 special function unit for ‘unusual’ instructions – transcendental functions, interpolations, reciprocal square roots – take anywhere from 16 to 32 cycles to execute

COSC 6385 – Computer Architecture Edgar Gabriel

9 Streaming multi-processor (II)

• Single issue with SIMD capabilities • Can execute up to 8 thread blocks/1024 threads concurrently • Does not support speculative execution or branch prediction • Instructions are scoreboarded to reduce stalls • Each SP has access to 2048 register file entries each with 32 bits – a double precision number has to utilize two adjacent registers – register file can be used by up to 128 threads concurrently

COSC 6385 – Computer Architecture Edgar Gabriel

Streaming multi-processor (III)

Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1 COSC 6385 – Computer Architecture Edgar Gabriel

10 Streaming multi-processor (IV)

• Execution units of an SM run at twice the frequency of fetch and issue logic as well as memory and register • 64KB register file that is partitioned across alls SPs • 16KB shared memory that can be used for communication between the threads running on the SPs of the same SM – organized in 4096 entries, 16 banks ( = 32bit bank width) – accessing shared memory is as fast as accessing a register!

COSC 6385 – Computer Architecture Edgar Gabriel

Load/Store operations • Generated in SMs, but handled by SM controller in the TPC – load pipeline shared hardware with texture pipeline – shared by three 3 SMs – mutual exclusive usage of load and texture pipelines – effective address calculation + mapping of 40byte virtual addresses to physical address by MMU • Texture cache: – 2-D addressing – read only caches without cache coherence • entire cache hierarchy invalidated if a data item is modified – texture caches used to save bandwidth and power, not really faster than texture memory COSC 6385 – Computer Architecture Edgar Gabriel

11 CUDA Memory Model

COSC 6385 – Computer Architecture Edgar Gabriel

CUDA Memory Model (II)

• cudaError_t cudaMalloc(void** devPtr, size_t size) – Allocates size bytes of device(global) memory pointed to by *devPtr – Returns cudaSuccess for no error • cudaError_t cudaMempy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) – Dst = destination memory address – Src = source memory address – Count = bytes to copy – Kind = type of transfer (“HostToDevice”, “DeviceToHost”, “DeviceToDevice”) • cudaError_t cudaFree(void* devPtr) – Frees memory allocated with cudaMalloc

Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo COSChttp://www.cse.buffalo.edu/faculty/miller/Courses/C 6385 – Computer Architecture SE710/heavner.pdf Edgar Gabriel

12 Example: Vector Addition (V) int main ( int argc, char ** argv) { float a[N], b[N], c[N]; float *d_a, *d_b, *d_c; cudaMalloc( &d_a, N*sizeof(float)); cudaMalloc( &d_b, N*sizeof(float)); cudaMalloc( &d_c, N*sizeof(float)); cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsPerBlock(256); // 1-D array of threads dim3 blocksPerGrid(N/256); // 1-D grid vecAdd<<>>(d_a, d_b, d_c); cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return-0;COSC 6385 – Computer Architecture } Edgar Gabriel

Nvidia Fermi processor • Next generation processors of Nvidia • Removed one level of hierarchy – contains 16 SM processors, but no notion of TPCs anymore • Each SM processor has – 32 ALU units (Nvidia ‘cores’, SIMD ‘lanes’ in the book) • compared to 8 on the GT200 • organized as two units with 16 ALUs each – 16 load/store units • compared to 1 for three SMs in GT200 – 64 kb local SRAM that can be split into L1 cache and shared memory (16kb/48kb or 48kb/16kb) – 4 special function units • compared to 1 in GT200 COSC 6385 – Computer Architecture Edgar Gabriel

13 Nvidia Fermi SM processor

Image Source:Peter N. Glaskowsky, “Nvidia’s Fermi: The First Complete GPU Architecture” http://www.nvidia.com/content/PDF/fermi_white_papers/P.GCOSC 6385 – Computer Architecture laskowsky_NVIDIA%27s_Fermi- The_First_Complete_GPU_Architecture.pdfEdgar Gabriel

Nvidia Fermi processor

• Can manage up 1,536 threads simultaneously per SM – compared to 1024 per SM on the GT200 • Register file increased to 128kB (32k entries) • New: modified address space using 40bit addresses – global, shared and local addresses are ranges within that address space • New: support for atomic read-modify-write operation • New: support for predicated instructions

COSC 6385 – Computer Architecture Edgar Gabriel

14 Similarities and Differences between GPU and Vector Processors • Memory organization and management – All GPU memory accesses are gather-scatter -> special hardware to recognize address coalescing -> hides memory latency due to large number of threads and scoreboarding

– Loading data into vector register contiguous by default -> special support for gather-scatter operation -> costs of load/store operation amortized due to large number of elements accessed at once

COSC 6385 – Computer Architecture Edgar Gabriel

Similarities and Differences between GPU and Vector Processors (II) • Processor organization and ISA – Vector register hold entire vector <-> vector is distributed across registers in different ALUs on GPU – Much higher number of ALU/threads supported in GPU than no. of lanes in a vector processor – PTX instruction similar to a vector instruction – Both approaches use mask registers to handle conditional instructions -> mask set by compiler for vector processors -> mask set at runtime by hardware for GPU

COSC 6385 – Computer Architecture Edgar Gabriel

15 Similarities and Differences between GPU and Vector Processors (III) • Scalar processor executes scalar operations in vector processor • GPU could use the regular CPU for scalar operations – High costs of data transfer between GPU and CPU memory – Scalar code often executed on GPU

COSC 6385 – Computer Architecture Edgar Gabriel