Computer Archicture

COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation • Programming models such as MapReduce and Spark provide a high-level view of parallelism – not easy for all problems, e.g. recursive algorithms, many graph problems, etc. • How to handle problems that do not have inherent high-level parallelism? – sequential processing: time to solution takes too long for large problems – exploit low-level parallelism: often groups of very few instructions • Problem with instruction level parallelism: costs for exploiting parallelism exceed benefits if using regular threads/processes/tasks 1 Historic context: SIMD Instructions • Same operation executed for multiple data items • Uses a fixed length register and partitions the carry chain to allow utilizing the same functional unit for multiple operations – E.g. a 256 bit adder can be utilized for eight 32-bit add operations simultaneously • All elements in a register have to be on the same memory page to avoid page faults within the instruction Comparison of instructions • Example add operation of eight 32-bit integers with and and without SIMD instruction 3 instructions required for managing the loop (i.e. not contributing to the LOOP: LOAD R2, 0(R4) /* load x(i) */ actual solution of the LOAD R0, 0(R6) /* load y(i) */ problem) ADD R2, R2, R0 /* x(i)+y(i)*/ Branch instructions STORE R2, 0(R4) /* store x(i) */ typically lead to ADD R4, R4, #4 /* increment x */ processor stalls since ADD R6, R6, #4 /* increment y */ you have to wait for the outcome BNEQ R4, R20, LOOP of the comparison before you can decide --------------------- what is the next LOAD256 YMM1, 0(R4) /* loads 256 bits of data*/ instruction to execute LOAD256 YMM2, 0(R6) /* ditto */ VADDSP YMM1, YMM1, YMM2 /* AVX ADD operation */ STORE256 YMM1, 0(R4) Note: not actual Intel assembly instructions and registers used 2 SIMD Instructions • MMX (Mult-Media Extension) - 1996 – Existing 64 bit floating point register could be used for eight 8-bit operations or four 16-bit operations • SSE (Streaming SIMD Extension) – 1999 – Successor to MMX instructions – Separate 128-bit registers added for sixteen 8-bit, eight 16-bit, or four 32-bit operations • SSE2 – 2001, SSE3 – 2004, SSE4 - 2007 – Added support for double precision operations • AVX (Advanced Vector Extensions) - 2010 – 256-bit registers added • AVX2 – 2013 – 512 –bit registers added Graphics Processing Units (GPU) • Hardware in Graphics Units similar to SIMD units – Works well with data-level parallel problems – Scatter-gather transfers – Mask registers – Large register files • Using NVIDIA GPUs as an example 3 Graphics Processing Units (II) • Basic idea: – Heterogeneous execution model • CPU is the host, GPU is the device – Develop a C-like programming language for GPU – Unify all forms of GPU parallelism as CUDA thread – Programming model is “Single Instruction Multiple Threads” • GPU hardware handles thread management, not applications or OS Example: Vector Addition • Sequential code: int main ( int argc, char **argv ) { int A[N], B[N], C[N]; for ( i=0; i<N; i++) { C[i] = A[i] + B[i]; } return (0); } CUDA: replace the loop by N threads each executing on element of the vector add operation 4 Example: Vector Addition (II) • CUDA: replace the loop by N threads each executing one element of the vector add operation • Question: How does each thread know which elements to execute? – threadIdx : each thread has an id which is unique in the thread block • of type dim3, which is a struct { int x,y,z; } dim3; – blockDim: Total number of threads in the thread block • a thread block can be 1D, 2D or 3D Example: Vector Addition (III) • Initial CUDA kernel: void vecadd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; } Assuming a 1-D thread block -> only x-dimension used • This code is limited by the maximum number of threads in a thread block – Upper limit on max. number of threads in one block – if vector is longer, we have to create multiple thread blocks 5 How does the compiler now which code to compile for CPU and which one for GPU? • Specifier tells compiler where function will be executed -> compiler can generate code for corresponding processor • Executed on CPU, called form CPU (default if not specified) __host__ void func(…) • CUDA kernel to be executed on GPU, called from CPU __global__ void func(...); • CUDA kernel to be executed on GPU, called from GPU __device__ void func(...); Example: Vector Addition (IV) • so the CUDA kernel is in reality: __global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; } • Note: – d_A, d_B, and d_C are in global memory – int i is in local memory of the thread 6 If you have multiple thread blocks __global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; d_C[i] = d_A[i] + d_B[i]; return; } ID of the thread block that Number of threads in a this thread is part of thread block Using more than one element per thread __global__ void vecAdd ( int *d_A, int *d_B, int* d_C) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j; for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++) d_C[j] = d_A[j] + d_B[j]; return; } 7 Nvidia GT200 • A GT200 is multi-core chip with two level hierarchy – focuses on high throughput on data parallel workloads • 1st level of hierarchy: 10 Thread Processing Clusters (TPC) • 2nd level of hierarchy: each TPC has – 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1 core in a conventional processor) – a texture pipeline (used for memory access) • Global Block Scheduler: – issues thread blocks to SMs with available capacity – simple round-robin algorithm but taking resource availability (e.g. of shared memory) into account Nvidia GT200 Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1 8 Streaming multi-processor (I) • Instruction fetch, decode and issue logic • 8 32bit ALU units (that are often referred to as Streaming processor (SP) or confusingly called a ‘core’ by Nvidia) • 8 branch units: no branch prediction or speculation, branch delay: 4 cycles • Can execute up to 8 thread blocks/1024 threads concurrently • Each SP has access to 2048 register file entries each with 32 bits – a double precision number has to utilize two adjacent registers – register file can be used by up to 128 threads concurrently CUDA Memory Model 9 CUDA Memory Model (II) • cudaError_t cudaMalloc(void** devPtr, size_t size) – Allocates size bytes of device(global) memory pointed to by *devPtr – Returns cudaSuccess for no error • cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) – Dst = destination memory address – Src = source memory address – Count = bytes to copy – Kind = type of transfer (“HostToDevice”, “DeviceToHost”, “DeviceToDevice”) • cudaError_t cudaFree(void* devPtr) – Frees memory allocated with cudaMalloc Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf Example: Vector Addition (V) int main ( int argc, char ** argv) { float a[N], b[N], c[N]; float *d_a, *d_b, *d_c; cudaMalloc( &d_a, N*sizeof(float)); cudaMalloc( &d_b, N*sizeof(float)); cudaMalloc( &d_c, N*sizeof(float)); cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice); dim3 threadsPerBlock(256); // 1-D array of threads dim3 blocksPerGrid(N/256); // 1-D grid vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c); cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return-0; } 10 Nvidia Tesla V100 GPU • Most recent Nvidia GPU architecture • Architecture: each V100 contains – 6 GPU Processing Clusters (GPCs) – Each GPC has • 7 Texture Processing Clusters (TPCs) • 14 Streaming Multiprocessors (SMs) – Each SM has • 64 32-bit Floating Point cores • 64 32-bit Integer cores • 32 64-bit Floating Point cores • 8 Tensor Cores • 4 Texture Units Nvidia V100 Tensor Cores • Specifically designed to support neural networks • Designed to execute – D = A×B + C for 4x4 matrices – Operate on FP16 input data with FP32 accumulation Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 11 Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf NVLink • GPUs traditionally utilize a PCIe slot for moving data and instructions from CPU to GPU and between GPUs – PCIe 3.0 x8: 8 GB/s bandwidth – PCIe 3.0 x16: 16 GB/s bandwidth – motherboards often restricted in the number of PCIe lanes managed: i.e. using multiple PCIe cards will reduce the bandwidth available for each card • NVLink : high speed connection between multiple GPUs – higher bandwidth per link (25 GB/sec) than PCIe – V100 supports up to 6 NVLINKs per GPU 12 NVLink2 multi-GPU no CPU support Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf NVLink2 with CPU support • only supports IBM Power 9 CPUs at the moment Image source: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 13 Other V100 enhancements • In earlier GPUs – a group of threads (warps) executed a single instruction. – A single program counter was used in combination with a active mask that specified which threads of the warp are active at any given point in time – Divergent paths (e.g. if-then-else statements)

Computer Archicture

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support