Computing using GPUs

● Introduction, architecture overview ● CUDA and OpenCL programming models (recap) ● Programming tools: CUDA Fortran, debuggers, profilers, libraries ● Software and applications: Mathematica, MATLAB, Linpack, CFD, Imaging,... ● Visualization, enhancing browser experience (WebGL) ● GPU clusters, power issues ● Future trends?

High performance computing

● “Grand computational challenges” – Lattice QCD, Astrophysics, Earth system model, Full-scale traffic, LHC data analysis, etc. – large synergies, sometimes custom hardware ● Lots of compute-intensive scientific and engineering problems (CFD) ● Supercomputers built increasingly from off-the-shelf hardware ● www.top500.org ● Requirements for multimedia

Trends in architecture

● Feature size (32nm) and frequency (~3GHz) approaching limit ('power wall') ● At 3GHz light travels a length order of a processor size in 1 clock cycle (~10cm) ● Power (~160Watt/proc) becoming an issue ● Instruction—level parallelism already exploited ● Now actively exploring multi-core architectures

Memory

● Memory performance increased slower than CPU ● Solved by hierarchy of caches ● Large portion of a processor dedicated to memory management

Nehalem Pentium 1993

Multi-core and GPUs

● Chip makers looking into future multi-core architectures ● See e.g. www.intel.com/go/terascale/

● On the other hand, VGA controllers have been evolving into powerful programmable multiprocessors since 1990s ● More transistor budget allocated to arithmetic operations ● Could be programmed only with specific graphics languages ● Graphic processors lacked various features (instructions, memory harwdare, floating point) required to support general-purpose computing

GPGPUs

recognized the potential to use GPUs for general-purpose computing and launched CUDA (Computer Unified Device Architecture) around 2007 ● Allows using latest NVIDIA GPUs as general-purpose processors ● Programs written in an extension of C 99

Architecture

Programming model

● A host (normally a CPU) + several devices (GPUs) ● A program is structured into host code and device code (kernel) ● Workflow: move data to device(s) → load programs (kernels) to device → move data back ● Running kernels, moving data between host/device, coarse sync. managed by queues ● SIMT model – kernels executed by independent threads ● Multiple threads per processor ● Threads scheduled in 'blocks' ('warps') ● Threads communicate through local and global memories ● Partitioning of the problem programmed manually ● Programming for performance challenges: ● Aligned and coalesced memory access ● Use local memory instead of global ('manual caching') ● Minimize host/device data transfers

CUDA C

● See e.g. a tutorial at www.drdobbs.com/ /high-performance-computing/207200659 ● NVIDIA nvcc compiler, c99 extension

#include __global__ void incrementArrayOnDevice(float *a, int N) { int idx = blockIdx.x*blockDim.x + threadIdx.x; if (idx>> (a_d, N); cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); free(a_h); free(b_h); cudaFree(a_d); }

OpenCL

● Open standard for GPU computing http://www.khronos.org/opencl/ ● Khronos group, 100+ members ● Supported by AMD ● Less mature than CUDA, but gaining momentum ● Based on jit compilation of kernels ● Check out http://www.khronos.org/developers/resources/opencl/

context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); queue = clCreateCommandQueue(context, NULL, 0, NULL);

memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA, NULL);

memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL, NULL);

program = clCreateProgramWithSource(context, 1, &fft1D_1024_kernel_src, NULL, NULL);

clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

kernel = clCreateKernel(program, "fft1D_1024", NULL);

clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL);

global_work_size[0] = num_entries; local_work_size[0] = 64; clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); .....

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; ...... }

CUDA Fortran

● from Portland Group http://www.pgroup.com/resources/cudafortran.htm subroutine mmul( A, B, C ) ! use cudafor real, dimension(:,:) :: A, B, C integer :: N, M, L real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev type(dim3) :: dimGrid, dimBlock ! N = size(A,1) ; M = size(A,2) ; L = size(B,2) allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) ) Adev = A(1:N,1:M) Bdev = B(1:M,1:L) dimGrid = dim3( N/16, L/16, 1 ) dimBlock = dim3( 16, 16, 1 ) call mmul_kernel<<>>( Adev,Bdev,Cdev,N,M,L ) C(1:N,1:M) = Cdev deallocate( Adev, Bdev, Cdev ) ! end subroutine attributes(global) subroutine MMUL_KERNEL( A,B,C,N,M,L) ! real,device :: A(N,M),B(M,L),C(N,L) ... real,shared :: Ab(16,16), Bb(16,16) real :: Cij ! tx = threadidx%x ; ty = threadidx%y

..... end subroutine PyCUDA and PyOpenCL

● PyCUDA scripted wrapper for CUDA C http://mathema.tician.de/software/pycuda ● CUDA.NET, jCUDA, ...

import pycuda.compiler as comp import pycuda.driver as drv import numpy import pycuda.autoinit

mod = comp.SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } """)

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a) multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1))

print dest-a*b Thrust

● STL-like library for CUDA code.google.com/p/thrust/

#include #include #include #include

Int main(void) { // generate 32M random numbers on the host thrust::host_vector h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device thrust::device_vector d_vec= h_vec;

// sort data on the device (846M keys per sec on GeForceGTX 480) thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; }

Debuggers and profilers

-gdb, cuda-memcheck (part of nvidia sdk) ● Allinea DDT www.allinea.com/ ● RogueWave TotalView ● NVIDIA Parallel Nsight (Windows only)

Software exploiting GPUS

● CUDA Libraries: CUFFT, CUBLAS, math.h, CURAND, NPP ● MATLAB and Mathematica ● CST Microwave studio, ANSYS, AMBER11, CERN level2 trigger soft, CFD (OpenCurrent), H.264 video codec, biostatistics, finance, medical imaging (Siemenns, CERA CT software, MRI software from GE), microtomography (ESRF, CFEL,...), visualization, autonomous car navigation,... ● When comparing Nehalem Quad Core to C2050 (1-2 per system) most applications report 5x-20x speedup ● PIC-e becoming a bottleneck (8 GB/s max)

Visualization –

● Mental images iray ● Realistic rendering of CAD-generated Scenes with ray tracing ● Using a GPU cluster in the cloud to compute the scene ● Navigatable in real time

Visualization – WebGL

● http://www.khronos.org/webgl/ ● Accelerated graphics in a browser embedded in html5 tags and programmable with a Javascript similar to opengl ● Expected to be supported by all major browsers during 2011 ● Demos https://sites.google.com/a/chromium.org/dev/developers/demos-gpu-acceleration-and-webgl ● Probably a great choice for scientific visualization (integrated with browser) ● http://dan.lecocq.us/wordpress/tag/webglot/ high-perf. visualization project

GPU clusters

● In November 10 top500 #1 (Tianhe-1A, 2.5 PFlops) and #3 (Nebulae, 1.27 PFlops) are GPU clusters ● Longhorn visualization cluster with 256 nodes, 2GPU/node, infiniband, lustre, attached to Ranger supercomputer (#14 top500) www.tacc.utexas.edu ● Power consumption becoming a serious issue (Jaguar at OakRidge, 6.9 MW) ● With current trends future HPC machines ('exascale') might consume ~100MW (!) ● Which will cost several $M/year ● US data centres account for 2% energy consumption in 2007 ● Memory consumes substantial portion of energy for memories >128 GB (5Watt 4GB DIMM) ● GPUs (T1060 at 4GFlops/Watt) are more efficient than CPUs (i7 at 0.8 GFlops/Watt), heating still a serious problem. Space to increase power efficiency ● Ongoing research to reduce GPU power consumption by clocking down and software techniques such as efficient scheduling and virtualization (gVirtuS)

NVIDIA's roadmap

● More energy-efficient GPUs

AMD's roadmap

● Active player in the market ● ATI stream SDK for GPU programming ● OpenCL support ● New architecture announced fusion.amd.com/ x86 CPU with programmable vector processing engines on a single die (Accelerated Processing Unit)

Conclusions/outlook

● GPGPU computing has happened. ● Will GPU-based HPC be sustainable? ● Will sufficient open source software emerge? ● Will open standards be mature and established? ● Writing software is the hardest part (most of the software mentioned is developed in collaboration with NVIDIA engineers) ● To make use of available FLOPS, lots of applied software will need to be rewritten. ● Software usually outlives hardware (what did gcc run on in 1987?). → Sustainable open programming models (like MPI) are required ● There is space and time to rethink algorithms and approaches ● 2011 promises to be an exciting year for GPU computing

Amdahl's law.

● Optimize parts where code spends most time first ● Profile your code