Parallel programming languages: A new renaissance or a return to the dark ages?

Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected]

© Simon McIntosh-Smith 1 The Microelectronics Group at the University of Bristol http://www.cs.bris.ac.uk/Research/Micro/

© Simon McIntosh-Smith 2 The team

Simon McIntosh-Smith Dr Jose Nunez-Yanez Head of Group

Prof David May Dr Kerstin Eder

Prof Dhiraj Pradhan Dr Simon Hollis

© Simon McIntosh-Smith 3 Group expertise

Energy Aware COmputing (EACO): •Multi-core and many-core computer architectures •High Performance Computing (GPUs, OpenCL) •Network on Chip (NoC) •Reconfigurable architectures •Design verification (formal and simulation-based) •Formal specification and analysis •Silicon variation •Fault tolerant design (hardware and software)

© Simon McIntosh-Smith 4 Heterogeneous many-core www.innovateuk.org/mathsktn

© Simon McIntosh-Smith 5 Heterogeneous many-core XMOS

© Simon McIntosh-Smith 6 Didn’t use to be a niche?

© Simon McIntosh-Smith 7 When I were a lad…

© Simon McIntosh-Smith 8 But now parallelism is mainstream

Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU

© Simon McIntosh-Smith 9 HPC stronger than ever

548,352 processor cores delivering 8.16 PetaFLOPS (8.16x1015)

© Simon McIntosh-Smith 10 Big computing is mainstream too

http://www.nytimes.com/2006/06/14/technology/14search.html

Report: Google Uses About 900,000 Servers (Aug 1st 2011) http://www.datacenterknowledge.com/archives/2011/08/01/ report-google-uses-about-900000-servers/

© Simon McIntosh-Smith 11 A renaissance in parallel programming

OpenMP OpenCL Erlang Unified Parallel Fortress XC Go HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP

© Simon McIntosh-Smith 12 Groupings of || languages

Partitioned Global Address Space GPU languages: (PGAS): •OpenCL •Fortress •CUDA •X10 •HMPP •Chapel •Co-array Fortran Object oriented: •Unified Parallel C •C++ AMP •CHARM++ CSP: XC Multi-threaded: : MPI •Cilk •Go : OpenMP

© Simon McIntosh-Smith 13 2nd fastest computer in the world

© Simon McIntosh-Smith 14 Emerging GPGPU standards

• OpenCL, DirectCompute, C++ AMP, …

• NVIDIA’s ‘de facto’ standard CUDA

© Simon McIntosh-Smith 15 OpenCL

© Simon McIntosh-Smith 16 It’s a Heterogeneous world

A modern system includes: CPU CPU – One or more CPUs – One or more GPUs – DSP processors – …other? GPU GMCH

DRAM ICH

OpenCLOpenCL letslets programmersprogrammers writewrite aa singlesingle portableportable programprogram thatthat usesuses ALLALLresourcesresources inin thethe

heterogeneousheterogeneous© Simon McIntosh-Smith platformplatform 17 GMCH = graphics memory control hub, ICH = Input/output control hub Industry Standards for Programming Heterogeneous Platforms

CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing

Multi- Graphics processor Heterogeneous and programming Computing Shading e.g. OpenMP Languages

OpenCLOpenCL – – Open Open Computing Computing Language Language Open,Open, royalty-free royalty-free standard standard for for portable, portable, parallel parallel programming programming of of heterogeneousheterogeneous parallel parallel computing computing© Simon McIntosh-Smith CPUs, CPUs, GPUs, GPUs, and and other other processors processors18 The origins of OpenCL Ericsson Nokia AMD Merged, IBM needed commonality across Sony products ATI IMG Tech GPU vendor - Freescale wants to steal mkt Nvidia Wrote a share from CPU Khronos TI rough draft Compute straw man group formed + many CPU vendor - API more Intel wants to steal mkt share from GPU

was tired of recoding for Apple many core, GPUs. Pushed vendors to standardize. Dec 2008 © Simon McIntosh-Smith 19 Third party names are the property of their owners. System level architecture of OpenCL

OpenCL programs

OpenCL dynamic libraries

Vendor-supplied drivers

Could also be multi-core CPUs

© Simon McIntosh-Smith 20 OpenCL Platform Model

• One Host + one or more Compute Devices • Each Compute Device is composed of one or more Compute Units • Each Compute Unit is further divided into one or more Processing Elements © Simon McIntosh-Smith 21 The BIG idea behind OpenCL • Replace loops with functions (a kernel) executing at each point in a problem domain (index space). • E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops Data Parallel OpenCL void kernel void trad_mul(const int n, dp_mul(global const float *a, const float *a, global const float *b, const float *b, global float *c) float *c) { { int id = get_global_id(0); int i; for (i=0; i

© Simon McIntosh-Smith 22 An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together)

1024 Synchronization between work-items possible only within work-groups: barriers and memory fences

1024 Cannot synchronize outside of a work-group

© Simon McIntosh-Smith 23 OpenCL Memory Model

• Private Memory Private Private Private Private Memory Memory Memory Memory • Per Work-Item • Local Memory Work-Item Work-Item Work-Item Work-Item • Shared within a Work-Group Local Memory • Global / Constant Local Memory Work-Group Memories Work-Group • Visible to all Work-Groups Global Memory & Constant Memory • Host Memory Compute Device • On the CPU Host Memory

Host • Memory management is explicit You must move data from host Æ global Æ local and back

© Simon McIntosh-Smith 24 OpenCL C Language • Derived from ISO • A “supersubset” (Tim Mattson, Intel) • No standard C99 headers, function pointers, recursion, variable length arrays, or bit fields • Additions to the language for parallelism • Work-items and work-groups • Vector types • Synchronization • Address space qualifiers, e.g. global, local • Optimized image access • Built-in functions

© Simon McIntosh-Smith 25 Contexts and Queues

Contexts are used to contain and manage the state of the “world”

Contexts are defined & manipulated by the host and contain: • Devices • Kernels - OpenCL functions • Program objects - kernel sources and executables • Memory objects

Command-queue - coordinates execution of kernels • Kernel execution commands • Memory commands - transfer or mapping of memory object data • Synchronization commands - constrains the order of commands

© Simon McIntosh-Smith 26 OpenCL architecture

CPU GPU

Context

ProgramsPrograms Kernels Memory Objects Command Queues

dp_mul __kernel void Buffers Images dp_mul dp_mul( argarg [0][0] CPU program InIn OutOut of of global const float *a, arg[0]valuevalue value binary global const float *b, OrderOrder OrderOrder global float *c) arg [1] dp_mul arg [1] Queue Queue { arg[1]valuevalue value Queu Queu int id = get_global_id(0); GPU program binary c[id] = a[id] * b[id]; argarg [2][2] e e } arg[2]valuevalue value ComputeGPU Device

© Simon McIntosh-Smith 27 Example: vector addition

• The “hello world” program of data parallel programming is a program to add two vectors

C[i] = A[i] + B[i] for i=0 to N-1

• For the OpenCL solution, there are two parts • Kernel code • Host code

© Simon McIntosh-Smith 28 Vector Addition - Kernel

__kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

© Simon McIntosh-Smith 29 Vector Addition - Host Program

// create the OpenCL context on a GPU device // build the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program, 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL);

// get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); NULL, &cb); devices = malloc(cb); // set the args values clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // create a command-queue err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], cmd_queue = clCreateCommandQueue(context, devices[0], sizeof(cl_mem)); 0, NULL); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // allocate the buffer memory objects // set work-item dimensions memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | global_work_size[0] = n; CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | // execute kernel CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL); NULL, global_work_size, NULL, 0, NULL, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, // read output array NULL); err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL);

© Simon McIntosh-Smith 30 Vector Addition - Host Program

// create the OpenCL context on a GPU device // build theBuild program the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program,BuildBuild thethe programprogram 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL);

// get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); Define platform and queuesNULL, &cb); devicesDefine = malloc(cb); platform and queues // set the args values clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); err Create= clSetKernelArg(kernel, and setup 0, (void kernel *) &memobjs[0], Create and setupsizeof(cl_mem)); kernel // create a command-queue err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], cmd_queue = clCreateCommandQueue(context, devices[0], sizeof(cl_mem)); 0, NULL); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // allocate the buffer memory objects // set work-item dimensions memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | global_work_size[0] = n; CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | // executeExecute kernel the kernel CL_MEM_COPY_HOST_PTR,Define Memory sizeof(cl_float)*n, objects srcB, err = clEnqueueNDRangeKernel(cmd_queue,Execute the kernel kernel, 1, NULL);Define Memory objects NULL, global_work_size, NULL, 0, NULL, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, // read output array NULL); err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, ReadRead0, n*sizeof(cl_float), resultsresults backback dst, 0, to toNULL, thethe NULL); hosthost // create the program program =Create clCreateProgramWithSource(context, the program 1, &program_source,CreateCreate NULL, thethe NULL); programprogram It’s complicated, but most of this is “boilerplate” and not as bad as it looks.

© Simon McIntosh-Smith 31 C++ to the rescue

#include const std::string hwstring("Hello World\n"); using namespace cl; int main(void) { char * outH = new char[hwstring.length()+1]; Buffer outCL(CL_MEM_WRITE_ONLY, hwstring.length()+1);

Program program(loadProgram("helloWorld.cl")); program.build(); std::function hello = make_kernel(program, "hello"); hello( EnqueueArgs(NDRange(hwstring.length()+1)), outCL); enqueueReadBuffer(outCL, CL_TRUE, 0, hwstring.length()+1, outH); cout << outH << endl; } C++ API published on Khronos website http://www.khronos.org/registry/cl/

© Simon McIntosh-Smith 32 OpenCL case study

© Simon McIntosh-Smith 33 Bristol GPU-based drug docking success story – BUDE

Collaborators: Richard Sessions, Amaurys Avila Ibarra

© Simon McIntosh-Smith 34 Supermicro GPU server

© Simon McIntosh-Smith 35 Relative performance

1,120 seconds per simulation 162 seconds per simulation

Less than 2 days to screen a library of 1 million© Simondrug McIntosh-Smith candidates on 1000 GPUs 36 Relative energy efficiency

0.034 kWh per simulation

0.011 kWh per simulation

0.011 kWh = 0.16 pence per simulation 1 million simulations Æ £1,600 on

energy© Simon McIntosh-Smith for one experiment 37 Summary

• Parallel languages are going through a renaissance

• Not just for the niche high end any more

• No silver bullets, lots of “wheel reinventing”

• In HPC, GPUs being adopted quickly at the high-end

• In embedded computing, OpenCL gaining ground

• Everything going heterogeneous many-core…

© Simon McIntosh-Smith 38 http://www.cs.bris.ac.uk/Research/Micro/

© Simon McIntosh-Smith 39