Parallel programming languages: A new renaissance or a return to the dark ages?
Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected]
© Simon McIntosh-Smith 1 The Microelectronics Group at the University of Bristol http://www.cs.bris.ac.uk/Research/Micro/
© Simon McIntosh-Smith 2 The team
Simon McIntosh-Smith Dr Jose Nunez-Yanez Head of Group
Prof David May Dr Kerstin Eder
Prof Dhiraj Pradhan Dr Simon Hollis
© Simon McIntosh-Smith 3 Group expertise
Energy Aware COmputing (EACO): •Multi-core and many-core computer architectures •High Performance Computing (GPUs, OpenCL) •Network on Chip (NoC) •Reconfigurable architectures •Design verification (formal and simulation-based) •Formal specification and analysis •Silicon process variation •Fault tolerant design (hardware and software)
© Simon McIntosh-Smith 4 Heterogeneous many-core www.innovateuk.org/mathsktn
© Simon McIntosh-Smith 5 Heterogeneous many-core XMOS
© Simon McIntosh-Smith 6 Didn’t parallel computing use to be a niche?
© Simon McIntosh-Smith 7 When I were a lad…
© Simon McIntosh-Smith 8 But now parallelism is mainstream
Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU
© Simon McIntosh-Smith 9 HPC stronger than ever
548,352 processor cores delivering 8.16 PetaFLOPS (8.16x1015)
© Simon McIntosh-Smith 10 Big computing is mainstream too
http://www.nytimes.com/2006/06/14/technology/14search.html
Report: Google Uses About 900,000 Servers (Aug 1st 2011) http://www.datacenterknowledge.com/archives/2011/08/01/ report-google-uses-about-900000-servers/
© Simon McIntosh-Smith 11 A renaissance in parallel programming
OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP
© Simon McIntosh-Smith 12 Groupings of || languages
Partitioned Global Address Space GPU languages: (PGAS): •OpenCL •Fortress •CUDA •X10 •HMPP •Chapel •Co-array Fortran Object oriented: •Unified Parallel C •C++ AMP •CHARM++ CSP: XC Multi-threaded: Message passing: MPI •Cilk •Go Shared memory: OpenMP
© Simon McIntosh-Smith 13 2nd fastest computer in the world
© Simon McIntosh-Smith 14 Emerging GPGPU standards
• OpenCL, DirectCompute, C++ AMP, …
• NVIDIA’s ‘de facto’ standard CUDA
© Simon McIntosh-Smith 15 OpenCL
© Simon McIntosh-Smith 16 It’s a Heterogeneous world
A modern system includes: CPU CPU – One or more CPUs – One or more GPUs – DSP processors – …other? GPU GMCH
DRAM ICH
OpenCLOpenCL letslets programmersprogrammers writewrite aa singlesingle portableportable programprogram thatthat usesuses ALLALLresourcesresources inin thethe
heterogeneousheterogeneous© Simon McIntosh-Smith platformplatform 17 GMCH = graphics memory control hub, ICH = Input/output control hub Industry Standards for Programming Heterogeneous Platforms
CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing
Multi- Graphics processor Heterogeneous APIs and programming Computing Shading e.g. OpenMP Languages
OpenCLOpenCL – – Open Open Computing Computing Language Language Open,Open, royalty-free royalty-free standard standard for for portable, portable, parallel parallel programming programming of of heterogeneousheterogeneous parallel parallel computing computing© Simon McIntosh-Smith CPUs, CPUs, GPUs, GPUs, and and other other processors processors18 The origins of OpenCL Ericsson Nokia AMD Merged, IBM needed commonality across Sony products ATI IMG Tech GPU vendor - Freescale wants to steal mkt Nvidia Wrote a share from CPU Khronos TI rough draft Compute straw man group formed + many CPU vendor - API more Intel wants to steal mkt share from GPU
was tired of recoding for Apple many core, GPUs. Pushed vendors to standardize. Dec 2008 © Simon McIntosh-Smith 19 Third party names are the property of their owners. System level architecture of OpenCL
OpenCL programs
OpenCL dynamic libraries
Vendor-supplied drivers
Could also be multi-core CPUs
© Simon McIntosh-Smith 20 OpenCL Platform Model
• One Host + one or more Compute Devices • Each Compute Device is composed of one or more Compute Units • Each Compute Unit is further divided into one or more Processing Elements © Simon McIntosh-Smith 21 The BIG idea behind OpenCL • Replace loops with functions (a kernel) executing at each point in a problem domain (index space). • E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops Data Parallel OpenCL void kernel void trad_mul(const int n, dp_mul(global const float *a, const float *a, global const float *b, const float *b, global float *c) float *c) { { int id = get_global_id(0); int i; for (i=0; i © Simon McIntosh-Smith 22 An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together) 1024 Synchronization between work-items possible only within work-groups: barriers and memory fences 1024 Cannot synchronize outside of a work-group © Simon McIntosh-Smith 23 OpenCL Memory Model • Private Memory Private Private Private Private Memory Memory Memory Memory • Per Work-Item • Local Memory Work-Item Work-Item Work-Item Work-Item • Shared within a Work-Group Local Memory • Global / Constant Local Memory Work-Group Memories Work-Group • Visible to all Work-Groups Global Memory & Constant Memory • Host Memory Compute Device • On the CPU Host Memory Host • Memory management is explicit You must move data from host Æ global Æ local and back © Simon McIntosh-Smith 24 OpenCL C Language • Derived from ISO C99 • A “supersubset” (Tim Mattson, Intel) • No standard C99 headers, function pointers, recursion, variable length arrays, or bit fields • Additions to the language for parallelism • Work-items and work-groups • Vector types • Synchronization • Address space qualifiers, e.g. global, local • Optimized image access • Built-in functions © Simon McIntosh-Smith 25 Contexts and Queues Contexts are used to contain and manage the state of the “world” Contexts are defined & manipulated by the host and contain: • Devices • Kernels - OpenCL functions • Program objects - kernel sources and executables • Memory objects Command-queue - coordinates execution of kernels • Kernel execution commands • Memory commands - transfer or mapping of memory object data • Synchronization commands - constrains the order of commands © Simon McIntosh-Smith 26 OpenCL architecture CPU GPU Context ProgramsPrograms Kernels Memory Objects Command Queues dp_mul __kernel void Buffers Images dp_mul dp_mul( argarg [0][0] CPU program InIn OutOut of of global const float *a, arg[0]valuevalue value binary global const float *b, OrderOrder OrderOrder global float *c) arg [1] dp_mul arg [1] Queue Queue { arg[1]valuevalue value Queu Queu int id = get_global_id(0); GPU program binary c[id] = a[id] * b[id]; argarg [2][2] e e } arg[2]valuevalue value ComputeGPU Device © Simon McIntosh-Smith 27 Example: vector addition • The “hello world” program of data parallel programming is a program to add two vectors C[i] = A[i] + B[i] for i=0 to N-1 • For the OpenCL solution, there are two parts • Kernel code • Host code © Simon McIntosh-Smith 28 Vector Addition - Kernel __kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } © Simon McIntosh-Smith 29 Vector Addition - Host Program // create the OpenCL context on a GPU device // build the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program, 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL); // get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); NULL, &cb); devices = malloc(cb); // set the args values clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // create a command-queue err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], cmd_queue = clCreateCommandQueue(context, devices[0], sizeof(cl_mem)); 0, NULL); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // allocate the buffer memory objects // set work-item dimensions memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | global_work_size[0] = n; CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | // execute kernel CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL); NULL, global_work_size, NULL, 0, NULL, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, // read output array NULL); err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); © Simon McIntosh-Smith 30 Vector Addition - Host Program // create the OpenCL context on a GPU device // build theBuild program the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program,BuildBuild thethe programprogram 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL); // get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); Define platform and queuesNULL, &cb); devicesDefine = malloc(cb); platform and queues // set the args values clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); err Create= clSetKernelArg(kernel, and setup 0, (void kernel *) &memobjs[0], Create and setupsizeof(cl_mem)); kernel // create a command-queue err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], cmd_queue = clCreateCommandQueue(context, devices[0], sizeof(cl_mem)); 0, NULL); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // allocate the buffer memory objects // set work-item dimensions memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | global_work_size[0] = n; CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | // executeExecute kernel the kernel CL_MEM_COPY_HOST_PTR,Define Memory sizeof(cl_float)*n, objects srcB, err = clEnqueueNDRangeKernel(cmd_queue,Execute the kernel kernel, 1, NULL);Define Memory objects NULL, global_work_size, NULL, 0, NULL, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, // read output array NULL); err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, ReadRead0, n*sizeof(cl_float), resultsresults backback dst, 0, to toNULL, thethe NULL); hosthost // create the program program =Create clCreateProgramWithSource(context, the program 1, &program_source,CreateCreate NULL, thethe NULL); programprogram It’s complicated, but most of this is “boilerplate” and not as bad as it looks. © Simon McIntosh-Smith 31 C++ to the rescue #include Program program(loadProgram("helloWorld.cl")); program.build(); std::function © Simon McIntosh-Smith 32 OpenCL case study © Simon McIntosh-Smith 33 Bristol GPU-based drug docking success story – BUDE Collaborators: Richard Sessions, Amaurys Avila Ibarra © Simon McIntosh-Smith 34 Supermicro GPU server © Simon McIntosh-Smith 35 Relative performance 1,120 seconds per simulation 162 seconds per simulation Less than 2 days to screen a library of 1 million© Simondrug McIntosh-Smith candidates on 1000 GPUs 36 Relative energy efficiency 0.034 kWh per simulation 0.011 kWh per simulation 0.011 kWh = 0.16 pence per simulation 1 million simulations Æ £1,600 on energy© Simon McIntosh-Smith for one experiment 37 Summary • Parallel languages are going through a renaissance • Not just for the niche high end any more • No silver bullets, lots of “wheel reinventing” • In HPC, GPUs being adopted quickly at the high-end • In embedded computing, OpenCL gaining ground • Everything going heterogeneous many-core… © Simon McIntosh-Smith 38 http://www.cs.bris.ac.uk/Research/Micro/ © Simon McIntosh-Smith 39