Parallel Programming Languages: a New Renaissance Or a Return to the Dark Ages?

Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected] © Simon McIntosh-Smith 1 The Microelectronics Group at the University of Bristol http://www.cs.bris.ac.uk/Research/Micro/ © Simon McIntosh-Smith 2 The team Simon McIntosh-Smith Dr Jose Nunez-Yanez Head of Group Prof David May Dr Kerstin Eder Prof Dhiraj Pradhan Dr Simon Hollis © Simon McIntosh-Smith 3 Group expertise Energy Aware COmputing (EACO): •Multi-core and many-core computer architectures •High Performance Computing (GPUs, OpenCL) •Network on Chip (NoC) •Reconfigurable architectures •Design verification (formal and simulation-based) •Formal specification and analysis •Silicon process variation •Fault tolerant design (hardware and software) © Simon McIntosh-Smith 4 Heterogeneous many-core www.innovateuk.org/mathsktn © Simon McIntosh-Smith 5 Heterogeneous many-core XMOS © Simon McIntosh-Smith 6 Didn’t parallel computing use to be a niche? © Simon McIntosh-Smith 7 When I were a lad… © Simon McIntosh-Smith 8 But now parallelism is mainstream Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU © Simon McIntosh-Smith 9 HPC stronger than ever 548,352 processor cores delivering 8.16 PetaFLOPS (8.16x1015) © Simon McIntosh-Smith 10 Big computing is mainstream too http://www.nytimes.com/2006/06/14/technology/14search.html Report: Google Uses About 900,000 Servers (Aug 1st 2011) http://www.datacenterknowledge.com/archives/2011/08/01/ report-google-uses-about-900000-servers/ © Simon McIntosh-Smith 11 A renaissance in parallel programming OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP © Simon McIntosh-Smith 12 Groupings of || languages Partitioned Global Address Space GPU languages: (PGAS): •OpenCL •Fortress •CUDA •X10 •HMPP •Chapel •Co-array Fortran Object oriented: •Unified Parallel C •C++ AMP •CHARM++ CSP: XC Multi-threaded: Message passing: MPI •Cilk •Go Shared memory: OpenMP © Simon McIntosh-Smith 13 2nd fastest computer in the world © Simon McIntosh-Smith 14 Emerging GPGPU standards • OpenCL, DirectCompute, C++ AMP, … • NVIDIA’s ‘de facto’ standard CUDA © Simon McIntosh-Smith 15 OpenCL © Simon McIntosh-Smith 16 It’s a Heterogeneous world A modern system includes: CPU CPU – One or more CPUs – One or more GPUs – DSP processors – …other? GPU GMCH DRAM ICH OpenCLOpenCL letslets programmersprogrammers writewrite aa singlesingle portableportable programprogram thatthat usesuses ALLALLresourcesresources inin thethe heterogeneousheterogeneous© Simon McIntosh-Smith platformplatform 17 GMCH = graphics memory control hub, ICH = Input/output control hub Industry Standards for Programming Heterogeneous Platforms CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing Multi- Graphics processor Heterogeneous APIs and programming Computing Shading e.g. OpenMP Languages OpenCLOpenCL – – Open Open Computing Computing Language Language Open,Open, royalty-free royalty-free standard standard for for portable, portable, parallel parallel programming programming of of heterogeneousheterogeneous parallel parallel computing computing© Simon McIntosh-Smith CPUs, CPUs, GPUs, GPUs, and and other other processors processors18 The origins of OpenCL Ericsson Nokia AMD Merged, IBM needed commonality across Sony products ATI IMG Tech GPU vendor - Freescale wants to steal mkt Nvidia Wrote a share from CPU Khronos TI rough draft Compute straw man group formed + many CPU vendor - API more Intel wants to steal mkt share from GPU was tired of recoding for Apple many core, GPUs. Pushed vendors to standardize. Dec 2008 © Simon McIntosh-Smith 19 Third party names are the property of their owners. System level architecture of OpenCL OpenCL programs OpenCL dynamic libraries Vendor-supplied drivers Could also be multi-core CPUs © Simon McIntosh-Smith 20 OpenCL Platform Model • One Host + one or more Compute Devices • Each Compute Device is composed of one or more Compute Units • Each Compute Unit is further divided into one or more Processing Elements © Simon McIntosh-Smith 21 The BIG idea behind OpenCL • Replace loops with functions (a kernel) executing at each point in a problem domain (index space). • E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops Data Parallel OpenCL void kernel void trad_mul(const int n, dp_mul(global const float *a, const float *a, global const float *b, const float *b, global float *c) float *c) { { int id = get_global_id(0); int i; for (i=0; i<n; i++) c[id] = a[id] * b[id]; c[i] = a[i] * b[i]; } } // execute over “n” work-items © Simon McIntosh-Smith 22 An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together) 1024 Synchronization between work-items possible only within work-groups: barriers and memory fences 1024 Cannot synchronize outside of a work-group © Simon McIntosh-Smith 23 OpenCL Memory Model • Private Memory Private Private Private Private Memory Memory Memory Memory • Per Work-Item • Local Memory Work-Item Work-Item Work-Item Work-Item • Shared within a Work-Group Local Memory • Global / Constant Local Memory Work-Group Memories Work-Group • Visible to all Work-Groups Global Memory & Constant Memory • Host Memory Compute Device • On the CPU Host Memory Host • Memory management is explicit You must move data from host Æ global Æ local and back © Simon McIntosh-Smith 24 OpenCL C Language • Derived from ISO C99 • A “supersubset” (Tim Mattson, Intel) • No standard C99 headers, function pointers, recursion, variable length arrays, or bit fields • Additions to the language for parallelism • Work-items and work-groups • Vector types • Synchronization • Address space qualifiers, e.g. global, local • Optimized image access • Built-in functions © Simon McIntosh-Smith 25 Contexts and Queues Contexts are used to contain and manage the state of the “world” Contexts are defined & manipulated by the host and contain: • Devices • Kernels - OpenCL functions • Program objects - kernel sources and executables • Memory objects Command-queue - coordinates execution of kernels • Kernel execution commands • Memory commands - transfer or mapping of memory object data • Synchronization commands - constrains the order of commands © Simon McIntosh-Smith 26 OpenCL architecture CPU GPU Context ProgramsPrograms Kernels Memory Objects Command Queues dp_mul __kernel void Buffers Images dp_mul dp_mul( argarg [0][0] CPU program InIn OutOut of of global const float *a, arg[0]valuevalue value binary global const float *b, OrderOrder OrderOrder global float *c) arg [1] dp_mul arg [1] Queue Queue { arg[1]valuevalue value Queu Queu int id = get_global_id(0); GPU program binary c[id] = a[id] * b[id]; argarg [2][2] e e } arg[2]valuevalue value ComputeGPU Device © Simon McIntosh-Smith 27 Example: vector addition • The “hello world” program of data parallel programming is a program to add two vectors C[i] = A[i] + B[i] for i=0 to N-1 • For the OpenCL solution, there are two parts • Kernel code • Host code © Simon McIntosh-Smith 28 Vector Addition - Kernel __kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } © Simon McIntosh-Smith 29 Vector Addition - Host Program // create the OpenCL context on a GPU device // build the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program, 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL); // get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); NULL, &cb); devices = malloc(cb); // set the args values clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // create a command-queue err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], cmd_queue = clCreateCommandQueue(context, devices[0], sizeof(cl_mem)); 0, NULL); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // allocate the buffer memory objects // set work-item dimensions memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | global_work_size[0] = n; CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | // execute kernel CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL); NULL, global_work_size, NULL, 0, NULL, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, // read output array NULL); err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); © Simon McIntosh-Smith 30 Vector Addition - Host Program // create the OpenCL context on a GPU device // build theBuild program the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program,BuildBuild thethe programprogram 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL); // get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); Define

Load more