Parallel Programming Languages: a New Renaissance Or a Return to the Dark Ages?

Parallel programming languages: A new renaissance or a return to the dark ages? Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected] © Simon McIntosh-Smith 1 The Microelectronics Group at the University of Bristol http://www.cs.bris.ac.uk/Research/Micro/ © Simon McIntosh-Smith 2 The team Simon McIntosh-Smith Dr Jose Nunez-Yanez Head of Group Prof David May Dr Kerstin Eder Prof Dhiraj Pradhan Dr Simon Hollis © Simon McIntosh-Smith 3 Group expertise Energy Aware COmputing (EACO): •Multi-core and many-core computer architectures •High Performance Computing (GPUs, OpenCL) •Network on Chip (NoC) •Reconfigurable architectures •Design verification (formal and simulation-based) •Formal specification and analysis •Silicon process variation •Fault tolerant design (hardware and software) © Simon McIntosh-Smith 4 Heterogeneous many-core www.innovateuk.org/mathsktn © Simon McIntosh-Smith 5 Heterogeneous many-core XMOS © Simon McIntosh-Smith 6 Didn’t parallel computing use to be a niche? © Simon McIntosh-Smith 7 When I were a lad… © Simon McIntosh-Smith 8 But now parallelism is mainstream Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU © Simon McIntosh-Smith 9 HPC stronger than ever 548,352 processor cores delivering 8.16 PetaFLOPS (8.16x1015) © Simon McIntosh-Smith 10 Big computing is mainstream too http://www.nytimes.com/2006/06/14/technology/14search.html Report: Google Uses About 900,000 Servers (Aug 1st 2011) http://www.datacenterknowledge.com/archives/2011/08/01/ report-google-uses-about-900000-servers/ © Simon McIntosh-Smith 11 A renaissance in parallel programming OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP © Simon McIntosh-Smith 12 Groupings of || languages Partitioned Global Address Space GPU languages: (PGAS): •OpenCL •Fortress •CUDA •X10 •HMPP •Chapel •Co-array Fortran Object oriented: •Unified Parallel C •C++ AMP •CHARM++ CSP: XC Multi-threaded: Message passing: MPI •Cilk •Go Shared memory: OpenMP © Simon McIntosh-Smith 13 2nd fastest computer in the world © Simon McIntosh-Smith 14 Emerging GPGPU standards • OpenCL, DirectCompute, C++ AMP, … • NVIDIA’s ‘de facto’ standard CUDA © Simon McIntosh-Smith 15 OpenCL © Simon McIntosh-Smith 16 It’s a Heterogeneous world A modern system includes: CPU CPU – One or more CPUs – One or more GPUs – DSP processors – …other? GPU GMCH DRAM ICH OpenCLOpenCL letslets programmersprogrammers writewrite aa singlesingle portableportable programprogram thatthat usesuses ALLALLresourcesresources inin thethe heterogeneousheterogeneous© Simon McIntosh-Smith platformplatform 17 GMCH = graphics memory control hub, ICH = Input/output control hub Industry Standards for Programming Heterogeneous Platforms CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing Multi- Graphics processor Heterogeneous APIs and programming Computing Shading e.g. OpenMP Languages OpenCLOpenCL – – Open Open Computing Computing Language Language Open,Open, royalty-free royalty-free standard standard for for portable, portable, parallel parallel programming programming of of heterogeneousheterogeneous parallel parallel computing computing© Simon McIntosh-Smith CPUs, CPUs, GPUs, GPUs, and and other other processors processors18 The origins of OpenCL Ericsson Nokia AMD Merged, IBM needed commonality across Sony products ATI IMG Tech GPU vendor - Freescale wants to steal mkt Nvidia Wrote a share from CPU Khronos TI rough draft Compute straw man group formed + many CPU vendor - API more Intel wants to steal mkt share from GPU was tired of recoding for Apple many core, GPUs. Pushed vendors to standardize. Dec 2008 © Simon McIntosh-Smith 19 Third party names are the property of their owners. System level architecture of OpenCL OpenCL programs OpenCL dynamic libraries Vendor-supplied drivers Could also be multi-core CPUs © Simon McIntosh-Smith 20 OpenCL Platform Model • One Host + one or more Compute Devices • Each Compute Device is composed of one or more Compute Units • Each Compute Unit is further divided into one or more Processing Elements © Simon McIntosh-Smith 21 The BIG idea behind OpenCL • Replace loops with functions (a kernel) executing at each point in a problem domain (index space). • E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions Traditional loops Data Parallel OpenCL void kernel void trad_mul(const int n, dp_mul(global const float *a, const float *a, global const float *b, const float *b, global float *c) float *c) { { int id = get_global_id(0); int i; for (i=0; i<n; i++) c[id] = a[id] * b[id]; c[i] = a[i] * b[i]; } } // execute over “n” work-items © Simon McIntosh-Smith 22 An N-dimensional domain of work-items Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (work group, executes together) 1024 Synchronization between work-items possible only within work-groups: barriers and memory fences 1024 Cannot synchronize outside of a work-group © Simon McIntosh-Smith 23 OpenCL Memory Model • Private Memory Private Private Private Private Memory Memory Memory Memory • Per Work-Item • Local Memory Work-Item Work-Item Work-Item Work-Item • Shared within a Work-Group Local Memory • Global / Constant Local Memory Work-Group Memories Work-Group • Visible to all Work-Groups Global Memory & Constant Memory • Host Memory Compute Device • On the CPU Host Memory Host • Memory management is explicit You must move data from host Æ global Æ local and back © Simon McIntosh-Smith 24 OpenCL C Language • Derived from ISO C99 • A “supersubset” (Tim Mattson, Intel) • No standard C99 headers, function pointers, recursion, variable length arrays, or bit fields • Additions to the language for parallelism • Work-items and work-groups • Vector types • Synchronization • Address space qualifiers, e.g. global, local • Optimized image access • Built-in functions © Simon McIntosh-Smith 25 Contexts and Queues Contexts are used to contain and manage the state of the “world” Contexts are defined & manipulated by the host and contain: • Devices • Kernels - OpenCL functions • Program objects - kernel sources and executables • Memory objects Command-queue - coordinates execution of kernels • Kernel execution commands • Memory commands - transfer or mapping of memory object data • Synchronization commands - constrains the order of commands © Simon McIntosh-Smith 26 OpenCL architecture CPU GPU Context ProgramsPrograms Kernels Memory Objects Command Queues dp_mul __kernel void Buffers Images dp_mul dp_mul( argarg [0][0] CPU program InIn OutOut of of global const float *a, arg[0]valuevalue value binary global const float *b, OrderOrder OrderOrder global float *c) arg [1] dp_mul arg [1] Queue Queue { arg[1]valuevalue value Queu Queu int id = get_global_id(0); GPU program binary c[id] = a[id] * b[id]; argarg [2][2] e e } arg[2]valuevalue value ComputeGPU Device © Simon McIntosh-Smith 27 Example: vector addition • The “hello world” program of data parallel programming is a program to add two vectors C[i] = A[i] + B[i] for i=0 to N-1 • For the OpenCL solution, there are two parts • Kernel code • Host code © Simon McIntosh-Smith 28 Vector Addition - Kernel __kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } © Simon McIntosh-Smith 29 Vector Addition - Host Program // create the OpenCL context on a GPU device // build the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program, 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL); // get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); NULL, &cb); devices = malloc(cb); // set the args values clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); // create a command-queue err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], cmd_queue = clCreateCommandQueue(context, devices[0], sizeof(cl_mem)); 0, NULL); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // allocate the buffer memory objects // set work-item dimensions memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | global_work_size[0] = n; CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | // execute kernel CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL); NULL, global_work_size, NULL, 0, NULL, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, // read output array NULL); err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); © Simon McIntosh-Smith 30 Vector Addition - Host Program // create the OpenCL context on a GPU device // build theBuild program the program cl_context = clCreateContextFromType(0, err = clBuildProgram(program,BuildBuild thethe programprogram 0, NULL, NULL, NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); NULL); // get the list of GPU devices associated with context // create the kernel clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, kernel = clCreateKernel(program, “vec_add”, NULL); Define

Parallel Programming Languages: a New Renaissance Or a Return to the Dark Ages?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support