Opencl: a Parallel Programming Standard for Heterogeneous

N OVEL A RCHITecTURes Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected] OPEN CL: A PA R all E L PR O GR A MMING STA ND A RD F O R HETER O GENE O U S CO MPUTING SY S TEM S By John E. Stone, David Gohara, and Guochun Shi The OpenCL standard offers a common API for program execution on systems composed of different types of computational devices such as multicore CPUs, GPUs, or other accelerators. he strong need for increased programming toolkits. OpenCL and hardware abstractions, OpenCL computational performance in defines core functionality that all lets developers accelerate applications T science and engineering has led devices support, as well as optional with task- or data-parallel computa- to the use of heterogeneous comput- functionality for high-function de- tions in a heterogeneous computing ing, with GPUs and other accelerators vices; it also includes an extension environment consisting of the host acting as coprocessors for arithmetic mechanism that lets vendors expose CPU and any attached OpenCL de- intensive data-parallel workloads.1–4 unique hardware features and ex- vices. Such devices might or might OpenCL is a new industry standard perimental programming interfaces not share memory with the host CPU, for task-parallel and data-parallel het- for application developers’ benefit. and typically have a different ma- erogeneous computing on a variety Although OpenCL can’t mask sig- chine instruction set. The OpenCL of modern CPUs, GPUs, DSPs, and nificant differences in hardware archi- programming interfaces therefore as- other microprocessor designs.5 This tecture, it does guarantee portability sume heterogeneity between the host trend toward heterogeneous comput- and correctness. This makes it much and all attached devices. ing and highly parallel architectures easier for developers to start with a OpenCL’s key programming inter- has created a strong need for software correctly functioning OpenCL pro- faces include functions for development infrastructure in the gram tuned for one architecture form of parallel programming lan- and produce a correctly function- • enumerating available target de- guages and subroutine libraries that ing program optimized for another vices (CPUs, GPUs, and various can support heterogeneous comput- architecture. accelerators); ing on multiple vendors’ hardware • managing the target devices’ platforms. To address this, developers The OpenCL contexts; adapted many existing science and en- Programming Model • managing memory allocations; gineering applications to take advan- In OpenCL, a program is executed on • performing host-device memory tage of multicore CPUs and massively a computational device, which can be a transfers; parallel GPUs using toolkits such as CPU, GPU, or another accelerator • compiling the OpenCL programs Threading Building Blocks (TBB), (see Figure 1). Devices contain one or and kernel functions that the OpenMP, Compute Unified Device more compute units (processor cores). devices will execute; Architecture (CUDA),6 and others.7,8 These units are themselves composed • launching kernels on the target Existing programming toolkits, how- of one or more single-instruction devices; ever, were either limited to a single multiple-data (SIMD) processing ele- • querying execution progress; and microprocessor family or didn’t sup- ments (PE) that execute instructions • checking for errors. port heterogeneous computing. in lock-step. OpenCL provides easy-to-use Although developers can compile abstractions and a broad set of OpenCL Device Management and link OpenCL programs into programming APIs based on past By providing a common language inary objects using off line com- successes with CUDA and other and common programming interfaces pilation methodology, OpenCL 66 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE COMPUTING IN SC IEN C E & ENGINEERING Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. OpenCL program OpenCL context Miscsupport OpenCL device functions encourages runtime compilation that lets OpenCL programs run Kernel A natively on the target hardware— even on platforms unavailable to the OpenCL device original software developer. Runtime Kernel B compilation eliminates dependencies on instruction sets, letting hardware vendors significantly change instruction sets, drivers, and supporting Kernel C libraries from one hardware genera- tion to the next.2 Applications that use OpenCL’s runtime compilation Figure 1. OpenCL describes hardware in terms of a hierarchy of devices, compute features will automatically take ad- units, and clusters of single instruction multiple data (SIMD) processing elements. vantage of the target device’s latest Before becoming accessible to an application, devices must first be incorporated hardware and software features with- into an OpenCL context. OpenCL programs contain one or more kernel functions as well as supporting routines that kernels can use. out having to recompile the main application itself. Because OpenCL targets a broad • a large high-latency global memory, that OpenCL programs require to run range of microprocessor designs, it • a small low-latency read-only con- on that context. must support a multiplicity of pro- stant memory, Once a context is created, OpenCL gramming idioms that match the tar- • a shared local memory accessible programs can be compiled at runtime get architectures. Although OpenCL from multiple PEs within the same by passing the source code to Open- guarantees kernel portability and cor- compute unit, and CL compilation functions as arrays of rectness across a variety of hardware, • a private memory, or device regis- strings. After an OpenCL program it doesn’t guarantee that a particular ters, accessible within each PE. is compiled, handles can be obtained kernel will achieve peak performance for the kernel functions contained in on different architectures; the hard- Devices can implement local memory the program. The kernels can then be ware’s underlying nature might make using high-latency global memory, launched on devices within the Open- some programming strategies more fast on-chip static RAM, or a shared CL context. OpenCL host-device appropriate for particular platforms register file. Applications can query memory I/O operations and kernels than for others. As an example, a device attributes to determine prop- are executed by enqueing them into GPU-optimized kernel might achieve erties of the available compute units one of the command queues associ- peak memory performance when a and memory systems and use them ated with the target device. single work-group’s work-items col- accordingly. lectively perform loads and stores, Before an application can compile OpenCL and Modern whereas a Cell-optimized kernel might OpenCL programs, allocate device Processor Architectures perform better using a double buffer- memory, or launch kernels, it must State-of-the-art microprocessors con- ing strategy combined with calls to first create a context associated with tain several architectural features async_workgroup_copy(). A p pl ic a - one or more devices. Because OpenCL that, historically, have been poorly tions select the most appropriate ker- associates memory allocations with a supported or difficult to use in exist- nel for the target devices by querying context rather than a specific device, ing programming languages. This has the installed devices’ capabilities and developers should exclude devices led vendors to create their own pro- hardware attributes at runtime. with inadequate memory capacity gramming tools, language extensions, when creating a context, otherwise vector intrinsics, and subroutine li- OpenCL Device Contexts the least-capable device will limit the braries to close the programmability and Memory maximum memory allocation. Simi- gap created by these hardware fea- OpenCL defines four types of memory larly, they should exclude devices from tures. To help clarify the relationship systems that devices can incorporate: a context if they don’t support features between the OpenCL programming MAY/JUNE 2010 67 Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. N OVEL A RCHITecTURes model and the diversity of poten- instruction set extensions such as x86 execution pipelines. Each SPE has local tial target hardware, we compare the SSE and Power/VMX (vector multi- store, a small, local software-managed architectural characteristics of three media extensions). The current CPU cache-like fast memory. Applications exemplary microprocessor families implementations for x86 processors can load data from system memory and relate them to key OpenCL often make best use of SSE when to local store or vice versa using di- abstractions and OpenCL program- OpenCL kernels are written with rect memory access (DMA) requests, ming model features. explicit use of ﬂoat4 types. CPU im- with the best bandwidth achieved plementations often map all memory when both source and destination Multicore CPUs spaces onto the same hardware cache, are aligned to 128 bytes. Cell can ex- Modern CPUs are typically com- so a kernel that explicitly uses con- ecute data transfer and instructions posed of a few high-frequency proces- stant and local memory spaces might simultaneously, letting application sor cores with advanced features such actually incur more overhead than developers

Load more