N ovel A rchitectures

Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected]

Op e n CL: A Pa r all e l Pr o g r a m m i n g Sta n a r d f o r He t e r o g e n e o u s Co m p u t i n g Sy s t e m s

By John E. Stone, David Gohara, and Guochun Shi

The OpenCL standard offers a common API for program execution on systems composed of different types of computational devices such as multicore CPUs, GPUs, or other accelerators.

he strong need for increased programming toolkits. OpenCL and hardware abstractions, OpenCL computational performance in defines core functionality that all lets developers accelerate applications T science and engineering has led devices support, as well as optional with task- or data-parallel computa- to the use of heterogeneous comput- functionality for high-function de- tions in a ing, with GPUs and other accelerators vices; it also includes an extension environment consisting of the host acting as for arithmetic mechanism that lets vendors expose CPU and any attached OpenCL de- intensive data-parallel workloads.1–4 unique hardware features and ex- vices. Such devices might or might OpenCL is a new industry standard perimental programming interfaces not share memory with the host CPU, for task-parallel and data-parallel het- for application developers’ benefit. and typically have a different ma- erogeneous computing on a variety Although OpenCL can’t mask sig- chine instruction set. The OpenCL of modern CPUs, GPUs, DSPs, and nificant differences in hardware archi- programming interfaces therefore as- other microprocessor designs.5 This tecture, it does guarantee portability sume heterogeneity between the host trend toward heterogeneous comput- and correctness. This makes it much and all attached devices. ing and highly parallel architectures easier for developers to start with a OpenCL’s key programming inter- has created a strong need for software correctly functioning OpenCL pro- faces include functions for development infrastructure in the gram tuned for one architecture form of parallel programming lan- and produce a correctly function- • enumerating available target de- guages and subroutine libraries that ing program optimized for another vices (CPUs, GPUs, and various can support heterogeneous comput- architecture. accelerators); ing on multiple vendors’ hardware • managing the target devices’ platforms. To address this, developers The OpenCL contexts; adapted many existing science and en- Programming Model • managing memory allocations; gineering applications to take advan- In OpenCL, a program is executed on • performing host-device memory tage of multicore CPUs and massively a computational device, which can be a transfers; parallel GPUs using toolkits such as CPU, GPU, or another accelerator • compiling the OpenCL programs Threading Building Blocks (TBB), (see Figure 1). Devices contain one or and kernel functions that the OpenMP, Compute Unified Device more compute units (processor cores). devices will execute; Architecture (CUDA),6 and others.7,8 These units are themselves composed • launching kernels on the target Existing programming toolkits, how- of one or more single-instruction devices; ever, were either limited to a single multiple-data (SIMD) processing ele- • querying execution progress; and microprocessor family or didn’t sup- ments (PE) that execute instructions • checking for errors. port heterogeneous computing. in lock-step. OpenCL provides easy-to-use Although developers can compile abstractions and a broad set of OpenCL Device Management and link OpenCL programs into programming based on past By providing a common language inary objects using off­line com- successes with CUDA and other and common programming interfaces pilation methodology, OpenCL

66 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE Co m p u t i n g in Sc i e n e & En g i n e e r i n g

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. OpenCL program OpenCL context

Miscsupport OpenCL device functions

encourages runtime compilation that lets OpenCL programs run Kernel A natively on the target hardware— even on platforms unavailable to the OpenCL device original software developer. Runtime Kernel B compilation eliminates dependencies on instruction sets, letting hardware vendors significantly change instruc- tion sets, drivers, and supporting Kernel C libraries from one hardware genera- tion to the next.2 Applications that use OpenCL’s runtime compilation Figure 1. OpenCL describes hardware in terms of a hierarchy of devices, compute features will automatically take ad- units, and clusters of single instruction multiple data (SIMD) processing elements. vantage of the target device’s latest Before becoming accessible to an application, devices must first be incorporated hardware and software features with- into an OpenCL context. OpenCL programs contain one or more kernel functions as well as supporting routines that kernels can use. out having to recompile the main ap- plication itself. Because OpenCL targets a broad • a large high-latency global memory, that OpenCL programs require to run range of microprocessor designs, it • a small low-latency read-only con- on that context. must support a multiplicity of pro- stant memory, Once a context is created, OpenCL gramming idioms that match the tar- • a shared local memory accessible programs can be compiled at runtime get architectures. Although OpenCL from multiple PEs within the same by passing the source code to Open- guarantees kernel portability and cor- compute unit, and CL compilation functions as arrays of rectness across a variety of hardware, • a private memory, or device regis- strings. After an OpenCL program it doesn’t guarantee that a particular ters, accessible within each PE. is compiled, handles can be obtained kernel will achieve peak performance for the kernel functions contained in on different architectures; the hard- Devices can implement local memory the program. The kernels can then be ware’s underlying nature might make using high-latency global memory, launched on devices within the Open- some programming strategies more fast on-chip static RAM, or a shared CL context. OpenCL host-device appropriate for particular platforms register file. Applications can query memory I/O operations and kernels than for others. As an example, a device attributes to determine prop- are executed by enqueing them into GPU-optimized kernel might achieve erties of the available compute units one of the command queues associ- peak memory performance when a and memory systems and use them ated with the target device. single work-group’s work-items col- accordingly. lectively perform loads and stores, Before an application can compile OpenCL and Modern whereas a -optimized kernel might OpenCL programs, allocate device Processor Architectures perform better using a double buffer- memory, or launch kernels, it must State-of-the-art microprocessors con- ing strategy combined with calls to first create a context associated with tain several architectural features async_workgroup_copy(). A p pl ic a - one or more devices. Because OpenCL that, historically, have been poorly tions select the most appropriate ker- associates memory allocations with a supported or difficult to use in exist- nel for the target devices by querying context rather than a specific device, ing programming languages. This has the installed devices’ capabilities and developers should exclude devices led vendors to create their own pro- hardware attributes at runtime. with inadequate memory capacity gramming tools, language extensions, when creating a context, otherwise vector intrinsics, and subroutine li- OpenCL Device Contexts the least-capable device will limit the braries to close the programmability and Memory maximum memory allocation. Simi- gap created by these hardware fea- OpenCL defines four types of memory larly, they should exclude devices from tures. To help clarify the relationship systems that devices can incorporate: a context if they don’t support features between the OpenCL programming

Ma y/Ju n e 2010 67

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. N ovel A rchitectures

model and the diversity of poten- instruction set extensions such as execution pipelines. Each SPE has local tial target hardware, we compare the SSE and Power/VMX (vector multi- store, a small, local software-managed architectural characteristics of three media extensions). The current CPU cache-like fast memory. Applications exemplary microprocessor families implementations for x86 processors can load data from system memory and relate them to key OpenCL often make best use of SSE when to local store or vice versa using di- abstractions and OpenCL program- OpenCL kernels are written with rect memory access (DMA) requests, ming model features. explicit use of float4 types. CPU im- with the best bandwidth achieved plementations often all memory when both source and destination Multicore CPUs onto the same hardware cache, are aligned to 128 bytes. Cell can ex- Modern CPUs are typically com- so a kernel that explicitly uses con- ecute data transfer and instructions posed of a few high-frequency proces- stant and local memory spaces might simultaneously, letting application sor cores with advanced features such actually incur more overhead than developers hide memory latency using as out-of-order execution and branch a simple kernel that uses only global techniques such as double buffering. prediction. CPUs are generalists that memory references. (We describe the architecture and a perform well for a wide range of appli- sample application ported to the Cell cations, including latency-sensitive se- The Cell Processor processor elsewhere.1) quential workloads and coarse-grained The Cell Broadband Engine Architec- IBM has recently released an Open- task- or data-parallel workloads. ture (CBEA) is a heterogeneous chip CL toolkit supporting both the Cell and Power processors on the platform. The IBM OpenCL imple- The SIMD clusters execute machine instructions in mentation supports the embedded profile for the Cell SPUs, and uses lock-step; branch divergence is handled by executing software techniques to smooth over some architectural differences be- both branch paths and masking off results from inactive tween the Cell SPUs and conventional CPUs. On the Cell processor, global processing units as necessary. memory accesses perform best when operands are a multiple of 16 bytes (such as an OpenCL float4 type). Because they’re typically used for architecture consisting of one 64-bit The use of larger vector types such as latency sensitive workloads with min- Power-compliant PE (PPE), multiple float16 lets the unroll loops, imal parallelism, CPUs require large Synergistic PEs (SPEs), a memory- further increasing performance. The caches to hide main-memory latency. interface controller, and I/O units, program text and OpenCL local and Many CPUs also incorporate small- connected with an internal high-speed private variables share the 256 Kbyte scale use of SIMD arithmetic units bus.9 The PPE is a general-purpose Cell SPU local store, which limits the to the performance of dense processor based on the Power architec- practical work-group size because each arithmetic and multimedia workloads. ture and it’s designed to run conven- work-item requires private data stor- Because conventional programming tional OS and control-intensive code to age. The Cell DMA engine performs languages like C and Fortran don’t coordinate tasks running on SPEs. most effectively using double buffer- directly expose these units, their use The SPE is a SIMD streaming ing strategies combined with calls to requires calling vectorized subrou- processor optimized for massive data async_workgroup_copy() to load data tine libraries or proprietary vector processing that provides most of from global memory into local store. intrinsic functions, or trial-and-error the Cell systems’ computing power. source-level restructuring and auto­ Developers can realize an applica- Graphics Processing Units vectorizing . AMD, Apple, tion’s using multiple Contemporary GPUs are composed of and IBM provide OpenCL implemen- SPEs, while achieving data and in- hundreds of processing units running tations that target multicore CPUs, struction parallelism using the SIMD at low to moderate frequency, de- and support the use of SIMD instructions and the SPEs’ dual signed for throughput-oriented latency

68 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. 1: for i = 1 to M do {loop over grid points on face} 2: grid potential ⇐ 0.0 3: for j = 1 to N do {loop over all atoms} 4: grid potential ⇐ grid potential + (potential from atom j) 5: end for 6: end for insensitive workloads. To hide global 7: return grid potential memory latency, GPUs contain small or moderate sized on-chip caches and extensively use hardware multithread- Figure 2. Summary of a serial multiple Debye-Hückel algorithm. The MDH ing, executing tens of thousands of algorithm calculates the total potential at each grid point on a grid’s face, as threads concurrently across the pool of described in equation 1. processing units. The GPU processing units are typically organized in SIMD for ( int igrid=0 ; igrid

Ma y/Ju n e 2010 69

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. N ovel A rchitectures

­­__kernel void mdh (__global float *ax , __global float *ay , __global float *az , __global float *charge , __global float *size , __global float *gx , __global float *gy , __global float *gz , __global float *val , float pre1 , float xkappa , cl­_int natoms) { int igrid = get_global_id ( 0 ) ; float v = 0.0f ; for (int jatom = 0 ; jatom < natoms ; jatom++) { float dx = gx [ igrid ] − ax [ jatom ] ; float dy = gy [ igrid ] − ay [ jatom ] ; float dz = gz [ igrid ] − az [ jatom ] ; float dist = sqrt( dx*dx + dy*dy + dz*dz) ; v += pre1 * ( charge [ jatom ] / dist ) * exp( −xkappa * ( dist − size [ jatom ] ) ) / (1.0 f + xkappa * size [ jatom ] ) ; } val [ igrid ] = v ; }

Figure 4. A simple OpenCL kernel for the multiple Debye-Hückel method. This kernel is similar to the original sequential C loops, except that the outer loop over grid points has been replaced by a parallel instantiation of independent grid points as OpenCL work-items, and the igrid index is determined by the OpenCL work-item index.

Because the potential at each grid data (x, y, z, charge, and size) resulting in Another optimization involves us- point can be calculated independently, global memory transactions that could ing vector types such as float4 or we can use standard methods (such as be avoided by changing the algorithm. float16, which makes it easier for the pthreads or OpenMP) to parallelize By taking advantage of each Open- OpenCL compiler to effectively fill the earlier example on a CPU. We can CL compute unit’s fast on-chip local very long instruction word (VLIW) also perform this type of calculation memory, data can be staged in local slots and enables wider memory with OpenCL with almost no modifi- memory and then efficiently broadcast transfer operations. With vector cation. The kernel is simply the inner to all work-items within the same work- types, individual work-items process loop over the atoms, as Figure 4 shows. group. This greatly amplifies the effec- multiple grid points at a time. This In effect, the OpenCL dispatcher be- tive memory bandwidth available to the reduces the global work dimensions comes the outer loop over grid points. algorithm, improving performance. accordingly, with a corresponding in- We set the OpenCL global work- Although the global work-group size crease in register usage. Calculating group size to the total number of remains the same, the local work-group multiple grid points per work-item grid points, and each work-item that size increases from 1 to some multiple increases the ratio of arithmetic oper- OpenCL dispatches is responsible for of the hardware SIMD width. ations to memory operations because calculating a potential using the above As Figure 5’s example shows, the the same atom data is referenced kernel. Each OpenCL work-item de- work-group size is limited by the multi­ple times. On the AMD and termines its grid point index from its amount of data that can be loaded into Nvidia GPUs, the use of vector types global work-item index. (typically 16 Kbytes yields a 20 percent increase in perfor- Figure 4’s code runs approximately on Nvidia GPUs). The on-chip local mance. On the Cell processor, using 20 times faster on an Nvidia GeForce memory is partitioned so that each float16 vectors yields a factor 11 times GTX 285 GPU than the serial code of a work-group’s work-items loads a increase in kernel performance. on a 2.5 GHz Nehalem CPU. block of the position, charge, and size Figure 5 shows one variation of For reference, on 16 cores, the parallel data into local memory at a specific these concepts; many others are CPU performance is almost 13 times offset. Local memory barriers ensure possible. When compared to the faster. However, the kernel in Figure 4 that data isn’t overwritten before all original serial CPU code, the ap- doesn’t take advantage of locality of work-items in a work-group have ac- proximate performance increase for concurrent accesses to the atom data. In cessed it. This coordinated data load- an IBM Cell blade (using float16) this form, each work-item (grid point) ing and sharing reduces the number is 17 times faster, an AMD is responsible for loading each atom’s of slow global memory accesses. 5870 GPU is 134 times faster, and

70 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. typedef float4 floatvec; __kernel void mdh(__global float *ax, __global float *ay, __global float *az, __global float *charge, __global float *size, __global floatvec *gx, __global floatvec *gy, __global floatvec *gz, __global floatvec *val, __local float *shared, float pre1, float xkappa, cl_int natoms) { int igrid = get_global_id(0); int lsize = get_local_size(0); int lid = get_local_id(0); floatvec lgx = gx[igrid]; floatvec lgy = gy[igrid]; floatvec lgz = gz[igrid]; floatvec v = 0.0f; for (int jatom = 0; jatom < natoms; jatom+=lsize) { if ((jatom+lsize) > natoms) lsize = natoms - jatom;

if ((jatom + lid) < natoms) { shared[lid ] = ax[jatom + lid]; shared[lid + lsize] = ay[jatom + lid]; shared[lid + 2*lsize] = az[jatom + lid]; shared[lid + 3*lsize] = charge[jatom + lid]; shared[lid + 4*lsize] = size[jatom + lid]; } barrier(CLK_LOCAL_MEM_FENCE);

for (int i=0; i

Figure 5. The optimized OpenCL kernel for the multiple Debye-Hückel method. This kernel is similar to the simple OpenCL kernel in Figure 4, but each work-group collectively loads and processes blocks of atom data in fast on-chip local memory. OpenCL barrier instructions enforce coordination between the loading and processing phases to maintain local memory consistency. The inner loop uses vector types to process multiple grid points per work-item, and atom data is processed entirely from on-chip local memory, greatly increasing both arithmetic intensity and effective memory bandwidth.

an Nvidia GeForce GTX 285 GPU and summation order—to the CPU expect OpenCL to incorporate new is 141 times faster. With further methods (both serial and parallel). features and promote previously platform-specific tuning, each of optional features to core features these platforms could undoubtedly ur initial experiences in adapt- of OpenCL 1.1 and later versions. achieve even higher performance. A Oing molecular modeling appli- Recent GPUs allow direct access to more important outcome still is that cations such as APBS and VMD11 host memory over peripheral compo- the numerical result is exact—within to OpenCL 1.0 have been gener- nent interconnect express (PCI-E), the floating point rounding mode ally positive. In the coming year, we and some enable global GPU memory

Ma y/Ju n e 2010 71

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. N ovel A rchitectures

mapping into the host address space, grant 05-51665, the US National Center 11. W. Humphrey, A. Dalke, and K. allowing fast zero-copy access to data for Supercomputing Applications, and Schulten, “VMD—Visual Molecular that’s read or written only once during NIH grant P41-RR05969. Dynamics,” J. Molecular Graphics, kernel execution. Although zero-copy vol. 14, no. 1, 1996, pp. 33–38. APIs exist in CUDA, OpenCL cur- References rently doesn’t include this capability. 1. G. Shi et al., “Application Acceleration John Stone is a senior research program- As OpenCL matures, we hope to see with the Cell Broadband Engine,” mer in the Theoretical and Computational increased support for -safety, Computing in Science & Eng., vol. 12, Biophysics Group at the Beckman Institute increased OpenGL interoperability, no. 1, 2010, pp. 76–81. for Advanced Science and Technology, and and extension with advanced features 2. J. Cohen and M. Garland, “Solving associate director of the CUDA Center of found in APIs such as CUDA. Computational Problems with GPU Excellence, both at the University of Illinois OpenCL provides correctness guar- Computing,” Computing in Science & at Urbana-Champaign. He is lead developer antees such that code written and opti- Eng., vol. 11, no. 5, 2009, pp. 58–63. of the molecular visualization program VMD. mized for one device will run correctly 3. A. Bayoumi et al., “Scientific and Engi- His research interests include scientific visu- on any other device, although not nec- neering Computing Using ATI Stream alization and high-performance computing. essarily with peak performance. The Technology,” Computing in Science & Stone earned his MS in computer science from only exception to this is when optional Eng., vol. 11, no. 6, 2009, pp. 92–97. the Missouri University of Science and Tech- OpenCL features are used that aren’t 4. K.J. Barker et al., “Entering the Petaflop nology. Contact him at [email protected]. available on the target device. As future Era: The Architecture and Performance hardware platforms that operate on of Roadrunner,” Proc. 2008 ACM/IEEE David Gohara is director of research com- wider vectors—such as Intel’s advanced Conf. Supercomputing, IEEE Press, 2008, puting in the Department of Biochemistry vector extensions (AVX)—arrive, we’re pp. 1–11. and Biophysics and a Senior eager to see OpenCL implementa- 5. A. Munshi, OpenCL Specification Version in the Center for Computational Biology at tions incorporate a greater degree of 1.0, The , 2008, www. The Washington University School of Medi- autovectorization, enabling efficient khronos.org/registry/cl. cine in St. Louis. His research interests are in kernels to be written with less vector 6. J. Nickolls et al., “Scalable Parallel Pro- high-performance computing for improving width specificity, and improving the gramming with CUDA,” ACM Queue, computational methods used for the study performance portability of OpenCL vol. 6, no. 2, 2008, pp. 40–53. of biological processes. Gohara has a PhD kernels across the board. Our experi- 7. J.E. Stone et al., “Accelerating Molecular in biochemistry and molecular biology from ences show that OpenCL holds great Modeling Applications with Graphics the Pennsylvania State University, and did his promise as a standard low-level parallel Processors,” J. Computational Chemistry, postdoctoral research in x-ray crystallogra- programming interface for heteroge- vol. 28, no. 16, 2007, pp. 2618–2640. phy at Harvard Medical School. Contact him neous computing devices. 8. J.E. Stone et al., “High-Performance at [email protected]. Computation and Interactive Display of Acknowledgments Molecular Orbitals on GPUs and Multicore Guochun Shi is a research programmer at We are grateful to Nathan Baker and CPUs,” Proc. 2nd Workshop on General- the National Center for Supercomputing Yong Huang for development and Purpose Processing on Graphics Processing Applications at the University of Illinois at support of the APBS project, and to Units, vol. 383, no. 16, 2009, pp. 9–18. Urbana-Champaign. His research interests Ian Ollmann and Aaftab Munshi for assis- 9. H.P. Hofstee, “Power Efficient Processor are in high-performance computing. Shi has tance with OpenCL. The US National Architecture and the Cell Processor,” an MS in computer science from the Univer- Institutes of Health funds APBS devel- Proc. 11th Int’l Symp. High-Performance sity of Illinois at Urbana-Champaign. Contact opment through grant R01-GM069702. Computer Architecture, IEEE CS Press, him at [email protected]. Performance experiments were made 2005, pp. 258–262. possible with hardware donations and 10. N.A. Baker et al., “Electrostatics of OpenCL software provided by AMD, Nanosystems: Application to Microtu- Selected articles and columns from IBM, and Nvidia, and with support bules and the Ribosome,” Proc. Nat’l IEEE Computer Society publica- from the US National Science Founda- Academies of Science, 2001, vol. 98, tions are also available for free at http:// tion’s Computer and Network Systems no. 18, pp. 10037–10041. ComputingNow.computer.org.

72 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply. /PONFNCFSSBUF .NET SECURITYINTERVIEW WITH MELISSA HATHAWAY

NOVEMBER/DECEMBER 2008 PGGPS41 VOLUME 6, NUMBER 6 NBHB[JOF

*&&&4FDVSJUZ1SJWBDZJT

5)&QSFNJFSNBHB[JOF (00(-&4"/%30*%1-"5'03.t3*4,"44&44.&/5'03/038":4*/'3"4536$563& GPSTFDVSJUZQSPGFTTJPOBMT

+"/6"3:'$"3: 70-6.& /6.#&3 5PQTFDVSJUZQSPGFTTJPOBMT JOUIF¾FMETIBSFJOGPSNBUJPO POXIJDIZPVDBOSFMZ

4JMWFS#VMMFUQPEDBTUTBOE #"3"$,4#-"$,#&33:#*/%t&%6$"5*0/7*"4&$0/%-*'& JOUFSWJFXT *OUFMMFDUVBM1SPQFSUZ1SPUFDUJPO

MARCH/APRIL 2009 1JSBDZ 70-6.& /6.#&3 %FTJHOJOHGPS*OGSBTUSVDUVSF 4FDVSJUZ 1SJWBDZ*TTVFT -FHBM*TTVFT$ZCFSDSJNF %JHJUBM3JHIUT.BOBHFNFOU 5IF4FDVSJUZ1SPGFTTJPO

7JTJUPVS8FCTJUF BUXXXDPNQVUFSPSHTFDVSJUZ

4VCTDSJCFOPX XXXDPNQVUFSPSHTFSWJDFTOPONFNTQCOS

Authorized licensed use limited to: University of Cincinnati. Downloaded on May 05,2010 at 16:50:02 UTC from IEEE Xplore. Restrictions apply.