2011 International Conference on Parallel Architectures and Compilation Techniques

An OpenCL Framework for Homogeneous Manycores with no Hardware

Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee Center for Manycore Programming School of Science and Engineering Seoul National University, Seoul, 151-744, Korea {jun, jungwon, junghyun, sangmin}@aces.snu.ac.kr, [email protected] http://aces.snu.ac.kr

Abstract—Recently, Intel has introduced a research proto- [4]. As one of these efforts, Intel recently introduced the type called the Single-chip Cloud Com- Single-Chip Cloud Computer (SCC). The SCC experimental puter (SCC). The SCC is an experimental processor created processor[5], [6] is a 48-core ‘concept vehicle’ created by by Intel Labs. It contains 48 cores in a single chip and each core has its own L1 and L2 caches without any hardware Intel Labs as a platform for many-core research. support for cache coherence. It allows maximum 64GB size It contains 48 cores in a single chip and each core has of external memory that can be accessed by all cores and its own L1 and L2 caches but there is no hardware cache each core dynamically maps the external memory into their coherence support. Instead, it has a small but fast local own address space. In this paper, we introduce the design and memory called message passing buffer (MPB) to support implementation of an OpenCL framework (i.e., runtime and ) for such manycore architectures with no hardware the message passing programming model. In addition, each cache coherence. We have found that the OpenCL coherence core can dynamically map any external main memory blocks and fits well with the SCC architecture. The to its own address space at any time. OpenCL’s weak memory consistency model requires relatively Recently, the Open Computing Language (OpenCL)[7] small amount of messages and coherence actions to guarantee has been introduced. OpenCL is a framework for building coherence and consistency between the memory blocks in the SCC. The dynamic memory mapping mechanism enables our parallel applications that are portable across heterogeneous framework to preserve the semantics of the buffer object oper- parallel platforms. It provides a common hardware ab- ations in OpenCL with a small overhead. We have implemented straction layer across different multicore architectures. It the proposed OpenCL runtime and compiler and evaluate their allows application developers to focus their efforts on the performance on the SCC with OpenCL applications. functionality of their application, rather than the lower-level Keywords-Single-chip Cloud Computer, OpenCL, , details of the underlying architecture. OpenCL defines its Runtime, Cache coherence, Memory consistency own memory hierarchy and memory consistency model to provide software developers with a complete abstraction I. INTRODUCTION layer of the target architecture. In these days, most of processor manufacturers put more In this paper, we introduce the design and implementation cores in a single chip to improve performance and power of an OpenCL framework (i.e., runtime and compiler) for efficiency. 2, 4, 6, 8, and 12-core homogeneous processors manycore processors with no hardware cache coherence are already common in the market, and most of them support mechanism, such as the SCC. We have found that OpenCL’s hardware cache coherence to provide a single, coherent coherence and relaxed memory consistency model fits well space between cores. Much more cores with the SCC architecture. The memory model requires will be integrated together in a single chip in near future. relatively small amount of messages exchanged between Although the hardware cache coherence mechanism has cores to guarantee coherence and consistency in the SCC. been successful so far, it is quite doubtful that this will still This enables us to efficiently implement the OpenCL frame- work well with future manycore processors. work on top of the SCC’s message passing architecture. As the number of cores increases, the on-chip interconnect Moreover, the SCC’s dynamic memory mapping mechanism becomes a major performance bottleneck and takes up a enables our framework to easily preserve the semantics significant amount of power budget. A complicated cache of the buffer object operations in OpenCL with a small coherence protocol worsens the situation by increasing the overhead. It only requires modifying the control registers of amount of messages that go through the interconnect. For each core to transfer memory blocks between different cores this reason, there have been many studies on the relation- without using any expensive memory copy operations, such ship between on-chip interconnects and cache coherence as message passing and DMA operations. protocols for manycores to find an alternative to the cur- Except the core that takes the role of the host processor rent hardware cache coherence mechanism [1], [2], [3], in OpenCL, each core in the SCC is viewed as an individual

1089-795X/11 $26.00 © 2011 IEEE 56 DOI 10.1109/PACT.2011.12 OpenCL compute unit in a single OpenCL compute device. The MPB is a small but fast memory area that is shared In other words, our OpenCL framework can execute an by all cores to enable fast message passing between cores. OpenCL program without any modification in the source as The operations on the MPB also go through each core’s long as the program is written for a single compute device, cache, but coherence between the caches and MPBs is such as a GPU or CPU. not guaranteed either. To solve this problem, the SCC Our OpenCL--to-C compiler translates the OpenCL ker- architecture provides a new instruction called CL1INVMB nel code to the C code that runs on SCC cores. The OpenCL and a new memory type called MPBT. The MPBT data is runtime distributes the workload in the kernel to each core cached by the L1 cache only. When a core wants to update a based on a technique called symbolic array bound analysis. It data item in the MPB, it can invalidate the cached copy using also balances the workload and minimizes coherence actions the CL1INVMB instruction. The CL1INVMB invalidates all between cores. The OpenCL-C-to-C compiler also translates the MPBT cache lines. Since explicit management of MPBs the OpenCL kernel code to a code that performs a symbolic for message passing is quite burdensome, Intel provides array bound analysis at run time. The runtime merges an MPI-like message passing interface, called RCCE[8]. the data distributed across each core’s private memory to RCCE provides basic message passing operations, such guarantee memory consistency with a negligible overhead by as send, receive, broadcast, put, and get. It also provides exploiting the SCC’s dynamic memory mapping mechanism. synchronization operations such as acquire, release, and We have implemented our OpenCL runtime and OpenCL- barrier. The appearance of the interface looks just like the C-to-C translator. We evaluate our approach with 9 real MPI interface, but the internal implementation is based on OpenCL applications and show its effectiveness. We analyze the shared MPB management and CL1INVMB instruction. the characteristics of our OpenCL framework on the SCC Our OpenCL runtime exploits the RCCE library to handle and show how it exploits the features of the SCC efficiently. communication between the cores. As far as we know, our OpenCL proposal is the first work There are four memory controllers in the SCC. Each for building a complete and transparent software layer for controller supports up to 16GB DDR3 memory. While this the SCC architecture to improve ease of programming and implies that a maximum capacity of 64GB memory is to achieve high performance. provided by the SCC, each core is able to access only 4GB of memory because it is based on the IA-32 architecture. In II. BACKGROUND order to break this limitation, the SCC allows each core to alter its own memory map. Each core has a lookup table, We briefly describe the SCC architecture and the OpenCL called LUT, which is a set of configuration registers that platform in this section. map the core’s 32-bit physical addresses to the 64GB system memory. Each LUT has 256 entries, and each entry handles a A. SCC Architecture 16MB segment of the core’s 4GB physical address space. An The Single-Chip Cloud Computer (SCC)[5], [6] is an LUT entry can point to a 16MB system memory segment, experimental homogeneous manycore processor created by an MPB, or a memory-mapped configuration register. Each Intel as a platform for manycore software research. Figure 1 LUT entry can be modified at any time. On an L2 cache shows the block diagram of the SCC architecture. It contains miss, the requested 32-bit physical address is translated into 48 IA-32 cores on a single die as a form of 6x4 2D-mesh a 46-bit system address. The system memory address space network of tiles. Each tile contains two cores. Each core has consists of 4 different 16GB regions of the external memory, its own private 16KB L1 data and instruction caches and 24 16KB regions of MPBs, and regions for memory mapped 256KB unified L2 cache, but there is no hardware cache configuration registers of each core. coherence support. Instead, each tile has a local memory Each core’s LUT is initialized during the bootstrap pro- buffer, also called as message passing buffer (MPB). Each cess, but the memory map can be altered by any core at MPB is 16KB size and any core can access these 24 on-die any time. Consequently, the software can freely map any MPBs. memory region as a shared or private region. However, sharing the same memory region would cause a serious problem because there is no hardware support for cache 256KB IA-32 tile L2$ 1 Core1 coherence. Our OpenCL framework provides a convenient Memory Memory Controller Controller and safe way to exploit this type of architecture efficiently 16KB Router MPB while guaranteeing coherence and consistency. 16B Memory Memory 256KB IA-32 Controller Controller L2$ 0 Core0 B. OpenCL Platform Model System Interface SCC die The OpenCL platform model (Figure 2) consists of a host processor connected to one or more OpenCL compute Figure 1. The organization of the SCC architecture. devices. A compute device is divided into one or more

57 compute units (CUs), each of which contains one or more C. Memory Consistency Model processing elements (PEs). OpenCL defines a relaxed memory consistency model for An OpenCL program consists of two parts: kernels and consistent memory. An update to a memory location by a a host program. The kernels are executed on one or more work-item may not be visible to all the other work-items at compute devices. The host program runs on the host proces- the same time. Instead, the local view of memory from each sor and enqueues a command to a command-queue that is work-item is guaranteed to be consistent at synchronization attached to a compute device. A kernel command executes points. Synchronization points include work-group barriers, a kernel on the PEs within the compute device. A memory command-queue barriers, and events. A work-group barrier command controls a buffer object, and a synchronization synchronizes work-items in a single work-group, and a command enforces an ordering between commands. The command-queue barrier command synchronizes commands OpenCL runtime schedules the enqueued kernel command in a single command-queue. To synchronize commands in on the associated compute device and executes the enqueued different command-queues, events are used. Between work- memory or synchronization command directly. groups during kernel execution, there is no synchronization When a kernel command is enqueued, an abstract index mechanism available in OpenCL. space has been defined. The index space called NDRange is an N-dimensional space, where N is equal to 1, 2, or 3. An III. THE OPENCL FRAMEWORK NDRange is defined by an N-tuple of integers and specifies the extent of the index space (the dimension and the size). In this section, we describe the design and implementation An instance of the kernel executes for each point in this of our OpenCL framework for the SCC. index space. This kernel instance is called a work-item, and is uniquely identified by its global ID (N-tuples) defined by A. OpenCL Platform Model and SCC its point in the index space. Each work-item executes the Before elaborating on our OpenCL framework, we first same code but the specific pathway and accessed data can explain how the OpenCL platform model is mapped to the vary. SCC. Table I summarizes the mapping between the OpenCL platform and the SCC. A single core in the SCC becomes the Compute Device host processor. We will call the host processor as a host core Compute Unit 1 Compute Unit N Private Private Private Private in this paper. In our approach, the OpenCL runtime Memory 1 Memory M Memory 1 Memory M runs on the host core where the OpenCL host program also Compute Compute PE 1 PE M PE 1 PE M Device Device runs. Our OpenCL runtime maps the entire SCC but the host

Local Local Memory 1 Memory N core as an OpenCL compute device. Each core becomes a

Global / Constant Memory Data Cache CU and there are a total of 47 CUs (SCC cores). Host machine Main Global Constant Table I CPU Memory Memory Memory MAPPING BETWEEN THE OPENCL PLATFORM MODEL AND THE SCC Compute Device Memory OpenCL Platform SCC Figure 2. The OpenCL platform model. Host processor A single SCC core Main memory Host core’s private memory Compute device SCC itself except the host core One or more work-items compose a work-group, which Global/constant memory Host core’s private memory provides more coarse-grained decomposition of the index CUs SCC cores except the host core space. Each work-group has a unique work-group ID in the Local memory Each core’s private memory PEs Virtual PEs emulated by a CU core work-group index space and assigns a unique local ID to Private memory Each core’s private memory each work-item within itself. Thus a work-item is identified Global/constant memory cache Caches and the runtime protocol by its global ID or by a combination of its local ID and work- group ID. The work-items in a given work-group execute We assume each core in the SCC has a distinct private concurrently on the PEs in a single CU. system memory region (PSMR) that consists of multi- As shown in Figure 2, each work-item running on a ple system memory sections. Sections in the host core’s PE accesses four distinct memory regions: global, constant, PSMR are configured as the sections of the OpenCL main local, and private. The global memory and constant memory memory, global memory, and constant memory to reduce are shared by all CUs in a compute device (i.e., all work- the data transfer overhead between the main memory and items in all work-groups). The constant memory is a region global/constant memory. that remains constant during the execution of a kernel. Local A space in each CU core’s PSMR becomes the CU’s memory is shared by all PEs in a CU (i.e., all work-items local memory. Since our OpenCL runtime emulates PEs with in a single work-group). The PE private memory is private the associated CU core, the private memory of each PE is to each PE (i.e., private to each work-item). mapped to a space in the CU core’s PSMR. The remaining

58 space of the CU core’s PSMR is used for buffering data to index space. The execution order of WCRs in work-item maintain coherence and consistency. coalescing preserves the barrier semantics. The OpenCL global/constant cache is implemented by C. Workload Distribution each CU core’s caches and the coherence/consistency man- agement protocol provided by the runtime. Our OpenCL runtime divides and distributes the kernel workload to each CU core. Since there is no dependence between work-groups in OpenCL, the unit of workload 1 __kernel void foo(__global float *A, __global float *B, 2 __global float *C, int P, int Q) { distribution is a work-group. The runtime distributes work- 3 int id = get_global_id(0); groups in a kernel across CU cores. A set of work-groups 4 C[id] = A[id] + B[P+Q*id]; 5 } that is executed by a CU core is called a work-group (a) 1 void foo(float *A, float *B, float *C, int P, int Q, assignment. The runtime also distributes data in the global 2 uint memOffA, uint memOffB, uint memOffC) { memory that are accessed by the work-group assignments. 3 int id; 4 for( int __k = 0; __k < __local_size[2]; __k++){ a work-group 5 for( int __j = 0; __j < __local_size[1]; __j++){ 6 for( int __i = 0; __i < __local_size[0]; __i++){ CU0 CU0 CU1 7 id = get_global_id(0); CU0 CU1 8 (C-memOffC)[id] = (A-memOffA)[id] + (B-memOffB)[P+Q*id]; CU1 CU2 CU3 CU0 CU1 9 } CU2 CU0 CU1 10 } CU2 CU3 11 } CU3 CU2 CU3 CU2 CU3 12 } (a) (b) (c) (d) 13 (b) Figure 4. Examples of well-balanced and well-formed partitions. (a) and Figure 3. (a) An OpenCL kernel code. (b) The C code generated by our (b) are well-formed and well-balanced. (c) is well-balanced but not well- OpenCL-C-to-C translator. formed. (d) is well-formed but not well-balanced.

To distribute the workload, the runtime partitions the work-group index space into disjoint work-group assign- B. Emulating PEs ments in a well-balanced and well-formed manner. The In OpenCL, a CU executes a set of work-groups, and resulting partition is well balanced in the number of work- work-items in a work-group are executed by one or more groups in each assignment to make all CU cores busy. PEs in the same CU concurrently. Our OpenCL run- A partition is well formed if the work-group IDs in each time emulates the PEs using a kernel transformation tech- assignment are contiguous in each dimension of the work- nique, called work-item coalescing[9], [10], provided by group ID. Figure 4 shows some examples of well-balanced our OpenCL-C-to-C translator. The work-item coalescing and well-formed work-group partitions. technique makes the CU core (i.e., an SCC core) executes The runtime has a built-in partition table that is indexed each work-item in the work-group one by one sequentially with the dimension (1, 2, or 3) of the work-group index using a loop that iterates over the local work-item index space. Note that the dimension of the work-group index space. Figure 3 (a) shows an OpenCL kernel code and space will be known at run time. Each table entry records Figure 3 (b) shows the C code generated by our OpenCL- all well-balanced and well-formed partitions a priori. When C-to-C translator. The triply nested loop in Figure 3 (b) is the dimension of the work-group index space is known at such a loop after the work-item coalescing technique has run time, the runtime selects a partition in the table that been applied. Since the index space of the kernel program minimizes the memory management overhead. in Figure 3 (a) is one dimensional, __local_size[2] and A buffer object is a memory object that stores a linear __local_size[1] in Figure 3 (b) will be one. collection of bytes. An array is typically allocated as a buffer When there are work-group barriers contained in the object in OpenCL and accessed by a kernel using a pointer. kernel (i.e., each work-item in the work-group needs to wait The host program can create the buffer object in the OpenCL at the barrier until others arrive at the barrier), the work- global memory and pass the pointer as an argument to the item coalescing technique divide the code into work-item kernel. coalescing regions (WCRs)[9]. A WCR is a maximal code If the IDs of all work-groups in the work-group assign- region that does not contain any barrier. Since a work-item ment of a CU are contiguous in each dimension, then the private variable whose value is defined in one WCR and host launches the kernel only once for the entire work-groups used in another needs a separate location for each work- in the assignment. However, if they are non-contiguous item to transfer the variable’s value between different WCRs, (i.e., the partition is not well formed), the host needs to the variable expansion technique[9] is applied to WCRs. launch the same kernel for each set of work-groups with Then, the work-item coalescing technique executes each continuous IDs in the assignment. That is, the kernel is WCR using a loop that iterates over the local work-item launched multiple times. In addition, a non-well-formed

59 partition typically introduces multiple non-contiguous buffer 1 #define LOCAL_ID_L(N) (0) access ranges for a work-group assignment in the partition. 2 #define LOCAL_ID_U(N) (WGS[N]-1) 3 #define GROUP_ID_L(N) (FWG[N]) These non-contiguous buffer access ranges incur much more 4 #define GROUP_ID_U(N) (FWG[N]+NWG[N]-1) memory management and transfer overhead. 5 #define GLOBAL_ID_L(N) (GROUP_ID_L(N)*WGS[N]+LOCAL_ID_L(N)) 6 #define GLOBAL_ID_U(N) (GROUP_ID_U(N)*WGS[N]+LOCAL_ID_U(N)) To select an optimal partition that has a minimal memory 7 management overhead, our OpenCL framework relies on a 8 void bound_analysis(RANGE *A, RANGE *B, RANGE *C, int P, int Q, symbolic array bound analysis [11], [12]. This gives the 9 size_t FWG[3], size_t NWG[3], 10 size_t WGS[3], size_t ENWG[3]) { maximal buffer access ranges for each work-group assign- 11 ment in each candidate partition in the selected table entry. 12 set_write_range(C, GLOBAL_ID_L(0), GLOBAL_ID_U(0)); 13 set_read_range(A, GLOBAL_ID_L(0), GLOBAL_ID_U(0)); 14 D. Symbolic Array Bound Analysis 15 if(Q>0) 16 set_read_range(B, P+Q*GLOBAL_ID_L(0), P+Q*GLOBAL_ID_U(0)); Our OpenCL-C-to-C translator performs the symbolic 17 else array bound analysis and generates code for the runtime to 18 set_read_range(B, P+Q*GLOBAL_ID_U(0), P+Q*GLOBAL_ID_L(0)); 19 } determine maximal buffer access ranges. It conservatively Figure 5. The code generated by our OpenCL-C-to-C translator for the identifies access ranges of buffers (i.e., arrays) that are kernel in Figure 3 (a) after the symbolic array bound analysis. accessed by each work-group assignment. Note that each work-item of an OpenCL kernel typically accesses a buffer using its global ID, work-group ID, and local ID in an RANGE data item contains four bound variables to repre- SPMD[13] manner. The result provides symbolic lower sent buffer access range: lower and upper bounds for read and upper bounds for each array references in the kernel. and write. Set_read_range and set_write_range set Our OpenCL runtime exploits this information by resolving the read and write bound variables for the runtime. The the symbols at run time when the size of index space is GLOBAL_ID_L(N) and GLOBAL_ID_U(N) are macros for determined. calculating the values of lower bound and upper bound of get_global_id(N) at run time using the index space Table II SYMBOLIC BOUNDS determined by each work-group assignment. The macros use the run-time information, such as the ID of the first Array Read Range Write Range work-group in the work-group assignment (FWG), the number A [GL:GU ] B [P+Q*GL:P+Q*GU ] or [P+Q*GU :P+Q*GL] of work-groups in each work-group assignment (NWG), the C [GL:GU ] number of work-items in a work-group (WGS), and the number of work-groups in the entire kernel index space For example, Table II shows the result of symbolic array (ENWG). Since the dimension of the index space can be one, bound analysis of the kernel in Figure 3 (a). GL and GU are two, or three, each of these values is three dimensional array. symbolic lower and upper bounds of get_global_id(0) in To perform the symbolic analysis, our translator uses Figure 3 (a), respectively. Read Range shows the maximal program slicing techniques[14], [15]. First, it inlines user- array regions that may be read by any execution of the defined function calls in the kernel. Then it performs kernel. Similarly, Write Range shows the maximal array constant propagation, loop invariant and induction variable regions that may be modified by any execution of the kernel. recognition and reaching definitions analysis for the kernel. These regions are defined by unresolved symbols, such as After performing these analyses, the translator extracts the GL,GU , P and Q. The values of these symbols are defined index computation slice for each array reference. Then when the host program enqueues the kernel command into a the translator checks if the slice satisfies all the following command queue at run time. Thus, just before the kernel is conditions: issued to a compute device, our runtime resolves the values • Each index of the array reference is an affine function of these symbols by running the code generated by the in global ID, work-group ID, and local ID. OpenCL-C-to-C translator and obtains buffer access ranges • When the array reference is not contained in a loop, for the work-group assignment for each CU core. With any variable used in its indices does not have more this buffer access range information, the runtime selects a than one reaching definition. partition that has a minimal memory management cost. • If the array reference is contained in a well-behaved Figure 5 shows the code generated by our OpenCL- affine loop, each of its index is an affine function in C-to-C translator for the kernel in Figure 3 (a) after induction variables of the loops, global ID, work-group the symbolic analysis. The runtime runs this code for ID, and local ID. each work-group assignment to obtain its buffer access A well-behaved loop[16] is a for loop in the form for(e1; e2; ranges. Each buffer (i.e., array) argument in the kernel e3) stmt, in which e1 assigns a value to an integer variable i, is replaced with a pointer to a RANGE data item. A e2 compares i to a loop invariant expression, e3 increments

60 stmt or decrements i by a loop invariant expression, and Buffer b contains no assignment to i. We say a well-behaved loop is affine in the global ID, work-group ID, and local ID CU count for SI 1231 SI0 SI1 SI2 SI3 if both of the lower bound and upper bound of the loop Segment intervals index variable are affine functions in the global ID, work- group ID, and local ID. If the index subscript of array CU0 reference satisfies this condition, the translator iteratively CU1 applies forward substitution to the slice until there is no more forward-substitutable variable. If a kernel contains an CU2 array reference that does not satisfy the above conditions, our runtime conservatively assumes that the access range of Overlapping intervals each CU core is the entire buffer. CU count for OI 12111 2 2 E. Identifying an Optimal Partition OI0 (16MB) OI2 (10MB) OI4 (8MB) OI6 (16MB) To select an optimal partition, we estimate the memory management overhead for each partition using a cost model OI1 (6MB) OI3 (4MB) OI5 (4MB) that is based on the buffer access range information for each Figure 6. Buffer access ranges and overlapping intervals. kernel. When a CU core writes to a buffer and if the buffer access range of work-groups assigned to the CU core overlaps with each overlapping interval. When an overlapping interval’s that of work-groups assigned to another core, there is a CU count is more than one and at least one of the CUs memory consistency problem. Each CU core has to copy writes to the interval, we say that the overlapping interval is the global memory region of the buffer access range to its write-shared. When a segment interval overlaps with a write- own private space and updates the local copy to avoid the shared overlapping interval, the segment interval is also said problem. When the host reads the buffer from the device to be write-shared. after executing the kernel, it merges the updates done in the The memory copy overhead Costcopy(b) for a buffer b is local copies and updates the original copy of the buffer. given by, Otherwise, the writer CU core just alters its LUT and  ( )= ( ) · ( ) · calls mmap() function to map the global memory region of Costcopy b Count OI Length OI α the buffer access range to its own address space. This incurs OI∈S only the LUT modification and mmap() call overhead. There where S is the set write-shared overlapping intervals for is no memory copy or message passing overhead involved. b, Count(OI) is the number of CUs that have OI in their Moreover, at the end of a kernel execution, the CU core has access range, Length(OI) is the size of OI in bytes, and α to do nothing but calling munmap() function to unmap the is the cost of copying a byte and obtained by measuring the region from its own address space. The LUT modification actual cost on the target machine. and mmap()/munmap() call overhead is small enough to be In Figure 6, OI2, OI3 and OI5 are write-shared. They negligible. need to be copied to each associated CU core’s private ( ) Our memory management overhead (Costmanage b ) for address space. Thus, Costcopy(b)=(2· Length(OI2)+2· b a buffer basically consists of the memory copy overhead Length(OI3)+2· Length(OI5)) · α =36MB · α. Costcopy(b) and the merge overhead Costmerge(b) due to the When the runtime update the original buffer by merging overlapping buffer access ranges. local copies distributed to CU cores, it compares the local ( ) Consider the case in Figure 6. Each CU writes to a buffer copies word by word with the original buffer. Costmerge b b is given by, . A gray bar for each CU core is the buffer write range  of the work-groups assigned to the CU core. We divide Costmerge(b)= Count(OI)·Length(OI)·β+Length(OI)·α SI the buffer into disjoint segment intervals ( i). A segment OI∈S (16MB in the SCC) is the unit of mapping between the where S is the set of write-shared overlapping intervals 64GB system memory address space and the CU core’s 32- for b, and β is the cost of comparing two words (the bit physical address space in the SCC. A part of the segment size depends on the buffer type) and obtained by measur- cannot be mapped separately. Using the lower and upper ing the actual cost on the target machine. The first term bounds of the segment intervals and the buffer access range Count(OI) · Length(OI) · β is the worst-case word-by-word of each CU core, we also divide the buffer into disjoint comparison cost. The second term Length(OI) · α is the OI overlapping intervals ( i). For each segment interval, we update cost for an write-shared overlapping interval. It is record the number of CUs that have the segment interval conservatively assumed that all words in the overlapping in their buffer access range. The same thing is true for interval are updated. Since the OpenCL memory consistency

61 model does not guarantee a consistent memory view between physical address of the 16MB segment in the 32-bit address different work-groups during the kernel execution, and there space. The second field is the 46-bit system memory address is no synchronization between work-groups, we may choose of the 16MB segment. An MPB or configuration register can any update as the last update when there are multiple updates be also mapped by a single LUT entry. to the same location in the global memory. This is the reason Buffer object creation. Whenever the host program sends why we do not multiply Count(OI) to the second term. an OpenCL buffer object creation request to the runtime by In Figure 6, Costmerge(b)=(2· Length(OI2)+2· invoking the clCreateBuffer() API function, the runtime Length(OI3)+2· Length(OI5)) · β +((Length(OI2)+ allocates the buffer in the free space of the host core’s PSMR Length(OI3)+Length(OI5)) · α =38MB · β +18MB · α. (note that the OpenCL global memory is mapped to an Cost Now, the total memory management cost manage of area in the host core’s PSMR). Since the runtime manages buffers for a kernel is given by,  each core’s PSMR with 32-bit physical addresses through the core’s LUT, the runtime uses mmap() and munmap() to Costmanage = (Costcopy(b)+Costmerge(b)) b∈B obtain the virtual address of the buffer. Then the runtime returns it to the host program. In addition, the runtime where B is the set of buffers that is accessed in a kernel. The obtains the system memory address of the buffer object from runtime computes Cost for each partition and selects manage the LUT and records it in the data structure of the buffer the partition that has the minimum Cost . Using this manage object for later use. partition, the runtime distributes the work-groups and the The host program invokes clEnqueueWriteBuffer() buffer chunks to each CU core. when it wants to write to a buffer object that has been created

LUT # Physical Addr. Contents by clCreateBuffer(). The OpenCL runtime copies the data from the OpenCL main memory (allocated in the host System 247 0xF7000000 Configure Register -Tile 23 core’s PSMR) to the designated buffer object in the OpenCL Configuration Registers global memory (also allocated in the host core’s PSMR). (24 segments) 224 0xE0000000 Configure Register -Tile 00 Collecting runtime parameters. By invoking clSetKernelArg(), the host program passes arguments 215 0xD7000000 MPB in Tile 23 MPB Area to the kernel. The runtime records this information in a (24 segments) reserved space and passes them to the kernel when the kernel 384KB 192 0xC0000000 MPB in Tile 00 is launched. By invoking clEnqueueNDRangeKernel(),

38 0x2600000 System Memory Address the host program delivers the information (i.e., size and Free space dimension) about the work-item and work-group index Private Area 304MB (39 segments) 21 0x15000000 System Memory Address spaces of the kernel to the runtime. 624MB For OS kernel 0 0x00000000 System Memory Address 320MB Host core CU core Figure 7. The LUT setting. A[0] A[20] A[0] F. Memory Management To describe how the runtime manages the memory hier- A[100] A[80] archy and consistency in OpenCL, we explain what is done 16MB segment by our OpenCL runtime step by step. Figure 8. Address translation. Initialization. In the initialization phase of the OpenCL runtime, the runtime defines LUT entries to distribute the Distributing buffer objects. When the runtime dequeues available system memory to each core. The allocated system a kernel command from the command queue and issues the memory region to each core is private to each core. For kernel to CU cores for execution, the runtime knows all example, when we have a total of 32GB system memory, the information about the kernel that is required to resolve the runtime allocates a contiguous 2GB of the system memory symbols in the symbolic array bound analysis. It calls the region to the host core and a contiguous 624MB of the analysis function for the kernel at this point. Based on the system memory region to each CU core. The OS kernel buffer access range information obtained from the analysis running on each core uses about 320MB of the allocated and the cost model described in the previous section, the run- system memory and the runtime manages the remaining. time selects an optimal work-group partition. The runtime The LUT in Figure 7 shows this memory allocation for a assigns the work-group assignments in the selected partition CU core. The CU core’s LUT has a total of 256 entries. The to CU cores. LUT is indexed by the 16MB segment number in the 32-bit Then, the runtime distributes buffer objects that are ac- physical address space. The first field of an LUT entry is the cessed by CU cores. The host core sends a message to a CU

62 core C after flushing its own L1 and L2 caches. The message Merging local updates. After executing a kernel, the contains the system memory addresses of segment intervals host program typically invokes clEnqueueReadBuffer() in the buffer access ranges. After receiving the message, C to enqueue a memory command that copies the result from flushes its L1 and L2 caches and alters its own LUT to map the OpenCL global memory to the OpenCL main memory. segment intervals to its own 32-bit physical address space. The buffer object in the host core’s PSMR (i.e., the global Then C sends an acknowledgement message to the host core. memory) is either consistent or inconsistent to the result of If there is no write-shared overlapping interval in the the kernel execution. When there is a write-shared interval, buffer access ranges of C, there is no memory copy. If C the associated distributed buffer chunks in the PSMR of CU modifies a buffer chunk, the updates are directly reflected to cores are inconsistent to the original buffer chunk in the the buffer object located in the host core’s PSMR. global memory. Otherwise, the buffer chunk in the global If there are write-shared overlapping intervals, the host memory is consistent to the result of kernel execution. core sends another message to C after receiving the ac- To obtain correct result, the runtime checks knowledgement. The message makes C copy the write- for write-shared overlapping intervals when shared overlapping intervals to C’s PSMR. After receiving clEnqueueReadBuffer() is invoked. If there is no the message, it copies the overlapping intervals from host write-shared interval for a CU, the host core just makes core’s PSMR (i.e., the global memory) to its own PSMR. the CU core flush its L1 and L2 caches. Note that no After copying, C sends an acknowledgement message to the write-shared interval implies that the CU core directly host core. updates the original buffer located in the host core’s PSMR When a buffer object is divided and distributed to each exploiting the dynamic memory mapping mechanism in the core, the addresses of array references in the original kernel SCC. code need to be modified. For example, consider the example Otherwise, a set (say P ) of CU cores have write-shared of Figure 8. The black area in the host core’s buffer object overlapping intervals. First, the host core flushes its own is the access range of the work-group assignment that is caches and makes the CU cores in P flush their caches. The assigned to a CU core. By invoking the mmap() system call host core modifies its LUT to map the write-shared segment with the 32-bit physical address of the buffer chunk obtained intervals. Then, the host core scans the entire buffer that from the LUT, the CU core obtains the virtual address A contains write-shared overlapping intervals. When it meets of the buffer object. However, if the runtime passes this a write-shared overlapping interval, it compares word by virtual address to the kernel code, then an array reference word (i.e., the size depends on the buffer type) the contents A[20] in the host core’s buffer object becomes a reference of the interval from different CU cores to the original buffer. A[0] in the CU core. To make a reference A[20] in the If a different word is found, the host core updates this CU core’s kernel access the location of A[0] in the CU word to the origianl buffer in its PSMR. As mentioned core’s private memory, we need to shift the starting address before, since the OpenCL memory consistency model does of the array in the CU’s kernel code by memOffA (=20 in this not guarantee a consistent memory view between different example). Thus, a reference A[id] in the original OpenCL work-groups during the kernel execution, and there is no kernel becomes (A-memOffA)[id] in the CU’s kernel. The ordering enforced by synchronization between work-groups, memOffA is passed to the CU core’s kernel as an argument. we may choose any update as the last update when there are A sample kernel code for a CU core is shown in Figure 3(b). multiple updates to the same location. Thus, the comparison After receiving the acknowledgement message from all for a word is stopped when the first updated value the CU cores, the host core sends a message to each CU core for the word is detected. To perform this scan, the host core to launch the kernel for execution. Each CU core notifies needs at least one LUT entry available to read the segments the host core when it finishes executing the assigned work- in a CU core. Figure 9 shows an example of this process. groups. IV. EVALUATION 16MB segment This section describes the evaluation results of the SCC CU 0 with our OpenCL framework. To elicit the of our approach, we compare the scalability of the SCC to that of CU 1 a multicore system with four 12-core chip-multiprocessors. CU 2 A. Target Machine The SCC core is a 32-bit P54C Pentium design that has Host core been altered to increase the L1 data and instruction cache size to 16KB each. The ISA has been extended with a new Distributed buffer chunk update update CL1INVMB instruction, as described in Section II. There are Figure 9. The merge process. 48 cores in one SCC chip. Although the SCC has multiple

63 frequency domains and the frequency of each domain can 2 cores 4 cores 8 cores 16 cores 24 cores 48 cores be dynamically varied, we fixed the frequency of all cores at 48 40 533MHz in this experiment. Each core has its own private 32 256KB L2 cache, and each tile has 16KB MPB. The SCC 24 system has 32GB of system memory, 8GB for each memory 16 controller for this experiment. This system memory can be 8 0 shared by all cores using LUTs. Best Worst Best Worst Each core runs a Linux OS that supports virtual memory CP MatrixMul but does not support the paging mechanism to hard disc. Figure 10. Performance of different partitions. We use the RCCE library[8] provided by Intel to implement message passing mechanism between cores in our OpenCL runtime. Since the P54C Pentium core does not provide D. Effect of Partitioning an L2 cache flush instruction, we flush caches by reading a contiguous space of 256KB memory. This takes about As we discussed in the Section III, the existence of write- 600ms. shared overlapping intervals is critical to the performance of The counterpart of the SCC for the comparison is a our OpenCL framework. If the runtime selects a suboptimal multicore system with four 2.2 GHz 12-core AMD Opteron partition that has a bigger memory management cost than an processors. Each core has its own L1 data and instruction optimal partition, the memory copy and merge overhead will caches (64KB each) and 512KB L2 cache. The 12MB L3 degrade the performance. For example, MatrixMul and CP cache is shared between all cores, and coherence between have large write-shared overlapping intervals with a worst- L1 and L2 caches is supported by the hardware. case partition. An optimal partition for each of them does not have any write-shared overlapping interval. We see the B. Benchmark Applications effect of the optimal partition in Figure 10. It shows the We evaluate our approach with nine OpenCL applications speedup with 2, 4, 8, 16, 24, and 48 cores over a single from various sources: AMD[17], NAS[18], NVIDIA[19], core for both cases: optimal partition (Best) and worst-case Parboil[20], and PARSEC[21]. The characteristics of the partition (Worst). There is a significant differences between applications are summarized in Table V. Only EP and them when the number of cores increases. TPACF contain some array references that do not satisfy E. Scalability the symbolic array bound analysis conditions. In this case, our runtime assumes that each work-group assignment in the Figure 11 compares the speedup of each application over partition accesses the entire buffer for the array references. a single CU core for the SCC with our framework (SCC) and over a single AMD Opteron core for the 48-core AMD Table III Opteron machine with the OpenCL framework available COST OF RCCE FOR SMALL MESSAGES from AMD (Opteron)[17]. We vary the number of cores Size 8 Bytes 16 Bytes 32 Bytes 64 Bytes 128 Bytes from 2 to 48. All applications but TPACF scale well on Time 9 μs 9 μs 9 μs 10 μs 14 μs the SCC with our approach when compared to the Opteron machine. Table IV As shown in the Table V, repetitive kernel executions COMPARISON OF RCCE AND MMAP make the poor scalability of the RPES. The ideal speedup of RPES with 48 cores is 24 that is computed using the Am- Size 16 MB 32 MB 64 MB 128 MB 256 MB RCCE 1.92 s 3.84 s 7.68 s 15.36 s 30.72 s dahl’s law. Since at every kernel launch, the runtime thread mmap 2.23 ms 3.87 ms 5.36 ms 9.58 ms 18.24 ms incurs a non-negligible overhead on distributing workload, such as sending messages that contain kernel arguments and C. Communication Costs system memory addresses of the buffer objects. A total of 71 kernel launches repeatedly generates this overhead, and Our OpenCL runtime uses the RCCE library[8] for basic the number of cores increases, this fact manifests itself in communication, such as passing kernel arguments. Table III the speedup. Table VI also explains why the RPES does not shows the cost of the RCCE_send and RCCE_recv pair for scale well. It shows the ratio of communication overhead to different message sizes. Our runtime never uses messages the total execution time. Since the communication overhead whose size is over 128 bytes. dominates the execution time of RPES as the number of When the runtime distributes buffer chunks, it exploits cores increases, it does not scale well. mmap() instead of the RCCE library. As you see in Table IV, The runtime overhead, such as the symbolic analysis, using the mmap() function is three orders of magnitude more local-copy merge process for consistency, and communi- efficient than using the RCCE library for the buffer chunk cation overhead, are sensitive to the number of CU cores. distribution. However, as shown in column F and G in Table V, the ratio

64 Table V CHARACTERISTICS OF THE OPENCL APPLICATIONS

Application Source Description Input A B C D E F G BinomialOption AMD Binomial option pricing 45120 samples 1.4MB 1 1 o 99.97% 0.00% 0.00% Blackscholes PARSEC Black-Scholes PDE 7219200 options, 1000 iters 170MB 1 1 o 99.83% 0.00% 0.00% CP Parboil Coulombic potential 9024x9024, 1000 atoms 311MB 1 1 o 99.63% 0.00% 0.00% EP NAS CLASS B 1.5MB 1 1 x 99.98% 0.01% 0.64% MatrixMul NVIDIA Matrix multiplication 4512x4512 233MB 1 1 o 99.16% 0.00% 0.00% MRI-Q Parboil Magnetic resonance imaging Q Large 5MB 2 1, 2 o 99.96% 0.01% 0.00% Nbody NVIDIA N-Body simulation 48128 bodies 3MB 1 1 o 99.90% 0.01% 0.00% RPES Parboil Rys polynomial equation solver Default 79.4MB 2 54, 17 o 97.80% 0.19% 0.00% TPACF Parboil Two-point angular correlation function Default 9.5MB 1 1 x 99.95% 0.01% 0.32% A: total data size in the device global memory, B: # of kernels, C: # of kernel executions, D: satisfying symbolic analysis conditions, E: % seq. exec. time of kernels, F: % exec. time of symbolic analysis run for 48 cores, G: % exec. time of merge process for 48 cores.

2 cores 4 cores 8 cores 16 cores 24 cores 48 cores 48 40 32 24

Speedup 16 8 0 SCC Opteron SCC Opteron SCC Opteron SCC Opteron BinomialOption Blackscholes CP EP 48 40 32 24 16 Speedup 8 0 SCC Opteron SCC Opteron SCC Opteron SCC Opteron SCC Opteron MatrixMul MRI-Q Nbody RPES TPACF

Figure 11. The speedup comparison between the SCC and the Opteron system. The speedup is obtained over a single core. of the symbolic array bound analysis and merge process comparable scalability to that of the cache-coherent many- is under 1% for all applications with 48 cores. For all core system with a commodity OpenCL framework. applications but EP and TPACF, the merge process does not need to be performed. As shown in Table VI, with 48 cores, Table VI RATIO OF COMMUNICATION OVERHEAD the ratio of communication overhead is less than or equal to about 10% of the total execution time for all applications Application 1 core 2 cores 4 cores 8 cores 16 cores 24 cores 48 cores but RPES. BinomialOption 0.00% 0.00% 0.00% 0.01% 0.02% 0.05% 0.18% Blackscholes 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.03% TPACF is not the case that shows poor scalability due to CP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% EP 0.02% 0.03% 0.10% 0.42% 1.61% 3.48% 11.07% the runtime overhead. The reason is load balancing. Each MatrixMul 0.00% 0.00% 0.00% 0.01% 0.03% 0.06% 0.21% work-group in TPACF has different amount of workload. MRI-Q 0.00% 0.00% 0.00% 0.01% 0.03% 0.05% 0.19% Nbody 0.00% 0.02% 0.05% 0.20% 0.71% 1.59% 5.66% As we mentioned in Section III-C, the unit of workload RPES 0.52% 1.56% 3.59% 9.98% 23.92% 37.75% 63.21% distribution is a work-group, and we assume the same TPACF 0.00% 0.00% 0.00% 0.01% 0.05% 0.12% 0.34% amount of workload across different work-groups to identify an optimal partition. With our static approach, the actual amount of workload distributed across CU cores is different V. R ELATED WORK and this results in poor scalability. This problem can be There have been many studies done on the scalable solved by adding a more sophisticated dynamic scheduling hardware cache coherence mechanisms for chip-multicore mechanism that considers load balancing in addition to processors[1], [2], [3], [4]. These studies focus on the memory management overhead in the SCC. problem of how to improve the scalability and power con- Overall, our experimental results show that the SCC with sumption of the hardware cache coherence mechanism itself. our OpenCL framework achieves better scalability than or There are very few studies performed on non-cache-coherent

65 homogeneous multicore architectures. To our knowledge, access ranges using a sampling technique at run time. Using our OpenCL proposal is the first work for building a the buffer access ranges, they tries to select an optimal work- complete and transparent software layer for the SCC archi- group partition that minimizes data transfer cost through the tecture to improve ease of programming and to achieve high PCI-E bus between the host and GPUs. Our approach relies performance. on a symbolic array bound analysis to identify buffer access Saha et al.[22] propose a programming model for a ranges and tries to minimize the consistency management heterogeneous x86 platform. The target architecture consists overhead in addition to the memory copy overhead to select of a general-purpose x86 CPU and a Larrabee processor with an optimal partition. 24 cores[23]. The two processors are connected to a PCI express bus. They provide a partly shared memory model VI. CONCLUSIONS between all cores in the system. Since their method relies on the hardware cache coherent mechanism, it is not applicable In this paper, we describe the design and implemen- to our non-cache-coherent target architecture. tation of the OpenCL framework for non-cache-coherent Gummaraju et al.[24] present their OpenCL platform that homogeneous manycore processors, such as Intel SCC. We is capable of exploiting multicore CPUs. Their approach is find that the OpenCL parallel programming model can be similar to our work because they combine multiple parallel an ideal software layer for such homogeneous manycores threads into one loop to emulate PEs with a single CPU because of its relaxed memory consistency and data parallel core. They heavily rely on setjmp() and longjmp() to programming model. We do not have any difficulties to solve the work-group barrier problem, and choose a different provide the standard OpenCL API functions and the standard way to reduce the context switching overhead between work- OpenCL applications do not need to be modified to exploit items. They implement their own light-weight setjmp() the underlying message passing architecture. and longjmp() to reduce the context switching overhead, Our OpenCL runtime deploys each SCC core as a CU while our runtime relies on the work-item coalescing tech- of the OpenCL platform model. To emulate PEs in a nique. Moreover, since their runtime also depends on the single core efficiently, our OpenCL-C-to-C compiler applies hardware cache coherence support, it is not applicable to the work-item coalescing and variable expansion technique our non-cache-coherent target architecture. to the kernels. The runtime exploits the SCC’s dynamic The Cell BE processor[25] has some similarity with memory mapping mechanism together with the symbolic the SCC, but it is a heterogeneous multicore processor. array bound analysis technique to optimally distribute the The SPEs have their own private local stores that are not kernel workload between CU cores. When the distributed coherent to the main memory. Instead, the DMA operations buffer chunks are overlapped each other and at least one of on the local stores and some communication mechanisms them is for write, the runtime compares the values of the are provided. Consistency of the local stores should be fully chunks to eliminate any inconsistencies managed by the or the software. Lee et al.[9] after kernel execution. propose an OpenCL framework for the Cell BE processor. They propose work-item coalescing technique and variable With nine OpenCL applications, we compare the scala- expansion technique to emulate the PEs on an SPE. Instead bility of our framework on the SCC with a 48-core, cache- of directly managing coherence and consistency of the local coherent commodity multicore system. Our experimental stores, they exploit software caches to guarantee coherence results show that the SCC with our OpenCL framework and consistency. They also propose a buffering technique achieves better scalability than or comparable scalability that exploits preloading. to that of the commodity manycore system. This partly Stratton et al.[10] propose a compiler technique to emu- justifies that the effectiveness of the non-cache-coherent late parallel work units in a single processor. Their compiler homogeneous manycore processors and appropriateness of technique translates a CUDA kernel program to the code that the OpenCL framework to such an architecture to achieve runs on homogeneous multicore processors. They propose a high performance. similar technique to the work-item coalescing technique to emulate parallel work units. Their compiler divides a parallel ACKNOWLEDGMENT region by barriers and packs a chunk of parallel work units in a loop. They also expand the local variables to distinguish This work was supported in part by grant 2009-0081569 them using the thread id. (Creative Research Initiatives: Center for Manycore Pro- Kim et al.[26] presents an OpenCL framework that ex- gramming) from the National Research Foundation of Korea. ploits multiple GPU devices and automatically builds a This work was also supported in part by the Ministry of single GPU image on top of them. Their runtime manages Education, Science and Technology of Korea under the virtual buffers in the host main memory and copies them to BK21 Project. ICT at Seoul National University provided each device memory when required. They analyze the buffer research facilities for this study.

66 REFERENCES [12] R. Rugina and M. Rinard, “Symbolic bounds analysis of pointers, array indices, and accessed memory regions,” in Proceedings of the ACM SIGPLAN 2000 conference on [1] M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Programming language design and implementation, ser. PLDI Martin, and D. A. Wood, “Improving Multiple-CMP Systems ’00, 2000, pp. 182–195. Using Token Coherence,” in Proceedings of the 11th Interna- tional Symposium on High-Performance Computer Architec- [13] F. Darema, “The SPMD Model: Past, Present and Future,” ture, February 2005, pp. 328–339. in Proceedings of the 8th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel and [2] N. D. E. Jerger, L.-S. Peh, and M. H. Lipasti, “Virtual tree Message Passing Interface, 2001. coherence: Leveraging regions and in-network multicast trees for scalable cache coherence,” in Proceedings of the 2008 41st [14] M. Weiser, “Program slicing,” in Proceedings of the 5th IEEE/ACM International Symposium on Microarchitecture, international conference on Software engineering, ser. ICSE November 2008, pp. 35–46. ’81, 1981, pp. 439–449. [3] J. A. Brown, R. Kumar, and D. Tullsen, “Proximity-aware [15] T. Frank, “A survey of program slicing techniques,” Journal directory-based coherence for multi-core processor architec- of Programming Languages, vol. 3, pp. 121–189, 1995. tures,” in Proceedings of the nineteenth annual ACM sympo- sium on Parallel algorithms and architectures, ser. SPAA ’07, [16] S. S. Muchnick, Advanced compiler design and implementa- 2007, pp. 126–134. tion, 1997. [4] M. M. Michael and A. K. Nanda, “Design and performance [17] AMD, “AMD Accelerated Parallel Processing SDK v2.3,” of directory caches for scalable shared memory multiproces- http://developer.amd.com/gpu/AMDAPPSDK/Pages/default. sors,” in Proceedings of the 5th International Symposium on aspx. High Performance Computer Architecture, ser. HPCA ’99, [18] NASA Advanced Supercomputing Division, “NAS Parallel 1999. Benchmarks.” [5] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, [19] NVIDIA, “NVIDIA GPU Computing Software Devel- G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, opment Kit,” http://developer.nvidia.com/object/cuda 3 1 F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Sali- downloads.html. hundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund- [20] The IMPACT Research Group, “Parboil Benchmark suite,” Larsen, S. Steibl, S. Borkar, V. De, R. V. D. Wijngaart, and http://impact.crhc.illinois.edu/parboil.php. T. Mattson, “A 48-Core IA-32 Message-Passing Processor [21] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec with DVFS in 45nm CMOS,” in ISSCC ’10: Proceedings benchmark suite: characterization and architectural implica- of the 57th International Solid-State Circuits Conference, tions,” in Proceedings of the 17th international conference on February 2010, pp. 19–21. Parallel architectures and compilation techniques, ser. PACT [6] T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, ’08, 2008, pp. 72–81. P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and [22] B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, S. Dighe, “The 48-core scc processor: the programmer’s J. Fang, P. Zhang, R. Ronen, and A. Mendelson, “Program- view,” in Proceedings of the 2010 ACM/IEEE International ming model for a heterogeneous x86 platform,” in Proceed- Conference for High Performance Computing, Networking, ings of the 2009 ACM SIGPLAN conference on Programming Storage and Analysis, ser. SC ’10, 2010, pp. 1–11. language design and implementation, ser. PLDI ’09, 2009, pp. [7] Khronos OpenCL Working Group, “The OpenCL Specifica- 431–440. tion Version 1.1,” 2010, http://khronos.org/opencl. [23] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, [8] R. F. van der Wijngaart, T. G. Mattson, and W. Haas, “Light- P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Es- weight communications on intel’s single-chip cloud computer pasa, E. Grochowski, T. Juan, and P. Hanrahan, “Larrabee: a processor,” SIGOPS Oper. Syst. Rev., vol. 45, pp. 73–83, many-core x86 architecture for visual computing,” in ACM February 2011. SIGGRAPH 2008 papers, ser. SIGGRAPH ’08, 2008, pp. 18:1–18:15. [9] J. Lee, J. Kim, S. Seo, S. Kim, J. Park, H. Kim, T. T. Dao, Y. Cho, S. J. Seo, S. H. Lee, S. M. Cho, H. J. Song, S.-B. Suh, [24] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. and J.-D. Choi, “An OpenCL Framework for Heterogeneous Gaster, and B. Zheng, “Twin peaks: a software platform for Multicores with Local Memory,” in PACT’10: Proceedings of heterogeneous computing on general-purpose and graphics the 19th International Conference on Parallel Architectures processors,” in Proceedings of the 19th international confer- and Compilation Techniques, September 2010, pp. 193–204. ence on Parallel architectures and compilation techniques, ser. PACT ’10, 2010, pp. 205–216. [10] J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Mur- phy, Z. Hu, and W.-m. W. Hwu, “Efficient compilation of [25] IBM, Sony, and Toshiba, “Cell Broadband Engine Architec- fine-grained -threaded programs for multicore cpus,” ture,” October 2007. in Proceedings of the 8th annual IEEE/ACM international [26] J. Kim, H. Kim, J. H. Lee, and J. Lee, “Achieving a sin- symposium on Code generation and optimization, ser. CGO gle compute device image in for multiple gpus,” in ’10, 2010, pp. 111–119. Proceedings of the 16th ACM symposium on Principles and [11] M. Wolfe, High Performance Compilers for Parallel Comput- practice of parallel programming, ser. PPoPP ’11, 2011, pp. ing, 1995. 277–288.

67