Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs

Chen Shen Xiaonan Tian Dounia Khaldi Dept. of Computer Science Dept. of Computer Science Institute for Advanced Computational University of Houston University of Houston Science Houston, TX Houston, TX Stony Brook University [email protected] [email protected] Stony Brook, NY [email protected]

Barbara Chapman Institute for Advanced Computational Science Stony Brook University Stony Brook, NY [email protected]

Abstract HPC platforms. As a result, considerable effort has been made to provide suitable programming models. Theproliferationofacceleratorsinmodernclustersmakesefficient One may first distinguish between low-level and high-level pro- coprocessorprogrammingakeyrequirementifapplicationcodes gramming libraries and languages; the user has to choose one pro- aretoachievehighlevelsofperformancewithacceptableenergy gramming API that suits her application code, level of program- consumptiononsuchplatforms.Thishasledtoconsiderableeffort ming, and timing. CUDA (cud 2015) and OpenCL (OpenCL Stan- to provide suitable programming models for these accelerators, dard) are low-level programming libraries. On the other hand, high- especially within the OpenMP community. While OpenMP 4.5 level directive-based APIs for programming accelerators, such as offers a rich set of directives, clauses and runtime calls to fully OpenMP (Chapman et al. 2008) and OpenACC (OpenACC), are utilize accelerators, an efficienti mplementationofO penMP4.5 easy to use by scientists, preserve high-level language features, and forGPUsremainsanon-trivialtask,giventheirmultiplelevelsof provide good performance portability features. Recently, multiple threadparallelism. open-source compilers (e.g. GCC, LLVM, OpenUH) and commer- Inthispaper,wedescribeanewimplementationofthecorre- cial compilers such as Cray, IBM, Intel and PGI, have added sup- spondingfeaturesofOpenMP4.5forGPUsbasedonaone-to-one port for these directive-based APIs (or are in the process of doing mappingofitsloophierarchyparallelismtotheGPUthreadhier- so). archy.Weassesstheimpactofthismapping,inparticulartheuse OpenMP’s multithreading features are already employed in ofGPUwarpstohandleinnermostloopexecution,ontheperfor- many codes to exploit multicore systems. The broadly available mance of GPU execution via a set of benchmarks that include a standard has been rapidly evolving to meet the requirements of het- versionoftheNASparallelbenchmarksspecificallydevelopedfor erogeneous nodes. In 2013, Version 4.0 of the OpenMP specifica- this research; we also used the Matrix-Matrix multiplication, Ja- tion made it possible for user-selected computations to be offloaded cobi,GaussandLaplaciankernels. onto accelerators; some important features were added in version Keywords OpenMP,OpenUH,LLVM,GPUs 4.5 (Beyer et al. 2011; Board 2015). Implementations within ven- dor compilers are under way. OpenMP 4.0 added a number of de- 1. Introduction vice constructs: target can be used to specify a region that should be launched on a device and define a data device environment, and Most modern clusters include nodes with attached accelerators,         target data, to map variables on that device. Pragma teams can NVIDIA GPU accelerators being the most commonly used. But          be used inside target to spawn a set of teams, each containing recently, accelerators such as AMD GPUs, Intel Xeon Phis, DSPs,          multiple OpenMP threads; note that teams can be mapped to a and FPGAs have been explored as well. The energy efficiency of            CUDA block. Finally, distribute is used to partition the itera- accelerators and their resulting proliferation has made coprocessor         tions of an enclosed loop to each team of such a set. programming essential if application codes are to efficiently exploit         Compared to OpenMP 4.0, OpenMP 4.5 changed the seman- tics of the mapping of scalar variables in /C++ target regions. It also provides support for asynchronous offloading using nowait and depend clauses on the target construct. The mapping of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and data can also be performed asynchronously in OpenMP 4.5 us- the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. target enter data target exit data Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, ing and . The clause requires prior specific permission and/or a fee. Request permissions from [email protected]. is device ptr was added to target to indicate that a variable PMAM'17, February 04-08, 2017, Austin, TX, USA is a device pointer that is already in the device data environment, © 2017 ACM. ISBN 978-1-4503-4883-6/17/02$15.00 DOI: http://dx.doi.org/10.1145/3026937.3026945 so it should be used directly. The clause use device ptr has been added to target data to convert variables into device pointers to Modern GPUs deploy a deep memory hierarchy, which includes the corresponding variable in the device data environment. Finally, several different memory spaces. Each of them has special prop- device memory routines were added to support explicit allocation erties. The global memory is the main memory in GPUs that all of memory and memory transfers between hosts and offloading de- threads can make use of it. The so-called shared memory is shared vices, such as omp target alloc. by threads within a thread block only. The texture memory is read- A typical GPU architecture, for which some of the above- only memory that allows adjacent reads in a warp. mentioned features were designed, provides three levels of par- Each kind of memory requires specific attention. For example, allelism, namely thread blocks, warps, and threads inside each accesses to global memory may be coalesced or not; accesses to warp. OpenMP 4.5 provides three levels of loop hierarchy par- texture memory could come with a spatial locality penalty; accesses allelism as well, namely teams, parallel, and SIMD. For traditional to shared memory might suffer from bank conflicts; and accesses CPU architectures and CPU-like accelerators, the mapping of these to constant memory may be broadcast. Memory coalescing is a three levels of parallelism is straightforward. However, a typical key enabler of data locality in modern GPU architectures. Under GPU architecture differs substantially from a CPU, and thus the it, memory requests by individual threads in a warp are grouped design of an efficient implementation of OpenMP 4.5 on GPUs together into large transactions. When consecutive threads access is not a trivial task. In this work, we describe an efficient imple- consecutive memory addresses, this enables the exploitation of mentation of the OpenMP offloading constructs in the open source spatial data locality within a warp. OpenUH compiler (Chapman et al. 2012) for NVIDIA GPUs. Our implementation of OpenMP 4.x offloading is based on a one-to- 2.2 OpenUH Compiler one mapping of OpenMP levels of parallelism to CUDA’s levels of The OpenUH compiler (Chapman et al. 2012) is a branch of parallelism, i.e, grid, thread block, and warp. To assess the suitabil- the open-source Open64 compiler suite for C, C++, and For- ity of this approach, we also implemented an OpenMP accelerator tran 95/2003. It offers a complete implementation of OpenACC version of the NAS Parallel Benchmarks, and used it to test our 1.0 (Tian et al. 2014), Fortran Coarrays (Eachempati et al. 2012), implementation. and OpenMP3.0 (Liao et al. 2007). OpenUH is an industrial- This paper makes the following contributions: strength optimizing compiler that integrates all the components needed of a modern compiler; it serves as a parallel programming we propose a one-to-one mapping of the loop hierarchy paral- • infrastructure in the compiler research community. OpenUH has lelism available in OpenMP 4.x to the GPU thread hierarchy been the basis for a broad range of research endeavors, such as lan- and implement this mapping in the OpenUH compiler; guage research (Ghosh et al. 2013; Huang et al. 2010; Eachempati we use a set of benchmarks to assess the impact of this mapping, • et al. 2012), static analysis of parallel programs (Chapman et al. especially the use of GPU warps to handle innermost loop 2003), performance analysis (Pophale et al. 2014), task schedul- execution, on the performance of GPU execution; ing (Ahmad Qawasmeh 2014) and dynamic optimization (Be- we adapt the NAS parallel benchmarks to use OpenMP offload- sar Wicaksono 2011). • ing, and make them available for use by the OpenMP commu- The major functional parts of the compiler are the front ends nity to test other implementations and platforms. (the Fortran 95 front end was originally developed by Cray Re- search and the C/C++ front end comes from GNU GCC 4.2.0), The organization of this paper is as follows. Section 2 provides the inter-procedural analyzer/optimizer (IPA/IPO) and the middle- an overview of the NVIDIA GPUs architecture and the OpenUH end/backend, which is further subdivided into the loop nest opti- compiler. Section 3 describes the one-to-one loop hierarchy map- mizer (LNO), including an auto-parallelizer (with an OpenMP op- ping to the GPU thread hierarchy that we have adopted. Perfor- timization module), global optimizer (WOPT), and code generator mance results are discussed in Section 4. Section 5 highlights the (CG). Currently, x86-64, IA-64, IA-32, MIPS, and PTX are sup- related work in this area. Finally, we conclude our work in Sec- ported by its backend. OpenUH may also be used as a source- tion 6. to-source compiler for other machines using its IR (Intermedi- ate Representation)-to-source tools. OpenUH uses a tree-based IR called WHIRL which comprises 5 levels, from Very High (VH) 2. Background to Very Low (VL), to enable a broad range of optimizations. This In this section, we provide a brief overview of the architecture of design allows the compiler to perform various optimizations on dif- GPUs, and of the OpenUH compiler. ferent levels.

2.1 GPU Architecture 3. OpenUH-Based OpenMP Offloading Support Modern GPUs consist of multiple streaming multiprocessors (SMs In the OpenMP offloading model, it is assumed that the main pro- or SMXs); each SM consists of many scalar processors (SPs, also gram runs on the host while the compute-intensive regions of the referred to as cores). Each GPU supports the concurrent execution program are offloaded to the attached accelerator. OpenMP con- of hundreds to thousands of threads following the Single Instruc- structs for device and host execution are similar: the target direc- tion Multiple Threads (SIMT) paradigm, and each thread is exe- tive is used to distinguish between the two. A target region consists cuted by a scalar core. The smallest scheduling and execution unit of two parts: (1) a device data environment, which maps host vari- is called a warp, which is typically composed of 32 threads. Warps ables to the device memory and (2) a computing region that will run of threads are grouped together into a thread block, and blocks are on the device. The device and the host may have separate memo- grouped into a grid. Both the thread blocks and the grid can be ries; data movements between them may be controlled explicitly. organized into a one-, two- or three-dimensional topology. Currently, OpenMP provides a rich set of data transfer directives, SIMT execution within a warp in GPUs can be compared to clauses and runtime routines as part of the standard to create device SIMD execution in a CPU. In each case, the same instruction data environments and offload regions. is broadcasted to multiple arithmetic units. However, the CPU OpenMP also provides multiple levels of parallelism. First, the comes with vector instructions to process several elements of short teams directive can be used to spawn a league of teams; each arrays in parallel, whereas, in SIMT mode (within a warp), several team contains a number of OpenMP threads. Next, distribute may elements of an array are processed in parallel. be attached to a loop to partition its iterations among the master nally, we replace the original outlined target region with a run- OpenUH OpenMP Compiler Infrastructure time function call that is used to launch the kernel. Note that, FRONTENDS Source Code (C, Fortran, OpenMP) with OpenMP Directives during the outlining step, we perform a one-to-one mapping of PRELOWER parallel loops to the threading architectures; we detail this map- (Preprocess OpenMP) ping in the following subsection. Pre-OPT (SSA-Analysis) CUDA Binary Generation Since our translation is based on source-to- LNO NVCC (Loop Nest Optimizer) compiler source translation to handle target regions, we use the CUDA PTX LOWERING OMP IR Assembler SDK environment to translate the CUDA kernel functions into (Transformation of OpenMP) Loaded Dynamically NVIDIA GPU assembly code, also called PTX code. WOPT Runtime (Global Scalar Optimizer) Library Our main goal is to follow the OpenMP target model specifi- WHIRL2CUDA/OpenCL Linker cation while attaining high performance on GPUs as well as CPU- like accelerators despite the significant differences in their architec- CG(Code for IA-32,IA-64,X86_64) CPU Binary Executable tures. The individual cores in CPU-like accelerators contain SIMD Figure 1: OpenUH compiler framework for OpenMP offloading constructs: units. Their ability to issue SIMD instructions makes it very easy NVCC is the NVIDIA CUDA Compiler for the cores to exploit the SIMD directives in OpenMP, which are important for performance on such platforms. In contrast, GPUs are highly suitable for massively parallel processing. Their thread threads of these teams. Then, the parallelfor directive can be used hierarchy is flatter than on CPUs. The warps are basic scheduling to workshare the enclosed loop within each team. Finally, the simd units on GPUs and this makes it challenging to map different levels directive gives a hint to the compiler to vectorize the loop iterations of parallelism. using SIMD instructions. With the traditional CPU architecture and in CPU-like acceler- 3.2 One-to-One Loop Hierarchy Mapping of OpenMP 4.5 ators, the mapping of these three levels of parallelism is straight- Parallelism forward. In the latest Intel 72-Core Knights-Landing (KNL) ac- Performance portability, where a single source code runs well celerator, for example, each core has two 512-bit vector units and across different target platforms, is increasingly demanded. Progress supports four threads. In this case, we might e.g. create 72 teams toward this goal in the context of OpenMP requires that accelerator containing 4 threads each. Each thread can then perform 512-bit directives be implemented efficiently on GPUs as well CPU-like SIMD operations. The typical GPU architecture does not permit accelerators. In order to achieve this, it is necessary to to effec- this straightforward approach, and thus designing an efficient im- tively exploit each level of architectural parallelism, between and plementation of OpenMP 4.x on the GPU is not a trivial task. In this within groups of threads, and to map all three levels of OpenMP 4.x work, we introduce, and subsequently assess, an approach based on parallelism onto the GPU. Mapping the simd construct of OpenMP a one-to-one mapping of the OpenMP levels of parallelism to the is critical to fully exploit the corresponding computing capabilities. levels of parallelism found in CUDA, i.e, grid, thread block, and Table 1 shows the mapping of OpenMP parallelism concepts to warp (cud 2015). CUDA parallelism levels that was used in this work. The teams construct is translated directly into thread blocks in CUDA because 3.1 Frontend and Backend Support the thread blocks are independent of one another and consist of a We have implemented the OpenMP accelerator constructs in the group of threads that run on the same execution unit (SMX). OpenUH compiler. Note that OpenUH also supports OpenACC, another directive-based API for device programming. The existing Table 1: OpenUH Loop Mapping in Terms of CUDA Terminologies OpenACC implementation generates CUDA/OpenCL kernel func- OpenMP Abstraction CUDA Abstraction tions for NVIDIA and AMD GPUs, respectively. In this work, we teams thread blocks within the grid target NVIDIA GPUs because of their wide popularity in the HPC parallel warps within a thread block community. We plan to address AMD GPUs in future work. simd 32 threads within a warp The creation of an OpenMP compiler for accelerators requires a significant implementation effort to meet the challenges of map- In the OpenMP target region, the environment of the com- ping high-level loop constructs to low-level threading architectures pute kernel is initiated, which includes device data creation and (as shown in Figure 1). This implementation is divided into the fol- CUDA kernel generation. At runtime, target data will be transferred lowing phases. between host and device according to the information from map Frontend We have extended the existing frontend to support the clause, then the CUDA kernel will be launched. following directives and clauses of OpenMP 4.5: target, When the program encounters the omp teams distribute di- target data, and the loop parallelism directives, namely rective, a league of thread teams will be created and the execution distribute, teams, and simd1. will be distributed across the associated loop. This mapping strat- egy will basically create a grid that consists of thread blocks; then We parsed these constructs and generated new corresponding the computation load will be spread across those created thread intermediate representations nodes. blocks. Strictly following the specification of OpenMP, the work- Backend In this phase, we transform the generated IR nodes into load is shared among the master threads of thread blocks. As we are runtime library calls, which further invoke lower-level CUDA generating CUDA kernel calls, there is no way to specify synchro- library routines to allocate/deallocate memory within GPUs; nization among different teams; the thread synchronization calls this corresponds to the LOWERING module in Figure 1. Then, are only effective within a thread block. we outline each OpenMP target region as a CUDA kernel The implementation of the omp parallel for directive in- function that will be fed to another module of Figure 1, called side the omp target region is based on a single thread block. A WHIRL2CUDA, in order to generate CUDA source code. Fi- thread block will get a chunk of data and distribute the associated loop across threads. In OpenUH, the distribution is actually accom- 1 The parallel for construct was already implemented in OpenUH. plished among warps within a thread block, and the master thread in each warp finishes the work load and waits at a synchronize point. branching. CPU cores are typically much more powerful than GPU This kind of mapping strategy makes it possible to utilize shared cores. On a GPU-like platform, using parallel for will involve memory resources in each thread team on GPU-like accelerators. only one thread inside the warp in the computation which results Each warp consists of 32 threads; the first thread of each warp in poor performance. In the OpenUH implementation, we have au- corresponds to the thread with the smallest thread-id. A parallel tomatically attached the simd construct after the innermost loop region is executed by each first thread in all the warps within the to get reasonable performance. This transformation still spreads same thread block. If for is combined with parallel, these first computation across teams, but in a more efficient way. Since this threads will workshare the parallel loop region. tranformation happens in the innermost loop, it is semantically le- The GPU architecture supports the SIMT approach by executing gal. The performance of the code is heavily dependent on the tar- the same instruction on several elements concurrently, within a get accelerator type. In OpenUH, the current compilation strategy warp. Therefore, a single instruction that is executed by a warp can was designed to take advantage of the warp-wise execution model, be considered to be a SIMD operation as defined by the OpenMP which assumes that the work load of the innermost loop will always standard. Thus, in our implementation of OpenMP offloading for be spread across threads within warps. GPUs, we interpret the simd construct as the execution unit of a warp. Specifically, the iterations of a loop that is annotated by the #pragma omp target simd directive are distributed or vectorized among threads within #pragma omp parallel for the same warp. for (i=1;i<=isize -1;i++) a[i] = ... #pragma omp target teams distribute if () for (i=1;i<=isize -1;i++) else #pragma omp parallel for Figure 4: One-to-One Mapping of the One Level of OpenMP Parallelism for (j=1;j<=jsize;j++) into a GPU #pragma omp simd for (k=1;k<=ksize;k++) temp1 = dt * tx1; Figure 4 shows an example code that contains just a single temp2 = dt * tx2; level of OpenMP offloading parallelism in a single loop. Since we lhsX[i][j][k] = ... do not have a teams construct, only one thread block is created and the workload is shared among the threads inside this thread block. As described above, in a naive implementation only the Grid Block0 Block1 Block2 master thread in each warp is working and the other threads simply wait at the synchronization point. There is a large overhead for synchronizing threads, which further degrades performance. In this kind of scenario, OpenUH’s implementation will automatically Block1 spread the workload to warps which has an effect similar to that of Warp0 Warp1 Warp2 Warp3 using a combined parallel for simd construct. Note that this approach, and the previous automatic insertion of simd constructs, Figure 2: One-to-One Mapping of the Three Levels (top) of OpenMP does not affect the semantics of the program. Parallelism into a GPU (bottom) 4. Evaluation Figure 2 shows an example code that contains three levels of In this section, we present the results of experiments that used our OpenMP offloading parallelism in a triple-nested loop. We demon- implementation of the OpenMP 4.5 offloading model to translate strate the one-to-one mapping of this nested loop to the thread hier- the NAS parallel benchmarks (NPBs) for a GPU. We also compare archy. In this triple-nested loop, three thread blocks are generated our implementation with that of LLVM. and each thread block contains multiple warps, as we have the simd hint before the innermost loop, the computing can be distributed 4.1 Experimental Setup among threads inside warps with coalesced memory access model. The NVIDIA GPU used for this evaluation includes a host with 2 Intel Xeon E5-2640 CPUs (16 threads per CPU) and 32GB main #pragma omp target teams distribute memory; the attached GPU is a K20Xm with 5GB global mem- for (i=1;i<=isize -1;i++) #pragma omp parallel for ory. CUDA 6.5 is used for the OpenUH backend GPU code com- for (j=1;j<=jsize;j++) pilation with “-O3” optimization. For the comparative analysis, temp1 = dt * tx1; we used the Clang 3.8 implementation of OpenMP that is avail- temp2 = dt * tx2; able at https://github.com/clang-omp/libomptarget.We lhsX[i][j][k] = ... use “-fopenmp -omptargets=nvptx64sm 35-nvidia-linux -O3” for if () the LLVM compiler options. To obtain reliable results, all experi- else ments were performed five times and then the average performance Figure 3: One-to-One Mapping of the Two Levels of OpenMP Parallelism was computed. We also compare the GPU execution with 2 Intel into a GPU Xeon E5-2640 CPUs.

Figure 3 shows an example code that contains two levels of 4.2 Performance Evaluation using NAS Benchmarks OpenMP offloading parallelism in a double-nested loop. For this In order to assess the performance of the OpenUH implementation code snippet, on a CPU-like accelerator, using parallel for is of the OpenMP offloading model, we modified the NAS parallel better than using simd for the inner loop because the inner loop benchmarks (NPB) to make use of OpenMP offloading directives, control flow could be costly, simd on CPUs should execute identi- including data movement and parallel region constructs with the cal instructions; if we use the simd construct instead of parallel corresponding clauses. We then measured the execution time for for, the inner loop may be less efficient because of the divergent data movement and for launching and running the kernels. The data sizes used for NAS benchmarks are CLASS A (not discussed here), 1000 OpenUH CLASS B, and CLASS C, the data sizes for each class are different. LLVM 100

120 GPU CPU (16 threads) 10 100

Time (sec) Time 1 80

0.1 60 Time (sec) Time 40 0.01 2048*2048 4096*4096 8192*8192 Data Size 20 Figure 8: Performance Comparison between OpenUH and LLVM on Matrix Multiplication Kernel 0 BT CG EP FT LU MG SP Figure 5: NPB-OpenMP offloading vs. NPB-OpenMP using OpenUH (Class B) 350 OpenUH LLVM 300

600 250 GPU CPU (16 threads) 200 500

150 Time (sec) Time 400 100

300 50 Time (sec) Time 200 0 512*512 1024*1024 2048*2048 Data Size 100 Figure 9: Performance Comparison between OpenUH and LLVM on Gauss Kernel 0 BT CG EP FT LU MG SP Figure 6: NPB-OpenMP offloading vs. NPB-OpenMP using OpenUH (Class C) 80 OpenUH LLVM 70 Figures 5 and 6 show the performance of our offloading imple- 60 mentation on the NPB variant that we specifically developed for this work. Note that we did not perform comparisons with LLVM, 50 because the backend of the currently available implementation does 40 not support some constructs used in NPB-OpenMP4.5. (sec) Time 30

20 4.3 Comparison with Clang OpenMP Implementation 10

We compared the performance of OpenUH with the LLVM im- 0 plementation that uses libomptarget as an offloading runtime. We 128*128*128 256*256*256 512*512*512 Data Size used four kernels: 2d-jacobi, matrix-matrix multiplication, gauss Figure 10: Performance Comparison between OpenUH and LLVM on and laplacian. In all kernels, our implementation outperforms the Laplacian Kernel LLVM implementation (see Figure 7 and 8).

100 OpenUH clause. Indeed, LLVM mainly addresses two levels of parallelism LLVM on the GPU: distribute and parallel. In the current implemen- 80 tations (Bertolli et al. 2014, 2015), the teams construct is mapped to the x dimension of the thread blocks in the grid. The parallel 60 directive is mapped to the x dimension of threads within each thread block. This indicates that LLVM does not handle the inner-

Time (sec) Time 40 most loop efficiently; yet when only master threads in warps are

20 executing, a large overhead for thread synchronization is incurred as described in the previous section. For the laplacian kernel, the 0 performance gap between OpenUH and LLVM is smaller than for 512*512 1024*1024 2048*2048 Data Size the other three kernels. The laplacian has three levels of parallelism: Figure 7: Performance comparison between OpenUH and LLVM on Jacobi- distribute, parallel and simd, in this case, LLVM could make OpenMP4.x Kernel (using 20000 max-iterations) use of GPU warps for the innermost loop, which could then be spread across threads within each warp, resulting in much better For the matrix-matrix multiplication kernel, the performance performance. gap between OpenUH and LLVM is huge (notice the logarithmic In OpenUH, the fact that we implicitly use GPU warps to handle scale on the y axis) for very large data sizes. This is because Clang innermost loop execution leads to a much better performance than maps the loop hierarchy to the GPU threads in an unefficient way. LLVM’s approach in general. Therefore, we believe the one-to-one In fact, LLVM does not fully utilize GPU warps as we believe mapping introduced in this paper should be used to fully exploit the it should be handled while executing loops with a parallel for thread hierarchy and to get better thread occupancy within GPUs. 5. Related Work ACM. ISBN 978-1-4503-4005-2. doi: 10.1145/2833157.2833161. URL http://doi.acm.org/10.1145/2833157.2833161. Support for OpenMP 4.5 in vendor compilers is still a work in progress. The Intel compiler 16.0, released in August 2015, fully B. C. Besar Wicaksono, Ramachandra C. Nanjegowda. A Dynamic Op- supports OpenMP 4.0 for the Intel Xeon Phi coprocessor. The Cray timization Framework for OpenMP. In 7th International Workshop on OpenMP, IWOMP 2011, Chicago, IL, USA, June 13-15, 2011. Proceed- Compiling Environment 8.4 was released in September 2015. It ings. Springer, 2011. provides support for the OpenMP 4.0 specification for C, C++, and Fortran, enabling the execution of OpenMP 4.0 target regions on J. C. Beyer, E. J. Stotzer, A. Hart, and B. R. de Supinski. OpenMP for Accelerators. In OpenMP in the Petascale Era, pages 108–121. Springer, NVIDIA GPUs. 2011. Regarding open-source compilers, GCC 6 was released in April 2016; it fully supports the OpenMP 4.5 specification for C and C++. O. A. R. Board. OpenMP 4.5 Specification. http://www.openmp.org/ mp-documents/openmp-4.5.pdf, 2015. GCC targets Intel XeonPhi Knights Landing and AMD’s HSAIL (Heterogeneous System Architecture Intermediate Language). B. Chapman, O. Hernandez, L. Huang, T. hsiung Weng, Z. Liu, L. Adhianto, LLVM 3.8.0 was released in March 2016. Its frontend Clang and Y. Wen. Dragon: an open64-based interactive program analysis tool for large applications. In Parallel and Distributed Computing, supports all of OpenMP 3.1 and some elements of OpenMP 4.0 and Applications and Technologies, 2003. PDCAT’2003. Proceedings of the 4.5. LLVM uses an OpenMP offloading library called libomptarget. Fourth International Conference on, pages 792–796, Aug 2003. doi: The current implementation of this library can be divided into three 10.1109/PDCAT.2003.1236416. components: target-agnostic offloading, target-specific offloading B. Chapman, G. Jost, and R. Van Der Pas. Using OpenMP: portable shared plugins, and target-specific runtime library. The libomptarget li- memory parallel programming, volume 10. MIT press, 2008. brary was designed with the goal of supporting multiple devices; B. Chapman, D. Eachempati, and O. Hernandez. Experiences Developing it currently targets the IBM Power architecture, x86 64, Nvidia the OpenUH Compiler and Runtime Infrastructure. International Jour- GPUs, and Intel Xeon Phi. In (Bertolli et al. 2015), its OpenMP nal of Parallel Programming, pages 1–30, 2012. ISSN 0885-7458. doi: 4.0 offloading strategy is described and some optimizations are 10.1007/s10766-012-0230-9. URL http://dx.doi.org/10.1007/ applied, such as a control loop scheme (Bertolli et al. 2014) that s10766-012-0230-9. avoids dynamic spawning of GPU threads inside the target region; D. Eachempati, A. Richardson, T. Liao, H. Calandra, and B. Chapman. A this implementation targets NVIDIA GPUs. As already discussed Coarray Fortran Implementation to Support Data-Intensive Application in the experimental section, the current implementation of LLVM Development. In The International Workshop on Data-Intensive Scal- does not map the simd OpenMP level to GPU warps, thereby hurt- able Computing Systems (DISCS), in conjunction with SC’12, November ing performance. 2012. P. Ghosh, Y. Yan, D. Eachempati, and B. Chapman. A Prototype Im- 6. Conclusion plementation of OpenMP Task Dependency Support. In A. Rendell, B. Chapman, and M. Mller, editors, OpenMP in the Era of Low Power In this paper, we describe our design and implementation of a com- Devices and Accelerators, volume 8122 of Lecture Notes in Computer pilation scheme based on a one-to-one mapping of the loop hier- Science, pages 128–140. Springer Berlin Heidelberg, 2013. ISBN 978- archy parallelism available in OpenMP 4.x to the thread hierar- 3-642-40697-3. doi: 10.1007/978-3-642-40698-0 10. URL http: chy found in GPUs. The implementation was carried out in the //dx.doi.org/10.1007/978-3-642-40698-0_10. OpenUH compiler, which is well suited for prototyping novel tech- L. Huang, H. Jin, L. Yi, and B. Chapman. Enabling locality-aware com- niques to support parallel languages and new compiler optimiza- putations in openmp. Sci. Program., 18(3-4):169–181, Aug. 2010. tions. A set of benchmarks were employed to assess how this map- ISSN 1058-9244. URL http://dl.acm.org/citation.cfm?id= ping, especially the use of GPU warps to handle innermost loop 1938482.1938488. execution, impacts the performance of GPU execution. The bench- C. Liao, O. Hernandez, B. Chapman, W. Chen, and W. Zheng. OpenUH: marks include new versions of the NAS parallel benchmarks that An Optimizing, Portable OpenMP Compiler. Concurrency and Compu- we specifically developed for this research; we also used matrix- tation: Practice and Experience, 19(18):2317–2332, 2007. matrix multiplication and Jacobi kernels. We show that our one-to- OpenACC. OpenACC. http://www.openacc-standard.org, 2015. one mapping technique significantly outperforms the Clang imple- OpenCL Standard. OpenCL Standard. http://www.khronos.org/ mentation of OpenMP on these two last kernels. In future work, we , 2015. plan to fully support the OpenMP 4.5 specification as well as target S. Pophale, O. Hernandez, S. Poole, and B. Chapman. Extending the open- other accelerators such as AMD GPUs and Intel Xeon Phi. shmem analyzer to perform synchronization and multi-valued analy- sis. In S. Poole, O. Hernandez, and P. Shamis, editors, OpenSHMEM References and Related Technologies. Experiences, Implementations, and Tools, volume 8356 of Lecture Notes in Computer Science, pages 134–148. CUDA C PROGRAMMING GUIDE. http://docs.nvidia.com/ Springer International Publishing, 2014. ISBN 978-3-319-05214-4. /pdf/CUDA_C_Programming_Guide.pdf, September 2015. doi: 10.1007/978-3-319-05215-1 10. URL http://dx.doi.org/10. B. M. C. Ahmad Qawasmeh, Abid M. Malik. OpenMP Task Scheduling 1007/978-3-319-05215-1_10. Analysis via OpenMP Runtime API and Tool Visualization. In Parallel X. Tian, R. Xu, Y. Yan, Z. Yun, S. Chandrasekaran, and B. Chapman. Lan- & Distributed Processing Symposium Workshops, 2014 IEEE Interna- guages and Compilers for Parallel Computing: 26th International Work- tional. IEEE, 2014. shop, LCPC 2013, San Jose, CA, USA, September 25–27, 2013. Re- C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O’Brien, Z. Sura, A. C. vised Selected Papers, chapter Compiling a High-Level Directive-Based Jacob, T. Chen, and O. Sallenave. Coordinating gpu threads for openmp Programming Model for GPGPUs, pages 105–120. Springer Interna- 4.0 in . In Proceedings of the 2014 LLVM Compiler Infrastructure in tional Publishing, Cham, 2014. ISBN 978-3-319-09967-5. doi: 10. HPC, LLVM-HPC ’14, pages 12–21, Piscataway, NJ, USA, 2014. IEEE 1007/978-3-319-09967-5 6. URL http://dx.doi.org/10.1007/ Press. ISBN 978-1-4799-7023-0. doi: 10.1109/LLVM-HPC.2014.10. 978-3-319-09967-5_6. URL http://dx.doi.org/10.1109/LLVM-HPC.2014.10. C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, and K. O’Brien. In- tegrating GPU Support for OpenMP Offloading Directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastruc- ture in HPC, LLVM ’15, pages 5:1–5:11, New York, NY, USA, 2015.