Advanced Openmp Tutorial Openmp for Heterogeneous Computing Slides Designed by / Based On

Total Page:16

File Type:pdf, Size:1020Kb

Advanced Openmp Tutorial Openmp for Heterogeneous Computing Slides Designed by / Based On Advanced OpenMP Tutorial OpenMP for Heterogeneous Computing slides designed by / based on: Christian Terboven Michael Klemm James C. Beyer Kelvin Li Bronis R. de Supinski 1 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Topics n Heterogeneous device execution model n Mapping variables to a device n Accelerated workshare n Examples 2 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Model n OpenMP supports heterogeneous systems n Device model: àOne host device and àOne or more target devices Heterogeneous SoC Host and Co-processors Host and GPUs 3 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology n Device: An implementation-defined (logical) execution unit. n Device data environment: The storage associated with a device. The execution model is host-centric such that the host device offloads target regions to target devices. 4 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading Target Device Offloading Host Device void saxpy(){ main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y // Run SAXPY Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; } } 5 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading Target Device Offloading Host Device void saxpy(){ offloading main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y // Run SAXPY #pragma omp target Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; saxpy(); } } 6 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading Target Device Offloading Host Device void saxpy(){ offloading main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y x,y // Run SAXPY #pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; saxpy(); } } 7 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Device Constructs Execute code on a target device Map variables to a target device • omp target [clause[[,]clause]…] • map([[map-type-modifier[,]]map-type:] list) map-type := alloc | tofrom | to | from | release | delete structured-block map-type-modifier := always • omp declare target • omp target data clause[[[,] clause]…] [function-definitions-or-declarations] structured-block • omp target enter data clause[[[,]clause]…] • omp target exit data clause[[[,]clause]…] Workshare for acceleration • omp target update clause[[[,]clause]…] • omp teams [clause[[,]clause]…] • omp declare target structured-block [variable-definitions-or-declarations] • omp distribute [clause[[,]clause]…] for-loops 8 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Runtime Support Runtime routines • void omp_set_default_device(int dev_num) • int omp_get_default_device(void) • int omp_get_num_devices(void) • int omp_get_num_teams(void) • int omp_get_team_num(void) • int omp_is_initial_device(void) • int omp_get_initial_device(void) Environment variable • Control default device through OMP_DEFAULT_DEVICE • Control offloading behaviour OMP_TARGET_OFFLOAD 9 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Offloading Computation host n Use target construct to count = 500; à Transfer control from the host to the #pragma omp target map(to:b,c,d) map(from:a) target device { #pragma omp parallel for target n Use map clause to for (i=0; i<count; i++) { à Map variables between the host and a[i] = b[i] * c + d; target device data environments } } n Host thread waits until offloaded host region completed by default a0 = a[0]; à Use the nowait clause for asynchronous execution 10 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Asynchronous offloading host n A host task is generated that count = 500; encloses the target region. #pragma omp target map(to:b,c,d) map(from:a) nowait { #pragma omp parallel for n The nowait clause indicates that the target for (i=0; i<count; i++) { encountering thread does not wait for a[i] = b[i] * c + d; the target region to complete. } } n The depend clause can be used for do_some_other_work(); host synchronization with other tasks a0 = a[0]; //Synchronization missing here! target task A mergeable and untied task that is generated by a target, target enter data, target exit data or target update construct. 11 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target Construct n Transfer control from the host to the target device n Syntax (C/C++) #pragma omp target [clause[[,] clause]…] structured-block n Syntax (Fortran) !$omp target [clause[[,] clause]…] structured-block !$omp end target n Clauses device(scalar-integer-expression) map([always[,]] alloc | to | from | tofrom | delete | release: list) if([target: ]scalar-expr) private(list) firstprivate(list) is_device_ptr(list) defaultmap(tofrom: scalar) nowait depend(dependence-type: list) 12 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Tasks and Target Example #pragma omp declare target #include <stdlib.h> #include <omp.h> extern void compute(float*, float*, int); #pragma omp end declare target All aspects of this void vec_mult_async(float* p, float* v1, float* v2, int N) { example will be int i; explained in the #pragma omp target enter data map(alloc: v1[:N], v2[:N]) following. #pragma omp target nowait depend(out: v1, v2) compute(v1, v2, N); #pragma omp task other_work(); // execute asynchronously on host device #pragma omp target map(from:p[0:N]) nowait depend(in: v1, v2) { #pragma omp distribute parallel for for (i=0; i<N; i++) p[i] = v1[i] * v2[i]; } #pragma omp taskwait #pragma omp target exit data map(release: v1[:N], v2[:N]) } 13 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology n Mapped variable: The corresponding variable in a device data environment to an original variable in a (host) data environment. n Mappable type: A type that is valid for mapped variables. (Bitwise copyable plus additional restrictions.) 14 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause n Map a variable or an array section to a device data environment n Syntax: map([[map-type-modifier[,]] map-type:] list) n Where map-type is: à alloc: allocate storage for corresponding variable à to: alloc and assign value of original variable to corresponding variable on entry à from: alloc and assign value of corresponding variable to original variable on exit à tofrom: default, both to and form à delete: the corresponding variable is removed à release: the reference count is decremented n Where map-type-modifier is: à always: the mapping operation is always performed 15 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Data Environment n The map clauses determine how an original variable in a data environment is mapped to a corresponding variable in a device data environment. Host Device pA 1 alloc(…) 2 to(…) #pragma omp target \ 4 map(alloc:...) \ from(…) map(to:...) \ map(from:...) { ... } 3 16 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause void vec_mult(float *p, float *v1, float *v2, int N) { • int i; The array sections for v1, v2, init(v1, v2, N); and p are explicitly mapped #pragma omp target map(to:v1[0:N],v2[0:N]) map(from:p[0:N]) into the device data { #pragma omp parallel for environment. for (i=0; i<N; i++) p[i] = v1[i] * v2[i]; • The variable N is implicitly output(p, N); firstprivate into the device } } data environment. subroutine vec_mult(p, v1, v2, N) real, dimension(*) :: p, v1, v2 integer :: N, i call init(v1, v2, N) !$omp target map(to: v1(1:N), v2(1:N)) map(from:p(1:N)) Note: !$omp parallel do In 4.0, a scalar variable referenced inside a target do i=1, N construct is implicitly mapped as map(inout:...). p(i) = v1(i) * v2(i) end do In 4.5, a scalar variable referenced inside a target !omp end target construct is implicitly firstprivate. call output(p, N) end subroutine 17 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause void vec_mult(float *p, float *v1, float *v2, int N) { • On entry to the target region: int i; – init(v1, v2, N); Allocate corresponding variables v1, v2, and p in the device data environment. #pragma omp target map(to:v1[0:N],v2[0:N]) map(from:p[0:N]) – Assign the corresponding variables v1 { and v2 the value of their respective #pragma omp parallel for for (i=0; i<N; i++) original variables. p[i] = v1[i] * v2[i]; – The corresponding variable p is undefined. output(p, N); } • On exit from the target region: } – Assign the original variable p the value subroutine vec_mult(p, v1, v2, N) of its corresponding variable. real, dimension(*) :: p, v1, v2 – The original variables v1 and v2 are integer :: N, i call init(v1, v2, N) undefined. – Remove the corresponding variables v1, !$omp target map(to: v1(1:N), v2(1:N)) map(from:p(1:N)) v2, and p from the device data !$omp parallel do environment. do i=1, N p(i) = v1(i) * v2(i) end do !omp end target call output(p, N) end subroutine 18 Advanced OpenMP Tutorial – Device Constructs Tim Cramer MAP is not necessarily a copy Shared memory Memory Processor X Accelerator Y Cache Cache A A A Distributed memory Accelerator Memory X Y • The corresponding variable in the device Processor data environment may share storage with X Memory Y the original variable. Cache A A • Writes to the corresponding variable may A change the value of the original variable. 19 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Map variables across multiple target regions n Optimize sharing data between host and device. n The target data, target enter data, and target exit data constructs map variables but do not offload code. n Corresponding variables remain in the device data environment for the extent of the target data region.
Recommended publications
  • Map, Reduce and Mapreduce, the Skeleton Way$
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Procedia Computer Science ProcediaProcedia Computer Computer Science Science 00 1 (2010)(2012) 1–9 2095–2103 www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 Map, Reduce and MapReduce, the skeleton way$ D. Buonoa, M. Daneluttoa, S. Lamettia aDept. of Computer Science, Univ. of Pisa, Italy Abstract Composition of Map and Reduce algorithmic skeletons have been widely studied at the end of the last century and it has demonstrated effective on a wide class of problems. We recall the theoretical results motivating the intro- duction of these skeletons, then we discuss an experiment implementing three algorithmic skeletons, a map,areduce and an optimized composition of a map followed by a reduce skeleton (map+reduce). The map+reduce skeleton implemented computes the same kind of problems computed by Google MapReduce, but the data flow through the skeleton is streamed rather than relying on already distributed (and possibly quite large) data items. We discuss the implementation of the three skeletons on top of ProActive/GCM in the MareMare prototype and we present some experimental obtained on a COTS cluster. ⃝c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Keywords: Algorithmic skeletons, map, reduce, list theory, homomorphisms 1. Introduction Algorithmic skeletons have been around since Murray Cole’s PhD thesis [1]. An algorithmic skeleton is a paramet- ric, reusable, efficient and composable programming construct that exploits a well know, efficient parallelism exploita- tion pattern. A skeleton programmer may build parallel applications simply picking up a (composition of) skeleton from the skeleton library available, providing the required parameters and running some kind of compiler+run time library tool specific of the target architecture at hand [2].
    [Show full text]
  • From Sequential to Parallel Programming with Patterns
    From sequential to parallel programming with patterns Inverted CERN School of Computing 2018 Plácido Fernández [email protected] 07 Mar 2018 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 3 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 4 Source: Herb Sutter in Dr. Dobb’s Journal Parallel hardware history I Before ~2005, processor manufacturers increased clock speed. I Then we hit the power and memory wall, which limits the frequency at which processors can run (without melting). I Manufacturers response was to continue improving their chips performance by adding more parallelism. To fully exploit modern processors capabilities, programmers need to put effort into the source code. 07 Mar 2018 6 Parallel Hardware Processors are naturally parallel: Programmers have plenty of options: I Caches I Multi-threading I Different processing units (floating I SIMD / Vectorization (Single point, arithmetic logic...) Instruction Multiple Data) I Integrated GPU I Heterogeneous hardware 07 Mar 2018 7 Source: top500.org Source: Intel Source: CERN Parallel hardware Xeon Xeon 5100 Xeon 5500 Sandy Bridge Haswell Broadwell Skylake Year 2005 2006 2009 2012 2015 2016 2017 Cores 1 2 4 8 18 24 28 Threads 2 2 8 16 36 48 56 SIMD Width 128 128 128 256 256 256 512 Figure: Intel Xeon processors evolution 0Source: ark.intel.com 07 Mar 2018 11 Parallel considerations When designing parallel algorithms where are interested in speedup.
    [Show full text]
  • Opencl: a Hands-On Introduction
    OpenCL: A Hands-on Introduction Tim Mattson Alice Koniges Intel Corp. Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: In addition to Tim, Alice and Simon … Tom Deakin (Bristol) and Ben Gaster (Qualcomm) contributed to this content. Agenda Lectures Exercises An Introduction to OpenCL Logging in and running the Vadd program Understanding Host programs Chaining Vadd kernels together Kernel programs The D = A + B + C problem Writing Kernel Programs Matrix Multiplication Lunch Working with the OpenCL memory model Several ways to Optimize matrix multiplication High Performance OpenCL Matrix multiplication optimization contest The OpenCL Zoo Run your OpenCL programs on a variety of systems. Closing Comments Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, we provide: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will help you keep track of the API as you do the exercises: https://www.khronos.org/files/ opencl-1-2-quick-reference-card.pdf The v1.2 spec is also very readable and recommended to have on-hand: https://www.khronos.org/registry/ cl/specs/opencl-1.2.pdf AN INTRODUCTION TO OPENCL Industry Standards for Programming Heterogeneous Platforms GPUs CPUs Emerging Increasingly general Multiple cores driving Intersection purpose data-parallel performance increases computing Graphics Multi- Heterogeneous APIs and processor Computing Shading programming – Languages e.g. OpenMP OpenCL – Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors The origins of OpenCL ARM AMD Merged, needed Nokia commonality IBM ATI across products Sony Wrote a rough draft Qualcomm GPU vendor – straw man API Imagination wants to steal TI NVIDIA market share from CPU + many more Khronos Compute CPU vendor – group formed wants to steal Intel market share from GPU Was tired of recoding for many core, GPUs.
    [Show full text]
  • Mapreduce Basics
    Chapter 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer, a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. The basic idea is to partition a large problem into smaller sub-problems. To the extent that the sub-problems are independent [5], they can be tackled in parallel by different workers| threads in a processor core, cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster. Intermediate results from each individual worker are then combined to yield the final output.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems in many different application domains. However, the details of their implementations are varied and complex. For example, the following are just some of the issues that need to be addressed: • How do we break up a large problem into smaller tasks? More specifi- cally, how do we decompose the problem so that the smaller tasks can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others, e.g., due to available resources, locality constraints, etc.)? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by an- other? • How do we accomplish all of the above in the face of software errors and hardware faults? 1We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift, but they are far from being sufficiently mature to solve real world problems.
    [Show full text]
  • CUDA C++ Programming Guide
    CUDA C++ Programming Guide Design Guide PG-02829-001_v11.4 | September 2021 Changes from Version 11.3 ‣ Added Graph Memory Nodes. ‣ Formalized Asynchronous SIMT Programming Model. CUDA C++ Programming Guide PG-02829-001_v11.4 | ii Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. The Benefits of Using GPUs.....................................................................................................1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model....... 2 1.3. A Scalable Programming Model.............................................................................................. 3 1.4. Document Structure................................................................................................................. 5 Chapter 2. Programming Model.......................................................................................... 7 2.1. Kernels.......................................................................................................................................7 2.2. Thread Hierarchy...................................................................................................................... 8 2.3. Memory Hierarchy...................................................................................................................10 2.4. Heterogeneous Programming................................................................................................11 2.5. Asynchronous
    [Show full text]
  • Programming Massively Parallel Processors
    APPENDIX An introduction to OpenCL A CHAPTER OUTLINE A.1 Background .......................................................................................................461 A.2 Data Parallelism Model ......................................................................................462 A.3 Device Architecture ............................................................................................464 A.4 Kernel Functions ................................................................................................466 A.5 Device Management and Kernel Launch ..............................................................466 A.6 Electrostatic Potential Map in OpenCL .................................................................469 A.7 Summary ...........................................................................................................473 A.8 Exercises ...........................................................................................................474 References ...............................................................................................................474 In this appendix, we will give a brief overview of OpenCL for CUDA programers. The fundamental programing model of OpenCL is so similar to CUDA that there is a one-to-one correspondence for most features. With your understanding of CUDA, you will be able to start writing OpenCL programs with the material presented in this appendix. In our opinion, the best way to learn OpenCL is actually to learn CUDA first and then map the OpenCL features
    [Show full text]
  • A Case Study of Openmp Applied to Map/Reduce-Style Computations
    A Case Study of OpenMP applied to Map/Reduce-style Computations Arif, M., & Vandierendonck, H. (2015). A Case Study of OpenMP applied to Map/Reduce-style Computations. In OpenMP: Heterogenous Execution and Data Movements (Vol. 9342, pp. 162-174). (Lecture Notes in Computer Science). https://doi.org/10.1007/978-3-319-24595-9_12 Published in: OpenMP: Heterogenous Execution and Data Movements Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights Copyright Springer 2015. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-24595-9_12. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:01. Oct. 2021 A Case Study of OpenMP applied to Map/Reduce-Style Computations Mahwish Arif Hans Vandierendonck School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, UK Keywords: OpenMP, map/reduce, reduction As data analytics are growing in importance they are also quickly becoming one of the dominant application domains that require parallel processing.
    [Show full text]
  • A Review of CUDA, Mapreduce, and Pthreads Parallel Computing Models
    A Review of CUDA, MapReduce, and Pthreads Parallel Computing Models Kato Mivule1, Benjamin Harvey2, Crystal Cobb3, and Hoda El Sayed4 1 Computer Science Department, Bowie State University, Bowie, MD, USA [email protected] [email protected] [email protected] [email protected] Abstract – The advent of high performance computing models to assist high performance computing users (HPC) and graphics processing units (GPU), present an with a faster understanding of parallel programming enormous computation resource for Large data concepts and thus better implementation of big data transactions (big data) that require parallel processing for parallel programming projects. Although a number of robust and prompt data analysis. While a number of HPC models could have been chosen for this overview, we frameworks have been proposed, parallel programming focused on CUDA, MapReduce, and Pthreads models present a number of challenges – for instance, how to fully utilize features in the different programming models because of the underlying features in the three to implement and manage parallelism via multi-threading models that could be used to for multi-threading. The in both CPUs and GPUs. In this paper, we take an rest of this paper is organized as follows: In Section overview of three parallel programming models, CUDA, 2, a review of the latest literature on parallel MapReduce, and Pthreads. The goal is to explore literature computing models and determinism is done. In on the subject and provide a high level view of the features Section 3, an overview of CUDA is given, while in presented in the programming models to assist high Section a review of MapReduce features is performance users with a concise understanding of parallel presented.
    [Show full text]
  • Parallel Design Patterns
    Parallel design patterns M. Danelutto Introduction Parallel design Parallel design patterns patterns Finding concurrency design space Marco Danelutto Algorithm Structure design space Dept. of Computer Science, University of Pisa, Italy Supporting structures design space May 2011, Pisa Implementation mechanisms design space Conclusions Contents Parallel design patterns M. Danelutto 1 Introduction Introduction Parallel design 2 Parallel design patterns patterns Finding concurrency 3 Finding concurrency design space design space Algorithm 4 Algorithm Structure design space Structure design space Supporting 5 Supporting structures design space structures design space Implementation 6 Implementation mechanisms design space mechanisms design space Conclusions 7 Conclusions T s(n) = seq (speedup) Tpar (n) T d T (n) = i = seq (efficiency) Tpar (n) n × Tpar (n) 1 s(n) = 1−f (Amdhal law) f + n Parallel computing Parallel design patterns M. Danelutto The problem Introduction Solve a problem using nw processing resources Parallel design Obtaining a (close to) nw speedup with respect to time patterns spent sequentially Finding concurrency design space Algorithm Structure design space Supporting structures design space Implementation mechanisms design space Conclusions Parallel computing Parallel design patterns M. Danelutto The problem Introduction Solve a problem using nw processing resources Parallel design Obtaining a (close to) nw speedup with respect to time patterns spent sequentially Finding concurrency design space Algorithm Tseq Structure s(n) = (speedup) design space Tpar (n) Supporting structures Ti d Tseq design space (n) = = (efficiency) Implementation Tpar (n) n × Tpar (n) mechanisms design space 1 Conclusions s(n) = 1−f (Amdhal law) f + n Parallelism exploitation program activities (threads, processes) program interactions (communications, sycnhronizations) ! overhead The problems Parallel design patterns M.
    [Show full text]
  • Memcuda: Map Device Memory to Host Memory on GPGPU Platform*
    memCUDA: Map Device Memory to Host Memory on GPGPU Platform* Hai Jin, Bo Li, Ran Zheng, Qin Zhang, and Wenbing Ao Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China [email protected] Abstract. The Compute Unified Device Architecture (CUDA) programming environment from NVIDIA is a milestone towards making programming many-core GPUs more flexible to programmers. However, there are still many challenges for programmers when using CUDA. One is how to deal with GPU device memory, and data transfer between host memory and GPU device mem- ory explicitly. In this study, source-to-source compiling and runtime library technologies are used to implement an experimental programming system based on CUDA, called memCUDA, which can automatically map GPU device memory to host memory. With some pragma directive language, programmer can directly use host memory in CUDA kernel functions, during which the te- dious and error-prone data transfer and device memory management are shielded from programmer. The performance is also improved with some near-optimal technologies. Experiment results show that memCUDA programs can get similar effect with well-optimized CUDA programs with more compact source code. Keywords: GPU, CUDA, memory mapping, programming model. 1 Introduction In the past decade, the era of multi-core computer processor is coming due to power and performance limitations of VLSI design. As a new computing platform, GPGPUs become more and more attractive because they offer extensive resources even for non-visual, general-purpose computations: massive parallelism, high memory band- width, and general-purpose instruction sets.
    [Show full text]
  • Openmp Advanced Overview SIMD and Target Offload
    OpenMP Advanced Overview SIMD and Target Offload Doug Jacobsen, Intel Corporation John Pennycook, Intel Corporation Tim Mattson, Intel Corporation Alejandro Duran, Intel Corporation Intel, the Intel logo, Intel® Xeon Phi™, Intel® Xeon® Processor are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks. Topics • We Will Cover: • Accelerated OpenMP* basics review • Vectorization and OpenMP* SIMD • OpenMP* Device Usage (target) • We will NOT Cover: • Detailed OpenMP* basics usage • OpenMP* Tasking • Affinity • NUMA effects • Reference material in backup: • Interface descriptions for all described constructs • Brief Introduction to OpenMP* 5.0 Changes 2 © 2018 Intel Corporation OpenMP* Review 3 © 2018 Intel Corporation OpenMP* Basics Review OpenMP*: . Application Program Interface (API) focused on the expression of parallelism within an application . Provides methods for creating multi-threaded, heterogeneous device, and vectorized applications . Contains standardized directives and library functions . Supports C, C++, and Fortran . Was established in 1997 4 © 2018 Intel Corporation OpenMP* Threads Threads are the original and most basic concept within OpenMP*. A master thread spawns a team of threads when it encounters a parallel region Threads can distribute work among themselves to accomplish the same task faster, or can work on independent tasks in parallel. 5 © 2018 Intel Corporation OpenMP* Directive Overview Directive Description #pragma omp parallel Create threads in a parallel region !$omp parallel #pragma omp for Distribute iterations of a loop !$omp do among threads #pragma omp for reduction(+:sum) Distribute iterations of a loop !$omp do reduction(+:sum) among threads, and reduce the thread private ‘sum’ after the loop is complete.
    [Show full text]
  • A Brief Introduction to Opencl • Reference - Programming Massively Parallel Processors: a Hands-On Approach, David Kirk and Wen-Mei W
    A Brief Introduction to OpenCL • Reference - Programming Massively Parallel Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11 • What is OpenCL? – It is a standardized, cross-platform, parallel-computing API – It is designed to be portable & work with heterogeneous systems – Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing – Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code – In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem Data Parallelism Model • There is a direct corres- pondence with CUDA • Host programs are used to launch kernels on OpenCL devices • The index space maps data to the work items • Work items are groups, as in Blocks with CUDA • Work items in the same group can be synchronized • The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!) Parallel Execution Model Getting global and local values • Thead IDs and Sizes – The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension) – If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension Device Architecture • The CPU is a traditional computer that exectures the host program • Here is an OpenCL device – It contains one or more compute units (CU) – Each
    [Show full text]