Opencl – Openacc & Exascale

Total Page:16

File Type:pdf, Size:1020Kb

Opencl – Openacc & Exascale OpenCL – OpenACC & Exascale F. Bodin Introduction • Exascale architectures may be o Massively parallel o Heterogeneous compute units o Hierarchical memory systems o Unreliable o Asynchronous o Very energy saving oriented o … • Exascale roadmap needs to be build on programming standards o Nobody can afford re-writing applications again and again o Exascale roadmap, HPC, mass market many-core and embedded systems are sharing many common issues o Exascale is not about an heroic technology development o Exascale project must provide technology for a large industry base/uses • OpenACC and OpenCL may be candidates o Dealing with inside the node o Part of a standardization initiative o OpenACC complementary to OpenCL • This presentation try to forecast OpenACC and OpenCL in the light of Exascale challenges o Challenges as identified by the ETP4HPC (http://www.etp4hpc.eu) WSTOOLS 2012 www.caps-entreprise.com 2 http://www.etp4hpc.eu WSTOOLS 2012 www.caps-entreprise.com 3 Programming Environments Context 1. Standardization initiative o Software developments need visibility 2. Intellectual property issues o Impact on tools development and interactions o Fundamental for creating an ecosystem o Everything open source is not the answer 3. Software engineering, applications and users expectations o Minimizing maintenance effort, one code for all targets 4. Tools development strategy o How to create coherent ecosystem? WSTOOLS 2012 www.caps-entreprise.com 4 Outline of the presentation • A very short overview of OpenCL • A very short overview of OpenACC • OpenACC-OpenCL versus Exascale challenges WSTOOLS 2012 www.caps-entreprise.com 5 OpenCL Overview • Open Computing Language o C-based cross-platform programming interface o Subset of ISO C99 with language extensions o Data- and task- parallel compute model • Host-Compute Devices (GPUs) model • Platform layer API and runtime API o Hardware abstraction layer, … o Manage resources • Portable syntax WSTOOLS 2012 www.caps-entreprise.com 6 Memory Model • Four distinct memory regions o Global Memory o Local Memory o Constant Memory o Private Memory • Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities • Local memory is shared by all WI of a WG • Private memory is private to each WI www.caps-entreprise.com 7 OpenCL Memory Hierarchy From Aaftab Munshi’s talk at Siggraph2008 WSTOOLS 2012 www.caps-entreprise.com 8 Data-Parallelism in OpenCL • A kernel is executed by the work-items o Same parallel model as Cuda (< 4.x) // OpenCL Kernel Function for element by element vector addition! __kernel void VectorAdd(__global const float8* a, __global const float8* b, __global float8* c)! {! // get oct-float index into global data array! int iGID = get_global_id(0);! // read inputs into registers! float8 f8InA = a[iGID];! float8 f8InB = b[iGID];! float8 f8Out = (float8)0.0f;! // add the vector elements! f8Out.s0 = f8InA.s0 + f8InB.s0;! f8Out.s1 = f8InA.s1 + f8InB.s1;! f8Out.s2 = f8InA.s2 + f8InB.s2;! f8Out.s3 = f8InA.s3 + f8InB.s3;! f8Out.s4 = f8InA.s4 + f8InB.s4;! f8Out.s5 = f8InA.s5 + f8InB.s5;! f8Out.s6 = f8InA.s6 + f8InB.s6;! f8Out.s7 = f8InA.s7 + f8InB.s7;! // write back out to GMEM! c[get_global_id(0)] = f8Out;! }! WSTOOLS 2012 www.caps-entreprise.com 9 OpenCL vs CUDA • OpenCL and CUDA share the same parallel programming model OPENCL CUDA kernel kernel • Runtime API is different host pgm host pgm o OpenCL is lower level than CUDA NDrange grid work item thread work group block • OpenCL and CUDA may use Global mem global mem different implementations that cst mem cst mem could lead to different execution local mem shared mem times for a similar kernel on the private mem local mem same hardware November 2011 www.caps-entreprise.com 10 Basic OpenCL Runtime Operations • Create a command-queue • Then queue up OpenCL events o Data transfers o Kernel launches • Allocate the accelerator’s memory o Before transferring data • Free the memory • Manage errors www.caps-entreprise.com 11 The Command Queue (1) • Command-queue can be used to queue a set of operations • Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization • Create an OpenCL command-queue cl_command_queue clCreateCommandQueue( cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret) www.caps-entreprise.com 12 The Command Queue (2) • Example o Allocation of a queue for the device 0 cl_command_queue my_cmd_queue; my_cmd_queue = clCreateCommandQueue(my_context, devices[0], 0, NULL); • Flush a command queue o All commands have started cl_int clFlush(cl_command_queue command_queue) • Finish a command queue (synchronization) o All commands have terminated cl_int clFinish(cl_command_queue command_queue) www.caps-entreprise.com 13 The Command Queue (3) • Possible to have multiple command queues on a device o Command queues are Asynchronous o The programmer must synchronize when needed www.caps-entreprise.com 14 How to Allocate Memory on a device ? • Memory objects are categorized into two types o Buffer Objects : 1D memory o Image Objects : 2D-3D memory • It can be o A scalar data type o A vector data type o A user defined structure • Memory objects are described by a cl_mem object • Kernels take cl_mem objects as input or output www.caps-entreprise.com 15 Allocate 1D Memory • Create a buffer cl_mem clCreateBuffer( cl_context context, cl_mem_flags flags, size_t size_in_bytes, void *host_ptr, cl_int *errcode_ret) • Example o Allocate a single precision float matrix as input size_t memsize = nb_elements * sizeof(float); cl_mem mat_a_gpu = clCreateBuffer(my_context, CL_MEM_READ_ONLY, memsize, NULL, &err); o And as output cl_mem mat_res_gpu = clCreateBuffer(my_context, CL_MEM_WRITE_ONLY, memsize, NULL, &err); www.caps-entreprise.com 16 Transfer data to Device (1) • Any data transfer to/from the device implies o A host pointer o A device memory object o The size of data (in bytes) o The command queue o If it is a blocking transfer or not o … • In case of non-blocking transfer o Link a cl_event to the transfer o Check the transfer finish with : cl_int clWaitForEvents (cl_uint num_events, const cl_event *event_list) www.caps-entreprise.com 17 Transfer data to Device (2) • Write in a buffer cl_int clEnqueueWriteBuffer ( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t size_in_bytes, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example o Transferring synchronously mat_a! err = clEnqueueWriteBuffer(cmd_queue, mat_a_gpu, CL_TRUE, 0, memsize, (void*) mat_a, 0,NULL,&evt); www.caps-entreprise.com 18 Kernel Arguments • A kernel needs arguments o So we must set these arguments o Arguments can be scalar, vector or user-defined data types • Set the kernel arguments cl_int clSetKernelArg( cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value) • Example o Set a argument o Set res argument o Set size argument err = clSetKernelArg(my_kernel, 0, sizeof(cl_mem), (void*) &mat_a_gpu); err = clSetKernelArg(my_kernel, 1, sizeof(cl_mem), (void*) &mat_res_gpu); err = clSetKernelArg(my_kernel, 2, sizeof(int), (void*) & nb_elements); www.caps-entreprise.com 19 Settings for Kernel Launching • Set the NDRange (grid) geometry int global_work_size[2] = {nb_elements_x, nb_elements_y}; int local_work_size[2] = {16, 16}; • Task parallel model is used for CPU o General task : complex, independent, … • Data-parallel is used for GPU o Need to set a grid, NDRange in OpenCL www.caps-entreprise.com 20 Kernel Launch (1) • If task kernel o Use the queued task command cl_int clEnqueueTask(cl_command_queue command_queue, cl_kernel kernel, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • If data-parallel kernel o Use the queued NDRange kernel command cl_int clEnqueueNDRangeKernel( cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) www.caps-entreprise.com 21 Kernel Launch (2) • The launch of the kernel is asynchronous by default err = clEnqueueNDRangeKernel( cmd_queue, kernel[0], 2, NULL, &global_work_size[0], &local_work_size[0], 0, NULL, &evt); clFinish(cmd_queue); www.caps-entreprise.com 22 Copy Back the Results • About the same as the copy in o From device to host cl_int clEnqueueReadBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t sizeinbytes, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example err = clEnqueueReadBuffer(cmd_queue, res_mem, CL_TRUE, 0, size, (void*) mat_res, 0, NULL, NULL); clFinish(cmd_queue); www.caps-entreprise.com 23 Free Device’s Memory • At the end you need to release the allocated memory • Releasing the memory cl_int clReleaseMemObject(cl_mem memobj) • Example o Release matrix a on GPU o Release matrix res on GPU clReleaseMemObject(mat_a_gpu); clReleaseMemObject(mat_res_gpu); www.caps-entreprise.com 24 Release Objects • At the end, you must release o The programs o The kernels o The command queues o And the context cl_int clReleaseKernel (cl_kernel kernel) cl_int clReleaseProgram (cl_program program) cl_int clReleaseCommandQueue (cl_command_queue command_queue) cl_int clReleaseContext (cl_context context) www.caps-entreprise.com 25 Error Management • Do not forget to
Recommended publications
  • Introduction to Openacc 2018 HPC Workshop: Parallel Programming
    Introduction to OpenACC 2018 HPC Workshop: Parallel Programming Alexander B. Pacheco Research Computing July 17 - 18, 2018 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 3 / 45 Accelerate Application for GPU 4 / 45 GPU Accelerated Libraries 5 / 45 GPU Programming Languages 6 / 45 What is OpenACC? I OpenACC Application Program Interface describes a collection of compiler directive to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator. I provides portability across operating systems, host CPUs and accelerators History I OpenACC was developed by The Portland Group (PGI), Cray, CAPS and NVIDIA. I PGI, Cray, and CAPs have spent over 2 years developing and shipping commercial compilers that use directives to enable GPU acceleration as core technology.
    [Show full text]
  • Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C
    PathScale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C. Bergström | May 14th, 2012 Brief Introduction to ENZO 2 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 ENZO Overview & Goals Speed transition to GPU & many-core systems • Simplify the task of migrating software written in C, C++ & Fortran • Uses OpenHMPP Standard (easy migration) • CAPS HMPP compatible Performance & HPC focused • Fully exploits NVIDIA GPU features • Generates native instructions optimized for NVIDIA GPU 3 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule & Status 4 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule . ENZO Production release June 2012 – OpenHMPP 2.5 C, C++ and Fortran . Next ENZO Production release October 2012 – More tools and better support for libraries – x8664 OpenHMPP task parallelism (similar to OMP3 tasks) – More optimizations (IPA / CG2 / textures) – OpenHMPP 3.0 – CUDA 4.x – Kepler 5 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Status . OpenHMPP 2.5 – Running CAPS C & Fortran Labs – PathScale written HMPP test suite – Customer code . New C++ compiler – Perennial C++VS and CVSA regression free – Corner case compile time issues – Corner case runtime issues . Ongoing effort – Performance tuning & benchmarking – Compiler robustness – Nightly compiler builds to address issues 6 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance 7 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance . NVIDIA Tesla 2050 - “Lab2” SGEMM – ENZO – /opt/enzo/bin/pathcc -hmpp
    [Show full text]
  • GPU Computing with Openacc Directives
    Introduction to OpenACC Directives Duncan Poole, NVIDIA Thomas Bradley, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Universities Data Analytics Supercomputing Centers Life Sciences 100,000’s Oil & Gas Defense Weather Research Climate Early Adopters Plasma Physics 2004 Present Time 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages CUDA Libraries are interoperable with OpenACC “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages CUDA Languages are interoperable with OpenACC, “Drop-in” Easily Accelerate too! Maximum Acceleration Applications Flexibility NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP Vector Signal GPU Accelerated Matrix Algebra on Image Processing Linear Algebra GPU and Multicore NVIDIA cuFFT Building-block Sparse Linear C++ STL Features IMSL Library Algorithms for CUDA Algebra for CUDA GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications OpenACC Directives CPU GPU Simple Compiler hints Program myscience Compiler Parallelizes code ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 OpenACC ... parallel code ... Compiler Works on many-core GPUs & enddo enddo Hint !$acc end kernels multicore CPUs ... End Program myscience
    [Show full text]
  • Openacc Course October 2017. Lecture 1 Q&As
    OpenACC Course October 2017. Lecture 1 Q&As. Question Response I am currently working on accelerating The GCC compilers are lagging behind the PGI compilers in terms of code compiled in gcc, in your OpenACC feature implementations so I'd recommend that you use the PGI experience should I grab the gcc-7 or compilers. PGI compilers provide community version which is free and you try to compile the code with the pgi-c ? can use it to compile OpenACC codes. New to OpenACC. Other than the PGI You can find all the supported compilers here, compiler, what other compilers support https://www.openacc.org/tools OpenACC? Is it supporting Nvidia GPU on TK1? I Currently there isn't an OpenACC implementation that supports ARM think, it must be processors including TK1. PGI is considering adding this support sometime in the future, but for now, you'll need to use CUDA. OpenACC directives work with intel Xeon Phi is treated as a MultiCore x86 CPU. With PGI you can use the "- xeon phi platform? ta=multicore" flag to target multicore CPUs when using OpenACC. Does it have any application in field of In my opinion OpenACC enables good software engineering practices by Software Engineering? enabling you to write a single source code, which is more maintainable than having to maintain different code bases for CPUs, GPUs, whatever. Does the CPU comparisons include I think that the CPU comparisons include standard vectorisation that the simd vectorizations? PGI compiler applies, but no specific hand-coded vectorisation optimisations or intrinsic work. Does OpenMP also enable OpenMP does have support for GPUs, but the compilers are just now parallelization on GPU? becoming available.
    [Show full text]
  • Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives
    Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. 100,000 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz 19,484 10,000 AMD Athlon 64, 2.8 GHz 14,387 AMD Athlon, 2.6 GHz 11,865 Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 22%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year Performance (vs.
    [Show full text]
  • How to Write Code That Will Survive
    Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO Introduction • Programming many-core systems faces the following dilemma o Achieve "portable" performance • Multiple forms of parallelism cohabiting – Multiple devices (e.g. GPUs) with their own address space – Multiple threads inside a device – Vector/SIMD parallelism inside a thread • Massive parallelism – Tens of thousands of threads needed o The constraint of keeping a unique version of codes, preferably mono- language • Reduces maintenance cost • Preserves code assets • Less sensitive to fast moving hardware targets • Codes last several generations of hardware architecture • For legacy codes, directive-based approach may be an alternative o And may benefit from auto-tuning techniques CC 2012 www.caps-entreprise.com 2 Profile of a Legacy Application • Written in C/C++/Fortran • Mix of user code and while(many){ library calls ... mylib1(A,B); ... • Hotspots may or may not be myuserfunc1(B,A); parallel ... mylib2(A,B); ... • Lifetime in 10s of years myuserfunc2(B,A); ... • Cannot be fully re-written } • Migration can be risky and mandatory CC 2012 www.caps-entreprise.com 3 Overview of the Presentation • Many-core architectures o Definition and forecast o Why usual parallel programming techniques won't work per se • Directive-based programming o OpenACC sets of directives o HMPP directives o Library integration issue • Toward a portable infrastructure for auto-tuning o Current auto-tuning directives in HMPP 3.0 o CodeletFinder for offline auto-tuning o Toward a standard auto-tuning interface CC 2012 www.caps-entreprise.com 4 Many-Core Architectures Heterogeneous Many-Cores • Many general purposes cores coupled with a massively parallel accelerator (HWA) Data/stream/vector CPU and HWA linked with a parallelism to be PCIx bus exploited by HWA e.g.
    [Show full text]
  • Multi-Threaded GPU Accelerration of ORBIT with Minimal Code
    Princeton Plasma Physics Laboratory PPPL- 4996 4996 Multi-threaded GPU Acceleration of ORBIT with Minimal Code Modifications Ante Qu, Stephane Ethier, Eliot Feibush and Roscoe White FEBRUARY, 2014 Prepared for the U.S. Department of Energy under Contract DE-AC02-09CH11466. Princeton Plasma Physics Laboratory Report Disclaimers Full Legal Disclaimer This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors or their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or any third party’s use or the results of such use of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or its contractors or subcontractors. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Trademark Disclaimer Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or its contractors or subcontractors. PPPL Report Availability Princeton Plasma Physics Laboratory: http://www.pppl.gov/techreports.cfm Office of Scientific and Technical Information (OSTI): http://www.osti.gov/bridge Related Links: U.S.
    [Show full text]
  • HPVM: Heterogeneous Parallel Virtual Machine
    HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou∗ Prakalp Srivastava∗ Matthew D. Sinclair Department of Computer Science Department of Computer Science Department of Computer Science University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign [email protected] [email protected] [email protected] Rakesh Komuravelli Vikram Adve Sarita Adve Qualcomm Technologies Inc. Department of Computer Science Department of Computer Science [email protected]. University of Illinois at University of Illinois at com Urbana-Champaign Urbana-Champaign [email protected] [email protected] Abstract hardware, and that runtime scheduling policies can make We propose a parallel program representation for heteroge- use of both program and runtime information to exploit the neous systems, designed to enable performance portability flexible compilation capabilities. Overall, we conclude that across a wide range of popular parallel hardware, including the HPVM representation is a promising basis for achieving GPUs, vector instruction sets, multicore CPUs and poten- performance portability and for implementing parallelizing tially FPGAs. Our representation, which we call HPVM, is a compilers for heterogeneous parallel systems. hierarchical dataflow graph with shared memory and vector CCS Concepts • Computer systems organization → instructions. HPVM supports three important capabilities for Heterogeneous (hybrid) systems; programming heterogeneous systems: a compiler interme- diate representation (IR), a virtual instruction set (ISA), and Keywords Virtual ISA, Compiler, Parallel IR, Heterogeneous a basis for runtime scheduling; previous systems focus on Systems, GPU, Vector SIMD only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for het- 1 Introduction erogeneous systems.
    [Show full text]
  • Openacc Getting Started Guide
    OPENACC GETTING STARTED GUIDE Version 2018 TABLE OF CONTENTS Chapter 1. Overview............................................................................................ 1 1.1. Terms and Definitions....................................................................................1 1.2. System Prerequisites..................................................................................... 2 1.3. Prepare Your System..................................................................................... 2 1.4. Supporting Documentation and Examples............................................................ 3 Chapter 2. Using OpenACC with the PGI Compilers...................................................... 4 2.1. OpenACC Directive Summary........................................................................... 4 2.2. CUDA Toolkit Versions....................................................................................6 2.3. C Structs in OpenACC....................................................................................8 2.4. C++ Classes in OpenACC.................................................................................9 2.5. Fortran Derived Types in OpenACC...................................................................13 2.6. Fortran I/O............................................................................................... 15 2.6.1. OpenACC PRINT Example......................................................................... 15 2.7. OpenACC Atomic Support.............................................................................
    [Show full text]
  • CAPS Openacc Compiler
    CAPS OpenACC Compiler HMPP Workbench 3.2 IDDN.FR.001.490007.000.S.P.2008.000.10600 This information is the property of CAPS entreprise and cannot be used, reproduced or transmitted without authorization. Headquarters – France CAPS – USA CAPS – CHINA Immeuble CAP Nord 4701 Patrick Drive Bldg 12 Suite E2, 30/F 4A Allée Marie Berhaut Santa Clara JuneYao International Plaza 35000 Rennes CA 95054 789, Zhaojiabang Road, France Shanghai 200032 Tel.: +33 (0)2 22 51 16 00 Tel.: +1 408 550 2887 x70 Tel.: +86 21 3363 0057 Fax: +33 (0)2 23 20 16 43 Fax: +86 21 3363 0067 [email protected] [email protected] [email protected] N° d’agrément formation : 53 35 08397 35 Visit our website: http://www.caps-entreprise.com CAPS OpenACC Compiler SUMMARY 1. Introduction 5 1.1. Revisions history .................................................................................................................................... 5 1.2. Introduction ............................................................................................................................................ 6 1.3. What is HMPP Workbench? What is the CAPS OpenACC Compiler? ................................................. 6 1.4. Execution Model .................................................................................................................................... 8 1.5. Memory Model ....................................................................................................................................... 8 2. OpenACC Directives 9 2.1. kernels
    [Show full text]
  • Using CAPS Compiler on NVIDIA Kepler and CARMA Systems
    Using CAPS Compiler on NVIDIA Kepler and CARMA Systems F. Bodin CTO – CAPS entreprise Introduction • CAPS develops programming tools to help writing a unique source code that can be executed on existing accelerator technologies o C / C++ / Fortran • Fast moving hardware systems require two directives sets o OpenHMPP - easy to extend – integrate new HW features o OpenACC - standardized – longer term view but moving slowly • Generates CUDA or OpenCL codes o Portable on AMD GPU and APU, Intel MIC, Nvidia Kepler-Carma, … nvidia SC 2012 www.caps-entreprise.com 2 CAPS Technology • Provide OpenACC and OpenHMPP directives o OpenHMPP codelet based o OpenACC code region based #pragma hmpp myfunc codelet, … void saxpy(int n, float alpha, float x[n], float y[n]){ #pragma hmppcg gridify(i) for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } #pragma acc kernels … { for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } nvidia SC 2012 www.caps-entreprise.com 3 Compilation Process • Source-to-source technology C++ C Fortran Frontend Frontend Frontend Extraction module codelets Host Fun#1 Fun #2 code Fun #3 Instrumentation OpenCL/Cuda module Generation CPU compiler Native (gcc, ifort, …) compilers Executable CAPS HWA Code (mybin.exe) Runtime (Dyn. library) www.caps-entreprise.com nvidia SC 2012 4 A Few Typical Situations 1. Simple nested loops 6. Dealing with accelerated library 2. Data transfer optimization 7. Dealing with dynamic accelerated tasks scheduling 3. Complex loop nests 8. Using multiple accelerators 4. Code tuning 9. Nested parallelism using native 5. Integrating auto-tuning techniques nvidia SC 2012 www.caps-entreprise.com 5 Simple nested loops - 1 Host • The simple construct is to CPU declare a parallel loop to be code Accelerator compiled and executed on an accelerator send A,B,C o Iterations of the loop nests are converted into threads execute kern.
    [Show full text]
  • Introduction to GPU Programming with CUDA and Openacc
    Introduction to GPU Programming with CUDA and OpenACC Alabama Supercomputer Center 1 Alabama Research and Education Network Contents Topics § Why GPU chips and CUDA? § GPU chip architecture overview § CUDA programming § Queue system commands § Other GPU programming options § OpenACC programming § Comparing GPUs to other processors 2 What is a GPU chip? GPU § A Graphic Processing Unit (GPU) chips is an adaptation of the technology in a video rendering chip to be used as a math coprocessor. § The earliest graphic cards simply mapped memory bytes to screen pixels – i.e. the Apple ][ in 1980. § The next generation of graphics cards (1990s) had 2D rendering capabilities for rendering lines and shaded areas. § Graphics cards started accelerating 3D rendering with standards like OpenGL and DirectX in the early 2000s. § The most recent graphics cards have programmable processors, so that game physics can be offloaded from the main processor to the GPU. § A series of GPU chips sometimes called GPGPU (General Purpose GPU) have double precision capability so that they can be used as math coprocessors. 3 Why GPUs? GPU Comparison of peak theoretical GFLOPs and memory bandwidth for NVIDIA GPUs and Intel CPUs over the past few years. Graphs from the NVIDIA CUDA C Programming Guide 4.0. 4 CUDA Programming Language CUDA The GPU chips are massive multithreaded, manycore SIMD processors. SIMD stands for Single Instruction Multiple Data. Previously chips were programmed using standard graphics APIs (DirectX, OpenGL). CUDA, an extension of C, is the most popular GPU programming language. CUDA can also be called from a C++ program. The CUDA standard has no FORTRAN support, but Portland Group sells a third party CUDA FORTRAN.
    [Show full text]