Hybrid Programming Using Openshmem and Openacc
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Introduction to Openacc 2018 HPC Workshop: Parallel Programming
Introduction to OpenACC 2018 HPC Workshop: Parallel Programming Alexander B. Pacheco Research Computing July 17 - 18, 2018 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 CPU vs GPU CPU : consists of a few cores optimized for sequential serial processing GPU : has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously GPU enabled applications 2 / 45 3 / 45 Accelerate Application for GPU 4 / 45 GPU Accelerated Libraries 5 / 45 GPU Programming Languages 6 / 45 What is OpenACC? I OpenACC Application Program Interface describes a collection of compiler directive to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator. I provides portability across operating systems, host CPUs and accelerators History I OpenACC was developed by The Portland Group (PGI), Cray, CAPS and NVIDIA. I PGI, Cray, and CAPs have spent over 2 years developing and shipping commercial compilers that use directives to enable GPU acceleration as core technology. -
NVIDIA CUDA on IBM POWER8: Technical Overview, Software Installation, and Application Development
Redpaper Dino Quintero Wei Li Wainer dos Santos Moschetta Mauricio Faria de Oliveira Alexander Pozdneev NVIDIA CUDA on IBM POWER8: Technical overview, software installation, and application development Overview The exploitation of general-purpose computing on graphics processing units (GPUs) and modern multi-core processors in a single heterogeneous parallel system has proven highly efficient for running several technical computing workloads. This applied to a wide range of areas such as chemistry, bioinformatics, molecular biology, engineering, and big data analytics. Recently launched, the IBM® Power System S824L comes into play to explore the use of the NVIDIA Tesla K40 GPU, combined with the latest IBM POWER8™ processor, providing a unique technology platform for high performance computing. This IBM Redpaper™ publication discusses the installation of the system, and the development of C/C++ and Java applications using the NVIDIA CUDA platform for IBM POWER8. Note: CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and programming model created by NVIDIA and implemented by the GPUs that they produce. The following topics are covered: Advantages of NVIDIA on POWER8 The IBM Power Systems S824L server Software stack System monitoring Application development Tuning and debugging Application examples © Copyright IBM Corp. 2015. All rights reserved. ibm.com/redbooks 1 Advantages of NVIDIA on POWER8 The IBM and NVIDIA partnership was announced in November 2013, for the purpose of integrating IBM POWER®-based systems with NVIDIA GPUs, and enablement of GPU-accelerated applications and workloads. The goal is to deliver higher performance and better energy efficiency to companies and data centers. This collaboration produced its initial results in 2014 with: The announcement of the first IBM POWER8 system featuring NVIDIA Tesla GPUs (IBM Power Systems™ S824L). -
Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C
PathScale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C. Bergström | May 14th, 2012 Brief Introduction to ENZO 2 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 ENZO Overview & Goals Speed transition to GPU & many-core systems • Simplify the task of migrating software written in C, C++ & Fortran • Uses OpenHMPP Standard (easy migration) • CAPS HMPP compatible Performance & HPC focused • Fully exploits NVIDIA GPU features • Generates native instructions optimized for NVIDIA GPU 3 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule & Status 4 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule . ENZO Production release June 2012 – OpenHMPP 2.5 C, C++ and Fortran . Next ENZO Production release October 2012 – More tools and better support for libraries – x8664 OpenHMPP task parallelism (similar to OMP3 tasks) – More optimizations (IPA / CG2 / textures) – OpenHMPP 3.0 – CUDA 4.x – Kepler 5 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Status . OpenHMPP 2.5 – Running CAPS C & Fortran Labs – PathScale written HMPP test suite – Customer code . New C++ compiler – Perennial C++VS and CVSA regression free – Corner case compile time issues – Corner case runtime issues . Ongoing effort – Performance tuning & benchmarking – Compiler robustness – Nightly compiler builds to address issues 6 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance 7 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance . NVIDIA Tesla 2050 - “Lab2” SGEMM – ENZO – /opt/enzo/bin/pathcc -hmpp -
GPU Computing with Openacc Directives
Introduction to OpenACC Directives Duncan Poole, NVIDIA Thomas Bradley, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Universities Data Analytics Supercomputing Centers Life Sciences 100,000’s Oil & Gas Defense Weather Research Climate Early Adopters Plasma Physics 2004 Present Time 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages CUDA Libraries are interoperable with OpenACC “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility 3 Ways to Accelerate Applications Applications OpenACC Programming Libraries Directives Languages CUDA Languages are interoperable with OpenACC, “Drop-in” Easily Accelerate too! Maximum Acceleration Applications Flexibility NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP Vector Signal GPU Accelerated Matrix Algebra on Image Processing Linear Algebra GPU and Multicore NVIDIA cuFFT Building-block Sparse Linear C++ STL Features IMSL Library Algorithms for CUDA Algebra for CUDA GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications OpenACC Directives CPU GPU Simple Compiler hints Program myscience Compiler Parallelizes code ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 OpenACC ... parallel code ... Compiler Works on many-core GPUs & enddo enddo Hint !$acc end kernels multicore CPUs ... End Program myscience -
Openacc Course October 2017. Lecture 1 Q&As
OpenACC Course October 2017. Lecture 1 Q&As. Question Response I am currently working on accelerating The GCC compilers are lagging behind the PGI compilers in terms of code compiled in gcc, in your OpenACC feature implementations so I'd recommend that you use the PGI experience should I grab the gcc-7 or compilers. PGI compilers provide community version which is free and you try to compile the code with the pgi-c ? can use it to compile OpenACC codes. New to OpenACC. Other than the PGI You can find all the supported compilers here, compiler, what other compilers support https://www.openacc.org/tools OpenACC? Is it supporting Nvidia GPU on TK1? I Currently there isn't an OpenACC implementation that supports ARM think, it must be processors including TK1. PGI is considering adding this support sometime in the future, but for now, you'll need to use CUDA. OpenACC directives work with intel Xeon Phi is treated as a MultiCore x86 CPU. With PGI you can use the "- xeon phi platform? ta=multicore" flag to target multicore CPUs when using OpenACC. Does it have any application in field of In my opinion OpenACC enables good software engineering practices by Software Engineering? enabling you to write a single source code, which is more maintainable than having to maintain different code bases for CPUs, GPUs, whatever. Does the CPU comparisons include I think that the CPU comparisons include standard vectorisation that the simd vectorizations? PGI compiler applies, but no specific hand-coded vectorisation optimisations or intrinsic work. Does OpenMP also enable OpenMP does have support for GPUs, but the compilers are just now parallelization on GPU? becoming available. -
Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. 100,000 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz 19,484 10,000 AMD Athlon 64, 2.8 GHz 14,387 AMD Athlon, 2.6 GHz 11,865 Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 22%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year Performance (vs. -
How to Write Code That Will Survive
Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO Introduction • Programming many-core systems faces the following dilemma o Achieve "portable" performance • Multiple forms of parallelism cohabiting – Multiple devices (e.g. GPUs) with their own address space – Multiple threads inside a device – Vector/SIMD parallelism inside a thread • Massive parallelism – Tens of thousands of threads needed o The constraint of keeping a unique version of codes, preferably mono- language • Reduces maintenance cost • Preserves code assets • Less sensitive to fast moving hardware targets • Codes last several generations of hardware architecture • For legacy codes, directive-based approach may be an alternative o And may benefit from auto-tuning techniques CC 2012 www.caps-entreprise.com 2 Profile of a Legacy Application • Written in C/C++/Fortran • Mix of user code and while(many){ library calls ... mylib1(A,B); ... • Hotspots may or may not be myuserfunc1(B,A); parallel ... mylib2(A,B); ... • Lifetime in 10s of years myuserfunc2(B,A); ... • Cannot be fully re-written } • Migration can be risky and mandatory CC 2012 www.caps-entreprise.com 3 Overview of the Presentation • Many-core architectures o Definition and forecast o Why usual parallel programming techniques won't work per se • Directive-based programming o OpenACC sets of directives o HMPP directives o Library integration issue • Toward a portable infrastructure for auto-tuning o Current auto-tuning directives in HMPP 3.0 o CodeletFinder for offline auto-tuning o Toward a standard auto-tuning interface CC 2012 www.caps-entreprise.com 4 Many-Core Architectures Heterogeneous Many-Cores • Many general purposes cores coupled with a massively parallel accelerator (HWA) Data/stream/vector CPU and HWA linked with a parallelism to be PCIx bus exploited by HWA e.g. -
Multi-Threaded GPU Accelerration of ORBIT with Minimal Code
Princeton Plasma Physics Laboratory PPPL- 4996 4996 Multi-threaded GPU Acceleration of ORBIT with Minimal Code Modifications Ante Qu, Stephane Ethier, Eliot Feibush and Roscoe White FEBRUARY, 2014 Prepared for the U.S. Department of Energy under Contract DE-AC02-09CH11466. Princeton Plasma Physics Laboratory Report Disclaimers Full Legal Disclaimer This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors or their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or any third party’s use or the results of such use of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or its contractors or subcontractors. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Trademark Disclaimer Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof or its contractors or subcontractors. PPPL Report Availability Princeton Plasma Physics Laboratory: http://www.pppl.gov/techreports.cfm Office of Scientific and Technical Information (OSTI): http://www.osti.gov/bridge Related Links: U.S. -
HPVM: Heterogeneous Parallel Virtual Machine
HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou∗ Prakalp Srivastava∗ Matthew D. Sinclair Department of Computer Science Department of Computer Science Department of Computer Science University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign [email protected] [email protected] [email protected] Rakesh Komuravelli Vikram Adve Sarita Adve Qualcomm Technologies Inc. Department of Computer Science Department of Computer Science [email protected]. University of Illinois at University of Illinois at com Urbana-Champaign Urbana-Champaign [email protected] [email protected] Abstract hardware, and that runtime scheduling policies can make We propose a parallel program representation for heteroge- use of both program and runtime information to exploit the neous systems, designed to enable performance portability flexible compilation capabilities. Overall, we conclude that across a wide range of popular parallel hardware, including the HPVM representation is a promising basis for achieving GPUs, vector instruction sets, multicore CPUs and poten- performance portability and for implementing parallelizing tially FPGAs. Our representation, which we call HPVM, is a compilers for heterogeneous parallel systems. hierarchical dataflow graph with shared memory and vector CCS Concepts • Computer systems organization → instructions. HPVM supports three important capabilities for Heterogeneous (hybrid) systems; programming heterogeneous systems: a compiler interme- diate representation (IR), a virtual instruction set (ISA), and Keywords Virtual ISA, Compiler, Parallel IR, Heterogeneous a basis for runtime scheduling; previous systems focus on Systems, GPU, Vector SIMD only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for het- 1 Introduction erogeneous systems. -
Openacc Getting Started Guide
OPENACC GETTING STARTED GUIDE Version 2018 TABLE OF CONTENTS Chapter 1. Overview............................................................................................ 1 1.1. Terms and Definitions....................................................................................1 1.2. System Prerequisites..................................................................................... 2 1.3. Prepare Your System..................................................................................... 2 1.4. Supporting Documentation and Examples............................................................ 3 Chapter 2. Using OpenACC with the PGI Compilers...................................................... 4 2.1. OpenACC Directive Summary........................................................................... 4 2.2. CUDA Toolkit Versions....................................................................................6 2.3. C Structs in OpenACC....................................................................................8 2.4. C++ Classes in OpenACC.................................................................................9 2.5. Fortran Derived Types in OpenACC...................................................................13 2.6. Fortran I/O............................................................................................... 15 2.6.1. OpenACC PRINT Example......................................................................... 15 2.7. OpenACC Atomic Support............................................................................. -
CAPS Openacc Compiler
CAPS OpenACC Compiler HMPP Workbench 3.2 IDDN.FR.001.490007.000.S.P.2008.000.10600 This information is the property of CAPS entreprise and cannot be used, reproduced or transmitted without authorization. Headquarters – France CAPS – USA CAPS – CHINA Immeuble CAP Nord 4701 Patrick Drive Bldg 12 Suite E2, 30/F 4A Allée Marie Berhaut Santa Clara JuneYao International Plaza 35000 Rennes CA 95054 789, Zhaojiabang Road, France Shanghai 200032 Tel.: +33 (0)2 22 51 16 00 Tel.: +1 408 550 2887 x70 Tel.: +86 21 3363 0057 Fax: +33 (0)2 23 20 16 43 Fax: +86 21 3363 0067 [email protected] [email protected] [email protected] N° d’agrément formation : 53 35 08397 35 Visit our website: http://www.caps-entreprise.com CAPS OpenACC Compiler SUMMARY 1. Introduction 5 1.1. Revisions history .................................................................................................................................... 5 1.2. Introduction ............................................................................................................................................ 6 1.3. What is HMPP Workbench? What is the CAPS OpenACC Compiler? ................................................. 6 1.4. Execution Model .................................................................................................................................... 8 1.5. Memory Model ....................................................................................................................................... 8 2. OpenACC Directives 9 2.1. kernels -
Using CAPS Compiler on NVIDIA Kepler and CARMA Systems
Using CAPS Compiler on NVIDIA Kepler and CARMA Systems F. Bodin CTO – CAPS entreprise Introduction • CAPS develops programming tools to help writing a unique source code that can be executed on existing accelerator technologies o C / C++ / Fortran • Fast moving hardware systems require two directives sets o OpenHMPP - easy to extend – integrate new HW features o OpenACC - standardized – longer term view but moving slowly • Generates CUDA or OpenCL codes o Portable on AMD GPU and APU, Intel MIC, Nvidia Kepler-Carma, … nvidia SC 2012 www.caps-entreprise.com 2 CAPS Technology • Provide OpenACC and OpenHMPP directives o OpenHMPP codelet based o OpenACC code region based #pragma hmpp myfunc codelet, … void saxpy(int n, float alpha, float x[n], float y[n]){ #pragma hmppcg gridify(i) for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } #pragma acc kernels … { for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } nvidia SC 2012 www.caps-entreprise.com 3 Compilation Process • Source-to-source technology C++ C Fortran Frontend Frontend Frontend Extraction module codelets Host Fun#1 Fun #2 code Fun #3 Instrumentation OpenCL/Cuda module Generation CPU compiler Native (gcc, ifort, …) compilers Executable CAPS HWA Code (mybin.exe) Runtime (Dyn. library) www.caps-entreprise.com nvidia SC 2012 4 A Few Typical Situations 1. Simple nested loops 6. Dealing with accelerated library 2. Data transfer optimization 7. Dealing with dynamic accelerated tasks scheduling 3. Complex loop nests 8. Using multiple accelerators 4. Code tuning 9. Nested parallelism using native 5. Integrating auto-tuning techniques nvidia SC 2012 www.caps-entreprise.com 5 Simple nested loops - 1 Host • The simple construct is to CPU declare a parallel loop to be code Accelerator compiled and executed on an accelerator send A,B,C o Iterations of the loop nests are converted into threads execute kern.