GPU Programming with CUDA and OpenACC Axel Koehler –

© NVIDIA Corporation 2012 Add GPUs: Accelerate Applications

CPUs: designed to GPUs: designed run a few tasks to run many tasks quickly. efficiently.

© NVIDIA Corporation 2012 Minimum Change, Big Speed-up

Application Code

Rest of Sequential Compute-Intensive Functions CPU Code GPU CPU

+ © NVIDIA Corporation 2012 Ways to Accelerate Applications

Applications High Level Languages Programming (Matlab, ..) Libraries Directives Languages (CUDA, ..)

Easiest Approach Maximum No Need for Performance Programming Expertise © NVIDIA Corporation 2012 GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector Signal GPU Accelerated Matrix Algebra on Image Processing Linear Algebra GPU and Multicore NVIDIA cuFFT

Sparse Linear Building-block ++ STL Features IMSL Library Algebra Algorithms for CUDA for CUDA

© NVIDIA Corporation 2012 OpenACC Directives

CPU GPU Easy, Open, Powerful

• Simple hints

• Works on multicore CPUs & many core Program myscience GPUs ... serial code ... !$acc region do k = 1,n1 OpenACC • Compiler Parallelizes code do i = 1,n2 Compiler ... parallel code ... Hint enddo • Future Integration into OpenMP enddo !$acc end region standard planned ... End Program myscience http://www.openacc-standard.org Your original or C code © NVIDIA Corporation 2012

OpenACC

Compiler directives to specify parallel regions in C, C++, Fortran OpenACC offload parallel regions from host to accelerator Portable across OSes, host CPUs and accelerators

Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator

Programming model allows programmers to start simple Enhance with additional guidance for compiler on loop mappings, data location, and other performance details

© NVIDIA Corporation 2012 Basic Concepts

CPU Memory Transfer data GPU Memory

PCI Bus

CPU Offload computation GPU

For efficiency, decouple data movement and compute off-load

© NVIDIA Corporation 2012 Directive Syntax

Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

© NVIDIA Corporation 2012 OpenACC Directive Set

Parallel Constructs #pragma acc parallel [clause [[,] clause]…] new-line Data Constructs #pragma acc data [clause [[,] clause]…] new-line Loop Constructs #pragma acc loop [clause [[,] clause]...]new-line Runtime Library Routines Cache Directives

And few others…

© NVIDIA Corporation 2012 Jacobi Relaxation Iterate until converged iter = 0 do while ( err .gt tol .and. iter .gt. iter_max )

iter = iter + 1 err = 0.0 Iterate across Calculate new value do j=1,m elements of matrix from neighbours do i=1,n Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i, j+1) err = max( err, abs(Anew(i,j)-A(i,j)) ) end do end do

if( mod(iter,100).eq.0 .or. iter.eq.1 ) print*, iter, err A = Anew end do

© NVIDIA Corporation 2012 OpenMP CPU Implementation

iter = 0 do while ( err .gt tol .and. iter .gt. iter_max ) Parallelise code inside region iter = iter + 1 err = 0.0 !$omp parallel do shared(m,n,Anew,A) reduction(max:err) do j=1,m do i=1,n Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i, j+1) err = max( err, abs(Anew(i,j)-A(i,j)) ) end do Close off region end do !$omp end parallel do if( mod(iter,100).eq.0 ) print*, iter, err A = Anew end do

© NVIDIA Corporation 2012 OpenACC GPU Implementation

Copy arrays into GPU !$acc data copy(A,Anew) memory within region iter = 0 do while ( err .gt tol .and. iter .gt. iter_max )

iter = iter + 1 err = 0.0 Parallelise code !$acc parallel reduction( max:err ) inside region do j=1,m do i=1,n Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i,j+1) err = max( err, abs(Anew(i,j)-A(i,j)) ) end do Close off parallel end do region !$acc end parallel if( mod(iter,100).eq.0 ) print*, iter, err A = Anew end do !$acc end data Close off data region, copy data back

© NVIDIA Corporation 2012 Improved OpenACC GPU Implementation

Reduced data !$acc data copyin(A), copyout(Anew) movement iter = 0 do while ( err .gt tol .and. iter .gt. iter_max )

iter = iter + 1 err = 0.0 !$acc parallel reduction( max:err ) do j=1,m do i=1,n Anew(i,j) = 0.25 * ( A(i+1,j ) + A(i-1,j ) & A(i, j-1) + A(i, j+1) err = max( err, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end parallel if( mod(iter,100).eq.0 ) print*, iter, err A = Anew end do !$acc end data

© NVIDIA Corporation 2012 More Parallelism

!$acc data copyin(A), create(Anew) Anew now only iter = 0 do while ( err .gt. tol .and. iter .gt. iter_max ) exists on GPU

iter = iter + 1 err = 0.0 !$acc parallel reduction( max:err ) Find maximum over do j=1,m all iterations do i=1,n Anew(i,j) = 0.25 * ( A(i+1,j ) + A(i-1,j ) & A(i, j-1) + A(i, j+1) ) err = max( err, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end parallel if( mod(iter,100).eq.0 ) print*, iter, err !$acc parallel A = Anew Add second parallel region !$acc end parallel inside data region end do !$acc end data

© NVIDIA Corporation 2012 More Performance

!$acc data copyin(A), create(Anew) iter = 0 do while ( err .gt. tol .and. iter .gt. iter_max )

iter = iter + 1 err = 0.0 !$acc kernels loop reduction( max:err ), gang(32), worker(8) do j=1,m 30% faster than do i=1,n default schedule Anew(i,j) = 0.25 * ( A(i+1,j ) + A(i-1,j ) & A(i, j-1) + A(i, j+1) ) err = max( err, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end kernels loop if( mod(iter,100).eq.0 ) print*, iter, err !$acc parallel A = Anew !$acc end parallel end do !$acc end data

© NVIDIA Corporation 2012 OpenACC: Small Effort, Real Impact

Large Oil Company Univ. of Houston Uni. Of Melbourne Ufa State Aviation GAMESS-UK Prof. M.A. Kayali Prof. Kerry Black Prof. Arthur Dr. Wilkinson, 3x in 7 days 20x in 2 days 65x in 2 days Yuldashev Prof. Naidoo 7x in 4 Weeks 10x Solving billions of Studying magnetic Better understand equations iteratively for systems for complex reasons by oil production at innovations in lifecycles of snapper Generating stochastic Used for various fields world’s largest magnetic storage fish in Port Phillip Bay geological models of such as investigating petroleum reservoirs media and memory, oilfield reservoirs with biofuel production and field sensors, and borehole data molecular sensors. biomagnetism

© NVIDIA Corporation 2012 CUDA 4.1 Highlights

Advanced Application GPU-Accelerated New & Improved Development Libraries Developer Tools

• New LLVM-based compiler • 1000+ new imaging functions • Re-Designed Visual Profiler • 3D surfaces & cube maps • Tri-diagonal solver 10x faster vs. MKL • Parallel Nsight 2.1 • Peer-to-Peer between processes • MRG32k3a & MTGP11213 RNGs • Multi-context debugging • GPU reset with nvidia-smi • New Bessell functions in Math lib • assert() in device code • New GrabCut sample shows • 2x faster matrix-vector w/ HYB-ELL • Enhanced CUDA-MEMCHECK interactive foreground extraction • -style placeholders in Thrust • New code samples for optical flow, • Batched GEMM for small matrices volume filtering and more…

© NVIDIA Corporation 2012 New LLVM-based CUDA Compiler

Delivers up to 10% faster application performance

Faster compilation for increased developer productivity

Modern compiler with broad support

Will bring more languages to the GPU

Easier to support CUDA to more platforms

© NVIDIA Corporation 2012 NVIDIA Opens Up CUDA Platform

NVIDIA PGI New Language CUDA Compiler Source for C C++ Fortran Support Researchers & Tools Vendors CUDA Compiler Enables LLVM New Language Support

New Processor Support NVIDIA x86 New Processor GPUs CPUs Support

Apply for early access at http://developer.nvidia.com/cuda-source © NVIDIA Corporation 2012

GPU Technology Conference 2012 May 14-17 | San Jose, CA

The one event you can’t afford to miss . Learn about leading-edge advances in GPU computing . Explore the research as well as the commercial applications . Discover advances in computational visualization . Take a deep dive into parallel programming

Ways to participate . Speak – share your work and gain exposure as a thought leader . Register – learn from the experts and network with your peers . Exhibit/Sponsor – promote your company as a key player in the GPU ecosystem www.gputechconf.com Summary

Heterogeneous/hybrid Computing is the future OpenACC Directives provide a standardized way to hybrid computing Easy , Open and Powerful Use highly optimized GPU libraries Use CUDA for maximum performance

Don’t wait and start today !!

© NVIDIA Corporation 2012 GPU Programming with CUDA and OpenACC Axel Koehler – NVIDIA ([email protected])

© NVIDIA Corporation 2012