CUDA 2013

Brent Leback The Portland Group [email protected] Why Fortran?

. Rich legacy in the scientific community . Semantics – easier to vectorize/parallelize . Array descriptors . Modules . Fortran 2003

Today’s Fortran delivers many of the abstraction features of ++ without compromising performance Tesla C1060 Commodity Multicore + Commodity Manycore GPUs

4 – 24 CPU Cores 240 GPU/Accelerator Cores Tesla C2050 “Fermi”

4 – 24 CPU Cores 448 GPU/Accelerator Cores Tesla K20 “Kepler” GPU Programming Constants

The Program must:

. Initialize/Select the GPU to run on . Allocate data on the GPU . Move data from host, or initialize data on GPU . Launch kernel(s) . Gather results from GPU . Deallocate data 6 What Does CUDA Fortran Look Like?

attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L integer :: i, j, kb, k, tx, ty real, device, allocatable, dimension(:,:) :: real, shared :: Asub(16,16),Bsub(16,16) Adev,Bdev,Cdev tx = threadidx%x . . . ty = threadidx%y i = blockidx%x * 16 + tx allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) j = blockidx%y * 16 + ty Adev = A(1:N,1:M) Cij = 0.0 Bdev = B(1:M,1:L) do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) call mm_kernel <<>> Bsub(tx,ty) = B(kb+ty-1,j) ( Adev, Bdev, Cdev, N, M, L) call syncthreads() do k = 1,16 C(1:N,1:L) = Cdev Cij = Cij + Asub(tx,k) * Bsub(k,ty) deallocate ( Adev, Bdev, Cdev ) enddo call syncthreads() . . . enddo C(i,j) = Cij end subroutine mmul_kernel

CPU Code GPU Code Using the CUDA API use cudafor real, allocatable, device :: a(:) real :: b(10), b2(2), c(10) integer(kind=cuda_stream_kind) :: istrm . . . istat = cudaMalloc( a, 10 ) istat = cudaMemcpy( a, b, 10 ) istat = cudaMemcpy( a(2), b2, 2 ) istat = cudaMemcpy( c, a, 10 ) istat = cudaFree( a ) istat = cudaMemcpyAsync(a, x, 10, istrm) Building a CUDA Fortran Program

% pgfortran a.cuf .cuf suffix implies CUDA Fortran (free form) .CUF suffix runs preprocessor Use the –Mfixed option for F77-style fixed format % pgfortran –Mcuda a.f90 –Mcuda[=[emu|cc10|cc13|cc20|cc30|cc35] –Mcuda=[cuda4.2,cuda5.0] –Mcuda=fastmath,flushz,keep[bin|gpu|ptx] –Mcuda=rdc

Must use –Mcuda when linking from object files, driver pulls in correct path and libraries PGI CUDA Fortran 2013 New Features

. Texture memory support . CUDA 5.0 Dynamic Parallelism . Chevron launches within global subroutines . Support for allocate, deallocate within global subroutines real, device, allocatable :: a_d_local(:) . Pre-defined Fortran interfaces to supported CUDA API, cublas . CUDA 5.0 support for separate compilation, creating and linking static device objects and libraries. Relocatable device code. Texture Memory as a Read-only Cache CUDA Fortran module CUDA C/C++ global real, texture, pointer :: q(:) texture tq;

CUDA Fortran host code CUDA C/C++ host code float *d_p; real, device, target :: p(:) ...... cudaMalloc((void **) d_p, size); allocate(p(n)) cudaChannelFormatDesc desc = q => p cudaCreateChannelDesc(32, 0, 0, 0, cudaFormatKindFloat); cudaBindTexture(0, tq, d_p, desc, size);

CUDA Fortran device code CUDA C/C++ device code

i = threadIdx%x i = threadIdx.x; j = (blockIdx%x-1)*blockDim%x + i j = blockIdx.x*blockDim.x + i; s(j) = s(j) + q(i) s[j] = s[j] + tex1Dfetch(tq, i); Texture Memory Performance (Measured Gbytes/sec) module memtests blockDim = 32 blockDim = 128 integer, texture, pointer :: t(:) stride stridem stridet stridem stridet contains 1 53.45 50.94 141.93 135.43 attributes(global) subroutine stridem(m,b,s) 2 49.77 49.93 100.37 99.61 integer, device :: m(*), b(*) 3 46.96 48.07 75.25 74.91 integer, value :: s 4 44.25 45.90 60.19 60.00 i = blockDim%x*(blockIdx%x-1) + threadIdx%x 5 39.06 39.89 50.09 49.96 j = i * s 6 37.14 39.33 42.95 42.93 b(i) = m(j) + 1 7 32.88 35.94 37.56 37.50 return 8 32.48 32.98 33.38 33.42 end subroutine 9 29.90 32.94 30.01 33.38 10 27.43 32.86 27.28 33.07 attributes(global) subroutine stridet(b,s) 11 25.17 32.77 24.98 32.91 integer, device :: b(*) 12 23.23 33.19 23.07 33.01 integer, value :: s 13 21.57 33.13 21.40 32.26 i = blockDim%x*(blockIdx%x-1) + threadIdx%x 14 20.15 32.98 19.97 31.76 j = i * s 15 18.87 32.80 18.72 30.80 b(i) = t(j) + 1 16 17.78 32.66 17.60 31.83 return 20 14.38 33.37 14.23 26.84 end subroutine 24 12.08 33.71 11.94 24.30 end module memtests 28 10.41 33.38 10.27 19.97 32 9.15 33.42 9.02 20.19 12 CUDA Fortran Dynamic Parallelism (1)

attributes(global) subroutine strassen( A, B, C, M, N, K ) real :: A(M,K), B(K,N), C(M,N) integer, value :: N, M, K . . . if (ntimes == 0) then allocate(m1(1:m/2,1:k/2)) allocate(m2(1:k/2,1:n/2)) allocate(m3(1:m/2,1:k/2)) allocate(m4(1:k/2,1:n/2)) allocate(m5(1:m/2,1:k/2)) allocate(m6(1:k/2,1:n/2)) allocate(m7(1:m/2,1:k/2)) flags = cudaStreamNoBlocking do i = 1, 7 istat = cudaStreamCreateWithFlags(istreams(i), flags) end do end if . . .

Support for Fortran allocate and deallocate in device code CUDA Fortran Dynamic Parallelism (2)

. . . ! m1 = (A11 + A22) * (B11 + B22) call dgemm16t1<<>>(a(1,1), & a(1+m/2,1+k/2), m, & b(1,1), b(1+k/2,1+n/2), k, & m1(1,1), newn, newn, 1.0d0) ! m2 = (A21 + A22) * B11 call dgemm16t2<<>>(a(1+m/2,1), & a(1+m/2,1+k/2), m, & b(1,1), k, & m2(1,1), newn, newn)

. . .

! m7 = (A12 - A22) * (B21 + B22) call dgemm16t1<<>>(a(1,1+k/2), & a(1+m/2,1+k/2), m, & b(1+k/2,1), b(1+k/2,1+n/2), k, & m7(1,1), newn, newn, -1.0d0) istat = cudaDeviceSynchronize()

. . . Support for kernel launch from device code CUDA Fortran Dynamic Parallelism (3)

. . . ! C11 = m1 + m4 – m5 + m7 call add16x4<<<1,devthreads,0,istreams(1)>>>(m1,m4,m5,m7,m/2,c(1,1),m,n/2)

! C12 = m3 + m5 call add16<<<1,devthreads,0,istreams(2)>>>(m3,m/2,m5,m/2,c(1,1+n/2),m,n/2)

! C21 = m2 + m4 call add16<<<1,devthreads,0,istreams(3)>>>(m2,m/2,m4,m/2,c(1+m/2,1),m,n/2)

! C22 = m1 + m3 – m2 + m6 call add16x4<<<1,devthreads,0,istreams(4)>>>(m1,m3,m2,m6,m/2,c(1+m/2,1+n/2),m,n/2)

. . . end subroutine strassen

Compile using –Mcuda=rdc,cc35 CUDA Fortran Separate Compilation

Compile separate modules independently % pgf90 –c –O2 –Mcuda=rdc ddfun90.cuf ddmod90.cuf Object files can be put into a library % ar rc ddfunc.a ddfun90.o ddmod90.o Use the modules in device code in typical Fortran fashion % cat main.cuf program main use ddmodule . . . Link using pgf90 and the rdc option % pgf90 –O2 –Mcuda=rdc main.cuf ddfunc.a Cool CUDA Fortran Features Generic Interfaces and Overloading Allows programmers to define Fortran-like operations: module dev_transpose interface transpose module procedure real4devxspose module procedure int4devxspose end interface contains function realdevxpose(adev) result(b) real, device :: adev(:,:) real b(ubound(adev,2),ubound(adev,1)) return end end module dev_transpose At the site of the function reference, the look is normal Fortran: subroutine s1(a,b,n,m) use dev_transpose real, device :: a(n,m) real b(m,n) b = transpose(a) end BLAS Overloading

module cublas ! isamax use cublas interface isamax integer function isamax & real(4), device :: xd(N) (n, x, incx) real(4) x(N) integer :: n, incx call random_number(x) real(4) :: x(*) end function ! On the device xd = x integer function isamaxcu & j = isamax(N,xd,1) (n, x, incx) bind(c, & name='cublasIsamax') ! On the host, same name integer, value :: n, incx k = isamax(N,x,1) real(4), device :: x(*) end function end interface . . . Object-oriented Features Type extension allows polymorphism:

type dertype integer id, iop, npr real, allocatable :: rx(:) contains procedure :: init => init_dertype procedure :: print => print_dertype procedure :: find => find_dertype end type dertype

type, extends(dertype) :: extdertype real, allocatable, device :: rx_d(:) contains procedure :: init => init_extdertype procedure :: find => find_extdertype end type extdertype The class statement allows arguments of the base or extended type:

subroutine init_dertype(this, n) class(dertype) :: this module madd_device_module use cudafor implicit none contains attributes(global) subroutine madd_kernel(a,b,c,blocksum,n1,n2) real, dimension(:,:) :: a,b,c real, dimension(:) :: blocksum integer, value :: n1,n2 integer :: i,j,tindex,tneighbor,bindex !$CUF kernel real :: mysum real, shared :: bsum(256) ! Do this thread's work mysum = 0.0 do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x directives a(i,j) = b(i,j) + c(i,j) mysum = mysum + a(i,j) ! accumulates partial sum per thread enddo enddo ! Now add up all partial sums for the whole thread block ! Compute this thread's linear index in the thread block ! We assume 256 threads in the thread block module madd_device_module tindex = threadidx%x + (threadidx%y-1)*blockdim%x ! Store this thread's partial sum in the shared memory block use cudafor bsum(tindex) = mysum call syncthreads() ! Accumulate all the partial sums for this thread block to a single value contains tneighbor = 128 do while( tneighbor >= 1 ) if( tindex <= tneighbor ) & subroutine madd_dev(a,b,c,sum,n1,n2) bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 call syncthreads() real,dimension(:,:),device :: a,b,c enddo ! Store the partial sum for the thread block bindex = blockidx%x + (blockidx%y-1)*griddim%x real :: sum if( tindex == 1 ) blocksum(bindex) = bsum(1) integer :: n1,n2 end subroutine ! Add up partial sums for all thread blocks to a single cumulative sum attributes(global) subroutine madd_sum_kernel(blocksum,dsum,nb) type(dim3) :: grid, block real, dimension(:) :: blocksum real :: dsum integer, value :: nb !$cuf kernel do (2) <<<(*,*),(32,4)>>> real, shared :: bsum(256) integer :: tindex,tneighbor,i ! Again, we assume 256 threads in the thread block do j = 1,n2 ! accumulate a partial sum for each thread tindex = threadidx%x bsum(tindex) = 0.0 do i = 1,n1 do i = tindex, nb, blockdim%x bsum(tindex) = bsum(tindex) + blocksum(i) a(i,j) = b(i,j) + c(i,j) enddo call syncthreads() ! This code is copied from the previous kernel sum = sum + a(i,j) ! Accumulate all the partial sums for this thread block to a single value ! Since there is only one thread block, this single value is the final result tneighbor = 128 enddo do while( tneighbor >= 1 ) if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) enddo tneighbor = tneighbor / 2 call syncthreads() enddo end subroutine if( tindex == 1 ) dsum = bsum(1) end module Equivalent end subroutine subroutine madd_dev(a,b,c,dsum,n1,n2) real, dimension(:,:), device :: a,b,c hand-written real, device :: dsum real, dimension(:), allocatable, device :: blocksum integer :: n1,n2,nb type(dim3) :: grid, block CUDA kernels integer :: r ! Compute grid/block size; block size must be 256 threads grid = dim3((n1+31)/32, (n2+7)/8, 1) block = dim3(32,8,1) nb = grid%x * grid%y allocate(blocksum(1:nb)) call madd_kernel<<< grid, block >>>(a,b,c,blocksum,n1,n2) call madd_sum_kernel<<< 1, 256 >>>(blocksum,dsum,nb) r = cudaThreadSynchronize() ! don't deallocate too early deallocate(blocksum) end subroutine end module Thrust Interoperability

extern "C" { module sortutils

//Sort for integer arrays interface sort void sort_int_wrapper(int *data, int N) module procedure sort_dev_int { thrust::device_ptr dev_ptr(data); . . . thrust::sort(dev_ptr, dev_ptr+N); end interface } } contains subroutine sort_dev_int(devarray) integer, device :: devarray(:) program t interface use sortutils subroutine sort_int(data, n) & integer, parameter :: n = 100 bind(C,name='sort_int_wrapper') integer, device :: a_d(n) integer, device :: data(*) integer :: a_h(n) integer, value :: n end subroutine !$cuf kernel do end interface do i = 1, n n = size(array) a_d(i) = 1 + mod(47*i,n) call sort_int(devarray, n) end do return end subroutine !a_h = sort(a_d) ! Preserves a_d end module call sort(a_d); a_h = a_d print *,n,count(a_h .eq. (/(i,i=1,n)/)) end OpenACC Interoperability module mymod real, dimension(:), allocatable, device :: xDev end module ... use mymod ... allocate( xDev(n) ) call init_kernel <<>> (xDev, n) ... !$acc data copy( y(:) ) ! no need to copy 'xDev' ... !$acc kernels loop do i = 1, n y(i) = y(i) + a * xDev(i) enddo ... !$acc end data Calling CUDA Fortran Subroutines From OpenACC Programs use cublas

!$acc data region copyin(x)

! some compute regions . . . k = isamax(N,x,1)

! maybe some other compute regions. . .

!$acc end data region

. You can call CUDA routines by creating explicit interfaces to host-resident data and device-resident data specific functions for a generic call Why CUDA Fortran?

. Fortran’s legacy in the scientific community . Semantics – easier to parallelize . Array descriptors . Modules . Fortran 2003 . High-level syntax for many CUDA operations . CUF Kernels . PGI OpenACC interoperability Copyright Notice

© Contents copyright 2013, The Portland Group, Inc. This material may not be reproduced in any manner without the expressed written permission ofThe Portland Group.

PGFORTRAN, PGF95, PGI Accelerator and PGI Unified Binary are trademarks; and PGI, PGCC, PGC++, PGI Visual Fortran, PVF, PGI CDK, Cluster Development Kit, PGPROF, PGDBG, and The Portland Group are registered trademarks of The Portland Group Incorporated. Other brands and names are the property of their respective owners.