Accelerating Hpc Applications on Nvidia Gpus
Total Page:16
File Type:pdf, Size:1020Kb
ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC Doug Miles, PGI Compilers & Tools, NVIDIA High Performance Computing Advisory Council February 21, 2018 PGI — THE NVIDIA HPC SDK Fortran, C & C++ Compilers Optimizing, SIMD Vectorizing, OpenMP Accelerated Computing Features CUDA Fortran, OpenACC Directives Multi-Platform Solution X86-64 and OpenPOWER Multicore CPUs NVIDIA Tesla GPUs Supported on Linux, macOS, Windows MPI/OpenMP/OpenACC Tools Debugger Performance Profiler Interoperable with DDT, TotalView 2 Programming GPU-Accelerated Systems Separate CPU System and GPU Memories GPU Developer View PCIe System GPU Memory Memory 3 Programming GPU-Accelerated Systems Separate CPU System and GPU Memories GPU Developer View NVLink System GPU Memory Memory 4 CUDA FORTRAN attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L real, device, allocatable, dimension(:,:) :: integer :: i, j, kb, k, tx, ty Adev,Bdev,Cdev real, shared :: Asub(16,16),Bsub(16,16) tx = threadidx%x . ty = threadidx%y i = blockidx%x * 16 + tx allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) j = blockidx%y * 16 + ty Adev = A(1:N,1:M) Cij = 0.0 Bdev = B(1:M,1:L) do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) call mm_kernel <<<dim3(N/16,M/16),dim3(16,16)>>> Bsub(tx,ty) = B(kb+ty-1,j) ( Adev, Bdev, Cdev, N, M, L ) call syncthreads() do k = 1,16 C(1:N,1:L) = Cdev Cij = Cij + Asub(tx,k) * Bsub(k,ty) deallocate ( Adev, Bdev, Cdev ) enddo call syncthreads() . enddo C(i,j) = Cij end subroutine mmul_kernel CPU Code Tesla Code 5 module madd_device_module use cudafor implicit none contains attributes(global) subroutine madd_kernel(a,b,c,blocksum,n1,n2) real, dimension(:,:) :: a,b,c real, dimension(:) :: blocksum integer, value :: n1,n2 integer :: i,j,tindex,tneighbor,bindex real :: mysum real, shared :: bsum(256) CUDA FORTRAN ! Do this thread's work mysum = 0.0 do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x !$CUF KERNEL Directives a(i,j) = b(i,j) + c(i,j) mysum = mysum + a(i,j) ! accumulates partial sum per thread enddo enddo ! Now add up all partial sums for the whole thread block ! Compute this thread's linear index in the thread block ! We assume 256 threads in the thread block tindex = threadidx%x + (threadidx%y-1)*blockdim%x module madd_device_module ! Store this thread's partial sum in the shared memory block bsum(tindex) = mysum call syncthreads() use cudafor ! Accumulate all the partial sums for this thread block to a single value tneighbor = 128 do while( tneighbor >= 1 ) contains if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 subroutine madd_dev(a,b,c,sum,n1,n2) call syncthreads() enddo ! Store the partial sum for the thread block real,dimension(:,:),device :: a,b,c bindex = blockidx%x + (blockidx%y-1)*griddim%x if( tindex == 1 ) blocksum(bindex) = bsum(1) real :: sum end subroutine ! Add up partial sums for all thread blocks to a single cumulative sum attributes(global) subroutine madd_sum_kernel(blocksum,dsum,nb) integer :: n1,n2 real, dimension(:) :: blocksum real :: dsum integer, value :: nb type(dim3) :: grid, block real, shared :: bsum(256) integer :: tindex,tneighbor,i ! Again, we assume 256 threads in the thread block !$cuf kernel do (2) <<<(*,*),(32,4)>>> ! accumulate a partial sum for each thread tindex = threadidx%x bsum(tindex) = 0.0 do j = 1,n2 do i = tindex, nb, blockdim%x bsum(tindex) = bsum(tindex) + blocksum(i) enddo do i = 1,n1 call syncthreads() ! This code is copied from the previous kernel ! Accumulate all the partial sums for this thread block to a single value a(i,j) = b(i,j) + c(i,j) ! Since there is only one thread block, this single value is the final result tneighbor = 128 do while( tneighbor >= 1 ) sum = sum + a(i,j) if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 enddo call syncthreads() enddo if( tindex == 1 ) dsum = bsum(1) enddo end subroutine subroutine madd_dev(a,b,c,dsum,n1,n2) end subroutine real, dimension(:,:), device :: a,b,c Equivalent real, device :: dsum real, dimension(:), allocatable, device :: blocksum end module integer :: n1,n2,nb type(dim3) :: grid, block hand-written integer :: r ! Compute grid/block size; block size must be 256 threads grid = dim3((n1+31)/32, (n2+7)/8, 1) block = dim3(32,8,1) CUDA kernels nb = grid%x * grid%y allocate(blocksum(1:nb)) call madd_kernel<<< grid, block >>>(a,b,c,blocksum,n1,n2) call madd_sum_kernel<<< 1, 256 >>>(blocksum,dsum,nb) r = cudaThreadSynchronize() ! don't deallocate too early 6 deallocate(blocksum) end subroutine end module OpenACC Directives Manage #pragma acc data copyin(a,b) copyout(c) Incremental Data { Movement ... Single source #pragma acc parallel { Initiate #pragma acc loop gang vector Interoperable Parallel for (i = 0; i < n; ++i) { Execution c[i] = a[i] + b[i]; Performance portable ... } CPU, GPU, Manycore Optimize } Loop ... Mappings } 7 OpenACC for GPUs in a Nutshell ... #pragma acc data copy(b[0:n][0:m]) \ create(a[0:n][0:m]) { p21 for (iter = 1; iter <= p; ++iter){ A #pragma acc parallel loop SS (B(B)) for (i = 1; i < n-1; ++i){ for (j = 1; j < m-1; ++j){ a[i][j]=w0*b[i][j]+ 1p w1*(b[i-1][j]+b[i+1][j]+ B S (B)(B) b[i][j-1]+b[i][j+1])+ w2*(b[i-1][j-1]+b[i-1][j+1]+ b[i+1][j-1]+b[i+1][j+1]); } } #pragma acc parallel loop for( i = 1; i < n-1; ++i ) for( j = 1; j < m-1; ++j ) System GPU b[i][j] = a[i][j]; Memory Memory } } ... 8 OpenACC is for Multicore, Manycore & GPUs 98 !$acc parallel 99 !$acc loop independent 100 do k=y_min-depth,y_max+depth 101 !$acc loop independent 102 do j=1,depth 103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k) 104 enddo 105 enddo 106 !$acc end parallel Multicore CPU Tesla GPU % pgfortran -ta=multicore –fast –Minfo=acc -c \ % pgfortran -ta=tesla –fast -Minfo=acc –c \ update_tile_halo_kernel.f90 update_tile_halo_kernel.f90 . 100, Loop is parallelizable 100, Loop is parallelizable Generating Multicore code 102, Loop is parallelizable 100, !$acc loop gang Accelerator kernel generated 102, Loop is parallelizable Generating Tesla code 100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y 102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x 9 SPEC ACCEL 1.2 BENCHMARKS OpenACC OpenMP 4.5 200 200 PGI 18.1 Intel 2018 PGI 18.1 150 150 100 100 4.4x Speed-up GEOMEAN Seconds GEOMEAN GEOMEAN Seconds GEOMEAN 50 50 0 0 2-socket 1x Volta 2-socket Skylake 2-socket EPYC 2-socket Broadwell Broadwell V100 40 cores / 80 threads 48 cores / 48 threads 40 cores / 80 threads Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. Volta: NVIDIA DGX1 system with two 20 core Intel Xeon E5-2698 v4 CPUs @ 2.20GHz, 256GB memory, one NVIDIA Tesla V100-SXM2-16GB GPU @ 1.53GHz. SPEC® is a registered trademark of the Standard Performance Evaluation Corporation (www.spec.org). OPENACC APPLICATIONS 11 Parallelization Strategy Within Gaussi an 16, GPUs are used for a sm all fraction of code that consumes a large fraction of the execution time. T e implementation of GPU parallelism conforms to Gaussi an’s general parallelization strategy. Its main tenets are to avoid changing the underlying source code and to avoid modif cations which negatively af ect CPU per formance. For these reasons, OpenACC was used for GPU parallelization. PGI Accelerator Compilers with OpenACC PGI compilers fully support the current OpenACC standard as wel l as important extensions to it. PGI is an important contributor to the ongoing development of OpenACC. OpenACC enables developer s to implement GPU parallelism by adding compiler directives to their source code, of en eliminating the need for rewriting or restructuring. For example, the following Fortran compiler directive identif es a loop which the compiler should parallelize: !$acc paral l el l oop Other directives allocate GPU memory, copy data to/from GPUs, specify data to remain on the GPU, combine or split loops and other code sections, and gener ally provide hints for optimal work T e Gaussian approach to parallelization relies on environment-speci f c parallelization distribution management, and more. frameworks and tools: OpenMP for shared-memory, Linda for cluster and net work parallelization across discret e nodes, and OpenACC for GPUs. T e OpenACC project is ver y active, and the speci f cations and tools are changing fairly rapidly. T is has been true throughout the T e process of implementing GPU support involved many dif er ent aspects: lifet ime of this project. Indeed, one of its major Identifying places wher e GPUs could be benef ci al. T ese are a subset of areas which challenges has been using OpenACC in the midst of its development. T e talented people at PGI are parallelized for other execution contexts because using GPUs requires f ne grained wer e instrumental in addressing issues that arose parallelism. in one of the ver y f rst uses of OpenACC for a Understanding and optimizing data movem ent/storage at a high level to maximize large commer cial sof ware package.