An Introduction to GPU Computing

An Introduction to GPU Computing iVEC Supercomputer Training 5th - 9th November 2012 Introducing the Historical GPU Graphics Processing Unit (GPU) n : A specialised electronic circuit designed to rapidly manipulate and alter memory in such a way as to accelerate the building of images in a frame buffer intended for output to a display. Introducing the Modern GPU Graphics Processing Unit (GPU) n : A general purpose electronic circuit designed to rapidly manipulate and alter memory in such a way as to accelerate computational algorithms that have fine-grained parallelism. GPU Computing Motivation But what about... Central Processing Unit (CPU) n : the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system GPU Computing Motivation : Arithmetic Performance GPU Computing Motivation How is that possible? - GPUs have less cache and logic, more arithmetic - GPUs execute lots of simple threads - GPUs are physically bigger Nehalem Die (263mm2) Fermi Die (530mm2) GPU Computing Motivation : Power Efficiency GPU Computing Motivation : Memory Bandwidth GPU Computing Motivation : Summary Pros : - high arithmetic performance - high memory bandwidth - power efficient computation Cons : - separated by PCI Express bus - requires fine grain parallelism - requires high arithmetic intensity - requires light weight threads - requires additional software development For the right algorithm, additional time spent on code development can yield significant performance gains. GPU Computing Motivation : Further Considerations CPUs are steadily becoming more “GPU-like” - increasing number of cores - increasing vector parallelism GPUs are steadily becoming more “CPU-like” - increased caching - ability to run multiple kernels at once Heterogeneous processors - combined CPU and GPU cores - no longer using PCI Express What will future architectures look like? How well will your code adapt to them? Pawsey Phase 1B : Fornax Supercomputer 1A 1B Building Phase 2 March August March September April 2010 2010 2011 2011 2012 Requirement Specification CSIRO SGI System Online Gathering Development Tender Selected 96 compute nodes + 4 development nodes 2 hex-core Intel Xeon X5650 CPUs per blade (2304 cores) 72GB RAM per node (~7TB total) single NVIDIA Tesla C2075 GPU per node Dual Infiniband QDR 32Gbps interconnect 700TB local, 500 TB scratch, 10Gbps external ~100TF GPU performance located at iVEC@UWA Fornax: System Architecture Overview Head Node fornax-hn http://fornax-monitor.ivec.org/gweb VM Nodes fornax-xen ldap license monitor Login Nodes Copy Nodes login1 login2 io1 io2 10G Compute Nodes Ethernet External f001 f002 f095 f096 Development Nodes f097 f098 f099 f100 Lustre Nodes fmds1 foss1 foss2 foss3 foss4 fmds2 foss5 foss6 foss7 foss8 Infiniband 1G Ethernet Fornax: Compute Node Architecture (X8DAH) System System Intel Intel Memory Memory Xeon X5650 Xeon X5650 36GB 36GB CPU CPU Infiniband DDR3 DDR3 2x 32 Gbps Intel 256 Gbps QuickPath 256 Gbps Interconnect 102 Gbps PCIe2 x8 QLogic Infiniband IBA7322 QDR 32 Gbps Intel Intel PCIe2 x8 5520 Chipset 5520 Chipset QLogic Infiniband IBA7322 QDR 32 Gbps I/O Hub IOH36D I/O Hub IOH36D PCIe2 x16 NVIDIA Tesla C2075 GPU 64 Gbps PCIe2 x4 32 Gbps 8.4 Gbps Local Storage (7x1TB) SAS/SATA Controller LSI ISA2008 Fornax: NVIDIA Fermi GPU Architecture GPU PCIe2 x16 Memory 1152 DDR5 64 Gbps Gbps ECC 5.25 GB Fornax: NVIDIA Fermi GPU Architecture Fermi C2070 GPU consists of : - Host interface - GigaThread Engine - 6x 64 bit DDR5 1566MHz memory controllers - 786kB L2 Cache - Four Graphics Processing Clusters (GPC) Fornax: NVIDIA Fermi GPU Architecture Graphics Processing Cluster (GPC) : - Raster Engine : triangle setup, rasterization, z-cull - Four Streaming Multiprocessors (SM) Fornax: NVIDIA Fermi GPU Architecture Streaming Multiprocessor (SM) : - Instruction Cache - Warp Schedulers - Dispatch Unit - Register File (32k 32bit registers) - 32 CUDA Cores (FP and INT unit each) - 16 LD/ST Units - 4 Special Function Units (SFU) - Interconnect Network - 64 KB Shared Memory / L1 Cache (1:3 or 3:1 configurable) - Uniform Cache - Texture Units - Texture Cache - Polymorph Engine NVIDIA Tesla C2075 has 14 SMs enabled - 448 CUDA cores @ 1.15 GHz - Fused Multiply Add (FMA) per clock - 1.030 SP TFLOPS per GPU Fornax: NVIDIA Fermi GPU Architecture Summary: - the GPU is a co-processor, separate from the CPU - the GPU has its own parallel DDR5 memory - transfer between host and GPU memory takes time - the GPU has hundreds of cores, significantly more than a CPU - these GPU cores are grouped into SMs - cores within an SM execute the same instructions, but on different data - cores within an SM can see the same shared memory - memory access from the SM is efficient if all cores are accessing the same locale in the GPU memory GPU Programming Paradigms Pre-GPGPU Era GPGPU Era GPU Computing Era 1992 1995 2003 2004 2006 2008 OpenGL DirectX Cg, HLSL,GLSL BrookGPU CUDA OpenCL Fixed Function Programmable Programmable Pipeline Pipeline Parallel Computing Architectures Architectures Architectures (not a comprehensive listing) Compute Unified Device Architecture (CUDA) CUDA is a parallel computing platform and programming model, enabling dramatic increases in computational performance by harnessing the power of the graphics processing unit (GPU). CUDA is created by NVIDIA: http://www.nvidia.com/object/cuda_home_new.html Open Compute Language (OpenCL) OpenCL is the first open, royalty-free standard for cross- platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL is being created by the Khronos Group: http://www.khronos.org/opencl/ Participating companies and institutions: 3DLABS, Activision Blizzard, AMD, Apple, ARM, Broadcom, Codeplay, Electronic Arts, Ericsson, Freescale, Fujitsu, GE, Graphic Remedy, HI, IBM, Intel, Imagination Technologies, Los Alamos National Laboratory, Motorola, Movidius, Nokia, NVIDIA, Petapath, QNX, Qualcomm, RapidMind, Samsung, Seaweed, S3, ST Microelectronics, Takumi, Texas Instruments, Toshiba and Vivante. Running GPU programs on Fornax Logging in: ssh [email protected] ssh -u username fornax.ivec.org Beware the DenyHosts intrusion detection daemon! Repeatedly entering the wrong username or password will get your IP address blacklisted, resulting in the following error: ssh_exchange_identification: Connection closed by remote host If this happens, send an email to [email protected] to get your IP address reinstated Running GPU programs on Fornax PBS Script Path and Filename: /scratch/projectname/username/etc/subQuery PBS Script Contents: #!/bin/bash #PBS -W group_list=projectname #PBS -q workq #PBS -l walltime=00:02:00 #PBS -l select=1:ncpus=1:ngpus=1:mem=64gb #PBS -l place=excl module load cuda module load cuda-sdk cd /scratch/projectname /username/etc deviceQuery Running GPU programs on Fornax Submitting the job: qsub subQuery Viewing the queue: q qstat Viewing the output: cat subQuery.o##### cat subQuery.e##### Running GPU programs on Fornax deviceQuery.o#####: deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Found 1 CUDA Capable device(s) Device 0: "Tesla C2075" CUDA Driver Version / Runtime Version 4.1 / 4.1 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 5375 MBytes (5636554752 bytes) (14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores GPU Clock Speed: 1.15 GHz Memory Clock rate: 1566.00 Mhz Memory Bus Width: 384-bit L2 Cache Size: 786432 bytes ... Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 ... Running GPU programs on Fornax deviceQuery.e#####: [deviceQuery] starting... [deviceQuery] test results... PASSED > exiting in 3 seconds: 3...2...1...done! ============================================================================= Job ID: 4790.fornax User ID: charris Group ID: director100 Job Name: subQuery Session ID: 9656 Resource List: mem=64gb,ncpus=1,partition=1,place=excl,walltime=00:02:00 Resources Used: cpupercent=0,cput=00:00:03,mem=1396kb,ncpus=1,vmem=106288kb,walltime=00:00:06 Queue Name: workq Account String: null Exit Code: 0 Common Problems Missing qsub [username@login1 ~]$ qsub -bash: qsub: command not found solution module load pbs Missing cuda_runtime.h (or cl.h) [username@login1 ~]$ make gcc -Wall filename.c -o outputname -lcudart filename.c:#:#: error: cuda_runtime.h: No such file or directory etc solution module load cuda (or manually provide the path to your software, 'module display cuda' may be helpful for this) Common Problems Job sitting in queue forever [username@login1 ~]$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 3720[].fornax jobX otherperson 0 B workq 4962.fornax jobY somebodyelse 49:32:42 R workq 4967.fornax jobZ somebodyelse 0 H workq 4970.fornax myjob username 0 Q workq solutions - double check the PBS directives in your submission script - wait until the scheduler decides it is your turn to run - contact [email protected] if necessary Missing CUDA libraries (or OpenCL libraries) [username@login1 device]$ cat myjob.e##### /home/username/path/to/binary: error while loading shared libraries: libcudart.so.4: cannot open shared object file: No such file or directory solution module load cuda Common Problems No CUDA devices detected [username@login1 ~]$ ./device no CUDA-capable device is detected solutions - don't try to run on nodes that don't have GPUs (login,io/copyq) - contact [email protected] if necessary .

An Introduction to GPU Computing

The Importance of Data

AMD Accelerated Parallel Processing Opencl Programming Guide

Delivering Heterogeneous Programming in C++

Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster

Programmers' Tool Chain

Evolution of the Graphical Processing Unit

Graphics Processing Units

The Shallow Water Equations on Graphics Processing Units

Fragment Shader

GPGPU in System Simulations

Open Standards Enable Continuous Software Development in the Automotive Industry

SYCL for CUDA