CUDA
GPU Computing
K. Cooper1
1Department of Mathematics Washington State University
2020 CUDA Review of Parallel Paradigms
MIMD Computing Multiple Instruction– Multiple Data –Several separate program streams, each executing possibly different sets of instructions –Each instruction stream operates on different data – each instruction stream may only have access to a fragment of data – MPI – Python parallel package CUDA Review of Parallel Paradigms
SIMD Computing Single Instruction– Multiple Data –Only one program stream, though that may launch multiple threads –The instruction stream may be applied simultaneously to many different data elements. CUDA Review of Parallel Paradigms
Advantages of MIMD Advantages • Instructions can be wildly different for individual streams • Instructions can be separated – even to different nodes • Memory is distributed – becomes limited only by number of nodes • Nodes can be unsophisticated – cheap Disadvantages • Communication • Really only at best for simulations CUDA Review of Parallel Paradigms
Advantages of SIMD Disadvantages • All computations must happen on single machine – limited memory, processors • Hardware might be very complex, therefore expensive Advantages • All computations happen on a single machine – fast CUDA Review of Parallel Paradigms
SIMT Historically, SIMD computing involved vastly complex CPUs with many ALUs, with complicated switch architectures. This is, in some sense, a description of a modern video card. Ever since SGI, video cards have had small specialized processors designed for arithmetic involved in 3-d projections. Single Instruction - Multiple Thread We start one program – that program can launch many threads to perform small tasks in parallel on a Graphics Processing Unit. CUDA CUDA Computing
NVidia – The company that really drives this is NVidia – Makes video cards for 3-d games – API for programmers to send instructions to card:
– CUDA - Compute Unified Device Architecture
AMD/ATI is playing too, but uses different API CUDA CUDA Computing
Model 1 – Start one program 2 – Write function(s) to handle core of computation in parallel – kernel 3 – Allocate memory for CPU – and also on video card 4 – Copy data from CPU to video card 5 – Run the kernel on the card 6 – Copy data back from card to CPU CUDA CUDA Computing
Automation This has been around long enough that people have developed frameworks to handle much of the complication. Anaconda – Numba • CUDA for Conda. . . • translates your code into CUDA for you • make CUDA code from simple scalar code easily CUDA CUDA Computing
CUDA
1 Must install NVidia drivers for your video card. 2 Must install CUDA libraries – depends on OS 3 We have all this on three public machines, not listed here. 4 nvidia-smi gives info 5 Demo. . . CUDA CUDA Computing
Numba To install on your own machine, note that Numba is available in many repositories now.
pip3 install numba or conda install numba conda install cudatoolkit[=10.1]
CUDA communicates with firmware in card, so it is important to install the right toolkit to work with the card. As of this writing, CUDA 10.2 is most recent version. CUDA CUDA Computing
Example Consider the simplest collection of ODEs:
0 y = −y, y(0) = y0,
where y is a vector.
We can hit this with Euler’s method.
def euler(y0,T,h): nSteps = int(T/h) for i in range(nSteps): y0 = (1.-h)*y0 return y0 CUDA CUDA Computing
Results
Method Time(sec) Scalar 15.0 Vectorized w/ cpu option 27.2 Vectorized w/ parallel option 6.9 Vectorized w/ cuda option 2.3