<<

CUDA

GPU

K. Cooper1

1Department of Mathematics Washington State University

2020 CUDA Review of Parallel Paradigms

MIMD Computing Multiple Instruction– Multiple Data –Several separate program streams, each executing possibly different sets of instructions –Each instruction stream operates on different data – each instruction stream may only have access to a fragment of data – MPI – Python parallel package CUDA Review of Parallel Paradigms

SIMD Computing Single Instruction– Multiple Data –Only one program stream, though that may launch multiple threads –The instruction stream may be applied simultaneously to many different data elements. CUDA Review of Parallel Paradigms

Advantages of MIMD Advantages • Instructions can be wildly different for individual streams • Instructions can be separated – even to different nodes • Memory is distributed – becomes limited only by number of nodes • Nodes can be unsophisticated – cheap Disadvantages • Communication • Really only at best for simulations CUDA Review of Parallel Paradigms

Advantages of SIMD Disadvantages • All computations must happen on single machine – limited memory, processors • Hardware might be very complex, therefore expensive Advantages • All computations happen on a single machine – fast CUDA Review of Parallel Paradigms

SIMT Historically, SIMD computing involved vastly complex CPUs with many ALUs, with complicated architectures. This is, in some sense, a description of a modern . Ever since SGI, video cards have had small specialized processors designed for arithmetic involved in 3-d projections. Single Instruction - Multiple We start one program – that program can launch many threads to perform small tasks in parallel on a . CUDA CUDA Computing

NVidia – The company that really drives this is – Makes video cards for 3-d games – API for to send instructions to card:

– CUDA - Compute Unified Device Architecture

AMD/ATI is playing too, but uses different API CUDA CUDA Computing

Model 1 – Start one program 2 – Write function(s) to handle core of computation in parallel – kernel 3 – Allocate memory for CPU – and also on video card 4 – Copy data from CPU to video card 5 – Run the kernel on the card 6 – Copy data back from card to CPU CUDA CUDA Computing

Automation This has been around long enough that people have developed frameworks to handle much of the complication. Anaconda – • CUDA for Conda. . . • translates your code into CUDA for you • make CUDA code from simple scalar code easily CUDA CUDA Computing

CUDA

1 Must install NVidia drivers for your video card. 2 Must install CUDA libraries – depends on OS 3 We have all this on three public machines, not listed here. 4 nvidia-smi gives info 5 Demo. . . CUDA CUDA Computing

Numba To install on your own machine, note that Numba is available in many repositories now.

pip3 install numba or conda install numba conda install cudatoolkit[=10.1]

CUDA communicates with firmware in card, so it is important to install the right toolkit to work with the card. As of this writing, CUDA 10.2 is most recent version. CUDA CUDA Computing

Example Consider the simplest collection of ODEs:

0 y = −y, y(0) = y0,

where y is a vector.

We can hit this with Euler’s method.

def euler(y0,T,h): nSteps = int(T/h) for i in range(nSteps): y0 = (1.-h)*y0 return y0 CUDA CUDA Computing

Results

Method Time(sec) Scalar 15.0 Vectorized w/ cpu option 27.2 Vectorized w/ parallel option 6.9 Vectorized w/ option 2.3