12/12/11

The Intro to GPGPU .

Dr. Chokchai (Box) Leangsuksun, PhD Louisiana Tech University. Ruston, LA

CPU vs. GPU

• CPU – Fast caches – Branching adaptability – High performance • GPU – Multiple ALUs – Fast onboard memory – High on parallel tasks • Executes program on each fragment/vertex

• CPUs are great for parallelism • GPUs are great for parallelism

Supercomputing 20082 Education Program

1 12/12/11

CPU vs. GPU - Hardware

• More transistors devoted to data processing

CUDA programming guide 3.1 3

CPU vs. GPU – Computation

CUDA programming guide 3.1

2 12/12/11

CPU vs. GPU – Memory Bandwidth

CUDA programming guide 3.1

What is GPGPU ?

• General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical path of application • Data parallel leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low- floating point (FP) computation

© David Kirk/ and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

3 12/12/11

Why is GPGPU?

• Large number of cores – – 100-1000 cores in a single card • Low cost – less than $100-$1500 • Green – Low power consumption – 135 watts/card – 135 w vs 30000 w (300 watts * 100) • 1 card can perform > 100 desktops

12/14/09– $750 vs 50000 ($500 * 100) 7

Two major players

4 12/12/11

Parallel Computing on a GPU

• NVIDIA GPU Computing Architecture – Via a HW device interface – In , desktops, workstations, servers • Tesla T10 1070 from 1-4 TFLOPS • AMD/ATI 5970 x2 3200 cores • NVIDIA is an all-in-one (system-on-a-) ATI 4850! architecture derived from the ARM family • GPU parallelism is better than Moore’s law, more doubling every year • GPGPU is a GPU that allows user to both graphics and non-graphics applications.

GeForce 8800!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Requirements of a GPU system

• GPGPU is a GPU that allows user to process both graphics and non-graphics applications.

• GPGPU-capable • Power supply

• Cooling Tesla D870! • PCI-Express 16x

GeForce 8800!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

5 12/12/11

Examples of GPU devices

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

NVIDIA GeForce 8800 (G80)

• the eighth generation of NVIDIA’s GeForce graphic cards. • High performance CUDA-enabled GPGPU • 128 cores • Memory 256-768 MB or 1.5 GB in Tesla • High-speed memory bandwidth (86.4GB/s) • Supports (SLI)

6 12/12/11

NVIDIA GeForce 295(G200)

• the tenth generation of NVIDIA’s GeForce graphic cards. • The second generation of CUDA architecture. • Dual GPU card. • 480 cores. (240 per GPU ) • 1242 Mhz processor clock speed. • Memory 1792 MB. (896 MB per GPU) • 223.8 GB/s memory bandwidth. (2 memory interfaces) • Supports Quad Scalable Link Interface (SLI)

NVIDIA GeForce 480(Fermi)

• the elevenths generation of NVIDIA’s GeForce graphic cards. • The third generation of CUDA architecture. • 480 cores. • 1401 Mhz processor clock speed • Memory 1536 MB. • 177.4 GB/s memory bandwidth • Supports 2way/3way Scalable Link Interface (SLI)

7 12/12/11

NVIDIA TeslaTM

• Feature – GPU Computing for HPC – No display ports – Dedicate to computation – For massively Multi-threaded computing – Supercomputing performance – Large memory capacity up to 6GB in Tesla M2070

NVIDIA Tesla Card >>

• Tesla 10: • C-Series(Card) = 1 GPU with 1.5 GB • D-Series(Deskside unit) = 2 GPUs • S-Series(1U server) = 4 GPUs • Tesla 20 (Fermi architecture) = 1GPU with 3GB or 6GB • Note: 1 G80 GPU = 128 cores = ~500 GFLOPs • 1 T10 = 240 cores = 1 TFLOPs

<< NVIDIA G80

8 12/12/11

NVIDIA Fermi (Tesla seris 20) “I believe history will record Fermi • 512 cores (16 SM * 32 cores) as a significant milestone. ” Dave Patterson • 8X faster peak DP floating point calculation. • 520-630 GFLOPS DP • 3GB-GDDR5 for Tesla 2050 • 6GB-GDDR5 for Tesla 2070 • ECC • L1 and L2 • Concurrent Kernels Executions (up to 16 kernels) • IEEE754-2008 and FMA Fused Multiply-Add

NVidia Fermi Architecture

NVIDIA's Fermi white paper

9 12/12/11

3rd Generation SM Architecture

•32 cores, 16 load/store registers, and 4 Special Function Unites. •Customized 64KB memory 16KB Shared memory and 48KB L1 Cache, Or 48KB Shared memory and 16KB L1 Cache. • dual warp scheduler. •Each CUDA Core contain one Floating Point Unit and one Integer ALU, With DB support.

•8X faster in double precession operations than GT200.

NVIDIA's Fermi white paper

Memory Hierarchy

Each in a block can access the shared memory and the L1 Cache, Each block has the access to the L2 cache and the Global memory.

NVIDIA's Fermi white paper

10 12/12/11

Dual Warp Scheduler >>

<< Concurrent Kernel Execution

NVIDIA's Fermi white paper

Fermi Products

GTX460 GTX465 GTX470 GTX480 Tesla2050 Tesla2070 Cores 336 352 448 480 448 448 Clock SP:1.05TFLOPS 1350MHz 1215MHz 1215MHz 1401MHz Speed DP: 515 GFLOPS 768MB or Memory 1GB 1280 MB 1.5 GB 3 GB 6GB 1GB 86.4 or bandwidth 102.6 133.9 177.4 148 148 115.2 Power 160W 200W 215W 250W 225W 225W Price $199-$249 $299 $349 $499 $2,499 $3,999

NVIDIA.com

11 12/12/11

CUDA Architecture Generations

The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location.

Nvidia's Fermi white paper

Fermi VS GT200

Each SM in fermi architecture can do 16 FMA (Fused Multiply- Add) double precision operation per clock cycle.

Nvidia's Fermi white paper

12 12/12/11

Nvidia's Fermi white paper

This slide is from NVDIA CUDA tutorial!

© David Kirk/ NVIDIA and Wen- mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

13 12/12/11

ATI Stream (1)

12/14/09

27

ATI 4870

12/14/09

28

14 12/12/11

ATI 4870 X2

12/14/09

29

ATI ™ HD 5870

Transistors 2.15 billion (40nm) Stream Cores 1600 Clock speed 850 MHz SP Compute Power 2.72 TeraFLOPS DB Compute Power 544 GigaFLOPS Memory Type GDDR5 4.8Gbps Memory Capacity 1 GB Memory Bandwidth 153.6 GB/sec Board Power 188w max / 27w idle

AMD.com

15 12/12/11

ATI Radeon™ HD 5970

Transistors 4.3 billion (40nm) Stream Cores 3200 (2 GPUs) Clock speed 725 MHz SP Compute Power 4.64 TeraFLOPS DB Compute Power 928 GigaFLOPS Memory Type GDDR5 4.0Gbps Memory Capacity 2 - 4 GB Memory Bandwidth 256.0 GB/sec Board Power 294w max / 51w idle

AMD.com

Architecture of ATI Radeon 4000 series

16 12/12/11

This slide is from ATI presentation!

This slide is from ATI presentation!

17 12/12/11

What about ??

Intel Larrabee • a hybrid between a multi-core CPU and a GPU, • coherent and architecture compatibility are CPU-like • its wide SIMD vector units and texture sampling hardware are GPU-like.

18 12/12/11

Months after ISC’09, Intel canceled the larrabee project In ISC’10 they announced new project code name “Night Ferry” using a similar architecture to larrabee called MIC

Intel Night Ferry (MIC Architecture)

• 22 nm technology • 32 cores 1.2Ghz ( MIC is up to 50 cores) • 128 threads at 4threads/core. • 8MB shared coherent cache • 1-2GB GDDR5 • Intel HPC tools

This slide information from ISC’10 Skaugen_keynote

19 12/12/11

MIC Architecture (Many Integrated Core)

This slide information from ISC’10 Skaugen_keynote

Intel Night Ferry VS NVIDIA Fermi Intel MIC NVIDIA Fermi MIMD Parallelism 32 32(28) SIMD Parallelism 16 16 Instruction-Level Parallelism 2 1 Thread Granularity coarse fine Multithreading 4 24 Clock 1.2GHz 1.1GHz L1 cache/processor 32KB 64KB L2 cache/processor 256KB 24KB programming model posix threads CUDA kernels yes no memory shared with host no no hardware parallelism support no yes mature tools yes yes

This information from the Article “ and more: Night Ferry Versus Fermi” by Michael Wolf. Portland group inc.

20 12/12/11

Introduction to Open CL

Toward new approach in Computing

Introduction to openCL

• OpenCL stands for Open Computing Language. • It is from consortium efforts such as Apple, NVDIA, AMD etc. • The Khronos group who was responsible for OpenGL. • Toke 6 months to come up with the specifications.

21 12/12/11

OpenCL

1. Royalty-free. 2. Support both task and data parallel programing modes. 3. Works for vendor-agnostic GPGPUs 4. including multi cores CPUs 5. Works on Cell processors. 6. Support handhelds and mobile devices. 7. Based on C language under C99.

22 12/12/11

OpenCL Platform Model

Basic OpenCL program Structure

1. OpenCL Kernel

2. Host program containing: a. Devices Context. b. Command Queue c. Memory Objects d. OpenCL Program. e. Kernel Memory Arguments.

23 12/12/11

CPUs+GPU platforms

12/14/09

47

Performance of GPGPU

Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS

24 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

25 12/12/11

CUDA

• “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor • Targeted stack – Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU – Standalone Driver - Optimized for computation – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds – Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

26 12/12/11

Parallel Computing on a GPU

• NVIDIA GPU Computing Architecture – Via a separate HW interface – In laptops, desktops, workstations, servers GeForce 8800 • Programmable in C with CUDA tools • Multithreaded SIMD model uses application and thread parallelism Tesla D870

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Tesla S870 ECE 498AL, University of Illinois, Urbana-Champaign

GeForce 8800

16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache

TextureTexture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Global Memory ECE 498AL1, University of Illinois, Urbana-Champaign

27 12/12/11

Introduction to CUDA programming

These materials are excerpted from David Kirk/NVIDIA and Wen-mei W. Hwu ! And Christian Trefftz / Greg Wolffe’s SC08 GPU tutorials! !

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Data-parallel Programming

• Think of the CPU as a massively-threaded co- processor • Write “kernel” functions that execute on the device -- processing multiple data elements in parallel

• Keep it busy! [ massive threading • Keep your data close! [ local memory

Supercomputing56 2008 Education Program

28 12/12/11

Pixel / Thread Processing

Supercomputing57 2008 Education Program

Steps for CUDA Programming

1. Device Initialization 2. Device memory allocation 3. Copies data to device memory 4. Executes kernel (Calling __global__ function) 5. Copies data from device memory (retrieve results)

29 12/12/11

Initially:

array Host’s Memory GPU Card’s Memory

Supercomputing59 2008 Education Program

Allocate Memory in the GPU card

array array_d Host’s Memory GPU Card’s Memory

Supercomputing60 2008 Education Program

30 12/12/11

Copy content from the host’s memory to the GPU card memory

array array_d Host’s Memory GPU Card’s Memory

Supercomputing61 2008 Education Program

Execute code on the GPU

GPU MPs Kernel code

array array_d Host’s Memory GPU Card’s Memory

Supercomputing62 2008 Education Program

31 12/12/11

Copy results back to the host memory

array array_d Host’s Memory GPU Card’s Memory

Supercomputing63 2008 Education Program

Steps for CUDA Programming

1. Device Initialization 2. Device memory allocation 3. Copies data to device memory 4. Executes kernel (Calling __global__ function) 5. Copies data from device memory (retrieve results)

32 12/12/11

Hello World

// Kernel definition __global__ void vecAdd(float* A, float* B, float* C) {

}

int main() { // Kernel invocation vecAdd<<<1, N>>>(A, B, C); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Hello World

// Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

int main() { // Kernel invocation vecAdd<<<1, N>>>(A, B, C); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

33 12/12/11

Extended C • Declspecs __device__ float filter[N]; – global, device, shared, local, constant __global__ void convolve (float *image) {

__shared__ float region[M]; • Keywords ...

– threadIdx, blockIdx region[threadIdx] = image[i];

• Intrinsics __syncthreads() – __syncthreads ...

image[j] = result; } • Runtime API – Memory, symbol, // Allocate GPU memory void *myimage = cudaMalloc(bytes) execution management

// 100 blocks, 10 threads per block • Function launch convolve<<<100, 10>>> (myimage);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Initialize Device calls

• cudaSetDevice(device) is for selecting the device associated to the host thread. • cudaGetDeviceCount(&devicecount) is for getting number of devices. • cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties

• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.

34 12/12/11

CUDA Language concept

• CUDA Programming Model • CUDA Memory Model

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Some Terminology • device = GPU = set of multiprocessors • Multiprocessor = set of processors & shared memory • Kernel = GPU program • Grid = array of thread blocks that execute a kernel • Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 70 ECE 498AL, University of Illinois, Urbana-Champaign

35 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

36 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

37 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

38 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Thread Batching: Grids and Blocks • A kernel is executed as a grid Host Device of thread blocks Grid 1

– All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0) • A thread block is a batch of Block Block Block threads that can cooperate with (0, 1) (1, 1) (2, 1)

each other by: Grid 2

– Synchronizing their execution Kernel • For hazard-free shared 2 memory accesses – Efficiently sharing data through Block (1, 1) a low latency shared memory Thread Thread Thread Thread Thread • Two threads from two different (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread blocks cannot cooperate (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Courtesy: NDVIA

39 12/12/11

What are those blockIds and threadIds? blockIdx.x is a built-in variable in CUDA that returns the blockId in the x axis of the block that is executing this block of code threadIdx.x is another built-in variable returns the threadId in the x axis of the thread that is being executed by this stream processor • Example code in the kernel: x=blockIdx.x*BLOCK_SIZE+threadIdx.x; block_d[x] = blockIdx.x; thread_d[x] = threadIdx.x;

Supercomputing79 2008 Education Program

In the GPU:

Processing Elements

Threa Threa Threa Threa Threa Threa Threa Threa

d 0 d 1 d 2 d 3 d 0 d 1 d 2 d 3

Array Elements Block 0 Block 1

Supercomputing80 2008 Education Program

40 12/12/11

CUDA Device Memory Model Overview • Each thread can: (Device) Grid

– R/W per-thread registers Block (0, 0) Block (1, 0) – R/W per-thread local memory Shared Memory Shared Memory – R/W per-block shared memory – R/W per-grid global memory Registers Registers Registers Registers – Read only per-grid constant Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) memory

– Read only per-grid texture memory Local Local Local Local Memory Memory Memory Memory • The host can R/W Host Global global, constant, and Memory Constant texture memories Memory

Texture © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Memory ECE 498AL, University of Illinois, Urbana-Champaign

Global, Constant, and Texture Memories (Long Latency Accesses) • Global memory (Device) Grid – Main means of Block (0, 0) Block (1, 0) communicating R/W Data between host and device Shared Memory Shared Memory – Contents visible to all Registers Registers Registers Registers threads Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Texture and Constant Memories Local Local Local Local Memory Memory Memory Memory – Constants initialized by Host Global host Memory

– Contents visible to all Constant threads Memory Texture © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Memory ECE 498AL, University of Illinois, Urbana-Champaign Courtesy: NDVIA

41 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Device Memory Allocation

• cudaMalloc() (Device) Grid

– Allocates object in the Block (0, 0) Block (1, 0)

device Global Memory Shared Memory Shared Memory

Register Register Register Register – Requires two parameters s s s s

• Address of a pointer to the Thread (0, Thread (1, Thread (0, Thread (1, allocated object 0) 0) 0) 0)

Local Local Local Local • Size of of allocated object Memor Memor Memor Memor y y y y

Host Global • cudaFree() Memory

Constant – Frees object from device Memory

Texture Global Memory Memory • Pointer to freed object © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

42 12/12/11

CUDA Host-Device Data Transfer

• cudaMemcpy() (Device) Grid

– memory data transfer Block (0, 0) Block (1, 0)

– Requires four parameters Shared Memory Shared Memory

• Pointer to source Register Register Register Register • Pointer to destination s s s s • Number of bytes copied Thread (0, Thread (1, Thread (0, Thread (1, 0) 0) 0) 0) • Type of transfer Local Local Local Local – Host to Host Memor Memor Memor Memor y y y y

– Host to Device Host Global Memory – Device to Host Constant – Device to Device Memory

Texture • Asynchronous in CUDA Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

CUDA Function Declarations

Executed on Only callable the: from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host • __global__ defines a kernel function – Must return void

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

43 12/12/11

Language Extensions: Variable Type Qualifiers Memory Scope Lifetime __device__ __local__ int LocalVar; local thread thread __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application __device__ __constant__ int ConstantVar; constant grid application • __device__ is optional when used with __local__, __shared__, or __constant__

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 87 ECE 498AL, University of Illinois, Urbana-Champaign

Access Times

• Register – dedicated HW - single cycle • Shared Memory – dedicated HW - single cycle • Local Memory – DRAM, no cache - *slow* • Global Memory – DRAM, no cache - *slow* • Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality • Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality • Instruction Memory (invisible) – DRAM, cached

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 88 ECE 498AL, University of Illinois, Urbana-Champaign

44 12/12/11

CUDA function calls restrictions

• __device__ functions cannot have their address taken • For functions executed on the device: – No recursion – No static variable declarations inside the function – No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Calling a Kernel Function – Thread Creation

• A kernel function must be called with an execution configuration:

__global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

45 12/12/11

Resources on line

• http://www.ddj.com/hpc-high-performance- computing/207200659 • http://www.nvidia.com/object/cuda_home.html# • http://www.nvidia.com/object/cuda_learn.html

Supercomputing91 2008 Education Program

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

46 12/12/11

Demo

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

47 12/12/11

1. Initialize Device

• cudaSetDevice(device) is for selecting the device associated to the host thread. • cudaGetDeviceCount(&devicecount) is for getting number of devices. • cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties

• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.

Example: Device Initialization

int deviceCount; cudaGetDeviceCount(&deviceCount); int device; cudaDeviceProp deviceProp;

for (device = 0; device < deviceCount; device++) { …

cudaGetDeviceProperties(&deviceProp, device); … }

48 12/12/11

A Simple Running Example Matrix Multiplication • A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs – Leave shared memory usage until later – Local, register usage – Thread ID usage – Memory data transfer API between host and device

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Programming Model: Square Matrix Multiplication Example • P = M * N of size WIDTH x WIDTH N • Without tiling:

One thread handles one element of P WIDTH M and N are loaded WIDTH times from global memory

M P WIDTH

WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

49 12/12/11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Step 1: Matrix Data Transfers

// Allocate the device memory where we will copy M to Matrix Md; Md.width = WIDTH; Md.height = WIDTH; Md.pitch = WIDTH; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md.elements, size);

// Copy M from the host to the device cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);

// Read M from the device to the host into P cudaMemcpy(P.elements, Md.elements, size, cudaMemcpyDeviceToHost); ... // Free device memory cudaFree(Md.elements); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

50 12/12/11

Step 2: Matrix Multiplication A Simple Host Code in C // Matrix multiplication on the (CPU) host in double precision // for simplicity, we will assume that all dimensions are equal void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; sum += a * b; } P.elements[i * N.width + j] = sum; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Multiply Using One Thread Block

Grid 1 N • One Block of threads compute Block 1 matrix P – Each thread computes one

element of P Thread (2, 2) • Each thread – Loads a row of matrix M – Loads a column of matrix N – Perform one multiply and addition for each pair of M and N elements – Compute to off-chip memory 48 access ratio close to 1:1 (not very high) • Size of matrix limited by the number of threads allowed in a BLOCK_SIZE thread block M P © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

51 12/12/11

Step 3: Matrix Multiplication Host-side Main Program Code

int main(void) { // Allocate and initialize the matrices Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0);

// M * N on the device MatrixMulOnDevice(M, N, P);

// Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

See the demo code

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

52 12/12/11

Step 3: Matrix Multiplication Host-side code

// Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N);

// Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // Clear memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Step 3: Matrix Multiplication Host-side Code (cont.) // Setup the execution configuration dim3 dimBlock(WIDTH, WIDTH); dim3 dimGrid(1, 1);

// Launch the device computation threads! MatrixMulKernel<<>>(Md, Nd, Pd);

// Read P from the device CopyFromDeviceMatrix(P, Pd);

// Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

53 12/12/11

Step 4: Matrix Multiplication Device-side Kernel Function

// Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y;

// Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Step 4: Matrix Multiplication Device-Side Kernel Function (cont.)

N for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; WIDTH Pvalue += Melement * Nelement; }

// Write the matrix to deviceM memory; P // each thread writes one element ty P.elements[ty * P.pitch + tx] = Pvalue; } WIDTH

tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 WIDTH WIDTH ECE 498AL, University of Illinois, Urbana-Champaign

54 12/12/11

Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; }

// Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); }

void FreeMatrix(Matrix M) { free(M.elements); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Step 5: Some Loose Ends (cont.)

// Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); } // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) { int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

55 12/12/11

Step 6: Handling Arbitrary Sized Square Matrices

• Have each 2D thread block to compute N a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix

– Each has (BLOCK_WIDTH)2 threads WIDTH • Generate a 2D Grid of (WIDTH/ BLOCK_WIDTH)2 blocks M P You still need to put a by loop around the kernel call for cases where ty WIDTH WIDTH is greater than bx tx Max grid size! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 WIDTH WIDTH ECE 498AL, University of Illinois, Urbana-Champaign

56