12/12/11
The Intro to GPGPU .
Dr. Chokchai (Box) Leangsuksun, PhD Louisiana Tech University. Ruston, LA
CPU vs. GPU
• CPU – Fast caches – Branching adaptability – High performance • GPU – Multiple ALUs – Fast onboard memory – High throughput on parallel tasks • Executes program on each fragment/vertex
• CPUs are great for task parallelism • GPUs are great for data parallelism
Supercomputing 20082 Education Program
1 12/12/11
CPU vs. GPU - Hardware
• More transistors devoted to data processing
CUDA programming guide 3.1 3
CPU vs. GPU – Computation Power
CUDA programming guide 3.1
2 12/12/11
CPU vs. GPU – Memory Bandwidth
CUDA programming guide 3.1
What is GPGPU ?
• General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
3 12/12/11
Why is GPGPU?
• Large number of cores – – 100-1000 cores in a single card • Low cost – less than $100-$1500 • Green computing – Low power consumption – 135 watts/card – 135 w vs 30000 w (300 watts * 100) • 1 card can perform > 100 desktops
12/14/09 – $750 vs 50000 ($500 * 100) 7
Two major players
4 12/12/11
Parallel Computing on a GPU
• NVIDIA GPU Computing Architecture – Via a HW device interface – In laptops, desktops, workstations, servers • Tesla T10 1070 from 1-4 TFLOPS • AMD/ATI 5970 x2 3200 cores • NVIDIA Tegra is an all-in-one (system-on-a-chip) ATI 4850! processor architecture derived from the ARM family • GPU parallelism is better than Moore’s law, more doubling every year • GPGPU is a GPU that allows user to process both graphics and non-graphics applications.
GeForce 8800!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Requirements of a GPU system
• GPGPU is a GPU that allows user to process both graphics and non-graphics applications.
• GPGPU-capable video card • Power supply
• Cooling Tesla D870! • PCI-Express 16x
GeForce 8800!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
5 12/12/11
Examples of GPU devices
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
NVIDIA GeForce 8800 (G80)
• the eighth generation of NVIDIA’s GeForce graphic cards. • High performance CUDA-enabled GPGPU • 128 cores • Memory 256-768 MB or 1.5 GB in Tesla • High-speed memory bandwidth (86.4GB/s) • Supports Scalable Link Interface (SLI)
6 12/12/11
NVIDIA GeForce 295(G200)
• the tenth generation of NVIDIA’s GeForce graphic cards. • The second generation of CUDA architecture. • Dual GPU card. • 480 cores. (240 per GPU ) • 1242 Mhz processor clock speed. • Memory 1792 MB. (896 MB per GPU) • 223.8 GB/s memory bandwidth. (2 memory interfaces) • Supports Quad Scalable Link Interface (SLI)
NVIDIA GeForce 480(Fermi)
• the elevenths generation of NVIDIA’s GeForce graphic cards. • The third generation of CUDA architecture. • 480 cores. • 1401 Mhz processor clock speed • Memory 1536 MB. • 177.4 GB/s memory bandwidth • Supports 2way/3way Scalable Link Interface (SLI)
7 12/12/11
NVIDIA TeslaTM
• Feature – GPU Computing for HPC – No display ports – Dedicate to computation – For massively Multi-threaded computing – Supercomputing performance – Large memory capacity up to 6GB in Tesla M2070
NVIDIA Tesla Card >>
• Tesla 10: • C-Series(Card) = 1 GPU with 1.5 GB • D-Series(Deskside unit) = 2 GPUs • S-Series(1U server) = 4 GPUs • Tesla 20 (Fermi architecture) = 1GPU with 3GB or 6GB • Note: 1 G80 GPU = 128 cores = ~500 GFLOPs • 1 T10 = 240 cores = 1 TFLOPs
<< NVIDIA G80
8 12/12/11
NVIDIA Fermi (Tesla seris 20) “I believe history will record Fermi • 512 cores (16 SM * 32 cores) as a significant milestone. ” Dave Patterson • 8X faster peak DP floating point calculation. • 520-630 GFLOPS DP • 3GB-GDDR5 for Tesla 2050 • 6GB-GDDR5 for Tesla 2070 • ECC • L1 and L2 cache • Concurrent Kernels Executions (up to 16 kernels) • IEEE754-2008 and FMA Fused Multiply-Add
NVidia Fermi Architecture
NVIDIA's Fermi white paper
9 12/12/11
3rd Generation SM Architecture
•32 cores, 16 load/store registers, and 4 Special Function Unites. •Customized 64KB memory 16KB Shared memory and 48KB L1 Cache, Or 48KB Shared memory and 16KB L1 Cache. • dual warp scheduler. •Each CUDA Core contain one Floating Point Unit and one Integer ALU, With DB support.
•8X faster in double precession operations than GT200.
NVIDIA's Fermi white paper
Memory Hierarchy
Each Thread in a block can access the shared memory and the L1 Cache, Each block has the access to the L2 cache and the Global memory.
NVIDIA's Fermi white paper
10 12/12/11
Dual Warp Scheduler >>
<< Concurrent Kernel Execution
NVIDIA's Fermi white paper
Fermi Products
GTX460 GTX465 GTX470 GTX480 Tesla2050 Tesla2070 Cores 336 352 448 480 448 448 Clock SP:1.05TFLOPS 1350MHz 1215MHz 1215MHz 1401MHz Speed DP: 515 GFLOPS 768MB or Memory 1GB 1280 MB 1.5 GB 3 GB 6GB 1GB 86.4 or bandwidth 102.6 133.9 177.4 148 148 115.2 Power 160W 200W 215W 250W 225W 225W Price $199-$249 $299 $349 $499 $2,499 $3,999
NVIDIA.com
11 12/12/11
CUDA Architecture Generations
The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location.
Nvidia's Fermi white paper
Fermi VS GT200
Each SM in fermi architecture can do 16 FMA (Fused Multiply- Add) double precision operation per clock cycle.
Nvidia's Fermi white paper
12 12/12/11
Nvidia's Fermi white paper
This slide is from NVDIA CUDA tutorial!
© David Kirk/ NVIDIA and Wen- mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
13 12/12/11
ATI Stream (1)
12/14/09
27
ATI 4870
12/14/09
28
14 12/12/11
ATI 4870 X2
12/14/09
29
ATI Radeon™ HD 5870
Transistors 2.15 billion (40nm) Stream Cores 1600 Clock speed 850 MHz SP Compute Power 2.72 TeraFLOPS DB Compute Power 544 GigaFLOPS Memory Type GDDR5 4.8Gbps Memory Capacity 1 GB Memory Bandwidth 153.6 GB/sec Board Power 188w max / 27w idle
AMD.com
15 12/12/11
ATI Radeon™ HD 5970
Transistors 4.3 billion (40nm) Stream Cores 3200 (2 GPUs) Clock speed 725 MHz SP Compute Power 4.64 TeraFLOPS DB Compute Power 928 GigaFLOPS Memory Type GDDR5 4.0Gbps Memory Capacity 2 - 4 GB Memory Bandwidth 256.0 GB/sec Board Power 294w max / 51w idle
AMD.com
Architecture of ATI Radeon 4000 series
16 12/12/11
This slide is from ATI presentation!
This slide is from ATI presentation!
17 12/12/11
What about Intel ??
Intel Larrabee • a hybrid between a multi-core CPU and a GPU, • coherent cache hierarchy and x86 architecture compatibility are CPU-like • its wide SIMD vector units and texture sampling hardware are GPU-like.
18 12/12/11
Months after ISC’09, Intel canceled the larrabee project In ISC’10 they announced new project code name “Night Ferry” using a similar architecture to larrabee called MIC
Intel Night Ferry (MIC Architecture)
• 22 nm technology • 32 cores 1.2Ghz ( MIC is up to 50 cores) • 128 threads at 4threads/core. • 8MB shared coherent cache • 1-2GB GDDR5 • Intel HPC tools
This slide information from ISC’10 Skaugen_keynote
19 12/12/11
MIC Architecture (Many Integrated Core)
This slide information from ISC’10 Skaugen_keynote
Intel Night Ferry VS NVIDIA Fermi Intel MIC NVIDIA Fermi MIMD Parallelism 32 32(28) SIMD Parallelism 16 16 Instruction-Level Parallelism 2 1 Thread Granularity coarse fine Multithreading 4 24 Clock 1.2GHz 1.1GHz L1 cache/processor 32KB 64KB L2 cache/processor 256KB 24KB programming model posix threads CUDA kernels virtual memory yes no memory shared with host no no hardware parallelism support no yes mature tools yes yes
This information from the Article “Compiler and more: Night Ferry Versus Fermi” by Michael Wolf. Portland group inc.
20 12/12/11
Introduction to Open CL
Toward new approach in Computing
Introduction to openCL
• OpenCL stands for Open Computing Language. • It is from consortium efforts such as Apple, NVDIA, AMD etc. • The Khronos group who was responsible for OpenGL. • Toke 6 months to come up with the specifications.
21 12/12/11
OpenCL
1. Royalty-free. 2. Support both task and data parallel programing modes. 3. Works for vendor-agnostic GPGPUs 4. including multi cores CPUs 5. Works on Cell processors. 6. Support handhelds and mobile devices. 7. Based on C language under C99.
22 12/12/11
OpenCL Platform Model
Basic OpenCL program Structure
1. OpenCL Kernel
2. Host program containing: a. Devices Context. b. Command Queue c. Memory Objects d. OpenCL Program. e. Kernel Memory Arguments.
23 12/12/11
CPUs+GPU platforms
12/14/09
47
Performance of GPGPU
Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS
24 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
25 12/12/11
CUDA
• “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor • Targeted software stack – Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU – Standalone Driver - Optimized for computation – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds – Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
26 12/12/11
Parallel Computing on a GPU
• NVIDIA GPU Computing Architecture – Via a separate HW interface – In laptops, desktops, workstations, servers GeForce 8800 • Programmable in C with CUDA tools • Multithreaded SIMD model uses application data parallelism and thread parallelism Tesla D870
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Tesla S870 ECE 498AL, University of Illinois, Urbana-Champaign
GeForce 8800
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host
Input Assembler
Thread Execution Manager
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache
TextureTexture Texture Texture Texture Texture Texture Texture Texture
Load/store Load/store Load/store Load/store Load/store Load/store
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Global Memory ECE 498AL1, University of Illinois, Urbana-Champaign
27 12/12/11
Introduction to CUDA programming
These materials are excerpted from David Kirk/NVIDIA and Wen-mei W. Hwu ! And Christian Trefftz / Greg Wolffe’s SC08 GPU tutorials! !
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Data-parallel Programming
• Think of the CPU as a massively-threaded co- processor • Write “kernel” functions that execute on the device -- processing multiple data elements in parallel
• Keep it busy! [ massive threading • Keep your data close! [ local memory
Supercomputing56 2008 Education Program
28 12/12/11
Pixel / Thread Processing
Supercomputing57 2008 Education Program
Steps for CUDA Programming
1. Device Initialization 2. Device memory allocation 3. Copies data to device memory 4. Executes kernel (Calling __global__ function) 5. Copies data from device memory (retrieve results)
29 12/12/11
Initially:
array Host’s Memory GPU Card’s Memory
Supercomputing59 2008 Education Program
Allocate Memory in the GPU card
array array_d Host’s Memory GPU Card’s Memory
Supercomputing60 2008 Education Program
30 12/12/11
Copy content from the host’s memory to the GPU card memory
array array_d Host’s Memory GPU Card’s Memory
Supercomputing61 2008 Education Program
Execute code on the GPU
GPU MPs Kernel code
array array_d Host’s Memory GPU Card’s Memory
Supercomputing62 2008 Education Program
31 12/12/11
Copy results back to the host memory
array array_d Host’s Memory GPU Card’s Memory
Supercomputing63 2008 Education Program
Steps for CUDA Programming
1. Device Initialization 2. Device memory allocation 3. Copies data to device memory 4. Executes kernel (Calling __global__ function) 5. Copies data from device memory (retrieve results)
32 12/12/11
Hello World
// Kernel definition __global__ void vecAdd(float* A, float* B, float* C) {
}
int main() { // Kernel invocation vecAdd<<<1, N>>>(A, B, C); }
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Hello World
// Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }
int main() { // Kernel invocation vecAdd<<<1, N>>>(A, B, C); }
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
33 12/12/11
Extended C • Declspecs __device__ float filter[N]; – global, device, shared, local, constant __global__ void convolve (float *image) {
__shared__ float region[M]; • Keywords ...
– threadIdx, blockIdx region[threadIdx] = image[i];
• Intrinsics __syncthreads() – __syncthreads ...
image[j] = result; } • Runtime API – Memory, symbol, // Allocate GPU memory void *myimage = cudaMalloc(bytes) execution management
// 100 blocks, 10 threads per block • Function launch convolve<<<100, 10>>> (myimage);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Initialize Device calls
• cudaSetDevice(device) is for selecting the device associated to the host thread. • cudaGetDeviceCount(&devicecount) is for getting number of devices. • cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties
• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.
34 12/12/11
CUDA Language concept
• CUDA Programming Model • CUDA Memory Model
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Some Terminology • device = GPU = set of multiprocessors • Multiprocessor = set of processors & shared memory • Kernel = GPU program • Grid = array of thread blocks that execute a kernel • Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 70 ECE 498AL, University of Illinois, Urbana-Champaign
35 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
36 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
37 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
38 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Thread Batching: Grids and Blocks • A kernel is executed as a grid Host Device of thread blocks Grid 1
– All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0) • A thread block is a batch of Block Block Block threads that can cooperate with (0, 1) (1, 1) (2, 1)
each other by: Grid 2
– Synchronizing their execution Kernel • For hazard-free shared 2 memory accesses – Efficiently sharing data through Block (1, 1) a low latency shared memory Thread Thread Thread Thread Thread • Two threads from two different (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread blocks cannot cooperate (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Courtesy: NDVIA
39 12/12/11
What are those blockIds and threadIds? blockIdx.x is a built-in variable in CUDA that returns the blockId in the x axis of the block that is executing this block of code threadIdx.x is another built-in variable returns the threadId in the x axis of the thread that is being executed by this stream processor • Example code in the kernel: x=blockIdx.x*BLOCK_SIZE+threadIdx.x; block_d[x] = blockIdx.x; thread_d[x] = threadIdx.x;
Supercomputing79 2008 Education Program
In the GPU:
Processing Elements
Threa Threa Threa Threa Threa Threa Threa Threa
d 0 d 1 d 2 d 3 d 0 d 1 d 2 d 3
Array Elements Block 0 Block 1
Supercomputing80 2008 Education Program
40 12/12/11
CUDA Device Memory Model Overview • Each thread can: (Device) Grid
– R/W per-thread registers Block (0, 0) Block (1, 0) – R/W per-thread local memory Shared Memory Shared Memory – R/W per-block shared memory – R/W per-grid global memory Registers Registers Registers Registers – Read only per-grid constant Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) memory
– Read only per-grid texture memory Local Local Local Local Memory Memory Memory Memory • The host can R/W Host Global global, constant, and Memory Constant texture memories Memory
Texture © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Memory ECE 498AL, University of Illinois, Urbana-Champaign
Global, Constant, and Texture Memories (Long Latency Accesses) • Global memory (Device) Grid – Main means of Block (0, 0) Block (1, 0) communicating R/W Data between host and device Shared Memory Shared Memory – Contents visible to all Registers Registers Registers Registers threads Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Texture and Constant Memories Local Local Local Local Memory Memory Memory Memory – Constants initialized by Host Global host Memory
– Contents visible to all Constant threads Memory Texture © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Memory ECE 498AL, University of Illinois, Urbana-Champaign Courtesy: NDVIA
41 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory Allocation
• cudaMalloc() (Device) Grid
– Allocates object in the Block (0, 0) Block (1, 0)
device Global Memory Shared Memory Shared Memory
Register Register Register Register – Requires two parameters s s s s
• Address of a pointer to the Thread (0, Thread (1, Thread (0, Thread (1, allocated object 0) 0) 0) 0)
Local Local Local Local • Size of of allocated object Memor Memor Memor Memor y y y y
Host Global • cudaFree() Memory
Constant – Frees object from device Memory
Texture Global Memory Memory • Pointer to freed object © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
42 12/12/11
CUDA Host-Device Data Transfer
• cudaMemcpy() (Device) Grid
– memory data transfer Block (0, 0) Block (1, 0)
– Requires four parameters Shared Memory Shared Memory
• Pointer to source Register Register Register Register • Pointer to destination s s s s • Number of bytes copied Thread (0, Thread (1, Thread (0, Thread (1, 0) 0) 0) 0) • Type of transfer Local Local Local Local – Host to Host Memor Memor Memor Memor y y y y
– Host to Device Host Global Memory – Device to Host Constant – Device to Device Memory
Texture • Asynchronous in CUDA Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations
Executed on Only callable the: from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host • __global__ defines a kernel function – Must return void
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
43 12/12/11
Language Extensions: Variable Type Qualifiers Memory Scope Lifetime __device__ __local__ int LocalVar; local thread thread __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application __device__ __constant__ int ConstantVar; constant grid application • __device__ is optional when used with __local__, __shared__, or __constant__
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 87 ECE 498AL, University of Illinois, Urbana-Champaign
Access Times
• Register – dedicated HW - single cycle • Shared Memory – dedicated HW - single cycle • Local Memory – DRAM, no cache - *slow* • Global Memory – DRAM, no cache - *slow* • Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality • Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality • Instruction Memory (invisible) – DRAM, cached
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 88 ECE 498AL, University of Illinois, Urbana-Champaign
44 12/12/11
CUDA function calls restrictions
• __device__ functions cannot have their address taken • For functions executed on the device: – No recursion – No static variable declarations inside the function – No variable number of arguments
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Calling a Kernel Function – Thread Creation
• A kernel function must be called with an execution configuration:
__global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
45 12/12/11
Resources on line
• http://www.ddj.com/hpc-high-performance- computing/207200659 • http://www.nvidia.com/object/cuda_home.html# • http://www.nvidia.com/object/cuda_learn.html
Supercomputing91 2008 Education Program
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
46 12/12/11
Demo
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
47 12/12/11
1. Initialize Device
• cudaSetDevice(device) is for selecting the device associated to the host thread. • cudaGetDeviceCount(&devicecount) is for getting number of devices. • cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties
• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.
Example: Device Initialization
int deviceCount; cudaGetDeviceCount(&deviceCount); int device; cudaDeviceProp deviceProp;
for (device = 0; device < deviceCount; device++) { …
cudaGetDeviceProperties(&deviceProp, device); … }
48 12/12/11
A Simple Running Example Matrix Multiplication • A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs – Leave shared memory usage until later – Local, register usage – Thread ID usage – Memory data transfer API between host and device
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Programming Model: Square Matrix Multiplication Example • P = M * N of size WIDTH x WIDTH N • Without tiling:
One thread handles one element of P WIDTH M and N are loaded WIDTH times from global memory
M P WIDTH
WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
49 12/12/11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Step 1: Matrix Data Transfers
// Allocate the device memory where we will copy M to Matrix Md; Md.width = WIDTH; Md.height = WIDTH; Md.pitch = WIDTH; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md.elements, size);
// Copy M from the host to the device cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);
// Read M from the device to the host into P cudaMemcpy(P.elements, Md.elements, size, cudaMemcpyDeviceToHost); ... // Free device memory cudaFree(Md.elements); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
50 12/12/11
Step 2: Matrix Multiplication A Simple Host Code in C // Matrix multiplication on the (CPU) host in double precision // for simplicity, we will assume that all dimensions are equal void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; sum += a * b; } P.elements[i * N.width + j] = sum; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Multiply Using One Thread Block
Grid 1 N • One Block of threads compute Block 1 matrix P – Each thread computes one
element of P Thread (2, 2) • Each thread – Loads a row of matrix M – Loads a column of matrix N – Perform one multiply and addition for each pair of M and N elements – Compute to off-chip memory 48 access ratio close to 1:1 (not very high) • Size of matrix limited by the number of threads allowed in a BLOCK_SIZE thread block M P © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
51 12/12/11
Step 3: Matrix Multiplication Host-side Main Program Code
int main(void) { // Allocate and initialize the matrices Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0);
// M * N on the device MatrixMulOnDevice(M, N, P);
// Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; }
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
See the demo code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
52 12/12/11
Step 3: Matrix Multiplication Host-side code
// Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N);
// Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // Clear memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Step 3: Matrix Multiplication Host-side Code (cont.) // Setup the execution configuration dim3 dimBlock(WIDTH, WIDTH); dim3 dimGrid(1, 1);
// Launch the device computation threads! MatrixMulKernel<<
// Read P from the device CopyFromDeviceMatrix(P, Pd);
// Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
53 12/12/11
Step 4: Matrix Multiplication Device-side Kernel Function
// Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y;
// Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Matrix Multiplication Device-Side Kernel Function (cont.)
N for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; WIDTH Pvalue += Melement * Nelement; }
// Write the matrix to deviceM memory; P // each thread writes one element ty P.elements[ty * P.pitch + tx] = Pvalue; } WIDTH
tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 WIDTH WIDTH ECE 498AL, University of Illinois, Urbana-Champaign
54 12/12/11
Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; }
// Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); }
void FreeMatrix(Matrix M) { free(M.elements); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
Step 5: Some Loose Ends (cont.)
// Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); } // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) { int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); }
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign
55 12/12/11
Step 6: Handling Arbitrary Sized Square Matrices
• Have each 2D thread block to compute N a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix
– Each has (BLOCK_WIDTH)2 threads WIDTH • Generate a 2D Grid of (WIDTH/ BLOCK_WIDTH)2 blocks M P You still need to put a by loop around the kernel call for cases where ty WIDTH WIDTH is greater than bx tx Max grid size! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 WIDTH WIDTH ECE 498AL, University of Illinois, Urbana-Champaign
56