Introduction to GPU Programming

Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT)

V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Tutorial Goals

• Become familiar with NVIDIA GPU architecture • Become familiar with the NVIDIA GPU application development flow • Be able to write and run simple NVIDIA GPU kernels in CUDA • Be aware of performance limiting factors and understand performance tuning strategies

2 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Schedule

• Day 1 – 2:30-3:45 part I – 3:45-4:15 break – 4:15-5:30 part II • Day 2 – 2:30-3:45 part III – 3:45-4:15 break – 4:15-5:30 part IV

3 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Part I

• Introduction • Hands-on: getting started with NCSA GPU cluster • Hands-on: anatomy of a GPU application

4 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Introduction

• Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture – NVIDIA • GPU programming overview – Libraries – CUDA C – OpenCL – PGI x64+GPU

5 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Why GPUs?

1200 Raw Performance Trends

GTX 285 1000 GTX 280

800

8800 Ultra 600 8800 GTX

NVIDIA GPU GFLOPS Intel CPU 400 7900 GTX

7800 GTX 200 6800 Ultra Intel Xeon Quad-core 3 GHz 5800 5950 Ultra

0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

6 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA Why GPUs?

180 Trends GTX 285 160 GTX 280 140

120

8800 Ultra

/s 100 8800 GTX

80 GByte

60 7800 GTX 7900 GTX 40 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

7 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA GPU vs. CPU Silicon Use

8 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA NVIDIA GPU Architecture

• A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of – 8 Scalar Processor (SP) cores – 2 special function units for transcendentals – A multithreaded instruction unit – On-chip shared memory • GDDR3 SDRAM • PCIe interface

9 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Figure is courtesy of NVIDIA NVIDIA GeForce9400M G GPU

TPC • 16 streaming processors Geometry controller SMC arranged as 2 streaming

SM SM I cache I cache multiprocessors MT issue MT issue C cache C cache SP SP SP SP • At 0.8 GHz this provides SP SP SP SP SP SP SP SP – 54 GFLOPS in single- SP SP SP SP

SFU SFU SFU SFU precision (SP) Shared Shared memory memory

Texture units • 128-bit interface to off- Texture L1 chip GDDR3 memory 128-bit interconnect – 21 GB/s bandwidth L2 ROP ROP L2

DRAM DRAM

10 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt C1060 GPU • 240 streaming TPC 1 TPC 10 Geometry controller Geometry controller processors arranged SMC SMC as 30 streaming SM SM SM SM SM SM I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue multiprocessors C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP • At 1.3 GHz this SP SP SP SP SP SP SP SP SP SP SP SP provides SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU – 1 TFLOPS SP Shared Shared Shared Shared Shared Shared memory memory memory memory memory memory

Texture units Texture units – 86.4 GFLOPS DP Texture L1 Texture L1 • 512-bit interface to 512-bit memory interconnect off-chip GDDR3 L2 ROP ROP L2 memory DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM – 102 GB/s bandwidth

11 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt NVIDIA Tesla S1070 Computing Server

• 4 T10 GPUs 4 GB GDDR3 4 GB GDDR3 SDRAM SDRAM

Tesla GPU Power Power supply Tesla GPU

NVIDIA

PCI x16 SWITCH

Thermal Thermal PCI x16 PCI

management NVIDIA

SWITCH

Tesla GPU

System System Tesla GPU monitoring 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM

12 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA GPU Use/Programming

• GPU libraries – NVIDIA’s CUDA BLAS and FFT libraries – Many 3rd party libraries • Low abstraction lightweight GPU programming toolkits – CUDA C – OpenCL • High abstraction compiler-based tools – PGI x64+GPU

13 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt CUDA C APIs

• higher-level API called the CUDA • low-level API called the CUDA driver runtime API API – myKernel<<>>(args); – cuModuleLoad( &module, binfile ); – cuModuleGetFunction( &func, module, "mykernel" ); – … – cuParamSetv( func, 0, &args, 48 ); – … – cuParamSetSize( func, 48 ); – cuFuncSetBlockShape( func, ts[0], ts[1], 1 ); – cuLaunchGrid( func, gs[0], gs[1] );

14 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Getting Started with NCSA GPU Cluster

• Cluster architecture overview • How to login and check out a node • How to compile and run an existing application

15 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt NCSA AC GPU Cluster

16 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt GPU Cluster Architecture

• Servers: 32 • Accelerator Units: 32 – CPU cores: 128 – GPUs: 128

ac (head node)

ac01 ac02 ac32 (compute node) (compute node) (compute node)

17 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt GPU Cluster Node Architecture

IB • HP xw9400 workstation QDR IB – 2216 AMD Opteron 2.4 HP xw9400 workstation GHz dual socket dual core – 8 GB DDR2 PCIe x16 PCIe x16 – InfiniBand QDR • S1070 1U GPU PCIe interface PCIe interface

Computing Server T10 T10 T10 T10 – 1.3 GHz Tesla T10 processors DRAM DRAM DRAM DRAM – 4x4 GB GDDR3 SDRAM Tesla S1070 Compute node

18 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Accessing the GPU Cluster

• Use Secure Shell (SSH) client to access AC – ssh [email protected] (User: tra1 – tra50; Password: ???)

– You will see something like this printed out:

See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/ If publishing works supported by AC, you can acknowledge it as an IACAT resource. http://www.ncsa.illinois.edu/UserInfo/Allocations/Ack.html

All SSH traffic on this system is monitored. For more information see: https://bw-wiki.ncsa.uiuc.edu/display/bw/Security+Monitoring+Policy

Machine Description and HOW TO USE. See: /usr/local/share/docs/ac.readme CUDA wrapper readme: /usr/local/share/docs/cuda_wrapper.readme These docs also available at