Getting Started with GPU Programming

Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Tutorial Goals • Become familiar with NVIDIA GPU architecture • Become familiar with the NVIDIA GPU application development flow • Be able to write and run simple NVIDIA GPU kernels in CUDA • Be aware of performance limiting factors and understand performance tuning strategies 2 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Schedule • Day 1 – 2:30-3:45 part I – 3:45-4:15 break – 4:15-5:30 part II • Day 2 – 2:30-3:45 part III – 3:45-4:15 break – 4:15-5:30 part IV 3 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Part I • Introduction • Hands-on: getting started with NCSA GPU cluster • Hands-on: anatomy of a GPU application 4 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Introduction • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture – NVIDIA • GPU programming overview – Libraries – CUDA C – OpenCL – PGI x64+GPU 5 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Why GPUs? 1200 Raw Performance Trends GTX 285 1000 GTX 280 800 8800 Ultra 600 8800 GTX NVIDIA GPU GFLOPS Intel CPU 400 7900 GTX 7800 GTX 200 6800 Ultra Intel Xeon Quad-core 3 GHz 5800 5950 Ultra 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 6 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA Why GPUs? 180 Memory Bandwidth Trends GTX 285 160 GTX 280 140 120 8800 Ultra /s 100 8800 GTX 80 GByte 60 7800 GTX 7900 GTX 40 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 7 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA GPU vs. CPU Silicon Use 8 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA NVIDIA GPU Architecture • A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of – 8 Scalar Processor (SP) cores – 2 special function units for transcendentals – A multithreaded instruction unit – On-chip shared memory • GDDR3 SDRAM • PCIe interface 9 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Figure is courtesy of NVIDIA NVIDIA GeForce9400M G GPU TPC • 16 streaming processors Geometry controller SMC arranged as 2 streaming SM SM I cache I cache multiprocessors MT issue MT issue C cache C cache SP SP SP SP • At 0.8 GHz this provides SP SP SP SP SP SP SP SP – 54 GFLOPS in single- SP SP SP SP SFU SFU SFU SFU precision (SP) Shared Shared memory memory Texture units • 128-bit interface to off- Texture L1 chip GDDR3 memory 128-bit interconnect – 21 GB/s bandwidth L2 ROP ROP L2 DRAM DRAM 10 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt NVIDIA Tesla C1060 GPU • 240 streaming TPC 1 TPC 10 Geometry controller Geometry controller processors arranged SMC SMC as 30 streaming SM SM SM SM SM SM I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue multiprocessors C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP • At 1.3 GHz this SP SP SP SP SP SP SP SP SP SP SP SP provides SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU – 1 TFLOPS SP Shared Shared Shared Shared Shared Shared memory memory memory memory memory memory Texture units Texture units – 86.4 GFLOPS DP Texture L1 Texture L1 • 512-bit interface to 512-bit memory interconnect off-chip GDDR3 L2 ROP ROP L2 memory DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM – 102 GB/s bandwidth 11 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt NVIDIA Tesla S1070 Computing Server • 4 T10 GPUs 4 GB GDDR3 4 GB GDDR3 SDRAM SDRAM Tesla GPU Power Power supply Tesla GPU NVIDIA PCI x16 SWITCH Thermal Thermal PCI x16 management NVIDIA SWITCH Tesla GPU System System Tesla GPU monitoring 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM 12 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Graph is courtesy of NVIDIA GPU Use/Programming • GPU libraries – NVIDIA’s CUDA BLAS and FFT libraries – Many 3rd party libraries • Low abstraction lightweight GPU programming toolkits – CUDA C – OpenCL • High abstraction compiler-based tools – PGI x64+GPU 13 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt CUDA C APIs • higher-level API called the CUDA • low-level API called the CUDA driver runtime API API – myKernel<<<grid size>>>(args); – cuModuleLoad( &module, binfile ); – cuModuleGetFunction( &func, module, "mykernel" ); – … – cuParamSetv( func, 0, &args, 48 ); – … – cuParamSetSize( func, 48 ); – cuFuncSetBlockShape( func, ts[0], ts[1], 1 ); – cuLaunchGrid( func, gs[0], gs[1] ); 14 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Getting Started with NCSA GPU Cluster • Cluster architecture overview • How to login and check out a node • How to compile and run an existing application 15 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt NCSA AC GPU Cluster 16 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt GPU Cluster Architecture • Servers: 32 • Accelerator Units: 32 – CPU cores: 128 – GPUs: 128 ac (head node) ac01 ac02 ac32 (compute node) (compute node) (compute node) 17 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt GPU Cluster Node Architecture IB • HP xw9400 workstation QDR IB – 2216 AMD Opteron 2.4 HP xw9400 workstation GHz dual socket dual core – 8 GB DDR2 PCIe x16 PCIe x16 – InfiniBand QDR • S1070 1U GPU PCIe interface PCIe interface Computing Server T10 T10 T10 T10 – 1.3 GHz Tesla T10 processors DRAM DRAM DRAM DRAM – 4x4 GB GDDR3 SDRAM Tesla S1070 Compute node 18 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Accessing the GPU Cluster • Use Secure Shell (SSH) client to access AC – ssh [email protected] (User: tra1 – tra50; Password: ???) – You will see something like this printed out: See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/ If publishing works supported by AC, you can acknowledge it as an IACAT resource. http://www.ncsa.illinois.edu/UserInfo/Allocations/Ack.html All SSH traffic on this system is monitored. For more information see: https://bw-wiki.ncsa.uiuc.edu/display/bw/Security+Monitoring+Policy Machine Description and HOW TO USE. See: /usr/local/share/docs/ac.readme CUDA wrapper readme: /usr/local/share/docs/cuda_wrapper.readme These docs also available at: http://ac.ncsa.uiuc.edu/docs/ ##################################################################### Nov 22, 2010 CUDA 3.2 fully deployed. Release nodes are available here: http://developer.nvidia.com/object/cuda_3_2_downloads.html An updated SDK is available in /usr/local/cuda/ Questions? Contact Jeremy Enos [email protected] (for password resets, please contact [email protected]) 11:17:55 up 2:38, 9 users, load average: 0.10, 0.07, 0.06 Job state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:49 Exiting:0 [tra50@ac ~]$ 19 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Installing Tutorial Examples • Run this sequence to retrieve and install tutorial examples: cd wget http://www.ncsa.illinois.edu/~kindr/projects/hpca/files/cairo2010_tutorial.tgz tar -xvzf cairo2010_tutorial.tgz cd tutorial ls benchmarks src1 src2 src4 src5 src6 20 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Accessing the GPU Cluster user 1 user 2 user n ac You are here (head node) ac01 ac02 ac32 (compute node) (compute node) (compute node) 21 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Requesting a Cluster Node for Interactive Use • Run qstat to see what other users do • Run qsub -I -l walltime=03:00:00 to request a node with a single GPU for 3 hours of interactive use – You will see something like this printed out: qsub: waiting for job 1183789.acm to start qsub: job 1183789.acm ready [tra50@ac10 ~]$ _ 22 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Requesting a Cluster Node user 1 user 2 user n ac (head node) You are here ac01 ac02 ac32 (compute node) (compute node) (compute node) 23 V. Kindratenko, Introduction to GPU Programming (part I), December 2010, The American University in Cairo, Egypt Some useful utilities installed on AC • As part of NVIDIA driver – nvidia-smi (NVIDIA System Management Interface program) • As part of NVIDIA CUDA SDK – deviceQuery – bandwidthTest • As part of CUDA wrapper – wrapper_query – showgputime/showallgputime (works from the head node only) 24 V.

Load more