Introduction to Massively Parallel Computing Tutorial Goals

Introduction to Computing With Graphics Processors Cris Cecka Computational Mathematics Stanford University

Lecture 1: Introduction to Massively Parallel Computing Tutorial Goals

Learn architecture and computational environment of GPU computing Massively Parallel Hierarchical threading and memory space Principles and patterns of parallel programming Processor architecture features and constraints Scalability across future generations Introduction to programming in CUDA Programming API, tools and techniques Functionality and maintainability Moore’s Law (paraphrased)

“The number of transistors on an integrated circuit doubles every two years.” – Gordon E. Moore Moore’s Law (Visualized)

GF100

Data credit: Wikipedia Serial Performance Scaling is Over

Cannot continue to scale processor frequencies no 10 GHz chips

Cannot continue to increase power consumption can’t melt chip

Can continue to increase transistor density as per Moore’s Law Why Massively Parallel Processing?

A quiet revolution and potential build-up Computation: TFLOPs vs. 100 GFLOPs

T12

GT200

G80

G70 3GHz Xeon 3GHz Westmere NV40 Quad NV30 3GHz Dual Core2 Duo Core P4

– GPU in every PC – massive volume & potential impact Why Massively Parallel Processing?

• A quiet revolution and potential build-up – Bandwidth: ~10x T12

GT200

G80

G70 3GHz Xeon NV40 Westmere NV30 3GHz Quad 3GHz Dual Core2 Duo Core P4

– GPU in every PC – massive volume & potential impact The “New” Moore’s Law

• Computers no longer get faster, just wider

• You must re-think your algorithms to be parallel! Not only parallel, but hierarchically parallel...

• Data-parallel computing is most scalable4 cores solution8 cores 16 cores… Otherwise: refactor code for 2 cores You will always have more data than cores – build the computation around the data Enter the GPU

• Highly Scalable • Massively parallel Hundreds of cores Thousands of threads • Cheap • Available • Programmable GPU Evolution

• High throughput computation – GeForce GTX 280: 933 GFLOP/s • High bandwidth memory – GeForce GTX 280: 140 GB/s • High availability to all “Fermi” 3B xtors – 180+ million CUDA-capable GPUs in the wild

GeForce 8800 681M xtors GeForce FX 125M xtors GeForce 3 GeForce® 256 60M xtors RIVA 128 23M xtors 3M xtors

1995 2000 2005 2010 Graphics Legacy

• Throughput is paramount – Must paint every pixel within frame time – Scalability

• Create, run, & retire lots of threads very rapidly – Measured 14.8 Gthread/s on increment() kernel

• Use multithreading to hide latency – 1 stalled thread is OK if 100 are ready to run Why is this different from a CPU?

• CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic

• GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.) multithreading can hide latency => skip the big caches

• Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not NVIDIA GPU Architecture Fermi GF100

D SM Multiprocessor

Instruction Cache • 32 CUDA Cores per SM (512 total) Scheduler Scheduler

Dispatch Dispatch • 8x peak FP64 performance Register File Core Core Core Core 50% of peak FP32 performance Core Core Core Core Core Core Core Core

• Direct load/store to memory Core Core Core Core Core Core Core Core Usual linear sequence of bytes Core Core Core Core High bandwidth (Hundreds GB/sec) Core Core Core Core Core Core Core Core

• 64KB of fast, on-chip RAM Core Core Core Core Software or hardware-managed Core Core Core Core Load/Store Units x 16 Shared amongst CUDA cores Special Func Units x 4 Enables thread communication Interconnect Network 64K Configurable Cache/Shared Mem

Uniform Cache Key Architectural Ideas

Instruction Cache • GPU serves as a coprocessor to the CPU Scheduler Scheduler has its own device memory on the card Dispatch Dispatch Register File

Core Core Core Core • SIMT (Single Instruction Multiple Thread) execution threads run in groups of 32 called warps Core Core Core Core threads in a warp share instruction unit (IU) Core Core Core Core

HW automatically handles divergence Core Core Core Core

Core Core Core Core

• Hardware multithreading Core Core Core Core

HW resource allocation & thread scheduling Core Core Core Core HW relies on threads to hide latency Core Core Core Core

Load/Store Units x 16 • Threads have all resources needed to run Special Func Units x 4 Interconnect Network any warp not waiting for something can run 64K Configurable context switching is (basically) free Cache/Shared Mem

Uniform Cache Enter CUDA

• A compiler and toolkit for programming NVIDIA GPUs

• Minimal extensions to familiar C/C++ environment let programmers focus on parallel algorithms

• Scalable parallel programming model Express parallelism and control hierarchy of memory spaces But also uses a high level abstraction from hardware

• Provide straightforward mapping onto hardware good fit to GPU architecture maps well to multi-core CPUs too 17X 45X 100X

30x 13–457x Motivation

110-240X 35X Compute Environment

• Threads executed by SP Thread t Executing the same sequential kernel On-chip registers, Off-ship local memory

Block b • Thread blocks executed by SM t0 t1 … tB Threads in the same block can cooperate On-ship shared memory Synchronization

• Grids of blocks executed by device t0 t0 t1 … t1 … Off-chip global memory tB tB t0 t0 No synchronization t1 … t1 … tB tB CUDA Model of Parallelism

Block Mem Block Mem ory • • • ory

Global Memory

• CUDA virtualizes the physical hardware – thread is a virtualized scalar processor (registers, PC, state) – block is a virtualized multiprocessor (threads, shared mem.)

• Scheduled onto physical hardware without pre-emption – threads/blocks launch & run to completion – blocks should be independent NOT: Flat Multiprocessor

Processors

Global Memory

• Global synchronization isn’t cheap • Global memory access times are expensive

• cf. PRAM (Parallel Random Access Machine) model NOT: Distributed Processors

Processo Mem Processo Mem r ory • • • r ory

Interconnection Network

• Distributed computing is a different setting

• cf. BSP (Bulk Synchronous Parallel) model, MPI A Common Programming Strategy

Global memory resides in device memory (DRAM) Much slower access than shared memory Tile data to take advantage of fast shared memory: Generalize from adjacent_difference example Divide and conquer A Common Programming Strategy

Partition data into subsets that fit into shared memory A Common Programming Strategy

Handle each data subset with one thread block A Common Programming Strategy

Load the subset from global memory to shared memory, using multiple threads to exploit memory- level parallelism A Common Programming Strategy

Perform the computation on the subset from shared memory A Common Programming Strategy

Copy the result from shared memory back to global memory