Early 3D Graphics

Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Graphics Hardware

Artist using a perspective machine Albrecht Dürer, 1525

Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Electronic Graphics Hardware

SKETCHPAD: A Man-Machine Graphical Communication System Ivan Sutherland, 1963

© NVIDIA Corporation 2009 The

The Geometry Engine: A VLSI Geometry System for Graphics Jim Clark, 1982

© NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Key abstraction of real-time graphics

Rasterize Hardware used to look like this

Pixel

One chip/board per stage Test & Blend

Framebuffer Fixed data flow through pipeline

© NVIDIA Corporation 2009 SGI RealityEngine (1993)

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

!"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+ © NVIDIA Corporation 2009 SGI InfiniteReality (1997)

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3: © NVIDIA Corporation 2009 The Graphics Pipeline

Vertex Remains a useful abstraction

Rasterize Hardware used to look like this

Pixel

Test & Blend

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K

Rasterize Hardware used to look like this:

Vertex, pixel processing became Pixel programmable

Test & Blend

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Geometry Hardware used to look like this

Rasterize Vertex, pixel processing became programmable Pixel New stages added Test & Blend

Framebuffer

© NVIDIA Corporation 2009 The Graphics Pipeline

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Tessellation Hardware used to look like this Geometry

Vertex, pixel processing became Rasterize programmable

Pixel New stages added

Test & Blend

Framebuffer GPU architecture increasingly centers around execution

© NVIDIA Corporation 2009 Modern GPUs: Unified Design

Discrete Design Unified Design

Shader A

Shader B ibuffer ibuffer ibuffer ibuffer

Shader Core

obuffer obuffer obuffer obuffer Shader C

Shader D

Vertex , pixel shaders, etc. become threads running different programs on a flexible core GeForce 8: Modern GPU Architecture

26?$

.7B"$%&??(8F)(# -($"B%C%09?$(#DE(

@(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF ;<#(9=%1#6>(??6# L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer GeForce 8: Modern GPU Architecture

26?$

.7B"$%&??(8F)(# -($"B%C%09?$(#DE(

@(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF ;<#(9=%1#6>(??6# L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Modern GPU Architecture: GT200

26?$

.7B"$%&??(8F)(# -($"B%C%09?$(#DE(

@(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF

L1 L1 L1 L1 L1 ;<#(9=%-><(=")(#

L1 L1 L1 L1 L1

TF TF TF TF TF

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

L2 L2 L2 L2 L2 L2 L2 L2

Framebuffer© NVIDIA CorporationFramebuffer 2009 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Next-Gen GPU Architecture: Fermi DRAM I/FDRAM DRAM I/F DRAM I/FDRAM HOST I/F HOST

L2 DRAM I/FDRAM Giga Thread Giga DRAM I/FDRAM DRAM I/F

© NVIDIA Corporation 2009 NVIDIA next-!"#$%&"'()*$+',-).",./'" GPUs Today

Lessons from Graphics Pipeline

Throughput is paramount

Create, run, & retire lots of threads very rapidly !"#$%&' )9 +,-./ Use multithreading to hide latency 012-.31%((88 6(&*%+,-./ 012-.31 27 &'5*%+,-./ 012-.31 )% 012-.314 '56 68*%+,-./ !"#$%&'( ')*%+,-./ )*%+,-./

!""# $%%% $%%# $%!%

Perspective

1 TeraFLOP in 1993 !"#$%&#'($)**+#,-$./'

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data- is most scalable solution

© NVIDIA Corporation 2009 Why GPU Computing?

Fact: Throughputnobody architecture cares about ! massivetheoretical parallelism peak

Massive parallelism !Challenge:computational horsepower harness GPU power for real application performance GPU

$"# !"# #<=4>&+234&?@&6.A !"#$#%&'()*%&+,-.- 0&12345 /0-&12345 GFLOPS ,-/&89*:;) 67.&89*:;) CPU CUDA Successes

146X 36X 18X 50X 100X

Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois Elemental Tech AccelerEyes RIKEN

149X 47X 20X 130X 30X

Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois U of Maryland Accelerating Insight

4.6 Days 2.7 Days 3 Hours 8 Hours

30 Minutes 27 Minutes 16 Minutes 13 Minutes

CPU Only Heterogeneous with Tesla GPU

© NVIDIA Corporation 2009

GPU Computing 1.0

(Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34

GPU Computing 1.0: compute pretending to be graphics

Disguise data as textures or geometry

Disguise algorithm as render passes

!Trick graphics pipeline into doing your computation!

Term GPGPU coined by Mark Harris

© NVIDIA Corporation 2009 Typical GPGPU Constructs

© NVIDIA Corporation 2009 Typical GPGPU Constructs

© NVIDIA Corporation 2009 GPGPU Hardware & Algorithms

GPUs get progressively more capable

Fixed-function ! register combiners ! shaders

fp32 pixel hardware greatly extends reach

Algorithms get more sophisticated

Cellular automata ! PDE solvers ! ray tracing

Clever graphics tricks

High-level shading languages emerge GPU Computing 2.0 5 Enter CUDA

GPU Computing 2.0: direct compute

Program GPU directly, no graphics-based restrictions

GPU Computing supplants graphics-based GPGPU

November 2006: NVIDIA introduces CUDA

© NVIDIA Corporation 2009 CUDA In One Slide

Block Thread B(#GF)6>' B(#G$<#(9= ?<9#(= )6>9)%8(86#* Local barrier 8(86#*

Kernel &''$%

. . . Global barrier B(#G=(HD>( I)6F9) Kernel !"#$% 8(86#*

. . . © NVIDIA Corporation 2009 CUDA C Example

!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%'4(<#=%;%*<#=9 > Serial C Code

??%@0!"A,%&,-#'. BCDEF%A,-0,. &'()*+&,-#'./02%GH82%(2%*59

++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9 #3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 > Parallel C Code

??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59 Heterogeneous Programming

Use the right processor for the right job

Serial Code

Parallel Kernel &''((()*+,-.)*/01)222$"#34%5 . . .

Serial Code

Parallel Kernel !"#((()*+,-.)*/01)222$"#34%5 . . .

© NVIDIA Corporation 2009 GPU Computing 3.0 5 An Ecosystem

GPU Computing 3.0: an emerging ecosystem

Hardware & product lines Education & research

Algorithmic sophistication Consumer applications

Cross-platform standards High-level languages

!"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55* Products

CUDA is in products from laptops to supercomputers

`

© NVIDIA Corporation 2009 Emerging HPC Products

New class of hybrid GPU-CPU servers

2 Tesla Upto 18 Tesla M1060 GPUs M1060 GPUs

SuperMicro 1U Bull Bullx GPU Server Blade Enclosure

© NVIDIA Corporation 2009 Algorithmic Sophistication

Sort

Sparse matrix

Hash tables

Fast multipole method

Ray tracing (parallel tree traversal)

© NVIDIA Corporation 2009 012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 Cross-Platform Standards

GPU Computing Applications

Java Python tm Direct CUDA CUDA C OpenCL .NET Compute Fortran MATLAB 3

NVIDIA GPU with the CUDA Parallel Computing Architecture

© NVIDIA Corporation 2009 OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc. CUDA Momentum

Over 250 universities teach CUDA Over 1000 research papers

CUDA powered TSUBAME 29th fastest supercomputer in the world

180 Million CUDA GPUs

100,000 active developers

Over 500 CUDA Apps www.nvidia.com/CUDA © NVIDIA Corporation 2009 thrust

Data Structures Algorithms

! !"#$%!&&'()*+(,)(+!-# ! !"#$%!&&%-#! ! !"#$%!&&"-%!,)(+!-# ! !"#$%!&&#('$+( ! !"#$%!&&'()*+(,.!# ! !"#$%!&&(/+0$%*)(,%+12 ! Etc. ! Etc.

thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA

C++ template metaprogramming automatically chooses the fastest code path at compile time

© NVIDIA Corporation 2009 thrust::sort

sort.cu 60*7,819)(:;#84:<;'4:=>97:'#?;2 60*7,819)(:;#84:<19>079=>97:'#?;2 60*7,819)(:;#84:<39*9#":9?;2 60*7,819)(:;#84:<4'#:?;2 60*7,819)(74:1,0!2

0*: @"0*$>'01% A <<)39*9#":9)#"*1'@)1":")'*):;9);'4: :;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5 :;#84:BB39*9#":9$;=>97?!930*$%.);=>97?9*1$%.)#"*1%5

<<):#"*4&9#):')19>079)"*1)4'#: :;#84:BB19>079=>97:'#(0*:2)1=>97 E);=>975 <<)4'#:)CFDG)HI!)-9J4<497)'*)K/IDD :;#84:BB4'#:$1=>97?!930*$%.)1=>97?9*1$%%5

#9:8#*)D5 L http://thrust.googlecode.com

© NVIDIA Corporation 2009 GPU Begins to Vanish

Ever-increasing number of codes & commercial 6/78/9#-$%:;-<$9*$=/-<#+($'"#2$>0?$@-$6+#-#2<

Some bio/chem codes available or porting: NAMD / VMD, GROMACS (alpha), HOOMD, GPU HMMER, MUMmerGPU, AutoDock3 LAMMPS, CHARMM, Q-ChemA$>/;--@/2A$B)CDE3

Consumer applications, e.g. Photoshop Badaboom, vReveal, Nero MoveIt

OS: Microsoft Windows 7, MacOS Snowleapord

© NVIDIA Corporation 2009

Next-Gen GPU Architecture: Fermi

3 billion transistors Over 2x the cores (512 total) ~2x the memory bandwidth L1 and L2 caches 8x the peak DP performance ECC C++

© NVIDIA Corporation 2009 SM Microarchitecture

Instruction

Scheduler Scheduler

Objective 5 optimize for GPU computing Dispatch Dispatch New ISA Core Core Core Core Revamp issue / control flow Core Core Core Core New CUDA core architecture Core Core Core Core

32 cores per SM Core Core Core Core

(512 total) Core Core Core Core

64KB of configurable Core Core Core Core L1$ / shared memory Core Core Core Core Core Core Core Core

Load/Store Units x 16 Special Func Units x 4 Interconnect Network FP32 FP64 INT SFU LD/ST 64K Configurable Ops / clk 32 16 32 4 16 Cache/Shared Mem Uniform Cache

© NVIDIA Corporation 2009 SM Microarchitecture

Instruction Cache

Scheduler Scheduler

New IEEE 754-2008 Dispatch Dispatch arithmetic standard CUDA Core Register File Dispatch Port Core Core Core Core Operand Collector Core Core Core Core Fused Multiply-Add Core Core Core Core FP Unit INT Unit (FMA) for SP & DP Core Core Core Core

Core Core Core Core Result Queue New integer ALU Core Core Core Core Core Core Core Core optimized for 64-bit and Core Core Core Core

extended precision ops Load/Store Units x 16 Special Func Units x 4 Interconnect Network

64K Configurable Cache/Shared Mem

Uniform Cache

© NVIDIA Corporation 2009 Hardware Thread Scheduling

Concurrent kernel execution + faster context switch

Kernel 1 Kernel 1 Kernel 2

Ker Kernel 2 Kernel 2 Kernel 3 4 nel Kernel 2 Kernel 5

Time Kernel 3

Kernel 4

Kernel 5

Serial Kernel Execution Parallel Kernel Execution More Fermi Goodness

ECC protection for DRAM, L2, L1, RF

Unified 40-bit address space for local, shared, global

5-20x faster atomics

Dual DMA engines for CPU"! GPU transfers

ISA extensions for C++ (e.g. virtual functions)

Key CUDA Challenges

Express other programming models elegantly

Producer-consumer: persistent thread blocks reading and writing work queues

Task-parallel: kernel, thread block or warp as parallel task

C*<"$7*FF*2$%6*'#+-;-#+($G?HB$6/<<#+2-

Foster more high-level languages & platforms

Improve & mature development environment Key GPU Workloads

Computational graphics

Scientific and numeric computing

Image processing 5 video & images

Computer vision

Speech & natural language

Machine learning The Future of Computing?

Forward-looking statements:

All future interesting problems are throughput problems.

GPUs will evolve to be the general-purpose throughput processors.

CPUs as we know them will become (already are?) %9**I$#2*;9"(A$/2I$-"+@28$<*$/$7*+2#+$*=$<"#$I@#J

© NVIDIA Corporation 2009 Final Thoughts 5 Education

We should teach parallel computing in CS 1 or CS 2

G*F6;<#+-$I*2,<$9#<$=/-<#+A$:;-<$'@I#+ now Manycore is the future of computing

Heapsort and mergesort

Both O(n lg n)

One parallel-friendly, one not

Students need to understand this early

© NVIDIA Corporation 2009

Miscellaneous Thoughts NVIDIA Resources Available

NVIDIA enables heterogeneous computing research and teaching:

Grants for leading researchers in GPU Computing Ph.D. Fellowship program Professor Partnership program

Discounted hardware

GPU Computing Ventures 5 Venture fund focused on GPU computing

Technical Training and Assistance

© NVIDIA Corporation 2009