Early 3D Graphics
Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Graphics Hardware
Artist using a perspective machine Albrecht Dürer, 1525
Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Electronic Graphics Hardware
SKETCHPAD: A Man-Machine Graphical Communication System Ivan Sutherland, 1963
© NVIDIA Corporation 2009 The Graphics Pipeline
The Geometry Engine: A VLSI Geometry System for Graphics Jim Clark, 1982
© NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Key abstraction of real-time graphics
Rasterize Hardware used to look like this
Pixel
One chip/board per stage Test & Blend
Framebuffer Fixed data flow through pipeline
© NVIDIA Corporation 2009 SGI RealityEngine (1993)
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
!"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+ © NVIDIA Corporation 2009 SGI InfiniteReality (1997)
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3: © NVIDIA Corporation 2009 The Graphics Pipeline
Vertex Remains a useful abstraction
Rasterize Hardware used to look like this
Pixel
Test & Blend
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K
Rasterize Hardware used to look like this:
Vertex, pixel processing became Pixel programmable
Test & Blend
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Geometry Hardware used to look like this
Rasterize Vertex, pixel processing became programmable Pixel New stages added Test & Blend
Framebuffer
© NVIDIA Corporation 2009 The Graphics Pipeline
!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Tessellation Hardware used to look like this Geometry
Vertex, pixel processing became Rasterize programmable
Pixel New stages added
Test & Blend
Framebuffer GPU architecture increasingly centers around shader execution
© NVIDIA Corporation 2009 Modern GPUs: Unified Design
Discrete Design Unified Design
Shader A
Shader B ibuffer ibuffer ibuffer ibuffer
Shader Core
obuffer obuffer obuffer obuffer Shader C
Shader D
Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core GeForce 8: Modern GPU Architecture
26?$
.7B"$%&??(8F)(# -($"B%C%09?$(#DE(
@(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF ;<#(9=%1#6>(??6# L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer GeForce 8: Modern GPU Architecture
26?$
.7B"$%&??(8F)(# -($"B%C%09?$(#DE(
@(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF ;<#(9=%1#6>(??6# L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Modern GPU Architecture: GT200
26?$
.7B"$%&??(8F)(# -($"B%C%09?$(#DE(
@(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF
L1 L1 L1 L1 L1 ;<#(9=%-><(=")(#
L1 L1 L1 L1 L1
TF TF TF TF TF
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
L2 L2 L2 L2 L2 L2 L2 L2
Framebuffer© NVIDIA CorporationFramebuffer 2009 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Next-Gen GPU Architecture: Fermi DRAM I/FDRAM DRAM I/F DRAM I/FDRAM HOST I/F HOST
L2 DRAM I/FDRAM Giga Thread Giga DRAM I/FDRAM DRAM I/F
© NVIDIA Corporation 2009 NVIDIA next-!"#$%&"'()*$+',-).",./'" GPUs Today
Lessons from Graphics Pipeline
Throughput is paramount
Create, run, & retire lots of threads very rapidly !"#$%&' )9 +,-./ Use multithreading to hide latency 012-.31%((88 6(&*%+,-./ 012-.31 27 &'5*%+,-./ 012-.31 )% 012-.314 '56 68*%+,-./ !"#$%&'( ')*%+,-./ )*%+,-./
!""# $%%% $%%# $%!%
Perspective
1 TeraFLOP in 1993 !"#$%'($)**+#,-$./'
Computers no longer get faster, just wider
You must re-think your algorithms to be parallel !
Data-parallel computing is most scalable solution
© NVIDIA Corporation 2009 Why GPU Computing?
Fact: Throughputnobody architecture cares about ! massivetheoretical parallelism peak
Massive parallelism !Challenge:computational horsepower harness GPU power for real application performance GPU
$"# !"# #<=4>&+234&?@&6.A !"#$#%&'()*%&+,-.- 0&12345 /0-&12345 GFLOPS ,-/&89*:;) 67.&89*:;) CPU CUDA Successes
146X 36X 18X 50X 100X
Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois Elemental Tech AccelerEyes RIKEN
149X 47X 20X 130X 30X
Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois U of Maryland Accelerating Insight
4.6 Days 2.7 Days 3 Hours 8 Hours
30 Minutes 27 Minutes 16 Minutes 13 Minutes
CPU Only Heterogeneous with Tesla GPU
© NVIDIA Corporation 2009
GPU Computing 1.0
(Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34
GPU Computing 1.0: compute pretending to be graphics
Disguise data as textures or geometry
Disguise algorithm as render passes
!Trick graphics pipeline into doing your computation!
Term GPGPU coined by Mark Harris
© NVIDIA Corporation 2009 Typical GPGPU Constructs
© NVIDIA Corporation 2009 Typical GPGPU Constructs
© NVIDIA Corporation 2009 GPGPU Hardware & Algorithms
GPUs get progressively more capable
Fixed-function ! register combiners ! shaders
fp32 pixel hardware greatly extends reach
Algorithms get more sophisticated
Cellular automata ! PDE solvers ! ray tracing
Clever graphics tricks
High-level shading languages emerge GPU Computing 2.0 5 Enter CUDA
GPU Computing 2.0: direct compute
Program GPU directly, no graphics-based restrictions
GPU Computing supplants graphics-based GPGPU
November 2006: NVIDIA introduces CUDA
© NVIDIA Corporation 2009 CUDA In One Slide
Block Thread B(#GF)6>' B(#G$<#(9= ?<9#(= )6>9)%8(86#* Local barrier 8(86#*
Kernel &''$%
. . . Global barrier B(#G=(HD>( I)6F9) Kernel !"#$% 8(86#*
. . . © NVIDIA Corporation 2009 CUDA C Example
!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%'4(<#=%;%*<#=9 > Serial C Code
??%@0!"A,%&,-#'. BCDEF%A,-0,. &'()*+&,-#'./02%GH82%(2%*59
++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9 #3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 > Parallel C Code
??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59 Heterogeneous Programming
Use the right processor for the right job
Serial Code
Parallel Kernel &''((()*+,-.)*/01)222$"#34%5 . . .
Serial Code
Parallel Kernel !"#((()*+,-.)*/01)222$"#34%5 . . .
© NVIDIA Corporation 2009 GPU Computing 3.0 5 An Ecosystem
GPU Computing 3.0: an emerging ecosystem
Hardware & product lines Education & research
Algorithmic sophistication Consumer applications
Cross-platform standards High-level languages
!"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55* Products
CUDA is in products from laptops to supercomputers
`
© NVIDIA Corporation 2009 Emerging HPC Products
New class of hybrid GPU-CPU servers
2 Tesla Upto 18 Tesla M1060 GPUs M1060 GPUs
SuperMicro 1U Bull Bullx GPU Server Blade Enclosure
© NVIDIA Corporation 2009 Algorithmic Sophistication
Sort
Sparse matrix
Hash tables
Fast multipole method
Ray tracing (parallel tree traversal)
© NVIDIA Corporation 2009 012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 Cross-Platform Standards
GPU Computing Applications
Java Python tm Direct CUDA CUDA C OpenCL .NET Compute Fortran MATLAB 3
NVIDIA GPU with the CUDA Parallel Computing Architecture
© NVIDIA Corporation 2009 OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc. CUDA Momentum
Over 250 universities teach CUDA Over 1000 research papers
CUDA powered TSUBAME 29th fastest supercomputer in the world
180 Million CUDA GPUs
100,000 active developers
Over 500 CUDA Apps www.nvidia.com/CUDA © NVIDIA Corporation 2009 thrust
Data Structures Algorithms
! !"#$%!&&'()*+(,)(+!-# ! !"#$%!&&%-#! ! !"#$%!&&"-%!,)(+!-# ! !"#$%!&('$+( ! !"#$%!&&'()*+(,.!# ! !"#$%!&&(/+0$%*)(,%+12 ! Etc. ! Etc.
thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA
C++ template metaprogramming automatically chooses the fastest code path at compile time
© NVIDIA Corporation 2009 thrust::sort
sort.cu 60*7,819)(:;#84:<;'4:=>97:'#?;2 60*7,819)(:;#84:<19>079=>97:'#?;2 60*7,819)(:;#84:<39*9#":9?;2 60*7,819)(:;#84:<4'#:?;2 60*7,819)(74:1,0!2
0*: @"0*$>'01% A <<)39*9#":9)#"*1'@)1":")'*):;9);'4: :;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5 :;#84:BB39*9#":9$;=>97?!930*$%.);=>97?9*1$%.)#"*1%5
<<):#"*4&9#):')19>079)"*1)4'#: :;#84:BB19>079=>97:'#(0*:2)1=>97 E);=>975 <<)4'#:)CFDG)HI!)-9J4<497)'*)K/IDD :;#84:BB4'#:$1=>97?!930*$%.)1=>97?9*1$%%5
#9:8#*)D5 L http://thrust.googlecode.com
© NVIDIA Corporation 2009 GPU Begins to Vanish
Ever-increasing number of codes & commercial 6/78/9#-$%:;-<$9*$=/-<#+($'"#2$>0?$@-$6+#-#2<
Some bio/chem codes available or porting: NAMD / VMD, GROMACS (alpha), HOOMD, GPU HMMER, MUMmerGPU, AutoDock3 LAMMPS, CHARMM, Q-ChemA$>/;--@/2A$B)CDE3
Consumer applications, e.g. Photoshop Badaboom, vReveal, Nero MoveIt
OS: Microsoft Windows 7, MacOS Snowleapord
© NVIDIA Corporation 2009
Next-Gen GPU Architecture: Fermi
3 billion transistors Over 2x the cores (512 total) ~2x the memory bandwidth L1 and L2 caches 8x the peak DP performance ECC C++
© NVIDIA Corporation 2009 SM Microarchitecture
Instruction Cache
Scheduler Scheduler
Objective 5 optimize for GPU computing Dispatch Dispatch New ISA Register File Core Core Core Core Revamp issue / control flow Core Core Core Core New CUDA core architecture Core Core Core Core
32 cores per SM Core Core Core Core
(512 total) Core Core Core Core
64KB of configurable Core Core Core Core L1$ / shared memory Core Core Core Core Core Core Core Core
Load/Store Units x 16 Special Func Units x 4 Interconnect Network FP32 FP64 INT SFU LD/ST 64K Configurable Ops / clk 32 16 32 4 16 Cache/Shared Mem Uniform Cache
© NVIDIA Corporation 2009 SM Microarchitecture
Instruction Cache
Scheduler Scheduler
New IEEE 754-2008 Dispatch Dispatch arithmetic standard CUDA Core Register File Dispatch Port Core Core Core Core Operand Collector Core Core Core Core Fused Multiply-Add Core Core Core Core FP Unit INT Unit (FMA) for SP & DP Core Core Core Core
Core Core Core Core Result Queue New integer ALU Core Core Core Core Core Core Core Core optimized for 64-bit and Core Core Core Core
extended precision ops Load/Store Units x 16 Special Func Units x 4 Interconnect Network
64K Configurable Cache/Shared Mem
Uniform Cache
© NVIDIA Corporation 2009 Hardware Thread Scheduling
Concurrent kernel execution + faster context switch
Kernel 1 Kernel 1 Kernel 2
Ker Kernel 2 Kernel 2 Kernel 3 4 nel Kernel 2 Kernel 5
Time Kernel 3
Kernel 4
Kernel 5
Serial Kernel Execution Parallel Kernel Execution More Fermi Goodness
ECC protection for DRAM, L2, L1, RF
Unified 40-bit address space for local, shared, global
5-20x faster atomics
Dual DMA engines for CPU"! GPU transfers
ISA extensions for C++ (e.g. virtual functions)
Key CUDA Challenges
Express other programming models elegantly
Producer-consumer: persistent thread blocks reading and writing work queues
Task-parallel: kernel, thread block or warp as parallel task
C*<"$7*FF*2$%6*'#+-;-#+($G?HB$6/<<#+2-
Foster more high-level languages & platforms
Improve & mature development environment Key GPU Workloads
Computational graphics
Scientific and numeric computing
Image processing 5 video & images
Computer vision
Speech & natural language
Machine learning The Future of Computing?
Forward-looking statements:
All future interesting problems are throughput problems.
GPUs will evolve to be the general-purpose throughput processors.
CPUs as we know them will become (already are?) %9**I$#2*;9"(A$/2I$-"+@28$<*$/$7*+2#+$*=$<"#$I@#J
© NVIDIA Corporation 2009 Final Thoughts 5 Education
We should teach parallel computing in CS 1 or CS 2
G*F6;<#+-$I*2,<$9#<$=/-<#+A$:;-<$'@I#+ now Manycore is the future of computing
Heapsort and mergesort
Both O(n lg n)
One parallel-friendly, one not
Students need to understand this early
© NVIDIA Corporation 2009
Miscellaneous Thoughts NVIDIA Resources Available
NVIDIA enables heterogeneous computing research and teaching:
Grants for leading researchers in GPU Computing Ph.D. Fellowship program Professor Partnership program
Discounted hardware
GPU Computing Ventures 5 Venture fund focused on GPU computing
Technical Training and Assistance
© NVIDIA Corporation 2009