Why GPU Computing?

Early 3D Graphics Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Graphics Hardware Artist using a perspective machine Albrecht Dürer, 1525 Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Electronic Graphics Hardware SKETCHPAD: A Man-Machine Graphical Communication System Ivan Sutherland, 1963 © NVIDIA Corporation 2009 The Graphics Pipeline The Geometry Engine: A VLSI Geometry System for Graphics Jim Clark, 1982 © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Key abstraction of real-time graphics Rasterize Hardware used to look like this Pixel One chip/board per stage Test & Blend Framebuffer Fixed data flow through pipeline © NVIDIA Corporation 2009 SGI RealityEngine (1993) Vertex Rasterize Pixel Test & Blend Framebuffer !"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+ © NVIDIA Corporation 2009 SGI InfiniteReality (1997) Vertex Rasterize Pixel Test & Blend Framebuffer 567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3: © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Remains a useful abstraction Rasterize Hardware used to look like this Pixel Test & Blend Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline !!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Rasterize Hardware used to look like this: Vertex, pixel processing became Pixel programmable Test & Blend Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline !!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Geometry Hardware used to look like this Rasterize Vertex, pixel processing became programmable Pixel New stages added Test & Blend Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline !!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0 Vertex 4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">? @ 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Tessellation Hardware used to look like this Geometry Vertex, pixel processing became Rasterize programmable Pixel New stages added Test & Blend Framebuffer GPU architecture increasingly centers around shader execution © NVIDIA Corporation 2009 Modern GPUs: Unified Design Discrete Design Unified Design Shader A Shader B ibuffer ibuffer ibuffer ibuffer Shader Core obuffer obuffer obuffer obuffer Shader C Shader D Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core GeForce 8: Modern GPU Architecture 26?$ .7B"$%&??(8F)(# -($"B%C%09?$(#DE( @(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"( SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF ;<#(9=%1#6>(??6# L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer GeForce 8: Modern GPU Architecture 26?$ .7B"$%&??(8F)(# -($"B%C%09?$(#DE( @(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"( SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF ;<#(9=%1#6>(??6# L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Modern GPU Architecture: GT200 26?$ .7B"$%&??(8F)(# -($"B%C%09?$(#DE( @(#$(A%;<#(9=%.??"( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"( SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF L1 L1 L1 L1 L1 ;<#(9=%-><(=")(# L1 L1 L1 L1 L1 TF TF TF TF TF SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP L2 L2 L2 L2 L2 L2 L2 L2 Framebuffer© NVIDIA CorporationFramebuffer 2009 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Next-Gen GPU Architecture: Fermi DRAM I/F DRAM I/F DRAM I/F HOST I/F HOST L2 DRAM I/F Giga Thread Giga DRAM I/F DRAM I/F © NVIDIA Corporation 2009 NVIDIA next-!"#$%&"'()*$+',-).",./'" GPUs Today Lessons from Graphics Pipeline Throughput is paramount Create, run, & retire lots of threads very rapidly !"#$%&' )9 +,-./ Use multithreading to hide latency 012-.31%((88 6(&*%+,-./ 012-.31 27 &'5*%+,-./ 012-.31 )% 012-.314 '56 68*%+,-./ !"#$%&'( ')*%+,-./ )*%+,-./ !""# $%%% $%%# $%!% Perspective 1 TeraFLOP in 1993 !"#$%&#'($)**+#,-$./' Computers no longer get faster, just wider You must re-think your algorithms to be parallel ! Data-parallel computing is most scalable solution © NVIDIA Corporation 2009 Why GPU Computing? Fact: Throughputnobody architecture cares about ! massivetheoretical parallelism peak Massive parallelism !Challenge:computational horsepower harness GPU power for real application performance GPU $"# !"# #<=4>&+234&?@&6.A !"#$#%&'()*%&+,-.- 0&12345 /0-&12345 GFLOPS ,-/&89*:;) 67.&89*:;) CPU CUDA Successes 146X 36X 18X 50X 100X Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois Elemental Tech AccelerEyes RIKEN 149X 47X 20X 130X 30X Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois U of Maryland Accelerating Insight 4.6 Days 2.7 Days 3 Hours 8 Hours 30 Minutes 27 Minutes 16 Minutes 13 Minutes CPU Only Heterogeneous with Tesla GPU © NVIDIA Corporation 2009 GPU Computing 1.0 (Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34 GPU Computing 1.0: compute pretending to be graphics Disguise data as textures or geometry Disguise algorithm as render passes !Trick graphics pipeline into doing your computation! Term GPGPU coined by Mark Harris © NVIDIA Corporation 2009 Typical GPGPU Constructs © NVIDIA Corporation 2009 Typical GPGPU Constructs © NVIDIA Corporation 2009 GPGPU Hardware & Algorithms GPUs get progressively more capable Fixed-function ! register combiners ! shaders fp32 pixel hardware greatly extends reach Algorithms get more sophisticated Cellular automata ! PDE solvers ! ray tracing Clever graphics tricks High-level shading languages emerge GPU Computing 2.0 5 Enter CUDA GPU Computing 2.0: direct compute Program GPU directly, no graphics-based restrictions GPU Computing supplants graphics-based GPGPU November 2006: NVIDIA introduces CUDA © NVIDIA Corporation 2009 CUDA In One Slide Block Thread B(#GF)6>' B(#G$<#(9= ?<9#(= )6>9)%8(86#* Local barrier 8(86#* Kernel &''$% . Global barrier B(#G=(HD>( I)6F9) Kernel !"#$% 8(86#* . © NVIDIA Corporation 2009 CUDA C Example !"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%'4(<#=%;%*<#=9 > Serial C Code ??%@0!"A,%&,-#'. BCDEF%A,-0,. &'()*+&,-#'./02%GH82%(2%*59 ++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9 #3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 > Parallel C Code ??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59 Heterogeneous Programming Use the right processor for the right job Serial Code Parallel Kernel &''((()*+,-.)*/01)222$"#34%5 . Serial Code Parallel Kernel !"#((()*+,-.)*/01)222$"#34%5 . © NVIDIA Corporation 2009 GPU Computing 3.0 5 An Ecosystem GPU Computing 3.0: an emerging ecosystem Hardware & product lines Education & research Algorithmic sophistication Consumer applications Cross-platform standards High-level languages !"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55* Products CUDA is in products from laptops to supercomputers ` © NVIDIA Corporation 2009 Emerging HPC Products New class of hybrid GPU-CPU servers 2 Tesla Upto 18 Tesla M1060 GPUs M1060 GPUs SuperMicro 1U Bull Bullx GPU Server Blade Enclosure © NVIDIA Corporation 2009 Algorithmic Sophistication Sort Sparse matrix Hash tables Fast multipole method Ray tracing (parallel tree traversal) © NVIDIA Corporation 2009 012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 Cross-Platform Standards GPU Computing Applications Java Python tm Direct CUDA CUDA C OpenCL .NET Compute Fortran MATLAB 3 NVIDIA GPU with the CUDA Parallel Computing Architecture © NVIDIA Corporation 2009 OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc. CUDA Momentum Over 250 universities teach CUDA Over 1000 research papers CUDA powered TSUBAME 29th fastest supercomputer in the world 180 Million CUDA GPUs 100,000 active developers Over 500 CUDA Apps www.nvidia.com/CUDA © NVIDIA Corporation 2009 thrust Data Structures Algorithms ! !"#$%!&&'()*+(,)(+!-# ! !"#$%!&&%-#! ! !"#$%!&&"-%!,)(+!-# ! !"#$%!&&#('$+( ! !"#$%!&&'()*+(,.!# ! !"#$%!&&(/+0$%*)(,%+12 ! Etc. ! Etc. thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template

Why GPU Computing?

Real-Time Graphics Architecture P

20 Years of Opengl

PACKET 22 BOOKSTORE, TEXTBOOK CHAPTER Reading Graphics

Interactive Visualization of Large-Scale Architectural Models Over the Grid Strolling in Tang Chang’An City

Onyx2™ Deskside Workstation Owner's Guide

Infinitereality: Ein Real-Time- Grafiksystem

GPU Architecture:Architecture: Implicationsimplications && Trendstrends

Infinitereality: a Real-Time Graphics System

Infinitereality™ Video Format Combiner User's Guide

SGI High Performance Visualization Solutions

Graphics and Computing Gpus

Sgrortgtn'" 3000 Sertes Modular, Hlgh-Performance Servers Sgiorigin 3000 Series Product Stanford and SGI .Q.Y..~Iyi~W