CUDA Speedup – Narrow Uncertainty Bands and Reduce False Alarms Modeling Air Traffic
Total Page:16
File Type:pdf, Size:1020Kb
Graphics Processing Unit (GPU) What GPUs Do GeForce Quadro Tegra Tesla Parallel Computing, Illustrated The “New” Moore’s Law . Computers no longer get faster, just wider . You must re-think your algorithms to be parallel ! . Data-parallel computing is most scalable solution The World’s Programmers Why GPU Computing? 1200 160 Tesla 20-series 140 Tesla 20-series 1000 Tesla 10-series 120 Tesla 10-series 800 100 Tesla 8-series 600 80 Tesla 8- Tesla 20-series series 60 400 Westmere 40 Nehalem 3 GHz 200 Tesla 10-series Nehalem Westmere 3 GHz 3 GHz 3 GHz 20 0 0 2003 2004 2005 2006 2007 2008 2009 2010 2003 2004 2005 2006 2007 2008 2009 2010 GFlops/sec GBytes/sec Single Precision: NVIDIA GPU Single Precision: x86 CPU NVIDIA GPU X86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU ECC off Accelerating Insight 4.6 Days 2.7 Days 3 Hours 8 Hours 30 Minutes 27 Minutes 16 Minutes 13 Minutes CPU Only Heterogeneous with Tesla GPU Tracking Space Junk . Air Force monitors 19,000 pieces of space debris . Even a paint flake can destroy spacecraft . 21x CUDA speedup – narrow uncertainty bands and reduce false alarms Modeling Air Traffic . Air traffic is increasing . Predictive modeling can avoid airport overloading . Variables: flight paths, air speed, altitude, descent rates . NASA ported their model to CUDA . 10 minute process reduced to 3 second Detecting IEDs 12 mph 77 mph CPU GPU Reducing Radiation from CT Scans 28,000 people/year develop cancer from CT scans UCSD: advanced CT reconstruction reduces radiation by 35-70x CPUs: 2 hours CUDA: 2 minutes (unusable) (clinically practical) Operating on a Beating Heart Only 2% of surgeons can operate on a beating heart Patient stands to lose 1 point of IQ every10 min with heart stopped GPU enables real-time motion compensation to virtually stop beating heart for surgeons: Courtesy Laboratoire d’Informatique de Robotique et de Microelectronique de Montpellier Simulating Shampoo The outcome is quite “ spectacular…with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray“ XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were“ “ undoable before. It still blows my mind. Axel Kohlmeyer Temple University Surfactant Simulation Cleaning Cotton Problem: Cotton is over-cleaned, causing fiber damage GPU-based machine vision enables real-time feedback during cleaning 96% lower fiber damage $100M additional potential revenue “Parallel Algorithm for GPU Processing; for use in High Speed Machine Vision Sensing of Cotton Lint Trash”, Mathew G. Pelletier, February 8, 2008 GPUs Power 3 of the Top 5… 2500 2000 1500 Gigaflops 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Tera 100 …Using Less Power 2500 8 7 2000 6 5 1500 Gigaflops 4 Megawatts 1000 3 2 500 1 0 0 Tianhe-1A Jaguar Nebulae Tsubame Tera 100 Early 3D Graphics Perspective study of a chalice Paolo Uccello, circa 1450 Early Graphics Hardware Artist using a perspective machine Albrecht Dürer, 1525 Perspective study of a chalice Paolo Uccello, circa 1450 Early Electronic Graphics Hardware SKETCHPAD: A Man-Machine Graphical Communication System Ivan Sutherland, 1963 The Graphics Pipeline The Geometry Engine: A VLSI Geometry System for Graphics Jim Clark, 1982 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline Vertex Key abstraction of real-time graphics Rasterize Hardware used to look like this Pixel One chip/board per stage Test & Blend Framebuffer Fixed data flow through pipeline © NVIDIA Corporation 2011 SGI RealityEngine (1993) Vertex Rasterize Pixel Test & Blend Framebuffer RealityEngine Graphics © NVIDIA Corporation 2011 Kurt Akeley , SIGGRAPH 93 SGI InfiniteReality (1997) Vertex Rasterize Pixel Test & Blend Framebuffer InfiniteReality: A real-time graphics system © NVIDIA Corporation 2011 Montrym et al., SIGGRAPH 97 The Graphics Pipeline Vertex Remains a useful abstraction Rasterize Hardware used to look like this Pixel Test & Blend Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline pixel_out main(uniform sampler2D texture : TEXUNIT 0, pixel_in { Vertex pixel_out OUT; float d= clamp(1.0 – pow(dot(IN.lightdist, IN.light float3 color = tex2D(texture, IN.texcoord).rgb; OUT.color = color * (d + 0.4); return OUT; } Rasterize Hardware used to look like this: Vertex, pixel processing became Pixel programmable Test & Blend Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline pixel_out main(uniform sampler2D texture : TEXUNIT 0, pixel_in { Vertex pixel_out OUT; float d= clamp(1.0 – pow(dot(IN.lightdist, IN.light float3 color = tex2D(texture, IN.texcoord).rgb; OUT.color = color * (d + 0.4); return OUT; } Geometry Hardware used to look like this Rasterize Vertex, pixel processing became programmable Pixel New stages added Test & Blend Framebuffer © NVIDIA Corporation 2011 The Graphics Pipeline pixel_out main(uniform sampler2D texture : TEXUNIT 0, pixel_in { Vertex pixel_out OUT; float d= clamp(1.0 – pow(dot(IN.lightdist, IN.light float3 color = tex2D(texture, IN.texcoord).rgb; OUT.color = color * (d + 0.4); return OUT; Tessellation } Hardware used to look like this Geometry Vertex, pixel processing became Rasterize programmable Pixel New stages added Test & Blend Framebuffer GPU architecture increasingly centers around shader execution © NVIDIA Corporation 2011 Modern GPUs: Unified Design Discrete Design Unified Design Shader A Shader B ibuffer ibuffer ibuffer ibuffer Shader Core obuffer obuffer obuffer obuffer Shader C Shader D Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core © NVIDIA Corporation 2011 GeForce 8: Modern GPU Architecture Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Thread Thread Processor L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer © NVIDIA Corporation 2011 GeForce 8: Modern GPU Architecture Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Thread Thread Processor L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer © NVIDIA Corporation 2011 Modern GPU Architecture: GT200 Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF L1 L1 L1 L1 L1 Thread Scheduler L1 L1 L1 L1 L1 TF TF TF TF TF SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP L2 L2 L2 L2 L2 L2 L2 L2 Framebuffer© NVIDIA CorporationFramebuffer 2011 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Current GPU Architecture: Fermi DRAM DRAM I/F DRAM DRAM I/F DRAM DRAM I/F HOST HOST I/F L2 DRAM DRAM I/F Giga Thread Giga DRAM DRAM I/F DRAM DRAM I/F © NVIDIA Corporation 2011 NVIDIA “Fermi” architecture GPUs Today Lessons from Graphics Pipeline Throughput is paramount Create, run, & retire lots of threads very rapidly ―Fermi‖ 3B xtors Use multithreading to hide latency GeForce 8800 681M xtors GeForce FX 125M xtors GeForce 3 GeForce® 256 60M xtors RIVA 128 23M xtors 3M xtors 1995 2000 2005 2010 How to build a parallel machine: SIMD Thinking Machines CM-2 MasPar MP1 (front), Goddard MPP (back) How to build a parallel machine: Hardware Multithreading Tera MTA How to build a parallel machine: Symmetric Multiprocessing Intel Core2 Duo SGI Challenge Fermi, Oversimplified 32-wide SIMD (two 16-wide datapaths) 48-way hardware multithreading 16-way SMP 24576 threads in flight @ 512 FMA ops per clock GPU Computing 1.0: GPGPU (Ignoring prehistory: Ikonas, Pixel Machine, Pixel-Planes…) Compute pretending to be graphics . Disguise data as triangles or textures . Disguise algorithm as render passes & shaders Trick graphics pipeline into doing your computation! Typical GPGPU Constructs Typical GPGPU Constructs A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight et al., 2003 GPU Computing 2.0: CUDA Thread Serial Code per-thread local memory Kernel foo() . Global barrier per-device Kernel bar() global memory Block . Serial Code Local barrier per-block shared memory GPU Computing 3.0: An Ecosystem Languages & API’s Hardware & Product Lines Research & Education Fortran Algorithmic Cloud Services Libraries Sophistication Tools & Partners Mathematical Integrated Packages Development Environment GPU Computing by the numbers 300,000,000 CUDA Capable GPUs 500,000 CUDA Toolkit Downloads 100,000 Active CUDA Developers 400 Universities Teaching CUDA 12 CUDA Centers of Excellence Workloads Each GPU is designed to target a mix of known and speculative workloads . The art of GPU design is choosing