A BRIEF HISTORY OF GPGPU Mark Harris Chief Technologist, GPU Computing UNC Ph.D. 2003 2 A BRIEF HISTORY OF GPGPU

fd

General-Purpose computation on Graphics Processing Units

3 THE FIRST GPGPU: IKONAS RDS-3000 1978: Nick England & Mary Whitton founded Ikonas Graphics Systems Tim Van Hook wrote microcode for solid modeling, (SIGGRAPH ’86)

From a 1985 Video: “All computation is taking place in the Adage 3000 Display” http://www.virhistory.com/ikonas/ikonas.html 4 UNC PIXEL PLANES AND PIXELFLOW Procedural textures on Pixel Planes 5 (Rhodes et al. 1992)

Proceedings 1992 Symposium on Interactive 3D Graphics Cambridge, Massachusetts 29 March - 1 April 1992

PixelFlow Program Co-Chairs Marc Levoy Edwin E. Catmull 100,000+ 100MHzStanford 8-bit University processors Pixar Symposium Chair Early real-time programmable shading (Olano/Lastra ‘98) David Zeltzer Kedem et al. (‘98) usedMIT forMedia unix Laboratory password cracking

Sponsored by the following organizations:

Office of Naval Research 5 National Science Foundation USA Ballistic Research Laboratory Hewlett-Packard Silicon Graphics Sun Microsystems MIT Media Laboratory

In Cooperation with ACM SIGGRAPH GEFORCE 1-3: THE DAWN OF GPGPU (‘99-’01) GeForce 256: First “GPU” GeForce 3: First programmable GPU Vertex – programmable vertex transforms, 32-bit float Data-dependent, configurable texturing + register combiners Enabled early GPGPU results Hoff (1999) -- Voronoi diagrams on TNT2 Larsen &McAllister (2001): first GPU matrix multiplication (8-bit) Rumpf & Strzodka (2001): first GPU PDEs (diffusion, image segmentation) NVIDIA SDK Game of Life, Shallow Water (Greg James, 2001)

6 PHYSICALLY BASED SIMULATION ON GEFORCE 3 Approximate simulation of natural phenomena Boiling liquid, fluid convection, chemical reaction-diffusion Inaccurate due to low GPU precision

“Physically-Based Visual Simulation on Graphics Hardware”. Harris, Coombe, Scheuermann, and Lastra. Graphics Hardware 2002 7 NAMING A TREND “Application of graphics hardware to non-graphics applications” “General computations on graphics hardware” “Exploiting special-purpose hardware for alternative purposes”

Let’s name this thing that people are doing! I coined “GPGPU” and created home page November 2002 home on the web to collect research / resources Interest grew quickly: launched GPGPU.org August 2003

8 GEFORCE FX (2003) : FLOATING POINT PIXELS True Programmability enabled broader simulation research

Ray Tracing (Purcell, 2002), Photon Maps (Purcell, 2003)

Radiosity (Carr et al., 2003 & Coombe et al., 2004)

PDE solvers

Red-black Gauss-Seidel (Harris et al., 2003)

Conjugate gradient (Bolz et al. 2003, Krueger et al. 2003)

Multigrid (Goodnight et al. 2003)

Physically-based simulation

Fluid and cloud simulation: (Krueger et al. 2003, Harris et al. 2003)

Cloth simulation (Green, 2003)

Ice crystal formation (Kim and Lin, 2003)

FFT (Moreland and Angel, 2003)

High-level language: Brook for GPUs (Buck et al. 2004)

9 GPU CLOUD SIMULATION My Ph.D. Dissertation: visually realistic cloud simulation on GPUs 2D & 3D Incompressible Navier-Stokes fluid Thermodynamics (latent heat, diffusion) Water condensation / evaporation Light scattering simulation for rendering Programmed in OpenGL with pixel shaders

10 “Real-Time Cloud Simulation and Rendering”. Mark Harris Ph.D. Dissertation U. of North Carolina. 2003 CUDA AND THE G80 GPU (2006)

First GPU arch. and software platform designed for computing Dedicated computing mode – threads rather than pixels/vertices General, byte-addressable memory architecture First C/C++ language and compiler for GPUs CUDA C++ defines minimally extended subset of C++ with parallelism 2007 began a massive surge in GPGPU development Not just graphics PhDs

11 ACCELERATING

DISCOVERIES

USING A SUPERCOMPUTER POWERED BY 3,000 TESLA PROCESSORS, UNIVERSITY OF ILLINOIS SCIENTISTS PERFORMED THE FIRST ALL-ATOM SIMULATION OF THE HIV VIRUS AND DISCOVERED THE CHEMICAL STRUCTURE OF ITS CAPSID — “THE PERFECT TARGET FOR FIGHTING THE INFECTION.”

WITHOUT GPUS, THE SUPERCOMPUTER WOULD NEED TO BE 5X LARGER FOR SIMILAR PERFORMANCE.

12 FROM HPC TO ENTERPRISE DATACENTERS

Oil & Gas Higher Ed Government Supercomputing Finance Consumer Web

Air Force Research Laboratory

Tokyo Institute of Technology

Naval Research Laboratory

13 MACHINE LEARNING USING DEEP NEURAL NETWORKS

Input Result

Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009 Visual Object Recognition Using Deep Convolutional Neural Networks Rob Fergus (New York University / Facebook) http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php#2985 14 GPU-ACCELERATED DEEP LEARNING

START-UPS Image Detection

Face Recognition

Gesture Recognition

Video Search & Analytics

Speech Recognition & Translation

Image and Video Understanding

Recommendation Engines

Indexing & Search

15 COMMON PROGRAMMING APPROACHES Across Heterogeneous Architectures

Libraries AmgX cuBLAS

Compiler Directives

Programming x86 / Languages

16 Unified Memory DRAMATICALLY LOWER DEVELOPER EFFORT

Past Developer View Developer View With Unified Memory

System GPU Memory Unified Memory Memory

17 PARALLELISM IN MAINSTREAM LANGUAGES

Enable more programmers to write parallel software Give programmers the choice of language to use Parallel computing support in key languages

C

18 FUTURE C++: PARALLEL STL

std::vector vec = ...

// previous standard sequential loop std::for_each(vec.begin(), vec.end(), f);

// explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f);

// permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f);

Complete set of parallel primitives: for_each, sort, reduce, scan, etc.

N3960 Technical Specification Working Draft: ISO C++ committee voted unanimously to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4352.html accept as official tech. specification working draft Prototype: https://github.com/n3554/n3554

19 PARTING WORDS OF WISDOM

Stand Up! “Keep it narrow and doable” – Fred Brooks “Write a little bit every day” – Fred Brooks “If you measure it, you can improve it” – Jen-Hsun Huang “But you have to measure the right thing!”

20 1999: ADVENT OF THE GPU

NVIDIA GeForce 256 Coined the term “

“A single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second.”

Register Combiners 4 RGB Inputs Fragment Color 4 Alpha Inputs General 3 RGB Outputs Specular Color Combiner 0 configurable multipass shading 3 Alpha Outputs Fog Color/Factor 4 RGB Inputs 4 Alpha Inputs Beginning of GPU programmability Texture 0 General Texture 3 RGB Outputs Combiner 1 3 Alpha Outputs Fetch Texture 1

Spare 0 Specular Color 6 RGB Inputs Final Combiner 1 Alpha Input RegisterSet 21 EARLY PC & WORKSTATION GRAPHICS

Rasterizer, texture unit, z-buffer, frame buffer Fixed-point math, fixed-function interpolation / texturing

Lengyel et al. (1990) – Robot motion planning Use rasterizer to fill minkowski sum polygons HP 9000 TurboSRX Workstation Hoff (1999) -- Voronoi diagrams on NVIDIA TNT2 Render cones – rasterizer and z-buffer compute voronoi diagram

22