NVidia’s GPU Microarchitectures

By Stephen Lucas and Gerald Kotas Intro

Discussion Points

- Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture - Kepler Architecture - Ma xwe ll Archite cture - Future Advances CPU vs GPU Architectures

● Few Cores ● Hundreds of Cores ● Lots of Cache ● Thousands of Threads ● Handful of Threads ● Single Proce ss Exe cution ● Independent Processes Use s of GPUs

Gaming

● Gra phics ● Focuses on High Frames per ● Low Polygon Count ● Predefined Textures

Workstation

● Computation ● Focuses on Floating Point Precision ● CAD Gra p hics - Billions of Polygons Brief Timeline of GPUs

1999 - World’s First GPU: GeForce 256

2001 - First Programmable GPU: GeForce3

2004 - Sca la ble Link Inte rfa ce

2006 - CUDA Architecture Announced

2007 - Launch of Computation GPUs with Tesla Microarchitecture

2009 - Fermi Architecture Introduced

2 0 13 - Kepler Architecture Launched

2 0 16 - Architecture

Tesla

- First microarchitecture to implement the unified model, which uses the same hardware resources for all fragment processing - Consists of a number of stream processors, which are scalar and can only operate on one component at a time - Increased clock speed in GPUs - Round robin scheduling for warps - Contained local, shared, and global memory - Conta ins Spe cia l-Function Units which are specialized for interpolating points - Allowed for two instructions to execute per clock cycle per SP Fermi

Peak Performance Overview:

● Each SM has 32 CUDA Core s ● PCI-Express v2 Bus connecting CPU and GPU (8GB/s peak transfer) ● Up to 6GB GDDR5 DRAM (192GB/s pe ak transfer) ● Estimated 1.5 Ghz Clock Frequency ● 2GHz Global Memory Clock Frequency ● Peak Performance of 1.5 TFLOPS Firmi Continued

● De sktop GPU Tra nsistor Size : 40 nm ● Mobile GPU Tra nsistor Size : 40 nm or 2 8 nm ● Fuse d Multiply-Ad d ○ A*B + C ○ No loss of precision of addition while being faster than separate operations ● Two-le ve l, Distribute d Thre a d Sche duling ○ 2 Warps Issued and Executed ● Double-Pre cision ha s ha lf pe rforma nce of Single -Precision on Workstations ○ Limited to ⅛ on Consumer Cards ● 32K 32-bit Re giste r ● 6 4 KB O n -Chip Memory Kepler

- Implemented Nested Kernels - Allowe d multiple CPU core s to la unch work on a single GPU simulta ne ously - 64k 32-bit re giste rs - 32 threads/warp - 1024 Max Threads - 6 4 Ma x Th re a d s /MP - 16 Max Thread Blocks - Each SM contains 192 single-pre cision CUDA core s Kepler Continued

- SMX use s the prima ry GPU clock, 2 x slowe r tha n Fe rmi/Te sla - Lower draw, providing performance per - Includes fused multiply-a dd like Fe rmi, a llowing for high pre cision - Each SMX features four warp schedulers, allowing four warps to be issued and executed concurrently. - Had twice as many instruction dispatch units than warps, allowing two independent instructions per warp to begin execution concurrently. - Allows double pre cision instructions to be pa ire d with othe r instructions, unlike Fe rmi - Register scoreboarding for long latency operations - Dyna mic inte r-warp scheduling - Ability for thre a d block le ve l sche duling Maxwell

● Focused on Power-Efficiency rather than Additional Features ● L2 Cache Increased from 256KiB to 2MiB ○ Reduced need of Memory Bus from 192 bit to 128 bit ● Sta rting using Tile Re nde ring ○ Reduces amount of memory needed when rendering ● Double-Pre cision Pe rforma nce is 1/32 of Single -Pre cision ○ Worse than previous versions ● 64k Registers Pascal

- 64 CUDA Cores per streaming multiprocessor - 2 with a 4096-bit bus and memory bandwidth of 720 GBs - Unifie d Me mory - CPU and GPU can access the same memory with the help of a “Page Migration Engine” - NVLin k - Provides a high bandwidth bus between CPU and GPU, allowing higher transfer speeds than PCI - Twice the amount of registers per CUDA core, more shared memory - Dynamic load balancing, allowing for asynchronous computations - Instruction level and thread level preemption Future of GPUs

- Integrated GPU and CPU architecture into one chip - Reduces latency, increases bandwidth, and improves cache coherent memory sharing - Smaller transistor sizes and larger amounts of memory - Spe cia lize d GPUS for ta sks such a s ma chine le a rning - Improved interconnections between GPUs - NVidia working on better connections in Pascal Questions? References

Kirk, D. (2008). Chapter 1: Introduction, CUDA Textbook. (pp 1-13). Retrieved from http://www2.engr.arizona.edu/~ece569a/Readings/Book_Chapters/Chapter1- Introduction.pdf

NVid ia . (2 0 17). NVidia History. Retrieved from http://www.nvidia.com/page/corporate_timeline.html

Ieeexplore.ieee.org. (2017). : A Unified Graphics and Computing Architecture - IEEE Journals & Magazine. [online ] Ava ila ble a t: http://ieeexplore.ieee.org/document/4523358/ [Accessed 4 Dec. 2017].

References Cont.

Nvidia (2 0 0 9 ). White pa pe r Nvidia 's Ne xt Ge ne ra tion CUDA Compute Archite cture : Fermi. Retrieved December 2, 2017. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute _Architecture_Whitepaper.pdf

Nvidia (2 0 12). Nvidia Ke ple r GK110 Ne xt-Generation CUDA Compute Architecture. Retrieved December 2, 2017. http://www.nvidia.com/content/PDF/kepler/NV_DS_Tesla_KCompute_Arch_May_2 012_LR.pdf

Greengard, S. (2016). GPUs reshape computing. Communications of the ACM, 5 9 (9 ), 14 -16 . d o i:10 . 114 5 / 2 9 6 7 9 7 9