Nvidia's GPU Microarchitectures
Total Page:16
File Type:pdf, Size:1020Kb
NVidia’s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture - Kepler Architecture - Ma xwe ll Archite cture - Future Advances CPU vs GPU Architectures ● Few Cores ● Hundreds of Cores ● Lots of Cache ● Thousands of Threads ● Handful of Threads ● Single Proce ss Exe cution ● Independent Processes Use s of GPUs Gaming ● Gra phics ● Focuses on High Frames per Second ● Low Polygon Count ● Predefined Textures Workstation ● Computation ● Focuses on Floating Point Precision ● CAD Gra p hics - Billions of Polygons Brief Timeline of NVidia GPUs 1999 - World’s First GPU: GeForce 256 2001 - First Programmable GPU: GeForce3 2004 - Sca la ble Link Inte rfa ce 2006 - CUDA Architecture Announced 2007 - Launch of Tesla Computation GPUs with Tesla Microarchitecture 2009 - Fermi Architecture Introduced 2 0 13 - Kepler Architecture Launched 2 0 16 - Pascal Architecture Tesla - First microarchitecture to implement the unified shader model, which uses the same hardware resources for all fragment processing - Consists of a number of stream processors, which are scalar and can only operate on one component at a time - Increased clock speed in GPUs - Round robin scheduling for warps - Contained local, shared, and global memory - Conta ins Spe cia l-Function Units which are specialized for interpolating points - Allowed for two instructions to execute per clock cycle per SP Fermi Peak Performance Overview: ● Each SM has 32 CUDA Core s ● PCI-Express v2 Bus connecting CPU and GPU (8GB/s peak transfer) ● Up to 6GB GDDR5 DRAM (192GB/s pe ak transfer) ● Estimated 1.5 Ghz Clock Frequency ● 2GHz Global Memory Clock Frequency ● Peak Performance of 1.5 TFLOPS Firmi Continued ● De sktop GPU Tra nsistor Size : 40 nm ● Mobile GPU Tra nsistor Size : 40 nm or 2 8 nm ● Fuse d Multiply-Ad d ○ A*B + C ○ No loss of precision of addition while being faster than separate operations ● Two-le ve l, Distribute d Thre a d Sche duling ○ 2 Warps Issued and Executed ● Double-Pre cision ha s ha lf pe rforma nce of Single -Precision on Workstations ○ Limited to ⅛ on Consumer Cards ● 32K 32-bit Re giste r ● 6 4 KB O n -Chip Memory Kepler - Implemented Nested Kernels - Allowe d multiple CPU core s to la unch work on a single GPU simulta ne ously - 64k 32-bit re giste rs - 32 threads/warp - 1024 Max Threads - 6 4 Ma x Th re a d s /MP - 16 Max Thread Blocks - Each SM contains 192 single-pre cision CUDA core s Kepler Continued - SMX use s the prima ry GPU clock, 2 x slowe r tha n Fe rmi/Te sla - Lower power draw, providing performance per watt - Includes fused multiply-a dd like Fe rmi, a llowing for high pre cision - Each SMX features four warp schedulers, allowing four warps to be issued and executed concurrently. - Had twice as many instruction dispatch units than warps, allowing two independent instructions per warp to begin execution concurrently. - Allows double pre cision instructions to be pa ire d with othe r instructions, unlike Fe rmi - Register scoreboarding for long latency operations - Dyna mic inte r-warp scheduling - Ability for thre a d block le ve l sche duling Maxwell ● Focused on Power-Efficiency rather than Additional Features ● L2 Cache Increased from 256KiB to 2MiB ○ Reduced need of Memory Bus from 192 bit to 128 bit ● Sta rting using Tile Re nde ring ○ Reduces amount of memory needed when rendering ● Double-Pre cision Pe rforma nce is 1/32 of Single -Pre cision ○ Worse than previous versions ● 64k Registers Pascal - 64 CUDA Cores per streaming multiprocessor - High Bandwidth Memory 2 with a 4096-bit bus and memory bandwidth of 720 GBs - Unifie d Me mory - CPU and GPU can access the same memory with the help of a “Page Migration Engine” - NVLin k - Provides a high bandwidth bus between CPU and GPU, allowing higher transfer speeds than PCI - Twice the amount of registers per CUDA core, more shared memory - Dynamic load balancing, allowing for asynchronous computations - Instruction level and thread level preemption Future of GPUs - Integrated GPU and CPU architecture into one chip - Reduces latency, increases bandwidth, and improves cache coherent memory sharing - Smaller transistor sizes and larger amounts of memory - Spe cia lize d GPUS for ta sks such a s ma chine le a rning - Improved interconnections between GPUs - NVidia working on better connections in Pascal Questions? References Kirk, D. (2008). Chapter 1: Introduction, CUDA Textbook. (pp 1-13). Retrieved from http://www2.engr.arizona.edu/~ece569a/Readings/Book_Chapters/Chapter1- Introduction.pdf NVid ia . (2 0 17). NVidia History. Retrieved from http://www.nvidia.com/page/corporate_timeline.html Ieeexplore.ieee.org. (2017). NVIDIA Tesla: A Unified Graphics and Computing Architecture - IEEE Journals & Magazine. [online ] Ava ila ble a t: http://ieeexplore.ieee.org/document/4523358/ [Accessed 4 Dec. 2017]. References Cont. Nvidia (2 0 0 9 ). White pa pe r Nvidia 's Ne xt Ge ne ra tion CUDA Compute Archite cture : Fermi. Retrieved December 2, 2017. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute _Architecture_Whitepaper.pdf Nvidia (2 0 12). Nvidia Ke ple r GK110 Ne xt-Generation CUDA Compute Architecture. Retrieved December 2, 2017. http://www.nvidia.com/content/PDF/kepler/NV_DS_Tesla_KCompute_Arch_May_2 012_LR.pdf Greengard, S. (2016). GPUs reshape computing. Communications of the ACM, 5 9 (9 ), 14 -16 . d o i:10 . 114 5 / 2 9 6 7 9 7 9 .