Nvidia's GPU Microarchitectures

NVidia’s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture - Kepler Architecture - Ma xwe ll Archite cture - Future Advances CPU vs GPU Architectures ● Few Cores ● Hundreds of Cores ● Lots of Cache ● Thousands of Threads ● Handful of Threads ● Single Proce ss Exe cution ● Independent Processes Use s of GPUs Gaming ● Gra phics ● Focuses on High Frames per Second ● Low Polygon Count ● Predefined Textures Workstation ● Computation ● Focuses on Floating Point Precision ● CAD Gra p hics - Billions of Polygons Brief Timeline of NVidia GPUs 1999 - World’s First GPU: GeForce 256 2001 - First Programmable GPU: GeForce3 2004 - Sca la ble Link Inte rfa ce 2006 - CUDA Architecture Announced 2007 - Launch of Tesla Computation GPUs with Tesla Microarchitecture 2009 - Fermi Architecture Introduced 2 0 13 - Kepler Architecture Launched 2 0 16 - Pascal Architecture Tesla - First microarchitecture to implement the unified shader model, which uses the same hardware resources for all fragment processing - Consists of a number of stream processors, which are scalar and can only operate on one component at a time - Increased clock speed in GPUs - Round robin scheduling for warps - Contained local, shared, and global memory - Conta ins Spe cia l-Function Units which are specialized for interpolating points - Allowed for two instructions to execute per clock cycle per SP Fermi Peak Performance Overview: ● Each SM has 32 CUDA Core s ● PCI-Express v2 Bus connecting CPU and GPU (8GB/s peak transfer) ● Up to 6GB GDDR5 DRAM (192GB/s pe ak transfer) ● Estimated 1.5 Ghz Clock Frequency ● 2GHz Global Memory Clock Frequency ● Peak Performance of 1.5 TFLOPS Firmi Continued ● De sktop GPU Tra nsistor Size : 40 nm ● Mobile GPU Tra nsistor Size : 40 nm or 2 8 nm ● Fuse d Multiply-Ad d ○ A*B + C ○ No loss of precision of addition while being faster than separate operations ● Two-le ve l, Distribute d Thre a d Sche duling ○ 2 Warps Issued and Executed ● Double-Pre cision ha s ha lf pe rforma nce of Single -Precision on Workstations ○ Limited to ⅛ on Consumer Cards ● 32K 32-bit Re giste r ● 6 4 KB O n -Chip Memory Kepler - Implemented Nested Kernels - Allowe d multiple CPU core s to la unch work on a single GPU simulta ne ously - 64k 32-bit re giste rs - 32 threads/warp - 1024 Max Threads - 6 4 Ma x Th re a d s /MP - 16 Max Thread Blocks - Each SM contains 192 single-pre cision CUDA core s Kepler Continued - SMX use s the prima ry GPU clock, 2 x slowe r tha n Fe rmi/Te sla - Lower power draw, providing performance per watt - Includes fused multiply-a dd like Fe rmi, a llowing for high pre cision - Each SMX features four warp schedulers, allowing four warps to be issued and executed concurrently. - Had twice as many instruction dispatch units than warps, allowing two independent instructions per warp to begin execution concurrently. - Allows double pre cision instructions to be pa ire d with othe r instructions, unlike Fe rmi - Register scoreboarding for long latency operations - Dyna mic inte r-warp scheduling - Ability for thre a d block le ve l sche duling Maxwell ● Focused on Power-Efficiency rather than Additional Features ● L2 Cache Increased from 256KiB to 2MiB ○ Reduced need of Memory Bus from 192 bit to 128 bit ● Sta rting using Tile Re nde ring ○ Reduces amount of memory needed when rendering ● Double-Pre cision Pe rforma nce is 1/32 of Single -Pre cision ○ Worse than previous versions ● 64k Registers Pascal - 64 CUDA Cores per streaming multiprocessor - High Bandwidth Memory 2 with a 4096-bit bus and memory bandwidth of 720 GBs - Unifie d Me mory - CPU and GPU can access the same memory with the help of a “Page Migration Engine” - NVLin k - Provides a high bandwidth bus between CPU and GPU, allowing higher transfer speeds than PCI - Twice the amount of registers per CUDA core, more shared memory - Dynamic load balancing, allowing for asynchronous computations - Instruction level and thread level preemption Future of GPUs - Integrated GPU and CPU architecture into one chip - Reduces latency, increases bandwidth, and improves cache coherent memory sharing - Smaller transistor sizes and larger amounts of memory - Spe cia lize d GPUS for ta sks such a s ma chine le a rning - Improved interconnections between GPUs - NVidia working on better connections in Pascal Questions? References Kirk, D. (2008). Chapter 1: Introduction, CUDA Textbook. (pp 1-13). Retrieved from http://www2.engr.arizona.edu/~ece569a/Readings/Book_Chapters/Chapter1- Introduction.pdf NVid ia . (2 0 17). NVidia History. Retrieved from http://www.nvidia.com/page/corporate_timeline.html Ieeexplore.ieee.org. (2017). NVIDIA Tesla: A Unified Graphics and Computing Architecture - IEEE Journals & Magazine. [online ] Ava ila ble a t: http://ieeexplore.ieee.org/document/4523358/ [Accessed 4 Dec. 2017]. References Cont. Nvidia (2 0 0 9 ). White pa pe r Nvidia 's Ne xt Ge ne ra tion CUDA Compute Archite cture : Fermi. Retrieved December 2, 2017. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute _Architecture_Whitepaper.pdf Nvidia (2 0 12). Nvidia Ke ple r GK110 Ne xt-Generation CUDA Compute Architecture. Retrieved December 2, 2017. http://www.nvidia.com/content/PDF/kepler/NV_DS_Tesla_KCompute_Arch_May_2 012_LR.pdf Greengard, S. (2016). GPUs reshape computing. Communications of the ACM, 5 9 (9 ), 14 -16 . d o i:10 . 114 5 / 2 9 6 7 9 7 9 .

Nvidia's GPU Microarchitectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support