Lecture 20: – Speeds and Feeds When Employing a GPU

3/23/2020 ECE408 / CS483 Spring 2020 Objectives • to understand the impact of data transfers on Applied Parallel Programming performance when using a GPU as a co-processor – speeds and feeds of traditional CPU Lecture 20: – speeds and feeds when employing a GPU GPU as part of the PC Architecture • to develop a knowledge base for performance tuning for modern GPU’s © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 1 2 Review: Canonical CUDA Program Structure Bandwidth: The Gravity of Modern Computer Systems • Global variables declaration Bandwidth between key components • Kernel functions ultimately dictates system performance, – __global__ void kernelOne(…) • especially for GPUs processing • Main () // host code – allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) large amounts of data. – transfer data from host to device – cudaMemcpy(d_GlblVarPtr, h_Gl…) – execution configuration setup • Tricks like buffering, reordering, caching can – kernel call – kernelOne<<<execution configuration>>>( args… ); repeat temporarily defy the rules in some cases. – transfer results from device to host – cudaMemcpy(h_GlblVarPtr,…) as needed • Ultimately, performance falls back to – optional: compare against golden (host computed) solution what the “speeds and feeds” dictate. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 3 4 1 3/23/2020 (Original) PCI Bus Specification Classic (Historical) PC Architecture A Humble Beginning • Northbridge connects 3 • Connected to the southBridge components that must – Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate communicate at high speed – Later, 66 MHz, 64-bit, 528 MB/second peak – CPU, DRAM, video CPU – Upstream bandwidth remain slow for device (~256MB/s peak) – Video needs first-class – Shared bus with arbitration access to DRAM • Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge – Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers • Southbridge serves as a concentrator for slower I/O devices Core Logic Chipset © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 5 6 PCI as Memory Mapped I/O PCI Express (PCIe) • PCI device registers switched, point-to-point are mapped into the connection CPU’s physical address space • each card has dedicated – Accessed through “link” to the central switch, loads/ stores (kernel with no arbitration mode) • Addresses are assigned • packet switches: messages to the PCI devices at form virtual channel boot time • prioritized packets for QoS – All devices listen for their addresses (such as for real-time video streaming) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 7 8 2 3/23/2020 PCIe Generations PCIe Gen 3 Links and Lanes • Each link consists of one or • Within a generation, number of lanes more lanes in a link can be scaled – Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit – using distinct physical channels 8Gb/s in one direction) • Upstream and downstream (more bits / wider transfers) simultaneous and symmetric – Each Link can combine 1, 2, 4, – ×1, ×2, ×4, ×8, ×16, ×32, … 8, 12, 16 lanes- x1, x2, etc. – Each byte data is 128b/130b encoded into 130 bits with equal number of 1’s and 0’s; net data rate 7.8768 Gb/s per lane each • Each new generation aims to way. – Thus, the net data rates are 985 double the speed. MB/s (x1) 1.97 GB/s (x2), 3.94 GB/s (x4), 7.9 GB/s (x8), 15.8 GB/s (x16), each way © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 9 10 Foundation: 8/10 bit encoding Current: 128/130 bit encoding • Goal is to maintain DC • 00000000, 00000111, • Same goal: maintain DC • Scrambler function: balance while have 11000001 bad balance while have long runs of 0s, 1s sufficient state transition • 01010101, 11001100 sufficient state transition vanishingly small for clock recovery good for clock recovery • Instead of guaranteed • The difference of 1s and • Find 256 good patterns • 1.5% overhead instead run length of 8/10b 0s in a 20-bit stream among 1024 total of 20% • At least one bit shift should be ≤ 2 patterns of 10 bits to every 66 bits • There should be no encode an 8-bit datum more than 5 consecutive • a 20% overhead 1s or 0s in any stream © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 11 12 3 3/23/2020 Patterns Contain Many 0s and 1s Recent PCIe PC Architecture A question for fun: Today, PCIe forms the – if we need 2128 code words interconnect backbone within PC. – chosen from all 2130 130-bit patterns – how many 0s/1s must we consider including? Northbridge and Answer: 63-67 (of either type) Southbridge are PCIe switches. Thus 128b/130b code words are pretty well-balanced, Source: Jon Stokes, PCI Express: An Overview and have lots of 0-1 transitions (for clock recovery). (http://arstechnica.com/articles/ paedia/hardware/pcie.ars) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 13 14 GeForce GTX 1080 (Pascal) Recent PCIe PC Architecture GPU Consumer Card Details How is PCI supported? SLI Connector (NVIDIA to NVIDIA) • Need a PCI-PCIe bridge, which is DP Out • sometimes included as HDMI Out part of Southbridge, or • can add as a separate PCIe I/O card. DVI Out Current systems integrate PCIe controllers directly on chip with CPU. 8GB/256-bit GDDR5X Source: Jon Stokes, PCI Express: 1.25 GHz mem clock An Overview 16x PCI-Express (http://arstechnica.com/articles/ 2.5 GHz write clock w/ QDR = 10Gb/s/pin paedia/hardware/pcie.ars) 256 bit bus = 320 GB/s 8 pieces of 8Gb (16 Mb x 32 x 16 banks) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 15 16 4 3/23/2020 PCIe Data Transfer using DMA Pinned Memory DMA (Direct Memory • DMA uses physical • If a source or destination Access) is used to fully Main Memory (DRAM) addresses of a cudaMemCpy() in utilize the bandwidth of an • The OS could the host memory is not I/O bus accidentally page out the pinned, it needs to be first copied to a pinned • DMA uses physical CPU data that is being read or address for source and written by a DMA and memory – extra destination page in another virtual overhead • Transfers a number of Global DMA page into the same • cudaMemcpy is much bytes requested by OS Memory location faster with pinned host GPU card • Pinned memory cannot memory source or • Needs pinned memory (or other I/O cards) not be paged out destination © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 17 18 Allocate/Free Pinned Memory Using Pinned Memory (a.k.a. Page Locked Memory) • cudaHostAlloc() • Use the allocated memory and its pointer the – Three parameters same way those returned by malloc(); – Address of pointer to the allocated memory • The only difference is that the allocated – Size of the allocated memory in bytes memory cannot be paged by the OS – Option – use cudaHostAllocDefault for now • The cudaMemcpy function should be about 2X faster with pinned memory • cudaFreeHost() • Pinned memory is a limited resource whose – One parameter over-subscription can have serious – Pointer to the memory to be freed consequences © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 19 20 5 3/23/2020 Important Trends • Knowing yesterday, today, and tomorrow – The PC world is becoming flatter – CPU and GPU are being fused together – Outsourcing of computation is becoming easier… ANY MORE QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 21 22 6.

Lecture 20: – Speeds and Feeds When Employing a GPU

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support