Lecture 20: – Speeds and Feeds When Employing a GPU
Total Page:16
File Type:pdf, Size:1020Kb
3/23/2020 ECE408 / CS483 Spring 2020 Objectives • to understand the impact of data transfers on Applied Parallel Programming performance when using a GPU as a co-processor – speeds and feeds of traditional CPU Lecture 20: – speeds and feeds when employing a GPU GPU as part of the PC Architecture • to develop a knowledge base for performance tuning for modern GPU’s © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 1 2 Review: Canonical CUDA Program Structure Bandwidth: The Gravity of Modern Computer Systems • Global variables declaration Bandwidth between key components • Kernel functions ultimately dictates system performance, – __global__ void kernelOne(…) • especially for GPUs processing • Main () // host code – allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) large amounts of data. – transfer data from host to device – cudaMemcpy(d_GlblVarPtr, h_Gl…) – execution configuration setup • Tricks like buffering, reordering, caching can – kernel call – kernelOne<<<execution configuration>>>( args… ); repeat temporarily defy the rules in some cases. – transfer results from device to host – cudaMemcpy(h_GlblVarPtr,…) as needed • Ultimately, performance falls back to – optional: compare against golden (host computed) solution what the “speeds and feeds” dictate. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 3 4 1 3/23/2020 (Original) PCI Bus Specification Classic (Historical) PC Architecture A Humble Beginning • Northbridge connects 3 • Connected to the southBridge components that must – Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate communicate at high speed – Later, 66 MHz, 64-bit, 528 MB/second peak – CPU, DRAM, video CPU – Upstream bandwidth remain slow for device (~256MB/s peak) – Video needs first-class – Shared bus with arbitration access to DRAM • Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge – Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers • Southbridge serves as a concentrator for slower I/O devices Core Logic Chipset © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 5 6 PCI as Memory Mapped I/O PCI Express (PCIe) • PCI device registers switched, point-to-point are mapped into the connection CPU’s physical address space • each card has dedicated – Accessed through “link” to the central switch, loads/ stores (kernel with no arbitration mode) • Addresses are assigned • packet switches: messages to the PCI devices at form virtual channel boot time • prioritized packets for QoS – All devices listen for their addresses (such as for real-time video streaming) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 7 8 2 3/23/2020 PCIe Generations PCIe Gen 3 Links and Lanes • Each link consists of one or • Within a generation, number of lanes more lanes in a link can be scaled – Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit – using distinct physical channels 8Gb/s in one direction) • Upstream and downstream (more bits / wider transfers) simultaneous and symmetric – Each Link can combine 1, 2, 4, – ×1, ×2, ×4, ×8, ×16, ×32, … 8, 12, 16 lanes- x1, x2, etc. – Each byte data is 128b/130b encoded into 130 bits with equal number of 1’s and 0’s; net data rate 7.8768 Gb/s per lane each • Each new generation aims to way. – Thus, the net data rates are 985 double the speed. MB/s (x1) 1.97 GB/s (x2), 3.94 GB/s (x4), 7.9 GB/s (x8), 15.8 GB/s (x16), each way © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 9 10 Foundation: 8/10 bit encoding Current: 128/130 bit encoding • Goal is to maintain DC • 00000000, 00000111, • Same goal: maintain DC • Scrambler function: balance while have 11000001 bad balance while have long runs of 0s, 1s sufficient state transition • 01010101, 11001100 sufficient state transition vanishingly small for clock recovery good for clock recovery • Instead of guaranteed • The difference of 1s and • Find 256 good patterns • 1.5% overhead instead run length of 8/10b 0s in a 20-bit stream among 1024 total of 20% • At least one bit shift should be ≤ 2 patterns of 10 bits to every 66 bits • There should be no encode an 8-bit datum more than 5 consecutive • a 20% overhead 1s or 0s in any stream © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 11 12 3 3/23/2020 Patterns Contain Many 0s and 1s Recent PCIe PC Architecture A question for fun: Today, PCIe forms the – if we need 2128 code words interconnect backbone within PC. – chosen from all 2130 130-bit patterns – how many 0s/1s must we consider including? Northbridge and Answer: 63-67 (of either type) Southbridge are PCIe switches. Thus 128b/130b code words are pretty well-balanced, Source: Jon Stokes, PCI Express: An Overview and have lots of 0-1 transitions (for clock recovery). (http://arstechnica.com/articles/ paedia/hardware/pcie.ars) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 13 14 GeForce GTX 1080 (Pascal) Recent PCIe PC Architecture GPU Consumer Card Details How is PCI supported? SLI Connector (NVIDIA to NVIDIA) • Need a PCI-PCIe bridge, which is DP Out • sometimes included as HDMI Out part of Southbridge, or • can add as a separate PCIe I/O card. DVI Out Current systems integrate PCIe controllers directly on chip with CPU. 8GB/256-bit GDDR5X Source: Jon Stokes, PCI Express: 1.25 GHz mem clock An Overview 16x PCI-Express (http://arstechnica.com/articles/ 2.5 GHz write clock w/ QDR = 10Gb/s/pin paedia/hardware/pcie.ars) 256 bit bus = 320 GB/s 8 pieces of 8Gb (16 Mb x 32 x 16 banks) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 15 16 4 3/23/2020 PCIe Data Transfer using DMA Pinned Memory DMA (Direct Memory • DMA uses physical • If a source or destination Access) is used to fully Main Memory (DRAM) addresses of a cudaMemCpy() in utilize the bandwidth of an • The OS could the host memory is not I/O bus accidentally page out the pinned, it needs to be first copied to a pinned • DMA uses physical CPU data that is being read or address for source and written by a DMA and memory – extra destination page in another virtual overhead • Transfers a number of Global DMA page into the same • cudaMemcpy is much bytes requested by OS Memory location faster with pinned host GPU card • Pinned memory cannot memory source or • Needs pinned memory (or other I/O cards) not be paged out destination © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 17 18 Allocate/Free Pinned Memory Using Pinned Memory (a.k.a. Page Locked Memory) • cudaHostAlloc() • Use the allocated memory and its pointer the – Three parameters same way those returned by malloc(); – Address of pointer to the allocated memory • The only difference is that the allocated – Size of the allocated memory in bytes memory cannot be paged by the OS – Option – use cudaHostAllocDefault for now • The cudaMemcpy function should be about 2X faster with pinned memory • cudaFreeHost() • Pinned memory is a limited resource whose – One parameter over-subscription can have serious – Pointer to the memory to be freed consequences © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 19 20 5 3/23/2020 Important Trends • Knowing yesterday, today, and tomorrow – The PC world is becoming flatter – CPU and GPU are being fused together – Outsourcing of computation is becoming easier… ANY MORE QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign ECE408/CS483/CSE408, ECE 498AL, University of Illinois, Urbana-Champaign 21 22 6.