Gpus: the Hype, the Reality, and the Future
Total Page:16
File Type:pdf, Size:1020Kb
Uppsala Programming for Multicore Architectures Research Center GPUs: The Hype, The Reality, and The Future David Black-Schaffer Assistant Professor, Department of Informaon Technology Uppsala University David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 2 Today 1. The hype 2. What makes a GPU a GPU? 3. Why are GPUs scaling so well? 4. What are the problems? 5. What’s the Future? David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 3 THE HYPE David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 4 How Good are GPUs? 100x 3x David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 5 Real World SoVware • Press release 10 Nov 2011: – “NVIDIA today announced that four leading applicaons… have added support for mul<ple GPU acceleraon, enabling them to cut simulaon mes from days to hours.” • GROMACS – 2-3x overall – Implicit solvers 10x, PME simulaons 1x • LAMPS – 2-8x for double precision – Up to 15x for mixed • QMCPACK – 3x 2x is AWESOME! Most research claims 5-10%. David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 6 GPUs for Linear Algebra 5 CPUs = 75% of 1 GPU StarPU for MAGMA David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 7 GPUs by the Numbers (Peak and TDP) 5 791 675 4 192 176 3 Intel 32nm vs. 40nm 36% smaller per transistor 2 244 250 3.0 Normalized to Intel 3960X 130 172 51 3.3 2.3 2.4 1 1.5 0.9 0 WaEs GFLOP Bandwidth GHz Transistors Intel 3960X Nvidia GTX 580 AMD 6970 David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 8 Efficiency (DP, peak) But how close to peak GPUs are enormously 4 do you really get? more efficient designs 400 for doing FLOPs 3 300 2 200 GFLOP/W 1 100 FLOP/cycle/B Transistors 0 0 Intel 3960X GTX 580 AMD 6970 Energy Efficiency Design Efficiency David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 9 Efficiency in Perspec<ve Energy Efficiency Nvidia4 says that ARM + Laptop GPU LINPACK: Green 500 #1 7.5GFLOPS/W Blue Gene/Q LINPACK: 3 Green 500 #6 GPU/CPU 2 GFLOP/W 1 Peak Peak Peak Actual Actual 0 Intel 3960X GTX 580 AMD 6970 Blue Gene/Q DEGIMA Cluster David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 11 Intel’s Response • Larabee – Manycore x86-“light” to compete with Nvidia/AMD in graphics and compute – Didn’t work out so well (despite huge fab advantages — graphics is hard) • Reposi/oned it as an HPC co-processor – Knight’s Corner – 1TF double precision in a huge (expensive) single 22nm chip – At 300W this would beat Nvidia’s peak efficiency today (40nm) David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 12 Show Me the Money 4 9.4 3 2 1 Revenue in Billion USD 50% of this is 0 GPU in HPC AMD Q3 Nvidia Q3 Intel Q3 Professional Graphics Embedded Processors David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 13 WHAT MAKES A GPU A GPU? David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 14 GPU Characteris<cs • Architecture • Programming/Interface – Data parallel processing – Data parallel kernels – Hardware thread scheduling – Throughput-focused – High memory bandwidth – Limited synchronizaon – Graphics rasterizaon units – Limited OS interac<on – Limited caches (with texture filtering hardware) David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 15 GPU Innovaons • SMT (for latency hiding) – Massive numbers of threads – Programming SMT is far easier than SIMD • SIMT (thread groups) – Amor<ze scheduling/control/data access – Warps, wavefronts, work-groups, gangs • Memory systems op/mized for graphics – Special storage formats – Texture filtering in the memory system – Bank op<mizaons across threads • Limited synchronizaon – Improves hardware scalability (They didn’t invent any of these, but they made them successful.) David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 16 WHY ARE GPUS SCALING SO WELL? David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 17 Lots of Room to Grow the Hardware Simple cores Simple control But, they do have a lot… – Short pipelines – Limited context switching – TLBs for VM – No branch predic<on – Limited processes – I-caches – In-order – (But do have to schedule – Complex hardware thread Simple memory system 1000s of threads…) schedulers – Limited caches* Lack of OS-level issues – No coherence – Memory allocaon Trading off simplicity – Split address space* – Preemp<on for features and programmability. 1600 200 1200 400 1200 150 900 300 800 100 600 200 FLOP/cycle GFLOPs (SP) 400 50 Bandwidth (GB/s) 300 100 FLOP/cycle/B Transistors 0 0 0 0 G80 GT200 GF100 G80 GT200 GF100 GFLOP (SP) Bandwidth Architecture Efficiency Design Efficiency David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 18 “Nice” Programming Model void kernel calcSin(global float *data) { • All GPU programs have: int id = get_global_id(0); data[id] = sin(data[id]); – Explicit parallelism } – Hierarchical structure Synchronization OK.! – Restricted synchronizaon – Data locality No Synchronization.! • Inherent in graphics • Enforced in compute by performance – Latency tolerance Expensive • Easy to scale! Cheap David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 19 Why are GPUs Scaling So Well? • Room in the hardware design • Scalable soVware They’re not burdened with 30 years of cru6 and legacy code… …lucky them. David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 20 WHERE ARE THE PROBLEMS? David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 21 Amdahl’s Law • Always have serial code – GPU single threaded performance is terrible • Soluon: Heterogeneity – A few fat latency-op/mized cores – Many thin throughput-opmized cores – Plus hard-coded accelerators • Nvidia Project Denver • ARM for latency 100 99% Parallel (91x) • GPU for throughput • AMD Fusion 10 90% Parallel (10x) • x86 for latency 75% Parallel (4x) • GPU for throughput Maximum Speedup • 1 Intel MIC 1 2 4 8 16 32 64 128 256 512 1024 • x86 for latency Number of Cores • X86-“light” for throughput Limits of Amdahl’s Law David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 22 It’s an Accelerator… • Moving data to the GPU is slow… • Moving data from the GPU is slow… • Moving data to/from the GPU is really slow. • Limited data storage • Limited interac<on with OS 192GB/s GPU GB/s CPU 51GB/s GRAM 5 DRAM Nvidia GTX 580 Intel 3960X 1.5GB 16GB 529mm2 PCIe 3.0 435mm2 10GB/s David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 23 Legacy Code • Code lives forever – Amdahl: Even op<mizing the 90% of hot code limits speedup to 10x – Many won’t invest in proprietary technology • Programming models are immature – CUDA mature low-level Nvidia, PGI – OpenCL immature low-level Nvidia, AMD, Intel, ARM, Altera, Apple – OpenHMPP mature high-level CAPS – OpenACC immature high-level CAPS, Nvidia, Cray, PGI David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 24 Code that Doesn’t Look Like Graphics • If it’s not painfully data-parallel you have to redesign your algorithm – Example: scan-based techniques for zero coun<ng in JPEG – Why? It’s only 64 entries! • Single-threaded performance is terrible. Need to parallelize. • Overhead of transferring data to CPU is too high. • If it’s not accessing memory well you have to re-order your algorithm – Example: DCT in JPEG • Need to make sure your access to local memory has no bank conflicts across threads. • Libraries star/ng to help – Lack of composability – Example: Combining linear algebra operaons to keep data on the device • Most code is not purely data-parallel – Very expensive to synchronize with the CPU (data transfer) – No effec<ve support for task-based parallelism – No ability to launch kernels from within kernels David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 25 Corollary: What are GPUs Good For? • Data parallel code – Lots of threads – Easy to express parallelism (no SIMD nas<ness) • High arithme/c intensity and simple control flow – Lots of FPUs – No branch predictors • High reuse of limited data sets – Very high bandwidth to memory (1-6GB) – Extremely high bandwidth to local memory (16-64kB) • Code with predictable access paerns – Small (or no) caches – User-controlled local memories David Black-Schaffer Uppsala University / Department of Informaon Technology 25/11/2011 | 26 WHAT’S THE FUTURE OF GPUS? ! big.LITTLE Task Migration Use Model In the big.LITTLE task migration use model the OS and applications only ever execute on Cortex-A15 or Cortex-A7 and never both processors at the same time. This use-model is a natural extension to the Dynamic Voltage and Frequency Scaling (DVFS), operating points provided by current mobile platforms with a single application processor to allow the OS to match the performance25/11/2011 | 27 of the platform to the David Black-Schaffer Uppsala University / Department of Informaon Technology performance required by the application. ! However, in a Cortex-A15-Cortex-A7 platform these operating points are applied both to Cortex-A15 and Cortex-A7. When Cortex-A7 is executing the OS can tune the operating points as it would for an existing platform with a single applications processor. Once Cortex-A7 is at its highest operating point if more performance is required a task migrationIntroduction can be invoked that picks up the OS and applications and ! moves them to Cortex-A15.