Understanding Throughput- Oriented Architectures
Cedric Nugteren GPU Mini Symposium 25th of January, 2011 Contents
• Motivation • GPUs in supercomputers • Performance per Watt • Example applications • Throughput-oriented architectures • Programming GPUs • Current and future architectures • Summary
/ GPU Mini Symposium 25-01-2011 PAGE 2 GPUs in supercomputers
Taken from: Top500 Supercomputer Websites Found at ‘www.top500.org’, November 2010 / GPU Mini Symposium 25-01-2011 PAGE 3 Performance per Watt
Taken from: Bill Dally In ‘GPU Computing: To Exascale and Beyond’, 2010 / GPU Mini Symposium 25-01-2011 PAGE 4 Performance per Watt
NVIDIA GeForce
10000 ] GTX480 1000 GTX285 GTS250
GFLOPS GT220 100 G210M
10 Performance [ Performance
1 0,1 1 10 100 1000 Power [W]
/ GPU Mini Symposium 25-01-2011 PAGE 5 Performance per Watt
NVIDIA GeForce AMD Radeon CPU 10000
] HD5870 HD5770 GTX480 1000 GTX285 GTS250
GFLOPS GT220 100 Core i7-960 HD5430 G210M
10
Performance [ Performance Atom 330 1 0,1 1 10 100 1000 Power [W]
/ GPU Mini Symposium 25-01-2011 PAGE 6 Performance per Watt
NVIDIA GeForce AMD Radeon PowerVR CPU 10000
] HD5870 HD5770 GTX480 1000 GTX285 GTS250
GFLOPS GT220 100 Core i7-960 HD5430 G210M
543MP8 10
Performance [ Performance Atom 330 1 545 0,1 1 10 100 1000 Power [W]
/ GPU Mini Symposium 25-01-2011 PAGE 7 Example applications
/ GPU Mini Symposium 25-01-2011 PAGE 8 Contents
• Motivation • Throughput-oriented architectures • Example: The NVIDIA Fermi GPU • Hardware multithreading • Many simple processing units • SIMD execution • Programming GPUs • Current and future architectures • Summary
/ GPU Mini Symposium 25-01-2011 PAGE 9 Example: The NVIDIA Fermi GPU
/ GPU Mini Symposium 25-01-2011 PAGE 10 Example: The NVIDIA Fermi GPU
• High off-chip memory bandwidth (~100GB/s) • High floating point performance (~1000GFLOPS) • Small on-chip scratchpads (48KB per SM) • Shared L2 data cache (768KB)
Taken from: Michael Garland and David Kirk In ‘Communications of the ACM’, November 2010 / GPU Mini Symposium 25-01-2011 PAGE 11 Throughput-oriented architectures
• Throughput-oriented architectures are characterised by: • Hardware multithreading • Many simple processing units • SIMD execution
• Example throughput-oriented architectures: • GPUs (NVIDIA, AMD, Intel, PowerVR) • Intel Larabee / MIC • STI Cell Broadband Engine
/ GPU Mini Symposium 25-01-2011 PAGE 12 Hardware multithreading
“GPUs are specifically designed to execute literally billions of small user- written programs per second.”
Michael Garland and David Kirk In ‘Communications of the ACM’, November 2010
/ GPU Mini Symposium 25-01-2011 PAGE 13 Hardware multithreading
• Hide latencies through fine-grained hardware multithreading: • Hide pipeline latencies • Hide (off-chip) DRAM latencies
thread 0 thread 1 comp. instr. thread 2 mem. instr. mem. latency thread 3
time
/ GPU Mini Symposium 25-01-2011 PAGE 14 Hardware multithreading
• Hide latencies through fine-grained hardware multithreading: • Hide pipeline latencies • Hide (off-chip) DRAM latencies • Caches for load/store operations to off-chip memory are no longer required
Core Core Core Core
Cache Core Core
/ GPU Mini Symposium 25-01-2011 PAGE 15 Hardware multithreading
• Hide latencies through fine-grained hardware multithreading • Changes inside a core: • Remove data caches: - Large register file needed • Remove branch predictor and out-of-order scheduler: - Thread scheduler needed
L1 I$ / D$ L1 I$ / D$ Decoder Decoder Cache L2- Branch predictor Thread sched. / Cache O-o-O scheduler Register file SPM Execution unit Execution unit MMU MMU
/ GPU Mini Symposium 25-01-2011 PAGE 16 Many simple processing units
“Aggressively throughput-oriented processors, exemplified by the GPU, willingly sacrifice single-thread execution speed to increase total computational throughput across all threads.”
Michael Garland and David Kirk In ‘Communications of the ACM’, November 2010
/ GPU Mini Symposium 25-01-2011 PAGE 17 Many simple processing units
• Reduce per core complexity and increase per chip throughput by enabling many simple processing units
Core Core Core Core Core Core Core Core Core Core Core Core
/ GPU Mini Symposium 25-01-2011 PAGE 18 Many simple processing units
• Reduce per core complexity and increase per chip throughput by enabling many simple processing units • Characteristics: • No out-of-order, no branch prediction • Simple execution unit, simple decoder / scheduler • Small memory sizes L1 I$ / D$ Decoder I$ / SPM Decoder/sched. L2- Branch predictor Register file Cache O-o-O scheduler Execution unit Execution unit Load/store MMU
/ GPU Mini Symposium 25-01-2011 PAGE 19 SIMD execution
• Increase the amount of resources devoted to functional units rather than control, by enabling SIMD execution • Characteristics: • Small processing elements (PEs) • Favours one execution trace (uniform work)
I$ / SPM I$ / SPM Decoder/sched. Decoder/sched. Register file Register file Execution unit 32 PEs Load/store Load/store
/ GPU Mini Symposium 25-01-2011 PAGE 20 SIMD execution
• Increase the amount of resources devoted to functional units rather than control, by enabling SIMD execution • Characteristics: • Small processing elements (PEs) • Favours one execution trace (uniform work) • On GPUs: • Single Instruction Multiple Thread (SIMT)
/ GPU Mini Symposium 25-01-2011 PAGE 21 Contents
• Motivation • Throughput-oriented architectures • Programming GPUs • Programming languages • Code example • Challenges • Current and future architectures • Summary
/ GPU Mini Symposium 25-01-2011 PAGE 22
Programming languages Ease of programmability of Ease
vendor independance
/ GPU Mini Symposium 25-01-2011 PAGE 23 Programming languages
• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). • Hardware converts TLP into DLP at run time.
float A[4][8]; float A[4][8]; for(i=0;i<4;i++) { kernelF<<<(4,1),(8,1)>>>(A); for(j=0;j<8;j++) { A[i][j]++; __device__ kernelF(A) { } i = blockIdx.x; } j = threadIdx.x; A[i][j]++; }
/ GPU Mini Symposium 25-01-2011 PAGE 24 Code example: Reduction
• Sum all elements in a matrix int sum = 0; for (i=0; i Taken from: Mark Harris, NVIDIA In ‘Optimizing Parallel Reduction in CUDA’, 2008 / GPU Mini Symposium 25-01-2011 PAGE 25 Code example: Reduction • First CUDA kernel implementation • Relatively straightforward • Code still understandable Taken from: Mark Harris, NVIDIA / GPU Mini Symposium In ‘Optimizing Parallel Reduction in CUDA’, 2008 25-01-2011 PAGE 26 Code example: Reduction • 7th and final CUDA kernel implementation • Thoroughly optimized: • Templates • Unrolling • Algorithm adjustments • Code unrecognizable Taken from: Mark Harris, NVIDIA In ‘Optimizing Parallel Reduction in CUDA’, 2008 / GPU Mini Symposium 25-01-2011 PAGE 27 Code example: Reduction • Result: 30x speed-up over a naive implementation Taken from: Mark Harris, NVIDIA In ‘Optimizing Parallel Reduction in CUDA’, 2008 / GPU Mini Symposium 25-01-2011 PAGE 28 Challenges “We stand at the threshold of a many core world. The hardware community is ready to cross this threshold. The parallel software community is not.” Tim Mattson, Intel 2010 / GPU Mini Symposium 25-01-2011 PAGE 29 Contents • Motivation • Throughput-oriented architectures • Programming GPUs • Current and future architectures • Intel Sandy Bridge • Intel Many Integrated Core • AMD Fusion • NVIDIA Project Denver • NVIDIA Kepler / Maxwell • NVIDIA Echelon • Summary / GPU Mini Symposium 25-01-2011 PAGE 30 Intel Sandy Bridge (Q1 2011) • New Core-i7 architecture • Shared L3 cache ‘Last Level Cache’ • 12 shaders in GPU (low-end) / GPU Mini Symposium 25-01-2011 PAGE 31 Intel Many Integrated Core (MIC) • MIC was previously known as Larabee • 32 core (x86-compatible) MIC ‘Knights Ferry’ • Future ‘Knights Corner’ with more than 50 cores / GPU Mini Symposium 25-01-2011 PAGE 32 AMD Fusion (Q3 2011) • AMD Llano Fusion ‘APU’ (Accelerated Processing Unit) • No shared cache between GPU and CPU / GPU Mini Symposium 25-01-2011 PAGE 33 NVIDIA Project Denver “With Project Denver, we are designing a high-performing ARM CPU core in combination with our massively parallel GPU cores to create a new class of processor.” Jen-Hsun Huang, CEO of NVIDIA At ‘CES2011’, January 2011 / GPU Mini Symposium 25-01-2011 PAGE 34 NVIDIA Kepler / Maxwell (2011/2013) “I´d like to congratulate NVIDIA on taking a giant step towards making GPUs attractive for a broader class of programs. I believe history will record Fermi as a significant milestone along the road to 2020.” David Patterson, UC Berkeley 2009 / GPU Mini Symposium 25-01-2011 PAGE 35 NVIDIA Kepler / Maxwell (2011/2013) Three promises for Kepler and Maxwell: • Preemptive • Virtual memory • Lower CPU dependence Taken from: Bill Dally In ‘GPU Computing: To Exascale and Beyond’, 2010 / GPU Mini Symposium 25-01-2011 PAGE 36 NVIDIA Echelon (Research) Exascale research project: Performance of 20TFLOPS per node, 160TFLOPS per module and 2.6PFLOPS per cabinet Taken from: Bill Dally In ‘GPU Computing: To Exascale and Beyond’, 2010 / GPU Mini Symposium 25-01-2011 PAGE 37 Contents • Motivation • Throughput-oriented architectures • Programming GPUs • Current and future architectures • Summary / GPU Mini Symposium 25-01-2011 PAGE 38 Summary • We characterised throughput-oriented architectures: • Hardware multithreading • Many simple processing units • SIMD execution • We discussed programming challenges: • Achieving a working solution is straightforward • Getting maximum performance remains difficult • We identified new emerging architectures: • General purpose GPUs • Hybrid CPU/GPUs (or APUs) / GPU Mini Symposium 25-01-2011 PAGE 39 Summary “Building a parallel computer by connecting […] CPUs […], an approach often called multi-core, will not work. This approach is analogous to trying to build an airplane by putting wings on a train.” Bill Dally, NVIDIA 2010 / GPU Mini Symposium 25-01-2011 PAGE 40 Summary “We expect the client hardware of 2020 will contain hundreds of cores, each tile will contain 1 processor designed for instruction-level parallelism and a descendant of vector and GPU architectures for data-level parallelism.” ParLab, UC Berkeley In ‘IEEE Micro’, March/April 2010 / GPU Mini Symposium 25-01-2011 PAGE 41 Questions ? / GPU Mini Symposium 25-01-2011 PAGE 42