Understanding Throughput- Oriented Architectures

Cedric Nugteren GPU Mini Symposium 25th of January, 2011 Contents

• Motivation • GPUs in • Performance per Watt • Example applications • Throughput-oriented architectures • Programming GPUs • Current and future architectures • Summary

/ GPU Mini Symposium 25-01-2011 PAGE 2 GPUs in supercomputers

Taken from: Top500 Websites Found at ‘www.top500.org’, November 2010 / GPU Mini Symposium 25-01-2011 PAGE 3 Performance per Watt

Taken from: Bill Dally In ‘GPU Computing: To Exascale and Beyond’, 2010 / GPU Mini Symposium 25-01-2011 PAGE 4 Performance per Watt

NVIDIA GeForce

10000 ] GTX480 1000 GTX285 GTS250

GFLOPS GT220 100 G210M

10 Performance [ Performance

1 0,1 1 10 100 1000 Power [W]

/ GPU Mini Symposium 25-01-2011 PAGE 5 Performance per Watt

NVIDIA GeForce AMD Radeon CPU 10000

] HD5870 HD5770 GTX480 1000 GTX285 GTS250

GFLOPS GT220 100 Core i7-960 HD5430 G210M

10

Performance [ Performance Atom 330 1 0,1 1 10 100 1000 Power [W]

/ GPU Mini Symposium 25-01-2011 PAGE 6 Performance per Watt

NVIDIA GeForce AMD Radeon PowerVR CPU 10000

] HD5870 HD5770 GTX480 1000 GTX285 GTS250

GFLOPS GT220 100 Core i7-960 HD5430 G210M

543MP8 10

Performance [ Performance Atom 330 1 545 0,1 1 10 100 1000 Power [W]

/ GPU Mini Symposium 25-01-2011 PAGE 7 Example applications

/ GPU Mini Symposium 25-01-2011 PAGE 8 Contents

• Motivation • Throughput-oriented architectures • Example: The NVIDIA Fermi GPU • Hardware multithreading • Many simple processing units • SIMD execution • Programming GPUs • Current and future architectures • Summary

/ GPU Mini Symposium 25-01-2011 PAGE 9 Example: The NVIDIA Fermi GPU

/ GPU Mini Symposium 25-01-2011 PAGE 10 Example: The NVIDIA Fermi GPU

• High off-chip memory bandwidth (~100GB/s) • High floating point performance (~1000GFLOPS) • Small on-chip scratchpads (48KB per SM) • Shared L2 data cache (768KB)

Taken from: Michael Garland and David Kirk In ‘Communications of the ACM’, November 2010 / GPU Mini Symposium 25-01-2011 PAGE 11 Throughput-oriented architectures

• Throughput-oriented architectures are characterised by: • Hardware multithreading • Many simple processing units • SIMD execution

• Example throughput-oriented architectures: • GPUs (NVIDIA, AMD, , PowerVR) • Intel Larabee / MIC • STI Cell Broadband Engine

/ GPU Mini Symposium 25-01-2011 PAGE 12 Hardware multithreading

“GPUs are specifically designed to execute literally billions of small user- written programs per second.”

Michael Garland and David Kirk In ‘Communications of the ACM’, November 2010

/ GPU Mini Symposium 25-01-2011 PAGE 13 Hardware multithreading

• Hide latencies through fine-grained hardware multithreading: • Hide pipeline latencies • Hide (off-chip) DRAM latencies

thread 0 thread 1 comp. instr. thread 2 mem. instr. mem. latency thread 3

time

/ GPU Mini Symposium 25-01-2011 PAGE 14 Hardware multithreading

• Hide latencies through fine-grained hardware multithreading: • Hide pipeline latencies • Hide (off-chip) DRAM latencies • Caches for load/store operations to off-chip memory are no longer required

Core Core Core Core

Cache Core Core

/ GPU Mini Symposium 25-01-2011 PAGE 15 Hardware multithreading

• Hide latencies through fine-grained hardware multithreading • Changes inside a core: • Remove data caches: - Large register file needed • Remove branch predictor and out-of-order scheduler: - Thread scheduler needed

L1 I$ / D$ L1 I$ / D$ Decoder Decoder Cache L2- Branch predictor Thread sched. / Cache O-o-O scheduler Register file SPM Execution unit Execution unit MMU MMU

/ GPU Mini Symposium 25-01-2011 PAGE 16 Many simple processing units

“Aggressively throughput-oriented processors, exemplified by the GPU, willingly sacrifice single-thread execution speed to increase total computational throughput across all threads.”

Michael Garland and David Kirk In ‘Communications of the ACM’, November 2010

/ GPU Mini Symposium 25-01-2011 PAGE 17 Many simple processing units

• Reduce per core complexity and increase per chip throughput by enabling many simple processing units

Core Core Core Core Core Core Core Core Core Core Core Core

/ GPU Mini Symposium 25-01-2011 PAGE 18 Many simple processing units

• Reduce per core complexity and increase per chip throughput by enabling many simple processing units • Characteristics: • No out-of-order, no branch prediction • Simple execution unit, simple decoder / scheduler • Small memory sizes L1 I$ / D$ Decoder I$ / SPM Decoder/sched. L2- Branch predictor Register file Cache O-o-O scheduler Execution unit Execution unit Load/store MMU

/ GPU Mini Symposium 25-01-2011 PAGE 19 SIMD execution

• Increase the amount of resources devoted to functional units rather than control, by enabling SIMD execution • Characteristics: • Small processing elements (PEs) • Favours one execution trace (uniform work)

I$ / SPM I$ / SPM Decoder/sched. Decoder/sched. Register file Register file Execution unit 32 PEs Load/store Load/store

/ GPU Mini Symposium 25-01-2011 PAGE 20 SIMD execution

• Increase the amount of resources devoted to functional units rather than control, by enabling SIMD execution • Characteristics: • Small processing elements (PEs) • Favours one execution trace (uniform work) • On GPUs: • Single Instruction Multiple Thread (SIMT)

/ GPU Mini Symposium 25-01-2011 PAGE 21 Contents

• Motivation • Throughput-oriented architectures • Programming GPUs • Programming languages • Code example • Challenges • Current and future architectures • Summary

/ GPU Mini Symposium 25-01-2011 PAGE 22

Programming languages Ease of programmability of Ease

vendor independance

/ GPU Mini Symposium 25-01-2011 PAGE 23 Programming languages

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). • Hardware converts TLP into DLP at run time.

float A[4][8]; float A[4][8]; for(i=0;i<4;i++) { kernelF<<<(4,1),(8,1)>>>(A); for(j=0;j<8;j++) { A[i][j]++; __device__ kernelF(A) { } i = blockIdx.x; } j = threadIdx.x; A[i][j]++; }

/ GPU Mini Symposium 25-01-2011 PAGE 24 Code example: Reduction

• Sum all elements in a matrix int sum = 0; for (i=0; i

Taken from: Mark Harris, NVIDIA In ‘Optimizing Parallel Reduction in CUDA’, 2008 / GPU Mini Symposium 25-01-2011 PAGE 25 Code example: Reduction

• First CUDA kernel implementation

• Relatively straightforward • Code still understandable

Taken from: Mark Harris, NVIDIA / GPU Mini Symposium In ‘Optimizing Parallel Reduction in CUDA’, 2008 25-01-2011 PAGE 26 Code example: Reduction

• 7th and final CUDA kernel implementation

• Thoroughly optimized: • Templates • Unrolling • Algorithm adjustments • Code unrecognizable

Taken from: Mark Harris, NVIDIA In ‘Optimizing Parallel Reduction in CUDA’, 2008 / GPU Mini Symposium 25-01-2011 PAGE 27 Code example: Reduction

• Result: 30x speed-up over a naive implementation

Taken from: Mark Harris, NVIDIA In ‘Optimizing Parallel Reduction in CUDA’, 2008 / GPU Mini Symposium 25-01-2011 PAGE 28 Challenges

“We stand at the threshold of a many core world. The hardware community is ready to cross this threshold. The parallel software community is not.”

Tim Mattson, Intel 2010

/ GPU Mini Symposium 25-01-2011 PAGE 29 Contents

• Motivation • Throughput-oriented architectures • Programming GPUs • Current and future architectures • Intel Sandy Bridge • Intel Many Integrated Core • AMD Fusion • NVIDIA Project Denver • NVIDIA Kepler / Maxwell • NVIDIA Echelon • Summary

/ GPU Mini Symposium 25-01-2011 PAGE 30 Intel Sandy Bridge (Q1 2011)

• New Core-i7 architecture • Shared L3 cache ‘Last Level Cache’ • 12 in GPU (low-end)

/ GPU Mini Symposium 25-01-2011 PAGE 31 Intel Many Integrated Core (MIC)

• MIC was previously known as Larabee • 32 core (-compatible) MIC ‘Knights Ferry’ • Future ‘Knights Corner’ with more than 50 cores

/ GPU Mini Symposium 25-01-2011 PAGE 32 AMD Fusion (Q3 2011)

• AMD Llano Fusion ‘APU’ (Accelerated Processing Unit) • No shared cache between GPU and CPU

/ GPU Mini Symposium 25-01-2011 PAGE 33 NVIDIA Project Denver

“With Project Denver, we are designing a high-performing ARM CPU core in combination with our massively parallel GPU cores to create a new class of processor.”

Jen-Hsun Huang, CEO of NVIDIA At ‘CES2011’, January 2011

/ GPU Mini Symposium 25-01-2011 PAGE 34 NVIDIA Kepler / Maxwell (2011/2013)

“I´d like to congratulate NVIDIA on taking a giant step towards making GPUs attractive for a broader class of programs. I believe history will record Fermi as a significant milestone along the road to 2020.”

David Patterson, UC Berkeley 2009

/ GPU Mini Symposium 25-01-2011 PAGE 35 NVIDIA Kepler / Maxwell (2011/2013)

Three promises for Kepler and Maxwell: • Preemptive • Virtual memory • Lower CPU dependence

Taken from: Bill Dally In ‘GPU Computing: To Exascale and Beyond’, 2010 / GPU Mini Symposium 25-01-2011 PAGE 36 NVIDIA Echelon (Research)

Exascale research project: Performance of 20TFLOPS per node, 160TFLOPS per module and 2.6PFLOPS per cabinet

Taken from: Bill Dally In ‘GPU Computing: To Exascale and Beyond’, 2010 / GPU Mini Symposium 25-01-2011 PAGE 37 Contents

• Motivation • Throughput-oriented architectures • Programming GPUs • Current and future architectures • Summary

/ GPU Mini Symposium 25-01-2011 PAGE 38 Summary

• We characterised throughput-oriented architectures: • Hardware multithreading • Many simple processing units • SIMD execution • We discussed programming challenges: • Achieving a working solution is straightforward • Getting maximum performance remains difficult • We identified new emerging architectures: • General purpose GPUs • Hybrid CPU/GPUs (or APUs)

/ GPU Mini Symposium 25-01-2011 PAGE 39 Summary

“Building a parallel computer by connecting […] CPUs […], an approach often called multi-core, will not work. This approach is analogous to trying to build an airplane by putting wings on a train.”

Bill Dally, NVIDIA 2010

/ GPU Mini Symposium 25-01-2011 PAGE 40 Summary

“We expect the client hardware of 2020 will contain hundreds of cores, each tile will contain 1 processor designed for instruction-level parallelism and a descendant of vector and GPU architectures for data-level parallelism.”

ParLab, UC Berkeley In ‘IEEE Micro’, March/April 2010

/ GPU Mini Symposium 25-01-2011 PAGE 41 Questions

?

/ GPU Mini Symposium 25-01-2011 PAGE 42