GPU System Architecture

Alan Gray EPCC The University of Edinburgh Outline

• Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to • Accelerator Technology – – AMD – Intel • Example GPU machines

Alan Gray Outline

• Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines

Alan Gray • The power used by a CPU core is proportional to Clock Frequency x Voltage2 • In the past, computers got faster by increasing the frequency – Voltage was decreased to keep power reasonable. • Now, voltage cannot be decreased any further – 1s and 0s in a system are represented by different voltages – Reducing overall voltage further would reduce this difference to a point where 0s and 1s cannot be properly distinguished

Alan Gray • Instead, performance increases can be achieved through exploiting parallelism • Need a chip which can perform many parallel operations every clock cycle – Many cores and/or many operations per core • Want to keep power/core as low as possible • Much of the power expended by CPU cores is on functionality not generally that useful for HPC – Branch prediction, out-of-order execution etc

Alan Gray • So, for HPC, we want chips with simple, low power, number-crunching cores • But we need our machine to do other things as well as the number crunching – Run an operating system, perform I/O, set up calculation etc • Solution: “Hybrid” system containing both CPU and “accelerator” chips

Alan Gray • It costs a huge amount of money to design and fabricate new chips – Not feasible for relatively small HPC market • Luckily, over the last few years, Graphics Processing Units (GPUs) have evolved for the highly lucrative gaming market – And largely possess the right characteristics for HPC – Many number-crunching cores • GPU vendors have tailored existing GPU architectures to the HPC market

Alan Gray GPUs in HPC

• November 2011 Top500 list:

GPUs

(next generation will have GPUs) GPUs

GPUs

• GPUs now firmly established in HPC industry

Alan Gray Outline

• Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines

Alan Gray AMD 12-core CPU

= compute unit (= core)

• Not much space on CPU is dedicated to compute

Alan Gray NVIDIA Fermi GPU

= compute unit (= SM = 32 CUDA cores)

• GPU dedicates much more space to compute – At expense of caches, controllers, sophistication etc

Alan Gray Memory • For many applications, performance is very sensitive to memory

CPUs use DRAM GPUs use Graphics DRAM

• Graphics memory: much higher bandwidth than standard CPU memory

Alan Gray • GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz) 12 (2.1GHz) Operations/cycle 1 4 DP Performance (peak) 515 GFlops 101 GFlops (peak) 144 GB/s 27.5 GB/s

• For these particular products, GPU theoretical advantage is ~5x for both compute and main memory bandwidth • Application performance very much depends on application • Typically a small fraction of peak • Depends how well application is suited to/tuned for architecture

Alan Gray Outline

• Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines

Alan Gray • GPUs cannot be used instead of CPUs – They must be used together – GPUs act as accelerators – Responsible for the computationally expensive parts of the code

DRAM GDRAM

CPU GPU

PCIe I/O I/O

Alan Gray Scaling to larger systems

PCIe I/O I/O

CPU GPU + GDRAM Interconnect DRAM

CPU GPU + Interconnect allows GDRAM multiple nodes to be connected PCIe I/O I/O

• Can have multiple CPUs and GPUs within each “” or “shared memory node” – E.g. 2 CPUs +2 GPUs (above) – CPUs share memory, but GPUs do not Alan Gray GPU Accelerated Supercomputer

GPU/CPU GPU/CPU … GPU/CPU Node Node Node

GPU/CPU GPU/CPU GPU/CPU Node Node … Node

… … …

GPU/CPU GPU/CPU GPU/CPU Node Node … Node

Alan Gray • To utilise a GPU, programs must – Contain parts targeted at host CPU (most lines of source code) – Contain parts targeted at GPU (key computational kernels) – Manage data transfers between distinct CPU and GPU memory spaces – Traditional language (e.g /Fortran) does not provide these facilities • To run on multiple GPUs in parallel – Normally use one host CPU core (thread) per GPU – Program manages communication between host CPUs in the same fashion as traditional parallel programs – e.g. MPI and/or OpenMP (latter shared memory node only)

Alan Gray Outline

• Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines

Alan Gray Latest Technology

• NVIDIA – Tesla HPC specific GPUs have evolved from GeForce series

• AMD – FireStream HPC specific GPUs have evolved from (ATI) series

• Intel – Knights Corner many-core chip is like hybrid between a GPU and many-core CPU

Alan Gray Series GPU

• Chip partitioned into Streaming Multiprocessors (SMs) • Multiple cores per SM • Sophisticated thread engine • Shared L2 cache

Alan Gray NVIDIA GPU SM • Less scheduling units than cores • Threads are scheduled in groups of 32, called a warp • Threads within a warp always execute the same instruction in lock-step (on different data elements) • Configurable L1 Cache/ Shared Memory

Alan Gray NVIDIA Tesla Series “Tesla” “Fermi” “Fermi” “Fermi” “Kepler” 1060 2050 2070 2090 K20 (not yet released) CUDA cores 240 448 448 512 1536

SP Performance 933 GFlops 1030 1030 1330 ?? GFlops GFlops GFlops DP Performance 78 GFlops 515 GFlops 515 GFlops 665 GFlops ??

Memory 102 GB/s 144 GB/s 144 GB/s 178 GB/s ?? Bandwidth Memory 4 GB 3 GB 6 GB 6 GB ??

• Variety of packaging solutions

Alan Gray NVIDIA Roadmap

Alan Gray Programming NVIDIA Tesla • CUDA C: proprietary interface to the architecture. – Extensions to the C language which allow interfacing to the hardware – Now relatively mature, and supported by comprehensive documentation, user forums, etc – Fortran support provided by third party PGI Cuda Fortran compiler • OpenCL – Cross platform API – Similar in design to CUDA, but lower level and not so mature • Directives based approach – OpenMP style directives - abstract complexities away from programmer – Several products with varying syntax etc. – Still lack maturity – OpenMP committee currently considering adopting accelerators as part of OpenMP standard – With participation from all main vendors plus others such as EPCC

Alan Gray AMD FirePro

• AMD acquired ATI in 2006 • AMD FirePro series: derivative of Radeon chips with HPC enhancements • FirePro W9000 – Released Aug 2012 – Peak 1 TFLOP (double precision) – Peak 4 TFLOPS (single precision) – 6GB GDDR5 SDRAM – Now with full ECC support • Programming: OpenCL • Packaging: just cards for at the moment. Server solutions may follow.

Alan Gray Intel • Intel Larrabee: “A Many-Core x86 Architecture for Visual Computing” – Release delayed such that the chip missed competitive window of opportunity. – Larrabee was not released as a competitive product, but instead a platform for research and development (Knight’s Ferry). • Knights Corner derivative chip – Many Integrated Cores (MIC) architecture. No longer aimed at graphics market – Instead “Accelerating Science and Discovery” – “more than 50 cores” – Release expected in 2012 – Now branded as “Intel Xeon Phi” • Hybrid between GPU and many-core CPU

Alan Gray Intel Xeon Phi

CPU-like GPU-like x86 cores: same instruction set as standard CPUs Simple cores, lack sophistication e.g. no out-of- Relatively small number cores/chip order execution Each core contains 512- vector processing Fully cache coherent unit (16 SP or 8 DP numbers)

In principle could support an OS Supports multithreading

Not expected to run OS, but used as accelerator (at least initially)

• Programming: Possible to use standard models such as OpenMP but still need additional directives to manage data

Alan Gray Outline

• Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines

Alan Gray DIY GPU Workstation

• Just need to slot GPU card into PCI-e • Need to make sure there is enough space and power in workstation

Alan Gray GPU Servers

• Several vendors offer GPU Servers • Example Configuration: – 4 GPUs plus 2 (multi-core) CPUs

• Multiple servers can be connected via interconnect

Alan Gray EPCC HECToR GPU Machine

• EPCC own a GPU System which is partly available to UK researchers – As part of the HECToR UK National Service • Comprises 3 GPU servers, each with 4 NVIDIA Fermi GPUs – Connected via infiniband interconnect – Plus another node with single AMD Firestream GPU – Plus a front end with a single NVIDIA Fermi – More details at http://www.hector.ac.uk/howcan/admin/apply/ HECToRGPU.php

Alan Gray Cray XK6

• “The Cray XK6 supercomputer is a trifecta of scalar, network and many-core innovation. It combines Cray’s proven Gemini interconnect, AMD’s leading multi-core scalar processors and NVIDIA’s powerful manycore GPU processors to create a true, productive hybrid supercomputer.” • Each compute node contains 1 CPU + 1 GPU – Can scale up to 500,000 nodes • http://www.cray.com/Products/XK6/KX6.aspx Alan Gray Cray XK6 Compute Node

XK6 Compute Node Characteristics AMD Series 6200 (Interlagos) NVIDIA Tesla X2090 PCIe Gen2 Host Memory 16 or 32GB H 1600 MHz DDR3 H T NVIDIA Tesla X2090 Memory T 3 6GB GDDR5 capacity 3 Z Gemini High Speed Interconnect Y Upgradeable to future GPUs X

Alan Gray Cray XK6 Compute Blade

• Compute Blade: 4 Compute Nodes 4 CPUs (middle) + 4 GPUs (right) + 2 interconnect chips (left) (2 compute nodes share a single interconnect chip)

Alan Gray Summary

• GPUs have higher compute and memory bandwidth capabilities than CPUs • GPUs are not used alone, but work in tandem with CPUs – Leading to complications with task decomposition and memory management • GPU accelerated systems scale from simple workstations to large- scale • Traditional languages are not enough to program such systems • Current/imminent accelerator products from NVIDIA,AMD and Intel – NVIDIA are market leaders at the moment, with the proprietary CUDA the most common programming method – AMD largely rely on the cross-platform OpenCL – Intel will enter the market later this year with a hybrid CPU/GPU product

Alan Gray