GPU System Architecture
Total Page:16
File Type:pdf, Size:1020Kb
GPU System Architecture Alan Gray EPCC The University of Edinburgh Outline • Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines Alan Gray Outline • Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines Alan Gray • The power used by a CPU core is proportional to Clock Frequency x Voltage2 • In the past, computers got faster by increasing the frequency – Voltage was decreased to keep power reasonable. • Now, voltage cannot be decreased any further – 1s and 0s in a system are represented by different voltages – Reducing overall voltage further would reduce this difference to a point where 0s and 1s cannot be properly distinguished Alan Gray • Instead, performance increases can be achieved through exploiting parallelism • Need a chip which can perform many parallel operations every clock cycle – Many cores and/or many operations per core • Want to keep power/core as low as possible • Much of the power expended by CPU cores is on functionality not generally that useful for HPC – Branch prediction, out-of-order execution etc Alan Gray • So, for HPC, we want chips with simple, low power, number-crunching cores • But we need our machine to do other things as well as the number crunching – Run an operating system, perform I/O, set up calculation etc • Solution: “Hybrid” system containing both CPU and “accelerator” chips Alan Gray • It costs a huge amount of money to design and fabricate new chips – Not feasible for relatively small HPC market • Luckily, over the last few years, Graphics Processing Units (GPUs) have evolved for the highly lucrative gaming market – And largely possess the right characteristics for HPC – Many number-crunching cores • GPU vendors have tailored existing GPU architectures to the HPC market Alan Gray GPUs in HPC • November 2011 Top500 list: GPUs (next generation will have GPUs) GPUs GPUs • GPUs now firmly established in HPC industry Alan Gray Outline • Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines Alan Gray AMD 12-core CPU = compute unit (= core) • Not much space on CPU is dedicated to compute Alan Gray NVIDIA Fermi GPU = compute unit (= SM = 32 CUDA cores) • GPU dedicates much more space to compute – At expense of caches, controllers, sophistication etc Alan Gray Memory • For many applications, performance is very sensitive to memory bandwidth CPUs use DRAM GPUs use Graphics DRAM • Graphics memory: much higher bandwidth than standard CPU memory Alan Gray • GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz) 12 (2.1GHz) Operations/cycle 1 4 DP Performance (peak) 515 GFlops 101 GFlops Memory Bandwidth (peak) 144 GB/s 27.5 GB/s • For these particular products, GPU theoretical advantage is ~5x for both compute and main memory bandwidth • Application performance very much depends on application • Typically a small fraction of peak • Depends how well application is suited to/tuned for architecture Alan Gray Outline • Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines Alan Gray • GPUs cannot be used instead of CPUs – They must be used together – GPUs act as accelerators – Responsible for the computationally expensive parts of the code DRAM GDRAM CPU GPU PCIe I/O I/O Alan Gray Scaling to larger systems PCIe I/O I/O CPU GPU + GDRAM Interconnect DRAM CPU GPU + Interconnect allows GDRAM multiple nodes to be connected PCIe I/O I/O • Can have multiple CPUs and GPUs within each “workstation” or “shared memory node” – E.g. 2 CPUs +2 GPUs (above) – CPUs share memory, but GPUs do not Alan Gray GPU Accelerated Supercomputer GPU/CPU GPU/CPU … GPU/CPU Node Node Node GPU/CPU GPU/CPU GPU/CPU Node Node … Node … … … GPU/CPU GPU/CPU GPU/CPU Node Node … Node Alan Gray • To utilise a GPU, programs must – Contain parts targeted at host CPU (most lines of source code) – Contain parts targeted at GPU (key computational kernels) – Manage data transfers between distinct CPU and GPU memory spaces – Traditional language (e.g C/Fortran) does not provide these facilities • To run on multiple GPUs in parallel – Normally use one host CPU core (thread) per GPU – Program manages communication between host CPUs in the same fashion as traditional parallel programs – e.g. MPI and/or OpenMP (latter shared memory node only) Alan Gray Outline • Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines Alan Gray Latest Technology • NVIDIA – Tesla HPC specific GPUs have evolved from GeForce series • AMD – FireStream HPC specific GPUs have evolved from (ATI) Radeon series • Intel – Knights Corner many-core x86 chip is like hybrid between a GPU and many-core CPU Alan Gray NVIDIA Tesla Series GPU • Chip partitioned into Streaming Multiprocessors (SMs) • Multiple cores per SM • Sophisticated thread engine • Shared L2 cache Alan Gray NVIDIA GPU SM • Less scheduling units than cores • Threads are scheduled in groups of 32, called a warp • Threads within a warp always execute the same instruction in lock-step (on different data elements) • Configurable L1 Cache/ Shared Memory Alan Gray NVIDIA Tesla Series “Tesla” “Fermi” “Fermi” “Fermi” “Kepler” 1060 2050 2070 2090 K20 (not yet released) CUDA cores 240 448 448 512 1536 SP Performance 933 GFlops 1030 1030 1330 ?? GFlops GFlops GFlops DP Performance 78 GFlops 515 GFlops 515 GFlops 665 GFlops ?? Memory 102 GB/s 144 GB/s 144 GB/s 178 GB/s ?? Bandwidth Memory 4 GB 3 GB 6 GB 6 GB ?? • Variety of packaging solutions Alan Gray NVIDIA Roadmap Alan Gray Programming NVIDIA Tesla • CUDA C: proprietary interface to the architecture. – Extensions to the C language which allow interfacing to the hardware – Now relatively mature, and supported by comprehensive documentation, user forums, etc – Fortran support provided by third party PGI Cuda Fortran compiler • OpenCL – Cross platform API – Similar in design to CUDA, but lower level and not so mature • Directives based approach – OpenMP style directives - abstract complexities away from programmer – Several products with varying syntax etc. – Still lack maturity – OpenMP committee currently considering adopting accelerators as part of OpenMP standard – With participation from all main vendors plus others such as EPCC Alan Gray AMD FirePro • AMD acquired ATI in 2006 • AMD FirePro series: derivative of Radeon chips with HPC enhancements • FirePro W9000 – Released Aug 2012 – Peak 1 TFLOP (double precision) – Peak 4 TFLOPS (single precision) – 6GB GDDR5 SDRAM – Now with full ECC support • Programming: OpenCL • Packaging: just cards for workstations at the moment. Server solutions may follow. Alan Gray Intel Xeon Phi • Intel Larrabee: “A Many-Core x86 Architecture for Visual Computing” – Release delayed such that the chip missed competitive window of opportunity. – Larrabee was not released as a competitive product, but instead a platform for research and development (Knight’s Ferry). • Knights Corner derivative chip – Many Integrated Cores (MIC) architecture. No longer aimed at graphics market – Instead “Accelerating Science and Discovery” – “more than 50 cores” – Release expected in 2012 – Now branded as “Intel Xeon Phi” • Hybrid between GPU and many-core CPU Alan Gray Intel Xeon Phi CPU-like GPU-like x86 cores: same instruction set as standard CPUs Simple cores, lack sophistication e.g. no out-of- Relatively small number cores/chip order execution Each core contains 512-bit vector processing Fully cache coherent unit (16 SP or 8 DP numbers) In principle could support an OS Supports multithreading Not expected to run OS, but used as accelerator (at least initially) • Programming: Possible to use standard models such as OpenMP but still need additional directives to manage data Alan Gray Outline • Why do we want/need accelerators such as GPUs? • GPU-CPU comparison – Architectural reasons for GPU performance advantages • GPU accelerated systems – From desktop to supercomputer • Accelerator Technology – NVIDIA – AMD – Intel • Example GPU machines Alan Gray DIY GPU Workstation • Just need to slot GPU card into PCI-e • Need to make sure there is enough space and power in workstation Alan Gray GPU Servers • Several vendors offer GPU Servers • Example Configuration: – 4 GPUs plus 2 (multi-core) CPUs • Multiple servers can be connected via interconnect Alan Gray EPCC HECToR GPU Machine • EPCC own a GPU System which is partly available to UK researchers – As part of the HECToR UK National Service • Comprises 3 GPU servers, each with 4 NVIDIA Fermi GPUs – Connected via infiniband interconnect – Plus another node with single AMD Firestream GPU – Plus a front end with a single NVIDIA Fermi – More details at http://www.hector.ac.uk/howcan/admin/apply/ HECToRGPU.php