High Performance Computing with Accelerators

Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT)

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline

• Recent trends in application accelerators • FPGAs, Cell, GPUs, … • Accelerator clusters • Accelerator clusters at NCSA • Cluster management • Programmability Issues • Production codes running on Accelerator Clusters • Cosmology, molecular dynamics, electronic structure, quantum chromodynamics HPC on Special-purpose Processors

• Field-Programmable Gate Arrays (FPGAs) – Digital signal processing, embedded computing

• Graphics Processing Units (GPUs) – Desktop graphics accelerators

• Physics Processing Units (PPUs) – Desktop games accelerators

/Toshiba/IBM Cell Broadband Engine – Game console and digital content delivery systems

• ClearSpeed accelerator – Floating-point accelerator board for compute- intensive applications GPU Performance Trends: FLOPS

1200

GTX 285 1000 GTX 280

800

/s 8800 Ultra 600 8800 GTX

NVIDIA GPU GFlop Intel CPU 400 7900 GTX

7800 GTX 200 6800 Ultra Intel Xeon Quad-core 3 GHz 5800 5950 Ultra

0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

Courtesy of NVIDIA GPU Performance Trends: 180 GTX 285 160 GTX 280 140

120 8800 Ultra

/s 100 8800 GTX

80 GByte

60 7800 GTX 7900 GTX 40 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

Courtesy of NVIDIA T10 GPU Architecture

PCIe interface Input assembler Thread execution manager • T10 architecture

TPC 1 TPC 10 • 240 streaming Geometry controller Geometry controller processors arranged SMC SMC

SM SM SM SM SM SM as 30 streaming I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue multiprocessors C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP • At 1.3 GHz this SP SP SP SP SP SP SP SP SP SP SP SP provides SP SP SP SP SP SP SP SP SP SP SP SP

SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU • 1 TFLOPS SP Shared Shared Shared Shared Shared Shared memory memory memory memory memory memory • 86.4 TFLOPS DP Texture units Texture units Texture L1 Texture L1 • 512-bit interface to

512-bit memory interconnect off-chip GDDR3

L2 ROP ROP L2 memory

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM • 102 GB/s bandwidth NVIDIA Tesla S1070 GPU Computing Server

4 GB GDDR3 • 4 T10 GPUs 4 GB GDDR3 SDRAM SDRAM Tesla

Tesla GPU supply Power GPU

NVIDIA

PCI x16 SWITCH Thermal Thermal

NVIDIA x16 PCI management SWITCH Tesla

GPU Tesla System System

monitoring GPU 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM GPU Clusters at NCSA

• Lincoln • AC • Production system available the • Experimental system available standard NCSA/TeraGrid HPC for anybody who is interested allocation in exploring GPU computing Intel 64 Tesla Linux Cluster Lincoln

• Dell PowerEdge 1955 server • Intel 64 (Harpertown) 2.33 GHz IB dual socket quad core SDR IB SDR IB • 16 GB DDR2 Dell PowerEdge Dell PowerEdge 1955 server 1955 server • InfiniBand SDR • S1070 1U GPU Computing Server PCIe x8 PCIe x8 • 1.3 GHz Tesla T10 processors PCIe interface PCIe interface • 4x4 GB GDDR3 SDRAM T10 T10 T10 T10 • Cluster

• Servers: 192 DRAM DRAM DRAM DRAM • CPU cores: 1536 Tesla S1070 • Accelerator Units: 96 • GPUs: 384 AMD AMD Opteron Tesla Linux Cluster AC

• HP xw9400 workstation • 2216 AMD Opteron 2.4 GHz IB dual socket dual core QDR IB

• 8 GB DDR2 HP xw9400 workstation Nallatech PCI-X H101 • InfiniBand QDR FPGA card • S1070 1U GPU Computing Server PCIe x16 PCIe x16 • 1.3 GHz Tesla T10 processors PCIe interface PCIe interface • 4x4 GB GDDR3 SDRAM T10 T10 T10 T10 • Cluster

• Servers: 32 DRAM DRAM DRAM DRAM • CPU cores: 128 Tesla S1070 • Accelerator Units: 32 • GPUs: 128 AC Cluster: 128 TFLOPS (SP) NCSA GPU Cluster Management Software

• CUDA SDK Wrapper • Enables true GPU resources allocation by the scheduler • E.g., qsub -l nodes=1:ppn=1 will allocate 1 CPU core and 1 GPU, fencing other GPUs in the node from accessing • Virtualizes GPU devices • Sets thread affinity to CPU core “nearest” the the GPU device • GPU Memory Scrubber • Cleans memory between runs • GPU Memory Test utility • Tests memory for manufacturing defects • Tests memory for soft errors • Tests GPUs for entering an erroneous state GPU Node Pre/Post Allocation Sequence

• Pre-Job (minimized for rapid device acquisition) • Assemble detected device file unless it exists • Sanity check results • Checkout requested GPU devices from that file • Initialize CUDA wrapper shared memory segment with unique key for user (allows user to ssh to node outside of job environment and have same gpu devices visible) • Post-Job • Use quick memtest run to verify healthy GPU state • If bad state detected, mark node offline if other jobs present on node • If no other jobs, reload kernel module to “heal” node (for CUDA 2.2 bug) • Run memscrubber utility to clear gpu device memory • Notify of any failure events with job details • Terminate wrapper shared memory segment • Check-in GPUs back to global file of detected devices GPU Programming

• 3rd Party Programming tools • CUDA C 2.2 SDK • OpenCL 1.2 SDK • PGI x64+GPU Fortran & C99 Compilers • IACAT’s on-going work • CUDA-auto • CUDA-lite • GMAC • CUDA-tune • MCUDA/OpenMP Implicitly parallel programming with data structure and function property CUDA-auto annotations to enable auto parallelization

Locality annotation programming to eliminate need for explicit management of CUDA-lite memory types and data transfers GMAC

Parameterized CUDA programming using auto-tuning and optimization space CUDA-tune pruning

st 1 generation CUDA programming MCUDA/ NVIDIA with explicit, hardwired thread OpenMP SDK organizations and explicit management of memory types and data transfers IA multi-core NVIDIA GPU & Larrabe

Courtesy of Wen-mei Hwu, UIUC GMAC – Designed to reduce accelerator use barrier

• Unified CPU / GPU Address Space: • Same CPU and GPU address • Customizable implicit data transfers: • Transfer everything (safe mode) • Transfer dirty data before kernel execution • Transfer data as being produced (default) • Multi-process / Multi-thread support • CUDA compatible

Courtesy of Wen-mei Hwu, UIUC GPU Cluster Applications

• TPACF • Cosmology code used to study how the matter is distributed in the Universe • NAMD • Molecular Dynamics code used to run large-scale MD simulations • DSCF • NSF CyberChem project; DSCF quantum chemistry code for energy calculations • WRF • Weather modeling code • MILC • Quantum chromodynamics code, work just started • … TPACF on AC QP GPU cluster scaling

1000.0 567.5

289.0 180.0× 147.5 154.1× 100.0 160.0× 79.9 46.3 29.3 140.0× (sec) time exeution 23.7 120.0× 10.0 1 10 100.0× 88.2× number of MPI threads

80.0× speedup 59.1× 60.0× 45.5× 40.0× 30.9

20.0× 6.8× 0.0× FX GeForce GTX Cell B./E. (SP) GeForce GTX PowerXCell 8i H101 (DP) 5600 280 (SP) 280 (DP) (DP) NAMD on Lincoln (8 cores and 2 GPUs per node, very early results)

~5.6 ~2.8 1.6 1.4 CPU (8ppn) 2 GPUs = 24 cores CPU (4ppn) 1.2 4 GPUs CPU (2ppn) 1 8 GPUs GPU (4:1) 0.8 GPU (2:1) 16 GPUs STMV s/stepSTMV 0.6 GPU (1:1) 0.4 8 GPUs = 0.2 96 CPU cores 0 4 8 16 32 64 128 CPU cores

Courtesy of James Phillips, UIUC DSCF on Lincoln

Bovine pancreatic trypsin inhibitor (BPTI) 3-21G, 875 atoms, 4893 MPI timings and scalability basis functions

Courtesy of Ivan Ufimtsev, Stanford GPU Clusters Open Issues / Future Work

• GPU cluster configuration • CPU/GPU ratio • Host/device memory ratio • Cluster interconnect requirements • Cluster management • Resources allocation • Monitoring • CUDA/OpenCL SDK for HPC • Programming models for accelerator clusters • CUDA/OpenCL+pthreads/MPI/OpenMP/Charm++/…