GPU Clusters for High- Performance Computing

Jeremy Enos Innovative Systems Laboratory

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline NVIDIA GPU technology overview GPU clusters at NCSA • AC • Lincoln GPU power consumption Programming tools • CUDA C • OpenCL • PGI x86+GPU Application performance GPU cluster management software • The need for new tools • CUDA/OpenCL wrapper library • CUDA memtest Balanced GPU accelerated system design considerations Conclusions Why Clusters for HPC?

• Clusters are a major workforce in HPC • Q: How many systems in top500 are clusters? • A: 410 out of 500

Top 500: Architecture Top 500: Performance

Source: http://www.top500.org Why GPUs in HPC Clusters?

1200 GPU Performance Trends

GTX 285 1000 GTX 280

800

/s 8800 Ultra 600 8800 GTX

NVIDIA GPU GFlop Intel CPU 400 7900 GTX

7800 GTX 200 6800 Ultra Intel Xeon Quad-core 3 GHz 5950 Ultra 5800

0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

Courtesy of NVIDIA T10 GPU Architecture

PCIe interface Input assembler Thread execution manager • T10 architecture

TPC 1 TPC 10 • 240 streaming Geometry controller Geometry controller processors arranged SMC SMC

SM SM SM SM SM SM as 30 streaming I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue multiprocessors C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP • At 1.3 GHz this SP SP SP SP SP SP SP SP SP SP SP SP provides SP SP SP SP SP SP SP SP SP SP SP SP

SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU • 1 TFLOP SP Shared Shared Shared Shared Shared Shared memory memory memory memory memory memory • 86.4 GFLOP DP Texture units Texture units Texture L1 Texture L1 • 512-bit interface to

512-bit memory interconnect off-chip GDDR3

L2 ROP ROP L2 memory

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM • 102 GB/s bandwidth NVIDIA Tesla S1070 GPU Computing Server

4 GB GDDR3 • 4 T10 GPUs 4 GB GDDR3 SDRAM SDRAM Tesla

Tesla GPU Power Power supply GPU PCI x16 NVIDIA

SWITCH Thermal

NVIDIA x16 PCI management SWITCH Tesla

GPU Tesla System System

monitoring GPU 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM GPU Clusters at NCSA

• Lincoln • AC • Production system available via • Experimental system available the standard NCSA/TeraGrid for anybody who is interested HPC allocation in exploring GPU computing Intel 64 Tesla Linux Cluster Lincoln

• Dell PowerEdge 1955 • Two Compute Nodes server IB • Intel 64 (Harpertown) 2.33 GHz dual socket quad core SDR IB SDR IB Dell PowerEdge Dell PowerEdge • 16 GB DDR2 1955 server 1955 server • Infiniband SDR • Tesla S1070 1U GPU PCIe x8 PCIe x8 Computing Server • 1.3 GHz Tesla T10 processors PCIe interface PCIe interface • 4x4 GB GDDR3 SDRAM T10 T10 T10 T10 • Cluster

• Servers: 192 DRAM DRAM DRAM DRAM • Accelerator Units: 96 Tesla S1070 AMD Opteron Tesla Linux Cluster AC

• HP xw9400 workstation • Compute Node • 2216 AMD Opteron 2.4 GHz IB dual socket dual core QDR IB • 8 GB DDR2 HP xw9400 workstation Nallatech • Infiniband QDR PCI-X H101 FPGA • Tesla S1070 1U GPU card

Computing Server PCIe x16 PCIe x16 • 1.3 GHz Tesla T10 processors PCIe interface PCIe interface • 4x4 GB GDDR3 SDRAM

• Cluster T10 T10 T10 T10 • Servers: 32 • Accelerator Units: 32 DRAM DRAM DRAM DRAM Tesla S1070 AC Cluster Lincoln vs. AC: Configuration

• Lincoln • AC • Compute cores • Compute cores • CPU cores: 1536 • CPU cores: 128 • GPU units: 384 • GPU units: 128 • CPU/GPU ratio: 4 • CPU/GPU ratio: 1 • Memory • Memory • Host memory: 16 GB • Host memory: 8 GB • GPU Memory: 8 GB/host • GPU Memory: 16 GB/host • Host mem/GPU: 8 GB • Host mem/GPU: 2 GB • I/O • I/O • PCI-E 2.0 (x8) • PCI-E 1.0 (x16) • GPU/host bandwidth: 4 GB/s • GPU/host bandwidth: 4 GB/s • IB bandwidth/host: 8 Gbit/s • IB bandwidth/host: 16 Gbit/s Both systems originated as extensions of existing clusters, they were not designed as Tesla GPU clusters from the beginning. As the result, their performance with regards to GPUs is suboptimal. Lincoln vs. AC: Host-device Bandwidth

3500

) Lincoln 3000 AC (no affinity mapping) 2500 AC (with affinity mapping)

(Mbytes/sec 2000

1500

1000 bandwidth 500

0 0.000001 0.0001 0.01 1 100 packet size (Mbytes) Lincoln vs. AC: HPL Benchmark

2500 45% 40% 2000 35% 30% 1500 AC (GFLOPS) 25% Lincoln (GFLOPS) 20% 1000 AC (% of peak)

Lincoln (% of peak) 15%

% of of peak % achieved GFLOPS achieved 500 10% 5% 0 0% 1 node 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes system size AC GPU Cluster Power Considerations

State Host Peak Tesla Peak Host Tesla power (Watt) (Watt) power factor factor (pf) (pf) power off 4 10 .19 .31 start-up 310 182 pre-GPU use idle 173 178 .98 .96 after NVIDIA driver module 173 178 .98 .96 unload/reload(1) after deviceQuery(2) (idle) 173 365 .99 .99 GPU memtest #10 (stress) 269 745 .99 .99 after memtest kill (idle) 172 367 .99 .99 after NVIDIA module 172 367 .99 .99 unload/reload(3) (idle) VMD Madd 268 598 .99 .99 NAMD GPU STMV 321 521 .97-1.0 .85-1.0(4) NAMD CPU only ApoA1 322 365 .99 .99 NAMD CPU only STMV 324 365 .99 .99

1. Kernel module unload/reload does not increase Tesla power 2. Any access to Tesla (e.g., deviceQuery) results in doubling power consumption after the application exits 3. Note that second kernel module unload/reload cycle does not return Tesla power to normal, only a complete reboot can 4. Power factor stays near one except while load transitions. Range varies with consumption swings Cluster Management Tools

• Deployment • SystemImager (AC) • Perceus (Lincoln) • Workload Management • Torque/MOAB • access.conf restrictions unless node used by user job • Job preemption config used to run GPU memtest during idle periods (AC) • Monitoring • CluMon (Lincoln) • Manual monitoring (AC) GPU Clusters Programming

• Programming tools • CUDA C 2.2 SDK • CUDA/MPI • OpenCL 1.0 SDK • PGI+GPU compiler • Some Applications (that we are aware of) • NAMD (Klaus Schulten’s group at UIUC) • WRF (John Michalakes, NCAR) • TPACF (Robert Brunner, UIUC) • TeraChem (Todd Martinez, Stanford) • … TPACF

180.0×

160.0× 154.1×

140.0×

120.0×

100.0× 88.2×

80.0× speedup 59.1× 60.0× 45.5× 40.0× 30.9

20.0× 6.8× 0.0× FX GeForce GTX Cell B./E. (SP) GeForce GTX PowerXCell 8i H101 (DP) 5600 280 (SP) 280 (DP) (DP) NAMD on Lincoln Cluster Performance (8 cores and 2 GPUs per node, very early results)

~5.6 ~2.8 1.6 1.4 CPU (8ppn) 2 GPUs = 24 cores CPU (4ppn) 1.2 4 GPUs CPU (2ppn) 1 8 GPUs GPU (4:1) 0.8 GPU (2:1) 16 GPUs STMV s/step STMV 0.6 GPU (1:1) 0.4 8 GPUs = 0.2 96 CPU cores 0 4 8 16 32 64 128 CPU cores

Courtesy of James Phillips, UIUC TeraChem

Bovine pancreatic trypsin inhibitor (BPTI) 3-21G, 875 atoms, 4893 MPI timings and scalability basis functions GPU (computed with SP): K-matrix, J-matrix CPU (computed with DP): the rest

Courtesy of Ivan Ufimtsev, Stanford CUDA Memtest • 4GB of Tesla GPU memory is not ECC protected • Hunt for “soft error” • Features • Full re-implementation of every test included in memtest86 • Random and fixed test patterns, error reports, error addresses, test specification • Email notification • Includes additional stress test for software and hardware errors • Usage scenarios • Hardware test for defective GPU memory chips • CUDA API/driver software bugs detection • Hardware test for detecting soft errors due to non-ECC memory • No soft error detected in 2 years x 4 gig of cumulative runtime • Availability • NCSA/UofI Open Source License • https://sourceforge.net/projects/cudagpumemtest/ CUDA/OpenCL Wrapper Library

• Basic operation principle: • Use /etc/ld.so.preload to overload (intercept) a subset of CUDA/OpenCL functions, e.g. {cu|cuda}{Get|Set}Device, clGetDeviceIDs, etc • Purpose: • Enables controlled GPU device visibility and access, extending resource allocation to the workload manager • Prove or disprove feature usefulness, with the hope of eventual uptake or reimplementation of proven features by the vendor • Provides a platform for rapid implementation and testing of HPC relevant features not available in NVIDIA APIs • Features: • NUMA Affinity mapping • Sets thread affinity to CPU core nearest the gpu device • Shared host, multi-gpu device fencing • Only GPUs allocated by scheduler are visible or accessible to user • GPU device numbers are virtualized, with a fixed mapping to a physical device per user environment • User always sees allocated GPU devices indexed from 0 CUDA/OpenCL Wrapper Library

• Features (cont’d): • Device Rotation (deprecated) • Virtual to Physical device mapping rotated for each process accessing a GPU device • Allowed for common execution parameters (e.g. Target gpu0 with 4 processes, each one gets separate gpu, assuming 4 gpus available) • CUDA 2.2 introduced compute-exclusive device mode, which includes fallback to next device. Device rotation feature may no longer needed • Memory Scrubber • Independent utility from wrapper, but packaged with it • Linux kernel does no management of GPU device memory • Must run between user jobs to ensure security between users • Availability • NCSA/UofI Open Source License • https://sourceforge.net/projects/cudawrapper/ GPU Node Pre/Post Allocation Sequence

• Pre-Job (minimized for rapid device acquisition) • Assemble detected device file unless it exists • Sanity check results • Checkout requested GPU devices from that file • Initialize CUDA wrapper shared memory segment with unique key for user (allows user to ssh to node outside of job environment and have same gpu devices visible) • Post-Job • Use quick memtest run to verify healthy GPU state • If bad state detected, mark node offline if other jobs present on node • If no other jobs, reload kernel module to “heal” node (for CUDA 2.2 driver bug) • Run memscrubber utility to clear gpu device memory • Notify of any failure events with job details • Terminate wrapper shared memory segment • Check-in GPUs back to global file of detected devices Designing a Balanced Compute Host

• Consider: • Potential performance bottlenecks • Memcopy bandwidth is limiting factor to significant portion of applications • FLOP:I/O ratio increasing relative to existing general purpose clusters • Current NVIDIA driver limited to 8 GPU support • PCI-E lane oversubscription • Host • User feedback • Typically want at least as many CPU cores as GPU units • Typically want host memory to match or surpass attached GPU total • Get candidate test system before locking decision Test System

Core i7 host

PCIe x16 PCIe x16 PCIe x16 PCIe x16

PCIe interface PCIe interface PCIe interface PCIe interface

T10 T10 T10 T10 T10 T10 T10 T10

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Tesla S1070 Tesla S1070

• CPU cores (Intel core i7): 4 • Host Memory: 24 GB • Accelerator Units (S1070): 2 • GPU Memory: 32 GB • Total GPUs: 8 • CPU cores/GPU ratio: 0.5 • Theoretical peak PCIe bandwidth: 18 GB/s Intel X58 Chipset (Test System) Designing a Balanced Compute Host

36 x 500MB/sec/lane = 18GB/sec peak (1) (PCI-E Gen 2)

½ x 25.6GB/sec = 12.8 GB/sec peak (25.6 is a bi-directional bw)

192 bits * 800MHz = 17.8 GB/sec peak (assuming 3 banks populated)

1. GPU memcopy traffic is serialized Future Options

http://download.intel.com/pressroom/kits/events/idffall_2008/SSmith_briefing_roadmap.pdf Future Options

http://download.intel.com/pressroom/kits/events/idffall_2008/SSmith_briefing_roadmap.pdf Conclusions

• GPU acceleration producing real results • System tools becoming GPU aware, but still some gaps to fill • Balanced system design depends on application requirements, but some basic guidelines apply

Jeremy Enos [email protected] Thank you.