HPC Hardware Overview

Luke Wilson September 13, 2011

Texas Advanced Computing Center The University of Texas at Austin Outline

• Some general comments • Lonestar System – Dell blade-based system – InfiniBand ( QDR) – Processors • Ranger System – Sun blade-based system – InfiniBand (SDR) – AMD Processors • Longhorn System – Dell blade-based system – InfiniBand (QDR) – Intel Processors About this Talk

• We will focus on TACC systems, but much of the information applies to HPC systems in general

• As an applications programmer you may not care about hardware details, but… – We need to think about it because of performance issues – I will try to give you pointers to the most relevant architecture characteristics

• Do not hesitate to ask questions as we go High Performance Computing

• In our context, it refers to hardware and software tools dedicated to computationally intensive tasks

• Distinction between HPC center (throughput focused) and center (data focused) is becoming fuzzy

• High bandwidth, low latency – Memory – Network Lonestar: Intel hexa-core system Lonestar: Introduction • Lonestar Cluster – Configuration & Diagram – Server Blades • Dell PowerEdge M610 Blade (Intel Hexa-Core) Server Nodes • Architecture Features – Instruction Pipeline – Speeds and Feeds – Block Diagram • Node Interconnect – Hierarchy – InfiniBand Switch and Adapters – Performance Lonestar Cluster Overview

lonestar.tacc.utexas.edu Lonestar Cluster Overview

Hardware Components Characteristics Peak Performance 302 TFLOPS

Nodes 2 Hexa-Core 5680 1888 nodes / 22656 cores Memory 1333 MHz DDR3 24 GB/node, 45 TB total

Shared Disk Lustre parallel file system 1 PB Local Disk SATA 146GB/node, 276 TB total Interconnect Infiniband 4 GB/sec P-2-P Blade : Rack : System

• 1 node : 2 x 6 cores = 12 cores • 1 chassis : 16 nodes = 192 cores • 1 rack : 3 x 16 nodes = 576 cores • 39⅓ racks : 22656 cores 3 chassis

16 blades Lonestar login nodes

• Dell PowerEdge M610 – Intel dual socket Xeon hexa-core 3.33GHz – 24 GB DDR3 1333 MHz DIMMS – Intel QPI 5520 Chipset • Dell 1200 PowerVault – 15 TB HOME disk – 1 GB user quota(5x) Lonestar compute nodes

• 16 Blades / 10U chassis • Dell PowerEdge M619 – Dual socket Intel hexa-core Xeon • 3.33 GHz(1.25x) • 13.3 GFLOPS/core(1.25x) • 64 KB L1 cache (independent) • 12 MB L2 cache (unified) – 24 GB DDR3 1333 MHz DIMMS – Intel QPI 5520 Chipset – 2x QPI 6.4 GT/s – 146 GB 10k RPM SAS-SATA local disk (/tmp)

Motherboard DDR3 1333 DDR3 1333

Hexa-core Hexa-core Xeon Xeon 12 CPU Cores QPI

IOH I/O Hub PCIe

DMI ICH I/O Controller Intel Xeon 5600(Westmere) • 32 KB L1 cache/core • 256 KB L2 cache/core • Shared 12 MB L3 cache

Core 1 Core 2 Cluster Interconnect

16 independent 4 GB/s connections/chassis Lonestar Parallel File Systems: Lustre &NFS

15TB 1GB/ Ethernet user $HOME 3 2 ?TB 1

$WORK

InfiniBand 841 TB

$SCRATCH Uses IP over IB Ranger: AMD Quad-core system Ranger: Introduction

• Unique instrument for computational scientific research • Housed at TACC’s new machine room • Over 2 ½ years of initial planning and deployment efforts • Funded by the National Science Foundation as part of a unique program to reinvigorate High Performance Computing in the United States (Office of Cyberinfrastructure) ranger.tacc.utexas.edu

How Much Did it Cost and Who’s Involved?

• TACC selected for very first NSF ‘Track2’ HPC system – $30M system acquisition – Sun Microsystems is the vendor – Very Large InfiniBand Installation • ~4100 endpoint hosts • >1350 MT47396 switches

• TACC, ICES, Cornell Theory Center, Arizona State HPCI are teamed to operate/support the system four 4 years ($29M) Ranger: Performance

• Ranger debuted at #4 on the Top 500 list (ranked #11 as of June 2010)

Ranger Cluster Overview

Hardware Components Characteristics Peak Performance 579 TFLOPS

Nodes 4 Quad-Core 3,936 nodes / 62,976 cores Memory 667MHz DDR2 DIMMS 32 GB/node, 123 TB total 1GHz HyperTransport Shared Disk Lustre parallel file system 1.7 PB Local Disk NONE (flash-based) 8 Gb Interconnect Infiniband (generation 2) 1 GB/sec P-2-P 2.3 μs latency Ranger Hardware Summary

• Compute power - 579 Teraflops – 3,936 Sun four-socket blades – 15,744 AMD “Barcelona” processors • Quad-core, four /cycle (dual pipelines) • Memory - 123 Terabytes – 2 GB/core, 32 GB/node – ~20 GB/sec memory B/W per node (667 MHz DDR2) • Disk subsystem - 1.7 Petabytes – 72 Sun x4500 “Thumper” I/O servers, 24TB each – 40 GB/sec total aggregate I/O bandwidth – 1 PB raw capacity in largest filesystem • Interconnect - 10 Gbps /1.6 – 2.9 sec latency – Sun InfiniBand-based switches (2), up to 3456 4x ports each – Full non-blocking 7-stage Clos fabric – Mellanox ConnectX InfiniBand (second generation)

Ranger Hardware Summary (cont.)

• 25 Management servers - Sun 4-socket x4600s – 4 Login servers, quad-core processors – 1 Rocks master, contains software stack for nodes – 2 SGE servers, primary batch server and backup – 2 Sun Connection Management servers, monitors hardware – 2 InfiniBand Subnet Managers, primary and backup – 6 Lustre Meta-Data Servers, enabled with failover – 4 Archive data-movers, move data to tape library – 4 GridFTP servers, external multi-stream transfer • Ethernet Networking - 10Gbps Connectivity – Two external 10GigE networks: TeraGrid, NLR – 10GigE fabric for login, data-mover and GridFTP nodes, integrated into existing TACC network infrastructure – Force10 S2410P and E1200 switches

Infiniband Cabling in Ranger

• Sun switch design with reduced cable count, manageable, but still a challenge to cable – 1312 InfiniBand 12x to 12x cables – 78 InfiniBand 12x to three 4x splitter cables – Cable lengths range from 7-16m, average 11m • 9.3 miles of InfiniBand cable total (15.4 km) Space, Power and Cooling

• System Power: 3.0 MW total • System: 2.4 MW – ~90 racks, in 6 row arrangement – ~100 in-row cooling units – ~4000 ft2 total footprint • Cooling: ~0.6 MW – In-row units fed by three 400-ton chillers – Enclosed hot-aisles – Supplemental 280-tons of cooling from CRAC units • Observations: – Space less an issue than power – Cooling > 25kW per rack difficult – Power distribution a challenge, more than 1200 circuits

External Power and Cooling Infrastructure Switches in Place InfiniBand Cabling in Progress Ranger Features

• AMD Processors: – HPC Features  4 FLOPS/CP – 4 Sockets on a board – 4 Cores per socket – HyperTransport (Direct Connect between sockets) – 2.3 GHz core – Any idea what the peak floating-point performance of a node is? • 2.3 GHz * 4 Flops/CP * 16 cores = 147.2 GFlops Peak Performance – Any idea how much an application can sustain? • Can sustain over 80% of peak with DGEMM (matrix-matrix multiply) • NUMA Node Architecture (16 cores per node, think hybrid) • 2-tier InfiniBand (NEM – “Magnum”) Switch System • Multiple Lustre (Parallel) File Systems Ranger Architecture Compute Nodes internet 1 X4600 4 sockets X 4 cores Magnum

… “C48” InfiniBand Blades Login X4600 Switches

Nodes 3,456 IB ports, each 12x Line … splits into 3 4x lines. Bisection BW = 110Tbps. I/O Nodes WORK File System 82

Thumper 1 X4500 Metadata Server 1 per

File Sys.

X4600 …

24 TB 72 GigE each InfiniBand Ranger Infiniband Topology

“Magnum” NEM NEM Switch NEM NEM

NEM NEM NEM NEM …78… NEM NEM NEM NEM

NEM NEM NEM NEM

12x InfiniBand 3 cables combined MPI Tests: P2P Bandwidth

1200 Ranger - OFED 1.2 - MVAPICH 0.9.9 Lonestar - OFED 1.1 MVAPICH 0.9.8 1000 Point-to-Point MPI Measured Performance 800

• Shelf Latencies: ~1.6 µs 600 • Rack Latencies: ~2.0 µs 400

• Peak BW: ~965 MB/s (MB/sec) Bandwidth

200 Effictive Bandwith is improved at smaller message size

0 1 10 100 1000 10000 100000 1000000 1000000 1E+08 0 Message Size () Ranger: Bisection BW Across 2 Magnums

2000 120.0%

1800 Ideal 100.0% 1600 Measured 1400 80.0% 1200 1000 60.0% 800 40.0% 600

Bisection BW (GB/sec) . (GB/sec) BW Bisection 400

Full Bisection BW Efficiency . Efficiency BW Bisection Full 20.0% 200 0 0.0% 0 20 40 60 80 100 1 2 4 8 16 32 64 82 # of Ranger Compute Racks # of Ranger Compute Racks

• Able to sustain ~73% bisection bandwidth efficiency with all 3936 nodes communicating simultaneously (82 racks) Sun Motherboard for AMD Barcelona Chips Compute Blade 4 Sockets

4 Cores

8 Memory Slots/Socket HyperTransport 1 GHz Sun Motherboard for AMD Barcelona Chips

A maximum neighbor NUMA Configuration for 3-port HyperTransport.

s

s Two PCIe x8 – 32Gbps

8.3 8.3 GB/ GB/ One PCIe x4 – 16Gbps

Midplane NEM Switch

s

s

8.3

8.3

GB/

GB/ Passive Passive

HyperTransport Bidirectional is 6.4GB/s, Unidirectional is 3.2GB/s. Dual Channel, 533MHz Registered, ECC Memory Intel/AMD Dual- to Quad-core Evolution

AMD Opteron Dual-Core AMD Barcelona/Phenom Quad-Core

MCH MCH

Intel Woodcrest Intel Clovertown Caches in Quad Core CPUs

Intel Quad-Core AMD Quad-Core L2 caches are not independent L2 caches are independent

Memory Controller

L3 L2 L2 L2 L2 L2 L2

L1 L1 L1 L1 L1 L1 L1 L1

Core Core Core Core Core Core Core Core 1 2 3 4 1 2 3 4 Cache sizes in AMD Barcelona

HT 0 HT 1 HT 2 Mem Ctrl • HT link bandwidth is 24 GB/s (8 GB/s per link) Crossbar Switch System Request Interface L3 •Memory controller bandwidth 2 MB up to 10.7 GB/s (667 MHz)

L2 L2 L2 L2 512 KB 512 KB 512 KB 512 KB

L1 L1 L1 L1 64 KB 64 KB 64 KB 64 KB

Core 1 Core 2 Core 3 Core 4 Other Important Features

• AMD Quad-core (K10, Barcelona) • Instruction fetch bandwidth now 32 bytes/cycle • 2MB L3 cache on-die; 4 x 512KB L2 caches; 64KB L1 Instruction & Data caches. • SSE units are now 128-bit wide -> single-cycle throughput; improved ALU and FPU throughput • Larger branch prediction tables, higher accuracies • Dedicated stack engine to pull stack-related ESP updates out of the instruction stream AMD 10h Processor Speeds and Feeds

Load Speed 4 W/CP 2 W/CP 0.5 W/CP 2x @ 533MHz Store Speed 2 W/CP 1 W/CP 0.5 W/CP DDR2 DIMMS

On Die External Registers L1 Data L2 L3 Memory 64 KB 512 KB 2 MB

Latency 3 CP ~15 CP ~25 CP ~300 CP

Cache States: MOESI (Modified, Owner, Exclusive, Shared, Invalid) MOESI is beneficial when latency/bandwidth between cpus is significantly better than main memory

W : FP Word (64 bit) Cache line size (L1/L2) is 8W CP : Clock Period 4 FLOPS / CP Ranger Disk Subsystem - Lustre

• Disk system (OSS) is based on Sun x4500 “Thumper” – Each server has 48 SATA II 500 GB drives (24TB total) - running internal software RAID – Dual Socket/Dual-Core @ 2.6 GHz – 72 Servers Total: 1.7 PB raw storage (that’s 288 cores just to drive the file systems)

• Metadata Servers (MDS) based on SunFire x4600s

• MDS is Fibre-channel connected to 9TB Flexline Storage

• Target Performance – Aggregate bandwidth: 40 GB/sec

Ranger Parallel File Systems: Lustre

6 96TB Thumpers 12 blades/Chassis 36 OSTs 6GB/ user blade $HOME InfiniBand Switch 3 12 2 …81 193TB Thumpers 72 OSTs 1

$WORK 0

50 300 OSTs 773TB Thumpers

$SCRATCH Uses IP over IB I/O with Lustre over Native InfiniBand

$SCRATCH File System Throughput $SCRATCH Single Application Performance 60.00 60.00

50.00 Stripecount=1 50.00 Stripecount=1 Stripecount=4 Stripecount=4

40.00 40.00

30.00 30.00

20.00 20.00 Write SpeedWrite (GB/sec) . 10.00 . (GB/sec) Speed Write 10.00

0.00 0.00 1 10 100 1000 10000 1 10 100 1000 10000 # of Writing Clients # of Writing Clients

• Max. total aggregate performance of 38 or 46 GB/sec depending on stripecount (Design Target = 32GB/sec) • External users have reported performance of ~35 GB/sec with a 4K application run Longhorn: Intel Quad-core system Longhorn: Introduction

• First NSF eXtreme Digital Visualization grant (XD Vis) • Designed for scientific visualization and data analysis – Very large memory per computational core – Two NVIDIA Graphics cards per node – Rendering performance 154 billion triangles/second Longhorn Cluster Overview

Hardware Components Characteristics

Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS 128 NVIDIA Peak Performance (GPUs, SP) 500 TFLOPS Quadroplex 2200 S4 48 GB/node (240 nodes) System Memory DDR3-DIMMS 144 GB/node (16 nodes) 13.5 TB total Graphics Memory 512 FX5800 x 4GB 2 TB Lustre parallel file Disk 210 TB system Interconnect QDR Infiniband 4 GB/sec P-2-P Longhorn “fat” nodes

• Dell R710 – Intel dual socket quad-core Xeon E5400 @ 2.53GHz – 144 GB DDR3 ( 18 GB/core ) – Intel 5520 chipset • NVIDIA Quadroplex 2200 S4 – 4 NVIDIA Quadro FX5800 • 240 CUDA cores • 4 GB Memory • 102 GB/s Memory bandwidth Longhorn standard nodes

• Dell R610 – Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz – 48 GB DDR3 ( 6 GB/core ) – Intel 5520 chipset • NVIDIA Quadroplex 2200 S4 – 4 NVIDIA Quadro FX5800 • 240 CUDA cores • 4 GB Memory • 102 GB/s Memory bandwidth

Motherboard (R610/R710)

DDR3 DDR3 DDR3

DDR3 DDR3

DDR3 Quad Core Quad Core

DDR3 DDR3 DDR3

Intel Xeon Intel Xeon

DDR3 DDR3

DDR3 5500 5500

DDR3 DDR3 DDR3

DDR3 DDR3 DDR3

12 DIMM slots (R610) 18 DIMM slots (R710) Intel 5520

42 lanes PCI Express or 36 lines PCI Express 2.0 Cache Sizes in Intel Nehalem

Memory Controller QPI 0 QPI 1

Memory Controller Core Core Core Core 1 2 3 4 L3

8 MB

Shared L3 Cache

QPI 0 QPI QPI 1 QPI L2 L2 L2 L2 256 KB 256 KB 256 KB 256 KB • Total QPI bandwidth up to L1 L1 L1 L1 32 KB 32 KB 32 KB 32 KB 25.6 GB/s (@ 3.2 GHz) • 2 20-lane QPI links Core 1 Core 2 Core 3 Core 4 • 4 quadrants (10 lanes each) Nehalem μArchitecture Storage

• Ranch – Long term tape storage • Corral – 1 PB of spinning disk Ranch Archival System

•Sun Storage Tek Silo •10,000 tapes •10 PB capacity •Used for long-term storage Corral

• 1.2 PB DataDirect Networks online disk storage • 8 Dell 1950 and 8 Dell 2950 servers • High Performance Parallel File System – Multiple databases – iRODS data management – Replication to tape archive • Multiple levels of access control • Web and other data access available globally

References Lonestar Related References

User Guide services.tacc.utexas.edu/index.php/lonestar-user-guide Developers www.tomshardware.com www.intel.com/p/en_US/products/server/processor/xeon5000/technical-documents Ranger Related References

User Guide services.tacc.utexas.edu/index.php/ranger-user-guide Forums forums.amd.com/devforum www.pgroup.com/userforum/index.php Developers developer.amd.com/home.jsp developer.amd.com/rec_reading.jsp

Longhorn Related References

User Guide services.tacc.utexas.edu/index.php/longhorn-user-guide General Information www.intel.com/technology/architecture-silicon/next-gen/ www.nvidia.com/object/cuda_what_is.html Developers developer.intel.com/design/pentium4/manuals/index2.htm developer.nvidia.com/forums/index.php