HPC Hardware Overview
Luke Wilson September 13, 2011
Texas Advanced Computing Center The University of Texas at Austin Outline
• Some general comments • Lonestar System – Dell blade-based system – InfiniBand ( QDR) – Intel Processors • Ranger System – Sun blade-based system – InfiniBand (SDR) – AMD Processors • Longhorn System – Dell blade-based system – InfiniBand (QDR) – Intel Processors About this Talk
• We will focus on TACC systems, but much of the information applies to HPC systems in general
• As an applications programmer you may not care about hardware details, but… – We need to think about it because of performance issues – I will try to give you pointers to the most relevant architecture characteristics
• Do not hesitate to ask questions as we go High Performance Computing
• In our context, it refers to hardware and software tools dedicated to computationally intensive tasks
• Distinction between HPC center (throughput focused) and Data center (data focused) is becoming fuzzy
• High bandwidth, low latency – Memory – Network Lonestar: Intel hexa-core system Lonestar: Introduction • Lonestar Cluster – Configuration & Diagram – Server Blades • Dell PowerEdge M610 Blade (Intel Hexa-Core) Server Nodes • Microprocessor Architecture Features – Instruction Pipeline – Speeds and Feeds – Block Diagram • Node Interconnect – Hierarchy – InfiniBand Switch and Adapters – Performance Lonestar Cluster Overview
lonestar.tacc.utexas.edu Lonestar Cluster Overview
Hardware Components Characteristics Peak Performance 302 TFLOPS
Nodes 2 Hexa-Core Xeon 5680 1888 nodes / 22656 cores Memory 1333 MHz DDR3 DIMMS 24 GB/node, 45 TB total
Shared Disk Lustre parallel file system 1 PB Local Disk SATA 146GB/node, 276 TB total Interconnect Infiniband 4 GB/sec P-2-P Blade : Rack : System
• 1 node : 2 x 6 cores = 12 cores • 1 chassis : 16 nodes = 192 cores • 1 rack : 3 x 16 nodes = 576 cores • 39⅓ racks : 22656 cores 3 chassis
16 blades Lonestar login nodes
• Dell PowerEdge M610 – Intel dual socket Xeon hexa-core 3.33GHz – 24 GB DDR3 1333 MHz DIMMS – Intel QPI 5520 Chipset • Dell 1200 PowerVault – 15 TB HOME disk – 1 GB user quota(5x) Lonestar compute nodes
• 16 Blades / 10U chassis • Dell PowerEdge M619 – Dual socket Intel hexa-core Xeon • 3.33 GHz(1.25x) • 13.3 GFLOPS/core(1.25x) • 64 KB L1 cache (independent) • 12 MB L2 cache (unified) – 24 GB DDR3 1333 MHz DIMMS – Intel QPI 5520 Chipset – 2x QPI 6.4 GT/s – 146 GB 10k RPM SAS-SATA local disk (/tmp)
Motherboard DDR3 1333 DDR3 1333
Hexa-core Hexa-core Xeon Xeon 12 CPU Cores QPI
IOH I/O Hub PCIe
DMI ICH I/O Controller Intel Xeon 5600(Westmere) • 32 KB L1 cache/core • 256 KB L2 cache/core • Shared 12 MB L3 cache
Core 1 Core 2 Cluster Interconnect
16 independent 4 GB/s connections/chassis Lonestar Parallel File Systems: Lustre &NFS
15TB 1GB/ Ethernet user $HOME 3 2 ?TB 1
$WORK
InfiniBand 841 TB
$SCRATCH Uses IP over IB Ranger: AMD Quad-core system Ranger: Introduction
• Unique instrument for computational scientific research • Housed at TACC’s new machine room • Over 2 ½ years of initial planning and deployment efforts • Funded by the National Science Foundation as part of a unique program to reinvigorate High Performance Computing in the United States (Office of Cyberinfrastructure) ranger.tacc.utexas.edu
How Much Did it Cost and Who’s Involved?
• TACC selected for very first NSF ‘Track2’ HPC system – $30M system acquisition – Sun Microsystems is the vendor – Very Large InfiniBand Installation • ~4100 endpoint hosts • >1350 MT47396 switches
• TACC, ICES, Cornell Theory Center, Arizona State HPCI are teamed to operate/support the system four 4 years ($29M) Ranger: Performance
• Ranger debuted at #4 on the Top 500 list (ranked #11 as of June 2010)
Ranger Cluster Overview
Hardware Components Characteristics Peak Performance 579 TFLOPS
Nodes 4 Quad-Core Opteron 3,936 nodes / 62,976 cores Memory 667MHz DDR2 DIMMS 32 GB/node, 123 TB total 1GHz HyperTransport Shared Disk Lustre parallel file system 1.7 PB Local Disk NONE (flash-based) 8 Gb Interconnect Infiniband (generation 2) 1 GB/sec P-2-P 2.3 μs latency Ranger Hardware Summary
• Compute power - 579 Teraflops – 3,936 Sun four-socket blades – 15,744 AMD “Barcelona” processors • Quad-core, four flops/cycle (dual pipelines) • Memory - 123 Terabytes – 2 GB/core, 32 GB/node – ~20 GB/sec memory B/W per node (667 MHz DDR2) • Disk subsystem - 1.7 Petabytes – 72 Sun x4500 “Thumper” I/O servers, 24TB each – 40 GB/sec total aggregate I/O bandwidth – 1 PB raw capacity in largest filesystem • Interconnect - 10 Gbps /1.6 – 2.9 sec latency – Sun InfiniBand-based switches (2), up to 3456 4x ports each – Full non-blocking 7-stage Clos fabric – Mellanox ConnectX InfiniBand (second generation)
Ranger Hardware Summary (cont.)
• 25 Management servers - Sun 4-socket x4600s – 4 Login servers, quad-core processors – 1 Rocks master, contains software stack for nodes – 2 SGE servers, primary batch server and backup – 2 Sun Connection Management servers, monitors hardware – 2 InfiniBand Subnet Managers, primary and backup – 6 Lustre Meta-Data Servers, enabled with failover – 4 Archive data-movers, move data to tape library – 4 GridFTP servers, external multi-stream transfer • Ethernet Networking - 10Gbps Connectivity – Two external 10GigE networks: TeraGrid, NLR – 10GigE fabric for login, data-mover and GridFTP nodes, integrated into existing TACC network infrastructure – Force10 S2410P and E1200 switches
Infiniband Cabling in Ranger
• Sun switch design with reduced cable count, manageable, but still a challenge to cable – 1312 InfiniBand 12x to 12x cables – 78 InfiniBand 12x to three 4x splitter cables – Cable lengths range from 7-16m, average 11m • 9.3 miles of InfiniBand cable total (15.4 km) Space, Power and Cooling
• System Power: 3.0 MW total • System: 2.4 MW – ~90 racks, in 6 row arrangement – ~100 in-row cooling units – ~4000 ft2 total footprint • Cooling: ~0.6 MW – In-row units fed by three 400-ton chillers – Enclosed hot-aisles – Supplemental 280-tons of cooling from CRAC units • Observations: – Space less an issue than power – Cooling > 25kW per rack difficult – Power distribution a challenge, more than 1200 circuits
External Power and Cooling Infrastructure Switches in Place InfiniBand Cabling in Progress Ranger Features
• AMD Processors: – HPC Features 4 FLOPS/CP – 4 Sockets on a board – 4 Cores per socket – HyperTransport (Direct Connect between sockets) – 2.3 GHz core – Any idea what the peak floating-point performance of a node is? • 2.3 GHz * 4 Flops/CP * 16 cores = 147.2 GFlops Peak Performance – Any idea how much an application can sustain? • Can sustain over 80% of peak with DGEMM (matrix-matrix multiply) • NUMA Node Architecture (16 cores per node, think hybrid) • 2-tier InfiniBand (NEM – “Magnum”) Switch System • Multiple Lustre (Parallel) File Systems Ranger Architecture Compute Nodes internet 1 X4600 4 sockets X 4 cores Magnum
… “C48” InfiniBand Blades Login X4600 Switches
Nodes 3,456 IB ports, each 12x Line … splits into 3 4x lines. Bisection BW = 110Tbps. I/O Nodes WORK File System 82
Thumper 1 X4500 Metadata Server 1 per
File Sys.
X4600 …
24 TB 72 GigE each InfiniBand Ranger Infiniband Topology
“Magnum” NEM NEM Switch NEM NEM
NEM NEM NEM NEM …78… NEM NEM NEM NEM
NEM NEM NEM NEM
12x InfiniBand 3 cables combined MPI Tests: P2P Bandwidth
1200 Ranger - OFED 1.2 - MVAPICH 0.9.9 Lonestar - OFED 1.1 MVAPICH 0.9.8 1000 Point-to-Point MPI Measured Performance 800
• Shelf Latencies: ~1.6 µs 600 • Rack Latencies: ~2.0 µs 400
• Peak BW: ~965 MB/s (MB/sec) Bandwidth
200 Effictive Bandwith is improved at smaller message size
0 1 10 100 1000 10000 100000 1000000 1000000 1E+08 0 Message Size (Bytes) Ranger: Bisection BW Across 2 Magnums
2000 120.0%
1800 Ideal 100.0% 1600 Measured 1400 80.0% 1200 1000 60.0% 800 40.0% 600
Bisection BW (GB/sec) . (GB/sec) BW Bisection 400
Full Bisection BW Efficiency . Efficiency BW Bisection Full 20.0% 200 0 0.0% 0 20 40 60 80 100 1 2 4 8 16 32 64 82 # of Ranger Compute Racks # of Ranger Compute Racks
• Able to sustain ~73% bisection bandwidth efficiency with all 3936 nodes communicating simultaneously (82 racks) Sun Motherboard for AMD Barcelona Chips Compute Blade 4 Sockets
4 Cores
8 Memory Slots/Socket HyperTransport 1 GHz Sun Motherboard for AMD Barcelona Chips
A maximum neighbor NUMA Configuration for 3-port HyperTransport.
s
s Two PCIe x8 – 32Gbps
8.3 8.3 GB/ GB/ One PCIe x4 – 16Gbps
Midplane NEM Switch
s
s
8.3
8.3
GB/
GB/ Passive Passive
HyperTransport Bidirectional is 6.4GB/s, Unidirectional is 3.2GB/s. Dual Channel, 533MHz Registered, ECC Memory Intel/AMD Dual- to Quad-core Evolution
AMD Opteron Dual-Core AMD Barcelona/Phenom Quad-Core
MCH MCH
Intel Woodcrest Intel Clovertown Caches in Quad Core CPUs
Intel Quad-Core AMD Quad-Core L2 caches are not independent L2 caches are independent
Memory Controller Memory Controller
L3 L2 L2 L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1
Core Core Core Core Core Core Core Core 1 2 3 4 1 2 3 4 Cache sizes in AMD Barcelona
HT 0 HT 1 HT 2 Mem Ctrl • HT link bandwidth is 24 GB/s (8 GB/s per link) Crossbar Switch System Request Interface L3 •Memory controller bandwidth 2 MB up to 10.7 GB/s (667 MHz)
L2 L2 L2 L2 512 KB 512 KB 512 KB 512 KB
L1 L1 L1 L1 64 KB 64 KB 64 KB 64 KB
Core 1 Core 2 Core 3 Core 4 Other Important Features
• AMD Quad-core (K10, code name Barcelona) • Instruction fetch bandwidth now 32 bytes/cycle • 2MB L3 cache on-die; 4 x 512KB L2 caches; 64KB L1 Instruction & Data caches. • SSE units are now 128-bit wide -> single-cycle throughput; improved ALU and FPU throughput • Larger branch prediction tables, higher accuracies • Dedicated stack engine to pull stack-related ESP updates out of the instruction stream AMD 10h Processor Speeds and Feeds
Load Speed 4 W/CP 2 W/CP 0.5 W/CP 2x @ 533MHz Store Speed 2 W/CP 1 W/CP 0.5 W/CP DDR2 DIMMS
On Die External Registers L1 Data L2 L3 Memory 64 KB 512 KB 2 MB
Latency 3 CP ~15 CP ~25 CP ~300 CP
Cache States: MOESI (Modified, Owner, Exclusive, Shared, Invalid) MOESI is beneficial when latency/bandwidth between cpus is significantly better than main memory
W : FP Word (64 bit) Cache line size (L1/L2) is 8W CP : Clock Period 4 FLOPS / CP Ranger Disk Subsystem - Lustre
• Disk system (OSS) is based on Sun x4500 “Thumper” – Each server has 48 SATA II 500 GB drives (24TB total) - running internal software RAID – Dual Socket/Dual-Core Opterons @ 2.6 GHz – 72 Servers Total: 1.7 PB raw storage (that’s 288 cores just to drive the file systems)
• Metadata Servers (MDS) based on SunFire x4600s
• MDS is Fibre-channel connected to 9TB Flexline Storage
• Target Performance – Aggregate bandwidth: 40 GB/sec
Ranger Parallel File Systems: Lustre
6 96TB Thumpers 12 blades/Chassis 36 OSTs 6GB/ user blade $HOME InfiniBand Switch 3 12 2 …81 193TB Thumpers 72 OSTs 1
$WORK 0
50 300 OSTs 773TB Thumpers
$SCRATCH Uses IP over IB I/O with Lustre over Native InfiniBand
$SCRATCH File System Throughput $SCRATCH Single Application Performance 60.00 60.00
50.00 Stripecount=1 50.00 Stripecount=1 Stripecount=4 Stripecount=4
40.00 40.00
30.00 30.00
20.00 20.00 Write SpeedWrite (GB/sec) . 10.00 . (GB/sec) Speed Write 10.00
0.00 0.00 1 10 100 1000 10000 1 10 100 1000 10000 # of Writing Clients # of Writing Clients
• Max. total aggregate performance of 38 or 46 GB/sec depending on stripecount (Design Target = 32GB/sec) • External users have reported performance of ~35 GB/sec with a 4K application run Longhorn: Intel Quad-core system Longhorn: Introduction
• First NSF eXtreme Digital Visualization grant (XD Vis) • Designed for scientific visualization and data analysis – Very large memory per computational core – Two NVIDIA Graphics cards per node – Rendering performance 154 billion triangles/second Longhorn Cluster Overview
Hardware Components Characteristics
Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS 128 NVIDIA Peak Performance (GPUs, SP) 500 TFLOPS Quadroplex 2200 S4 48 GB/node (240 nodes) System Memory DDR3-DIMMS 144 GB/node (16 nodes) 13.5 TB total Graphics Memory 512 FX5800 x 4GB 2 TB Lustre parallel file Disk 210 TB system Interconnect QDR Infiniband 4 GB/sec P-2-P Longhorn “fat” nodes
• Dell R710 – Intel dual socket quad-core Xeon E5400 @ 2.53GHz – 144 GB DDR3 ( 18 GB/core ) – Intel 5520 chipset • NVIDIA Quadroplex 2200 S4 – 4 NVIDIA Quadro FX5800 • 240 CUDA cores • 4 GB Memory • 102 GB/s Memory bandwidth Longhorn standard nodes
• Dell R610 – Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz – 48 GB DDR3 ( 6 GB/core ) – Intel 5520 chipset • NVIDIA Quadroplex 2200 S4 – 4 NVIDIA Quadro FX5800 • 240 CUDA cores • 4 GB Memory • 102 GB/s Memory bandwidth
Motherboard (R610/R710)
DDR3 DDR3 DDR3
DDR3 DDR3
DDR3 Quad Core Quad Core
DDR3 DDR3 DDR3
Intel Xeon Intel Xeon
DDR3 DDR3
DDR3 5500 5500
DDR3 DDR3 DDR3
DDR3 DDR3 DDR3
12 DIMM slots (R610) 18 DIMM slots (R710) Intel 5520
42 lanes PCI Express or 36 lines PCI Express 2.0 Cache Sizes in Intel Nehalem
Memory Controller QPI 0 QPI 1
Memory Controller Core Core Core Core 1 2 3 4 L3
8 MB
Shared L3 Cache
QPI 0 QPI QPI 1 QPI L2 L2 L2 L2 256 KB 256 KB 256 KB 256 KB • Total QPI bandwidth up to L1 L1 L1 L1 32 KB 32 KB 32 KB 32 KB 25.6 GB/s (@ 3.2 GHz) • 2 20-lane QPI links Core 1 Core 2 Core 3 Core 4 • 4 quadrants (10 lanes each) Nehalem μArchitecture Storage
• Ranch – Long term tape storage • Corral – 1 PB of spinning disk Ranch Archival System
•Sun Storage Tek Silo •10,000 tapes •10 PB capacity •Used for long-term storage Corral
• 1.2 PB DataDirect Networks online disk storage • 8 Dell 1950 and 8 Dell 2950 servers • High Performance Parallel File System – Multiple databases – iRODS data management – Replication to tape archive • Multiple levels of access control • Web and other data access available globally
References Lonestar Related References
User Guide services.tacc.utexas.edu/index.php/lonestar-user-guide Developers www.tomshardware.com www.intel.com/p/en_US/products/server/processor/xeon5000/technical-documents Ranger Related References
User Guide services.tacc.utexas.edu/index.php/ranger-user-guide Forums forums.amd.com/devforum www.pgroup.com/userforum/index.php Developers developer.amd.com/home.jsp developer.amd.com/rec_reading.jsp
Longhorn Related References
User Guide services.tacc.utexas.edu/index.php/longhorn-user-guide General Information www.intel.com/technology/architecture-silicon/next-gen/ www.nvidia.com/object/cuda_what_is.html Developers developer.intel.com/design/pentium4/manuals/index2.htm developer.nvidia.com/forums/index.php