The Space Simulator

Michael S. Warren Chris Fryer M. Patrick Goda [email protected] [email protected] [email protected] Abstract Buying a Supercomputer

What would you do if you had $500,000 to buy a computer, and it The Space Simulator is a 294-processor Beowulf cluster with had to be delivered in 30 days? theoretical peak performance just below 1.5 Teraflop/s. It is based on the Shuttle XPC SS51G mini chassis. Each node consists of a budget space power cooling compiler system 2.53 GHz processor, 1 Gb of 333 MHz DDR SDRAM, an 80 Gbyte Maxtor hard drive, and a 3C996B-T gigabit ethernet administration purchasing department card. The network is made up of a Foundry FastIron 1500 and 800 delivery deadline performance reliability Gigabit Ethernet switch. Each individual node cost less than $1000, and the entire system cost under $500,000. The cluster achieved code MPI latency tolerance SMP memory Linpack performance of 665.1 Gflop/s on 288 processors in October bandwidth bureaucracy shared memory 2002, making it the 85th fastest computer in the world according to the 20th TOP500 list. Performance has since improved to 757.1 Linpack operating system network bandwidth tools Gflop/s. Here we describe the design, implementation and performance of the cluster, as well as some scientific application results obtained with the machine. Instruction #1 — Assembly of Beowulf requires great peace of mind.

Space Simulator Architecture

Qty. Price Ext. Description 294 280 82,320 Shuttle SS51G mini system (bare) 294 254 74,676 P4/2.53GHz, 533MHz FSB, 512k cache 588 118 69,384 512Mb DDR333 SDRAM (1024Mb per node) 294 95 27,930 3com 3c996B-T Gigabit Ethernet PCI card 294 83 24,402 Maxtor 4K080H4 80Gb 5400rpm Hard Disk 294 35 10,290 Assembly Labor/Extended Warranty 4,000 Cat6 Ethernet cables 3,300 Wire shelving/switch rack 1,378 Power strips 1 186,175 Foundry FastIron 1500+800, 304 Gbit ports Total $483,855 $1646 per node 5.06 Gflop/s peak/node

Table 1: Space Simulator architecture and price (September, 2002). The total cost per node was $1646, with $728 (44%) of that figure representing the Network Interface Cards and Ethernet switches Space and Power Consumption Cluster Temperature

Processor P4 P3 P4 TM5600 TM5800 Clock (MHz) 1800 2200 1133 2530 667 933 Performance (Mflop/s) 952 656 595 1357 298 373 Power (watts/proc) 160 130 70 125 22 24 Power eff. (Mflop/s/watt) 5.95 5.05 8.5 10.8 13.5 15.5 Table 2: Power consumption for various cluster configurations running the parallel N-body code. Note that power figures include power consumption of NIC and network switch. The P4 figures listed in column three represent the higher power consumed by the previous generation Figure 1: A smoothed representation of the temperature across the of processors. cluster obtained from the lm sensors package on each node.

OS install Reliability

We used PXE and RedHat kickstart. During the installation of the cluster and the initial large Linpack benchmark runs we identified the following defective hardware:

• Make sure PXE is enabled on your NICs before you get them. 3 power supplies 6 disk drives 4 motherboards 6 sticks of DRAM 1 ethernet card • Turn off spanning tree protocol on your ethernet switches. During the six month period since the initial failures, the following hardware has failed:

2 power supplies 5 disk drives 1 motherboard 3 sticks of DRAM 1 fan connector loose

It is possible that the faulty DRAM and motherboard/CPU discovered after the initial installation was defective to begin with, but took longer to identify since the errors were very infrequent. Software Environment and Tools Network Performance

800

700 Simpler is better. What do you really need? 600 "TCP" "LAM-O" "SWAMPI" "LAM" 500 "mpich2-0.92" "mpich-1.2.5"

• A stock distribution. Almost any one will work fine. 400

• A message passing library. LAM or MPICH are good. Bandwidth in Mbps 300

200

100 Everything else is an optimization. 0 1 10 100 1000 10000 100000 1e+06 1e+07 Message Size in Bytes

“Premature optimization is the root of all evil.” -Hoare Figure 2: We show bandwidth vs message size for a variety of message passing libraries as measured via NetPIPE.

Memory Bandwidth Gravity Kernel Performance

The node memory bandwidth is less than optimal for its 333 MHz Processor libm Karp frequency due to the fact that the on-board video system uses the 533-MHz Alpha EV56 76.2 242.2 system DRAM for it’s frame buffer. It is possible to disable the on-board 667-MHz TM5600 128.7 297.5 VGA controller and increase memory copy bandwidth by 14%, but you 933-MHz Transmeta TM5800 189.5 373.2 must then insert an AGP video card into the system in order for it to 375-MHz IBM Power3 298.5 514.4 boot. 1133-MHz Intel P3 292.2 594.9 Stream results for the nodes in their normal configuration with a 4 1200-MHz AMD Athlon MP 350.7 614.0 MByte shared VGA frame buffer are: 2200-MHz Intel P4 668.0 655.5 2530-MHz Intel P4 779.3 792.6 Function Rate (MB/s) RMS time Min time Max time 1800-MHz AMD Athlon XP 609.9 951.9 Copy: 1144.5249 0.0700 0.0699 0.0702 1250-MHz Alpha 21264C 935.2 1141.0 Scale: 1145.2130 0.0699 0.0699 0.0700 2530-MHz Intel P4 (icc) 1170.0 1357.0 Add: 1262.9053 0.0951 0.0950 0.0951 Triad: 1265.4891 0.0949 0.0948 0.0949 Table 3: Mflop/s obtained on our gravitational micro-kernel benchmark. Linpack Benchmark

Below are the results of running the HPL implementation of Linpack on the Space Simulator eight times in a row (which takes about 12 hours). Each run passed the residual check. Some commercial supercomputers have trouble running for 12 hours without an error.

======T/V N NB P Q Time Gflops ------W25R2C4 180000 160 16 18 5153.50 7.544e+02 W25R2C4 180000 160 16 18 5150.74 7.549e+02 W25R2C4 180000 160 16 18 5146.35 7.555e+02 W25R2C4 180000 160 16 18 5135.46 7.571e+02 W25R2C4 180000 160 16 18 5160.75 7.534e+02 W25R2C4 180000 160 16 18 5191.60 7.489e+02 W25R2C4 180000 160 16 18 5177.24 7.510e+02 W25R2C4 180000 160 16 18 5159.14 7.536e+02 ======

NAS Parallel Benchmarks, Class D NAS Parallel Benchmarks, Class C

Figure 3: Scaling of the NAS 2.4 Class D benchmarks on the Space Figure 4: Scaling of the NAS 2.4 Class C benchmarks on the Space Simulator. Perfect scaling would be a straight horizontal line for the plot Simulator. The computational problems are smaller than the Class D on the right. results shown in the previous plot, so the scaling is not as good for large numbers of processors. The Hashed Oct-tree Library

Figure 5: On the left is the self-similar curve used for load-balancing, while the right figure represents a tree data structure in 2d for a group of centrally condensed particles.

Core-Collapse Supernovae

This image shows the radial velocities of material in a 0.5◦ slice across the core of a rotating supernova just before the the core bounces. 3-dimensional simulations are required to study the true effects of rotation. Moore’s Law Applied to Beowulf Clusters Moore’s Law Applied to Beowulf Clusters

It is interesting to note there have been exactly six years between These factors of improvement over Moore’s Law are also realized in the completion of the Loki and Space Simulator clusters, which results the NPB performance results. For a given cost, the NPB performance in four “Moore’s Law” doublings. Comparing the Loki architecture and exceeds Moore’s Law scaling by 25% for BT, and close to a factor of two price to the Space Simulator, we can see that Moore’s Law scaling has for LU and MG. actually been greatly exceeded in some aspects of the architecture. For the N-body code, the overall price/performance improvement For instance, in 1996, Loki’s disks cost $111 per Gigabyte. For the that clusters have obtained over the past six years has not differed SS, they are close to $1 a Gigabyte, which is a factor of seven beyond much from Moore’s Law extrapolations. Loki obtained performance of the factor of 16 from Moore’s Law over six years. For memory, in the Loki 1.28 Gflop/s for for the N-body code, while the whole SS obtains 180 days it was $7.35 per Megabyte, and is now 23 cents per Megabyte, 2x Gflops, an improvement of a factor of 140. The price ratio between the lower than Moore’s Law would have predicted. machines is 9.4, which when multiplied by 16 for four 18-month Moore’s Law doublings, results in a ratio of 150.

Historical Performance of the Treecode Conclusion

Year Site Machine Procs Gflop/s Mflops/proc Overall, the Space Simulator provides a reliable computing resource with unbeatable price/performance for our applications. It is fortunate 2003 LANL ASCI QB 3600 2793 775.8 that the niche of small, quiet computers that Shuttle targeted with the 2003 LANL Space Simulator 288 179.7 623.9 Shuttle XPC series was quite well suited to the architecture of the latest 2002 NERSC IBM SP-3(375/W) 256 57.70 225.0 generation of Beowulf clusters. 2002 LANL Green Destiny 212 38.9 183.5 2000 LANL SGI Origin 2000 64 13.10 205.0 1998 LANL Avalon 128 16.16 126.0 1996 LANL Loki 16 1.28 80.0 1996 SC ’96 Loki+Hyglac 32 2.19 68.4 1996 Sandia ASCI Red 6800 464.9 68.4 1995 JPL Cray T3D 256 7.94 31.0 1995 LANL TMC CM-5 512 14.06 27.5 1993 Caltech Intel Delta 512 10.02 19.6