A Personal Supercomputer for Climate Research
Total Page:16
File Type:pdf, Size:1020Kb
CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology A Personal Supercomputer for Climate Research James C. Hoe, Chris Hill In proceedings of SuperComputing'99, Portland, Oregon 1999, August Computation Structures Group Memo 425 The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139 A Personal Sup ercomputer for Climate Research Computation Structures Group Memo 425 August 24, 1999 James C. Ho e Lab for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 jho [email protected] Chris Hill, Alistair Adcroft Dept. of Earth, Atmospheric and Planetary Sciences Massachusetts Institute of Technology Cambridge, MA 02139 fcnh,[email protected] To app ear in Pro ceedings of SC'99 This work was funded in part by grants from the National Science Foundation and NOAA and with funds from the MIT Climate Mo deling Initiative[23]. The hardware development was carried out at the Lab oratory for Computer Science, MIT and was funded in part by the Advanced Research Pro jects Agency of the Department of Defense under the Oce of Naval Research contract N00014-92-J- 1310 and Ft. Huachuca contract DABT63-95-C-0150. James C. Ho e is supp orted byanIntel Foundation Graduate Fellowship. The computational resources for this work were made available through generous donations from Intel Corp oration. We acknowledge Arvind and John Marshall for their continuing supp ort. We thank Andy Boughton and Larry Rudolph for their invaluable technical input. A Personal Sup ercomputer for Climate Research James C. Ho e Chris Hill, Alistair Adcroft Lab for Computer Science Dept. of Earth, Atmospheric and Planetary Sciences Massachusetts Institute of Technology Massachusetts Institute of Technology Cambridge, MA 02139 Cambridge, MA 02139 jho [email protected] fcnh,[email protected] August 24, 1999 Abstract We describ e and analyze the p erformance of a cluster of p ersonal computers dedicated to coupled climate simulations. This climate mo deling system p erforms comparably to state-of-the-art sup ercom- puters and yet is a ordable by individual research groups, thus enabling more sp ontaneous application of high-end numerical mo dels to climate science. The cluster's novelty centers around the Arctic Switch Fabric and the StarT-X network interface, a system-area interconnect substrate develop ed at MIT. A signi cant fraction of the interconnect's hardware p erformance is made available to our climate mo del through an application-sp eci c communication library. In addition to rep orting the overall application p erformance of our cluster, we develop an analytical p erformance mo del of our application. Based on this mo del, we de ne a metric, Potential Floating-Pointing Performance, whichwe use to quantify the role of high-sp eed interconnects in determining application p erformance. Our results show that a high- p erformance interconnect, in conjunction with a light-weight application-sp eci c library,provides ecient supp ort for our ne-grain parallel application on an otherwise general-purp ose commo dity system. 1 Intro duction Cluster computers constructed from low-cost platforms with commo dity pro cessors are emerging as a p ow- erful to ol for computational science[27]. These clusters are typically interconnected by standard lo cal area networks, suchasswitched Fast Ethernet. Fast Ethernet is an attractive option b ecause of its low cost and widespread availability. However, communication over Fast Ethernet, and even Gigabit Ethernet, incurs relatively high overhead and latencies[25]. This creates a communication b ottleneck that limits the utility of these clusters and presents a signi cantchallenge in taking advantage of this exciting hardware trend in climate research, where the dominant computation to ols are not, in general, embarrassingly parallel. General circulation mo dels (GCMs) are widely used in climate research. These mo dels numerically step forward the primitive equations[17] that govern the planetary-scale time evolution of the atmosphere (in an AGCM) and o cean (in an OGCM). Though the mo dels p ossess abundant data-parallelism, they can only tolerate long latency and high-overhead communication in relatively coarse grain parallel con gura- tions that have a low communication-to-computation ratio. In many cases, only numerical exp eriments that take months to complete have a suciently large problem size to achieve an acceptable ratio. This limitation precludes capitalizing on one of the most attractive qualities of a ordable cluster architectures - their mix of resp onsiveness with high-p erformance that gives rise to a personal supercomputer suited to exploratory,spontaneous numerical exp erimentation. A sp ecialized high-p erformance interconnect can sub- stantially ameliorate communication b ottlenecks and allows a ne-grain application to run eciently on a cluster[8,2, 9]. In this pap er, we presentHyades, an a ordable cluster of commo dity p ersonal computers with a high- p erformance interconnect. Hyades is dedicated to parallel coupled climate simulations. By developing a small set of communication primitives tailored sp eci cally to our application, we eciently supp ort very ne-grain parallel execution on general purp ose hardware. This provides a low-cost cluster system that has 1 node A random uproute node B priority High Hi Tx reserved16 uproute14 Priority Hi Tx downroute usr tag size Hi Rx Network Hi Rx 16 11 5 payload[0]32 Low payload[1]32 Lo Tx Priority Lo Tx payload[2]32 Lo Rx Network Lo Rx payload[3]32 (a) (b) Figure 1: (a) PIO Mo de Abstraction and (b) StarT-X Message Format T r oundtr ip size O O L s r netw or k 2 (by te) (sec) (sec) (sec) (sec) 8 0.4 2.0 3.7 1.3 64 1.7 8.6 11.7 1.4 Figure 2: LogP p erformance characteristics of PIO message passing for 8-byte payload messages and 64-byte payload messages. signi cant p otential as a climate research to ol. The architecture of the cluster is describ ed in Section 2. The climate mo del we employ is describ ed in Section 3. Section 4 discusses the mapping from the numerical mo del onto the cluster, and Section 5 analyzes the p erformance. We conclude with a discussion of the lessons learned. 2 Cluster Architecture The Hyades cluster is comprised of sixteen two-way Symmetric Multipro cessors (SMP). Each SMP no de is connected to the Arctic Switch Fabric through a StarT-X PCI network interface unit (NIU), yielding interpro cessor communication that is an order-of-magnitude faster than a standard lo cal area network. The total cost of the hardware is less than $100,000, ab out evenly divided b etween the pro cessing no des and the interconnect. 2.1 Pro cessing No des EachSMPcontains two 400-MHz Intel PI I pro cessors with 512 MBytes of 100-MHz SDRAM. These SMPs, based on Intel's 82801AB chipsets, have signi cantly b etter memory and I/O p erformance in comparison to previous generations of PC-class machines. The SMPs are capable of sustaining over 120 MByte/sec of direct memory accesses (DMA) by a PCI device. The latency of an 8-byte read of an uncached memory-mapp ed (mmap) PCI device register is 0.93 sec while the minimum latency b etween back-to-back8-byte writes is 0.18 sec. These I/O characteristics directly govern the p erformance of interpro cessor communication, and hence, have a signi cant impact on the p erformance of our parallel application. 2.2 The Arctic Switch Fabric Inter-no de communication is supp orted through the Arctic SwitchFabric[5,6], a system area network de- signed for *T massively-parallel pro cessors[24, 3]. This packet-switched multi-stage network is organized in a fat-tree top ology. The latency through a router stage is less than 0.15 sec. Each link in the fat-tree sup- p orts 150 MByte/sec in each direction. In an N -endp oint full fat-tree con guration, the network's bisection bandwidth is 2 N 150 MByte/sec. In addition to high-p erformance, Arctic provides features to simplify software communication layers. First, Arctic maintains a FIFO ordering of messages sent between two no des along the same path in the fat-tree top ology. Second, Arctic recognizes two message priorities and guarantees that a high-priority 2 Cached Physical Virtual PCI Loads, Queues Queues Stores DMA Tx DMA Engine Processor Rx DMA Engine StarT-X Host DRAM Figure 3: Cacheable Virtual Interface (VI) Abstraction message cannot b e blo cked bylow-priority messages. Lastly, Arctic's link technology is designed suchthat the software layer can assume error-free op erations. The correctness of the network messages is veri ed at every router stage and at the network endp oints using CRC. The software layer only has to check a 1-bit status to detect the unlikely event of a corrupted message due to a catastrophic network failure. 2.3 The StarT-X NIU The StarT-X network interface unit (NIU) provides three simple but p owerful message passing mechanisms to user-level applications. Its op eration is designed to minimize the need for elab orate software layers b etween the network and the application. The StarT-X message passing mechanisms are implemented completely in hardware, rather than using an emb edded pro cessor, so the peak p erformance can be attained easily and predictably under a variety of workloads. The StarT-X NIU has b een used in clusters consisting of SUN E5000 SMPs as well as previous generations of Intel PCs. In each case, StarT-X's p erformance has only b een limited by the p eak p erformance of the particular host's 32-bit 33-MHz PCI environment.