CSAIL Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
A Personal Supercomputer for Climate Research
James C. Hoe, Chris Hill
In proceedings of SuperComputing'99, Portland, Oregon
1999, August
Computation Structures Group Memo 425
The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139
A Personal Sup ercomputer for Climate Research
Computation Structures Group Memo 425
August 24, 1999
James C. Ho e
Lab for Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139
Chris Hill, Alistair Adcroft
Dept. of Earth, Atmospheric and Planetary Sciences
Massachusetts Institute of Technology
Cambridge, MA 02139
fcnh,[email protected]
To app ear in Pro ceedings of SC'99
This work was funded in part by grants from the National Science Foundation
and NOAA and with funds from the MIT Climate Mo deling Initiative[23]. The
hardware development was carried out at the Lab oratory for Computer Science,
MIT and was funded in part by the Advanced Research Pro jects Agency of the
Department of Defense under the Oce of Naval Research contract N00014-92-J-
1310 and Ft. Huachuca contract DABT63-95-C-0150. James C. Ho e is supp orted
byanIntel Foundation Graduate Fellowship. The computational resources for this
work were made available through generous donations from Intel Corp oration. We
acknowledge Arvind and John Marshall for their continuing supp ort. We thank
Andy Boughton and Larry Rudolph for their invaluable technical input.
A Personal Sup ercomputer for Climate Research
James C. Ho e Chris Hill, Alistair Adcroft
Lab for Computer Science Dept. of Earth, Atmospheric and Planetary Sciences
Massachusetts Institute of Technology Massachusetts Institute of Technology
Cambridge, MA 02139 Cambridge, MA 02139
jho [email protected] fcnh,[email protected]
August 24, 1999
Abstract
We describ e and analyze the p erformance of a cluster of p ersonal computers dedicated to coupled
climate simulations. This climate mo deling system p erforms comparably to state-of-the-art sup ercom-
puters and yet is a ordable by individual research groups, thus enabling more sp ontaneous application
of high-end numerical mo dels to climate science. The cluster's novelty centers around the Arctic Switch
Fabric and the StarT-X network interface, a system-area interconnect substrate develop ed at MIT. A
signi cant fraction of the interconnect's hardware p erformance is made available to our climate mo del
through an application-sp eci c communication library. In addition to rep orting the overall application
p erformance of our cluster, we develop an analytical p erformance mo del of our application. Based on
this mo del, we de ne a metric, Potential Floating-Pointing Performance, whichwe use to quantify the
role of high-sp eed interconnects in determining application p erformance. Our results show that a high-
p erformance interconnect, in conjunction with a light-weight application-sp eci c library,provides ecient
supp ort for our ne-grain parallel application on an otherwise general-purp ose commo dity system.
1 Intro duction
Cluster computers constructed from low-cost platforms with commo dity pro cessors are emerging as a p ow-
erful to ol for computational science[27]. These clusters are typically interconnected by standard lo cal area
networks, suchasswitched Fast Ethernet. Fast Ethernet is an attractive option b ecause of its low cost and
widespread availability. However, communication over Fast Ethernet, and even Gigabit Ethernet, incurs
relatively high overhead and latencies[25]. This creates a communication b ottleneck that limits the utility
of these clusters and presents a signi cantchallenge in taking advantage of this exciting hardware trend in
climate research, where the dominant computation to ols are not, in general, embarrassingly parallel.
General circulation mo dels (GCMs) are widely used in climate research. These mo dels numerically step
forward the primitive equations[17] that govern the planetary-scale time evolution of the atmosphere (in
an AGCM) and o cean (in an OGCM). Though the mo dels p ossess abundant data-parallelism, they can
only tolerate long latency and high-overhead communication in relatively coarse grain parallel con gura-
tions that have a low communication-to-computation ratio. In many cases, only numerical exp eriments
that take months to complete have a suciently large problem size to achieve an acceptable ratio. This
limitation precludes capitalizing on one of the most attractive qualities of a ordable cluster architectures
- their mix of resp onsiveness with high-p erformance that gives rise to a personal supercomputer suited to
exploratory,spontaneous numerical exp erimentation. A sp ecialized high-p erformance interconnect can sub-
stantially ameliorate communication b ottlenecks and allows a ne-grain application to run eciently on a
cluster[8,2, 9].
In this pap er, we presentHyades, an a ordable cluster of commo dity p ersonal computers with a high-
p erformance interconnect. Hyades is dedicated to parallel coupled climate simulations. By developing a
small set of communication primitives tailored sp eci cally to our application, we eciently supp ort very
ne-grain parallel execution on general purp ose hardware. This provides a low-cost cluster system that has 1 node A random uproute node B priority High Hi Tx reserved16 uproute14 Priority Hi Tx downroute usr tag size Hi Rx Network Hi Rx 16 11 5 payload[0]32 Low payload[1]32 Lo Tx Priority Lo Tx payload[2]32 Lo Rx Network Lo Rx payload[3]32
(a) (b)
Figure 1: (a) PIO Mo de Abstraction and (b) StarT-X Message Format
T
r ound tr ip
size O O L
s r
netw or k
2
(by te) (sec) (sec) (sec) (sec)
8 0.4 2.0 3.7 1.3
64 1.7 8.6 11.7 1.4
Figure 2: LogP p erformance characteristics of PIO message passing for 8-byte payload messages and 64-byte
payload messages.
signi cant p otential as a climate research to ol. The architecture of the cluster is describ ed in Section 2.
The climate mo del we employ is describ ed in Section 3. Section 4 discusses the mapping from the numerical
mo del onto the cluster, and Section 5 analyzes the p erformance. We conclude with a discussion of the lessons
learned.
2 Cluster Architecture
The Hyades cluster is comprised of sixteen two-way Symmetric Multipro cessors (SMP). Each SMP no de
is connected to the Arctic Switch Fabric through a StarT-X PCI network interface unit (NIU), yielding
interpro cessor communication that is an order-of-magnitude faster than a standard lo cal area network. The
total cost of the hardware is less than $100,000, ab out evenly divided b etween the pro cessing no des and the
interconnect.
2.1 Pro cessing No des
EachSMPcontains two 400-MHz Intel PI I pro cessors with 512 MBytes of 100-MHz SDRAM. These SMPs,
based on Intel's 82801AB chipsets, have signi cantly b etter memory and I/O p erformance in comparison to
previous generations of PC-class machines. The SMPs are capable of sustaining over 120 MByte/sec of direct
memory accesses (DMA) by a PCI device. The latency of an 8-byte read of an uncached memory-mapp ed
(mmap) PCI device register is 0.93 sec while the minimum latency b etween back-to-back8-byte writes is
0.18 sec. These I/O characteristics directly govern the p erformance of interpro cessor communication, and
hence, have a signi cant impact on the p erformance of our parallel application.
2.2 The Arctic Switch Fabric
Inter-no de communication is supp orted through the Arctic SwitchFabric[5,6], a system area network de-
signed for *T massively-parallel pro cessors[24, 3]. This packet-switched multi-stage network is organized in
a fat-tree top ology. The latency through a router stage is less than 0.15 sec. Each link in the fat-tree sup-
p orts 150 MByte/sec in each direction. In an N -endp oint full fat-tree con guration, the network's bisection
bandwidth is 2 N 150 MByte/sec.
In addition to high-p erformance, Arctic provides features to simplify software communication layers.
First, Arctic maintains a FIFO ordering of messages sent between two no des along the same path in the
fat-tree top ology. Second, Arctic recognizes two message priorities and guarantees that a high-priority 2 Cached Physical Virtual PCI Loads, Queues Queues Stores DMA Tx DMA Engine Processor Rx DMA Engine StarT-X
Host DRAM
Figure 3: Cacheable Virtual Interface (VI) Abstraction
message cannot b e blo cked bylow-priority messages. Lastly, Arctic's link technology is designed suchthat
the software layer can assume error-free op erations. The correctness of the network messages is veri ed at
every router stage and at the network endp oints using CRC. The software layer only has to check a 1-bit
status to detect the unlikely event of a corrupted message due to a catastrophic network failure.
2.3 The StarT-X NIU
The StarT-X network interface unit (NIU) provides three simple but p owerful message passing mechanisms
to user-level applications. Its op eration is designed to minimize the need for elab orate software layers b etween
the network and the application. The StarT-X message passing mechanisms are implemented completely in
hardware, rather than using an emb edded pro cessor, so the peak p erformance can be attained easily and
predictably under a variety of workloads. The StarT-X NIU has b een used in clusters consisting of SUN
E5000 SMPs as well as previous generations of Intel PCs. In each case, StarT-X's p erformance has only
b een limited by the p eak p erformance of the particular host's 32-bit 33-MHz PCI environment.
The di erent StarT-X communication mechanisms are optimized to supp ort di erent granularities of
communication patterns. Elsewhere, wehave describ ed StarT-X's op eration in detail[16]. Here, we brie y
present StarT-X's Programmed I/O Interface (PIO) and Cacheable Virtual Interface (VI), the two mecha-
nisms employed by the GCM co de.
PIO Mo de: To supp ort ne-grain message-passing applications, the PIO interface presents a simple FIFO-
based network abstraction similar to the CM-5 data network interface[29]. Pro cesses on di erent no des
communicate by exchanging messages which contain two 32-bit header words followed by a variable size
payload of between 2 and 22 32-bit words. The PIO mo de abstraction and packet format are depicted
in Figure 1. Due to the relative high cost of the uncached mmap accesses, we can reliably estimate the
p erformance of PIO-mo de communication bysummingthecostofthemmap accesses (given in Section 2.1).
For example, when passing an 8-byte message, the sender and the receiver each p erform two8-byte (header
plus payload) mmap accesses to the NIU registers, and thus, we can estimate the overhead of sending
and receiving a message to b e 0.36 sec and 1.86 sec, resp ectively. The exp erimentally determined LogP
characteristics[10] of StarT-X's PIO mechanism, summarized in Figure 2, corrob orate these estimates.
VI Mo de: In mo dern architectures, cached accesses to main memory are orders-of-magnitude faster than
PIO accesses. To take advantage of this p erformance disparity, the VI mo de uses DMA to extend the
physical transmit and receivequeuesinto the memory system. Figure 3 illustrates this abstraction. The user
interacts with StarT-X indirectly through memory, and hence, avoids costly PIO accesses. The VI mo de
makes use of a pinned, contiguous physical memory region for DMA. The VI memory region is mapp ed into
a cacheable virtual memory region of the user pro cess. To send a VI message, instead of enqueuing directly
to the hardware transmit queue, the user pro cess writes the message to a cacheable VI bu er. The user then
invokes StarT-X's DMA engine to enqueue the message into the physical transmit queue. On the receiver
end, StarT-X delivers the message directly to a pre-sp eci ed bu er in the receiving no de's VI memory region.
Because DMA invocation and status p olling require mmap accesses to StarT-X registers, the VI mo de is
most ecient when multiple out-b ound messages are queued consecutively in memory and are transmitted
with a single DMA invo cation. The p eak payload transfer bandwidth in VI mo de is 110 MByte/sec. 3
3 The MIT General Circulation Mo del
The climate mo del used on Hyades is the MIT general circulation mo del[21, 20, 15, 14]. The mo del is
a versatile research to ol that can be applied to a wide variety of pro cesses ranging from non-hydrostatic
rotating uid dynamics[15, 22] to the large-scale general circulation of the atmosphere and o cean[14, 19].
The mo del is implemented to exploit the mathematical isomorphisms that exist between the equations of
motion for an incompressible uid (the o cean) and those of a compressible uid (the atmosphere), allowing
atmosphere and o cean simulations to b e p erformed by the same basic mo del co de[14].
3.1 Numerical Pro cedure
At the heart of the mo del is a numerical kernel that steps forward the equations of motion for a uid in a
rotating frame of reference using a pro cedure that is a variant on the theme set out by HarlowandWelch[13].
The kernel can b e written in semi-discrete form to second order in time, t,thus:
n+1 n
1
1