CSAIL Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

A Personal for Climate Research

James C. Hoe, Chris Hill

In proceedings of SuperComputing'99, Portland, Oregon

1999, August

Computation Structures Group Memo 425

The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139

A Personal Sup ercomputer for Climate Research

Computation Structures Group Memo 425

August 24, 1999

James C. Ho e

Lab for Computer Science

Massachusetts Institute of Technology

Cambridge, MA 02139

jho [email protected]

Chris Hill, Alistair Adcroft

Dept. of Earth, Atmospheric and Planetary Sciences

Massachusetts Institute of Technology

Cambridge, MA 02139

fcnh,[email protected]

To app ear in Pro ceedings of SC'99

This work was funded in part by grants from the National Science Foundation

and NOAA and with funds from the MIT Climate Mo deling Initiative[23]. The

hardware development was carried out at the Lab oratory for Computer Science,

MIT and was funded in part by the Advanced Research Pro jects Agency of the

Department of Defense under the Oce of Naval Research contract N00014-92-J-

1310 and Ft. Huachuca contract DABT63-95-C-0150. James C. Ho e is supp orted

byanIntel Foundation Graduate Fellowship. The computational resources for this

work were made available through generous donations from Intel Corp oration. We

acknowledge Arvind and John Marshall for their continuing supp ort. We thank

Andy Boughton and Larry Rudolph for their invaluable technical input.

A Personal Sup ercomputer for Climate Research

James C. Ho e Chris Hill, Alistair Adcroft

Lab for Computer Science Dept. of Earth, Atmospheric and Planetary Sciences

Massachusetts Institute of Technology Massachusetts Institute of Technology

Cambridge, MA 02139 Cambridge, MA 02139

jho [email protected] fcnh,[email protected]

August 24, 1999

Abstract

We describ e and analyze the p erformance of a cluster of p ersonal dedicated to coupled

climate simulations. This climate mo deling system p erforms comparably to state-of-the-art sup ercom-

puters and yet is a ordable by individual research groups, thus enabling more sp ontaneous application

of high-end numerical mo dels to climate science. The cluster's novelty centers around the Arctic Switch

Fabric and the StarT-X network interface, a system-area interconnect substrate develop ed at MIT. A

signi cant fraction of the interconnect's hardware p erformance is made available to our climate mo del

through an application-sp eci c communication library. In addition to rep orting the overall application

p erformance of our cluster, we develop an analytical p erformance mo del of our application. Based on

this mo del, we de ne a metric, Potential Floating-Pointing Performance, whichwe use to quantify the

role of high-sp eed interconnects in determining application p erformance. Our results show that a high-

p erformance interconnect, in conjunction with a light-weight application-sp eci c library,provides ecient

supp ort for our ne-grain parallel application on an otherwise general-purp ose commo dity system.

1 Intro duction

Cluster computers constructed from low-cost platforms with commo dity pro cessors are emerging as a p ow-

erful to ol for computational science[27]. These clusters are typically interconnected by standard lo cal area

networks, suchasswitched Fast . Fast Ethernet is an attractive option b ecause of its low cost and

widespread availability. However, communication over Fast Ethernet, and even Gigabit Ethernet, incurs

relatively high overhead and latencies[25]. This creates a communication b ottleneck that limits the utility

of these clusters and presents a signi cantchallenge in taking advantage of this exciting hardware trend in

climate research, where the dominant computation to ols are not, in general, embarrassingly parallel.

General circulation mo dels (GCMs) are widely used in climate research. These mo dels numerically step

forward the primitive equations[17] that govern the planetary-scale time evolution of the atmosphere (in

an AGCM) and o cean (in an OGCM). Though the mo dels p ossess abundant data-parallelism, they can

only tolerate long latency and high-overhead communication in relatively coarse grain parallel con gura-

tions that have a low communication-to-computation ratio. In many cases, only numerical exp eriments

that take months to complete have a suciently large problem size to achieve an acceptable ratio. This

limitation precludes capitalizing on one of the most attractive qualities of a ordable cluster architectures

- their mix of resp onsiveness with high-p erformance that gives rise to a personal supercomputer suited to

exploratory,spontaneous numerical exp erimentation. A sp ecialized high-p erformance interconnect can sub-

stantially ameliorate communication b ottlenecks and allows a ne-grain application to run eciently on a

cluster[8,2, 9].

In this pap er, we presentHyades, an a ordable cluster of commo dity p ersonal computers with a high-

p erformance interconnect. Hyades is dedicated to parallel coupled climate simulations. By developing a

small set of communication primitives tailored sp eci cally to our application, we eciently supp ort very

ne-grain parallel execution on general purp ose hardware. This provides a low-cost cluster system that has 1 node A random uproute node B priority High Hi Tx reserved16 uproute14 Priority Hi Tx downroute usr tag size Hi Rx Network Hi Rx 16 11 5 payload[0]32 Low payload[1]32 Lo Tx Priority Lo Tx payload[2]32 Lo Rx Network Lo Rx payload[3]32

(a) (b)

Figure 1: (a) PIO Mo de Abstraction and (b) StarT-X Message Format

T

r oundtr ip

size O O L

s r

netw or k

2

(by te) (sec) (sec) (sec) (sec)

8 0.4 2.0 3.7 1.3

64 1.7 8.6 11.7 1.4

Figure 2: LogP p erformance characteristics of PIO message passing for 8-byte payload messages and 64-byte

payload messages.

signi cant p otential as a climate research to ol. The architecture of the cluster is describ ed in Section 2.

The climate mo del we employ is describ ed in Section 3. Section 4 discusses the mapping from the numerical

mo del onto the cluster, and Section 5 analyzes the p erformance. We conclude with a discussion of the lessons

learned.

2 Cluster Architecture

The Hyades cluster is comprised of sixteen two-way Symmetric Multipro cessors (SMP). Each SMP no de

is connected to the Arctic Switch Fabric through a StarT-X PCI network interface unit (NIU), yielding

interpro cessor communication that is an order-of-magnitude faster than a standard lo cal area network. The

total cost of the hardware is less than $100,000, ab out evenly divided b etween the pro cessing no des and the

interconnect.

2.1 Pro cessing No des

EachSMPcontains two 400-MHz Intel PI I pro cessors with 512 MBytes of 100-MHz SDRAM. These SMPs,

based on Intel's 82801AB chipsets, have signi cantly b etter memory and I/O p erformance in comparison to

previous generations of PC-class machines. The SMPs are capable of sustaining over 120 MByte/sec of direct

memory accesses (DMA) by a PCI device. The latency of an 8-byte read of an uncached memory-mapp ed

(mmap) PCI device register is 0.93 sec while the minimum latency b etween back-to-back8-byte writes is

0.18 sec. These I/O characteristics directly govern the p erformance of interpro cessor communication, and

hence, have a signi cant impact on the p erformance of our parallel application.

2.2 The Arctic Switch Fabric

Inter-no de communication is supp orted through the Arctic SwitchFabric[5,6], a system area network de-

signed for *T massively-parallel pro cessors[24, 3]. This packet-switched multi-stage network is organized in

a fat-tree top ology. The latency through a router stage is less than 0.15 sec. Each link in the fat-tree sup-

p orts 150 MByte/sec in each direction. In an N -endp oint full fat-tree con guration, the network's bisection

bandwidth is 2  N 150 MByte/sec.

In addition to high-p erformance, Arctic provides features to simplify software communication layers.

First, Arctic maintains a FIFO ordering of messages sent between two no des along the same path in the

fat-tree top ology. Second, Arctic recognizes two message priorities and guarantees that a high-priority 2 Cached Physical Virtual PCI Loads, Queues Queues Stores DMA Tx DMA Engine Processor Rx DMA Engine StarT-X

Host DRAM

Figure 3: Cacheable Virtual Interface (VI) Abstraction

message cannot b e blo cked bylow-priority messages. Lastly, Arctic's link technology is designed suchthat

the software layer can assume error-free op erations. The correctness of the network messages is veri ed at

every router stage and at the network endp oints using CRC. The software layer only has to check a 1-bit

status to detect the unlikely event of a corrupted message due to a catastrophic network failure.

2.3 The StarT-X NIU

The StarT-X network interface unit (NIU) provides three simple but p owerful message passing mechanisms

to user-level applications. Its op eration is designed to minimize the need for elab orate software layers b etween

the network and the application. The StarT-X message passing mechanisms are implemented completely in

hardware, rather than using an emb edded pro cessor, so the peak p erformance can be attained easily and

predictably under a variety of workloads. The StarT-X NIU has b een used in clusters consisting of SUN

E5000 SMPs as well as previous generations of Intel PCs. In each case, StarT-X's p erformance has only

b een limited by the p eak p erformance of the particular host's 32-bit 33-MHz PCI environment.

The di erent StarT-X communication mechanisms are optimized to supp ort di erent granularities of

communication patterns. Elsewhere, wehave describ ed StarT-X's op eration in detail[16]. Here, we brie y

present StarT-X's Programmed I/O Interface (PIO) and Cacheable Virtual Interface (VI), the two mecha-

nisms employed by the GCM co de.

PIO Mo de: To supp ort ne-grain message-passing applications, the PIO interface presents a simple FIFO-

based network abstraction similar to the CM-5 data network interface[29]. Pro cesses on di erent no des

communicate by exchanging messages which contain two 32-bit header words followed by a variable size

payload of between 2 and 22 32-bit words. The PIO mo de abstraction and packet format are depicted

in Figure 1. Due to the relative high cost of the uncached mmap accesses, we can reliably estimate the

p erformance of PIO-mo de communication bysummingthecostofthemmap accesses (given in Section 2.1).

For example, when passing an 8-byte message, the sender and the receiver each p erform two8-byte (header

plus payload) mmap accesses to the NIU registers, and thus, we can estimate the overhead of sending

and receiving a message to b e 0.36 sec and 1.86 sec, resp ectively. The exp erimentally determined LogP

characteristics[10] of StarT-X's PIO mechanism, summarized in Figure 2, corrob orate these estimates.

VI Mo de: In mo dern architectures, cached accesses to main memory are orders-of-magnitude faster than

PIO accesses. To take advantage of this p erformance disparity, the VI mo de uses DMA to extend the

physical transmit and receivequeuesinto the memory system. Figure 3 illustrates this abstraction. The user

interacts with StarT-X indirectly through memory, and hence, avoids costly PIO accesses. The VI mo de

makes use of a pinned, contiguous physical memory region for DMA. The VI memory region is mapp ed into

a cacheable virtual memory region of the user pro cess. To send a VI message, instead of enqueuing directly

to the hardware transmit queue, the user pro cess writes the message to a cacheable VI bu er. The user then

invokes StarT-X's DMA engine to enqueue the message into the physical transmit queue. On the receiver

end, StarT-X delivers the message directly to a pre-sp eci ed bu er in the receiving no de's VI memory region.

Because DMA invocation and status p olling require mmap accesses to StarT-X registers, the VI mo de is

most ecient when multiple out-b ound messages are queued consecutively in memory and are transmitted

with a single DMA invo cation. The p eak payload transfer bandwidth in VI mo de is 110 MByte/sec. 3

3 The MIT General Circulation Mo del

The climate mo del used on Hyades is the MIT general circulation mo del[21, 20, 15, 14]. The mo del is

a versatile research to ol that can be applied to a wide variety of pro cesses ranging from non-hydrostatic

rotating uid dynamics[15, 22] to the large-scale general circulation of the atmosphere and o cean[14, 19].

The mo del is implemented to exploit the mathematical isomorphisms that exist between the equations of

motion for an incompressible uid (the o cean) and those of a compressible uid (the atmosphere), allowing

atmosphere and o cean simulations to b e p erformed by the same basic mo del co de[14].

3.1 Numerical Pro cedure

At the heart of the mo del is a numerical kernel that steps forward the equations of motion for a uid in a

rotating frame of reference using a pro cedure that is a variant on the theme set out by HarlowandWelch[13].

The kernel can b e written in semi-discrete form to second order in time, t,thus:

n+1 n

1

1

v v

n+

n+

2

2

= G rp (1)

v

t

n+1

r:v =0 (2)

Equations (1) and (2) describ e the time evolution of the ow eld (v is the three-dimensional velo city eld)

in resp onse to forcing due to G (representing inertial, Coriolis, metric, gravitational, and forcing/dissipation

terms) and to the pressure gradient rp. For brevity we do not write the equations for thermo dynamic

variables whichmust also b e stepp ed forward to nd, by making use of an equation of state, the buoyancy

b.

The pressure eld, p, required in equations (1) and (2) is found by separating the pressure into hydrostatic,

surface and non-hydrostatic parts. The ow in the climate scale simulations presented here is hydrostatic,

yielding a two-dimensional elliptic equation for the surface pressure p that ensures non-divergent depth

s

integrated ow:

H

1

H

n+

2

r : (3) G r p r :H r p = r :

v

h h hy h h s h

h

H

( indicates the vertical integral over the lo cal depth, H , of the uid and the subscript denotes hori-

h

zontal). In the hydrostatic limit the non-hydrostatic pressure comp onent is negligible and vertical variations

in p are computed hydrostatically from the buoyancy, b,toyieldp .

hy

3.2 Spatial Discretization

Finite volume techniques are used to discretize in space, a ording some exibility in sculpting of the mo del

grid to the irregular geometry of land masses[1]. In this approach the continuous domain (either o cean or

atmosphere) is carved into \volumes" as illustrated in Figure 4. Discrete forms of the continuous equations

1{3 can then be deduced by integrating over the volumes and making use of Gauss's theorem. Finite-

volume discretization pro duces abundant data parallelism and forms the basis for mapping the algorithm to

amulti-pro cessor cluster.

4 Mapping to the Cluster

The numerical kernel of the GCM co de is written in sequential Fortran to work on a computational domain

that has been decomp osed horizontally into tiles that extend over the full depth of the mo del. The tiles

form the basic unit on which computation is p erformed and over which parallelism is obtained. As shown

in Figure 5, tiles include a lateral overlap or \halo" region in which data from adjacent tiles are duplicated.

For some stages of the calculation computations are duplicated (overcomputed) in the halo region, so that 4 NNNN0231 volumes in node 15 NNNN4 5 6 7

NNNN8 9 10 11

NNNN12 13 14 15

Figure 4: Finite volume spatial discretization. This schematic diagram illustrates how an o cean basin might

be mapp ed to a 16-no de parallel computer. The cutout for no de 15 ( N ) shows the volumes within a

15

sub domain. The dark regions indicate land. The nite volume scheme allows b oth the face area and the

volume of a cell that is op en to ow to vary in space, so that the volumes can be made to t irregular

geometries. The vertical dimension stays within a single no de.

Figure 5: Flexible tiled domain decomp osition. Tile sizes and distributions can b e de ned to pro duce long

strips consistent with vector memories (upp er panel). Alternatively small, compact blo cks can b e created

which are b etter suited to deep memory hierarchies (lower panel). Connectivitybetween tiles can b e tuned

to reduce the overall computational load. The left panel shows a close-up of a single tile. The shaded area

indicates the \halo" region. The halo region surrounds the tile interior and holds duplicate copies of data

\b elonging" to the interior regions of neighb oring tiles. 5

INITIALIZE. De ne top ography, initial ow and tracer distributions

FOR each time step n DO

PS

1

1

n

n n n1

2

2

Step forward state. v ) rp = v +t(G

v

1

n+

2

Calculate time derivatives. G = gv(v ;b)

v

1

n+

2

Calculate hydrostatic p. p = hy (b)

hy

DS

H H

1 1 1

n+ n+ n+

2 2 2

= r : r : Solve forpressure. r :H r p G r p

s v

h h h h h

h

hy

END FOR

Figure 6: Outline of the GCM algorithm. The mo del iterates rep eatedly over a time-stepping lo op comprised

of two main blo cks, PS and DS.A numerical exp erimentmayentail many millions of time-steps. In PS,

n;n1;n2

time tendencies (G terms) are calculated using the mo del state at previous time levels ( etc..).

DS involves nding a pressure eld p such that the ow eld v at the succeeding time level will satisfy

s

the continuity relation in equation (2). For clarity the calculation of G terms for thermo dynamic variables

(temp erature and salinity in the ocean, and temp erature and water vap or in the atmosphere) has been

omitted from the PS outline. These terms are calculated by functions that have a similar form to the gv()

function and yield the buoyancy, b.

data do es not need to b e fetched from neighb oring tiles. This \overcomputation" is employed to reduce the

numb er of communication and synchronization p oints required in a mo del time-step.

Figure 6 shows the high-level structure of the GCM co de. The two main stages of each mo del time-step

are the Prognostic Step (PS) and the Diagnostic Step (DS). Both stages employ nitevolume techniques

and are formulated to compute on a single tile at a time, but the two stages di er in the amount and style

of parallel communication that they require. All terms in PS can b e calculated from quantities contained

within a lo cal stencil of 3  3 grid p oints. Accordingly PS is formulated to employ\overcomputation", so

that all PS terms for a given tile can b e calculated using only data within that tile and its halo region. In this

way, the communication and synchronization cost for PS is isolated to one o ccurrence within a time step,

and the ratio of on-no de computation to inter-no de communication is relatively high. In contrast, the form

of equation (3) implies global connectivity between grid points during DS. A pre-conditioned conjugate-

gradient[21, 15, 26] iterativesolver is employed in this phase. This pro cedure re ects the inherent global

connectivity of equation (3), and thus do es not lend itself to overcomputation. Accordingly the ratio of

on-no de computation to inter-no de communication is lowfor DS. In climate mo deling scenarios, DS is also

distinct from PS b ecause DS op erates on the vertically integrated mo del state (a two-dimensional eld)

whereas PS op erates on the full, three-dimensional mo del state.

Maximizing the parallel p erformance of the mo del requires optimizing the twokey primitives that com-

municate data amongst tiles in PS and DS. The rst of these primitives is exchange which brings halo

regions into a consistentstate. In DS, the iterativesolver requires an exchange to b e applied to two elds

at every solver iteration. DS exchanges up date a halo region that is only one element wide and op erate on a

two-dimensional eld. In PS, an exchange must b e p erformed for each of the mo del three-dimensional state

variables over a halo width of at least three p oints. The second key primitiveisglobal sum. This primitive

orchestrates the summing of a single oating-p ointnumber from eachtile and returns the result to every

tile. Two global sum op erations are required at every solver iteration in DS; PS do es not require any global

sum op erations.

The GCM software architecture isolates the communication co de from the numerical co de. Non-critical

communication is implemented in a p ortable way using MPI or shared memory, but p erformance critical

communication, exchange and global sum, can b e customized for the sp eci c hardware. High-p erformance

implementations of these two primitives are vital to ne-grain parallel p erformance. 6 120

100

80

60

40

Bandwidth (MByte/sec) 20

0 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072

Block Size (Byte)

Figure 7: Transfer bandwidth as a function of blo ck size.

4.1 Optimized Exchange

An exchange op eration on Hyades is implemented as two separate VI-mo de transfers in opp osite directions.

The two transfers are carried out sequentially b ecause a single transfer alone can saturate the PCI bus.

Since the StarT-X NIU do es not have address translation facilities, it can only DMA to and from a pinned

and contiguous physical memory region. To utilize the VI mo de, the sending pro cessor must copy the data

from the source bu er into a sp ecial VI memory region. For eciency, the sender copies the data in several

small chunks and initiates DMA on a chunk immediately after each copytooverlap the DMA transfer with

the next round of copying. Similarly, the receiver copies the data, in chunks, from the hardware inb ound

messages queue to the destination bu er as so on as the messages arrive. When the transfer in one direction

of the exchange completes, the sender and receiver reverse roles and continue in the opp osite direction.

A no de can sustain 110 MByte/sec of p eak data transfer bandwidth during an exchange. However,

there is a one-time 8.6 sec overhead to negotiate a transfer between two no des. This small overhead is

imp ortant b ecause, for ne-grain problem sizes, the exchanges in DS only transfer a few kilobytes of data.

Without a long transfer to amortize the overhead, this 8.6 sec overhead reduces the p erceived transfer

bandwidth to only 56.8 MByte/sec for a 1-KByte transfer. The p erceived bandwidth reaches 90% of the

peak 110 MByte/sec for 9-KByte transfers. The p erceived transfer bandwidth is plotted as a function of

the transfer blo ck size in Figure 7. Arctic's fat-tree interconnect can handle multiple simultaneous transfers

with undiminished pair-wise bandwidth.

Hyades uses one StarT-X NIU in each two-pro cessor SMP no de. The discussion ab ove p ertains to

op erating a single pro cessor at each network endp oint. This usage o ers the maximum ratio of communication

to computation p erformance. When multiple pro cessors p er SMP participate in the application, the exchange

primitive op erates in a mix-mo de fashion which uses shared memory to handle intra-SMP communication.

In this mo de, one pro cessor on each SMP is designated as the communication master who has sole control

of the NIU and pro cesses remote exchanges on b ehalf of slave pro cessors. The slaves p ost remote exchange

requests to the master through a shared-memory semaphore. The slave-to-slaveexchange bandwidth is ab out

30% lower than a master-to-master exchange.

4.2 Optimized Global Sum

With ample network bandwidth, our implementation of global sum minimizes latency at the exp ense of

more messages. For an N -no de global sum where N is a p ower of two, N  log N messages are sentover

2

log N rounds. The global sum algorithm computes N reductions concurrently such that after i rounds,

2

every no de has the partial sum for the group of no des whose no de identi ers only di er in the lowest i bits.

The communication pattern and the partial sum for each of the three rounds in a 8-way global sum is shown

in Figure 8. 7 Round 0 Round 1 Round 2

Node 0: d0 d0+d1 d0+d1+d2+d3 d0+d1+d2+d3+d4+d5+d6+d7 Node 1: d1 d0+d1 d0+d1+d2+d3 d0+d1+d2+d3+d4+d5+d6+d7 Node 2: d2 d2+d3 d0+d1+d2+d3 d0+d1+d2+d3+d4+d5+d6+d7

Node 3: d3 d2+d3 d0+d1+d2+d3 d0+d1+d2+d3+d4+d5+d6+d7 Node 4: d4 d4+d5 d4+d5+d6+d7 d0+d1+d2+d3+d4+d5+d6+d7 Node 5: d5 d4+d5 d4+d5+d6+d7 d0+d1+d2+d3+d4+d5+d6+d7 Node 6: d6 d6+d7 d4+d5+d6+d7 d0+d1+d2+d3+d4+d5+d6+d7

Node 7: d7 d6+d7 d4+d5+d6+d7 d0+d1+d2+d3+d4+d5+d6+d7

Figure 8: Communication pattern in an eight-way global sum.

The critical path in this algorithm involves successive sending and receiving of log N messages. Ignoring

2

the second-order e ect of varying network transit time, the elapsed time for an N -way global sum can b e

approximated by the simple formula:

t = Clog N

g sum 2

where the constant C corresp onds to the user-to-user latency of an 8-byte payload message plus the CPU

pro cessing latency in each round. The measured latencies for 2-way,4-way,8-way and 16-way global sums

are 4.0 sec, 8.3 sec, 12.8 sec and 18.2 sec, resp ectively. A least-squares t to these measurements is

t =(4:67  log N 0:95) sec.

gsum 2

When multiple pro cessors p er SMP are participating in the application, the pro cessors on each SMP rst

generate a lo cal global sum using shared-memory semaphores. Next, a master pro cessor from each SMP

enters into the system-wide global sum op eration. Finally, the master pro cessor distributes the overall global

sum to lo cal pro cessors using shared memory. The lo cal summing op eration adds ab out 1 sec to the global

sum latency. On our two-way SMPs, the measured latencies for 2x2-way, 2x4-way,2x8-way and 2x16-way

global sums are 4.8 sec, 9.1 sec, 13.5 sec and 19.5 sec, resp ectively.

5 Performance

To demonstrate the utility of the cluster, a representative climate research atmosphere-o cean simulation was

analyzed. Figure 9 shows a typical output from this simulation. The atmosphere and o cean simulations



runat2:8125 horizontal resolution (the lateral global grid size is 128  64 p oints). This exp erimentuses

an intermediate complexity atmospheric physics package[14, 12] which has b een designed for exploratory

climate simulations. The con guration is esp ecially well suited to predictability studies of the contemp orary

climate and to paleo-climate investigations. Running this con guration pro tably on a cluster system is

a challenging prop osition. The total size of the mo del domain is not that large, so tiles arising from the

domain decomp osition illustrated in Figure 4 are small. As a result, pro cessors must communicate relatively

frequently, and parallel sp eed-up is very sensitivetocommunication overheads.

5.1 Overall Performance

In coupled simulations, the o cean and atmosphere isomorphs must run concurrently, p erio dically exchanging

b oundary conditions. During full-scale pro duction runs, each isomorph o ccupies half of the cluster, sixteen

pro cessors on eight SMPs. In Figure 10, we compare Hyades to contemp orary vector machines in terms of

their p erformance on the GCM co de. As the table shows, GCM p erforms comp etitively on all platforms,

and the p erformance on sixteen pro cessors of our cluster is comparable to a one-pro cessor vector machine.

On Hyades, for each of the comp onent mo dels, the sustained Flop rate on sixteen pro cessors is fteen times

higher than the single pro cessor rate. For a full-scale pro duction run, the sustained combined oating-p oint

p erformance of b oth the atmosphere and the o cean isomorphs is b etween 1.6{1.8 GFlop/sec. 8 ATMOSPHERE Zonal Velocity at 250mb (January)

0W 150W 120W 90W 60W 30W 0 30E 60E 90E 120E 150E 180 m/s

−10 −5 0 5 10 15 20 25 30 35 40 45 50 OCEAN Currents at 25M, t=1000 years

0.5 m/s

0W 150W 120W 90W 60W 30W 0 30E 60E 90E 120E 150E 180

Figure 9: Currents obtained from the atmosphere and o cean isomorphs of the MIT general circulation mo del. 9

Pro cessor Machine Sustained

9

10 op

Count p erformance ( )

sec

1 CrayY-MP 0.4

4 CrayY-MP 1.5

1 CrayC90 0.6

4 CrayC90 2.2

1 NEC SX-4 0.7

4 NEC SX-4 2.7

1 Hyades 0.054

16 Hyades 0.8

Figure 10: Performance of o cean isomorph of our coarse resolution climate mo del. Because it is based on

the same kernel, the atmospheric counterpart has an almost identical pro le.

5.2 Performance Mo del

A simple p erformance mo del can b e used to examine the roles that oating-p oint capability and network ca-

pabilityplay in setting the overall p erformance of this simulation. The p erformance mo del formulaisderived

by breaking down the numerical mo del phases PS and DS into constituent computation and communication

stages.

An approximation for the time t taken by a single pass through the PS phase is

ps

+ t (4) t = t

compute ps exch ps ps

where

N n

ps xy z

t = (5)

ps compute

F

ps

= 5t (6) t

exch exchxy z ps

The PS phase pro cessor compute time t is the total numb er of oating-p oint op erations p er-

ps compute

formed by each pro cessor divided by the pro cessor's oating-p oint op eration rate. The total number of

oating-p oint op erations p er pro cessor in the PS phase is the pro duct of N and n . The term N is

ps xy z ps

the number of oating-p oint op erations per grid cell in a single PS phase, and it can be determined by

insp ecting the mo del co de. n is the numb er of grid cells in the 3-D volume assigned to a single pro cessor.

xy z

F is the measured oating-p oint p erformance on the PS phase single-pro cessor kernel.

ps

The communication time t for the PS phase is the time for applying the three-dimensional exchange

ps exch

primitiveto ve separate mo del elds. For agiven grid decomp osition, the exchange blo cksizeisknown,

and t can either b e measured exp erimentally from a stand-alone b enchmark or b e estimated from the

exchxy z

bandwidth curve in Figure 7. In PS, the exchange primitives are typically called with data blo cks ranging

from tens to hundreds of kilobytes.

The time t taken by a single iteration of the DS phase solver can b e expressed as

ds

+ t + t (7) t = t

compute ds exch ds g sum ds ds

where

N n

ds xy

= (8) t

compute ds

F

ds

t = 2t (9)

ds exch exchxy

t = 2t (10)

ds g sum gsum

is the total numb er of oating-p oint op erations p er- The DS phase pro cessor compute time t

compute ds

formed by each pro cessor (N n ) divided by the pro cessor's oating-p oint p erformance F . n is the

ds xy ds xy 10

PS phase parameters (Atmosphere)

N n t F

ps xy z ps

exchxy z

(secs) (MFlop/sec)

781 5120 1640 50

PS phase parameters (Ocean)

N n t F

ps xy z ps

exchxy z

(secs) (MFlop/sec)

751 15360 4573 50

DS phase parameters

N n t t F

xy gsum

ds exchxy ds

(secs) (secs) (MFlop/sec)

36 1024 13.5 115 60



Figure 11: Performance mo del parameters of a coupled o cean-atmosphere simulation at 2:8125 resolution.

Each isomorph o ccupies sixteen pro cessors on eight SMPs. For the PS phase the atmosphere and the o cean

mo del con gurations have di erentvertical resolutions and contain slightly di erentnumerical computations.

This results in slightly di erent p erformance mo del parameters for the PS phase. The DS phase parameters

are the same for b oth comp onents.

number of vertical columns assigned to a single pro cessor. Again, F can b e measured using a stand-alone

ds

single-pro cessor DS kernel. The communication time t for the DS phase is the time for applying the

ds exch

two-dimensional exchange primitivetotwo separate mo del elds plus the time for applying the global sum

primitivetwice. The data blo cks exchanged in the DS phase are typically an order of magnitude smaller

than those in PS.

The total runtime T for a numerical exp erimentwithN time steps and with a mean numb er of solver

run t

iterations, N ,inDS,is

i

T = N t + N N t (11)

run t ps t i ds

5.3 Validation of the Performance Mo del

We tested the p erformance ourmo del against a one-year atmospheric simulation running on sixteen pro cessors

over eightSMPs. The p erformance mo del parameters corresp onding to this simulation are given in Figure 11.

The exchange and global sum cost is determined using stand-alone b enchmarks. The same is true for F

ps

and F . The one-year simulation requires 183 minutes of wall-clo ck time.

ds

The p erformance mo del predicts the total communication time T should b e given by

comm

T =2N N t +5N t +2N N t (12)

comm t i g sum t exchxy z t i exchxy

For a one-year atmospheric simulation, N = 77760 and N =60. The predicted total communication time,

t i

using parameters values from Figure 11, is 30.1 minutes. The p erformance mo del also predicts that the total

computation time T is given by

comp

N n N n

ps xy z ds xy

T = N + N N (13)

comp t t i

F F

ps ds

Substituting values from Figure 11, the predicted T is 151 minutes. T and T total to 181

comp comm comp

minutes whichagreeswell with the observed 183 minutes of wall-clo ck time. 11

t t t P P F F

g sum exchxy exchxy z fpp;ps f pp;ds ps ds

(sec) (sec) (sec) (MFlop/sec) (MFlop/sec) (MFlop/sec) (MFlop/sec)

F.E. 942 10008 100000 8.0 1.6 50 60

G.E. 1193 1789 5742 139 6.2 50 60

Arctic 13.5 115 1640 487 143 50 60



Figure 12: Potential Floating-Point Performance of an 2:8125 -resolution atmospheric simulation on a

sixteen-pro cessor, eight-SMP cluster interconnected by Fast Ethernet (FE), Gigabit Ethernet (GE), and

Arctic SwitchFabric (Arctic). t , t and t are determined using stand-alone b enchmarks.

g sum exchxy exchxy z

5.4 Analysis using the Performance Mo del

Based on this p erformance mo del, weintro duce a p erformance metric whichwe call Potential Floating-Point

Performance, P . This quantityisthe p er-pro cessor oating-p oint p erformance a numerical application

fpp

would have if the numerical computations to ok zero time. This metric quanti es, for a given application

con guration and hardware, the role that the communication p erformance plays in determining overall

p erformance. P not only can characterize the balance of a system, but it can also determine the di-

fpp

rection for improvement. If P is signi cantly greater than current pro cessor compute p erformance then

fpp

straight-forward investments in faster or more pro cessors are a viable route to improving overall application

p erformance. Conversely,if P is several times smaller than current compute p erformance then there is

fpp

little p ointininvesting in hardware that only improves compute p erformance.

P is computed as the total numb er of oating-p oint op erations p er pro cessor divided by the commu-

fpp

nication time. For the PS and the DS phase of the GCM algorithm P is

fpp

N n N n

ps xy z ps xy z

P = lim = (14)

fpp;ps

F !1

t 5t

ps

ps exchxy z

N n N n

ds xy ds xy

P = lim = (15)

f pp;ds

F !1

t 2t +2t

ds

ds gsum exchxy

Figure 12 summarizes P achieved by a variety of architectures during the PS phase and the DS

fpp



phase of an atmospheric simulation at 2:8125 . The table compares the results from simulations running

1

on a sixteen-pro cessor, eight-SMP cluster interconnected by Arctic, Fast Ethernet, and Gigabit Ethernet.

Given a reference oating-p oint p erformance of ab out 50 MFlop/sec, the Gigabit Ethernet cluster's P

f pp;ps

value suggests that this architecture is viable for coarse grain scenarios where communications are large and

infrequent. However, the table also indicates that the p erformance in the ne-grain DS phase of the GCM

co de would b e severely limited by the p erformance of Fast Ethernet and Gigabit Ethernet. Toachieve P

f pp;ds

of 60 MFlop/sec, the sum of t and t cannot exceed 306 sec. The Gigabit Ethernet hardware is

gsum exchxy

nearly a factor of ten away from this threshold. This numb er suggests that Gigabit Ethernet clusters are

not suitable for this resolution of climate mo del.

6 Discussion

It has b een noted elsewhere[7] that single message overheads of 100 microseconds or greater present a serious

challenge to ne-grain parallel applications. As we illustrate here, eliminating this b ottleneck enables the

application of cluster technology to a much wider eld. The GCM implementation wehave analyzed is not

limited to coupled climate simulations. The MIT GCM algorithm is designed to apply to a wide variety

of geophysical uid problems. The p erformance mo del we have derived is valid for all these scenarios.

Therefore, our analysis suggests that, for many useful geophysical uid simulations, commo dity-o -the-

shelf (COTS) pro cessors signi cantly out p erform COTS interconnect solutions. In these cases, advanced

1

The GCM co de uses MPI to communicate on the Ethernet clusters. 12

networking hardware and software can alleviate the p erformance disparityby decreasing the denominator

terms in Equations (14) and (15).

Obviously, oating-p oint p erformance comparable with that of Hyades can b e achieved on state-of-the-

art sup ercomputers[11]. Big sup ercomputers, however, are typically shared resources where the CPU time

can often b e \dwarfed" by the amountoftimespent in the job queue. In contrast, the a ordabilityofour

cluster makes it p ossible to build a system that can b e dedicated to a single research endeavor such that the

turn-around time is simply the CPU time. As a consequence the Hyades cluster is a platform on whicha

o o

century long synchronous climate simulation, coupling an atmosphere at 2:8 resolution to a 1 o cean, can

b e completed within a twoweek p erio d.

In a related work, the HPVM (High Performance Virtual Machine) communication suite[8] also allows a

powerful cluster to b e easily constructed from low-cost PCs and a high-p erformance interconnect, Myrinet[4].

HPVM provides a collection of communication APIs as well as a suite of system management packages that

make the cluster suitable for a broad span of sup ercomputing applications. The Hyades cluster di ers from

HPVM in its purp ose and approach. Our cluster has a much narrower fo cus that emphasizes supp orting a

single application as eciently as p ossible, and can b e viewed as an application-speci c supercomputer. This

p ersp ectivemotivates us to stride for every last bit of p erformance bydeveloping communication primitives

that are tailored to the application. Although a HPVM cluster's hardware comp onents have comparable

p eak p erformance as the Hyades cluster, a sixteen-way global barrier on a comparable HPVM cluster takes

more than 50 sec (more than 2.5 times longer than Hyades's context-sp eci c primitive). Similarly,aHPVM

cluster's transfer bandwidth for 1-KByte blo cks is ab out 42 MByte/sec (25% slower than Hyades's exchange

primitives).

The Hyades cluster do es have general-purp ose, high-level programming interfaces, like MPI-StarT[18]

and Cilk[28], that can make use of the high-p erformance interconnect. However, in an application-sp eci c

cluster, there is little reason to giveupany p erformance for an API that is more general than required. The

2

one-time investment in developing customized primitives can easily b e recup erated over the life-time of the

cluster. Much greater e ort has already b een sp entondeveloping the science and the simulation asp ects of

the application. Any real application, likeGCM, is going to b e used rep eatedly and routinely to pro duce

meaningful and useful results. We b elieve developing customized primitives that only supp ort what is needed

is actually a p owerful and ecient strategy for rapidly adapting an application to on-going innovations in

networking hardware.

7 Acknowledgments

This work was funded in part bygrants from the National Science Foundation and NOAA and with funds

from the MIT Climate Mo deling Initiative[23]. The hardware developmentwas carried out at the Lab ora-

tory for Computer Science, MIT and was funded in part by the Advanced Research Pro jects Agency of the

Department of Defense under the Oce of Naval Researchcontract N00014-92-J-1310 and Ft. Huachuca

contract DABT63-95-C-0150. James C. Ho e is supp orted byanIntel Foundation Graduate Fellowship. The

computational resources for this work were made available through generous donations from Intel Corp ora-

tion. Weacknowledge Arvind and John Marshall for their continuing supp ort. We thank Andy Boughton

and Larry Rudolph for their invaluable technical input.

References

[1] A. Adcroft, C. Hill, and J. Marshall. Representation of top ographybyshaved cells in a height co ordinate

o cean mo del. Mon. Wea. Rev., pages 2293{2315, 1997.

[2] T. E. Anderson, D. E. Culler, D. A. Patterson, and the NOWTeam. Case for networks of :

NOW. IEEE Micro,February 1995.

[3] B. S. Ang, D. Chiou, D. L. Rosenband, M. Ehrlich, L. Rudolph, and Arvind. StarT-Voyager: A exible

platform for exploring scalable SMP issues. In Proceeding of SC'98,November 1998.

2

It to ok less than one-man month to develop the two custom primitives. 13

[4] N. Bo den, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su. Myrinet: A gigabit

p er second lo cal area network. IEEE Micro Magazine,February 1995.

[5] G. A. Boughton. Arctic routing chip. In Proceedings of Hot Interconnects II, August 1994.

[6] G. A. Boughton. Arctic Switch Fabric. In Proceedings of the 1997 Paral lel Computing, Routing and

Communications Workshop, June 1997.

[7] P. Buonadonna, A. Geweke, and D. E. Culler. An Implementation and Analysis of the Virtual Interface

Architecture. In Proceedings of SC'98,1998.

[8] A. Chien, S. Pakin, M. Lauria, M. Buchanan, K. Hane, L. Giannini, and J. Prusakova. High p erformance

virtual machines (HPVM): Clusters with sup ercomputing APIs and p erformance. In Proceedings of the

Eigth SIAM Conference on Paral lel Processing for Scienti c Computing,March1997.

[9] D. E. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin,

C. Yoshikawa, and F. Wong. Parallel computing on the Berkeley NOW. In Proceedings of 9th Joint

Symposium on Paral lel Processing,1997.

[10] D. E. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. LogP: Performance assessment of fast

network interfaces. IEEE Micro,Vol 16, February 1996.

[11] M. Desgagne, S. J. Thomas, R. Benoit, and M. Valin. Shared and distributed memory implementations

of the canadian mc2 mo del. In Proceedings of the Seventh ECMWF Workshop on the Use of Paral lel

Processors in Meteorology, pages 155{181. World Scienti c, 1997.

[12] F. Molteni. Development of simpli ed pametrization schemes for a 5-level primitive-

equation model of the atmospheric circulation. CINECA, 1999. available from

http://www.cineca.it/adamo/do c/p e5l param.html.

[13] F. H. Harlowand J. E. Welch. Numerical calculation of time-dep endent viscous incompressible ow.

Phys. Fluids, 8:2182, 1965.

[14] C. Hill, A. Adcroft, D. Jamous, and J. Marshall. A strategy for tera-scale climate mo deling. In Proceed-

ings of Eigth ECMWF Workshop on the Use of Paral lel Processors in Meteorology. World Scienti c,

1999.

[15] C. Hill and J. Marshall. Application of a parallel navier-stokes mo del to o cean circulation. In Proceedings

of Paral lel Computational Fluid Dynamics, pages 545{552. Elsevier, 1995.

[16] J. C. Ho e. StarT-X: a one-man-year exercise in network interface engineering. In Proceedings of Hot

Interconnects VI, August 1998.

[17] J. R. Holton. An Introduction to Dynamic Meteorology. Wiley, 1979.

[18] P. Husbands and J. C. Ho e. MPI-StarT: Delivering network p erformance to numerical applications. In

Proceeding of SC'98,Novemb er 1998.

[19] J. Marotzke, R. Giering, K. Q. Zhang, D. Stammer, C. Hill, and T. Lee. Construction of the Adjoint

MIT Ocean General Circulation Mo del and Application to Atlantic Heat Transp ort Sensitivity. J.

Geophys. Res., submitted 1999.

[20] J. Marshall, A. Adcroft, C. Hill, and L. Perelman. A nite-volume, incompressible Navier-Stokes mo del

for studies of the o cean on parallel computers. J. Geophys. Res., 102(C3):5753{5766, 1997.

[21] J. Marshall, C. Hill, L. Perelman, and A. Adcroft. Hydrostatic, quasi-hydrostatic and non-hydrostatic

o cean mo deling. J. Geophys. Res., 102(C3):5733{5752, 1997.

[22] J. Marshall, H. Jones, and C. Hill. Ecient o cean mo deling using non-hydrostatic algorithms. J. Marine

Systems, 18:115{134, 1998. 14

[23] MIT Center for Global Change Science Climate Modeling Initiative. see

http://web.mit.edu/cgcs/www/cmi.html.

[24] R. S. Nikhil, G. M. Papadop oulos, and Arvind. *T: A multithreaded massively parallel architecture. In

Proceedings of the 19th International Symposium on Computer Architecture,May1992.

[25] J. Salmon, C. Stein, and T. Sterling. Scaling of Beowulf-class Distributed Systems. In Proceedings of

SC'98, 1998.

[26] A. Shaw, K. C. Cho, C. Hill, R. P. Johnson, J. Marshall, and Arvind. A comparison of implicitly

parallel multithreaded and data parallel implementations of an o cean mo del based on the Navier-Stokes

equations. J. Paral lel and Distributed Computing, pages 1{51, 1998.

[27] T. Sterling, D. J. Becker, D. Savarese, J. E. Dorband, and U. A. Ranawake. BEOWULF: A parallel

for scienti c computation. In International Conference on Scienti c Computation,1995.

[28] Sup ercomputing Technology Group, MIT Lab oratory for Computer Science. Cilk-5.1 Reference Manual,

Septemb er 1997.

[29] Thinking Machines Corp oration. Connection Machine CM-5 Technical Summary,November 1993. 15