ATLAS DAQ Note 77

18 Sept 1997

R.K.Bo ck, A.Bogaerts, R. Hauser, C. Hortnagl,

K. Koski, P.Werner, A.Guglielmi, O.Orel

1

Benchmarking Communication Systems

1 Intro duction

Following the market trend of high-p erformance computing towards parallel systems

available at decreasing cost, we b elieve that there is a nite chance that muchor

mayb e all of the computing load of level-2 triggers in ATLAS, and certainly all of

level 3, can b e executed in a farm constituted by commercial parallel systems.

The computer p erformance evaluation of systems use, among other criteria,

results of b enchmarking, viz. measured execution times obtained by running a

representative job mix, usually without investing substantial e ort in optimising

for the system at hand. In order to assess, how far parallel systems can con-

tribute to our trigger problem solution, we have designed a comparatively naive

set of application-indep endent communication benchmarks; they are do cumented in

ATLAS DAQ notes 48 and 61. The results are, in rst instance, large tables of

measured communication times.

Our goal was to derive from these detailed results several basic communication

parameters. They include obvious ones like bandwidth, but also various overheads

asso ciated to switching technologies or arising from interfacing to op erating sys-

tems, and measures of trac interference. We will eventually use these parameters

for comparing di erent system p ossibilities, in particular as input to detailed and

full-scale ATLAS mo delling. While not replacing more detailed b enchmarking of

applications, they do give more useful information than the combination of CPU

b enchmarks with bandwidth numb ers.

The tested systems include a numb er of di erent architectures from clusters of

workstations to tightly coupled massively parallel systems. We also included the

technologies that were prominent in the ATLAS demonstrator program, SCI and

ATM. The b enchmark package includes versions for many di erent communication

technologies and programming interfaces, such as shared memory, MPI, Digital's

Memory Channel, Cray T3E Shared memory API and Meiko CS-2.

2 Description of the b enchmarks

To assess the p erformance of a numb er of commercially available parallel systems, we

de ned a set of abstract basic communication b enchmarks, which are not sp eci c to

our application. We also added two more application-oriented b enchmarks, which

represent much simpli ed versions of the currently prop osed second-level trigger

solutions [2].

All abstract basic b enchmarks are executed for a varying numb er of packet sizes

(minimum, 64, 256, 1024) and a varying numb er of pro cessors where applicable (2,

4, 8, 16, 32, 64). Packet sizes are restricted to those exp ected in our application,

although some implementations have scanned a wider parameter space.

1

Submitted, in mo di ed form, to Paral lel Computing 1

A more detailed de nition of the b enchmarks can b e found in CERN ATLAS

2 3

do cuments . An example implementation in C is also available from a web site .

Default implementations for MPI and shared memory are available, and the software

has also b een adapted to several low-level libraries from di erentvendors, including

Cray T3E, Meiko CS2 and Digital's Memory Channel.

The following b enchmark programs have b een used (N is the total number of

no des):

Ping-pong

One no de sends data to another no de, and waits for the data to b e sent back.

The b enchmark measures the round-trip time.

Two-way

Both no des are sending and receiving data simultaneously.

Al l-to-al l

All no des are sending to each other simultaneously. Increasing the number of

no des in the system increases the overall throughput.

Pairs

N=2 no des send to N=2 receivers one-directionally.

Outfarming and Broadcast

For outfarming one no de sends packets to N 1 receivers, while broadcast

uses hardware broadcast, if present. Thus in outfarming the data could be

di erent in each send, whereas broadcast sends always the same data.

Funnel and Push-farm

In the funnel b enchmark N 1 senders send data to one receiver.

The push-farm representsatyp e of communication, in which the data is sent

from N 1 no des to one receiver, much the same wayasinthe funnel. The

di erence is that in the push-farm b enchmark additional computing cycles can

be included; the computing represents analysis of these measurements, and

we execute dummy co de lasting 0, 100, 400 or 1600 microseconds. Each time

b efore the computing cycle is started, the request for the next data item has

b een issued, allowing overlap of computing and communication.

Pul l-farm

Pul l-farm represents a typ e of communication, in which rst a control message

(64 bytes) is sent from the receiver to N 1 senders, and subsequently an

amount of data (1024 bytes) is received back from each sender. Computing

cycles can b e included the same wayasin push-farm.

A graphical representation of the b enchmark top ologies is given in Figure 1.

Of particular relevance to our application are the b enchmarks ping-pong, pairs,

push-farm and pul l-farm. The latter twoobviously have b een sp eci cally designed

to corresp ond to communication patterns typical for our application. Ping-pong

tests the request-acknowledge cycle, which is needed in several kinds of transmis-

sions. The pairs b enchmark tests one-way communication p erformance from p oint

to p oint, whichischaracteristic for communication without need for acknowledge-

ment.

2

http://atlasinfo.cern.ch/Atlas/GROUPS/DAQTRIG/NOTES/note61/rudi.ps.Z,

DAQ 48.ps.Z http://atlasinfo.cern.ch/Atlas/GROUPS/DAQTRIG/NOTES/note48/ATLAS

3

http://www.cern.ch/RD11/comb ench/comb ench.tar.Z 2 Ping-pong

Two-way

Outfarming and Broadcast All-to-all

Pairs

∆d ∆d

Funnel Push-farm Pull-farm

Figure 1: A graphical representation of the communication b enchmarks. The dot

indicates, which time is measured.

3 Implementation

The b enchmarks have b een implemented in di erent technologies by designing a

separate intermediate layer for each technology, as illustrated in Figure 2. This layer

contains the message passing routines, such as sending and receiving the message,

initialisation, cleaning up and broadcasting. The routines include non-blo cking send

and receive op erations.

By using this kind of layered approach, the p orting of the b enchmarks has b een

made more straightforward. When implementing a version for a new technology,

only the intermediate layer has to b e changed.

Since two di erentATM libraries were used, also two di erent implementations

of the low level interface had to b e programmed for that particular technology. Since

the application programming interface in di erentATM hardware, for example, is

usually proprietary, the interface has to b e implemented separately for each system.

The push-farm and pul l-farm results for ATM, use sp eci c trac generators as

data sources. Additionally, the programs used in these measurements di er slightly

from the other b enchmarks.

45

The b enchmarks for Mercury RACEway system have b een designed by using

6

the PeakWare to olkit , previously known as CapCASE [14].

4

RACE is a registered trademark of Mercury Computer Systems,Inc.

5

http://www.mc.com/Technical bulletins/mtb4smp-race/smp-v-race.html

6

PeakWare is a trademark of MATRA SYSTEMES & INFORMATION 3 write once ATLAS communication benchmarks per application

write once low level interface per technology

MPI

various SCI ATM shared communication memory

MEMORY CHANNEL hardware

Figure 2: The structure of the b enchmark implementation. The low level interface

has b een programmed separately for each technology.

4 Platforms

The measurements have b een done in a numb er of technologies:

7

 Scalable CoherentInterface (SCI) on a RIO2 8061 emb edded pro cessor b oard

using the LynxOS op erating system, on PC under , and on Digital

Alphas under Digital

 Digital Memory Channel connecting Digital Alphas

 Asynchronous Transfer Mo de (ATM), on RIO2 (and RTPC) emb edded pro-

cessor b oards, with the LynxOS op erating system, on PC-s under Windows

NT, and on Digital Alphas under Digital UNIX

 Cray T3E Shared memory Application Programming Interface

 Raceway Bus, using a Matra Systemes & Information PeakWare to olkit

 T9000 using IEEE 1355 DS Links (GPMIMD)

 Meiko CS-2 communication library

In addition, shared memory and Message Passing Interface (MPI) have b een

used in multiple systems, as opp osed to lower-level API-s. Shared memory has

b een b enchmarked in Digital 8400, Silicon Graphics Challenge and Origin systems.

The tested MPI platforms include shared memory multipro cessors, such as Digital

8400 system and Silicon Graphics Challenge, clusters, such as Digital's Memory

Channel, and conventional distributed memory systems, such as IBM SP2, Cray

T3E and Meiko CS-2.

We will here describ e more in detail the test pro cedure of three of these tech-

nologies: Digital's Memory Channel, SCI and ATM.

4.1 Memory Channel

8

[5] is a proprietary network technology from Digital Equipment Memory Channel

Corp oration, is commercially targeted as inter-no de transp ort medium in Digital

7

http://www.ces.ch/Pro ducts/Pro ducts.html

8

Memory Channel is a registered trademark of Digital Equipment Corp oration 4 AlphaStation 200 4/166

AlphaStation 200 4/166

AlphaServer 4000 5/300

AlphaStation 200 4/166

MEMORY CHANNEL hub

AlphaStation

200 4/166

Figure 3: Con guration in the Memory Channel tests. Later an AlphaStation 500

system was added into the con guration replacing one of the older stations.

UNIX TruCluster con gurations. It typically interconnects several AlphaServers

with multiple pro cessors each, and thus extends the scalability of Digital's pro duct

line to installations with currently up to 96 (= 8  12) parallel Alpha pro cessors.

Memory Channel presents the abstraction of a unique shared address space to

all pro cesses, regardless of their attachment to remote CPUs. Inter-no de com-

munication can be pursued with low overhead b ecause after an initial phase of

memory-mapping single user-level CPU store and load instructions suce to launch

communication, and required protection mechanisms are enforced in hardware.

Unlike cache-coherent non-uniform memory architectures (CC-NUMAs) in par-

ticular, Memory Channel opts for a slimmer solution which exp oses di erences of

lo cal and remote memories at the application level. It do es not provide a strong

memory coherency mo del; thus senders and receivers have to follow software-based

proto cols for distinguishing between outdated and fresh copies of data. Further-

more its memory mappings are asymmetric, i.e. applications must b e prepared to

use di erent virtual addresses for reading from and writing to physically unique

remote lo cations.

The quoted results were obtained with the following exp erimental setup: One

AlphaServer 4000 5/300 (299 MHz Alpha 21164 E5, 96 KB +2 MB o -chip cache)

and four AlphaStations 200 4/166 (166 MHz Alpha 21064 EV4.5, 512 KB o -chip

cache) were combined in a Digital's UNIX TruCluster. Inter-no de connectivitywas

established over Memory Channel-to-PCI adapter cards (revision 1.5), a Memory

Channel hub with ve line-cards and copp er link cables. All workstations had

128 MB RAM con gurations and ran copies of the Digital UNIX 4.0B op erating

system. The test environment is illustrated in Figure 3.

The tests used atwo-copy implementation of the low-level interface's message

passing library that ran on top the Memory Channel API library [6]. It is under-

stood that results can be improved by avoiding the second copy, at the price of

violating the proto col stack of Figure 2.

4.2 Scalable Coherent Interface (SCI)

SCI [8] is an enabling technology for cache-coherent non-uniform memory architec-

tures (CC-NUMAs). It aims at b etter scalability for shared-memory multipro cessors

than what bus-based schemes can achieve, by using multiple p oint-to-p oint links and 5 RIO 8061

AlphaStation 500/400

RIO 8061 SCI switch

AlphaServer 4000 5/300

RIO 8061

Figure 4: Con guration in the SCI tests.

a directory-based cache coherency scheme. The internal comp osition of the network,

which aggregates links into top ologies of ringlets and switches, is e ectively hidden

from clients b ehind an interface which resembles backplane buses.

In comparison to Memory Channel, which has b een adopted by a single vendor

for market-ready solutions, SCI an IEEE 1596 standard, with increasing strong

9 10 11 12

commitments e.g. from Data General , Sequent , Sun and Siemens-Nixdorf .

For our tests we used Dolphin Interconnect Solutions PCI-SCI cards (rev. B) [7]

which conform to the 32-bit PCI lo cal-bus sp eci cation. This PCI implementation

did not o er SCI's cache coherency; it enabled us to utilise a variety of di erent

no des, in particular also including emb edded pro cessor b oards, which is of great

imp ortance to some of the conceived parallel applications in our domain.

The cards, attached to 18-DE-200 link cables for a 16-bit parallel, electrical

implementation of the physical layer, o er up to 200 MB/s of aggregate bandwidth

on the medium. We observed that obtained bandwidths (. 70 MB/s maximum)

were limited by the p erformance of PCI buses. This restriction is avoided by systems

which integrate SCI on the system-bus level, at the obvious price of giving up

reusable interface cards.

Our results were obtained with the following equipment: p oint-to-p oint measure-

ments refer to our fastest pair of no des, i.e. an AlphaStation 500/400 (400 MHz

Alpha 21164 E5.6, 96 KB +2 MB o -chip cache) and an AlphaServer 4000 5/300.

Tests involving more that two no des ran on the AlphaServer and a pool of RIO2

13

8061 (100 MHz PowerPC 604) VME-emb edded pro cessor b oards.

The low level interface's two-copy message passing library op erated on top of

an implementation of a draft version of SCI PHY-API [9], whose general aim is to

provide a standard for low-level software access to SCI services.

SCI tests have b een done using a con guration illustrated in Figure 4.

4.3 ATM

4.3.1 Abstract b enchmarks

ATM tests were carried out by using p oint-to-p oint connections b etween a number

of di erent systems. Additionally, a testb ed consisting of Digital AlphaStation

9

http://www.dg.com/numaliine/html/sci interconnect chipset and adapter.html

10

http://www.sequent.com/numaq/

11

http://www.sun.com/hp c/tech/interconnect.html

12

http://www.sni.de/public/sni.htm

13

http://www.ces.ch/Pro ducts/Pro ducts.html 6 AlphaStation 200 4/166

RTPC

AlphaStation FORE switch 200 4/166

RIO 8061

RIO 8061

equipment on loan from RD31

Figure 5: ATM testb ed con guration.

200, RIO2 and RTPC systems and FORE ATM switch was used. The testb ed

con guration is illustrated in Figure 5. The maximum numb er of no des available

during the tests was ve. In addition new generation Digital AlphaStation 500 and

AlphaServer 4000 systems were available for ping-p ong tests.

A general problem with the ATM measurements arose for b enchmarking from

the limited availability of the systems; our testb ed was relatively heterogeneous: two

di erenttyp es of system and two di erentATM libraries were used. The Digital

systems used Digital's ATMSOCK library version 1.0, which will b ecome a com-

mercial pro duct [10] (it had not b een released at the time of the tests). The RIO2s

and RTPC used ATMNicLib library [13], which was an ecient implementation

reducing overheads bybypassing the kernel (a library of utility functions was called

instead of a device driver) and avoiding data copies (NicStar network interface has

a direct access to user bu ers). Thus the ATM library implementation running in

RIO2s and RTPC was esp ecially tuned.

Most of the b enchmarks, including all the testb ed runs, have b een run using

AlphaStation 200 systems, since newer systems were not available for testing at that

moment. The p oint-to-p oint results from the newer Digital systems (AlphaServer

4000, AlphaStation 500) were added later.

Minimum overheads corresp ond to the smallest packet size used in the measure-

ments, whichwas either 1 byte or 8 bytes. The largest packet size used here was

1024 bytes, although some of the ping-p ong tests were additionally run with larger

packet size.

In testb ed measurements the full optimisation was not used. The p oint-to-

p oint measurements have b een done with full optimisations b etween two AlphaSta-

tion 200, b etween AlphaStation 500 and AlphaServer 4000, and b etween RIO2 and

RTPC.

The basic implementation of the b enchmarks uses Unde ned Bit Rate (UBR)

connections, and sends in full sp eed; as ATM has no ow control, the receiver can

in some cases lo ose packets. This is esp ecially apparent in b enchmarks pairs, funnel

and push farm. In the funnel b enchmarks, when four senders and large packets were

used, no meaningful measurements were p ossible. Toavoid the problem of lo osing

packets, Constant Bit Rate (CBR) connections with reduced bandwidth for each 7

sender could b e used instead. Tests using CBR have b een carried out in addition

[11], [12].

Minimum round trip time divided bytwo for RIO2 was around 80-100 microsec-

onds in b enchmark ping-pong in which b oth sides receive and send. The same

overhead for Digital AlphaStation 200 is around 200 microseconds. For newer Dig-

ital systems the ping-p ong overhead is close to the one obtained on the RIO2s.

The p erformance di erence between the measurements with Digital pro cessors of

di erent generations is surprisingly large, which should b e attributed not so much

to the faster pro cessor technology (communication b enchmarks are not very CPU

intensive), but to other architectural changes, whichhave taken place during recent

years.

For receiving only (b enchmark pairs), RIO2 can achievea very low receiver over-

head, less than 10 microseconds for one byte. Digital Alpha's receiver overhead in

pairs is around 20-30 microseconds for one byte. The fast RIO2 results demonstrate

the p erformance gain, which can b een achieved byinvesting in low-level design work

with ATM drivers [4] compared with directly available commercial implementations.

The ATM link sp eed was nominally 155 Mbit/s. The maximum user bandwidth,

e.g. link sp eed subtracted by the amount of control data, was around 135 Mbit/s.

For example, sending in full sp eed (UBR) one-directionally b etween RIO2s, already

with 1024 bit packet size the link sp eed is almost entirely used (129.6 Mbit/s).

However, since the nominal sp eed 155 Mbit/s is not very high compared with some

other communication technologies of to day,itwould b e interesting to see the e ects

of ATM connections with higher sp eed (e.g. 622 Mbit/s).

4.3.2 Application b enchmarks: push farm and pull farm

The p erformance of the receiver pro cessor in the push farm tests was measured

on the upgraded demonstrator of the RD31 pro ject [11]. In this system, sender

pro cessors are replaced byATM trac generators whichemulate the senders. The

receiver pro cessor establishes a Constant Bit Rate connection with each sender.

The bandwidth of this channel is 1=N umber of S ender s to avoid congestion in the

switching network and the receiver. We measured the maximum event rate that a

receiver can handle for various packet sizes, numb er of messages to b e group ed and

pro cessing times. For small messages when the total amountofevent data do es not

exceed 2 kBytes the p erformance of the system is determined by the software and

hardware overhead of the receiver T =30s+ 8s  N umber of S ender s.

oh push

For large messages maximum event rate is limited by the usable bandwidth of

the 155 Mbit/s ATM links. For example, when each of the four senders sends 1

kbyte of data, the total data transmission time is 250s and maximum event rate

is 4 kHz.

The measurements for the pul l farm b enchmark were made on the demonstrator

for ATLAS describ ed in [12]. For this implementation of the pull proto col, the

overhead to handle small events (less than 2 kBytes) in the receiver is: T =

oh pull

200s+ 13s N umber of S ender s

For events bigger than 8 kBytes the link bandwidth limits the maximum event

rate p er receiver. For intermediate event sizes no simple formula can b e derived.

For example, when four senders send 1 kbyte of data, maximum event rate is ab out

2.66 kHz which corresp onds to T of 376 s.

oh pull

For b oth the push farm and pul l farm, when the sum of overhead and pro cessing

time is larger than the event transfer time the maximum event rate is 1=(ov er head +

pr ocessing time). When the data transfer time is dominant the maximum event

rate is 1=tr ansf er time. 8

5 Results

5.1 An overview of the parameters

We have condensed the large number of di erent b enchmark results into a few

meaningful parameters, as shown in Figure 6. A complete set of results is available

14

from the web site .

The parameters are the following:

In ping-pong we de ne one parameter: the overhead derived from dividing the

round-trip time for the smallest packet size bytwo. This parameter represents the

latency of the communication.

Latency has also b een measured in two-way. However, in this measurement the

time has not b een divided bytwo, since b oth the no des are sending and receiving

data simultaneously. Note that the two-way latency is larger than in ping-pong,

since setup times of b oth sending and receiving are included.

In the pairs b enchmark we de ne two parameters: overhead and e ective band-

width. Overhead is the one-directional communication sp eed. The e ective band-

width has b een calculated from this b enchmark (and not from the previous ones),

since in one-directional communication the sp eed is not limited bywaiting for an

acknowledgement each time the message is sent. We extract the e ective bandwidth

using as a reference the packet of one kilobyte. It should b e noted that this is in

most cases not the upp er limit for bandwidth; some systems achieve substantially

higher bandwidth only with packets much larger than 1 kbyte.

A parameter describing the broadcasting capabilities of the system has b een

extracted from the outfarming and broadcasting b enchmarks by dividing the out-

farming ratio by the broadcasting ratio. The ratios have b een calculated by dividing

the time for 2  N no des by the time for N receiving no des, from 2 to 4, 4 to 8 and

8 to 16 pro cessors when p ossible, from which the average is taken. The parameter

shows howwell the broadcasting has b een implemented in each system; the larger

the numb er is, the more ecient the broadcasting is compared to the outfarming

p erformance of the same system.

The funneling (and push-farm) b enchmark represents a typical data collection

approach, in whichanumb er of no des sends their data to one receiver. Atypical

example has b een chosen: four no des sending to one no de a 1 kbyte packet each. The

cycle time to complete the op eration, the inverse frequency, is given as a parameter.

The results of funnel and push-farm di er from each other only slightly. We present

mainly the results from funnel, since it has b een run on a larger numb er of systems.

Where up-to-date funnel results have not b een available, push-farm results were

used instead.

Pul l-farm represents another typ e of data collection, in which rst a read request

is sent. Also here a typical example of four senders and one receiver has b een

chosen. In pul l-farm the packet sizes have b een xed: 64 bytes for the control

message requesting the data and 1 kbyte for the actual data. The cycle time,

inverse frequency, is presented.

The current implementations of communication b enchmarks push-farm and pul l-

farm allow, in principle, overlap between computation and communication. They

use non-blo cking communication primitives for starting to gather fragments be-

longing to the (n + 1)-th event, b efore starting on pre-emptive calculations on the

n-th event. Neither b enchmark attempts to request fragments b elonging to more

than one future event in advance. This is partly b ecause only few implementations

of the communication layer allowmultiple outstanding send- or receive-op erations

between two pro cesses at each time.

14

http://www.cern.ch/RD11/comb ench/results.html 9 ping- two- broad funnel pull- over- pong way pairs pairs cast cycle farm lap latency latency overhd bandw. ratio time cycle t Architecture µs µs µs MB/s µs µs % ATM, RIO 104.8 108.9 9.1 16.2 2.05 250 ** 376 ** - ATM, DEC 85.4 90.3 33.3 13.6 - - - - Cray T3E 4.8 5.5 5.0 57.4 1.55 35.7 57.2 12 Cray T3E (MPI) 21.7 34.4 19.2 19.1 1.67 243.7 349.6 0 DEC MC 6.4 7.8 3.7 31.2 1.92 83.3 * 90.0 71 DEC MC (MPI) 26.5 51.3 21.3 11.4 1.28 - - - DEC 8400 4.0 7.8 4.5 32.2 1.15 128.3* 138.3 0 DEC 8400 (MPI) 13.3 22.9 11.0 13.1 1.60 - - - GPMIMD 6.6 - 13.1 3.1 1.05 1132.4 - - IBM SP2 (MPI) 74.2 88.5 31.0 10.8 1.56 - - - Matra, Raceway 8.8 12.5 10.0 59.0 - 47.5 - - Meiko (Channel) 20.3 34.4 22.4 11.4 1.88 124.0* 172.0 78 Meiko (MPI) 128.5 137.0 102.0 6.2 1.23 - - - Parsytec 217.0 354.0 188.0 3.7 1.50 1103 - - SCI, DEC 9.9 14.9 8.3 38.5 - - - - SCI, RIO/DEC 12.5 18.1 9.9 21.2 - 76.6 * 98.3 61 SGI Origin 7.1 12.7 8.8 31.3 1.25 178.3 207.2 - SGI Challenge 12.9 20.1 14.0 12.2 1.23 262.6 - -

SGI Chall. (MPI) 66.1 81.4 34.8 5.8 - 546.4 - -

Figure 6: Parameters extracted from b enchmark results. Push-farm results were

used instead of funnel, when either funnel results were not available or push-farm

results were considerably newer. The push-farm results have b een marked with an

asterisk (*). Push-farm and pul l-farm results of ATM, marked with double asterisk

(**), have b een achieved by slightly di erent measurements, and are describ ed more

in detail in section 5.3.2.

We parameterised the observed amount of overlap by 1 a=t (in p ercent),

r;0

where a is the (application-sp eci c) communication overhead, and t is the time

r;0

sp ent by the receiver for gathering fragments (communication) with d = 0 (no

calculation). The overhead a was obtained as the average of t d for di erent

r;d

values of d; that is the overall excess over the time d that is required for calculating

alone. We used results from measurements, in which four senders relayed fragments

of 1024 bytes each to one receiver. The overlap presented in the table has b een

taken from the push-farm.

5.2 Discussion of the results

The results provide a large amount of data. The parameters extracted from these

results attempt to compress this large amount of b enchmark data to a few mean-

ingful numb ers, which can b e used in comparing the communication p erformance

of di erent systems.

From the parameters in Figure 6 a numb er of observations can b e made. The 10

latencies vary considerably, from few to few hundreds of microseconds. The same

observation applies to overhead in pairs b enchmark. The lowest overheads are not

necessarily obtained by the tightly coupled shared memory systems, even though

that might b e exp ected, since for example Digital's Memory Channel reaches less

than 4 s and Cray T3E 5 s.

The bandwidths measured by sending a kilobyte packet vary b etween 3.7 and

59.0 MB/s, although many of the systems can do b etter with a larger packet size.

MPI results were obtained in multiple systems, in which also lower level API-s

were available. The latencies and overheads with MPI, at least with the current MPI

versions, is large, typically 3-6 times larger than with lower level API-s. On the other

hand, since MPI is available in multiple platforms, it provides p ortability; it can

b e debated, whether the di erence in p erformance justi es additional programming

work.

The overlap results show the b est overlap for the Meiko CS-2, as is explicable

from its powerful p er-no de communication co-pro cessors. The AlphaServer 8400

o ccupies the other end of the sp ectrum as a typical SMP multi-pro cessor, with no

overlap at all (and low absolute latencies at the same time). Some other observed

overlaps must b e attributed to artifacts from implementations of intermediate soft-

ware layers, that stemmed from di erent authors, and are not so easily explained.

The scalability of the b enchmarked systems do es not strongly app ear from the

parameters. This is partly due to the fact that in many of the b enchmarked con g-

urations only few pro cessors were available. Few of the systems, Silicon Graphics

Origin and Cray T3E, could b e tested with large numb er of no des. These systems

demonstrated quite good communication network scalability up to the tested 44

(SGI) and 64 (Cray) pro cessors.

The large numberofbenchmark results has b een obtained over a time span of

more than a year. During that time a numb er of hardware and software upgrades

to ok place, so that the results do not necessarily represent the most up-to-date situa-

tion. In addition, the maximum numb er of pro cessors presented in some b enchmark

results dep ended on access to the systems.

It would b e exp ected that in pairs b enchmark the overhead would b e constantly

smaller than the latency in ping-pong, since only receiving setup time is present

in addition to the transfer time, and no ow control from the receiver back to the

sender is used. However, in some cases the pairs overhead is larger. In one case, this

kind of b ehaviour can b e explained by background load (the tested systems could

not always b e dedicated), in another there was a con guration change during some

of the measurements. The di erence is in these cases quite small, and the suspicious

times are also from the lower end (mostly less than 10 s with few exceptions), thus

the statistical error of the measurements might also in uence the results.

The parameters in Figure 6 represent critical asp ects of the systems, but are in

no way sucient to generate all measured b enchmark results. They may b e seen,

however, as parameters that can b e used in a mo del.

The current implementation do es a memory copy in each end of the data transfer,

whichis not an optimal way for some of the technologies, for example for shared

memory.

6 Summary

15

Benchmark suites suchasParkb ench measure the multiprocessor p erformance of

a system by running a set of prede ned applications or kernels of applications. Only

a small set of the Parkb ench suite deals with communication, however. We feel that

15

http://www.netlib.org/parkb ench/ 11

our b enchmark suite can serve as a useful to ol in comparing the raw communication

p erformance of parallel systems, as it is available to application-level programs.

There is a large numb er of results for these b enchmarks available. This makes

it p ossible to widely compare the communication p erformance of di erent systems.

Many of the latest-generation parallel technologies have b een measured.

Several of the systems show communication overheads b elow 10 s for small

packets. Some of the systems additionally have proven a good scalability, which

has b een tested in some cases with up to 64 pro cessors. Due to the go o d scalability

of some of the systems within the tested range it is predictable that scalabilityto

hundreds of pro cessors will either already now, or at least in the near future, also

b e quite ecient for example with some of the tightly coupled parallel systems or

shared memory systems using NUMA (non-uniform memory access) architecture.

The communication parameter most typical for our application, pull-farm with

four senders, is completed in some systems in around 60-90 s. The b est result for

the push-farm, for the same numb er of pro cessors and 1 kbyte data, is around 40

s. We consider these numb ers as promising for our trigger work.

7 Future work

The b enchmarks describ ed in this do cument have b een run on a large number of

di erent systems. This provides an extensive set of results, from which information

ab out the di erent communication networks can b e extracted. However, new and

faster systems constantly arriveat the market; we intend to sub ject them to the

same pro cedures.

Parallel systems are evolving, to o. Many of the main vendors are developing

systems based on clustering shared memory systems, in which each no de thus is

a multipro cessor system itself. This kind of two-level (or more) communication

hierarchy creates new challenges for future releases of communication b enchmarks.

8 Acknowledgements

Wewould like to thank Digital Equipment Corp oration for their close co-op eration

during this b enchmark work.

Wewould like to thank Center for Scienti c Computing (CSC) in Finland for

usage of their sup ercomputer systems.

Wewould also like to thank Irakli Mandjavidze (DSM/DAPNIA), Andreu Pacheco

(CERN), Denis Calvet (DSM/DAPNIA) and the CERN RD31 group for technical

assistance, and for the opp ortunity to use the ATM switch and related hardware in

building the testb ed for measurements.

In addition, we thank the following p ersons, who have contributed in running the

b enchmarks: Igor Zacharov (Silicon Graphics), Raynald Huaulme (Matra Systemes

& Information), Iosif Legrand (DESY), Ruud van Wijk (NIKHEF), Roger Heeley

(CERN) and John Ap ostolakis (CERN).

References

[1] ATLAS Technical Proposal. CERN/LHCC/94-43, 1994.

[2] Atlas Level-2 Trigger Groups, Atlas Second-Level Trigger Options. CHEP'97

16

conference pro ceedings .

16

http://sgi.ifh.de/CHEP97/pap er/pap er/466.ps 12

[3] J.Ap ostolakis et al., Abstract Communication Benchmarks on Paral lel Systems

17

for Real-time Applications. CHEP'97 conference pro ceedings .

[4] Private communications, from Irakli Mandjavidze and Denis Calvet.

18

[5] R. B. Gillett. Memory Channel Network for PCI . IEEE Micro. February 1996.

[6] Digital Equipment Corp oration. TruCluster Production Server Software. MEM-

19

ORY CHANNEL Application Programming Interfaces . Part Number AA-

QTN4B-TE. Septemb er 1996.

[7] Dolphin Interconnect Solutions A.S. Oslo, Norway. PCI-SCI Bridge Functional

Speci cation.Version 3.1 (con dential). Novemb er, 1996.

[8] IEEE Computer So ciety. IEEE Standard for Scalable Coherent Interface (SCI).

IEEE Std 1596-1992 (recognised as an American National Standard (ANSI).

August, 1993.

[9] IEEE Computer So ciety. Physical layer Application Programming Interface for

20

the Scalable Coherent Interface (SCI PHY-API) . IEEE Std P1596.9/Draft

0.41b. March 23, 1997.

[10] Digital Equipment Corp oration.Digital UNIX Native ATM Application Pro-

gramming Interface. Programmer's reference for PVC op erations. Version 2.0

Septemb er 16, 1996.

[11] M. Costa et al. Lessons from ATM-based event builder demonstrators and chal-

lenged for LHC-scale systems. Pro ceedings of Second Workshop on Electronics

for LHC Exp eriments. BalatonFred, Hungary, 23-27 Septemb er, 1996.

[12] D. Calvet et al. Operation and Performance of an ATM based Demonstrator

for the Sequential Option of the ATLAS Trigger. Pro ceedings of Tenth IEEE

Real Time Conference, Beaune, France, 21-26 Sept 1997.

[13] D. Calvet et al. Performance Analysis of ATM Network Interfaces for Data

Acquisition Applications. Pro ceedings of Second International Data Acquisition

Workshop on Networked Data Acquisition Systems. World Scienti c Publishing

1997, pp. 73-80.

[14] Alain CLOUARD et al. CapCASE: A Graphical Development Tool Supporting

Scalable, Heterogeneous Multicomputers. Pro ceedings of the International Con-

ference on Signal Pro cessing Applications and Technology (ICSPAT '96), pp.

873-879, Boston, USA. Octob er 7-10, 1996.

17

http://sgi.ifh.de/CHEP97/pap er/pap er/460.ps

18

http://www.digital.com:80/info/hp c/ref/gillett ieee.p df

19

http://www.unix.digital.com/faqs/publications/cluster do c/PS MC API/TOC.HTM#TOC

20

http://sci.lbl.gov/sciapi/draft/b o ok041b.p df 13